Compare commits

..

3625 Commits

Author SHA1 Message Date
Kamil Braun
33901e1681 Fix version numbers in upgrade page title 2022-11-02 15:36:51 +01:00
guy9
097a65df9f adding top banner to the Docs website with a link to the ScyllaDB University fall LIVE event
Closes #11873
2022-11-02 10:20:40 +02:00
Nadav Har'El
b9d88a3601 cql/pytest: add reproducer for timestamp column validation issue
This patch adds a reproducing test for issue #11588, which is still open
so the test is expected to fail on Scylla ("xfail), and passes on Cassandra.

The test shows that Scylla allows an out-of-range value to be written to
timestamp column, but then it can't be read back.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11864
2022-11-01 08:11:01 +02:00
Botond Dénes
dc46bfa783 Merge 'Prepare repair for task manager integration' from Aleksandra Martyniuk
The PR prepares repair for task manager integration:
- Creates repair_module
- Keeps repair_module in repair_service
- Moves tracker methods to repair_module
- Changes UUID to task_id in repair module

Closes #11851

* github.com:scylladb/scylladb:
  repair: check shutdown with abort source in repair module
  repair: use generic module gate for repair module operations
  repair: move tracker to repair module
  repair: move next_repair_command to repair_module
  repair: generate repair id in repair module
  repair: keep shard number in repair_uniq_id
  repair: change UUID to task_id
  repair: add task_manager::module to repair_service
  repair: create repair module and task
2022-11-01 08:05:14 +02:00
Aleksandra Martyniuk
f2fe586f03 repair: check shutdown with abort source in repair module
In repair module the shutdown can be checked using abort_source.
Thus, we can get rid of shutdown flag.
2022-10-31 10:57:29 +01:00
Aleksandra Martyniuk
2d878cc9b5 repair: use generic module gate for repair module operations
Repair module uses a gate to prevent starting new tasks on shutdown.
Generic module's gate serves the same purpose, thus we can
use it also in repair specific context.
2022-10-31 10:56:36 +01:00
Aleksandra Martyniuk
4aae7e9026 repair: move tracker to repair module
Since both tracker and repair_module serve similar purpose,
it is confusing where we should seek for methods connected to them.
Thus, to make it more transparent, tracker class is deleted and all
its attributes and methods are moved to repair_module.
2022-10-31 10:55:36 +01:00
Aleksandra Martyniuk
a5c05dcb60 repair: move next_repair_command to repair_module
Number of the repair operation was counted both with
next_repair_command from tracer and sequence number
from task_manager::module.

To get rid of redundancy next_repair_command was deleted and all
methods using its value were moved to repair_module.
2022-10-31 10:54:39 +01:00
Aleksandra Martyniuk
c81260fb8b repair: generate repair id in repair module
repair_uniq_id for repair task can be generated in repair module
and accessed from the task.
2022-10-31 10:54:24 +01:00
Aleksandra Martyniuk
6432a26ccf repair: keep shard number in repair_uniq_id
Execution shard is one of the traits specific to repair tasks.
Child task should freely access shard id of its parent. Thus,
the shard number is kept in a repair_uniq_id struct.
2022-10-31 10:41:17 +01:00
guy9
276ec377c0 removed broken roadmap link
Closes #11854
2022-10-31 11:33:03 +02:00
Aleksandra Martyniuk
e2c7c1495d repair: change UUID to task_id
Change type of repair id from utils::UUID to task_id to distinguish
them from ids of other entities.
2022-10-31 10:07:08 +01:00
Aleksandra Martyniuk
dc80af33bc repair: add task_manager::module to repair_service
repair_service keeps a shared pointer to repair_module.
2022-10-31 10:04:50 +01:00
Aleksandra Martyniuk
576277384a repair: create repair module and task
Create repair_task_impl and repair_module inheriting from respectively
task manager task_impl and module to integrate repair operations with
task manager.
2022-10-31 10:04:48 +01:00
Takuya ASADA
159bc7c7ea install-dependencies.sh: use binary distributions of PIP package
We currently avoid compiling C code in "pip3 install scylla-driver", but
we actually providing portable binary distributions of the package,
so we should use it by "pip3 install --only-binary=:all: scylla-driver".
The binary distribution contains dependency libraries, so we won't have
problem loading it on relocatable python3.

Closes #11852
2022-10-31 10:38:36 +02:00
Botond Dénes
139fbb466e Merge 'Task manager extension' from Aleksandra Martyniuk
The PR adds changes to task manager that allow more convenient integration with modules.

Introduced changes:
- adds internal flag in task::impl that allows user to filter too specific tasks
- renames `parent_data` to more appropriate name `task_info`
- creates `tasks/types.hh` which allows using some types connected with task manager without the necessity to include whole task manager
- adds more flexible version of `make_task` method

Closes #11821

* github.com:scylladb/scylladb:
  tasks: add alternative make_task method
  tasks: rename parent_data to task_info and move it
  tasks: move task_id to tasks/types.hh
  tasks: add internal flag for task_manager::task::impl
2022-10-31 09:57:10 +02:00
Botond Dénes
2c021affd1 Merge 'storage_service, repair: use per-shard abort_source' from Benny Halevy
Prevent copying shared_ptr across shards
in do_sync_data_using_repair by allocating
a shared_ptr<abort_source> per shard in
node_ops_meta_data and respectively in node_ops_info.

Fixes #11826

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #11827

* github.com:scylladb/scylladb:
  repair: use sharded abort_source to abort repair_info
  repair: node_ops_info: add start and stop methods
  storage_service: node_ops_abort_thread: abort all node ops on shutdown
  storage_service: node_ops_abort_thread: co_return only after printing log message
  storage_service: node_ops_meta_data: add start and stop methods
  repair: node_ops_info: prevent accidental copy
2022-10-31 09:43:34 +02:00
Tenghuan He
e0948ba199 Add directory change instruction
Add directory change instruction while building scylla

Closes #11717
2022-10-30 23:53:02 +02:00
Pavel Emelyanov
477e0c967a scylla-gdb: Evaluate LSA object sizes dynamically
The lsa-segment command tries to walk LSA segment objects by decoding
their descriptors and (!) object sizes as well. Some objects in LSA have
dynamic sizes, i.e. those depending on the object contents. The script
tries to drill down the object internals to get this size, but bad news
is that nowadays there are many dynamic objects that are not covered.
Once stepped upon unsupported object, scylla-gdb likely stops because
the "next" descriptor happens to be in the middle of the object and its
parsing throws.

This patch fixes this by taking advantage of the virtual size() call of
the migrate_fn_type all LSA objects are linked with (indirectly). It
gets the migrator object, the LSA object itself and calls

  ((migrate_fn_type*)<migrator_ptr>)->size((const void*)<object_ptr>)

with gdb. The evaluated value is the live dynamic size of the object.

fixes: #11792
refs: #2455

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #11847
2022-10-28 14:11:30 +03:00
Botond Dénes
74c9aa3a3f Merge 'removenode: allow specifying nodes to ignore using host_id' from Benny Halevy
Currently, when specifying nodes to ignore for replace or removenode,
we support specifying them only using their ip address.

As discussed in https://github.com/scylladb/scylladb/issues/11839 for removenode,
we intentionally require the host uuid for specifying the node to remove,
so the nodes to ignore (that are also done, otherwise we need not ignore them),
should be consistent with that and be specified using their host_id.

The series extends the apis and allows either the nodes ip address or their host_id
to be specified, for backward compatibility.

We should deprecate the ip address method over time and convert the tests and management
software to use the ignored nodes' host_id:s instead.

Closes #11841

* github.com:scylladb/scylladb:
  api: doc: remove_node: improve summary
  api, service: storage_service: removenode: allow passing ignore_nodes as uuid:s
  storage_service: get_ignore_dead_nodes_for_replace: use tm.parse_host_id_and_endpoint
  locator: token_metadata: add parse_host_id_and_endpoint
  api: storage_service: remove_node: validate host_id
2022-10-28 13:35:04 +03:00
Benny Halevy
335a8cc362 api: doc: remove_node: improve summary
The current summary of the operation is obscure.
It refers to a token in the ring and the endpoint associated with it,
while the operation uses a host_id to identify a whole node.

Instead, clarify the summary to refer to a node in the cluster,
consistent with the description for the host_id parameter.
Also, describe the effect the call has on the data the removed node
logically owned.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-28 07:52:37 +03:00
Benny Halevy
9ef2631ec2 api, service: storage_service: removenode: allow passing ignore_nodes as uuid:s
Currently the api is inconsistent: requiring a uuid for the
host_id of the node to be removed, while the ignored nodes list
is given as comma-separated ip addresses.

Instead, support identifying the ignored_nodes either
by their host_id (uuid) or ip address.

Also, require all ignore_nodes to be of the same kind:
either UUIDs or ip addresses, as a mix of the 2 is likely
indicating a user error.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-28 07:49:03 +03:00
Benny Halevy
40cd685371 storage_service: get_ignore_dead_nodes_for_replace: use tm.parse_host_id_and_endpoint
Allow specifying the dead node to ignore either as host_id
or ip address.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-28 07:38:13 +03:00
Benny Halevy
b74807cb8a locator: token_metadata: add parse_host_id_and_endpoint
To be used for specifying nodes either by their
host_id or ip address and using the token_metadata
to resolve the mapping.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-28 07:38:13 +03:00
Benny Halevy
340a5a0c94 api: storage_service: remove_node: validate host_id
The node to be removed must be identified by its host_id.
Validate that at the api layer and pass the parsed host_id
down to storage_service::removenode.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-28 07:38:13 +03:00
Takuya ASADA
464b5de99b scylla_setup: allow symlink to --disks option
Currently, --disks options does not allow symlinks such as
/dev/disk/by-uuid/* or /dev/disk/azure/*.

To allow using them, is_unused_disk() should resolve symlink to
realpath, before evaluating the disk path.

Fixes #11634

Closes #11646
2022-10-28 07:24:11 +03:00
Botond Dénes
b744036840 Merge 'scylla_util.py: on sysconfig_parser, don't use double quote when it's possible' from Takuya ASADA
It seems like distribution original sysconfig files does not use double
quote to set the parameter when the value does not contain space.
Adding function to detect spaces in the value, don't usedouble quote
when it not detected.

Fixes #9149

Closes #9153

* github.com:scylladb/scylladb:
  scylla_util.py: adding unescape for sysconfig_parser
  scylla_util.py: on sysconfig_parser, don't use double quote when it's possible
2022-10-28 07:19:13 +03:00
Benny Halevy
44e1058f63 docs: nodetool/removenode: fix host_id in examples
removenode host_id must specify the host ID as a UUID,
not an ip address.

Fixes #11839

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #11840
2022-10-27 14:29:36 +03:00
Benny Halevy
0ea8250e83 repair: use sharded abort_source to abort repair_info
Currently we use a single shared_ptr<abort_source>
that can't be copied across shards.

Instead, use a sharded<abort_source> in node_ops_info so that each
repair_info instance will use an (optional) abort_source*
on its own shard.

Added respective start and stop methodsm plus a local_abort_source
getter to get the shard-local abort_source (if available).

Fixes #11826

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-27 12:18:30 +03:00
Benny Halevy
88f993e5ed repair: node_ops_info: add start and stop methods
Prepare for adding a sharded<abort_source> member.

Wire start/stop in storage_service::node_ops_meta_data.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-27 12:18:30 +03:00
Benny Halevy
c2f384093d storage_service: node_ops_abort_thread: abort all node ops on shutdown
A later patch adds a sharded<abort_source> to node_ops_info.
On shutdown, we must orderly stop it, so use node_ops_abort_thread
shutdown path (where node_ops_singal_abort is called will a nullopt)
to abort (and stop) all outstanding node_ops by passing
a null_uuid to node_ops_abort, and let it iterate over all
node ops to abort and stop them.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-27 12:14:06 +03:00
Benny Halevy
0efd290378 storage_service: node_ops_abort_thread: co_return only after printing log message
Currently the function co_returns if (!uuid_opt)
so the log info message indicating it's stopped
is not printed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-27 12:14:03 +03:00
Benny Halevy
47e4761b4e storage_service: node_ops_meta_data: add start and stop methods
Prepare for starting and stopping repair node_ops_info

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-27 12:14:03 +03:00
Benny Halevy
5c25066ea7 repair: node_ops_info: prevent accidental copy
Delete node_ops_info copy and move constructors before
we add a sharded<abort_source> member for the per-shard repairs
in the next patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-27 12:14:03 +03:00
Takuya ASADA
cd6030d5df scylla_util.py: adding unescape for sysconfig_parser
Even we have __escape() for escaping " middle of the value to writing
sysconfig file, we didn't unescape for reading from sysconfig file.
So adding __unescape() and call it on get().
2022-10-27 16:39:47 +09:00
Takuya ASADA
de57433bcf scylla_util.py: on sysconfig_parser, don't use double quote when it's possible
It seems like distribution original sysconfig files does not use double
quote to set the parameter when the value does not contain space.
Adding function to detect spaces in the value, don't usedouble quote
when it not detected.

Fixes #9149
2022-10-27 16:36:27 +09:00
Aleksandra Martyniuk
6494de9bb0 tasks: add alternative make_task method
Task manager tasks should be created with make_task method since
it properly sets information about child-parent relationship
between tasks. Though, sometimes we may want to keep additional
task data in classes inheriting from task_manager::task::impl.
Doing it with existing make_task method makes it impossible since
implementation objects are created internally.

The commit adds a new make_task that allows to provide a task
implementation pointer created by caller. All the fields except
for the one connected with children and parent should be set before.
2022-10-26 14:01:05 +02:00
Aleksandra Martyniuk
10d11a7baf tasks: rename parent_data to task_info and move it
parent_data struct contains info that is common	for each task,
not only in parent-child relationship context. To use it this way
without confusion, its name is changed to task_info.

In order to be able to widely and comfortably use task_info,
it is moved from tasks/task_manager.hh to tasks/types.hh
and slightly extended.
2022-10-26 14:01:05 +02:00
Aleksandra Martyniuk
9ecc2047ac tasks: move task_id to tasks/types.hh 2022-10-26 14:01:05 +02:00
Aleksandra Martyniuk
e2e8a286cc tasks: add internal flag for task_manager::task::impl
It is convenient to create many different tasks implementations
representing more and more specific parts of the operation in
a module. Presenting all of them through the api makes it cumbersome
for user to navigate and track, though.

Flag internal is added to task_manager::task::impl so that the tasks
could be filtered before they are sent to user.
2022-10-26 14:01:05 +02:00
Pavel Emelyanov
e245780d56 gossiper: Request topology states in shadow round
When doing shadow round for replacement the bootstrapping node needs to
know the dc/rack info about the node it replaces to configure it on
topology. This topology info is later used by e.g. repair service.

fixes: #11829

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #11838
2022-10-25 13:21:20 +03:00
Pavel Emelyanov
64c9359443 storage_proxy: Don't use default-initialized endpoint in get_read_executor()
After calling filter_for_query() the extra_replica to speculate to may
be left default-initialized which is :0 ipv6 address. Later below this
address is used as-is to check if it belongs to the same DC or not which
is not nice, as :0 is not an address of any existing endpoint.

Recent move of dc/rack data onto topology made this place reveal itself
by emitting the internal error due to :0 not being present on the
topology's collection of endpoints. Prior to this move the dc filter
would count :0 as belonging to "default_dc" datacenter which may or may
not match with the dc of the local node.

The fix is to explicitly tell set extra_replica from unset one.

fixes: #11825

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #11833
2022-10-25 09:16:50 +03:00
Takuya ASADA
1a11a38add unified: move unified package contents to sub-directory
On most of the software distribution tar.gz, it has sub-directory to contain
everything, to prevent extract contents to current directory.
We should follow this style on our unified package too.

To do this we need to increment relocatable package version to '3.0'.

Fixes #8349

Closes #8867
2022-10-25 08:58:15 +03:00
Takuya ASADA
a938b009ca scylla_raid_setup: run uuidpath existance check only after mount failed
We added UUID device file existance check on #11399, we expect UUID
device file is created before checking, and we wait for the creation by
"udevadm settle" after "mkfs.xfs".

However, we actually getting error which says UUID device file missing,
it probably means "udevadm settle" doesn't guarantee the device file created,
on some condition.

To avoid the error, use var-lib-scylla.mount to wait for UUID device
file is ready, and run the file existance check when the service is
failed.

Fixes #11617

Closes #11666
2022-10-25 08:54:21 +03:00
Yaniv Kaul
cec21d10ed docs: Fix typo (patch -> batch)
See subject.

Closes #11837
2022-10-25 08:50:44 +03:00
Tomasz Grabiec
687df05e28 db: make_forwardable::reader: Do not emit range_tombstone_change with position past the range
Since the end bound is exclusive, the end position should be
before_key(), not after_key().

Affects only tests, as far as I know, only there we can get an end
bound which is a clustering row position.

Would cause failures once row cache is switched to v2 representation
because of violated assumptions about positions.

Introduced in 76ee3f029c

Closes #11823
2022-10-24 17:06:52 +03:00
Avi Kivity
9e34779c53 Update seastar submodule
* seastar 601e0776c0...f32ed00954 (28):
  > Merge 'treewide: more fmt 9 adjustments' from Avi Kivity
  > rpc: Remove nested class friend declaration from connection
  > reactor: advance the head pointer in batch
  > Add git submodule instructions to HACKING.md, resolves #541
  > dns: Handle TCP mode connect failure
  > future: s/make_exception_ptr/std::make_exception_ptr/
  > reactor: implement read_some(fd, buffer, len) in io_uring
  > reactor: remove unneeded "protected"
  > Merge 'reactor: support more network ops in io_uring backend' from Kefu Chai
  > reactor: Indentation fix after previous patch
  > io: Remove --max-io-requests concept
  > future: add concept constraints to handle_exception()
  > future: improve the doxygen document
  > aio_general_context: flush: provide 1 second grace for retries
  > reactor: destroy_scheduling_group: make sure scheduling_group is valid
  > reactor: pass a plain pointer to io_uring_wait_cqes()
  > gate: add move ctor and move assignment operator for gate
  > reactor: drop stale comment
  > reactor_config: update stale doc comments
  > test: alloc_test: Actually prevent dead allocation elimination
  > util/closeable: hold _obj with reference_wrapper<>
  > memory: Fix off-by-one in large allocation detection
  > util/closeable: add move ctor for deferred_stop
  > reactor: Remove some unused friend declarations
  > core/sharded.hh: tweak on comment for better readability
  > Merge 'fmt 9 ostream fix' from longlene
  > program_options: allow configure switch-stytle option programmatically
  > inet_address: Add helper to check for address being lo/any

Closes #11814
2022-10-21 21:30:07 +03:00
Botond Dénes
4aa0b16852 Merge 'distributed_loader: detect highest generation before populating column families' from Benny Halevy
We should scan all sstables in the table directory and its
subdirectories to determine the highest sstable version and generation
before using it for creating new sstables (via reshard or reshape).

Otherwise, the generations of new sstables created when populating staging (via reshard or reshape) may collide with generations in the base directory, leading to https://github.com/scylladb/scylladb/issues/11789

Refs scylladb/scylladb#11789
Fixes scylladb/scylladb#11793

Closes #11795

* github.com:scylladb/scylladb:
  distributed_loader: populate_column_family: reindent
  distributed_loader: coroutinize populate_column_family
  distributed_loader: table_population_metadata: start: reindent
  distributed_loader: table_population_metadata: coroutinize start_subdir
  distributed_loader: table_population_metadata: start_subdir: reindent
  distributed_loader: pre-load all sstables metadata for table before populating it
2022-10-21 14:07:51 +03:00
Botond Dénes
e981bd4f21 Merge 'Alternator, MV: fix bug in some view updates which set the view key to its existing value' from Nadav Har'El
As described in issue #11801, we saw in Alternator when a GSI has both partition and sort keys which were non-key attributes in the base, cases where updating the GSI-sort-key attribute to the same value it already had caused the entire GSI row to be deleted.

In this series fix this bug (it was a bug in our materialized views implementation) and add a reproducing test (plus a few more tests for similar situations which worked before the patch, and continue to work after it).

Fixes #11801

Closes #11808

* github.com:scylladb/scylladb:
  test/alternator: add test for issue 11801
  MV: fix handling of view update which reassign the same key value
  materialized views: inline used-once and confusing function, replace_entry()
2022-10-21 10:49:28 +03:00
Botond Dénes
396d9e6a46 Merge 'Subscribe repair_info::abort on node_ops_meta_data::abort_source' from Pavel Emelyanov
The storage_service::stop() calls repair_service::abort_repair_node_ops() but at that time the sharded<repair_service> is already stopped and call .local() on it just crashes.

The suggested fix is to remove explicit storage_service -> repair_service kick. Instead, the repair_infos generated for the sake of node-ops are subscribed on the node_ops_meta_data's abort source and abort themselves automatically.

fixes: #10284

Closes #11797

* github.com:scylladb/scylladb:
  repair: Remove ops_uuid
  repair: Remove abort_repair_node_ops() altogether
  repair: Subscribe on node_ops_info::as abortion
  repair: Keep abort source on node_ops_info
  repair: Pass node_ops_info arg to do_sync_data_using_repair()
  repair: Mark repair_info::abort() noexcept
  node_ops: Remove _aborted bit
  node_ops: Simplify construction of node_ops_metadata
  main: Fix message about repair service starting
2022-10-21 10:08:43 +03:00
Avi Kivity
9ebac12e60 test: mutation-test: fix off-by-one in test_large_collection_allocation
The test wants to see that no allocations larger than 128k are present,
but sets the warning threshold to exactly 128k. Due to an off-by-one in
Seastar, this went unnoticed. However, now that the off-by-one in Seastar
is fixed [1], this test starts to fail.

Fix by setting the warning threshold to 128k + 1.

[1] 429efb5086

Closes #11817
2022-10-21 10:04:40 +03:00
Avi Kivity
f0643d1713 alternator: ttl: do not copy mutation while constructing a vector
The vector(initializer_list<T>) constructor copies the T since
initializer_list is read-only. Move the mutation instead.

This happens to fix a use-after-return on clang 15 on aarch64.
I'm fairly sure that's a miscompile, but the fix is worthwhile
regardless.

Closes #11818
2022-10-21 10:04:00 +03:00
Avi Kivity
db79f1eb60 Merge 'cql3: expr: Add unit tests for evaluate()' from Jan Ciołek
This PR adds some unit tests for the `expr::evaluate()` function.

At first I wanted to add the unit tests as part of #11658, but their size grew and grew, until I decided that they deserve their own pull request.

I found a few places where I think it would be better to behave in a different way, but nothing serious.

Closes #11815

* github.com:scylladb/scylladb:
  test/boost: move expr_test_utils.hh to .hh and .cc in test/lib
  cql3: expr: Add unit tests for bind_variable validation of collections
  cql3: expr: Add test for subscripted list and map
  cql3: expr: Add test for usertype_constructor
  cql3: expr: Add test for tuple_constructor
  cql3: expr: Add tests for evaluation of collection constructors
  cql3: expr: Add tests for evaluation of column_values and bind_variables
  cql3: expr: Add constant evaluation tests
  test/boost: Add expr_test_utils.hh
  cql3: Add ostream operator for raw_value
  cql3: add is_empty_value() to raw_value and raw_value_view
2022-10-20 22:55:34 +03:00
Jan Ciolek
4c4ed8e6df test/boost: move expr_test_utils.hh to .hh and .cc in test/lib
expr_test_utils.hh was a header file with helper methods for
expression tests. All functions were inline, because I didn't
know how to create and link a .cc file in test/boost.

Now the header is split into expr_test_utils.hh and expr_test_utils.cc
and moved to test/lib, which is designed to keep this kind of files.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-10-20 17:31:37 +02:00
Avi Kivity
6ce659be5b Merge "Deglobalize snitch" from Pavel E
"
Snitch was the junction of several services' deps because it was the
holder of endpoint->dc/rack mappings. Now this information is all on
topology object, so snitch can be finally made main-local
"

* 'br-deglobalize-snitch' of https://github.com/xemul/scylla:
  code: Deglobalize snitch
  tests: Get local reference on global snitch instance once
  gossiper: Pass current snitch name into checker
  snitch: Add sharded<snitch_ptr> arg to reset_snitch()
  api: Move update_snitch endpoint
  api: Use local snitch reference
  api: Unset snitch endpoints on stop
  storage_service: Keep local snitch reference
  system_keyspace: Don't use global snitch instance
  snitch: Add const snitch_ptr::operator->()
2022-10-20 16:51:24 +03:00
Avi Kivity
dd0b571d7e Update tools/java submodule (Scylla Cloud serverless config option)
* tools/java 5f2b91d774...87672be28e (1):
  > Add serverless Scylla Cloud config file option
2022-10-20 16:15:28 +03:00
Konstantin Osipov
8c920add42 test: (pytest) fix the pytest wrapper to work on Ubuntu
Ubuntu doesn't have python, only python2 and python3.

Closes #11810
2022-10-20 15:53:24 +03:00
Botond Dénes
669b225c67 reader_permit: resources: remove operator bool and >=
These cannot be meaningfully define for a vector value like resources.
To prevent instinctive misuse, remove them. Operator bool is replaced
with `non_zero()` which hopefully better expresses what to expected.
The comparison operator is just removed and inlined into its own user,
which actually help said user's readability.

Closes #11813
2022-10-20 15:25:11 +03:00
Jan Ciolek
75b27cb61c cql3: expr: Add unit tests for bind_variable validation of collections
evaluating a bind variable should validate
collection values.

Test that bound collection values are validated,
even in case of a nested collection.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-10-20 12:12:03 +02:00
Jan Ciolek
c4651e897f cql3: expr: Add test for subscripted list and map
Test that subscripting lists and maps works as expected.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-10-20 12:12:03 +02:00
Jan Ciolek
5a00c3dd76 cql3: expr: Add test for usertype_constructor
Test that evaluate(usertype_constructor) works
as expected.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-10-20 12:12:03 +02:00
Jan Ciolek
8f6309bd66 cql3: expr: Add test for tuple_constructor
Test that evaluate(tuple_constructor) works
as expected.

It was necessary to implement a custom function
for serializing tuples, because some tests
require the tuple to contain unset_value
or an empty value, which is impossible
to express using the exisiting code.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-10-20 12:12:03 +02:00
Jan Ciolek
5ae719d51a cql3: expr: Add tests for evaluation of collection constructors
Test that evaluate(collection_constructor) works as expected.

Added a bunch of utility methods for creating
collection values to expr_test_utils.hh.

I was forced to write custom serialization of
collections. I tried to use data_value,
but it doesn't allow to express unset_value
and empty values.

The custom serialization isnt actually used
in this specific commit, but it's needed
in the following ones.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-10-20 12:12:02 +02:00
Pavel Emelyanov
01b1f56bd7 code: Deglobalize snitch
All uses of snitch not have their own local referece. The global
instance can now be replaced with the one living in main (and tests)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-20 12:33:41 +03:00
Pavel Emelyanov
8e4e3f7185 tests: Get local reference on global snitch instance once
Some tests actively use global snitch instance. This patch makes each
test get a local reference and use it everywhere. Next patch will
replace global instance with local one

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-20 12:33:40 +03:00
Pavel Emelyanov
898579027d gossiper: Pass current snitch name into checker
Gossiper makes sure local snitch name is the same as the one of other
nodes in the ring. It now gets global snitch to get the name, this patch
passes the name as an argument, because the caller (storage_service) has
snitch instance local reference

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-20 12:33:38 +03:00
Pavel Emelyanov
1674882220 snitch: Add sharded<snitch_ptr> arg to reset_snitch()
The method replaces snitch instance on the existing sharded<snitch_ptr>
and the "existing" is nowadays the global instance. This patch changes
it to use local reference passed from API code

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-20 12:33:34 +03:00
Pavel Emelyanov
5fba0a7f65 api: Move update_snitch endpoint
It's now living in storage_service.cc, but non-global snitch is
available in endpoint_snitch.cc so move the endpoint handler there

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-20 12:33:20 +03:00
Pavel Emelyanov
0d49b0e24a api: Use local snitch reference
The snitch/name endpoint needs snitch instance to get the name from.
Also the storage_service/reset_snitch endpoint will also need snitch
instance to call reset on.

This patch carries local snitch reference all thw way through API setup
and patches the get_name() call. The reset_snitch() will come in the
next patch

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-20 12:31:45 +03:00
Pavel Emelyanov
c175ea33e2 api: Unset snitch endpoints on stop
Some time soon snitch API handlers will operate on local snitch
reference capture, so those need to be unset before the target local
variable variable goes away

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-20 12:31:12 +03:00
Pavel Emelyanov
ea8bfc4844 storage_service: Keep local snitch reference
Storage service uses snitch in several places:
- boot
- snitch-reconfigured subscription
- preferred IP reconnection

At this point it's worth adding storage_service->snitch explicit
dependency and patch the above to use local reference

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-20 12:30:00 +03:00
Pavel Emelyanov
52d6e56a10 system_keyspace: Don't use global snitch instance
There are two places to patch: .start() and .setup() and both only need
snitch to get local dc/rack from, nothing more. Thus both can live with
the explicit argument for now

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-20 12:29:26 +03:00
Pavel Emelyanov
f524a79fe9 snitch: Add const snitch_ptr::operator->()
To call snitch->something() on const snitch_ptr& variable later

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-20 12:29:25 +03:00
Nadav Har'El
264f453b9d Merge 'Associate alternator user with its service level configuration' from Piotr Sarna
Until now, authentication in alternator served only two purposes:
 - refusing clients without proper credentials
 - printing user information with logs

After this series, this user information is passed to lower layers, which also means that users are capable of attaching service levels to roles, and this service level configuration will be effective with alternator requests.

tests: manually by adding more debug logs and inspecting that per-service-level timeout value was properly applied for an authenticated alternator user

Fixes #11379

Closes #11380

* github.com:scylladb/scylladb:
  alternator: propagate authenticated user in client state
  client_state: add internal constructor with auth_service
  alternator: pass auth_service and sl_controller to server
2022-10-19 23:27:48 +03:00
Avi Kivity
22f13e7ca3 Revert "Merge 'cql3: select_statement: coroutinize indexed_table_select_statement::do_execute_base_query()' from Avi Kivity"
This reverts commit df8e1da8b2, reversing
changes made to 4ff204c028. It causes
a crash in debug mode on aarch64 (likely a coroutine miscompile).

Fixes #11809.
2022-10-19 21:28:55 +03:00
Alexander Turetskiy
636e14cc77 Alternator: Projection field added to return from DescribeTable which describes GSIs and LSIs.
The return from DescribeTable which describes GSIs and LSIs is missing
the Projection field. We do not yet support all the settings Projection
(see #5036), but the default which we support is ALL, and DescribeTable
should return that in its description.

Fixes #11470

Closes #11693
2022-10-19 19:01:08 +03:00
Avi Kivity
69199dbfba Merge 'schema_tables: limit concurrency' from Benny Halevy
To prevent stalls due to large number of tables.

Fixes scylladb/scylladb#11574

Closes #11689

* github.com:scylladb/scylladb:
  schema_tables: merge_tables_and_views reindent
  schema_tables: limit paralellism
2022-10-19 18:40:45 +03:00
Tomasz Grabiec
a979bbf829 dbuild: Do not fail if .gdbinit is missing
Closes #11811
2022-10-19 18:38:09 +03:00
Avi Kivity
6b0afb968d Merge 'reader_concurrency_semaphore: add set_resources()' from Botond Dénes
Allowing to change the total or initial resources the semaphore has. After calling `set_resources()` the semaphore will look like as if it was created with the specified amount of resources when created.

Use the new method in `replica::database::revert_initial_system_read_concurrency_boost()` so it doesn't lead to strange semaphore diagnostics output. Currently the system semaphore has 90/100 count units when there are no reads against it, which has led to some confusion.

I also plan on using the new facility in enterprise.

Closes #11772

* github.com:scylladb/scylladb:
  replica/database: revert initial boost to system semaphore with set_resources()
  reader_concurrency_semaphore: add set_resources()
2022-10-19 18:04:20 +03:00
Raphael S. Carvalho
ba6186a47f replica: Pick new generation for SSTables being moved from staging dir
When moving a SSTable from staging to base dir, we reused the generation
under the assumption that no SSTable in base dir uses that same
generation. But that's not always true.

When reshaping staging dir, reshape compaction can pick a generation
taken by a SSTable in base dir. That's because staging dir is populated
first and it doesn't have awareness of generations in base dir yet.

When that happens, view building will fail to move SSTable in staging
which shares the same generation as another in base dir.

We could have played with order of population, populating base dir
first than staging dir, but the fragility wouldn't be gone. Not
future proof at all.
We can easily make this safe by picking a new generation for the SSTable
being moved from staging, making sure no clash will ever happen.

Fixes #11789.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #11790
2022-10-19 15:33:30 +03:00
Nadav Har'El
2e439c9471 test/alternator: add test for issue 11801
This patch adds a test reproducing issue #11801, and confirming that
the previous patch fixed it. Before the previous patch, the test passed
on DynamoDB but failed on Alternator.
The patch also adds four more passing tests which demonstrate that
issue #11801 only happened in the very specific case where:
 1. A GSI has two key attributes which weren't key attributes in the
    base, and
 2. An update sets the second of those attributes to the same value
    which it already had.

This bug was originally discovered and explained by @fee-mendes.

Refs #11801.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-10-19 14:36:48 +03:00
Benny Halevy
4d7f0be929 distributed_loader: populate_column_family: reindent 2022-10-19 14:18:38 +03:00
Benny Halevy
030afaa934 distributed_loader: coroutinize populate_column_family
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-19 14:18:04 +03:00
Benny Halevy
0f23ee14c9 distributed_loader: table_population_metadata: start: reindent 2022-10-19 14:16:59 +03:00
Benny Halevy
39cec4f304 distributed_loader: table_population_metadata: coroutinize start_subdir
Calling it in a seastar thread was done to reduce code churn
and facilitate backporting.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-19 14:16:59 +03:00
Benny Halevy
5749a54cab distributed_loader: table_population_metadata: start_subdir: reindent 2022-10-19 14:16:59 +03:00
Benny Halevy
119c0f3983 distributed_loader: pre-load all sstables metadata for table before populating it
We should scan all sstables in the table directory and its
subdirectories to determine the highest sstable version and generation
before using it for creating new sstables (via reshard or reshape).

Fixes scylladb/scylladb#11793

Note: table_population_metadata::start_subdir is called
in a seastar thread to facilitate backporting to old versions
that do not support coroutines yet.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-19 14:16:57 +03:00
Nadav Har'El
8f4243b875 MV: fix handling of view update which reassign the same key value
When a materialized view has a key (in Alternator, this can be two
keys) which was a regular column in the base table, and a base
update modifies that regular column, there are two distinct cases:

1. If the old and new key values are different, we need to delete the
   old view row, and create a new view row (with the different key).

2. If the old and new key values are the same, we just need to update
   the pre-existing row.

It's important not to confuse the two cases: If we try to delete and
create the *same* view row in the same timestamp, the result will be
that the row will be deleted (a tombstone wins over data if they have the
same timestamp) instead of updated. This is what we saw in issue #11801.

We had a bug that was seen when an update set the view key column to
the old value it already had: To compare the old and new key values
we used the function compare_atomic_cell_for_merge(), but this compared
not just they values but also incorrectly compared the metadata such as
a the timestamp. Because setting a column to the same value changes its
timestamp, we wrongly concluded that these to be different view keys
and used the delete-and-create code for this case, resulting in the
view row being deleted (as explained above).

The simple fix is to compare just the key values - not looking at
the metadata.

See tests reproducing this bug and confirming its fix in the next patch.

Fixes #11801

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-10-19 13:43:12 +03:00
Nadav Har'El
e1f8cb6521 materialized views: inline used-once and confusing function, replace_entry()
The replace_entry() function is nothing more than a convenience for
calling delete_old_entry() and then create_entry(). But it is only used
once in the code, and we can just open-code the two calls instead of
the one.

The reason I want to change it now is that the shortcut replace_entry()
helped hide a bug (#11801) - replace_entry() works incorrectly if the
old and new row have the same key, because if they do we get a deletion
and creation of the same row with the same timestamp - and the deletion
wins. Having the two calls not hidden by a convenience function makes
this potential problem more apparent.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-10-19 13:25:34 +03:00
Benny Halevy
ce22dd4329 schema_tables: merge_tables_and_views reindent
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-19 13:05:41 +03:00
Benny Halevy
7ccb0e70f0 schema_tables: limit paralellism
To prevent stalls due to large number of tables.

Fixes scylladb/scylladb#11574

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-19 13:05:38 +03:00
Anna Stuchlik
7ec750fc63 docs: add the list of new metrics in 5.1
Closes #11703
2022-10-19 12:06:25 +03:00
Jan Ciolek
1b7acc758e cql3: expr: Add tests for evaluation of column_values and bind_variables
Add tests which test that evaluate(column_value)
and evaluate(bind_variable) work as expected.

values of columns and bind variables are
kept in evaluation_inputs, so we need to mock
them in order for evaluate() to work.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-10-19 10:30:51 +02:00
Jan Ciolek
0f29015d9f cql3: expr: Add constant evaluation tests
Add unit test for evaluating expr::constant values.
evaluate(constant) just returns constant.value,
so there is no point in trying all the possible combinations.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-10-19 10:30:42 +02:00
Anna Stuchlik
a066396cd3 doc: fix the command to create and sign a certificate so that the trusted certificate SHA256 is created
Closes #11758
2022-10-19 11:30:20 +03:00
Botond Dénes
37ebbc819a Merge 'Scylla-gdb lsa polishing' from Pavel Emelyanov
It was supposed to be fix for #2455, but eventually it turned out that #11792 blocks this progress but takes more efforts. So for now only a couple of small improvements (not to lose them by chance)

Closes #11794

* github.com:scylladb/scylladb:
  scylla-gdb: Make regions iterable object
  scylla-gdb: Dont print 0x0x
2022-10-19 06:54:49 +03:00
Botond Dénes
2d581e9e8f Merge "Maintain dc/rack by topology" from Pavel Emelyanov
"
There's an ongoing effort to move the endpoint -> {dc/rack} mappings
from snitch onto topology object and this set finalizes it. After it the
snitch service stops depending on gossiper and system keyspace and is
ready for de-globalization. As a nice side-effect the system keyspace no
longer needs to maintain the dc/rack info cache and its starting code gets
relaxed.

refs: #2737
refs: #2795
"

* 'br-snitch-dont-mess-with-topology-data-2' of https://github.com/xemul/scylla: (23 commits)
  system_keyspace: Dont maintain dc/rack cache
  system_keyspace: Indentation fix after previous patch
  system_keyspace: Coroutinuze build_dc_rack_info()
  topology: Move all post-configuration to topology::config
  snitch: Start early
  gossiper: Do not export system keyspace
  snitch: Remove gossiper reference
  snitch: Mark get_datacenter/_rack methods const
  snitch: Drop some dead dependency knots
  snitch, code: Make get_datacenter() report local dc only
  snitch, code: Make get_rack() report local rack only
  storage_service: Populate pending endpoint in on_alive()
  code: Populate pending locations
  topology: Put local dc/rack on topology early
  topology: Add pending locations collection
  topology: Make get_location() errors more verbose
  token_metadata: Add config, spread everywhere
  token_metadata: Hide token_metadata_impl copy constructor
  gosspier: Remove messaging service getter
  snitch: Get local address to gossip via config
  ...
2022-10-19 06:50:21 +03:00
Jan Ciolek
429600a957 test/boost: Add expr_test_utils.hh
Add a header file which will contain
utilities for writing expression tests.

For now it contains simple functions
like make_int_constant(), but there
are many more to come.

I feel like it's cleaner to put
all these functions in a separate
file instead of having them spread randomly
between tests.

It also enables code reuse so that
future expression tests can reuse
these functions instead of writing
them from scratch.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-10-18 22:48:33 +02:00
Jan Ciolek
855db49306 cql3: Add ostream operator for raw_value
It's possible to print raw_value_view,
but not raw_value. It would be useful
to be able to print both.

Implement printing raw_value
by creating raw_value_view from it
and printing the view.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-10-18 22:48:25 +02:00
Jan Ciolek
096c65d27f cql3: add is_empty_value() to raw_value and raw_value_view
An empty value is a value that is neither null nor unset,
but has 0 bytes of data.
Such values can be created by the user using
certain CQL functions, for example an empty int
value can be inserted using blobasint(0x).

Add a method to raw_value and raw_value_view,
which allows to check whether the value is empty.
This will be used in many places in which
we need to validate that a value isn't empty.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-10-18 22:47:48 +02:00
Pavel Emelyanov
3dc7c33847 repair: Remove ops_uuid
It used to be used to abort repair_info by the corresponding node-ops
uuid, but this code is no longer there, so it's good to drop the uuid as
well

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-18 20:04:23 +03:00
Pavel Emelyanov
b835c3573c repair: Remove abort_repair_node_ops() altogether
This code is dead after previous patch

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-18 20:04:23 +03:00
Pavel Emelyanov
8231b4ec1b repair: Subscribe on node_ops_info::as abortion
When node_ops_meta_data aborts it also kicks repair to find and abort
all relevant repair_infos. Now it can be simplified by subscribing
repair_meta on the abort source and aborting it without explicit kick

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-18 20:04:23 +03:00
Pavel Emelyanov
bf5825daac repair: Keep abort source on node_ops_info
Next patches will need to subscribe on node_ops_meta_data's abort source
inside repair code, so keep the pointer on node_ops_info too. At the
same time, the node_ops_info::abort becomes obsolete, because the same
check can be performed via the abort_source->abort_requested()

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-18 20:04:23 +03:00
Pavel Emelyanov
bbb7fca09c repair: Pass node_ops_info arg to do_sync_data_using_repair()
Next patches will need to know more than the ops_uuid. The needed info
is (well -- will be) sitting on node_ops_info, so pass it along

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-18 20:04:23 +03:00
Pavel Emelyanov
5e9c3c65b5 repair: Mark repair_info::abort() noexcept
Next patch will call it inside abort_source subscription callback which
requires the calling code to be noexcept

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-18 20:04:23 +03:00
Pavel Emelyanov
34458ec2c5 node_ops: Remove _aborted bit
A short cleanup "while at it" -- the node_ops_meta_data doesn't need to
carry dedicated _aborted boolean -- the abort source that sets it is
available instantly

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-18 20:04:22 +03:00
Pavel Emelyanov
96f0695731 node_ops: Simplify construction of node_ops_metadata
It always constructs node_ops_info the same way

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-18 20:03:53 +03:00
Pavel Emelyanov
2fa58632b3 main: Fix message about repair service starting
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-18 17:23:17 +03:00
Botond Dénes
7fbad8de87 reader_concurrency_semaphore: unify admission logic across all paths
The semaphore currently has two admission paths: the
obtain_permit()/with_permit() methods which admits permits on user
request (the front door) and the maybe_admit_waiters() which admits
permits based on internal events like memory resource being returned
(the back door). The two paths used their own admission conditions
and naturally this means that they diverged in time. Notably,
maybe_admit_waiters() did not look at inactive readers assuming that if
there are waiters there cannot be inactive readers. This is not true
however since we merged the execution-stage into the semaphore. Waiters
can queue up even when there are inactive reads and thus
maybe_admit_waiters() has to consider evicting some of them to see if
this would allow for admitting new reads.
To avoid such divergence in the future, the admission logic was moved
into a new method can_admit_read() which is now shared between the two
method families. This method now checks for the possibility of evicting
inactive readers as well.
The admission logic was tuned slightly to only consider evicting
inactive readers if there is a real possibility that this will result
in admissions: notably, before this patch, resource availability was
checked before stalls were (used permits == blocked permits), so we
could evict readers even if this couldn't help.
Because now eviction can be started from maybe_admit_waiters(), which is
also downstream from eviction, we added a flag to avoid recursive
evict -> maybe admit -> evict ... loops.

Fixes: #11770

Closes #11784
2022-10-18 17:07:43 +03:00
Pavel Emelyanov
b5fd65af61 scylla-gdb: Make regions iterable object
This makes it re-usable across different commands (not there yet)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-18 16:09:47 +03:00
Pavel Emelyanov
0b6b0bd8d2 scylla-gdb: Dont print 0x0x
Formatting pointer adds 0x automatically, no need in adding it
explicitly

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-18 16:09:09 +03:00
Botond Dénes
df8e1da8b2 Merge 'cql3: select_statement: coroutinize indexed_table_select_statement::do_execute_base_query()' from Avi Kivity
indexed_table_select_statement::do_execute_base_query() is fairly
complicated and becomes a little simpler with coroutines.

Closes #11297

* github.com:scylladb/scylladb:
  cql3: indexed_table_select_statement: fix indentation
  cql3: indexed_table_select_statement: clarify loop termination
  cql3: indexed_table_select_statement: get rid of internal base_query_state struct
  cql3: indexed_table_select_statement: coroutinize do_execute_base_query()
  cql3: indexed_table_select_statement: de-result_wrap() do_execute_base_query()
2022-10-18 08:24:21 +03:00
Avi Kivity
50b1fd4cd2 cql3: indexed_table_select_statement: fix indentation
Restore normal indentation after coroutinization, no code changes.
2022-10-17 22:03:11 +03:00
Avi Kivity
3ad956ca2d cql3: indexed_table_select_statement: clarify loop termination
The loop terminates when we run out of keys. There are extra
conditions such as for short read or page limit, but these are
truly discovered during the loop and qualify as special
conditions, if you squint enough.
2022-10-17 22:03:11 +03:00
Avi Kivity
ec183d4673 cql3: indexed_table_select_statement: get rid of internal base_query_state struct
It was just a crutch for do_with(), and now can be replaced with
ordinary coroutine-protected variables. The member names were renamed
to the final names they were assigned within the do_with().
2022-10-17 22:03:11 +03:00
Avi Kivity
75e1321b08 cql3: indexed_table_select_statement: coroutinize do_execute_base_query()
Indentation and "infinite" for-loop left for later cleanup.

Note the last check for a utils::result<> failure is no longer needed,
since the previous checks for failure resulted in an immediate co_return
rather than propagating the failure into a variable as with continuations.

The lambda coroutine is stabilized with the new seastar::coroutine::lambda
facility.
2022-10-17 22:03:11 +03:00
Avi Kivity
8b019841d8 cql3: indexed_table_select_statement: de-result_wrap() do_execute_base_query()
It's an obstacle to coroutinization as it introduces more lambdas.
2022-10-17 22:03:11 +03:00
Tomasz Grabiec
4ff204c028 Merge 'cache: make all removals of cache items explicit' from Michał Chojnowski
This series is a step towards non-LRU cache algorithms.

Our cache items are able to unlink themselves from the LRU list. (In other words, they can be unlinked solely via a pointer to the item, without access to the containing list head). Some places in the code make use of that, e.g. by relying on auto-unlink of items in their destructor.

However, to implement algorithms smarter than LRU, we might want to update some cache-wide metadata on item removal. But any cache-wide structures are unreachable through an item pointer, since items only have access to themselves and their immediate neighbours. Therefore, we don't want items to unlink themselves — we want `cache.remove(item)`, rather than `item.remove_self()`, because the former can update the metadata in `cache`.

This series inserts explicit item unlink calls in places that were previously relying on destructors, gets rid of other self-unlinks, and adds an assert which ensures that every item is explicitly unlinked before destruction.

Closes #11716

* github.com:scylladb/scylladb:
  utils: lru: assert that evictables are unlinked before destruction
  utils: lru: remove unlink_from_lru()
  cache: make all cache unlinks explicit
2022-10-17 12:47:02 +02:00
Michał Chojnowski
a96433d3a4 utils: lru: assert that evictables are unlinked before destruction
Previous patches introduce the assumption that evictables are manually unlinked
before destruction, to allow for correct bookkeeping within the cache.
This assert assures that this assumptions is correct.

This is particularly important because the switch from automatic to explicit
unlinking had to be done manually. Destructor calls are invisible, so it's
possible that we have missed some automatic destruction site.
2022-10-17 12:07:27 +02:00
Michał Chojnowski
f340c9cca5 utils: lru: remove unlink_from_lru()
unlink_from_lru() allows for unlinking elements from cache without notifying
the cache. This messes up any potential cache bookkeeping.
Improved that by replacing all uses of unlink_from_lru() with calls to
lru::remove(), which does have access to cache's metadata.
2022-10-17 12:07:27 +02:00
Michał Chojnowski
d785364375 cache: make all cache unlinks explicit
Our LSA cache is implemented as an auto_unlink Boost intrusive list, meaning
that elements of the list unlink themselves from the list automatically on
destruction. Some parts of the code rely on that, and don't unlink them
manually.

However, this precludes accurate bookkeeping about the cache. Elements only have
access to themselves and their neighbours, not to any bookkeeping context.
Therefore, a destructor cannot update the relevant metadata.

In this patch, we fix this by adding explicit unlink calls to places where it
would be done by a destructor. In a following patch, we will add an assert to
the destructor to check that every element is unlinked before destruction.
2022-10-17 12:07:27 +02:00
Nadav Har'El
c31bf4184f test/cql-pytest: two reproducers for SI returning oversized pages
This patch has two reproducing tests for issue #7432, which are cases
where a paged query with a restriction backed by a secondary-index
returns pages larger than the desired page size. Because these tests
reproduce a still-open bug, they are both marked "xfail". Both tests
pass on Cassandra.

The two tests involve quite dissimilar casess - one involves requesting
an entire partition (and Scylla forgetting to page through it), and the
other involves GROUP BY - so I am not sure these two bugs even have the
same underlying cause. But they were both reported in #7432, so let's
have reproducers for both.

Refs #7432

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11586
2022-10-17 11:36:05 +03:00
Botond Dénes
d85208a574 replica/database: revert initial boost to system semaphore with set_resources()
Unlike the current method (which uses consume()), this will also adjust the
initial resources, adjusting the semaphore as if it was created with the
reduced amount of resources in the first place. This fixes the confusing
90/100 count resources seen in diagnostics dump outputs.
2022-10-17 07:39:20 +03:00
Botond Dénes
ecc7c72acd reader_concurrency_semaphore: add set_resources()
Allowing to change the total or initial resources the semaphore has.
After calling `set_resources()` the semaphore will look like as if it
was created with the specified amount of resources when created.
2022-10-17 07:39:20 +03:00
Avi Kivity
e5e7780f32 test: work around modern pytest rejecting site-packages
Modern (as of Fedora 37) pytest has the "-sP" flags in the Python command
line, as found in /usr/bin/pytest. This means it will reject the
site-packages directory, where we install the Scylla Python driver. This
causes all the tests to fail.

Work around it by supplying an alternative pytest script that does not
have this change.

Closes #11764
2022-10-17 07:18:33 +03:00
Nadav Har'El
9f02431064 test/cql-pytest: fix test_permissions.py when running with "--ssl"
The tests in test_permissions.py use the new_session() utility function
to create a new connection with a different logged-in user.

It models the new connection on the existing one, but incorrectly
assumed that the connection is NOT ssl. This made this test failed
with cql-pytest/run is passed the "--ssl" option.

In this patch we correctly infer the is_ssl state from the existing
cql fixture, instead of assuming it is false. After this pass,
"cql-pytest/run --ssl" works as expected for this test.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11742
2022-10-17 06:46:46 +03:00
Tomasz Grabiec
c8a372ae7f test: db: Add test for row merging involving many versions
The test verifies that a row which participated in earlier merge, and
its cells lost on the timestamp check, behaves exactly like an empty
row and can accept any mutation.

This wasn't the case in versions prior to f006acc.

Closes #11787
2022-10-16 14:29:49 +03:00
Tomasz Grabiec
5d7e40af99 mvcc: Add snapshot details to the printout of partition_entry
Useful for debugging.

Closes #11788
2022-10-16 14:22:14 +03:00
Nadav Har'El
d2cd9b71b3 Merge 'Make tracing test run again, simplify backend registry and few related cleanups' from Pavel Emelyanov
It turned out that boost/tracing test is not run because its name doesn't match the *_test.cc pattern. While fixing it it turned out that the test cannot even start, because it uses future<>.get() calls outside of seastar::thread context. While patching this place the trace-backend registry was removed for simplicity. And, while at it, few more cleanups "while at it"

Closes #11779

* github.com:scylladb/scylladb:
  tracing: Wire tracing test back
  tracing: Indentation fix after previous patch
  tracing: Move test into thread
  tracing: Dismantle trace-backend registry
  tracing: Use class-registrator for backends
  tracing: Add constraint to trace_state::begin()
  tracing: Remove copy-n-paste comments from test
  tracing: Outline may_create_new_session
2022-10-16 12:32:17 +03:00
Nadav Har'El
1f936838ba Merge 'doc: fix the notes on the OS Support by Platform and Version page' from Anna Stuchlik
Fix https://github.com/scylladb/scylladb/issues/11773

This PR fixes the notes by removing repetition and improving the clairy of the notes on the OS Support page.
In addition, "Scylla" was replaced with "ScyllaDB" on related pages.

Closes #11783

* github.com:scylladb/scylladb:
  doc: replace Scylla with ScyllaDB
  doc: add a comment to remove in future versions any information that refers to previous releases
  doc: rewrite the notes to improve clarity
  doc: remove the reperitions from the notes
2022-10-16 10:13:50 +03:00
Tomasz Grabiec
87b7e7ff9c Merge 'storage_proxy: prepare for fencing, complex ops' from Avi Kivity
Following up on 69aea59d97, which added fencing support
for simple reads and writes, this series does the same for the
complex ops:
 - partition scan
 - counter mutation
 - paxos

With this done, the coordinator knows about all in-flight requests and
can delay topology changes until they are retired.

Closes #11296

* github.com:scylladb/scylladb:
  storage_proxy: hold effective_replication_map for the duration of a paxos transaction
  storage_proxy: move paxos_response_handler class to .cc file
  storage_proxy: deinline paxos_response_handler constructor/destructor
  storage_proxy: use consistent effective_replication_map for counter coordinator
  storage_proxy: improve consistency in query_partition_key_range{,_concurrent}
  storage_proxy: query_partition_key_range_concurrent: reduce smart pointer use
  storage_proxy: query_partition_key_range_concurrent: improve token_metadata consistency
  storage_proxy: query_singular: use fewer smart pointers
  storage_proxy: query_singular: simplify lambda captures
  locator: effective_replication_map: provide non-smart-pointer accessor to token_metadata
  storage_proxy: use consistent token_metadata with rest of singular read
2022-10-14 15:44:35 +02:00
Pavel Emelyanov
6150214da3 Add rust/Cargo.lock to .gitignore
The file appears after build

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #11776
2022-10-14 13:54:50 +03:00
Anna Stuchlik
09b0e3f63e doc: replace Scylla with ScyllaDB 2022-10-14 11:06:27 +02:00
Anna Stuchlik
9e2b7e81d3 doc: add a comment to remove in future versions any information that refers to previous releases 2022-10-14 10:53:17 +02:00
Anna Stuchlik
fc0308fe30 doc: rewrite the notes to improve clarity 2022-10-14 10:48:59 +02:00
Anna Stuchlik
1bd0bc00b3 doc: remove the reperitions from the notes 2022-10-14 10:32:52 +02:00
Botond Dénes
621e43a0c8 Merge 'dirty_memory_manager: tidy up' from Avi Kivity
A collection of small cleanups, and a bug fix.

Closes #11750

* github.com:scylladb/scylladb:
  dirty_memory_manager: move region_group data members to top-of-class
  dirty_memory_manager: update region_group comment
  dirty_memory_manager: remove outdated friend
  dirty_memory_manager: fold region_group::push_back() into its caller
  dirty_memory_manager: simplify blocked calculation in region_group::run_when_memory_available
  dirty_memory_manager: remove unneeded local from region_group::run_when_memory_is_available
  dirty_memory_manager: tidy up region_group::execution_permitted()
  dirty_memory_manager: reindent region_group::release_queued_allocations()
  dirty_memory_manager: convert region_group::release_queued_allocations() to a coroutine
  dirty_memory_manager: move region_group::_releaser after _shutdown_requested
  dirty_memory_manager: move region_group queued allocation releasing into a function
  dirty_memory_manager: fold allocation_queue into region_group
  dirty_memory_manager: don't ignore timeout in allocation_queue::push_back()
2022-10-14 06:56:42 +03:00
Avi Kivity
1feaa2dfb4 storage_proxy: handle_write: use coroutine::all() instead of when_all()
coroutine::all() saves an allocation. Since it's safe for lambda
coroutines, remove a coroutine::lambda wrapper.

Closes #11749
2022-10-14 06:56:16 +03:00
Tomasz Grabiec
ee2398960c Merge 'service/raft: simplify raft_address_map' from Kamil Braun
The `raft_address_map` code was "clever": it used two intrusive data structures and did a lot of manual lifetime management; raw pointer manipulation, manual deletion of objects... It wasn't clear who owns which object, who is responsible for deleting what. And there was a lot of code.

In this PR we replace one of the intrusive data structures with a good old `std::unordered_map` and make ownership clear by replacing the raw pointers with `std::unique_ptr`. Furthermore, some invariants which were not clear and enforced in runtime are now encoded in the type system.

The code also became shorter: we reduced its length from ~360 LOC to ~260 LOC.

Closes #11763

* github.com:scylladb/scylladb:
  service/raft: raft_address_map: get rid of `is_linked` checks
  service/raft: raft_address_map: get rid of `to_list_iterator`
  service/raft: raft_address_map: simplify ownership of `expiring_entry_ptr`
  service/raft: raft_address_map: move _last_accessed field from timestamped_entry to expiring_entry_ptr
  service/raft: raft_address_map: don't use intrusive set for timestamped entries
  service/raft: raft_address_map: store reference to `timestamped_entry` in `expiring_entry_ptr`
2022-10-13 18:08:49 +02:00
Kamil Braun
954849799d test/topology: disable flaky test_decommission_add_column
Flaky due to #11780, causes next promotion failures.
We can reenable it after the issue is fixed or a workaround is found.
2022-10-13 17:45:46 +02:00
Pavel Emelyanov
707efb6dfb tracing: Wire tracing test back
The boost/tracing test is not run, because test.py boost suite collects
tests that match *_test.cc pattern. The tracing one apparently doesn't

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-13 17:59:13 +03:00
Pavel Emelyanov
5b67a2a876 tracing: Indentation fix after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-13 17:59:08 +03:00
Pavel Emelyanov
53ac8536f1 tracing: Move test into thread
The test calls future<>.get()'s in its lambda which is only allowed in
seastar threads. It's not stepped upon because (surprise, surprise) this
test is not run at all. Next patch fixes it.

Meanwhile, the fix is in using cql_env_thread thing for the whole lambda
which runs in it seastar::async() context

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-13 17:57:35 +03:00
Pavel Emelyanov
5c8a61ace2 tracing: Dismantle trace-backend registry
It's not used any longer

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-13 17:57:24 +03:00
Pavel Emelyanov
fe7d38661c tracing: Use class-registrator for backends
Currently the code uses its own class registration engine, but there's a
generic one in utils/ that applies here too. In fact, the tracing
backend registry is just a transparent wrapper over the generic one :\

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-13 17:56:24 +03:00
Pavel Emelyanov
1adb2c8cc3 tracing: Add constraint to trace_state::begin()
It expects that the function is (void) and returns back a string

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-13 17:56:08 +03:00
Pavel Emelyanov
0a6a5a242e tracing: Remove copy-n-paste comments from test
Tests don't have supervisor, so there's no sense in keeping these bits

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-13 17:55:40 +03:00
Pavel Emelyanov
79820c2006 tracing: Outline may_create_new_session
It's a private method used purely in tracing.cc, no need in compiling it
every time the header is met somewhere else.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-13 17:55:14 +03:00
Kamil Braun
5a9371bcb0 service/raft: raft_address_map: get rid of is_linked checks
Being linked is an invariant of `expiring_entry_ptr`. Make it explicit
by moving the `_expiring_list.push_front` call into the constructor.
2022-10-13 15:17:07 +02:00
Kamil Braun
cdf3367c05 service/raft: raft_address_map: get rid of to_list_iterator
Unnecessary.
2022-10-13 15:17:06 +02:00
Kamil Braun
0e29495c38 service/raft: raft_address_map: simplify ownership of expiring_entry_ptr
The owner of `expiring_entry_ptr` was almost uniquely its corresponding
`timestamp_entry`; it would delete the expiring entry when it itself got
destroyed. There was one call to explicit `unlink_and_dispose`, which
made the picture unclear.

Make the picture clear: `timestamped_entry` now contains a `unique_ptr`
to its `expiring_entry_ptr`. The `unlink_and_dispose` was replaced with
`_lru_entry = nullptr`.

We can also get rid of the back-reference from `expiring_entry_ptr` to
`timestamped_entry`.

The code becomes shorter and simpler.
2022-10-13 15:16:40 +02:00
Petr Gusev
c76cf5956d removenode: don't stream data from the leaving node
If a removenode is run for a recently stopped node,
the gossiper may not yet know that the node is down,
and the removenode will fail with a Stream failed error
trying to stream data from that node.

In this patch we explicitly reject removenode operation
if the gossiper considers the leaving node up.

Closes #11704
2022-10-13 15:11:32 +02:00
Takuya ASADA
49d5e51d76 reloc: add support stripped binary installation for relocatable package
This add support stripped binary installation for relocatable package.
After this change, scylla and unified packages only contain stripped binary,
and introduce "scylla-debuginfo" package for debug symbol.
On scylla-debuginfo package, install.sh script will extract debug symbol
at /opt/scylladb/<dir>/.debug.

Note that we need to keep unstripped version of relocatable package for rpm/deb,
otherwise rpmbuild/debuild fails to create debug symbol package.
This version is renamed to scylla-unstripped-$version-$release.$arch.tar.gz.

See #8918

Signed-off-by: Takuya ASADA <syuu@scylladb.com>

Closes #9005
2022-10-13 15:11:32 +02:00
Asias He
6134fe4d1f storage_service: Prevent removed node to rejoin in handle_state_normal
- Start n1, n2, n3 (127.0.0.3)
- Stop n3
- Change ip address of n3 to 127.0.0.33 and restart n3
- Decommission n3
- Start new node n4

The node n4 will learn from the gossip entry for 127.0.0.3 that node
127.0.0.3 is in shutdown status which means 127.0.0.3 is still part of
the ring.

This patch prevents this by checking the status for the host id on all
the entries. If any of the entries shows the node with the host id is in
LEFT status, reject to put the node in NORMAL status.

Fixes #11355

Closes #11361
2022-10-13 15:11:32 +02:00
Jan Ciolek
52bbc1065c cql3: allow lists of IN elements to be NULL
Requests like `col IN NULL` used to cause
an error - Invalid null value for colum col.

We would like to allow NULLs everywhere.
When a NULL occurs on either side
of a binary operator, the whole operation
should just evaluate to NULL.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>

Closes #11775
2022-10-13 15:11:32 +02:00
Avi Kivity
19e62d4704 commitlog: delete unused "num_deleted" variable
Since d478896d46 we update the variable, but never read it.
Clang 15 notices and complains. Remove the variable to make it
happy.

Closes #11765
2022-10-13 15:11:32 +02:00
Avi Kivity
a2da08f9f9 storage_proxy: hold effective_replication_map for the duration of a paxos transaction
Luckily, all topology calculations are done in get_paxos_participants(),
so all we have to do is it hold the effective_replication_map for the
duration of the transaction, and pass it to get_paxos_participants().
This ensures that the coordinator knows about all in-flight requests
and can fence them from topology changes.
2022-10-13 14:27:26 +03:00
Avi Kivity
69aaa5e131 storage_proxy: move paxos_response_handler class to .cc file
It's not used elsewhere.
2022-10-13 14:27:26 +03:00
Avi Kivity
b2f3934e95 storage_proxy: deinline paxos_response_handler constructor/destructor
They have no business being inline as it's a heavyweight object.
2022-10-13 14:27:26 +03:00
Avi Kivity
94e4ff11be storage_proxy: use consistent effective_replication_map for counter coordinator
Hold the effective_replication_map while talking to the counter leader,
to allow for fencing in the future. The code is somewhat awkward because
the API allows for multiple keyspaces to be in use.

The error code generation, already broken as it doesn't use the correct
table, continues to be broken in that it doesn't use the correct
effective_replication_map, for the same reason.
2022-10-13 14:27:23 +03:00
Avi Kivity
406a046974 storage_proxy: improve consistency in query_partition_key_range{,_concurrent}
query_partition_key_range captures a token_metadata_ptr and uses
it consistently in sequential calls to query_partition_key_range_concurrent
(via tail recursion), but each invocation of
query_partition_key_range_concurrent captures its own
effective_replication_map_ptr. Since these are captured at different times,
they can be inconsistent after the first iteration.

Fix by capturing it once in the caller and propagating it everywhere.
2022-10-13 13:56:52 +03:00
Avi Kivity
5d320e95d5 storage_proxy: query_partition_key_range_concurrent: reduce smart pointer use
Capture token_metadata by reference rather than smart pointer, since
out effective_replication_map_ptr protects it.
2022-10-13 13:56:52 +03:00
Avi Kivity
f75efa965f storage_proxy: query_partition_key_range_concurrent: improve token_metadata consistency
Derive the token_metadata from the effective_replication_map rather than
getting it independently. Not a real bug since these were in the same
continuation, but safer this way.
2022-10-13 13:56:52 +03:00
Avi Kivity
161ce4b34f storage_proxy: query_singular: use fewer smart pointers
Capture token_metadata by reference since we're protecting it with
the mighty effective_replication_map_ptr. This saves a few instructions
to manage smart pointers.
2022-10-13 13:56:33 +03:00
Avi Kivity
efd89c1890 storage_proxy: query_singular: simplify lambda captures
The lambdas in query_singular do not outlive the enclosing coroutine,
so they can capture everything by reference. This simplifies life
for a future update of the lambda, since there's one thing less to
worry about.
2022-10-13 13:52:54 +03:00
Avi Kivity
d9955ab35b locator: effective_replication_map: provide non-smart-pointer accessor to token_metadata
token_metadata is protected by holders of an effective_replication_map_ptr,
so it's just as safe and less expensive for them to obtain a reference
to token_metadata rather than a smart pointer, so give them that option with
a new accessor.
2022-10-13 13:46:04 +03:00
Avi Kivity
86a48cf12f storage_proxy: use consistent token_metadata with rest of singular read
query_singular() uses get_token_metadata_ptr() and later, in
get_read_executor(), captures the effective_replication_map(). This
isn't a bug, since the two are captured in the same continuation and
are therefore consistent, but a way to ensure it stays so is to capture
the effective_replication_map earlier and derive the token_metadata from
it.
2022-10-13 13:46:04 +03:00
Avi Kivity
720fc733f0 dirty_memory_manager: move region_group data members to top-of-class
Rather than have them spread out throughout the class.
2022-10-13 13:12:01 +03:00
Avi Kivity
61b780ae63 dirty_memory_manager: update region_group comment
It's still named region_group. I may merge the whole thing into
dirty_memory_manager to retire the name.
2022-10-13 13:09:01 +03:00
Avi Kivity
7a5fa1497c dirty_memory_manager: remove outdated friend
That friend no longer exists.
2022-10-13 13:03:43 +03:00
Avi Kivity
02b7697051 dirty_memory_manager: fold region_group::push_back() into its caller
It is too trivial to live.
2022-10-13 13:03:43 +03:00
Avi Kivity
d403ecbed9 dirty_memory_manager: simplify blocked calculation in region_group::run_when_memory_available
- apply De Morgan's law
 - merge if block into boolean calculation
2022-10-13 13:03:43 +03:00
Avi Kivity
cb6c7023c1 dirty_memory_manager: remove unneeded local from region_group::run_when_memory_is_available 2022-10-13 13:03:43 +03:00
Avi Kivity
39668d5ae2 dirty_memory_manager: tidy up region_group::execution_permitted()
- remove excess parentheses
 - apply De Morgan's law
 - remove unneeded this->
 - whitespace cleanups
2022-10-13 13:03:43 +03:00
Avi Kivity
02706e78f9 dirty_memory_manager: reindent region_group::release_queued_allocations() 2022-10-13 13:03:43 +03:00
Avi Kivity
128f1c8c21 dirty_memory_manager: convert region_group::release_queued_allocations() to a coroutine
Nicer and faster. We have a rare case where we hold a lock for the duration
of a call but we don't want to hold it until the future it returns is
resolved, so we have to resort to a minor trick.
2022-10-13 13:03:29 +03:00
Avi Kivity
aad4c1c5e9 dirty_memory_manager: move region_group::_releaser after _shutdown_requested
The function that is attached to _releaser depends on
_shutdown_requested. There is currently now use-before-init, since
the function (release_queued_allocations) starts with a yield(),
moving the first use to until after the initialization.

Since I want to get rid of the yield, reorder the fields so that
they are initialized in the right order.
2022-10-13 13:00:50 +03:00
Raphael S. Carvalho
ec79ac46c9 db/view: Add visibility to view updating of Staging SSTables
Today, we're completely blind about the progress of view updating
on Staging files. We don't know how long it will take, nor how much
progress we've made.

This patch adds visibility with a new metric that will inform
the number of bytes to be processed from Staging files.
Before any work is done, the metric tell us the total size to be
processed. As view updating progresses, the metric value is
expected to decrease, unless work is being produced faster than
we can consume them.

We're piggybacking on sstables::read_monitor, which allows the
progress metric to be updated whenever the SSTable reader makes
progress.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #11751
2022-10-12 16:57:37 +03:00
Avi Kivity
2e79bb431c tools: change source_location location
std::experimental::source_location is provided by <experimental/source_location>,
not <source_location>. libstdc++ 12 insists, so change the header.

Closes #11766
2022-10-12 15:29:14 +03:00
Takuya ASADA
6b246dc119 locator::ec2_snitch: Retry HTTP request to EC2 instance metadata service
EC2 instance metadata service can be busy, ret's retry to connect with
interval, just like we do in scylla-machine-image.

Fixes #10250

Signed-off-by: Takuya ASADA <syuu@scylladb.com>

Closes #11688
2022-10-12 13:59:06 +03:00
Kamil Braun
92dd1f7307 service/raft: raft_address_map: move _last_accessed field from timestamped_entry to expiring_entry_ptr
`timestamped_entry` had two fields:

```
optional<clock_time_point> _last_accessed
expiring_entry_ptr* _lru_entry
```

The `raft_address_map` data structure maintained an invariant:
`_last_accessed` is set if and only if `_lru_entry` is not null.
This invariant could be broken for a while when constructing an expiring
`timestamped_entry`: the constructor was given an `expiring = true`
flag, which set the `_last_accessed` field; this was redundant, because
immediately after a corresponding `expiring_entry_ptr` was constructed
which again reset the `_last_accessed` field and set `_lru_entry`.

The code becomes simpler and shorter when we move `_last_accessed` field
into `expiring_entry_ptr`. The invariant is now guaranteed by the type
system: `_last_accessed` is no longer `optional`.
2022-10-12 12:22:57 +02:00
Kamil Braun
262b9473d5 service/raft: raft_address_map: don't use intrusive set for timestamped entries
Intrusive data structures are harder to reason about. In
`raft_address_map` there's a good reason to use an intrusive list for
storing `expiring_entry_ptr`s: we move the entries around in the list
(when their expiration times change) but we want for the objects to stay
in place because `timestamped_entry`s may point to them (although we
could simply update the pointers using the existing back-reference...)

However, there's not much reason to store `timestamped_entry` in an
intrusive set. It was basically used in one place: when dropping expired
entries, we iterate over the list of `expiring_entry_ptr`s and we want
to drop the corresponding `timestamped_entry` as well, which is easy
when we have a pointer to the entry and it's a member of an intrusive
container. But we can deal with it when using non-intrusive containers:
just `find` the element in the container to erase it.

The code becomes shorter with this change.

I also use a map instead of a set because we need to modify the
`timestamped_entry` which wouldn't be possible if it was used as an
`unordered_set` key. In fact using map here makes more sense: we were
using the intrusive set similarly to a map anyway because all lookups
were performed using the `_id` field of `timestamped_entry` (now the
field was moved outside the struct, it's used as the map's key).
2022-10-12 12:22:50 +02:00
Kamil Braun
3e84b1f69c Merge 'test.py: topology fix ssl var and improve pylint score' from Alecco
When code was moved to the new directory, a bug was reintroduced with `ssl` local hiding `ssl` module. Fix again.

Closes #11755

* github.com:scylladb/scylladb:
  test.py: improve pylint score for conftest
  test.py: fix variable name collision with ssl
2022-10-12 11:41:11 +02:00
Avi Kivity
f673d0abbe build: support fmt 9 ostream formatter deprecation
fmt 9 deprecates automatic fallback to std::ostream formatting.
We should migrate, but in order to do so incrementally, first enable
the deprecated fallback so the code continues to compile.

Closes #11768
2022-10-12 09:27:36 +03:00
Avi Kivity
0952cecfc9 build: mark abseil as a system header
Abseil is not under our control, so if a header generates a
warning, we can do nothing about it. So far this wasn't a problem,
but under clang 15 it spews a harmless deprecation warning. Silence
the warning by treating the header as a system header (which it is,
for us).

Closes #11767
2022-10-12 09:27:36 +03:00
Kamil Braun
0c13c85752 service/raft: raft_address_map: store reference to timestamped_entry in expiring_entry_ptr
The class was storing a pointer which couldn't be null.
A reference is a better fit in this case.
2022-10-11 17:21:01 +02:00
Asias He
810b424a8c storage_service: Reject to bootstrap new node when node has unknown gossip status
- Start a cluster with n1, n2, n3
- Full cluster shutdown n1, n2, n3
- Start n1, n2 and keep n3 as shutdown
- Add n4

Node n4 will learn the ip and uuid of n3 but it does not know the gossip
status of n3 since gossip status is published only by the node itself.
After full cluster shutdown, gossip status of n3 will not be present
until n3 is restarted again. So n4 will not think n3 is part of the
ring.

In this case, it is better to reject the bootstrap.

With this patch, one would see the following when adding n4:

```
ERROR 2022-09-01 13:53:14,480 [shard 0] init - Startup failed:
std::runtime_error (Node 127.0.0.3 has gossip status=UNKNOWN. Try fixing it
before adding new node to the cluster.)
```

The user needs to perform either of the following before adding a new node:

1) Run nodetool removenode to remove n3
2) Restart n3 to get it back to the cluster

Fixes #6088

Closes #11425
2022-10-11 15:47:34 +03:00
Botond Dénes
378c6aeebd Merge 'More Raft upgrade tests' from Kamil Braun
Refactor the existing upgrade tests, extracting some common functionality to
helper functions.

Add more tests. They are checking the upgrade procedure and recovery from
failure in scenarios like when a node fails causing the procedure to get stuck
or when we lose a majority in a fully upgraded cluster.

Add some new functionalities to `ScyllaRESTAPIClient` like injecting errors and
obtaining gossip generation numbers.

Extend the removenode function to allow ignoring dead nodes.

Improve checking for CQL availability when starting nodes to speed up testing.

Closes #11725

* github.com:scylladb/scylladb:
  test/topology_raft_disabled: more Raft upgrade tests
  test/topology_raft_disabled: refactor `test_raft_upgrade`
  test/pylib: scylla_cluster: pass a list of ignored nodes to removenode
  test/pylib: rest_client: propagate errors from put_json
  test/pylib: fix some type hints
  test/pylib: scylla_cluster: don't create and drop keyspaces to check if cql is up
2022-10-11 15:30:00 +03:00
Kamil Braun
08e654abf5 Merge 'raft: (service) cleanups on the path for dynamic IP address support' from Konstantin Osipov
In preparation for supporting IP address changes of Raft Group 0:
1) Always use start_server_for_group0() to start a server for group 0.
   This will provide a single extension point when it's necessary to
   prompt raft_address_map with gossip data.
2) Don't use raft::server_address in discovery, since going forward
   discovery won't store raft::server_address. On the same token stop
   using discovery::peer_set anywhere outside discovery (for persistence),
   use a peer_list instead, which is easier to marshal.

Closes #11676

* github.com:scylladb/scylladb:
  raft: (discovery) do not use raft::server_address to carry IP data
  raft: (group0) API refactoring to avoid raft::server_address
  raft: rename group0_upgrade.hh to group0_fwd.hh
  raft: (group0) move the code around
  raft: (discovery) persist a list of discovered peers, not a set
  raft: (group0) always start group0 using start_server_for_group0()
2022-10-11 13:43:41 +02:00
Asias He
58c65954b8 storage_service: Reject decommission if nodes are down
- Start n1, n2, n3

- Apply network nemesis as below:
  + Block gossip traffic going from nodes 1 and 2 to node 3.
  + All the other rpc traffic flows normally, including gossip traffic
    from node 3 to nodes 1 and 2 and responses to node_ops commands from
    nodes 1 and 2 to node 3.

- Decommission n3

Currently, the decommission will be successful because all the network
traffic is ok. But n3 could not advertise status STATUS_LEFT to the rest
of the cluster due to the network nemesis applied. As a result, n1 and
n3 could not move the n3 from STATUS_LEAVING to STATUS_LEFT, so n3 will
stay in DL forever.

I know why the node stays DL forever. The problem is that with
node_ops_cmd based node operation, we still rely on the gossip status of
STATUS_LEFT from the node being decommissioned to notify other nodes
this node has finished decommission and can be moved from STATUS_LEAVING
to STATUS_LEFT.

This patch fixes by checking gossip liveness before running
decommission. Reject if required peer nodes are down.

With the fix, the decommission of n3 will fail like this:

$ nodetool decommission -p 7300
nodetool: Scylla API server HTTP POST to URL
'/storage_service/decommission' failed: std::runtime_error
(decommission[adb3950e-a937-4424-9bc9-6a75d880f23d]: Rejected
decommission operation, removing node=127.0.0.3, sync_nodes=[127.0.0.2,
127.0.0.3, 127.0.0.1], ignore_nodes=[], nodes_down={127.0.0.1})

Fixes #11302

Closes #11362
2022-10-11 14:09:28 +03:00
Botond Dénes
917fdb9e53 Merge "Cut database-system_keyspace circular dependency" from Pavel Emelyanov
"
There's one via the database's compaction manager and large data handler
sub-services. Both need system keyspace to put their info into, but the
latter needs database naturally via query_processor->storage_proxy link.

The solution is to make c.m. | l.d.h. -> sys.ks. dependency be weak with
the help of shared_from_this(), described in details in patch #2 commit
message.

As a (not-that-)side effect this set removes a bunch of global qctx
calls.

refs: #11684 (this set seem to increase the chance of stepping on it)
"

* 'br-sysks-async-users' of https://github.com/xemul/scylla:
  large_data_handler: Use local system_keyspace to update entries
  system_keyspace: De-static compaction history update
  compaction_manager: Relax history paths
  database: Plug/unplug system_keyspace
  system_keyspace: Add .shutdown() method
2022-10-11 08:52:04 +03:00
Nadav Har'El
ef0da14d6f test/cql-pytest: add simple tests for USE statement
This patch adds a couple of simple tests for the USE statement: that
without USE one cannot create a table without explicitly specifying
a keyspace name, and with USE, it is possible.

Beyond testing these specific feature, this patch also serves as an
example of how to write more tests that need to control the effective USE
setting. Specifically, it adds a "new_cql" function that can be used to
create a new connection with a fresh USE setting. This is necessary
in such tests, because if multiple tests use the same cql fixture
and its single connection, they will share their USE setting and there
is no way to undo or reset it after being set.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11741
2022-10-11 08:20:19 +03:00
Kamil Braun
df2fb21972 test/topology: reenable test_remove_node_add_column
After #11691 was merged the test should no longer be flaky. Reenable it.

Closes #11754
2022-10-11 08:18:20 +03:00
Pavel Emelyanov
8b8b37cdda system_keyspace: Dont maintain dc/rack cache
Some good news finally. The saved dc/rack info about the ring is now
only loaded once on start. So the whole cache is not needed and the
loading code in storage_service can be greatly simplified

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:18:31 +03:00
Pavel Emelyanov
775f42c8d1 system_keyspace: Indentation fix after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:18:31 +03:00
Pavel Emelyanov
8f1df240c7 system_keyspace: Coroutinuze build_dc_rack_info()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:18:31 +03:00
Pavel Emelyanov
b6061bb97d topology: Move all post-configuration to topology::config
Because of snitch ex-dependencies some bits on topology were initialized
with nasty post-start calls. Now it all can be removed and the initial
topology information can be provided by topology::config

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:18:31 +03:00
Pavel Emelyanov
56d4863eb6 snitch: Start early
Snitch code doesn't need anything to start working, but it is needed by
the low-level token-metadata, so move the snitch to start early (and to
stop late)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:18:31 +03:00
Pavel Emelyanov
16188a261e gossiper: Do not export system keyspace
No users of it left. Despite the gossiper->system_keyspace dependency is
not needed either, keep it alive because gossiper still updates system
keyspace with feature masks, so chances are it will be reactivated some
time later.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:17:08 +03:00
Pavel Emelyanov
2bb354b2e7 snitch: Remove gossiper reference
It doesn't need gossiper any longer. This change will allow starting
snitch early by the next patch, and eventually improving the
token-metadata start-up sequence

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:17:08 +03:00
Pavel Emelyanov
26f9472f21 snitch: Mark get_datacenter/_rack methods const
They are in fact such, but wasn't marked as const before because they
wanted to talk to non-const gossiper and system_keyspaces methods and
updated snitch internal caches

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:17:08 +03:00
Pavel Emelyanov
e9bd912f79 snitch: Drop some dead dependency knots
After previous patches and merged branches snitch no longer needs its
method that gets dc/rack for endpoints from gossiper, system keyspace
and its internal caches.

This cuts the last but the biggest snitch->gossiper dependency.
Also this removes implicit snitch->system_keyspace dependency loop

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:17:08 +03:00
Pavel Emelyanov
4206b1f98f snitch, code: Make get_datacenter() report local dc only
The continuation of the previous patch -- all the code uses
topology::get_datacenter(endpoint) to get peers' dc string. The topology
still uses snitch for that, but it already contains the needed data.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:17:08 +03:00
Pavel Emelyanov
6c6711404f snitch, code: Make get_rack() report local rack only
All the code out there now calls snitch::get_rack() to get rack for the
local node. For other nodes the topology::get_rack(endpoint) is used.
Since now the topology is properly populated with endpoints, it can
finally be patched to stop using snitch and get rack from its internal
collections

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:17:08 +03:00
Pavel Emelyanov
bc813771e8 storage_service: Populate pending endpoint in on_alive()
A special-purpose add-on to the previous patch.

When messaging service accepts a new connection it sometimes may want to
drop it early based on whether the client is from the same dc/rack or
not. However, at this stage the information might have not yet had
chances to be spread via storage service pending-tokens updating paths,
so here's one more place -- the on_alive() callback

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:17:08 +03:00
Pavel Emelyanov
1be97a0a76 code: Populate pending locations
Previous patches added the concept of pending endpoints in the topology,
this patch populates endpoints in this state.

Also, the set_pending_ranges() is patched to make sure that the tokens
added for the enpoint(s) are added for something that's known by the
topology. Same check exists in update_normal_tokens()

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:17:08 +03:00
Pavel Emelyanov
b61bd6cf56 topology: Put local dc/rack on topology early
Startup code needs to know the dc/rack of the local node early, way
before nodes starts any communication with the ring. This information is
available when snitch activates, but it starts _after_ token-metadata,
so the only way to put local dc/rack in topology is via a startup-time
special API call. This new init_local_endpoint() is temporary and will
be removed later in this set

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:17:08 +03:00
Pavel Emelyanov
da75552e1f topology: Add pending locations collection
Nowadays the topology object only keeps info about nodes that are normal
members of the ring. Nodes that are joining or bootstrapping or leaving
are out of it. However, one of the goals of this patchset is to make
topology object provide dc/rack info for _all_ nodes, even those in
transitive state.

The introduced _pending_locations is about to hold the dc/rack info for
transitive endpoints. When a node becomes member of the ring it is moved
from pending (if it's there) to current locations, when it leaves the
ring it's moved back to pending.

For now the new collection is just added and the add/remove/get API is
extended to maintain it, but it's not really populated. It will come in
the next patch

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:17:08 +03:00
Pavel Emelyanov
fa613285e7 topology: Make get_location() errors more verbose
Currently if topology.get_location() doesn't find an entry in its
collection(s) it throws standard out-of-range exception which's very
hard to debug.

Also, next patches will extend this method, the introduced here if
(_current_locations.contains()) makes this future change look nicer.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:17:08 +03:00
Pavel Emelyanov
d60ebc5ace token_metadata: Add config, spread everywhere
Next patches will need to provide some early-start data for topology.
The standard way of doing it is via service config, so this patch adds
one. The new config is empty in this patch, to be filled later

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:17:08 +03:00
Pavel Emelyanov
7c211e8e50 token_metadata: Hide token_metadata_impl copy constructor
Copying of token_metadata_impl is heavy operation and it's performed
internally with the help of the dedicated clone_async() method. This
method, in turn, doesn't copy the whole object in its copy-ctor, but
rather default-initializes it and carries the remaining fields later.

Having said that, the standart copy-ctor is better to be made private
and, for the sake of being more explicit, marked as shallow-copy-ctor

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:17:08 +03:00
Pavel Emelyanov
072ef88ed1 gosspier: Remove messaging service getter
No code needs to borrow messaging from gossiper, which is nice

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:17:08 +03:00
Pavel Emelyanov
66bc84d217 snitch: Get local address to gossip via config
The property-file snitch gossips listen_address as internal-IP state. To
get this value it gets it from snitch->gossiper->messaging_service
chain. This change provides the needed value via config thus cutting yet
another snitch->gossiper dependency and allowing gossiper not to export
messaging service in the future

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:17:08 +03:00
Pavel Emelyanov
77bde21024 storage_service: Shuffle on_alive() callback
No functional changes, just keep some conditions from if()s as local
variables. This is the churn-reducing preparation for one of the the
next patches

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:17:08 +03:00
Pavel Emelyanov
583204972e api: Don't report dc/rack for endpoints not in ring
When an endpoint is not in ring the snitch/get_{rack|datacenter} API
still return back some value. The value is, in fact, the default one,
because this is how snitch resolves it -- when it cannot find a node in
gossiper and system keyspace it just returns defaults.

When this happens the API should better return some error (bad param?)
but there's a bug in nodetool -- when the 'status' command collects info
about the ring it first collects the endpoints, then gets status for
each. If between getting an endpoint and getting its status the endpoint
disappears, the API would fail, but nodetool doesn't handle it.

Next patches will make .get_rack/_dc calls use in-topology collections
that don't fall-back to default values if the entry is not found in it,
so prepare the API in advance to return back defaults.

refs: #11706

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:12:47 +03:00
Konstantin Osipov
3e46c32d7b raft: (discovery) do not use raft::server_address to carry IP data
We plan to remove IP information from Raft addresses.
raft::server_address is used in Raft configuration and
also in discovery, which is a separate algorithm, as a handy data
structure, to avoid having new entities in RPC.

Since we plan to remove IP addresses from Raft configuration,
using raft::server_address in discovery and still storing
IPs in it would create ambiguity: in some uses raft::server_address
would store an IP, and in others - would not.

So switch to an own data structure for the purposes of discovery,
discovery_peer, which contains a pair ip, raft server id.

Note to reviewers: ideally we should switch to URIs
in discovery_peer right away. Otherwise we may have to
deal with incompatible changes in discovery when adding URI
support to Scylla.
2022-10-10 16:24:33 +03:00
Pavel Emelyanov
b1f4273f0d large_data_handler: Use local system_keyspace to update entries
The l._d._h.'s way to update system keyspace is not like in other code.
Instead of a dedicated helper on the system_keyspace's side it executes
the insertion query directly with the help of qctx.

Now when the l._d._h. has the weak system keyspace reference it can
execute queries on _it_ rather than on the qctx.

Just like in previous patch, it needs to keep the sys._k.s. weak
reference alive until the query's future resolves.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-10 16:20:59 +03:00
Pavel Emelyanov
907fd2d355 system_keyspace: De-static compaction history update
Compaction manager now has the weak reference on the system keyspace
object and can use it to update its stats. It only needs to take care
and keep the shared pointer until the respective future resolves.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-10 16:20:59 +03:00
Pavel Emelyanov
3e0b61d707 compaction_manager: Relax history paths
There's a virtual method on table_state to update the entry in system
keyspace. It's an overkill to facilitate tests that don't want this.
With new system_keyspace weak referencing it can be made simpled by
moving the updating call to the compaction_manager itself.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-10 16:20:59 +03:00
Pavel Emelyanov
f9b57df471 database: Plug/unplug system_keyspace
There's a circular dependency between system_keyspace and database. The
former needs the latter because it needs to execula local requests via
query_processor. The latter needs the former via compaction manager and
large data handler, database depends on both and these too need to
insert their entries into system keyspace.

To cut this loop the compaction manager and large data handler both get
a weak reference on the system keysace. Once system keyspace starts is
activcates this reference via the database call. When system keyspace is
shutdown-ed on stop, it deactivates the reference.

Technically the weak reference is implemented by marking the system_k.s.
object as async_sharded_service, and the "reference" in question is the
shared_from_this() pointer. When compaction manager or large data
handler need to update a system keyspace's table, they both hold an
extra reference on the system keyspace until the entry is committed,
thus making sure that sys._k.s. doesn't stop from under their feet. At
the same time, unplugging the reference on shutdown makes sure that no
new entries update will appear and the system_k.s. will eventually be
released.

It's not a C++ classical reference, because system_keyspace starts after
and stops before database.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-10 16:20:59 +03:00
Konstantin Osipov
8857e017c7 raft: (group0) API refactoring to avoid raft::server_address
Replace raft::server_address in a few raft_group0 API
calls with raft::server_id.

These API calls do not need raft::server_address, i.e. the
address part, anyway, and since going forward raft::server_address
will not contain the IP address, stop using it in these calls.

This is a beginning of a multi-patch series to reduce
raft::server_address usage to core raft only.
2022-10-10 15:58:48 +03:00
Konstantin Osipov
224dd9ce1e raft: rename group0_upgrade.hh to group0_fwd.hh
The plan is to add other group-0-related forward declarations
to this file, not just the ones for upgrade.
2022-10-10 15:58:48 +03:00
Konstantin Osipov
e226624daf raft: (group0) move the code around
Move load/store functions for discovered peers up,
since going forward they'll be used to in start_server_for_group0(),
to extend the address map prior to start (and thus speed up
bootstrap).
2022-10-10 15:58:48 +03:00
Konstantin Osipov
199b6d6705 raft: (discovery) persist a list of discovered peers, not a set
We plan to reuse the discovery table to store the peers
after discovery is over, so load/store API must be generalized
to use outside discovery. This includes sending
the list of persisted peers over to a new member of the cluster.
2022-10-10 15:58:48 +03:00
Konstantin Osipov
746322b740 raft: (group0) always start group0 using start_server_for_group0()
When IP addresses are removed from raft::configuration, it's key
to initialize raft_address_map with IP addresses before we start group
0. Best place to put this initialization is start_server_for_group0(),
so make sure all paths which create group 0 use
start_server_for_group0().
2022-10-10 15:58:48 +03:00
Kamil Braun
4974a31510 test/topology_raft_disabled: more Raft upgrade tests
The tests are checking the upgrade procedure and recovery from failure
in scenarios like when a node fails causing the procedure to get stuck
or when we lose a majority in a fully upgraded cluster.

Added some new functionalities to `ScyllaRESTAPIClient` like injecting
errors and obtaining gossip generation numbers.
2022-10-10 14:32:10 +02:00
Pavel Emelyanov
caed12c8f2 system_keyspace: Add .shutdown() method
Many services out there have one (sometimes called .drain()) that's
called early on stop and that's responsible for prearing the service for
stop -- aborting pending/in-flight fibers and alike.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-10 15:29:33 +03:00
Kamil Braun
4460b4e63c test/topology_raft_disabled: refactor test_raft_upgrade
Take reusable parts out of the test to helper functions.
2022-10-10 12:59:12 +02:00
Kamil Braun
fa8dcb0d54 test/pylib: scylla_cluster: pass a list of ignored nodes to removenode
The `removenode` operation normally requires the removing node to
contact every node in the cluster except the one that is being removed.
But if more than 1 node is down it's possible to specify a list of nodes
to ignore for the operation; the `/storage_service/remove_node` endpoint
accepts an `ignore_nodes` param which is a comma-separated list of IPs.

Extend `ScyllaRESTAPIClient`, `ScyllaClusterManager` and `ManagerClient`
so it's possible to pass the list of ignored nodes.

We also modify the `/cluster/remove-node` Manager endpoint to use
`put_json` instead of `get_text` and pass all parameters except the
initiator IP (the IP of the node who coordinates the `removenode`
operation) through JSON. This simplifies the URL greatly (it was already
messy with 3 parameters) and more closely resembles Scylla's endpoint.
2022-10-10 12:59:12 +02:00
Kamil Braun
130ab1d312 test/pylib: rest_client: propagate errors from put_json 2022-10-10 12:59:12 +02:00
Kamil Braun
63892326d5 test/pylib: fix some type hints 2022-10-10 12:59:12 +02:00
Kamil Braun
6e3fe13fcf test/pylib: scylla_cluster: don't create and drop keyspaces to check if cql is up
Do a simple `SELECT` instead.

This speeds up tests - creating and dropping keyspaces is relatively
expensive, and we did this on every server restart.
2022-10-10 12:59:12 +02:00
Alejo Sanchez
7e2a3f2040 test.py: improve pylint score for conftest
Remove unused imports, fix long lines, add ignore flags.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-10-10 12:07:41 +02:00
Alejo Sanchez
aa1f4a321c test.py: fix variable name collision with ssl
Change variable name to avoid collision with module ssl.

This bug was reintroduced when moving code.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-10-10 11:59:13 +02:00
Pavel Emelyanov
53bad617c0 virtual_tables: Use token_metadata.is_member()
This method just jumps into topology.has_endpoint(). The change is
for consistency with other users of it and as a preparation for
topology.has_endpoint() future enhancements

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-10 12:16:19 +03:00
Tomasz Grabiec
fcf0628bc5 dbuild: Use .gdbinit from the host
Useful when starting gdb inside the dbuild container.
Message-Id: <20221007154230.1936584-1-tgrabiec@scylladb.com>
2022-10-09 11:14:33 +03:00
Petr Gusev
0923cb435f raft: mark removed servers as expiring instead of dropping them
There is a flaw in how the raft rpc endpoints are
currently managed. The io_fiber in raft::server
is supposed to first add new servers to rpc, then
send all the messages and then remove the servers
which have been excluded from the configuration.
The problem is that the send_messages function
isn't synchronous, it schedules send_append_entries
to run after all the current requests to the
target server, which can happen
after we have already removed the server from address_map.

In this patch the remove_server function is changed to mark
the server_id as expiring rather than synchronously dropping it.
This means all currently scheduled requests to
that server will still be able to resolve
the ip address for that server_id.

Fixes: #11228

Closes #11748
2022-10-07 19:08:34 +02:00
Avi Kivity
55606a51cb dirty_memory_manager: move region_group queued allocation releasing into a function
It's nicer to see a function release_queued_allocations() in a stack
trace rather than start_releaser(), which has done its work during
initialization.
2022-10-07 17:27:43 +03:00
Avi Kivity
3e60d6c243 dirty_memory_manager: fold allocation_queue into region_group
allocation_queue was extracted out of region_group in
71493c253 and 34d532236. But now that region_group refactoring is
mostly done, we can move them back in. allocation_queue has just one
user and is not useful standalone.
2022-10-07 17:27:40 +03:00
Avi Kivity
01368830b5 dirty_memory_manager: don't ignore timeout in allocation_queue::push_back()
In 34d5322368 ("dirty_memory_manager: move more allocation_queue
functions out of region_group") we accidentally started ignoring the
timeout parameter. Fix that.

No release branch has the breakage.
2022-10-07 17:19:56 +03:00
Kamil Braun
06b87869ba Merge 'Raft transport error' from Gusev Petr
The `add_entry` and `modify_config` methods sometimes do an rpc to
execute the request on the current leader. If the tcp connection
was broken, a `seastar::rpc::closed_error` would be thrown to the client.
This exception was not documented in the method comments and the
client could have missed handling it. For example, this exception
was not handled when calling `modify_config` in `raft_group0`,
which sometimes broke the `removenode` command.

An `intermittent_connection_error` exception was added earlier to
solve a similar problem with the `read_barrier` method. In this patch it
is renamed to `transport_error`, as it seems to better describe the
situation, and an explicit specification for this exception
was added - the rpc implementation can throw it if it is not known
whether the call reached the destination and whether any mutations were made.

In case of `read_barrier` it does not matter and we just retry, in case
of `add_entry` and `modify_config` we cannot retry because of possible mutations,
so we convert this exception to `commit_status_unknown`, which
the client has to handle.

Explicit comments have also been added to `raft::server` methods
describing all possible exceptions.

Closes #11691

* github.com:scylladb/scylladb:
  raft_group0: retry modify_config on commit_status_unknown
  raft: convert raft::transport_error to raft::commit_status_unknown
2022-10-07 15:53:22 +02:00
Petr Gusev
12bb8b7c8d raft_group0: retry modify_config on commit_status_unknown
modify_config can throw commit_status_unknown in case
of a leader change or when the leader is unavailable,
but the information about it has not yet reached the
current node. In this patch modify_config is run
again after some time in this case.
2022-10-07 13:34:23 +04:00
Petr Gusev
d79fbab682 raft: convert raft::transport_error to raft::commit_status_unknown
The add_entry and modify_config methods sometimes do an rpc to
execute the request on the current leader. If the tcp connection
was broken, a seastar::rpc::closed_error would be thrown to the client.
This exception was not documented in the method comments and the
client could have missed handling it. For example, this exception
was not handled when calling modify_config in raft_group0,
which sometimes broke the removenode command.

An intermittent_connection_error exception was added earlier to
solve a similar problem with the read_barrier method. In this patch it
is renamed to transport_error, as it seems to better describe the
situation, and an explicit specification for this exception
was added - the rpc implementation can throw it if it is not known
whether the call reached the target node and whether any
actions were performed on it.

In case of read_barrier it does not matter and we just retry. In case
of add_entry and modify_config we cannot retry
because the rpc calls are not idempotent, so we convert this
exception to commit_status_unknown, which the client has to handle.

Explicit comments have also been added to raft::server methods
describing all possible exceptions.
2022-10-07 13:34:16 +04:00
Botond Dénes
b247f29881 Merge 'De-static system_keyspace::get_{saved|local}_tokens()' from Pavel Emelyanov
Yet another user of global qctx object. Making the method(s) non-static requires pushing the system_keyspace all the way down to size_estimate_virtual_reader and a small update of the cql_test_env

Closes #11738

* github.com:scylladb/scylladb:
  system_keyspace: Make get_{local|saved}_tokens non static
  size_estimates_virtual_reader: Pass sys_ks argument to get_local_ranges()
  cql_test_env: Keep sharded<system_keyspace> reference
  size_estimate_virtual_reader: Keep system_keyspace reference
  system_keyspace: Pass sys_ks argument to install_virtual_readers()
  system_keyspace: Make make() non-static
  distributed_loader: Pass sys_ks argument to init_system_keyspace()
  system_keyspace: Remove dangling forward declaration
2022-10-07 11:28:32 +03:00
Botond Dénes
992afc5b8c Merge 'storage_proxy: coroutinize some functions with do_with' from Avi Kivity
do_with() is a sure indicator for coroutinization, since it adds
an allocation (like the coroutine does with its frame). Therefore
translating a function with do_with is at least a break-even, and
usually a win since other continuations no longer allocate.

This series converts most of storage_proxy's function that have
do_with to coroutines. Two remain, since they are not simple
to convert (the do_with() is kept running in the background and
its future is discarded).

Individual patches favor minimal changes over final readability,
and there is a final patch that restores indentation.

The patches leave some moves from coroutine reference parameters
to the coroutine frame, this will be cleaned up in a follow-up. I wanted
this series not to touch headers to reduce rebuild times.

Closes #11683

* github.com:scylladb/scylladb:
  storage_proxy: reindent after coroutinization
  storage_proxy: convert handle_read_digest() to a coroutine
  storage_proxy: convert handle_read_mutation_data() to a coroutine
  storage_proxy: convert handle_read_data() to a coroutine
  storage_proxy: convert handle_write() to a coroutine
  storage_proxy: convert handle_counter_mutation() to a coroutine
  storage_proxy: convert query_nonsingular_mutations_locally() to a coroutine
2022-10-07 07:37:37 +03:00
Nadav Har'El
72dbce8d46 docs, alternator: mention S3 Import feature in compatibility.md
In August 2022, DynamoDB added a "S3 Import" feature, which we don't yet
support - so let's document this missing feature in the compatibility
document.

Refs #11739.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11740
2022-10-06 19:50:16 +03:00
Avi Kivity
20bad62562 Merge 'Detect and record large collections' from Benny Halevy
This series adds support for detecting collections that have too many items
and recording them in `system.large_cells`.

A configuration variable was added to db/config: `compaction_collection_items_count_warning_threshold` set by default to 10000.
Collections that have more items than this threshold will be warned about and will be recorded as a large cell in the `system.large_cells` table.  Documentation has been updated respectively.

A new column was added to system.large_cells: `collection_items`.
Similar to the `rows` column in system.large_partition, `collection_items` holds the number of items in a collection when the large cell is a collection, or 0 if it isn't.  Note that the collection may be recorded in system.large_cells either due to its size, like any other cell, and/or due to the number of items in it, if it cross the said threshold.

Note that #11449 called for a new system.large_collections table, but extending system.large_cells follows the logic of system.large_partitions is a smaller change overall, hence it was preferred.

Since the system keyspace schema is hard coded, the schema version of system.large_cells was bumped, and since the change is not backward compatible, we added a cluster feature - `LARGE_COLLECTION_DETECTION` - to enable using it.
The large_data_handler large cell detection record function will populate the new column only when the new cluster feature is enabled.

In addition, unit tests were added in sstable_3_x_test for testing large cells detection by cell size, and large_collection detection by the number of items.

Closes #11449

Closes #11674

* github.com:scylladb/scylladb:
  sstables: mx/writer: optimize large data stats members order
  sstables: mx/writer: keep large data stats entry as members
  db: large_data_handler: dynamically update config thresholds
  utils/updateable_value: add transforming_value_updater
  db/large_data_handler: cql_table_large_data_handler: record large_collections
  db/large_data_handler: pass ref to feature_service to cql_table_large_data_handler
  db/large_data_handler: cql_table_large_data_handler: move ctor out of line
  docs: large-rows-large-cells-tables: fix typos
  db/system_keyspace: add collection_elements column to system.large_cells
  gms/feature_service: add large_collection_detection cluster feature
  test: sstable_3_x_test: add test_sstable_too_many_collection_elements
  test: lib: simple_schema: add support for optional collection column
  test: lib: simple_schema: build schema in ctor body
  test: lib: simple_schema: cql: define s1 as static only if built this way
  db/large_data_handler: maybe_record_large_cells: consider collection_elements
  db/large_data_handler: debug cql_table_large_data_handler::delete_large_data_entries
  sstables: mx/writer: pass collection_elements to writer::maybe_record_large_cells
  sstables: mx/writer: add large_data_type::elements_in_collection
  db/large_data_handler: get the collection_elements_count_threshold
  db/config: add compaction_collection_elements_count_warning_threshold
  test: sstable_3_x_test: add test_sstable_write_large_cell
  test: sstable_3_x_test: pass cell_threshold_bytes to large_data_handler
  test: sstable_3_x_test: large_data_handler: prepare callback for testing large_cells
  test: sstable_3_x_test: large_data tests: use BOOST_REQUIRE_[GL]T
  test: sstable_3_x_test: test_sstable_log_too_many_rows: use tests::random
2022-10-06 18:28:21 +03:00
Avi Kivity
62a4d2d92b Merge 'Preliminary changes for multiple Compaction Groups' from Raphael "Raph" Carvalho
What's contained in this series:
- Refactored compaction tests (and utilities) for integration with multiple groups
    - The idea is to write a new class of tests that will stress multiple groups, whereas the existing ones will still stress a single group.
- Fixed a problem when cloning compound sstable set (cannot be triggered today so I didn't open a GH issue)
- Many changes in replica::table for allowing integration with multiple groups

Next:
- Introduce for_each_compaction_group() for iterating over groups wherever needed.
- Use for_each_compaction_group() in replica::table operations spanning all groups (API, readers, etc).
- Decouple backlog tracker from compaction strategy, to allow for backlog isolation across groups
- Introduce static option for defining number of compaction groups and implement function to map a token to its respective group.
- Testing infrastructure for multiple compaction groups (helpful when testing the dynamic behavior: i.e. merging / splitting).

Closes #11592

* github.com:scylladb/scylladb:
  sstable_resharding_test: Switch to table_for_tests
  replica: Move compacted_undeleted_sstables into compaction group
  replica: Use correct compaction_group in try_flush_memtable_to_sstable()
  replica: Make move_sstables_from_staging() robust and compaction group friendly
  test: Rename column_family_for_tests to table_for_tests
  sstable_compaction_test: Use column_family_for_tests::as_table_state() instead
  test: Don't expose compound set in column_family_for_tests
  test: Implement column_family_for_tests::table_state::is_auto_compaction_disabled_by_user()
  sstable_compaction_test: Merge table_state_for_test into column_family_for_tests
  sstable_compaction_test: use table_state_for_test itself in fully_expired_sstables()
  sstable_compaction_test: Switch to table_state in compact_sstables()
  sstable_compaction_test: Reduce boilerplate by switching to column_family_for_tests
2022-10-06 18:23:47 +03:00
Kamil Braun
f94d547719 test.py: include modes in log file name
Instead of `test.py.log`, use:
`test.py.dev.log`
when running with `--mode dev`,
`test.py.dev-release.log`
when running with `--mode dev --mode release`,
and so on.

This is useful in Jenkins which is running test.py multiple times in
different modes; a later run would overwrite a previous run's test.py
file. With this change we can preserve the test.py files of all of these
runs.

Closes #11678
2022-10-06 18:20:39 +03:00
Kamil Braun
3af68052c4 test/topology: disable flaky test_remove_node_add_column test
The test was added recently and since then causes CI failures.

We suspect that it happens if the node being removed was the Raft group
0 leader. The removenode coordinator tries to send to it the
`remove_from_group0` request and fails.
A potential fix is in review: #11691.
2022-10-06 17:04:42 +02:00
Pavel Emelyanov
59da903054 system_keyspace: Make get_{local|saved}_tokens non static
Now all callers have system_keyspace reference at hand. This removes one
more user of the global qctx object

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-06 18:02:09 +03:00
Pavel Emelyanov
b03f1e7b17 size_estimates_virtual_reader: Pass sys_ks argument to get_local_ranges()
This method static calls system_keyspace::get_local_tokens(). Having the
system_keyspace reference will make this method non-static

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-06 18:00:09 +03:00
Pavel Emelyanov
4c099bb3ed cql_test_env: Keep sharded<system_keyspace> reference
There's a test_get_local_ranges() call in size-estimate reader which
will need system keyspace reference. There's no other place for tests to
get it from but the cql_test_env thing

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-06 17:59:21 +03:00
Pavel Emelyanov
34e8e5959f size_estimate_virtual_reader: Keep system_keyspace reference
The s._e._v._reader::fill_buffer() method needs system keyspace to get
node's local tokens. Now it's a static method, having system_keyspace
reference will make it non-static

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-06 17:58:07 +03:00
Pavel Emelyanov
04552f2d58 system_keyspace: Pass sys_ks argument to install_virtual_readers()
The size-estimate-virtual-reader will need it, now it's available as
"this" from system_keyspace::make() method

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-06 17:57:13 +03:00
Pavel Emelyanov
1938412d7a system_keyspace: Make make() non-static
This helper needs system_keyspace reference and using "this" as this
looks natural. Also this de-static-ification makes it possible to put
some sense into the invoke_on_all() call from init_system_keyspace()

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-06 17:56:11 +03:00
Pavel Emelyanov
9f79525f8e distributed_loader: Pass sys_ks argument to init_system_keyspace()
It's final destination is virtual tabls registration code called from
init_system_keyspace() eventually

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-06 17:55:03 +03:00
Pavel Emelyanov
e996503f0d system_keyspace: Remove dangling forward declaration
It doesn't match the real system_keyspace_make() definition and is in
fact not needed, as there's another "real" one in database.hh

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-06 17:54:22 +03:00
Vlad Zolotarov
8195dab92a scylla_prepare: correctly handle a former 'MQ' mode
Fixes a regression introduced in 80917a1054:
"scylla_prepare: stop generating 'mode' value in perftune.yaml"

When cpuset.conf contains a "full" CPU set the negation of it from
the "full" CPU set is going to generate a zero mask as a irq_cpu_mask.
This is an illegal value that will eventually end up in the generated
perftune.yaml, which in line will make the scylla service fail to start
until the issue is resolved.

In such a case a irq_cpu_mask must represent a "full" CPU set mimicking
a former 'MQ' mode.

Fixes #11701
Tested:
 - Manually on a 2 vCPU VM in an 'auto-selection' mode.
 - Manually on a large VM (48 vCPUs) with an 'MQ' manually
   enforced.
Message-Id: <20221004004237.2961246-1-vladz@scylladb.com>
2022-10-06 17:43:37 +03:00
Avi Kivity
9932c4bd62 Merge 'cql3: Make CONTAINS NULL and CONTAINS KEY NULL return false' from Jan Ciołek
Currently doing `CONTAINS NULL` or `CONTAINS KEY NULL` on a collection evaluates to `true`.

This is a really weird behaviour. Collections can't contain `NULL`, even if they wanted to.
Any operation that has a NULL on either side should evaluate to `NULL`, which is interpreted as `false`.
In Cassandra trying to do `CONTAINS NULL` causes an error.

Fixes: #10359

The only problem is that this change is not backwards compatible. Some existing code might break.

Closes #11730

* github.com:scylladb/scylladb:
  cql3: Make CONTAINS KEY NULL return false
  cql3: Make CONTAINS NULL return false
2022-10-06 17:08:56 +03:00
Petr Gusev
40bd9137f8 removenode: add warning in case of exception
The removenode_abort logic that follows the warning
may throw, in which case information about
the original exception was lost.

Fixes: #11722
Closes #11735
2022-10-06 13:49:26 +02:00
Benny Halevy
480b4759a9 idl: streaming: include stream_fwd.hh
To keep the idl definition of plan_id from
getting out of sync with the one in stream_fwd.hh.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #11720
2022-10-06 13:49:26 +02:00
Kamil Braun
962ee9ba7b Merge 'Make raft_group0 -> system_keyspace dependency explicit' from Pavel Emelyanov
The raft_group0 code needs system_keyspace and now it gets one from gossiper. This gossiper->system_keyspace dependency is in fact artificial, gossiper doesn't need system ks, it's there only to let raft and snitch call gossiper.get_system_keyspace().

This makes raft use system ks directly, snitch is patched by another branch

Closes #11729

* github.com:scylladb/scylladb:
  raft_group0: Use local reference
  raft_group0: Add system keyspace reference
2022-10-06 13:49:26 +02:00
Tomasz Grabiec
023f78d6ae test: lib: random_mutation_generator: Introduce a switch for generating simpler mutations for easier debugging
Closes #11731
2022-10-06 13:49:26 +02:00
Raphael S. Carvalho
14d6459efc compaction: Make compaction_manager stop more robust
Commit aba475fe1d accidentally fixed a race, which happens in
the following sequence of events:

1) storage service starts drain() via API for example
2) main's abort source is triggered, calling compaction_manager's do_stop()
via subscription.
        2.1) do_stop() initiates the stop but doesn't wait for it.
        2.2) compaction_manager's state is set to stopped, such that
        compaction_manager::stop() called in defer_verbose_shutdown()
        will wait for the stop and not start a new one.
3) drain() calls compaction_manager::drain() changing the state from
stopped to disabled.
4) main calls compaction_manager::stop() (as described in 2.2) and
incorrectly tries to stop the manager again, because the state was
changed in step 3.

aba475fe1d accidentally fixed this problem because drain() will no
longer take place if it detects the shutdown process was initiated
(it does so by ignoring drain request if abort source's subscription
was unlinked).

This shows us that looking at the state to determine if stop should be
performed is fragile, because once the state changes from A to B,
manager doesn't know the state was A. To make it robust, we can instead
check if the future that stores stop's promise is engaged, meaning that
the stop was already initiated and we don't have to start a new one.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #11711
2022-10-06 13:49:26 +02:00
Botond Dénes
753f671eaa Merge 'dirty_memory_manager: simplify, clarify, and document' from Avi Kivity
This series undoes some recent damage to clarity, then
goes further by renaming terms around dirty_memory_manager
to be clearer. Documentation is added.

Closes #11705

* github.com:scylladb/scylladb:
  dirty_memory_manager: re-term "virtual dirty" to "unspooled dirty"
  dirty_memory_manager: rename _virtual_region_group
  api: column_family: fix memtable off-heap memory reporting
  dirty_memory_manager: unscramble terminology
2022-10-06 13:49:26 +02:00
Tomasz Grabiec
4c8dc41f75 Merge 'Handle storage_io_error's ENOSPC when flushing' from Pavel Emelyanov
This is the continuation of the a980510654 that tries to catch ENOSPCs reported via storage_io_error similarly to how defer_verbose_shutdown() does on stop

Closes #11664

* github.com:scylladb/scylladb:
  table: Handle storage_io_error's ENOSPC when flushing
  table: Rewrap retry loop
2022-10-06 13:49:26 +02:00
Raphael S. Carvalho
fcdff50a35 sstable_resharding_test: Switch to table_for_tests
Important step for multiple compaction groups. As a bonus, lots
of boilerplate is removed.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-10-05 21:37:19 -03:00
Raphael S. Carvalho
cf3f93304e replica: Move compacted_undeleted_sstables into compaction group
Compacted undeleted sstables are relevant for avoiding data resurrection
in the purge path. As token ranges of groups won't overlap, it's
better to isolate this data, so to prevent one group from interfering
with another.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-10-05 21:37:19 -03:00
Raphael S. Carvalho
56ac62bbd6 replica: Use correct compaction_group in try_flush_memtable_to_sstable()
We need to pass the compaction_group received as a param, not the one
retrieved via as_table_state(). Needed for supporting multiple
groups.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-10-05 21:37:19 -03:00
Raphael S. Carvalho
707ebf9cf7 replica: Make move_sstables_from_staging() robust and compaction group friendly
Off-strategy can happen in parallel to view building.

A semaphore is used to ensure they don't step on each other's
toe.

If off-strategy completes first, then move_sstables_from_staging()
won't find the SSTable alive and won't reach code to add
the file to the backlog tracker.

If view building completes first, the SSTable exists, but it's
not reshaped yet (has repair origin) and shouldn't be
added to the backlog tracker.

Off-strategy completion code will make sure new sstables added
to main set are accounted by the backlog tracker, so
move_sstables_from_staging() only need to add to tracker files
which are certainly not going through a reshape compaction.

So let's take these facts into account to make the procedure
more robust and compaction group friendly. Very welcome change
for when multiple groups are supported.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-10-05 21:37:19 -03:00
Raphael S. Carvalho
7d82373e3a test: Rename column_family_for_tests to table_for_tests
To avoid confusion, as replica::column_family was already renamed
to replica::table.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-10-05 21:37:19 -03:00
Raphael S. Carvalho
e56bfecd8d sstable_compaction_test: Use column_family_for_tests::as_table_state() instead
That's important for multiple compaction groups. Once replica::table
supports multiple groups, there will be no table::as_table_state(),
so for testing table with a single group, we'll be relying on
column_family_for_tests::as_table_state().

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-10-05 21:37:19 -03:00
Raphael S. Carvalho
5a028ca4dc test: Don't expose compound set in column_family_for_tests
The compound set shouldn't be exposed in main_sstables() because
once we complete the switch to column_family_for_tests::table_state,
can happen compaction will try to remove or add elements to its
set snapshot, and compound set isn't allowed to either ops.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-10-05 21:37:19 -03:00
Raphael S. Carvalho
b16d6c55b1 test: Implement column_family_for_tests::table_state::is_auto_compaction_disabled_by_user()
Needed once we switch to column_family_for_tests::table_state, so unit
tests relying on correct value will still work

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-10-05 21:37:19 -03:00
Raphael S. Carvalho
a6d24a763a sstable_compaction_test: Merge table_state_for_test into column_family_for_tests
This change will make table_state_for_test the table_state of
column_family_for_tests. Today, an unit test has to keep a reference
to them both and logically couple them, but that's error prone.

This change is also important when replica::table supports multiple
compaction groups, so unit tests won't have to directly reference
the table_state of table, but rather use the one managed by
column_family_for_tests.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-10-05 21:37:19 -03:00
Raphael S. Carvalho
6a0eabd17a sstable_compaction_test: use table_state_for_test itself in fully_expired_sstables()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-10-05 21:37:19 -03:00
Raphael S. Carvalho
a6affea008 sstable_compaction_test: Switch to table_state in compact_sstables()
The switch is important once we have multiple compaction groups,
as a single table may own several groups. There will no longer be
a replica::table::as_table_state().

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-10-05 21:37:19 -03:00
Raphael S. Carvalho
2aa6518486 sstable_compaction_test: Reduce boilerplate by switching to column_family_for_tests
Lots of boilerplate is reduced, and will also help to complete the
switch from replica::table to compaction::table_state in the unit
tests.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-10-05 21:37:18 -03:00
Jan Ciolek
a2c359a741 cql3: Make CONTAINS KEY NULL return false
A binary operator like this:
{1: 2, 3: 4} CONTAINS KEY NULL
used to evaluate to `true`.

This is wrong, any operation involving null
on either side of the operator should evaluate
to NULL, which is interpreted as false.

This change is not backwards compatible.
Some existing code might break.

partially fixes: #10359

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-10-05 18:15:44 +02:00
Jan Ciolek
bbfef4b510 cql3: Make CONTAINS NULL return false
A binary operator like this:
[1, 2, 3] CONTAINS NULL
used to evaluate to `true`.

This is wrong, any operation involving null
on either side of the operator should evaluate
to NULL, which is interpreted as false.

This change is not backwards compatible.
Some existing code might break.

partially fixes: #10359

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-10-05 18:15:15 +02:00
Pavel Emelyanov
fb8ed684fa raft_group0: Use local reference
It now grabs one from gossiper which is weird. A bit later it will be
possible to remove gossiper->system_keyspace dependency

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-05 17:35:58 +03:00
Pavel Emelyanov
8570fe3c30 raft_group0: Add system keyspace reference
The sharded<system_keyspace> is already started by the time raft_group0
is created

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-05 17:35:13 +03:00
Michał Chojnowski
a0204c17c5 treewide: remove mentions of seastar::thread::should_yield()
thread_scheduling_group has been retired many years ago.
Remove the leftovers, they are confusing.

Closes #11714
2022-10-05 12:26:37 +03:00
Michał Chojnowski
8aa24194b7 row_cache: remove a dead try...catch block in eviction
All calls in the try block have been noexcept for some time.
Remove the try...catch and the associated misleading comment to avoid confusing
source code readers.

Closes #11715
2022-10-05 12:23:47 +03:00
Benny Halevy
7286f5d314 sstables: mx/writer: optimize large data stats members order
Since `_partition_size_entry` and `_rows_in_partition_entry`
are accessed at the same time when updated, and similarly
`_cell_size_entry` and `_elements_in_collection_entry`,
place the member pairs closely together to improve data
cache locality.

Follow the same order when preparing the
`scylla_metadata::large_data_stats` map.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-05 10:54:04 +03:00
Benny Halevy
8c8a0adb40 sstables: mx/writer: keep large data stats entry as members
To save the map lookup on the hot write path,
keep each large data stats entry as a member in the writer
object and build a map for storing the disk_hash in the
scylla metadata only when finalizing it in consume_end_of_stream.

Fixes #11686

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-05 10:54:04 +03:00
Benny Halevy
2c4ff71d2b db: large_data_handler: dynamically update config thresholds
make the various large data thresholds live-updateable
and construct the observers and updaters in
cql_table_large_data_handler to dynamically update
the base large_data_handler class threshold members.

Fixes #11685

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-05 10:53:40 +03:00
Benny Halevy
6d582054c0 utils/updateable_value: add transforming_value_updater
Automatically updates a value from a utils::updateable_value
Where they can be of different types.
An optional transfom function can provide an additional transformation
when updating the value, like multiplying it by a factor for unit conversion,
for example.

To be used for auto-updating the large data thresholds
from the db::config.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-05 10:52:49 +03:00
Botond Dénes
4c13328788 Merge 'Return all sstables in table::get_sstable_set()' from Raphael "Raph" Carvalho
This fixes a regression introduced by 1e7a444, where table::get_sstable_set() isn't exposing all sstables, but rather only the ones in the main set. That causes user of the interface, such as get_sstables_by_partition_key() (used by API to return sstable name list which contains a particular key), to miss files in the maintenance set.

Fixes https://github.com/scylladb/scylladb/issues/11681.

Closes #11682

* github.com:scylladb/scylladb:
  replica: Return all sstables in table::get_sstable_set()
  sstables: Fix cloning of compound_sstable_set
2022-10-05 06:55:50 +03:00
Pavel Emelyanov
2c1ef0d2b7 sstables.hh: Remove unused headers
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #11709
2022-10-04 23:37:07 +02:00
Raphael S. Carvalho
827750c142 replica: Return all sstables in table::get_sstable_set()
get_sstable_set() as its name implies is not confined to the main
or maintenance set, nor to a specific compaction group, so let's
make it return the compound set which spans all groups, meaning
all sstables tracked by a table will be returned.

This is a regression introduced in 1e7a444. It affects the API
to return sstable list containing a partition key, as sstables
in maintenance would be missed, fooling users of the API like
tools that could trust the output.

Each compaction group is returning the main and maintenance set
in table_state's main_sstable_set() and maintenance_sstable_set(),
respectively.

Fixes #11681.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-10-04 10:43:27 -03:00
Raphael S. Carvalho
eddf32b94c sstables: Fix cloning of compound_sstable_set
The intention was that its clone() would actually clone the content
of an existing set into a new one, but the current impl is actually
moving the sets instead of copying them. So the original set
becomes invalid. Luckily, this problem isn't triggered as we're
not exposing the compound set in the table's interface, so the
compound_sstable_set::clone() method isn't being called.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-10-04 10:43:25 -03:00
Felipe Mendes
f67bb43a7a locator: ec2_snitch: IMDSv2 support
Access to AWS Metadata may be configured in three distinct ways:
   1 - Optional HTTP tokens and HTTP endpoint enabled: The default as it works today
   2 - Required HTTP tokens and HTTP endpoint enabled: Which support is entirely missing today
   3 - HTTP endpoint disabled: Which effectively forbids one to use Ec2Snitch or Ec2MultiRegionSnitch

This commit makes the 2nd option the default which is not only AWS recommended option, but is also entirely compatible with the 1st option.
In addition, we now validate the HTTP response when querying the IMDS server. Therefore - should a HTTP 403 be received - Scylla will
properly notify users on what they are trying to do incorrectly in their setup.

The commit was tested under the following circumstances (covering all 3 variants):
 - Ec2Snitch: IMDSv2 optional & required, and HTTP server disabled.
 - Ec2MultiRegionSnitch: IMDSv2 optional & required, and HTTP server disabled.

Refs: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-instance-metadata-service.html
      https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-instance-metadata-options.html
      https://github.com/scylladb/scylladb/issues/9987
Fixes: https://github.com/scylladb/scylladb/issues/10490
Closes: https://github.com/scylladb/scylladb/issues/10490

Closes #11636
2022-10-04 15:48:42 +03:00
Avi Kivity
37c6b46d26 dirty_memory_manager: re-term "virtual dirty" to "unspooled dirty"
The "virtual dirty" term is not very informative. "Virtual" means
"not real", but it doesn't say in which way it isn't real.

In this case, virtual dirty refers to real dirty memory, minus
the portion of memtables that has been written to disk (but not
yet sealed - in that case it would not be dirty in the first
place).

I chose to call "the portion of memtables that has been written
to disk" as "spooled memory". At least the unique term will cause
people to look it up and may be easier to remember. From that
we have "unspooled memory".

I plan to further change the accounting to account for spooled memory
rather than unspooled, as that is a more natural term, but that is left
for later.

The documentation, config item, and metrics are adjusted. The config
item is practically unused so it isn't worth keeping compatibility here.
2022-10-04 14:03:59 +03:00
Avi Kivity
d02c407769 dirty_memory_manager: rename _virtual_region_group
Since we folded _real_region_group into _virtual_region_group, the
"virtual" tag makes no sense any more, so remove it.
2022-10-04 14:01:45 +03:00
Avi Kivity
b0814bdd42 api: column_family: fix memtable off-heap memory reporting
We report virtual memory used, but that's not a real accounting
of the actual memory used. Use the correct real_memory_used() instead.

Note that this isn't a recent regression and was probably broken forever.
However nobody looks at this measure (and it's usually close to the
correct value) so nobody noticed.

Since it's so minor, I didn't bother filing an issue.
2022-10-04 13:56:29 +03:00
Avi Kivity
bc2fcf5187 dirty_memory_manager: unscramble terminology
Before 95f31f37c1 ("Merge 'dirty_memory_manager: simplify
region_group' from Avi Kivity"), we had two region_group
objects, one _real_region_group and another _virtual_region_group,
each with a set of "soft" and "hard" limits and related functions
and members.

In 95f31f37c1, we merged _real_region_group into _virtual_region_group,
but unfortunately the _real_region_group members received the "hard"
prefix when they got merged. This overloads the meaning of "hard" -
is it related to soft/hard limit or is it related to the real/virtual
distinction?

This patch applied some renaming to restore consistency. Anything
that came from _virtual_region_group now has "virtual" in its name.
Anything that came from _real_region_group now has "real" in its name.
The terms are still pretty bad but at least they are consistent.
2022-10-04 13:56:28 +03:00
Kamil Braun
c200ae2228 Merge 'test.py topology Scylla REST API client' from Alecco
- Separate `aiohttp` client code
- Helper to access Scylla server REST API
- Use helper both in `ScyllaClusterManager` (test.py process) and `ManagerClient` (pytest process)
- Add `removenode` and `decommission` operations.

Closes #11653

* github.com:scylladb/scylladb:
  test.py: Scylla REST methods for topology tests
  test.py: rename server_id to server_ip
  test.py: HTTP client helper
  test.py: topology pass ManagerClient instead of...
  test.py: delete unimplemented remove server
  test.py: fix variable name ssl name clash
2022-10-04 11:50:18 +02:00
Botond Dénes
169a8a66f2 compatible_ring_position_or_view: make it cheap to copy
This class exists for one purpose only: to serve as glue code between
dht::ring_position and boost::icl::interval_map. The latter requires
that keys in its intervals are:
* default constructible
* copyable
* have standalone compare operations

For this reason we have to wrap `dht::ring_position` in a class,
together with a schema to provide all this. This is
`compatible_ring_position`. There is one further requirement by code
using the interval map: it wants to do lookups without copying the
lookup key(s). To solve this, we came up with
`compatible_ring_position_or_view` which is a union of a key or a key
view + schema. As we recently found out, boost::icl copies its keys **a
lot**. It seems to assume these keys are cheap to copy and carelessly
copies them around even when iterating over the map. But
`compatible_ring_position_or_view` is not cheap to copy as it copies a
`dht::ring_position` which allocates, and it does that via an
`std::optional` and `std::variant` to add insult to injury.
This patch make said class cheap to copy, by getting rid of the variant
and storing the `dht::ring_position` via a shared pointer. The view is
stored separately and either points to the ring position stored in the
shared pointer or to an outside ring position (for lookups).

Fixes: #11669

Closes #11670
2022-10-04 12:00:21 +03:00
Piotr Dulikowski
51f813d89b storage_proxy: update rate limited reads metric when coordinator rejects
The decision to reject a read operation can either be made by replicas,
or by the coordinator. In the second case, the

  scylla_storage_proxy_coordinator_read_rate_limited

metric was not incremented, but it should. This commit fixes the issue.

Fixes: #11651

Closes #11694
2022-10-04 10:33:58 +03:00
Pavel Emelyanov
9cd1f777a5 database.hh: Remove unused headers
Use forward declarations when needed

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #11667
2022-10-04 09:01:38 +03:00
Botond Dénes
5fd4b1274e Merge 'compaction_manager: Don't let ENOSPC throw out of ::stop() method' from Pavel Emelyanov
The seastar defer_stop() helper is cool, but it forwards any exception from the .stop() towards the caller. In case the caller is main() the exception causes Scylla to abort(). This fires, for example, in compaction_manager::stop() when it steps on ENOSPC

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #11662

* github.com:scylladb/scylladb:
  compaction_manager: Swallow ENOSPCs in ::stop()
  exceptions: Mark storage_io_error::code() with noexcept
2022-10-04 08:54:22 +03:00
Nadav Har'El
3a30fbd56c test/alternator: fix timeout in flaky test test_ttl_stats
The test `test_metrics.py::test_ttl_stats` tests the metrics associated
with Alternator TTL expiration events. It normally finishes in less than a
second (the TTL scanning is configured to run every 0.5 seconds), so we
arbitrarily set a 60 second timeout for this test to allow for extremely
slow test machines. But in some extreme cases even this was not enough -
in one case we measured the TTL scan to take 63 seconds.

So in this patch we increase the timeout in this test from 60 seconds
to 120 seconds. We already did the same change in other Alternator TTL
tests in the past - in commit 746c4bd.

Fixes #11695

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11696
2022-10-04 08:50:51 +03:00
Benny Halevy
46ebffcc93 db/large_data_handler: cql_table_large_data_handler: record large_collections
When the large_collection_detection cluster feature is enabled,
select the internal_record_large_cells_and_collections method
to record the large collection cell, storing also the collection_elements
column.

We want to do that only when the cluster feature is enabled
to facilitate rollback in case rolling upgrade is aborted,
otherwise system.large_cells won't be backward compatible
and will have to be deleted manually.

Delete the sstable from system.large_cells if it contains
elements_in_collection above threshold.

Closes #11449

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:42:10 +03:00
Benny Halevy
3f8bba202f db/large_data_handler: pass ref to feature_service to cql_table_large_data_handler
For recording collection_elements of large_collections when
the large_collection_detection feature is enabled.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:42:10 +03:00
Benny Halevy
dc4e7d8e01 db/large_data_handler: cql_table_large_data_handler: move ctor out of line
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:42:09 +03:00
Benny Halevy
f4c3070002 docs: large-rows-large-cells-tables: fix typos
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:42:09 +03:00
Benny Halevy
2f49eebb04 db/system_keyspace: add collection_elements column to system.large_cells
And bump the schema version offset since the new schema
should be distinguishable from the previous one.

Refs scylladb/scylladb#11660

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:42:08 +03:00
Benny Halevy
9ad41c700e gms/feature_service: add large_collection_detection cluster feature
And a corresponding db::schema_feature::SCYLLA_LARGE_COLLECTIONS

We want to enable the schema change supporting collection_elements
only when all nodes are upgraded so that we can roll back
if the rolling upgrade process is aborted.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:42:07 +03:00
Benny Halevy
9eeb8f2971 test: sstable_3_x_test: add test_sstable_too_many_collection_elements
Test that collections with too many elements are detected properly.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:42:07 +03:00
Benny Halevy
3c11937b00 test: lib: simple_schema: add support for optional collection column
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:42:06 +03:00
Benny Halevy
7b5f2d2e53 test: lib: simple_schema: build schema in ctor body
Rather when initializing _s.

Prepare for adding an optional collection column.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:42:06 +03:00
Benny Halevy
db01641a44 test: lib: simple_schema: cql: define s1 as static only if built this way
Keep the with_static ctor parameter as private member
to be used by the cql() method to define s1 either as static or not.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:42:05 +03:00
Benny Halevy
6dadca2648 db/large_data_handler: maybe_record_large_cells: consider collection_elements
Detect large_collections when the number of collection_elements
is above the configured threshold.

Next step would be to record the number of collection_elements
in the system.large_cells table, when the respective
cluster feature is enabled.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:42:05 +03:00
Benny Halevy
27ee75c54e db/large_data_handler: debug cql_table_large_data_handler::delete_large_data_entries
Log in debug level when deleting large data entry
from system table.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:42:04 +03:00
Benny Halevy
7dead10742 sstables: mx/writer: pass collection_elements to writer::maybe_record_large_cells
And update the sstable elements_in_collection
stats entry.

Next step would be to forward it to
large_data_handler().maybe_record_large_cells().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:41:58 +03:00
Benny Halevy
54ab038825 sstables: mx/writer: add large_data_type::elements_in_collection
Add a new large_data_stats type and entry for keeping
the collection_elements_count_threshold and the maximum value
of collection_elements.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:41:56 +03:00
Benny Halevy
a107f583fd db/large_data_handler: get the collection_elements_count_threshold
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:31:11 +03:00
Benny Halevy
167ec84eeb db/config: add compaction_collection_elements_count_warning_threshold
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:31:10 +03:00
Benny Halevy
5e88e6267e test: sstable_3_x_test: add test_sstable_write_large_cell
based on cell size threshold.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:31:09 +03:00
Benny Halevy
3980415d97 test: sstable_3_x_test: pass cell_threshold_bytes to large_data_handler
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:31:09 +03:00
Benny Halevy
3eb4cda8ea test: sstable_3_x_test: large_data_handler: prepare callback for testing large_cells
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:31:08 +03:00
Benny Halevy
0a9d3f24e6 test: sstable_3_x_test: large_data tests: use BOOST_REQUIRE_[GL]T
This way, the boost infrastructure prints the offending values
if the test assertion fails.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:31:07 +03:00
Benny Halevy
9668dd0e2d test: sstable_3_x_test: test_sstable_log_too_many_rows: use tests::random
So it would be reproducible based on the test random-seed

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:30:51 +03:00
Kamil Braun
114419d6ab service/raft: raft_group0_client: read on-disk an in-memory group0 upgrade atomically
`set_group0_upgrade_state` writes the on-disk state first, then
in-memory state second, both under a write lock.
`get_group0_upgrade_state` would only take the lock if the in-memory
state was `use_pre_raft_procedures`.

If there's an external observer who watches the on-disk state to decide
whether Raft upgrade finished yet, the following could happen:
1. The node wrote `use_post_raft_procedures` to disk but didn't update
   the in-memory state yet, which is still `synchronize`.
2. The external client reads the table and sees that the state is
   `use_post_raft_procedures`, and deduces that upgrade has finished.
3. The external client immediately tries to perform a schema change. The
   schema change code calls `get_group0_upgrade_state` which does not
   take the read lock and returns `synchronize`. The schema change gets
   denied because schema changes are not allowed in `synchronize`.

Make sure that `get_group0_upgrade_state` cannot execute in-between
writing to disk and updating the in-memory state by always taking the
read lock before reading the in-memory state. As it was before, it will
immediately drop the lock if the state is not `use_pre_raft_procedures`.

This is useful for upgrade tests, which read the on-disk state to decide
whether upgrade has finished and often try to perform a schema change
immediately afterwards.

Closes #11672
2022-10-03 19:04:16 +02:00
Alejo Sanchez
abf1425ad4 test.py: Scylla REST methods for topology tests
Provide a helper client for Scylla REST requests. Use it on both
ScyllaClusterManager (e.g. remove node, test.py process) and
ManagerClient (e.g. get uuid, pytest process).

For now keep using IPs as key in ScyllaCluster, but this will be changed
to UUID -> IP in the future. So, for now, pass both independently. Note
the UUID must be obtained from the server before stopping it.

Refresh client driver connection when decommissioning or removing
a node.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-10-03 19:01:03 +02:00
Alejo Sanchez
86c752c2a0 test.py: rename server_id to server_ip
In ScyllaCluster currently servers are tracked by the host IP. This is
not the host id (UUID). Fix the variable name accordingly

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-10-03 19:01:03 +02:00
Alejo Sanchez
a7a0b446f0 test.py: HTTP client helper
Split aiohttp client to a shared helper file.

While there, move aiohttp session setup back to constructors. When there
were teardown issues it looked it could be caused by aiohttp session
being created outside a coroutine. But this is proven not to be the case
after recent fixes. So move it back to the ManagerClient constructor.

On th other hand, create a close() coroutine to stop the aiohttp session.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-10-03 19:01:03 +02:00
Alejo Sanchez
41dbdf0f70 test.py: topology pass ManagerClient instead of...
cql connection

When there are topology changes, the driver needs to be updated. Instead
of passing the CassandraCluster.Connection, pass the ManagerClient
instance which manages the driver connection inside of it.

Remove workaround for test_raft_upgrade.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-10-03 19:00:47 +02:00
Alejo Sanchez
0c3a06d0d7 test.py: delete unimplemented remove server
Delete of Unused and unimplemented broken version of remove server.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-10-03 18:57:38 +02:00
Alejo Sanchez
98bc4c198f test.py: fix variable name ssl name clash
Change variable ssl to use_ssl to avoid clash with ssl module.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-10-03 18:57:38 +02:00
Avi Kivity
7626fd573a storage_proxy: reindent after coroutinization 2022-10-03 19:33:39 +03:00
Avi Kivity
019b18b232 storage_proxy: convert handle_read_digest() to a coroutine
The do_with() makes it at least a break-even, but there's some allocating
continuations that make it a win.

A variable named cmd had two different definitions (a value and a
lw_shared_ptr) that lived in different scopes. I renamed one to cmd1
to disambiguate. We should probably move that to the caller, but that
is not done here.
2022-10-03 19:33:39 +03:00
Avi Kivity
aa5f4bf1f3 storage_proxy: convert handle_read_mutation_data() to a coroutine
The do_with() makes it at least a break-even, but there's some allocating
continuations that make it a win.

A variable named cmd had two different definitions (a value and a
lw_shared_ptr) that lived in different scopes. I renamed one to cmd1
to disambiguate. We should probably move that to the caller, but that
is not done here.
2022-10-03 19:33:39 +03:00
Avi Kivity
bcd134e9b8 storage_proxy: convert handle_read_data() to a coroutine
The do_with() makes it at least a break-even, but there's some allocating
continuations that make it a win.

A variable named cmd had two different definitions (a value and a
lw_shared_ptr) that lived in different scopes. I renamed one to cmd1
to disambiguate. We should probably move that to the caller, but that
is not done here.
2022-10-03 19:33:39 +03:00
Avi Kivity
167c8b1b5e storage_proxy: convert handle_write() to a coroutine
A do_with() makes this at least a break-even.

Some internal lambdas were not converted since they commonly
do not allocate or block.

A finally() continuation is converted to seastar::defer().
2022-10-03 19:33:39 +03:00
Avi Kivity
741d6609a5 storage_proxy: convert handle_counter_mutation() to a coroutine
The do_with means the coroutine conversion is free, and conversion
of parallel_for_each to coroutine::parallel_for_each saves a
possible allocation (though it would not have been allocated usually.

An inner continuation is not converted since it usually doesn't
block, and therefore doesn't allocate.
2022-10-03 19:33:39 +03:00
Avi Kivity
ac5fae4b93 storage_proxy: convert query_nonsingular_mutations_locally() to a coroutine
It's simpler, and the do_with() allocation + task cancels out the
coroutine allocation + task.
2022-10-03 19:33:29 +03:00
Pavel Emelyanov
d22b130af1 compaction_manager: Swallow ENOSPCs in ::stop()
When being stopped compaction manager may step on ENOSPC. This is not a
reason to fail stopping process with abort, better to warn this fact in
logs and proceed as if nothing happened

refs: #11245

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-03 18:54:48 +03:00
Pavel Emelyanov
7ba1f551f3 exceptions: Mark storage_io_error::code() with noexcept
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-03 18:50:06 +03:00
Kamil Braun
67ee6500e3 service/raft: raft_group_registry: pass direct_fd_pinger by reference
It was passed to `raft_group_registry::direct_fd_proxy` by value. That
is a bug, we want to pass a reference to the instance that is living
inside `gossiper`.

Fortunately this bug didn't cause problems, because the pinger is only
used for one function, `get_address`, which looks up an address in a map
and if it doesn't find it, accesses the map that lives inside
`gossiper` on shard 0 (and then caches it in the local copy).

Explicitly delete the copy constructor of `direct_fd_pinger` so this
doesn't happen again.

Closes #11661
2022-10-03 16:40:35 +02:00
Tomasz Grabiec
9dae2b9c02 Merge 'mutation_fragment_stream_validator: various API improvements' from Botond Dénes
The low-level `mutation_fragment_stream_validator` gets `reset()` methods that until now only the high-level `mutation_fragment_stream_validating_filter` had.
Active tombstone validation is pushed down to the low level validator.
The low level validator, which was a pain to use until now due to being very fussy on which subset of its API one used, is made much more robust, not requiring the user to stick to a subset of its API anymore.

Closes #11614

* github.com:scylladb/scylladb:
  mutation_fragment_stream_validator: make interface more robust
  mutation_fragment_stream_validator: add reset() to validating filter
  mutation_fragment_stream_validator: move active tomsbtone validation into low level validator
2022-10-03 16:23:46 +02:00
Botond Dénes
95f31f37c1 Merge 'dirty_memory_manager: simplify region_group' from Avi Kivity
region_group evolved as a tree, each node of which contains some
regions (memtables). Each node has some constraints on memory, and
can start flushing and/or stop allocation into its memtables and those
below it when those constraints are violated.

Today, the tree has exactly two nodes, only one of which can hold memtables.
However, all the complexity of the tree remains.

This series applies some mechanical code transformations that remove
the tree structure and all the excess functionality, leaving a much simpler
structure behind.

Before:
 - a tree of region_group objects
 - each with two parameters: soft limit and hard limit
 - but only two instances ever instantiated
After:
 - a single region_group object
 - with three parameters - two from the bottom instance, one from the top instance

Closes #11570

* github.com:scylladb/scylladb:
  dirty_memory_manager: move third memory threshold parameter of region_group constructor to reclaim_config
  dirty_memory_manager: simplify region_group::update()
  dirty_memory_manager: fold region_group::notify_hard_pressure_relieved into its callers
  dirty_memory_manager: clean up region_group::do_update_hard_and_check_relief()
  dirty_memory_manager: make do_update_hard_and_check_relief() a member of region_group
  dirty_memory_manager: remove accessors around region_group::_under_hard_pressure
  dirty_memory_manager: merge memory_hard_limit into region_group
  dirty_memory_manager: rename members in memory_hard_limit
  dirty_memory_manager: fold do_update() into region_group::update()
  dirty_memory_manager: simplify memory_hard_limit's do_update
  dirty_memory_manager: drop soft limit / soft pressure members in memory_hard_limit
  dirty_memory_manager: de-template do_update(region_group_or_memory_hard_limit)
  dirty_memory_manager: adjust soft_limit threshold check
  dirty_memory_manager: drop memory_hard_limit::_name
  dirty_memory_manager: simplify memory_hard_limit configuration
  dirty_memory_manager: fold region_group_reclaimer into {memory_hard_limit,region_group}
  dirty_memory_manager: stop inheriting from region_group_reclaimer
  dirty_memory_manager: test: unwrap region_group_reclaimer
  dirty_memory_manager: change region_group_reclaimer configuration to a struct
  dirty_memory_manager: convert region_group_reclaimer to callbacks
  dirty_memory_manager: consolidate region_group_reclaimer constructors
  dirty_memory_manager: rename {memory_hard_limit,region_group}::notify_relief
  dirty_memory_manager: drop unused parameter to memory_hard_limit constructor
  dirty_memory_manager: drop memory_hard_limit::shutdown()
  dirty_memory_manager: split region_group hierarchy into separate classes
  dirty_memory_manager: extract code block from region_group::update
  dirty_memory_manager: move more allocation_queue functions out of region_group
  dirty_memory_manager: move some allocation queue related function definitions outside class scope
  dirty_memory_manager: move region_group::allocating_function and related classes to new class allocation_queue
  dirty_memory_manager: remove support for multiple subgroups
2022-10-03 13:22:47 +03:00
Botond Dénes
5621cdd7f9 db/view/view_builder: don't drop partition and range tombstones when resuming
The view builder builds the views from a given base table in
view_builder::batch_size batches of rows. After processing this many
rows, it suspends so the view builder can switch to building views for
other base tables in the name of fairness. When resuming the build step
for a given base table, it reuses the reader used previously (also
serving the role of a snapshot, pinning sstables read from). The
compactor however is created anew. As the reader can be in the middle of
a partition, the view builder injects a partition start into the
compactor to prime it for continuing the partition. This however only
included the partition-key, crucially missing any active tombstones:
partition tombstone or -- since the v2 transition -- active range
tombstone. This can result in base rows covered by either of this to be
resurrected and the view builder to generate view updates for them.
This patch solves this by using the detach-state mechanism of the
compactor which was explicitly developed for situations like this (in
the range scan code) -- resuming a read with the readers kept but the
compactor recreated.
Also included are two test cases reproducing the problem, one with a
range tombstone, the other with a partition tombstone.

Fixes: #11668

Closes #11671
2022-10-03 11:28:22 +03:00
Avi Kivity
2c744628ae Update abseil submodule
* abseil 9e408e05...7f3c0d78 (193):
  > Allows absl::StrCat to accept types that implement AbslStringify()
  > Merge pull request #1283 from pateldeev:any_inovcable_rename_true
  > Cleanup: SmallMemmove nullify should also be limited to 15 bytes
  > Cleanup: implement PrependArray and PrependPrecise in terms of InlineData
  > Cleanup: Move BitwiseCompare() to InlineData, and make it layout independent.
  > Change kPower10Table bounds to be half-open
  > Cleanup some InlineData internal layout specific details from cord.h
  > Improve the comments on the implementation of format hooks adl tricks.
  > Expand LogEntry method docs.
  > Documentation: Remove an obsolete note about the implementation of `Cord`.
  > `absl::base_internal::ReadLongFromFile` should use `O_CLOEXEC` and handle interrupts to `read`
  > Allows absl::StrFormat to accept types which implement AbslStringify()
  > Add common_policy_traits - a subset of hash_policy_traits that can be shared between raw_hash_set and btree.
  > Split configuration related to cycle clock into separate headers
  > Fix -Wimplicit-int-conversion and -Wsign-conversion warnings in btree.
  > Implement Eisel-Lemire for from_chars<float>
  > Import of CCTZ from GitHub.
  > Adds support for "%v" in absl::StrFormat and related functions for bool values. Note that %v prints bool values as "true" and "false" rather than "1" and "0".
  > De-pointerize LogStreamer::stream_, and fix move ctor/assign preservation of flags and other stream properties.
  > Explicitly disallows modifiers for use with %v.
  > Change the macro ABSL_IS_TRIVIALLY_RELOCATABLE into a type trait - absl::is_trivially_relocatable - and move it from optimization.h to type_traits.h.
  > Add sparse and string copy constructor benchmarks for hash table.
  > Make BTrees work with custom allocators that recycle memory.
  > Update the readme, and (internally) fix some export processes to better keep it up-to-date going forward.
  > Add the fact that CHECK_OK exits the program to the comment of CHECK_OK.
  > Adds support for "%v" in absl::StrFormat and related functions for numeric types, including integer and floating point values. Users may now specify %v and have the format specifier deduced. Integer values will print according to %d specifications, unsigned values will use %u, and floating point values will use %g. Note that %v does not work for `char` due to ambiguity regarding the intended output. Please continue to use %c for `char`.
  > Implement correct move constructor and assignment for absl::strings_internal::OStringStream, and mark that class final.
  > Add more options for `BM_iteration` in order to see better picture for choosing trade off for iteration optimizations.
  > Change `EndComparison` benchmark to not measure iteration. Also added `BM_Iteration` separately.
  > Implement Eisel-Lemire for from_chars<double>
  > Add `-llog` to linker options when building log_sink_set in logging internals.
  > Apply clang-format to btree.h.
  > Improve failure message: tell the values we don't like.
  > Increase the number of per-ObjFile program headers we can expect.
  > Fix "unsafe narrowing" warnings in absl, 8/n.
  > Fix format string error with an explicit cast
  > Add a case to detect when the Bazel compiler string is explicitly set to "gcc", instead of just detecting Bazel's default "compiler" string.
  > Fix "unsafe narrowing" warnings in absl, 10/n.
  > Fix "unsafe narrowing" warnings in absl, 9/n.
  > Fix stacktrace header includes
  > Add a missing dependency on :raw_logging_internal
  > CMake: Require at least CMake 3.10
  > CMake: install artifacts reflect the compiled ABI
  > Fixes bug so that `%v` with modifiers doesn't compile. `%v` is not intended to work with modifiers because the meaning of modifiers is type-dependent and `%v` is intended to be used in situations where the type is not important. Please continue using if `%s` if you require format modifiers.
  > Convert algorithm and container benchmarks to cc_binary
  > Merge pull request #1269 from isuruf:patch-1
  > InlinedVector: Small improvement to the max_size() calculation
  > CMake: Mark hash_testing as a public testonly library, as it is with Bazel
  > Remove the ABSL_HAVE_INTRINSIC_INT128 test from pcg_engine.h
  > Fix ClangTidy warnings in btree.h and btree_test.cc.
  > Fix log StrippingTest on windows when TCHAR = WCHAR
  > Refactors checker.h and replaces recursive functions with iterative functions for readability purposes.
  > Refactors checker.h to use if statements instead of ternary operators for better readability.
  > Import of CCTZ from GitHub.
  > Workaround for ASAN stack safety analysis problem with FixedArray container annotations.
  > Rollback of fix "unsafe narrowing" warnings in absl, 8/n.
  > Fix "unsafe narrowing" warnings in absl, 8/n.
  > Changes mutex profiling
  > InlinedVector: Correct the computation of max_size()
  > Adds support for "%v" in absl::StrFormat and related functions for string-like types (support for other builtin types will follow in future changes). Rather than specifying %s for strings, users may specify %v and have the format specifier deduced. Notably, %v does not work for `const char*` because we cannot be certain if %s or %p was intended (nor can we be certain if the `const char*` was properly null-terminated). If you have a `const char*` you know is null-terminated and would like to work with %v, please wrap it in a `string_view` before using it.
  > Fixed header guards to match style guide conventions.
  > Typo fix
  > Added some more no_test.. tags to build targets for controlling testing.
  > Remove includes which are not used directly.
  > CMake: Add an option to build the libraries that are used for writing tests without requiring Abseil's tests be built (default=OFF)
  > Fix "unsafe narrowing" warnings in absl, 7/n.
  > Fix "unsafe narrowing" warnings in absl, 6/n.
  > Release the Abseil Logging library
  > Switch time_state to explicit default initialization instead of value initialization.
  > spinlock.h: Clean up includes
  > Fix minor typo in absl/time/time.h comment: "ToDoubleNanoSeconds" -> "ToDoubleNanoseconds"
  > Support compilers that are unknown to CMake
  > Import of CCTZ from GitHub.
  > Change bit_width(T) to return int rather than T.
  > Import of CCTZ from GitHub.
  > Merge pull request #1252 from jwest591:conan-fix
  > Don't try to enable use of ARM NEON intrinsics when compiling in CUDA device mode. They are not available in that configuration, even if the host supports them.
  > Fix "unsafe narrowing" warnings in absl, 5/n.
  > Fix "unsafe narrowing" warnings in absl, 4/n.
  > Import of CCTZ from GitHub.
  > Update Abseil platform support policy to point to the Foundational C++ Support Policy
  > Import of CCTZ from GitHub.
  > Add --features=external_include_paths to Bazel CI to ignore warnings from dependencies
  > Merge pull request #1250 from jonathan-conder-sm:gcc_72
  > Merge pull request #1249 from evanacox:master
  > Import of CCTZ from GitHub.
  > Merge pull request #1246 from wxilas21:master
  > remove unused includes and add missing std includes for absl/status/status.h
  > Sort INTERNAL_DLL_TARGETS for easier maintenance.
  > Disable ABSL_HAVE_STD_IS_TRIVIALLY_ASSIGNABLE for clang-cl.
  > Map the absl::is_trivially_* functions to their std impl
  > Add more SimpleAtod / SimpleAtof test coverage
  > debugging: handle alternate signal stacks better on RISCV
  > Revert change "Fix "unsafe narrowing" warnings in absl, 4/n.".
  > Fix "unsafe narrowing" warnings in absl, 3/n.
  > Fix "unsafe narrowing" warnings in absl, 4/n.
  > Fix "unsafe narrowing" warnings in absl, 2/n.
  > debugging: honour `STRICT_UNWINDING` in RISCV path
  > Fix "unsafe narrowing" warnings in absl, 1/n.
  > Add ABSL_IS_TRIVIALLY_RELOCATABLE and ABSL_ATTRIBUTE_TRIVIAL_ABI macros for use with clang's __is_trivially_relocatable and [[clang::trivial_abi]].
  > Merge pull request #1223 from ElijahPepe:fix/implement-snprintf-safely
  > Fix frame pointer alignment check.
  > Fixed sign-conversion warning in code.
  > Import of CCTZ from GitHub.
  > Add missing include for std::unique_ptr
  > Do not re-close files on EINTR
  > Renamespace absl::raw_logging_internal to absl::raw_log_internal to match (upcoming) non-raw logging namespace.
  > Check for negative return values from ReadFromOffset
  > Use HTTPS RFC URLs, which work regardless of the browser's locale.
  > Avoid signedness change when casting off_t
  > Internal Cleanup: removing unused internal function declaration.
  > Make Span complain if constructed with a parameter that won't outlive it, except if that parameter is also a span or appears to be a view type.
  > any_invocable_test: Re-enable the two conversion tests that used to fail under MSVC
  > Add GetCustomAppendBuffer method to absl::Cord
  > debugging: add hooks for checking stack ranges
  > Minor clang-tidy cleanups
  > Support [[gnu::abi_tag("xyz")]] demangling.
  > Fix -Warray-parameter warning
  > Merge pull request #1217 from anpol:macos-sigaltstack
  > Undo documentation change on erase.
  > Improve documentation on erase.
  > Merge pull request #1216 from brjsp:master
  > string_view: conditional constexpr is no longer needed for C++14
  > Make exponential_distribution_test a bigger test (timeout small -> moderate).
  > Move Abseil to C++14 minimum
  > Revert commit f4988f5bd4176345aad2a525e24d5fd11b3c97ea
  > Disable C++11 testing, enable C++14 and C++20 in some configurations where it wasn't enabled
  > debugging: account for differences in alternate signal stacks
  > Import of CCTZ from GitHub.
  > Run flaky test in fewer configurations
  > AnyInvocable: Move credits to the top of the file
  > Extend visibility of :examine_stack to an upcoming Abseil Log.
  > Merge contiguous mappings from the same file.
  > Update versions of WORKSPACE dependencies
  > Use ABSL_INTERNAL_HAS_SSE2 instead of __SSE2__
  > PR #1200: absl/debugging/CMakeLists.txt: link with libexecinfo if needed
  > Update GCC floor container to use Bazel 5.2.0
  > Update GoogleTest version used by Abseil
  > Release absl::AnyInvocable
  > PR #1197: absl/base/internal/direct_mmap.h: fix musl build on mips
  > absl/base/internal/invoke: Ignore bogus warnings on GCC >= 11
  > Revert GoogleTest version used by Abseil to commit 28e1da21d8d677bc98f12ccc7fc159ff19e8e817
  > Update GoogleTest version used by Abseil
  > explicit_seed_seq_test: work around/disable bogus warnings in GCC 12
  > any_test: expand the any emplace bug suppression, since it has gotten worse in GCC 12
  > absl::Time: work around bogus GCC 12 -Wrestrict warning
  > Make absl::StdSeedSeq an alias for std::seed_seq
  > absl::Optional: suppress bogus -Wmaybe-uninitialized GCC 12 warning
  > algorithm_test: suppress bogus -Wnonnull warning in GCC 12
  > flags/marshalling_test: work around bogus GCC 12 -Wmaybe-uninitialized warning
  > counting_allocator: suppress bogus -Wuse-after-free warning in GCC 12
  > Prefer to fallback to UTC when the embedded zoneinfo data does not contain the requested zone.
  > Minor wording fix in the comment for ConsumeSuffix()
  > Tweak the signature of status_internal::MakeCheckFailString as part of an upcoming change
  > Fix several typos in comments.
  > Reformulate documentation of ABSL_LOCKS_EXCLUDED.
  > absl/base/internal/invoke.h: Use ABSL_INTERNAL_CPLUSPLUS_LANG for language version guard
  > Fix C++17 constexpr storage deprecation warnings
  > Optimize SwissMap iteration by another 5-10% for ARM
  > Add documentation on optional flags to the flags library overview.
  > absl: correct the stack trace path on RISCV
  > Merge pull request #1194 from jwnimmer-tri:default-linkopts
  > Remove unintended defines from config.h
  > Ignore invalid TZ settings in tests
  > Add ABSL_HARDENING_ASSERTs to CordBuffer::SetLength() and CordBuffer::IncreaseLengthBy()
  > Fix comment typo about absl::Status<T*>
  > In b-tree, support unassignable value types.
  > Optimize SwissMap for ARM by 3-8% for all operations
  > Release absl::CordBuffer
  > InlinedVector: Limit the scope of the maybe-uninitialized warning suppression
  > Improve the compiler error by removing some noise from it. The "deleted" overload error is useless to users. By passing some dummy string to the base class constructor we use a valid constructor and remove the unintended use of the deleted default constructor.
  > Merge pull request #714 from kgotlinux:patch-2
  > Include proper #includes for POSIX thread identity implementation when using that implementation on MinGW.
  > Rework NonsecureURBGBase seed sequence.
  > Disable tests on some platforms where they currently fail.
  > Fixed typo in a comment.
  > Rollforward of commit ea78ded7a5f999f19a12b71f5a4988f6f819f64f.
  > Add an internal helper for logging (upcoming).
  > Merge pull request #1187 from trofi:fix-gcc-13-build
  > Merge pull request #1189 from renau:master
  > Allow for using b-tree with `value_type`s that can only be constructed by the allocator (ignoring copy/move constructors).
  > Stop using sleep timeouts for Linux futex-based SpinLock
  > Automated rollback of commit f2463433d6c073381df2d9ca8c3d8f53e5ae1362.
  > time.h: Use uint32_t literals for calls to overloaded MakeDuration
  > Fix typos.
  > Clarify the behaviour of `AssertHeld` and `AssertReaderHeld` when the calling thread doesn't hold the mutex.
  > Enable __thread on Asylo
  > Add implementation of is_invocable_r to absl::base_internal for C++ < 17, define it as alias of std::is_invocable_r when C++ >= 17
  > Optimize SwissMap iteration for aarch64 by 5-6%
  > Fix detection of ABSL_HAVE_ELF_MEM_IMAGE on Haiku
  > Don’t use generator expression to build .pc Libs lines
  > Update Bazel used on MacOS CI
  > Import of CCTZ from GitHub.

Closes #11687
2022-10-03 11:06:37 +03:00
Botond Dénes
f4540ef0d6 Merge 'Upgrade nix devenv' from Michael Livshin
To recap: the Nix devenv ({default,shell,flake}.nix and friends) in Scylla is a nicer (for those who consider it so, that is) alternative to dbuild: a completely deterministic build environment without Docker.

In theory we could support much more (creating installable packages, container images, various deployment affordances, etc. -- Nix is, among other things, a kind of parallel-to-everything-else devops realm) but there is clearly no demand and besides duplicating the work the release team is already doing (and doing just fine, needless to say) would be pointless and wasteful.

This PR reflects the accumulated changes that I have been carrying locally for the past year or so.  The version currently in master _probably_ can still build Scylla, but that Scylla certainly would not pass unit tests.

What the previous paragraph seems to mean is, apparently I'm the only active user of Nix devenv for Scylla.  Which, in turn, presents some obvious questions for the maintainers:

- Does this need to live in the Scylla source at all?  (The changes to non-Nix-specific parts are minimal and unobtrusive, but they are still changes)
- If it's left in, who is going to maintain it going forward, should more users somehow appear?  (I'm perfectly willing to fix things up when alerted, but no timeliness guarantees)

Closes #9557

* github.com:scylladb/scylladb:
  nix: add README.md
  build: improvements & upgrades to Nix dev environment
  build: allow setting SCYLLA_RELEASE from outside
2022-10-03 09:40:09 +03:00
Botond Dénes
2041744132 Merge 'readers/mutlishard: don't mix coroutines and continuations in the do_fill_buffer()' from Avi Kivity
The combination is hard to read and modify.

Closes #11665

* github.com:scylladb/scylladb:
  readers/multishard: restore shard_reader_v2::do_fill_buffer() indentation
  readers/multishard: convert shard_reader_v2::do_fill_buffer() to a pure coroutine
2022-10-03 06:51:20 +03:00
Nadav Har'El
b8f8eb8710 Merge 'Improve test.py logging' from Kamil Braun
Include the unique test name (the unique name distinguishes between different test repeats) and the test case name where possible. Improve printing of clusters: include the cluster name and stopped servers. Fix some logging calls and add new ones.

Examples:
```
------ Starting test test_topology ------
```
became this:
```
------ Starting test test_topology.1::test_add_server_add_column ------
```

This:
```
INFO> Leasing Scylla cluster {127.191.142.1, 127.191.142.2, 127.191.142.3} for test test_add_server_add_column
```
became this:
```
INFO> Leasing Scylla cluster ScyllaCluster(name: 02cdd180-40d1-11ed-8803-3c2c30d32d96, running: {127.144.164.1, 127.144.164.2, 127.144.164.3}, stopped: {}) for test test_topology.1::test_add_server_add_column
```

Closes #11677

* github.com:scylladb/scylladb:
  test/pylib: scylla_cluster: improve cluster printing
  test/pylib: don't pass test_case_name to after-test endpoint
  test/pylib: scylla_cluster: track current test case name and print it
  test.py: pass the unique test name (e.g. `test_topology.1`) to cluster manager
  test/pylib: scylla_cluster: pass the test case name to `before_test`
  test/pylib: use "test_case_name" variable name when talking about test cases
2022-10-02 20:48:50 +03:00
Pavel Emelyanov
2b8636a2a9 storage_proxy.hh: Remove unused headers
Add needed forward declarations and fix indirect inclusions in some .ccs

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #11679
2022-10-02 20:48:50 +03:00
Michał Chojnowski
4563cbe595 logalloc: prevent false positives in reclaim_timer
reclaim_timer uses a coarse clock, but does not account for
the measurement error introduced by that -- it can falsely
report reclaims as stalls, even if they are shorter by a full
coarse clock tick from the requested threshold
(blocked-reactor-notify-ms).

Notably, if the stall threshold happens to be smaller or equal to coarse
clock resolution, Scylla's log gets spammed with false stall reports.
The resolution of coarse clocks in Linux is 1/CONFIG_HZ. This is
typically equal to 1 ms or 4 ms, and stall thresholds of this order
can occur in practice.

Eliminate false positives by requiring the measured reclaim duration to
be at least 1 clock tick longer than the configured threshold for it to
be considered a stall.

Fixes #10981

Closes #11680
2022-10-02 13:41:40 +03:00
Avi Kivity
372eadf542 Merge "perftune related improvements in scylla_* scripts" from Vlad Zolotarov
"
This series adds a long waited transition of our auto-generation
code to irq_cpu_mask instead of 'mode' in perftune.yaml.

And then it fixes a regression in scylla_prepare perftune.yaml
auto-generation logic.
"

* 'scylla_prepare_fix_regression-v1' of https://github.com/vladzcloudius/scylla:
  scylla_prepare + scylla_cpuset_setup: make scylla_cpuset_setup idempotent without introducing regressions
  scylla_prepare: stop generating 'mode' value in perftune.yaml
2022-10-02 13:25:13 +03:00
Michael Livshin
d178ac17dc nix: add README.md
Signed-off-by: Michael Livshin <repo@cmm.kakpryg.net>
2022-10-02 12:26:02 +03:00
Michael Livshin
7bd13be3f2 build: improvements & upgrades to Nix dev environment
* Add some more useful stuff to the shell environment, so it actually
  works for debugging & post-mortem analysis.

* Wrap ccache & distcc transparently (distcc will be used unless
  NODISTCC is set to a non-empty value in the environment; ccache will
  be used if CCACHE_DIR is not empty).

* Package the Scylla Python driver (instead of the C* one).

* Catch up to misc build/test requirements (including optional) by
  requiring or custom-packaging: wasmtime 0.29.0, cxxbridge,
  pytest-asyncio, liburing.

* Build statically-linked zstd in a saner and more idiomatic fashion.

* In pure builds (where sources lack Git metadata), derive
  SCYLLA_RELEASE from source hash.

* Refactor things for more parameterization.

* Explicitly stub out installPhase (seeing that "nix build" succeeds
  up to installPhase means we didn't miss any dependencies).

* Add flake support.

* Add copious comments.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-10-02 11:47:16 +03:00
Michael Livshin
839d8f40e6 build: allow setting SCYLLA_RELEASE from outside
The extant logic for deriving the value of SCYLLA_RELEASE from the
source tree has those assumptions:

* The tree being built includes Git metadata.

* The value of `date` is trustworthy and interesting.

* There are no uncommitted changes (those relevant to building,
  anyway).

The above assumptions are either irrelevant or problematic in pure
build environments (such as the sandbox set up by `nix-build`):

* Pure builds use cleaned-up sources with all timestamps reset to Unix
  time 0.  Those cleaned-up sources are saved (in the Nix store, for
  example) and content-hashed, so leaving the (possibly huge) Git
  metadata increases the time to copy the sources and wastes disk
  space (in fact, Nix in flake mode strips `.git` unconditionally).

* Pure builds run in a sandbox where time is, likewise, reset to Unix
  time 0, so the output of `date` is neither informative nor useful.

Now, the only build step that uses Git metadata in the first place is
the SCYLLA_RELEASE value derivation logic.  So, essentially, answering
the question "is the Git metadata needed to build Scylla" is a matter
of definition, and is up to us.  If we elect to ignore Git metadata
and current time, we can derive SCYLLA_RELEASE value from the content
hash of the cleaned-up tree, regardless of the way that tree was
arrived at.

This change makes it possible to skip the derivation of SCYLLA_RELEASE
value from Git metadata and current time by way of setting
SCYLLA_RELEASE in the environment.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-10-02 11:47:16 +03:00
Avi Kivity
17b1cb4434 dirty_memory_manager: move third memory threshold parameter of region_group constructor to reclaim_config
Place it along the other parameters.
2022-09-30 22:17:37 +03:00
Avi Kivity
ecf30ee469 dirty_memory_manager: simplify region_group::update()
We notice there are two separate conditions controlling a call to
a single outcome, notify_pressure_relief(). Merge them into a single
boolean variable.
2022-09-30 22:15:45 +03:00
Avi Kivity
230fff299a dirty_memory_manager: fold region_group::notify_hard_pressure_relieved into its callers
It is trivial.
2022-09-30 22:11:01 +03:00
Avi Kivity
12b81173b9 dirty_memory_manager: clean up region_group::do_update_hard_and_check_relief()
Remove synthetic "rg" local.
2022-09-30 22:09:09 +03:00
Avi Kivity
e1bad8e883 dirty_memory_manager: make do_update_hard_and_check_relief() a member of region_group
It started life as something shared between memory_hard_limit and
region_group, but now that they are back being the same thing, we
can make it a member again.
2022-09-30 22:04:26 +03:00
Avi Kivity
6b21c10e9e dirty_memory_manager: remove accessors around region_group::_under_hard_pressure
It is now only accessed from within the class, so the
accessors don't help anything.
2022-09-30 21:59:46 +03:00
Avi Kivity
6a02bb7c2b dirty_memory_manager: merge memory_hard_limit into region_group
The two classes always have a 1:1 or 0:1 relationship, and
so we can just move all the members of memory_hard_limit
into region_group, with the functions that track the relationship
(memory_hard_limit::{add,del}()) removed.

The 0:1 relationship is maintained by initializing the
hard limit parameter with std::numeric_limits<size_t>::max().
The _hard_total_memory variable is always checked if it is
greater than this parameter in order to do anything, and
with this default it can never be.
2022-09-30 21:59:38 +03:00
Avi Kivity
45ab24e43d dirty_memory_manager: rename members in memory_hard_limit
In preparation for merging memory_hard_limit into region_group,
disambiguate similarly named members by adding the word "hard" in
random places.

memory_hard_limit and region_group are candidates for merging
because they constantly reference each other, and memory_hard_limit
does very little by itself.
2022-09-30 21:47:33 +03:00
Avi Kivity
aca96c4103 readers/multishard: restore shard_reader_v2::do_fill_buffer() indentation 2022-09-30 19:19:51 +03:00
Avi Kivity
b08196f3b3 readers/multishard: convert shard_reader_v2::do_fill_buffer() to a pure coroutine
do_full_buffer() is an eclectic mix of coroutines and continuations.
That makes it hard to follow what is running sequentially and
concurrently.

Convert it into a pure coroutine by changing internal continuations
to lambda coroutines. These lambda coroutines are guarded with
seastar::coroutine::lambda. Furthermore, a future that is co_awaited
is converted to immediate co_await (without an intermediate future),
since seastar::coroutine::lambda only works if the coroutine is awaited
in the same statement it is defined on.
2022-09-30 19:19:48 +03:00
Kamil Braun
b2cf610567 test/pylib: scylla_cluster: improve cluster printing
Print the cluster name and stopped servers in addition to the running
servers.

Fix a logging call which tried to print a server in place of a cluster
and even at that it failed (the server didn't have a hostname yet so it
printed as an empty string). Add another logging call.
2022-09-30 17:00:05 +02:00
Kamil Braun
05ed3769dd test/pylib: don't pass test_case_name to after-test endpoint
It's redundant now, the manager tracks the current test case using
before-test endpoint calls.
2022-09-30 16:41:45 +02:00
Kamil Braun
dc6f37b7f7 test/pylib: scylla_cluster: track current test case name and print it
Use `_before_test` calls to track the current test case name.
Concatenate it with the unique test name like this:
`test_topology.1::test_add_server_add_column`, and print it
instead of the test case name.
2022-09-30 16:38:35 +02:00
Kamil Braun
5be818d73b test.py: pass the unique test name (e.g. test_topology.1) to cluster manager
This helps us distinguish the different repeats of a test in logs.
Rename the variable accordingly in `ScyllaClusterManager`.
2022-09-30 16:24:10 +02:00
Kamil Braun
fde4642472 test/pylib: scylla_cluster: pass the test case name to before_test
We pass the test case name to `after_test` - so make it consistent.
Arguably, the test case name is more useful (as it's more precise) than
the test name.
2022-09-30 16:17:59 +02:00
Kamil Braun
43d8b4a214 test/pylib: use "test_case_name" variable name when talking about test cases
Distinguish "test name" (e.g. `test_topology`) from "test case name"
(e.g. `test_add_server_add_column` - a test case inside
`test_topology`).
2022-09-30 16:15:48 +02:00
Botond Dénes
060dda8e00 Merge 'Reduce dependencies on large data handler header' from Benny Halevy
Reduce the false dependencies on db/large_data_handler.hh by
not including it from commonly used header files, and rather including
it only in the source files that actually need it.

The is in preparation for https://github.com/scylladb/scylladb/issues/11449

Closes #11654

* github.com:scylladb/scylladb:
  test: lib: do not include db/large_data_handler.hh in test_service.hh
  test: lib: move sstable test_env::impl ctor out of line
  sstables: do not include db/large_data_handler.hh in sstables.hh
  api/column_family: add include db/system_keyspace.hh
2022-09-30 13:27:38 +03:00
Tomasz Grabiec
5268f0f837 test: lib: random_mutation_generator: Don't generate mutations with marker uncompacted with shadowable tombstone
The generator was first setting the marker then applied tombstones.

The marker was set like this:

  row.marker() = random_row_marker();

Later, when shadowable tombstones were applied, they were compacted
with the marker as expected.

However, the key for the row was chosen randomly in each iteration and
there are multiple keys set, so there was a possibility of a key clash
with an earlier row. This could override the marker without applying
any tombstones, which is conditional on random choice.

This could generate rows with markers uncompacted with shadowable tombstones.

This broken row_cache_test::test_concurrent_reads_and_eviction on
comparison between expected and read mutations. The latter was
compacted because it went through an extra merge path, which compacts
the row.

Fix by making sure there are no key clashes.

Closes #11663
2022-09-30 11:27:01 +03:00
Kamil Braun
1793d43b15 test/pylib: scylla_cluster: mark server_remove as not implemented
The `server_remove` function did a very weird thing: it shut down a
server and made the framework 'forget' about it. From the point of view
of the Scylla cluster and the driver the server was still there.

Replace the function's body with `raise NotImplementedError`. In the
future it can be replaced with an implementation that calls
`removenode` on the Scylla cluster.

Remove `test_remove_server_add_column` from `test_topology`. It
effectively does the same thing as `test_stop_server_add_column`, except
that the framework also 'forgets' about the stopped server. This could
lead to weird situations because the forgotten server's IP could be
reused in another test that was running concurrently with this test.

Closes #11657
2022-09-29 21:03:18 +03:00
Pavel Emelyanov
6a5b0d6c70 table: Handle storage_io_error's ENOSPC when flushing
Commit a9805106 (table: seal_active_memtable: handle ENOSPC error)
made memtable flushing code stand ENOSPC and continue flusing again
in the hope that the node administrator would provide some free space.

However, it looks like the IO code may report back ENOSPC with some
exception type this code doesn't expect. This patch tries to fix it

refs: #11245

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-29 19:16:30 +03:00
Pavel Emelyanov
826244084e table: Rewrap retry loop
The existing loop is very branchy in its attempts to find out whether or
not to abort. The "allowed_retries" count can be a good indicator of the
decision taken. This makes the code notably shorter and easier to extend

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-29 19:14:46 +03:00
Benny Halevy
776b009c0f test: lib: do not include db/large_data_handler.hh in test_service.hh
It was needed for defining and referencing nop_lp_handler
and in sstable_3_x_test for testing the large_data_handler.

Remove the include from the commonly used header file
to reduce the false dependencies on large_data_handler.hh

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-29 18:36:16 +03:00
Benny Halevy
678d88576b test: lib: move sstable test_env::impl ctor out of line
To prepare for removing the include of db/large_data_handler.hh
from test/lib/test_services.hh

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-29 18:35:12 +03:00
Botond Dénes
ad04f200d3 Merge 'database: automatically take snapshot of base table views' from Benny Halevy
The logic to reject explicit snapshot of views/indexes was improved in aa127a2dbb. However, we never implemented auto-snapshot of
view/indexes when taking a snapshot of the base table.

This is implemented in this patch.

The implementation is built on top of
ba42852b0e
so it would be hard to backport to 5.1 or earlier
releases.

Fixes #11612

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #11616

* github.com:scylladb/scylladb:
  database: automatically take snapshot of base table views
  api: storage_service: reject snapshot of views in api layer
2022-09-29 13:33:31 +03:00
Benny Halevy
ae7fd1c7b2 sstables: do not include db/large_data_handler.hh in sstables.hh
Reduce dependencies by only forward-declaring
class db::large_data_handler in sstables.hh

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-29 12:42:58 +03:00
Benny Halevy
fb7e55b0a8 api/column_family: add include db/system_keyspace.hh
For db::system_keyspace::load_view_build_progress that currently
indirectly satisfied via sstables/sstables.hh ->
db/large_data_handler.hh

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-29 12:42:54 +03:00
Nadav Har'El
4a7794fb64 alternator: better error message when adding a GSI to an existing table
Due to issue #11567, Alternator do not yet support adding a GSI to an
existing table via UpdateTable with the GlobalSecondaryIndexUpdates
parameter.

However, currently, we print a misleading error message in this case,
complaining about the AttributeDefinitions parameter. This parameter
is also required with GlobalSecondaryIndexUpdates, but it's not the
main problem, and the user is likely to be confused why the error message
points to that specific paramter and what it means that this parameter
is claimed to be "not supported" (while it is supported, in CreateTable).
With this patch, we report that GlobalSecondaryIndexUpdates is not
supported.

This patch does not fix the unsupported feature - it just improves
the error message saying that it's not supported.

Refs #11567

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11650
2022-09-29 09:00:31 +03:00
Asias He
c194c811df repair: Yield in repair_service::do_decommission_removenode_with_repair
When walking through the ranges, we should yield to prevent stalls. We
do similar yield in other node operations.

Fix a stall in 5.1.dev.20220724.f46b207472a3 with build-id
d947aaccafa94647f71c1c79326eb88840c5b6d2

```
!INFO | scylla[6551]: Reactor stalled for 10 ms on shard 0. Backtrace:
0x4bbb9d2 0x4bba630 0x4bbb8e0 0x7fd365262a1f 0x2face49 0x2f5caff
0x36ca29f 0x36c89c3 0x4e3a0e1
````

Fixes #11146

Closes #11160
2022-09-28 18:21:35 +03:00
Avi Kivity
cf3830a249 Merge 'Add support for TRUNCATE USING TIMEOUT' from Benny Halevy
Extend the cql3 truncate statement to accept attributes,
similar to modification statements.

To achieve that we define cql3::statements::raw::truncate_statement
derived from raw::cf_statement, and implement its pure virtual
prepare() method to make a prepared truncate_statement.

The latter is no longer derived from raw::cf_statement,
and just stores a schema_ptr to get to the keyspace and column_family.

`test_truncate_using_timeout` cql-pytest was added to test
the new USING TIMEOUT feature.

Fixes #11408

Also, update docs/cql/ddl.rst truncate-statement section respectively.

Closes #11409

* github.com:scylladb/scylladb:
  docs: cql-extensions: add TRUNCATE to USING TIMEOUT section.
  docs: cql: ddl: add support for TRUNCATE USING TIMEOUT
  cql3, storage_proxy: add support for TRUNCATE USING TIMEOUT
  cql3: selectStatement: restrict to USING TIMEOUT in grammar
  cql3: deleteStatement: restrict to USING TIMEOUT|TIMESTAMP in grammar
2022-09-28 18:19:03 +03:00
Avi Kivity
19374779bb Merge 'Fix large data warning and docs' from Benny Halevy
The series contains fixes for system.large_* log warning and respective documentation.
This prepares the way for adding a new system.large_collections table (See #11449):

Fixes #11620
Fixes #11621
Fixes #11622

the respective fixes should be backported to different release branches, based on the respective patches they depend on (mentioned in each issue).

Closes #11623

* github.com:scylladb/scylladb:
  docs: adjust to sstable base name
  docs: large-partition-table: adjust for additional rows column
  docs: debugging-large-partition: update log warning example
  db/large_data_handler: print static cell/collection description in log warning
  db/large_data_handler: separate pk and ck strings in log warning with delimiter
2022-09-28 17:52:23 +03:00
Nadav Har'El
de1bc147bc Merge 'test.py: cleanups in topology test suites' from Kamil Braun
Fix the type of `create_server`, rename `topology_for_class` to `get_cluster_factory`, simplify the suite definitions and parameters passed to `get_cluster_factory`

Closes #11590

* github.com:scylladb/scylladb:
  test.py: replace `topology` with `cluster_size` in Topology tests
  test.py: rename `topology_for_class` to `get_cluster_factory`
  test/pylib: ScyllaCluster: fix create_server parameter type
2022-09-28 15:19:54 +03:00
Kamil Braun
1bcc28b48b test/topology_raft_disabled: reenable test_raft_upgrade
The test was disabled due to a bug in the Python driver which caused the
driver not to reconnect after a node was restarted (see
scylladb/python-driver#170).

Introduce a workaround for that bug: we simply create a new driver
session after restarting the nodes. Reenable the test.

Closes #11641
2022-09-28 15:13:42 +03:00
Mikołaj Grzebieluch
be8fcba8c1 raft: broadcast_tables: add support for bind variables
Extended the queries language to support bind variables which are bound in the
execution stage, before creating a raft command.

Adjusted `test_broadcast_tables.py` to prepare statements at the beginning of the test.

Fixed a small bug in `strongly_consistent_modification_statement::check_access`.

Closes #11525
2022-09-28 09:54:59 +03:00
Alejo Sanchez
02933c9b82 test.py: close aiohttp session for topology tests
Close the aiohttp ClientSession after pytest session finishes.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #11648
2022-09-27 18:09:08 +02:00
Kamil Braun
82481ae31b Merge 'raft server, log size limit in bytes' from Gusev Petr
Before this patch we could get an OOM if we
received several big commands. The number of
commands was small, but their total size
in bytes was large.

snapshot_trailing_size is needed to guarantee
progress. Without this limit the fsm could
get stuck if the size of the next item is greater than
 max_log_size - (size of trailing entries).

Closes #11397

* github.com:scylladb/scylladb:
  raft replication_test, make backpressure test to do actual backpressure
  raft server, shrink_to_fit on log truncation
  raft server, release memory if add_entry throws
  raft server, log size limit in bytes
2022-09-27 14:25:08 +02:00
Kamil Braun
ed67f0e267 Merge 'test.py: fix topology init error handling' from Alecco
When there are errors starting the first cluster(s) the logs of the server logs are needed. So move `.start()` to the `try` block in `test.py` (out of `asynccontextmanager`).

While there, make `ScyllaClusterManager.start()` idempotent.

Closes #11594

* github.com:scylladb/scylladb:
  test.py: fix ScyllaClusterManager start/stop
  test.py: fix topology init error handling
2022-09-27 11:36:07 +02:00
Petr Gusev
bc50b7407f raft replication_test, make backpressure test to do actual backpressure
Before this patch this test didn't actually experience any backpressure since all the commands were executed sequentially.
2022-09-27 12:04:14 +04:00
Petr Gusev
cbfe033786 raft server, shrink_to_fit on log truncation
We don't want to keep memory we don't use, shrink_to_fit guarantees that.

In fact, boost::deque frees up memory when items are deleted, so this change has little effect at the moment, but it may pay off if we change the container in the future.
2022-09-27 12:02:36 +04:00
Petr Gusev
b34dfed307 raft server, release memory if add_entry throws
We consume memory from semaphore in add_entry_on_leader, but never release it if add_entry throws.
2022-09-27 12:02:34 +04:00
Benny Halevy
b178813cba docs: cql-extensions: add TRUNCATE to USING TIMEOUT section.
List the queries that support the TIMEOUT parameter.

Mention the newly added support for TRUNCATE
USING TIMEOUT.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-26 18:30:39 +03:00
Benny Halevy
b0bad0b153 docs: cql: ddl: add support for TRUNCATE USING TIMEOUT
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-26 18:30:39 +03:00
Benny Halevy
64140ccf05 cql3, storage_proxy: add support for TRUNCATE USING TIMEOUT
Extend the cql3 truncate statement to accept attributes,
similar to modification statements.

To achieve that we define cql3::statements::raw::truncate_statement
derived from raw::cf_statement, and implement its pure virtual
prepare() method to make a prepared truncate_statement.

The latter, statements::truncate_statement, is no longer derived
from raw::cf_statement, and just stores a schema_ptr to get to the
keyspace and column_family names.

`test_truncate_using_timeout` cql-pytest was added to test
the new USING TIMEOUT feature.

Fixes #11408

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-26 18:30:39 +03:00
Benny Halevy
27d3e48005 cql3: selectStatement: restrict to USING TIMEOUT in grammar
It is preferred to reject USING TLL / TIMESTAMP at the grammar
level rather than functionally validating the USING attributes.

test_using_timeout was adjusted respectively to expect the
`SyntaxException` error rather than `InvalidRequest`.

Note that cql3::statements::raw::select_statement validate_attrs
now asserts that the ttl or the timestamp attributes aren't set.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-26 18:30:39 +03:00
Benny Halevy
0728d33d5f cql3: deleteStatement: restrict to USING TIMEOUT|TIMESTAMP in grammar
It is preferred to reject USING TLL / TIMESTAMP at the grammar
level rather than functionally validating the USING attributes.

test_using_timeout was adjusted respectively to expect the
`SyntaxException` error rather than `InvalidRequest`.

Note that now delete_statement ctor asserts that the ttl
attribute is not set.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-26 18:30:39 +03:00
Kamil Braun
696bdb2de7 test.py: replace topology with cluster_size in Topology tests
First, a reminder of a few basic concepts in Scylla:
- "topology" is a mapping: for each node, its DC and Rack.
- "replication strategy" is a method of calculating replica sets in
  a cluster. It is not a cluster-global property; each keyspace can have
  a different replication strategy. A cluster may have multiple
  keyspaces.
- "cluster size" is the number of nodes in a cluster.

Replication strategy is orthogonal to topology. Cluster size can be
derived from topology and is also orthogonal to replication strategy.

test.py was confusing the three concepts together. For some reason,
Topology suites were specifying a "topology" parameter which contained
replication strategy details - having nothing to do with topology. Also
it's unclear why a test suite would specify anything to do with
replication strategies - after all, a test may create keyspaces with
different replication strategies, and a suite may contain multiple
different tests.

Get rid of the "topology" parameter, replace it with a simple
"cluster_size". In the future we may re-introduce it when we actually
implement the possibility to start clusters with custom topologies
(which involves configuring the snitch etc.) Simplify the test.py code.
2022-09-26 15:17:50 +02:00
Botond Dénes
895522db23 mutation_fragment_stream_validator: make interface more robust
The validator has several API families with increasing amount of detail.
E.g. there is an `operator()(mutation_fragment_v2::kind)` and an
overload also taking a position. These different API families
currently cannot be mixed. If one uses one overload-set, one has to
stick with it, not doing so will generate false-positive failures.
This is hard to explain in documentation to users (provided they even
read it). Instead, just make the validator robust enough such that the
different API subsets can be mixed in any order. The validator will try
to make most of the situation and validate as much as possible.
Behind the scenes all the different validation methods are consolidated
into just two: one for the partition level, the other for the
intra-partition level. All the different overloads just call these
methods passing as much information as they have.
A test is also added to make sure this works.
2022-09-26 13:26:26 +03:00
Kamil Braun
0725ab3a3e test.py: rename topology_for_class to get_cluster_factory
The previous name had nothing to do with what the function calculated
and returned (it returned a `create_cluster` function; the standard name
for a function that constructs objects would be 'factory', so
`get_cluster_factory` is an appropriate name for a function that returns
cluster factories).
2022-09-26 11:45:44 +02:00
Kamil Braun
06cc4f9259 test/pylib: ScyllaCluster: fix create_server parameter type
The only usage of `ScyllaCluster` constructor passed a `create_server`
function which expected a `List[str]` for the second parameter, while
the constructor specified that the function should expect an
`Optional[List[str]]`. There was no reason for the latter, we can easily
fix this type error.

Also give a type hint for `create_cluster` function in
`PythonTestSuite.topology_for_class`. This is actually what catched the
type error.
2022-09-26 11:45:44 +02:00
Petr Gusev
27e60ecbf4 raft server, log size limit in bytes
Before this patch we could get an OOM if we
received several big commands. The number of
commands was small, but their total size
in bytes was large.

snapshot_trailing_size is needed to guarantee
progress. Without this limit the fsm could
get stuck if the size of the next item is
greater than max_log_size - (size of trailing entries).
2022-09-26 13:10:10 +04:00
Benny Halevy
d32c497cd9 database: automatically take snapshot of base table views
The logic to reject explicit snapshot of views/indexes
was improved in aa127a2dbb.
However, we never implemented auto-snapshot of
view/indexes when taking a snapshot of the base table.

This is implemented in this patch.

The implementation is built on top of
ba42852b0e
so it would be hard to backport to 5.1 or earlier
releases.

Fixes #11612

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-26 11:02:54 +03:00
Benny Halevy
55b0b8fe2c api: storage_service: reject snapshot of views in api layer
Rather than pushing the check to
`snapshot_ctl::take_column_family_snapshot`, just check
that explcitly when taking a snapshot of a particular
table by name over the api.

Other paths that call snapshot_ctl::take_column_family_snapshot
are internal and use it to snap views already.

With that, we can get rid of the allow_view_snapshots flag
that was introduced in aab4cd850c.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-26 10:44:56 +03:00
Botond Dénes
4d017b6d7e mutation_fragment_stream_validator: add reset() to validating filter
Allow the high level filtering validator to be reset() to a certain
position, so it can be used in situations where the consumption is not
continuous (fast-forwarding or paging).
2022-09-26 10:17:28 +03:00
Botond Dénes
a8cbf66573 mutation_fragment_stream_validator: move active tomsbtone validation into low level validator
Currently the active range tombstone change is validated in the high
level `mutation_fragment_stream_validating_stream`, meaning that users of
the low-level `mutation_fragment_stream_validator` don't benefit from
checking that tombstones are properly closed.
This patch moves the validation down to the low-level validator (which
is what the high-level one uses under the hood too), and requires all
users to pass information about changes to the active tombstone for each
fragment.
2022-09-26 10:17:27 +03:00
Nadav Har'El
868a884b79 test/cql-pytest: add reproducer for ignored IS NOT NULL
This test reproduces issue #10365: It shows that although "IS NOT NULL" is
not allowed in regular SELECT filters, in a materialized view it is allowed,
even for non-key columns - but then outright ignored and does not actually
filter out anything - a fact which already surprised several users.

The test also fails on Cassandra - it also wrongly allows IS NOT NULL
on the non-key columns but then ignores this in the filter. So the test
is marked with both xfail (known to fail on Scylla) and cassandra_bug
(fails on Cassandra because of what we consider to be a Cassandra bug).

Refs #10365
Refs #11606

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11615
2022-09-26 09:02:08 +03:00
Anna Stuchlik
c5285bcb14 doc: remove the section about updating OS packages during upgrade from upgrade guides for Ubunut and Debian (from 4.5 to 4.6)
Closes #11629
2022-09-26 08:04:02 +03:00
Avi Kivity
ad2f1dc704 Merge 'Avoid default initialization of token_metadata and topology when not needed' from Pavel Emelyanov
The goal is not to default initialize an object when its fields are about to be immediately overwritten by the consecutive code.

Closes #11619

* github.com:scylladb/scylladb:
  replication_strategy: Construct temp tokens in place
  topology: Define copy-sonctructor with init-lists
2022-09-25 18:08:42 +03:00
Jan Ciolek
ac152af88c expression: Add for_each_boolean factor
boolean_factors is a function that takes an expression
and extracts all children of the top level conjunction.

The problem is that it returns a vector<expression>,
which is inefficent.

Sometimes we would like to iterate over all boolean
factors without allocations. for_each_boolean_factor
is implemented for this purpose.

boolean_factors() can be implemented using
for_each_boolean_factor, so it's done to
reduce code duplication.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-09-25 16:34:22 +03:00
Avi Kivity
2a538f5543 Merge 'Cut one of snitch->gossiper links' from Pavel Emelyanov
Snitch uses gossiper for several reasons, one of is to re-gossip the topology-related app states when property-file snitch config changes. This set cuts this link by moving re-gossiping into the existing storage_service::snitch_reconfigured() subscription. Since initial snitch state gossiping happens in storage service as well, this change is not unreasonable.

Closes #11630

* github.com:scylladb/scylladb:
  storage_service: Re-gossiping snitch data in reconfiguration callback
  storage_service: Coroutinize snitch_reconfigured()
  storage_service: Indentation fix after previous patch
  storage_service: Reshard to shard-0 earlier
  storage_service: Refactor snitch reconfigured kick
2022-09-25 16:08:48 +03:00
Benny Halevy
a1adbf1f59 docs: adjust to sstable base name
Since 244df07771 (scylla 5.1),
only the sstable basename is kept in the large_* system tables.
The base path can be determined from the keyspace and
table name.

Fixes #11621

Adjust the examples in documentation respectively.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-25 14:38:13 +03:00
Benny Halevy
33924201cc docs: large-partition-table: adjust for additional rows column
Since a7511cf600 (scylla 5.0),
sstables containing partitions with too many rows are recorded in system.large_partitions.

Adjust the doc respectively.

Fixes #11622

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-25 14:38:13 +03:00
Benny Halevy
92ff17c6e3 docs: debugging-large-partition: update log warning example
The log warning format has changed since f3089bf3d1
and was fixed in the previous patch to include
a delimiter between the partition key, clustering key, and
column name.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-25 14:38:13 +03:00
Benny Halevy
fcbbc3eb9c db/large_data_handler: print static cell/collection description in log warning
When warning about a large cell/collection in a static row,
print that fact in the log warning to make it clearer.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-25 14:37:42 +03:00
Benny Halevy
4670829502 db/large_data_handler: separate pk and ck strings in log warning with delimiter
Currently (since f3089bf3d1),
when printing a warning to the log about large rows and/or cells
the clustering key string is concatenated to the partition key string,
rendering the warning confsing and much less useful.

This patch adds a '/' delimiter to separate the fields,
and also uses one to separate the clustering key from the column name
for large cells.  In case of a static cell, the clustering key is null
hence the warning will look like: `pk//column`.

This patch does NOT change anything in the large_* system
table schema or contents.  It changes only the log warning format
that need not be backward compatible.

Fixes #11620

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-25 14:36:41 +03:00
Pavel Emelyanov
47958a4b37 storage_service: Re-gossiping snitch data in reconfiguration callback
Nowadays it's done inside snitch, and snitch needs to carry gossiper
refernece for that. There's an ongoing effort in de-globalizing snitch
and fixing its dependencies. This patch cuts this snitch->gossiper link
to facilitate the mentioned effort.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-23 14:31:55 +03:00
Pavel Emelyanov
932566d448 storage_service: Coroutinize snitch_reconfigured()
Next patch will add more sleeping code to it and it's simpler if the new
call is co_await-ed rather than .then()-ed

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-23 14:31:55 +03:00
Pavel Emelyanov
7fee98cad0 storage_service: Indentation fix after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-23 14:31:55 +03:00
Pavel Emelyanov
3d4ea2c628 storage_service: Reshard to shard-0 earlier
It makes next patch shorter and nicer

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-23 14:31:55 +03:00
Pavel Emelyanov
11b79f9f80 storage_service: Refactor snitch reconfigured kick
The snitch_reconfigured calls update_topology with local node bcast
address argument. Things get simpler if the callee gets the address
itself.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-23 14:31:43 +03:00
Botond Dénes
b9d55ee02f Merge 'Add cassandra functional - show warn/err when tombstone threshold reached.' from Taras Borodin
Add cassandra functional - show warn/err when tombstone_warn_threshold/tombstone_failure_threshold reached on select, by partitions. Propagate raw query_string from coordinator to replicas.

Closes #11356

* github.com:scylladb/scylladb:
  add utf8:validate to operator<< partition_key with_schema.
  Show warn message if `tombstone_warn_threshold` reached on querier.
2022-09-23 05:53:47 +03:00
Pavel Emelyanov
9e7407ff91 replication_strategy: Construct temp tokens in place
Otherwise, the token_metadata object is default-initialized, then it's
move-assigned from another object.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-22 19:19:32 +03:00
Pavel Emelyanov
d540af2cb0 topology: Define copy-sonctructor with init-lists
Otherwise the topology is default-constructed, then its fields
are copy-assigned with the data from the copy-from reference.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-22 19:18:58 +03:00
Tomasz Grabiec
ccbfe2ef0d Merge 'Fix invalid mutation fragment stream issues' from Botond Dénes
Found by a fragment stream validator added to the mutation-compactor (https://github.com/scylladb/scylladb/pull/11532). As that PR moves very slowly, the fixes for the issues found are split out into a PR of their own.
The first two of these issues seems benign, but it is important to remember that how benign an invalid fragment stream is depends entirely on the consumer of said stream. The present consumer of said streams may swallow the invalid stream without problem now but any future change may cause it to enter into a corrupt state.
The last one is a non-benign problem (again because the consumer reacts badly already) causing problems when building query results for range scans.

Closes #11604

* github.com:scylladb/scylladb:
  shard_reader: do_fill_buffer(): only update _end_of_stream after buffer is copied
  readers/mutation_readers: compacting_reader: remember injected partition-end
  db/view: view_builder::execute(): only inject partition-start if needed
2022-09-22 17:57:27 +02:00
Taras Borodin
c155ae1182 add utf8:validate to operator<< partition_key with_schema. 2022-09-22 16:42:31 +03:00
TarasBor
1f4a93da78 Show warn message if tombstone_warn_threshold reached on querier.
When querier read page with tombstones more than `tombstone_warn_threshold` limit - warning message appeared in logs.
If `tombstone_warn_threshold:0` feature disabled.

Refs scylladb#11410
2022-09-22 16:42:31 +03:00
Avi Kivity
0cbaef31c1 dirty_memory_manager: fold do_update() into region_group::update()
There is just one caller, and folding the two functions enables
simplification.
2022-09-22 15:51:19 +03:00
Avi Kivity
8672f2248c dirty_memory_manager: simplify memory_hard_limit's do_update
do_update() has an output parameter (top_relief) which can either
be set to an input parameter or left alone. Simplify it by returning
bool and letting the caller reuse the parameter's value instead.
2022-09-22 15:50:48 +03:00
Avi Kivity
1858268377 dirty_memory_manager: drop soft limit / soft pressure members in memory_hard_limit
They are write-only.

This corresponds to the fact that memory_hard_limit does not do
flushing (which is initiated by crossing the soft limit), it only
blocks new allocations.
2022-09-22 14:59:38 +03:00
Asias He
9ed401c4b2 streaming: Add finished percentage metrics for node ops using streaming
We have added the finished percentage for repair based node operations.

This patch adds the finished percentage for node ops using the old
streaming.

Example output:

scylla_streaming_finished_percentage{ops="bootstrap",shard="0"} 1.000000
scylla_streaming_finished_percentage{ops="decommission",shard="0"} 1.000000
scylla_streaming_finished_percentage{ops="rebuild",shard="0"} 0.561945
scylla_streaming_finished_percentage{ops="removenode",shard="0"} 1.000000
scylla_streaming_finished_percentage{ops="repair",shard="0"} 1.000000
scylla_streaming_finished_percentage{ops="replace",shard="0"} 1.000000

In addition to the metrics, log shows the percentage is added.

[shard 0] range_streamer - Finished 2698 out of 2817 ranges for rebuild, finished percentage=0.95775646

Fixes #11600

Closes #11601
2022-09-22 14:19:34 +03:00
Avi Kivity
8369741063 dirty_memory_manager: de-template do_update(region_group_or_memory_hard_limit)
We made this function a template to prevent code duplication, but now
memory_hard_limit was sufficiently simplified so that the implementations
can start to diverge.
2022-09-22 14:16:43 +03:00
Avi Kivity
76ced5a60c dirty_memory_manager: adjust soft_limit threshold check
Use `>` rather than `>=` to match the hard limit check. This will
aid simplification, since for memory_hard_limit the soft and hard limits
are identical.

This should not cause any material behavior change, we're not sensitive
to single byte accounting. Typical limits are on the order of gigabytes.
2022-09-22 14:06:01 +03:00
Avi Kivity
b9eb26cd77 dirty_memory_manager: drop memory_hard_limit::_name
It's write-only.
2022-09-22 14:01:57 +03:00
Avi Kivity
c64fb66cc3 dirty_memory_manager: simplify memory_hard_limit configuration
We observe that memory_hard_limit's reclaim_config is only ever
initialized as default, or with just the hard_limit parameter.
Since soft_limit defaults to hard_limit, we can collapse the two
into a limit. The reclaim callbacks are always left as the default
no-op functions, so we can eliminate them too.

This fits with memory_hard_limit only being responsible for the hard
limit, and for it not having any memtables to reclaim on its own.
2022-09-22 13:56:59 +03:00
Avi Kivity
2f907dc47d dirty_memory_manager: fold region_group_reclaimer into {memory_hard_limit,region_group}
region_group_reclaimer is used to initialize (by reference) instances of
memory_hard_limit and region_group. Now that it is a final class, we
can fold it into its users by pasting its contents into those users,
and using the initializer (reclaim_config) to initialize the users. Note
there is a 1:1 relationship between a region_group_reclaimer instance
and a {memory_hard_limit,region_group} instance.

It may seem like code duplication to paste the contents of one class into
two, but the two classes use region_group_reclaimer differently, and most
of the code is just used to glue different classes together, so the
next patches will be able to get rid of much of it.

Some notes:
 - no_reclaimer was replaced by a default reclaim_config, as that's how
   no_reclaimer was initialized
 - all members were added as private, except when a caller required one
   to be public
 - an under_presssure() member already existed, forwarding to the reclaimer;
   this was just removed.
2022-09-22 13:56:59 +03:00
Avi Kivity
d8f857e74b dirty_memory_manager: stop inheriting from region_group_reclaimer
This inheritance makes it harder to get rid of the class. Since
there are no longer any virtual functions in the class (apart from
the destructor), we can just convert it to a data member. In a few
places, we need forwarding functions to make formerly-inherited functions
visible to outside callers.

The virtual destructor is removed and the class is marked final to
verify it is no longer a base class anywhere.
2022-09-22 13:56:59 +03:00
Avi Kivity
26f3a123a5 dirty_memory_manager: test: unwrap region_group_reclaimer
In one test, region_group_reclaimer is wrapped in another class just
to toggle a bool, but with the new callbacks it's easy to just use
a bool instead.
2022-09-22 13:56:59 +03:00
Avi Kivity
1d3508e02c dirty_memory_manager: change region_group_reclaimer configuration to a struct
It's just so much nicer.

The "threshold" limit was renamed to "hard_limit" to contrast it with
"soft_limit" (in fact threshold is a good name for soft_limit, since
it's a point where the behavior begins to change, but that's too much
of a change).
2022-09-22 13:56:59 +03:00
Avi Kivity
2c54c7d51e dirty_memory_manager: convert region_group_reclaimer to callbacks
region_group_reclaimer is partially policy (deciding when to reclaim)
and partially mechanism (implementing reclaim via virtual functions).

Move the mechanism to callbacks. This will make it easy to fold the
policy part into region_group and memory_hard_limit. This folding is
expected to simplify things since most of region_group_reclaimer is
cross-class communication.
2022-09-22 13:56:59 +03:00
Avi Kivity
8fa0652e68 dirty_memory_manager: consolidate region_group_reclaimer constructors
Delegate to other constructors rather than repeating the code. Doesn't
help much here, but simplifies the next patch.
2022-09-22 13:56:59 +03:00
Avi Kivity
5efbfa4cab dirty_memory_manager: rename {memory_hard_limit,region_group}::notify_relief
It clashes with region_group_reclaimer::notify_relief, which does something
different. Since we plan to merge region_group_reclaimer into
memory_hard_limit and region_group (this can simplify the code), we
need to avoid duplicate function names.
2022-09-22 13:56:59 +03:00
Avi Kivity
a72ac14154 dirty_memory_manager: drop unused parameter to memory_hard_limit constructor 2022-09-22 13:56:59 +03:00
Avi Kivity
fca5689052 dirty_memory_manager: drop memory_hard_limit::shutdown()
It is empty.
2022-09-22 13:56:59 +03:00
Avi Kivity
152136630c dirty_memory_manager: split region_group hierarchy into separate classes
Currently, region_group forms a hierarchy. Originally it was a tree,
but previous work whittled it down to a parent-child relationship
(with a single, possible optional parent, and a single child).

The actual behavior of the parent and child are very different, so
it makes sense to split them. The main difference is that the parent
does not contain any regions (memtables), but the child does.

This patch mechanically splits the class. The parent is named
memory_hard_limit (reflecting its role to prevent lsa allocation
above the memtable configured hard limit). The child is still named
region_group.

Details of the transformation:
 - each function or data member in region_group is either moved to
   memory_hard_limit, duplicated in memory_hard_limit, or left in
   region_group.
 - the _regions and _blocked_requests members, which were always
   empty in the parent, were not duplicated. Any member that only accessed
   them was similarly left alone.
 - the "no_reclaimer" static member which was only used in the parent
   was moved there. Similarly the constructor which accepted it
   was moved.
 - _child was moved to the parent, and _parent was kept in the child
   (more or less the defining change of the split) Similarly
   add(region_group*) and del(region_group*) (which manage _child) were moved.
 - do_for_each_parent(), which iterated to the top of the tree, was removed
   and its callers manually unroll the loop. For the parent, this is just
   a single iteration (since we're iterating towards the root), for the child,
   this can be two iterations, but the second one is usually simpler since
   the parent has many members removed.
 - do_update(), introduced in the previous patch, was made a template that
   can act on either the parent or the child. It will be further simplified
   later.
 - some tests that check now-impossible topologies were removed.
 - the parent's shutdown() is trivial since it has no _blocked_requests,
   but it was kept to reduce churn in the callers.
2022-09-22 13:56:59 +03:00
Avi Kivity
009bd63217 dirty_memory_manager: extract code block from region_group::update
A mechanical transformation intended to allow reuse later. The function
doesn't really deserve to exist on its own, so it will be swallowed back
by its callers later.
2022-09-22 13:56:59 +03:00
Avi Kivity
34d5322368 dirty_memory_manager: move more allocation_queue functions out of region_group
More mechanical changes, reducing churn for later patches.
2022-09-22 13:56:59 +03:00
Avi Kivity
4bc2638cf9 dirty_memory_manager: move some allocation queue related function definitions outside class scope
It's easier to move them to a new owner (allocation_queue) if they are
not defined in the class.
2022-09-22 13:56:59 +03:00
Avi Kivity
71493c2539 dirty_memory_manager: move region_group::allocating_function and related classes to new class allocation_queue
region_group currently fulfills two roles: in one role, when instantiated
as dirty_memory_manager::_virtual_region_group, it is responsible
for holding functions that allocate memtable memory (writes) and only
allowing them to run when enough dirty memory has been flushed from other
memtables. The other role, when instantiated as
dirty_memory_manager::_real_region_group, is to provide a hard stop when
the total amount of dirty memory exceeds the limit, since the other limit
is only estimated.

We want to simplify the whole thing, which means not using the same class
for two different roles (or rather, we can use it for both roles if we
simplify the internals significantly).

As a first step towards clarifying what functionality is used in what
role, move some classes related to holding allocating functions to a new
class allocation_queue. We will gradually move move content there, reducing
the amount of role confusion in region_group.

Type aliases are added to reduce churn.
2022-09-22 13:56:59 +03:00
Avi Kivity
d21d2cdb3e dirty_memory_manager: remove support for multiple subgroups
We only have one parent/child relationship in the region group
hierarchy, so support for more is unneeded complexity. Replace
the subgroup vector with a single pointer, and delete a test
for the removed functionality.
2022-09-22 13:56:59 +03:00
Botond Dénes
0ccb23d02b shard_reader: do_fill_buffer(): only update _end_of_stream after buffer is copied
Commit 8ab57aa added a yield to the buffer-copy loop, which means that
the copy can yield before done and the multishard reader might see the
half-copied buffer and consider the reader done (because
`_end_of_stream` is already set) resulting in the dropping the remaining
part of the buffer and in an invalid stream if the last copied fragment
wasn't a partition-end.

Fixes: #11561
2022-09-22 13:54:36 +03:00
Botond Dénes
16a0025dc3 readers/mutation_readers: compacting_reader: remember injected partition-end
Currently injecting a partition-end doesn't update
`_last_uncompacted_kind`, which will allow for a subsequent
`next_partition()` call to trigger injecting a partition-end, leading to
an invalid mutation fragment stream (partition-end after partition-end).
Fix by changing `_last_uncompacted_kind` to `partition_end` when
injecting a partition-end, making subsequent injection attempts noop.

Fixes: #11608
2022-09-22 13:54:36 +03:00
Botond Dénes
681e6ae77f db/view: view_builder::execute(): only inject partition-start if needed
When resuming a build-step, the view builder injects the partition-start
fragment of the last processed partition, to bring the consumer
(compactor) into the correct state before it starts to consume the
remainder of the partition content. This results in an invalid fragment
stream when the partition was actually over or there is nothing left for
the build step. Make the inject conditional on when the reader contains
more data for the partition.

Fixes: #11607
2022-09-22 13:54:36 +03:00
Nadav Har'El
517c1529aa docs: update docs/alternator/getting-started.md
Update several aspects of the alternator/getting-started.md which were
not up-to-date:

* When the documented was written, Alternator was moving quickly so we
  recommended running a nightly version. This is no longer the case, so
  we should recommend running the latest stable build.

* The link to the download link is no longer helpful for getting Docker
  instructions (it shows some generic download options). Instead point to
  our dockerhub page.

* Replace mentions of "Scylla" by the new official name, "ScyllaDB".

* Miscelleneous copy-edits.

Fixes #11218

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11605
2022-09-22 11:08:05 +03:00
Piotr Sarna
481240b8b4 Merge 'Alternator: Run more TTL tests by default (and add a test for metrics)' from Nadav Har'El
We had quite a few tests for Alternator TTL in test/alternator, but most
of them did not run as part of the usual Jenkins test suite, because
they were considered "very slow" (and require a special "--runveryslow"
flag to run).

In this series we enable six tests which run quickly enough to run by
default, without an additional flag. We also make them even quicker -
the six tests now take around 2.5 seconds.

I also noticed that we don't have a test for the Alternator TTL metrics
- and added one.

Fixes #11374.
Refs https://github.com/scylladb/scylla-monitoring/issues/1783

Closes #11384

* github.com:scylladb/scylladb:
  test/alternator: insert test names into Scylla logs
  rest api: add a new /system/log operation
  alternator ttl: log warning if scan took too long.
  alternator,ttl: allow sub-second TTL scanning period, for tests
  test/alternator: skip fewer Alternator TTL tests
  test/alternator: test Alternator TTL metrics
2022-09-22 09:47:50 +02:00
Botond Dénes
ef7471c460 readers/mutation_reader: stream validator: fix log level detection logic
The mutation fragment stream validator filter has a detailed debug log
in its constructor. To avoid putting together this message when the log
level is above debug, it is enclosed in an if, activated when log level
is debug or trace... at least that was intended. Actually the if is
activated when the log level is debug or above (info, warn or error) but
is only actually logged if the log level is exactly debug. Fix the logic
to work as intended.

Closes #11603
2022-09-22 09:41:45 +03:00
Pavel Emelyanov
7ae73c665b gossiper: Remove some dead code
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #11599
2022-09-22 06:58:29 +03:00
Pavel Emelyanov
5edeecf39b token_metadata: Provide dc/rack for bootstrapping nodes
The token_metadata::calculate_pending_ranges_for_bootstrap() makes a
clone of itself and adds bootstrapping nodes to the clone to calculate
ranges. Currently added nodes lack the dc/rack which affects the
calculations the bad way.

Unfortunately, the dc/rack for those nodes is not available on topology
(yet) and needs pretty heavy patching to have. Fortunately, the only
caller of this method has gossiper at hand to provide the dc/rack from.

fixes: #11531

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #11596
2022-09-22 06:55:52 +03:00
Petr Gusev
210d9dd026 raft: fix snapshots leak
applier_fiber could create multiple snapshots between
io_fiber run. The fsm_output.snp variable was
overwritten by applier_fiber and io_fiber didn't drop
the previous snapshot.

In this patch we introduce the variable
fsm_output.snps_to_drop, store in it
the current snapshot id before applying
a new one, and then sequentially drop them in
io_fiber after storing the last snapshot_descriptor.

_sm_events.signal() is added to fsm::apply_snapshot,
since this method mutates the _output and thus gives a
reason to run io_fiber.

The new test test_frequent_snapshotting demonstrates
the problem by causing frequent snapshots and
setting the applier queue size to one.

Closes #11530
2022-09-21 12:46:26 +02:00
Kamil Braun
3b096b71c1 test/topology_raft_disabled: disable test_raft_upgrade
For some reason, the test is currently flaky on Jenkins. Apparently the
Python driver does not reconnect to the cluster after the cluster
restarts (well it does, but then it disconnects from one of the nodes
and never reconnects again). This causes the test to hang on "waiting
until driver reconnects to every server" until it times out.

Disable it for now so it doesn't block next promotion.
2022-09-21 12:32:40 +02:00
Nadav Har'El
22bb35e2cb Merge 'doc: update the "Counting all rows in a table is slow" page' from Anna Stuchlik
Fix https://github.com/scylladb/scylladb/issues/11373

- Updated the information on the "Counting all rows in a table is slow" page.
- Added COUNT to the list of selectors of the SELECT statement (somehow it was missing).
- Added the note to the description of the COUNT() function with a link to the KB page for troubleshooting if necessary. This will allow the users to easily find the KB page.

Closes #11417

* github.com:scylladb/scylladb:
  doc: add a comment to remove the note in version 5.1
  doc: update the information on the Countng all rows page and add the recommendation to upgrade ScyllaDB
  doc: add a note to the description of COUNT with a reference to the KB article
  doc: add COUNT to the list of acceptable selectors of the SELECT statement
2022-09-21 12:32:40 +02:00
Alejo Sanchez
510215d79a test.py: fix ScyllaClusterManager start/stop
Check existing is_running member to avoid re-starting.

While there, set it to false after stopping.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-09-21 11:42:02 +02:00
Alejo Sanchez
933d93d052 test.py: fix topology init error handling
Start ScyllaClusterManager within error handling so the ScyllaCluster
logs are available in case of error starting up.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-09-21 09:15:25 +02:00
Nadav Har'El
91bccee9be Update tools/java submodule
* tools/java b004da9d1b...5f2b91d774 (1):
  > install.sh is using wrong permissions for install cqlsh files

Fixes #11584
2022-09-20 14:42:34 +03:00
Avi Kivity
2cec417426 Merge 'tools: use the standard allocator' from Botond Dénes
Tools want to be as little disrupting to the environment they run in as possible, because they might be run in a production environment, next to a running scylladb production server. As such, the usual behavior of seastar applications w.r.t. memory is an anti-pattern for tools: they don't want to reserve most of the system memory, in fact they don't want to reserve any amount, instead consuming as much as needed on-demand.
To achieve this, tools want to use the standard allocator. To achieve this they need a seastar option to to instruct seastar to *not* configure and use the seastar allocator and they need LSA to cooperate with the standard allocator.
The former is provided by https://github.com/scylladb/seastar/pull/1211.
The latter is solved by introducing the concept of a `segment_store_backend`, which abstracts away how the memory arena for segments is acquired and managed. We then refactor the existing segment store so that the seastar allocator specific parts are moved to an implementation of this backend concept, then we introduce another backend implementation appropriate to the standard allocator.
Finally, tools configure seastar with the newly introduced option to use the standard allocator and similarly configure LSA to use the standard allocator appropriate backend.

Refs: https://github.com/scylladb/scylladb/issues/9882
This is the last major code piece in scylla for making tools production ready.

Closes #11510

* github.com:scylladb/scylladb:
  test/boost: add alternative variant of logalloc test
  tools: use standard allocator
  utils/logalloc: add use_standard_allocator_segment_pool_backend()
  utils/logalloc: introduce segment store backend for standard allocator
  utils/logalloc: rebase release segment-store on segment-store-backend
  utils/logalloc: introduce segment_store_backend
  utils/logalloc: push segment alloc/dealloc to segment_store
  test/boost/logalloc_test: make test_compaction_with_multiple_regions exception-safe
2022-09-20 12:59:34 +03:00
Nadav Har'El
4a453c411d Merge 'doc: add the upgrade guide from 5.0 to 5.1' from Anna Stuchlik
Fix https://github.com/scylladb/scylladb/issues/11376

This PR adds the upgrade guide from version 5.0 to 5.1. It involves adding new files (5.0-to-5.1) and language/formatting improvements to the existing content (shared by several upgrade guides).

Closes #11577

* github.com:scylladb/scylladb:
  doc: upgrade the command to upgrade the ScyllaDB image from 5.0 to 5.1
  doc: add the guide to upgrade ScyllaDB from 5.0 to 5.1
2022-09-20 11:52:59 +03:00
Nadav Har'El
d81bedd3be Merge 'doc: add ScyllaDB image upgrade guides for patch releases' from Anna Stuchlik
This PR adds the missing upgrade guides for upgrading the ScyllaDB image to a patch release:

- ScyllaDB 5.0: /upgrade/upgrade-opensource/upgrade-guide-from-5.x.y-to-5.x.z/upgrade-guide-from-5.x.y-to-5.x.z-image/
- ScyllaDB Enterprise: /upgrade/upgrade-enterprise/upgrade-guide-from-2021.1-to-2022.1/upgrade-guide-from-2022.1-to-2022.1-image/ (the file name is wrong and will be fixed with another PR)

In addition, the section regarding the recommended upgrade procedure has been improved.
Fixes https://github.com/scylladb/scylladb/issues/11450
Fixes https://github.com/scylladb/scylladb/issues/11452

Closes #11460

* github.com:scylladb/scylladb:
  doc: update the commands to upgrade the ScyllaDB image
  doc: fix the filename in the index to resolve the warnings and fix the link
  doc: apply feedback by adding she step fo load the new repo and fixing the links
  doc: fix the version name in file upgrade-guide-from-2021.1-to-2022.1-image.rst
  doc: rename the upgrade-image file to upgrade-image-opensource and update all the links to that file
  doc: update the Enterprise guide to include the Enterprise-onlyimage file
  doc: update the image files
  doc: split the upgrade-image file to separate files for Open Source and Enterprise
  doc: clarify the alternative upgrade procedures for the ScyllaDB image
  doc: add the upgrade guide for ScyllaDB Image from 2022.x.y. to 2022.x.z
  doc: add the upgrade guide for ScyllaDB Image from 5.x.y. to 5.x.z
2022-09-20 11:51:26 +03:00
Botond Dénes
4ef7b080e3 docs/using-scylla/migrate-scylla.rst: remove link to unirestore
It points to a private scylladb repo, which has no place in user-facing
documentation. For now there is no public replacement, but a similar
functionality is in the works for Scylla Manager.

Fixes: #11573

Closes #11580
2022-09-20 11:46:28 +03:00
Anna Stuchlik
7b2209f291 doc: upgrade the command to upgrade the ScyllaDB image from 5.0 to 5.1 2022-09-20 10:42:47 +02:00
Anna Stuchlik
db75adaf9a doc: update the commands to upgrade the ScyllaDB image 2022-09-20 10:36:18 +02:00
Nadav Har'El
4c93a694b7 cql: validate bloom_filter_fp_chance up-front
Scylla's Bloom filter implementation has a minimal false-positive rate
that it can support (6.71e-5). When setting bloom_filter_fp_chance any
lower than that, the compute_bloom_spec() function, which writes the bloom
filter, throws an exception. However, this is too late - it only happens
while flushing the memtable to disk, and a failure at that point causes
Scylla to crash.

Instead, we should refuse the table creation with the unsupported
bloom_filter_fp_chance. This is also what Cassandra did six years ago -
see CASSANDRA-11920.

This patch also includes a regression test, which crashes Scylla before
this patch but passes after the patch (and also passes on Cassandra).

Fixes #11524.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11576
2022-09-20 06:18:51 +03:00
Botond Dénes
60991358e8 Merge 'Improvements to test/lib/sstable_utils.hh' from Raphael "Raph" Carvalho
Changes done to avoid pitfalls and fix issues of sstable-related unit tests

Closes #11578

* github.com:scylladb/scylladb:
  test: Make fake sstables implicitly belong to current shard
  test: Make it clearer that sstables::test::set_values() modify data size
2022-09-20 06:14:07 +03:00
Nadav Har'El
a1ff865c77 Merge 'test/topology_raft_disabled: write basic raft upgrade test' from Kamil Braun
The test changes the servers' configuration to include `raft`
in the `experimental-features` list, then restarts them.

It waits until driver reconnects to every server after restarting.
Then it checks that upgrade eventually finishes on every server by
querying `group0_upgrade_state` key in `system.scylla_local`. Finally,
it performs a schema change and verifies that a corresponding entry has
appeared in `system.group0_history`.

The commit also increases the number of clusters in the suite cluster
pool. Since the suite contains only one test at this time this only has
an effect if we run the test multiple times (using `--repeat`).

Closes #11563

* github.com:scylladb/scylladb:
  test/topology_raft_disabled: write basic raft upgrade test
  test: setup logging in topology suites
2022-09-19 20:27:08 +03:00
Alejo Sanchez
087ae521c5 test.py: make client fail if before test check fails
Check if request to server side (test.py) failed and raise if so.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #11575
2022-09-19 18:04:07 +02:00
Raphael S. Carvalho
2f52698a26 test: Make fake sstables implicitly belong to current shard
Fake SSTables will be implicitly owned by the shard that created them,
allowing them to be called on procedures that assert the SSTables
are owned by the current shard, like the table's one that rebuilds
the sstable set.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-19 12:05:24 -03:00
Raphael S. Carvalho
697f200319 test: Make it clearer that sstables::test::set_values() modify data size
By adding a param with default value, we make it clear in the interface
that the procedure modifies sstable data size.

It can happen one calls this function without noticing it overrides
the data size previously set using a different function.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-19 12:01:24 -03:00
Anna Stuchlik
2513497f9a doc: add the guide to upgrade ScyllaDB from 5.0 to 5.1 2022-09-19 16:06:24 +02:00
Kamil Braun
b770443300 test/topology_raft_disabled: write basic raft upgrade test
The test changes the servers' configuration to include `raft`
in the `experimental-features` list, then restarts them.

It waits until driver reconnects to every server after restarting.
Then it checks that upgrade eventually finishes on every server by
querying `group0_upgrade_state` key in `system.scylla_local`. Finally,
it performs a schema change and verifies that a corresponding entry has
appeared in `system.group0_history`.

The commit also increases the number of clusters in the suite cluster
pool. Since the suite contains only one test at this time this only has
an effect if we run the test multiple times (using `--repeat`).
2022-09-19 13:29:35 +02:00
Kamil Braun
fd986bfed1 test: setup logging in topology suites
Make it possible to use logging from within tests in the topology
suites. The tests are executed using `pytest`, which uses a `pytest.ini`
file for logging configuration.

Also cleanup the `pytest.ini` files a bit.
2022-09-19 12:23:11 +02:00
Nadav Har'El
711dcd56b6 docs/alternator: refer to the right issue
In compatibility.md where we refer to the missing ability to add a GSI
to an existing table - let's refer to a new issue specifically about this
feature, instead of the old bigger issue about UpdateItem.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11568
2022-09-19 11:05:07 +03:00
Piotr Sarna
5597bc8573 Merge 'Alternator: test and fix crashes and errors...
when using ":attrs" attribute' from Nadav Har'El

This PR improves the testing for issue #5009 and fixes most of it (but
not all - see below). Issue #5009 is about what happens when a user
tries to use the name `:attrs` for an attribute - while Alternator uses
a map column with that name to hold all the schema-less attributes of an
item.  The tests we had for this issue were partial, and missed the
worst cases which could result in Scylla crashing on specially-crafted
PutItem or UpdateItem requests.

What the tests missed were the cases that `:attrs` is used as a
**non-key**. So in this PR we add additional tests for this case,
several of them fail or even crash Scylla, and then we fix all these
cases.

Issue #5009 remains open because using `:attrs` as the name of a **key**
is still not allowed. But because it results in a clean error message
when attempting to create a table with such a key, I consider this
remaining problem very minor.

Refs #5009.

Closes #11572

* github.com:scylladb/scylladb:
  alternator: fix crashes an errors when using ":attrs" attribute
  alternator: improve tests for reserved attribute name ":attrs"
2022-09-19 09:48:06 +02:00
Nadav Har'El
999ca2d588 alternator: fix crashes an errors when using ":attrs" attribute
Alternator uses a single column, a map, with the deliberately strange
name ":attrs", to hold all the schema-less attributes of an item.
The existing code is buggy when the user tries to write to an attribute
with this strange name ":attrs". Although it is extremely unlikely that
any user would happen to choose such a name, it is nevertheless a legal
attribute name in DynamoDB, and should definitely not cause Scylla to crash
as it does in some cases today.

The bug was caused by the code assuming that to check whether an attribute
is stored in its own column in the schema, we just need to check whether
a column with that name exists. This is almost true, except for the name
":attrs" - a column with this name exists, but it is a map - the attribute
with that name should be stored *in* the map, not as the map. The fix
is to modify that check to special-case ":attrs".

This fix makes the relevant tests, which used to crash or fail, now pass.

This fix solves most of #5009, but one point is not yet solved (and
perhaps we don't need to solve): It is still not allowed to use the
name ":attrs" for a **key** attribute. But trying to do that fails cleanly
(during the table creation) with an appropriate error message, so is only
a very minor compatibility issue.

Refs #5009

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-09-19 10:30:11 +03:00
Nadav Har'El
6f8dca3760 alternator: improve tests for reserved attribute name ":attrs"
As explained in issue #5009, Alternator currently forbids the special
attribute name ":attrs", whereas DynamoDB allows any string of approriate
length (including the specific string ":attrs") to be used.

We had only a partial test for this incompatibility, and this patch
improves the testing of this issue. In particular, we were missing a
test for the case that the name ":attrs" was used for a non-key
attribute (we only tested the case it was used as a sort key).

It turns out that Alternator crashes on the new test, when the test tries
to write to a non-key attribute called ":attrs", so we needed to mark
the new test with "skip". Moreover, it turns out that different code paths
handle the attribute name ":attrs" differently, and also crash or fail
in other ways - so we added more than one xfailing and skipped tests
that each fails in a different place (and also a few tests that do pass).

As usual, the new tests we checked to pass on DynamoDB.

Refs #5009

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-09-19 10:30:06 +03:00
Botond Dénes
3003f1d747 Merge 'alternator: small documentation and comment fixes' from Nadav Har'El
This tiny series fixes some small error and out-of-date information in Alternator documentation and code comments.

Closes #11547

* github.com:scylladb/scylladb:
  alternator ttl: comment fixes
  docs/alternator: fix mention of old alternator-test directory
2022-09-19 09:27:53 +03:00
Kamil Braun
348582c4c8 test/pylib: pool: make it possible to free up space
Some tests mark clusters as 'dirty', which makes them non-reusable by
later tests; we don't want to return them to the pool of clusters.

This use-case was covered by the `add_one` function in the `Pool` class.
However, it had the unintended side effect of creating extra clusters
even if there were no more tests that were waiting for new clusters.

Rewrite the implementation of `Pool` so it provides 3 interface
functions:
- `get` borrows an object, building it first if necessary
- `put` returns a borrowed object
- `steal` is called by a borrower to free up space in the pool;
  the borrower is then responsible for cleaning up the object.

Both `put` and `steal` wake up any outstanding `get` calls. Objects are
built only in `get`, so no objects are built if none are needed.

Closes #11558
2022-09-18 12:05:57 +03:00
Botond Dénes
22128977e4 test/boost: add alternative variant of logalloc test
Which intializes LSA with use_standard_allocator_segment_pool_backend()
running the logalloc_test suite on the standard allocator segment pool
backend. To avoid duplicating the test code, the new test-file pulls in
the test code via #include. I'm not proud of it, but it works and we
test LSA with both the debug and standard memory segment stores without
duplicating code.
2022-09-16 14:57:23 +03:00
Botond Dénes
13ace7a05e Merge "Fix RPC sockets configuration wrt topology" from Pavel Emelyanov
"
Messaging service checks dc/rack of the target node when creating a
socket. However, this information is not available for all verbs, in
particular gossiper uses RPC to get topology from other nodes.

This generates a chicken-and-egg problem -- to create a socket messaging
service needs topology information, but in order to get one gossiper
needs to create a socket.

Other than gossiper, raft starts sending its APPEND_ENTRY messages early
enough so that topology info is not avaiable either.

The situation is extra-complicated with the fact that sockets are not
created for individual verbs. Instead, verbs are groupped into several
"indices" and socket is created for it. Thus, the "gossiping" index that
includes non-gossiper verbs will create topology-less socket for all
verbs in it. Worse -- raft sends messages w/o solicited topology, the
corresponding socket is created with the assumption that the peer lives
in default dc and rack which doesn't matchthe local nodes' dc/rack and
the whole index group gets the "randomly" configured socket.

Also, the tcp-nodelay tries to implement similar check, but uses wrong
index of 1, so it's also fixed here.
"

* 'br-messaging-topology-ignoring-clients' of https://github.com/xemul/scylla:
  messaging_service: Fix gossiper verb group
  messaging_service: Mind the absence of topology data when creating sockets
  messaging_service: Templatize and rename remove_rpc_client_one
2022-09-16 13:27:56 +03:00
Botond Dénes
6a0db84706 tools: use standard allocator
Use the new seastar option to instruct seastar to not initialize and use
the seastar allocator, relying on the standard allocator instead.
Configure LSA with the standard allocator based segment store backend:
* scylla-types reserves 1MB for LSA -- in theory nothing here should use
  LSA, but just in case...
* scylla-sstable reserves 100MB for LSA, to avoid excessive trashing in
  the sstable index caches.

With this, tools now should allocate memory on demand, without reserving
a large chunk of (or all of) the available memory, as regular seastar
apps do.
2022-09-16 13:07:01 +03:00
Botond Dénes
a55903c839 utils/logalloc: add use_standard_allocator_segment_pool_backend()
Creating a standard-memory-allocator backend for the segment store.
This is targeted towards tools, which want to configure LSA with a
segment store backend that is appropriate for the standard allocator
(which they want to use).
We want to be able to use this in both release and debug mode. The
former will be used by tools and the latter will be used to run the
logalloc tests with this new backend, making sure it works and doesn't
regress. For this latter, we have to allow the release and debug stores
to coexist in the same build and for the debug store to be able to
delegate to the release store when the standard allocator backend is
used.
2022-09-16 13:02:40 +03:00
Kamil Braun
595472ac59 Merge 'Don't use qctx in CDC tables quering' from Pavel Emelyanov
There's a bunch of helpers for CDC gen service in db/system_keyspace.cc. All are static and use global qctx to make queries. Fortunately, both callers -- storage_service and cdc_generation_service -- already have local system_keyspace references and can call the methods via it, thus reducing the global qctx usage.

Closes #11557

* github.com:scylladb/scylladb:
  system_keyspace: De-static get_cdc_generation_id()
  system_keyspace: De-static cdc_is_rewritten()
  system_keyspace: De-static cdc_set_rewritten()
  system_keyspace: De-static update_cdc_generation_id()
2022-09-16 11:52:01 +02:00
Kamil Braun
0a6f601996 Merge 'Raft test topology fix request paths and API response handling' from Alecco
- Raise on response not HTTP 200 for `.get_text()` helper
- Fix API paths
- Close and start a fresh driver when restarting a server and it's the only server in the cluster
- Fix stop/restart response as text instead of inspecting (errors are status 500 and raise exceptions)

Closes #11496

* github.com:scylladb/scylladb:
  test.py: handle duplicate result from driver
  test.py: log server restarts for topology tests
  test.py: log actions for topology tests
  Revert "test.py: restart stopped servers before...
  test.py: ManagerClient API fix return text
  test.py: ManagerClient raise on HTTP != 200
  test.py: ManagerClient fix paths to updated resource
2022-09-16 11:29:10 +02:00
Botond Dénes
c1c74005b7 utils/logalloc: introduce segment store backend for standard allocator
To be used by tools, this store backend is compatible with the standard
allocator as it acquires the memory arena for segments via mmap().
2022-09-16 12:16:57 +03:00
Botond Dénes
d2a7ebbe66 utils/logalloc: rebase release segment-store on segment-store-backend
Rebase the seastar allocator based segment store implementation on the
recently introduced segment store backend which is now abstracts away
how memory for segments is obtained.
This patch also introduces an explicit `segment_npos` to be used for
cases when a segment -> index mapping fails (segment doesn't belong to
the store). Currently the seastar allocator based store simply doesn't
handle this case, while the standard allocator based store uses 0 as the
implicit invalid index.
2022-09-16 12:16:57 +03:00
Botond Dénes
3717f7740d utils/logalloc: introduce segment_store_backend
We want to make it possible to select the segment-store to be used for
LSA -- the seastar allocator based one or the standard allocator based
on -- at runtime. Currently this choice is made at compile time via
preprocessor switches.
The current standard memory based store is specialized for debug build,
we want something more similar to the seastar standard memory allocator
based one. So we introduce a segment store backend for the current
seastar allocator based store, which abstracts how the backing memory
for all segments is allocated/freed, while keeping the segment <-> index
mapping common. In the next patches we will rebase the current seastar
allocator based segment store on this backend and later introduce
another backend for standard allocator, targeted for release builds.
2022-09-16 12:16:57 +03:00
Botond Dénes
5ea4d7fb39 utils/logalloc: push segment alloc/dealloc to segment_store
Currently the actual alloc/dealloc of memory for segments is located
outside the segment stores. We want to abstract away how segments are
allocated, so we move this logic too into the segment store. For now
this results in duplicate code in the two segment store implementations,
but this will soon be gone.
2022-09-16 12:16:57 +03:00
Botond Dénes
e82ea2f3ad test/boost/logalloc_test: make test_compaction_with_multiple_regions exception-safe
Said test creates two vectors, the vector storage being allocated with
the default allocator, while its content being allocated on LSA. If an
exception is thrown however, both are freed via the default allocator,
triggering an assert in LSA code. Move the cleanup into a `defer()` so
the correct cleanup sequence is executed even on exceptions.
2022-09-16 12:16:57 +03:00
Pavel Emelyanov
e221bb0112 system_keyspace: De-static get_cdc_generation_id()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-16 08:34:15 +03:00
Pavel Emelyanov
fe48b66c0a cross-shard-barrier: Capture shared barrier in complete
When cross-shard barrier is abort()-ed it spawns a background fiber
that will wake-up other shards (if they are sleeping) with exception.

This fiber is implicitly waited by the owning sharded service .stop,
because barrier usage is like this:

    sharded<service> s;
    co_await s.invoke_on_all([] {
        ...
        barrier.abort();
    });
    ...
    co_await s.stop();

If abort happens, the invoke_on_all() will only resolve _after_ it
queues up the waking lambdas into smp queues, thus the subseqent stop
will queue its stopping lambdas after barrier's ones.

However, in debug mode the queue can be shuffled, so the owning service
can suddenly be freed from under the barrier's feet causing use after
free. Fortunately, this can be easily fixed by capturing the shared
pointer on the shared barrier instead of a regular pointer on the
shard-local barrier.

fixes: #11303

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #11553
2022-09-16 08:21:02 +03:00
Michał Chojnowski
78850884d2 test: perf: perf_fast_forward: fix an error message
The test is supposed to give a helpful error message when the user forgets to
run --populate before the benchmark. But this must have become broken at some
point, because execute_cql() terminates the program with an unhelpful
("unconfigured table config") message, which doesn't mention --populate.

Fix that by catching the exception and adding the helpful tip.

Closes #11533
2022-09-15 19:30:10 +02:00
Avi Kivity
d3b8c0c8a6 logalloc: don't crash while reporting reclaim stalls if --abort-on-seastar-bad-alloc is specified
The logger is proof against allocation failures, except if
--abort-on-seastar-bad-alloc is specified. If it is, it will crash.

The reclaim stall report is likely to be called in low memory conditions
(reclaim's job is to alleviate these conditions after all), so we're
likely to crash here if we're reclaiming a very low memory condition
and have a large stall simultaneously (AND we're running in a debug
environment).

Prevent all this by disabling --abort-on-seastar-bad-alloc temporarily.

Fixes #11549

Closes #11555
2022-09-15 19:24:39 +02:00
Pavel Emelyanov
4f67898e7b system_keyspace: De-static cdc_is_rewritten()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-15 18:44:59 +03:00
Pavel Emelyanov
736021ee98 system_keyspace: De-static cdc_set_rewritten()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-15 18:44:53 +03:00
Pavel Emelyanov
b3d139bbdb system_keyspace: De-static update_cdc_generation_id()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-15 18:44:40 +03:00
Michał Chojnowski
cdb3e71045 sstables: add a flag for disabling long-term index caching
Long-term index caching in the global cache, as introduced in 4.6, is a major
pessimization for workloads where accesses to the index are (spacially) sparse.
We want to have a way to disable it for the affected workloads.

There is already infrastructure in place for disabling it for BYPASS CACHE
queries. One way of solving the issue is hijacking that infrastructure.

This patch adds a global flag (and a corresponding CLI option) which controls
index caching. Setting the flag to `false` causes all index reads to behave
like they would in BYPASS CACHE queries.

Consequences of this choice:

- The per-SSTable partition_index_cache is unused. Every index_reader has
  its own, and they die together. Independent reads can no longer reuse the
  work of other reads which hit the same index pages. This is not crucial,
  since partition accesses have no (natural) spatial locality. Note that
  the original reason for partition_index_cache -- the ability to share
  reads for the lower and upper bound of the query -- is unaffected.
- The per-SSTable cached_file is unused. Every index_reader has its own
  (uncached) input stream from the index file, and every
  bsearch_clustered_cursor has its own cached_file, which dies together with
  the cursor. Note that the cursor still can perform its binary search with
  caching. However, it won't be able to reuse the file pages read by
  index_reader. In particular, if the promoted index is small, and fits inside
  the same file page as its index_entry, that page will be re-read.
  It can also happen that index_reader will read the same index file page
  multiple times. When the summary is so dense that multiple index pages fit in
  one index file page, advancing the upper bound, which reads the next index
  page, will read the same index file page. Since summary:disk ratio is 1:2000,
  this is expected to happen for partitions with size greater than 2000
  partition keys.

Fixes #11202
2022-09-15 17:16:26 +03:00
David Garcia
3cc80da6af docs: update theme 1.3
Update conf.py

Closes #11330
2022-09-15 16:56:41 +03:00
Anna Stuchlik
e5c9f3c8a2 doc: fix the filename in the index to resolve the warnings and fix the link 2022-09-15 15:53:23 +02:00
Anna Stuchlik
338b45303a doc: apply feedback by adding she step fo load the new repo and fixing the links 2022-09-15 15:40:20 +02:00
Alejo Sanchez
92129f1d47 test.py: handle duplicate result from driver
Sometimes the driver calls twice the callback on ready done future with
a None result. Log it and avoid setting the local future twice.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-09-15 15:12:50 +02:00
Alejo Sanchez
2da7304696 test.py: log server restarts for topology tests
Add missing logging for server restart.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-09-15 15:10:29 +02:00
Alejo Sanchez
61a92afa2d test.py: log actions for topology tests
For debugging, log driver connection, before and after checks, and
topology changes.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-09-15 15:10:29 +02:00
Botond Dénes
05ef13a627 Merge 'Add support to split large partitions across SSTables' from Raphael "Raph" Carvalho
Introduces support to split large partitions during compaction. Today, compaction can only split input data at partition boundary, so a large partition is stored in a single file. But that can cause many problems, like memory pressure (e.g.: https://github.com/scylladb/scylladb/issues/4217), and incremental compaction can also not fulfill its promise as the file storing the large partition can only be released once exhausted.

The first step was to add clustering range metadata for first and last partition keys (retrieved from promoted index), which is crucial to determine disjointness at clustering level, and also the order at which the disjoint files should be opened for incremental reading.

The second step was to extend sstable_run to look at clustering dimension, so a set of files storing disjoint ranges for the same partition can live in the same sstable run.

The final step was to introduce the option for compaction to split large partition being written if it has exceeded the size threshold.

What's next? Following this series, a reader will be implemented for sstable_run that will incrementally open the readers. It can be safely built on the assumption of the disjoint invariant after the second step aforementioned.

Closes #11233

* github.com:scylladb/scylladb:
  test: Add test for large partition splitting on compaction
  compaction: Add support to split large partitions
  sstable: Extend sstable_run to allow disjointness on the clustering level
  sstables: simplify will_introduce_overlapping()
  test: move sstable_run_disjoint_invariant_test into sstable_datafile_test
  test: lib: Fix inefficient merging of mutations in make_sstable_containing()
  sstables: Keep track of first partition's first pos and last partition's last pos
  sstables: Rename min/max position_range to a descriptive name
  sstables_manager: Add sstable metadata reader concurrency semaphore
  sstables: Add ability to find first or last position in a partition
2022-09-15 16:08:56 +03:00
Alejo Sanchez
604f7353ef Revert "test.py: restart stopped servers before...
teardown..."

This reverts commit df1ca57fda.

In order to prevent timeouts on teardown queries, the previous commit
added functionality to restart servers that were down. This issue is
fixed in fc0263fc9b so there's no longer need to restart stopped servers
on test teardown.
2022-09-15 14:47:01 +02:00
Alejo Sanchez
ed81f1a85c test.py: ManagerClient API fix return text
For ManagerClient request API, don't return status, raise an exception.
Server side errors are signaled by status 500, not text body.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-09-15 14:47:01 +02:00
Alejo Sanchez
4a5f2418ec test.py: ManagerClient raise on HTTP != 200
Raise an exception if the request result is not HTTP 200 for .get()
helper.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-09-15 14:47:01 +02:00
Alejo Sanchez
a84bde38c0 test.py: ManagerClient fix paths to updated resource
Fix missing path renames for server-side rename
"node" -> "server" API.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-09-15 14:47:01 +02:00
Kamil Braun
728161003a Merge 'raft server, abort on background errors' from Gusev Petr
Halted background fibers render raft server effectively unusable, so
report this explicitly to the clients.

Fix: #11352

Closes #11370

* github.com:scylladb/scylladb:
  raft server, status metric
  raft server, abort group0 server on background errors
  raft server, provide a callback to handle background errors
  raft server, check aborted state on public server public api's
2022-09-15 14:12:11 +02:00
Alejo Sanchez
b8f68729b0 test.py: Pool add fresh when item not returned
Pool.get() might have waiting callers, so if an item is not returned
to the pool after use, tell the pool to add a new one and tell the pool
an entry was taken (used for total running entries, i.e. clusters).

Use it when a ScyllaCluster is dirty and not returned.

While there improve logging and docstrings.

Issue reported by @kbr-.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #11546
2022-09-15 13:56:44 +03:00
Pavel Emelyanov
82162be1f1 messaging_service: Remove init/uninit helpers
These two are just getting in the way when touching inter-components
dependencies around messaging service. Without it m.-s. start/stop
just looks like any other service out there

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #11535
2022-09-15 11:54:46 +03:00
Raphael S. Carvalho
0a8afe18ca cql: Reject create and alter table with DateTieredCompactionStrategy
It's been ~1 year (2bf47c902e) since we set restrict_dtcs
config option to WARN, meaning users have been warned about the
deprecation process of DTCS.

Let's set the config to TRUE, meaning that create and alter statements
specifying DTCS will be rejected at the CQL level.

Existing tables will still be supported. But the next step will
be about throwing DTCS code into the shadow realm, and after that,
Scylla will automatically fallback to STCS (or ICS) for users which
ignored the deprecation process.

Refs #8914.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #11458
2022-09-15 11:46:18 +03:00
Alejo Sanchez
7e3389ee43 test.py: schema timeout less than request timeout
When a server is down, the driver expects multiple schema timeouts
within the same request to handle it properly.

Found by @kbr-

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #11544
2022-09-15 11:43:52 +03:00
Raphael S. Carvalho
a04047f390 compaction: Properly handle stop request for off-strategy
If user stops off-strategy via API, compaction manager can decide
to give up on it completely, so data will sit unreshaped in
maintenance set, preventing it from being compacted with data
in the main set. That's problematic because it will probably lead
to a significant increase in read and space amplification until
off-strategy is triggered again, which cannot happen anytime
soon.

Let's handle it by moving data in maintenance set into main one,
even if unreshaped. Then regular compaction will be able to
continue from where off-strategy left off.

Fixes #11543.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #11545
2022-09-15 09:21:22 +03:00
Nadav Har'El
33e6a88d9a alternator ttl: comment fixes
This patch fixes a few errors and out-of-date descriptions in comments
in alternator/ttl.cc. No functional changes.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-09-15 00:03:43 +03:00
Nadav Har'El
8af9437508 docs/alternator: fix mention of old alternator-test directory
The directory that used to be called alternator-test is now (and has
been for a long time) really test/alternator. So let's fix the
references to it in docs/alternator/alternator.md.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-09-15 00:03:43 +03:00
Pavel Emelyanov
2c74062962 messaging_service: Fix gossiper verb group
When configuring tcp-nodelay unconditionally, messaging service thinks
gossiper uses group index 1, though it had changed some time ago and now
those verbs belong to group 0.

fixes: #11465

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-14 20:40:47 +03:00
Pavel Emelyanov
7bdad47de2 messaging_service: Mind the absence of topology data when creating sockets
When a socket is created to serve a verb there may be no topology
information regarding the target node. In this case current code
configures socket as if the peer node lived in "default" dc and rack of
the same name. If topology information appears later, the client is not
re-connected, even though it could providing more relevant configuration
(e.g. -- w/o encryption)

This patch checks if the topology info is needed (sometimes it's not)
and if missing it configures the socket in the most restrictive manner,
but notes that the socket ignored the topology on creation. When
topology info appears -- and this happens when a node joins the cluster
-- the messaging service is kicked to drop all sockets that ignored the
topology, so thay they reconnect later.

The mentioned "kick" comes from storage service on-join notification.
More correct fix would be if topology had on-change notification and
messaging service subscribed on it, but there are two cons:

- currently dc/rack do not change on the fly (though they can, e.g. if
  gossiping property file snitch is updated without restart) and
  topology update effectively comes from a single place

- updating topology on token-metadata is not like topology.update()
  call. Instead, a clone of token metadata is created, then update
  happens on the clone, then the clone is committed into t.m. Though
  it's possible to find out commit-time which nodes changed their
  topology, but since it only happens on join this complexity likely
  doesn't worth the effort (yet)

fixes: #11514
fixes: #11492
fixes: #11483

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-14 20:30:51 +03:00
Pavel Emelyanov
5ffc9d66ec messaging_service: Templatize and rename remove_rpc_client_one
It actually finds and removes a client and in its new form it also
applies filtering function it, so some better name is called for

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-14 20:30:07 +03:00
Raphael S. Carvalho
20a6483678 test: Add test for large partition splitting on compaction
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-14 13:23:19 -03:00
Raphael S. Carvalho
e2ccafbe38 compaction: Add support to split large partitions
Adds support for splitting large partitions during compaction.

Large partitions introduce many problems, like memory overhead and
breaks incremental compaction promise. We want to split large
partitions across fixed-size fragments. We'll allow a partition
to exceed size limit by 10%, as we don't want to unnecessarily split
partitions that just crossed the limit boundary.

To avoid having to open a minimal of 2 fragments in a read, partition
tombstone will be replicated to every fragment storing the
partition.

The splitting isn't enabled by default, and can be used by
strategies that are run aware like ICS. LCS still cannot support
it as it's still using physical level metadata, not run id.

An incremental reader for sstable runs will follow soon.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-14 13:23:16 -03:00
Raphael S. Carvalho
4bc24acf81 sstable: Extend sstable_run to allow disjointness on the clustering level
After commit 0796b8c97a, sstable_run won't accept a fragment
that introduces key overlapping. But once we split large partitions,
fragments in the same run may store disjoint clustering ranges
of the same partition. So we're extending sstable_run to look
at clustering dimension, so fragments storing disjoint clustering
ranges of the same large partition can co-exist in the same run.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-14 13:09:51 -03:00
Raphael S. Carvalho
574e656793 sstables: simplify will_introduce_overlapping()
An element S1 is completely ordered before S2, if S1's last key is
lower than S2's first key.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-14 13:09:51 -03:00
Raphael S. Carvalho
13942ec947 test: move sstable_run_disjoint_invariant_test into sstable_datafile_test
That's where it belongs.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-14 13:09:51 -03:00
Raphael S. Carvalho
e1560c6b7f test: lib: Fix inefficient merging of mutations in make_sstable_containing()
make_sstable_containing() was absurdly slow when merging thousands of
mutations belonging to the same key, as it was unnecessarily copying
the mutation for every merge, producing bad complexity.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-14 13:09:51 -03:00
Raphael S. Carvalho
5937765009 sstables: Keep track of first partition's first pos and last partition's last pos
With first partition's first position and last partition's last
partition, we'll be able to determine which fragments composing a
sstable run store a large partition that was split.

Then sstable run will be able to detect if all fragments storing
a given large partition are disjoint in the clustering level.

Fixes #10637.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-14 13:09:51 -03:00
Raphael S. Carvalho
a4bbdfcc58 sstables: Rename min/max position_range to a descriptive name
The new descriptive name is important to make a distinction when
sstable stores position range for first and last rows instead
of min and max.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-14 13:09:51 -03:00
Raphael S. Carvalho
e099a9bf3b sstables_manager: Add sstable metadata reader concurrency semaphore
Let's introduce a reader_concurrency_semaphore for reading sstable
metadata, to avoid an OOM due to unlimited concurrency.
The concurrency on startup is not controlled, so it's important
to enforce a limit on the amount of memory used by the parallel
readers.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-14 13:09:51 -03:00
Raphael S. Carvalho
9bcad9ffa8 sstables: Add ability to find first or last position in a partition
This new method allows sstable to load the first row of the first
partition and last row of last partition.

That's useful for incremental reading of sstable run which will
be split at clustering boundary.

To get the first row, it consumes the first row (which can be
either a clustering row or range tombstone change) and returns
its position_in_partition.
To get the last row, it does the same as above but in reverse
mode instead.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-14 13:09:48 -03:00
Nadav Har'El
77467bcbcd Merge 'test/pylib: APIs to read and modify configuration from tests' from Kamil Braun
We introduce `server_get_config` to fetch the entire configuration dict
and `update_config` to update a value under the given key.

Closes #11493

* github.com:scylladb/scylladb:
  test/pylib: APIs to read and modify configuration from tests
  test/pylib: ScyllaServer: extract _write_config_file function
  test/pylib: ScyllaCluster: extend ActionReturn with dict data
  test/pylib: ManagerClient: introduce _put_json
  test/pylib: ManagerClient: replace `_request` with `_get`, `_get_text`
  test: pylib: store server configuration in `ScyllaServer`
2022-09-14 18:49:55 +03:00
Kefu Chai
2a74a0086f docs: fix typos
* s/udpates/updates/
* s/opetarional/operational/

Signed-off-by: Kefu Chai <tchaikov@gmail.com>

Closes #11541
2022-09-14 17:04:05 +03:00
Kamil Braun
73bf781e17 test/pylib: APIs to read and modify configuration from tests
We introduce `server_get_config` to fetch the entire configuration dict
and `update_config` to update a value under the given key.
2022-09-14 12:46:41 +02:00
Kamil Braun
1f550428a9 test/pylib: ScyllaServer: extract _write_config_file function
For refreshing the on-disk config file with the config stored in dict
form in the `self.config` field.
2022-09-14 12:46:41 +02:00
Kamil Braun
52e52e8503 test/pylib: ScyllaCluster: extend ActionReturn with dict data
For returning types more complex than text. Also specify a default empty
string value for the `msg` field for non-text return values.
2022-09-14 12:46:41 +02:00
Kamil Braun
c9348ae8ea test/pylib: ManagerClient: introduce _put_json
For sending PUT requests to the Manager (such as updating
configuration).
2022-09-14 12:46:41 +02:00
Kamil Braun
d81c722476 test/pylib: ManagerClient: replace _request with _get, _get_text
`_request` performed a GET request and extracted a text body out of the
response.

Split it into `_get`, which only performs the request, and `_get_text`,
which calls `_get` and extracts the body as text.

Also extract a `_resource_uri` function which will be used for other
request types.
2022-09-14 12:46:41 +02:00
Kamil Braun
9d39e14518 test: pylib: store server configuration in ScyllaServer
In following commits we will make this configuration accessible from
tests through the Manager (for fetching and updating).
2022-09-14 12:46:41 +02:00
Nadav Har'El
cf30432715 Merge 'test: add a topology suite with Raft disabled' from Kamil Braun
Add a suite which is basically equivalent to `topology` except that it
doesn't start servers with Raft enabled.

The suite will be used to test the Raft upgrade procedure.

The suite contains a basic test just to check the suite itself can run;
the test will be removed when 'real' tests are added.

Closes #11487

* github.com:scylladb/scylladb:
  test.py: PythonTestSuite: sum default config params with user-provided ones
  test: add a topology suite with Raft disabled
  test: pylib: use Python dicts to manipulate `ScyllaServer` configuration
  test: pylib: store `config_options` in `ScyllaServer`
2022-09-14 13:37:44 +03:00
Pavel Emelyanov
43131976e9 updateable_value: Update comment about cross-shard copying
refs: #7316

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #11538
2022-09-14 12:35:56 +02:00
Michał Chojnowski
9b6fc553b4 db: commitlog: don't print INFO logs on shutdown
The intention was for these logs to be printed during the
database shutdown sequence, but it was overlooked that it's not
the only place where commitlog::shutdown is called.
Commitlogs are started and shut down periodically by hinted handoff.
When that happens, these messages spam the log.

Fix that by adding INFO commitlog shutdown logs to database::stop,
and change the level of the commitlog::shutdown log call to DEBUG.

Fixes #11508

Closes #11536
2022-09-14 11:30:53 +03:00
Avi Kivity
a24a8fd595 Update seastar submodule
* seastar cbb0e888d8...601e0776c0 (1):
  > coroutine: explain and mitigate the lambda coroutine fiasco

Closes #11537
2022-09-13 22:37:29 +03:00
Petr Gusev
4ff0807cd0 raft server, status metric 2022-09-13 19:34:22 +04:00
Alejo Sanchez
6799e766ca test.py: topology increment timeouts even more
Due to slow debug machines timing out, bump up all timeouts
significantly.

The cause was ExecutionProfile request_timeout. Also set a high
heartbeat timeout and bump already set timeouts to be safe, too.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #11516
2022-09-13 11:57:31 +02:00
Piotr Dulikowski
e69b44a60f exception: fix the error code used for rate_limit_exception
Per-partition rate limiting added a new error type which should be
returned when Scylla decides to reject an operation due to per-partition
rate limit being exceeded. The new error code requires drivers to
negotiate support for it, otherwise Scylla will report the error as
`Config_error`. The existing error code override logic works properly,
however due to a mistake Scylla will report the `Config_error` code even
if the driver correctly negotiated support for it.

This commit fixes the problem by specifying the correct error code in
`rate_limit_exception`'s constructor.

Tested manually with a modified version of the Rust driver which
negotiates support for the new error. Additionally, tested what happens
when the driver doesn't negotiate support (Scylla properly falls back to
`Config_error`).

Branches: 5.1
Fixes: #11517

Closes #11518
2022-09-13 11:46:15 +02:00
Nadav Har'El
8ece63c433 Merge 'Safemode - Introduce TimeWindowCompactionStrategy Guardrails'
This series introduces two configurable options when working with TWCS tables:

- `restrict_twcs_default_ttl` - a LiveUpdate-able tri_mode_restriction which defaults to WARN and will notify the user whenever a TWCS table is created without a `default_time_to_live` setting
- `twcs_max_window_count` - Which forbids the user from creating TWCS tables whose window count (buckets) are past a certain threshold. We default to 50, which should be enough for most use cases, and a setting of 0 effectively disables the check.

Refs: #6923
Fixes: #9029

Closes #11445

* github.com:scylladb/scylladb:
  tests: cql_query_test: add mixed tests for verifying TWCS guard rails
  tests: cql_query_test: add test for TWCS window size
  tests: cql_query_test: add test for TWCS tables with no TTL defined
  cql: add configurable restriction of default_time_to_live when for TimeWindowCompactionStrategy tables
  cql: add max window restriction for TimeWindowCompactionStrategy
  time_window_compaction_strategy: reject invalid window_sizes
  cql3 - create/alter_table_statement: Make check_restricted_table_properties accept a schema_ptr
2022-09-12 23:55:51 +03:00
Botond Dénes
045b053228 Update seastar submodule
* seastar 2b2f6c08...cbb0e888 (10):
  > memory: allow user to select allocator to be used at runtime
  > perftune.py: correct typos
  > Merge 'seastar-addr2line: support more flexible syslog-style backtraces' from Benny Halevy
  > Fix instruction count for start_measuring_time
  > build: s/c-ares::c-ares/c-ares::cares/
  > Merge 'shared_ptr_debug_helper: turn assert into on_internal_error_abort' from Benny Halevy
  > test: fix use after free in the loopback socket
  > doc/tutorial.md: fix docker command for starting hello-world_demo
  > httpd: add a ctor without addr parameter
  > dns: dns_resolver: sock_entry: move-construct tcp/udp entries in place

Closes #11526
2022-09-12 18:34:22 +03:00
Avi Kivity
62ac3432c9 Merge "Always notify dropped RPC connections" from Pavel E
"
This set makes messaging service notify connection drop listeners
when connection is dropped for _any_ reason and cleans things up
around it afterwards
"

* 'br-messaging-notify-connection-drop' of https://github.com/xemul/scylla:
  messaging_service: Relax connection drop on re-caching
  messaging_service: Simplify remove_rpc_client_one()
  messaging_service: Notify connection drop when connection is removed
2022-09-12 17:02:51 +03:00
Yaron Kaikov
27e326652b build_docker.sh:fix python2 dependency
Following the revert of b004da9d1b which solved https://github.com/scylladb/scylla-pkg/issues/3094

updating docker dependency to match `scylla-tools-java` requirements

Closes #11522
2022-09-12 13:33:06 +03:00
Kamil Braun
2fe3e67a47 gms: feature_service: don't distinguish between 'known' and 'supported' features
`feature_service` provided two sets of features: `known_feature_set` and
`supported_feature_set`. The purpose of both and the distinction between
them was unclear and undocumented.

The 'supported' features were gossiped by every node. Once a feature is
supported by every node in the cluster, it becomes 'enabled'. This means
that whatever piece of functionality is covered by the feature, it can
by used by the cluster from now on.

The 'known' set was used to perform feature checks on node start; if the
node saw that a feature is enabled in the cluster, but the node does not
'know' the feature, it would refuse to start. However, if the feature
was 'known', but wasn't 'supported', the node would not complain. This
means that we could in theory allow the following scenario:
1. all nodes support feature X.
2. X becomes enabled in the cluster.
3. the user changes the configuration of some node so feature X will
   become unsupported but still known.
4. The node restarts without error.

So now we have a feature X which is enabled in the cluster, but not
every node supports it. That does not make sense.

It is not clear whether it was accidental or purposeful that we used the
'known' set instead of the 'supported' set to perform the feature check.

What I think is clear, is that having two sets makes the entire thing
unnecessarily complicated and hard to think about.

Fortunately, at the base to which this patch is applied, the sets are
always the same. So we can easily get rid of one of them.

I decided that the name which should stay is 'supported', I think it's
more specific than 'known' and it matches the name of the corresponding
gossiper application state.

Closes #11512
2022-09-12 13:09:12 +03:00
Takuya ASADA
cd5320fe60 install.sh: add --without-systemd option
Since we fail to write files to $USER/.config on Jenkins jobs, we need
an option to skip installing systemd units.
Let's add --without-systemd to do that.

Also, to detect the option availability, we need to increment
relocatable package version.

See scylladb/scylla-dtest#2819

Closes #11345
2022-09-12 13:04:00 +03:00
Avi Kivity
521127a253 Update tools/jmx submodule
* tools/jmx 06f2735...88d9bdc (1):
  > install.sh: add --without-systemd option
2022-09-12 13:02:16 +03:00
Kamil Braun
ce7bb8b6d0 test.py: PythonTestSuite: sum default config params with user-provided ones
Previously, if the suite.yaml file provided
`extra_scylla_config_options` but didn't provide values for `authorizer`
or `authenticator` inside the config options, the harness wouldn't give
any defaults for these keys. It would only provide defaults for these
keys if suite.yaml didn't specify `extra_scylla_config_options` at all.

It makes sense to give the user the ability to provide extra options
while relying on harness defaults for `authenticator` and `authorizer`
if the user doesn't care about them.
2022-09-12 11:58:05 +02:00
Kamil Braun
1661fe9f37 test: add a topology suite with Raft disabled
Add a suite which is basically equivalent to `topology` except that it
doesn't start servers with Raft enabled.

The suite will be used to test the Raft upgrade procedure.

The suite contains a basic test just to check the suite itself can run;
the test will be removed when 'real' tests are added.
2022-09-12 11:58:05 +02:00
Kamil Braun
311806244d test: pylib: use Python dicts to manipulate ScyllaServer configuration
Previously we used a formattable string to represent the configuration;
values in the string were substituted by Python's formatting mechanism
and the resulting string was stored to obtain the config file.

This approach had some downsides, e.g. it required boilerplate work to
extend: to add a new config options, you would have to modify this
template string.

Instead we can represent the configuration as a Python dictionary. Dicts
are easy to manipulate, for example you can sum two dicts; if a key
appears in both, the second dict 'wins':
```
{1:1} | {1:2} == {1:2}
```

This makes the configuration easy to extend without having to write
boilerplate: if the user of `ScyllaServer` wants to add or override a
config option, they can simply add it to the `config_options` dict and
that's it - no need to modify any internal template strings in
`ScyllaServer` implementation like before. The `config_options` dict is
simply summed with the 'base' config dict of `ScyllaServer`
(`config_options` is the right summand so anything in there overrides
anything in the base dict).

An example of this extensibility is the `authenticator` and `authorizer`
options which no longer appear in `scylla_cluster.py` module after this
change, they only appear in the suite.yaml file.

Also, use "workdir" option instead of specifying data dir, commitlog
dir etc. separately.
2022-09-12 11:57:58 +02:00
Kamil Braun
fd19825eaa test: pylib: store config_options in ScyllaServer
Previously the code extracted `authenticator` and `authorizer` keys from
the config options and stored them.

Store the entire dict instead. The new code is easier to extend if we
want to make more options configurable.
2022-09-12 11:57:18 +02:00
Pavel Emelyanov
5663b3eda5 messaging_service: Relax connection drop on re-caching
When messaging_service::get_rpc_client() picks up cached socket and
notices error on it, it drops the connection and creates a new one. The
method used to drop the connection is the one that re-lookups the verb
index again, which is excessive. Tune this up while at it

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-12 12:05:02 +03:00
Nadav Har'El
b0371b6bf8 test/alternator: insert test names into Scylla logs
The output of test/alternator/run ends in Scylla's full log file, where
it is hard to understand which log messages are related to which test.
In this patch, we add a log message (using the new /system/log REST API)
every time a test is started and ends.

The messages look like this:

   INFO  2022-08-29 18:07:15,926 [shard 0] api - /system/log:
   test/alternator: Starting test_ttl.py::test_describe_ttl_without_ttl
   ...
   INFO  2022-08-29 18:07:15,930 [shard 0] api - /system/log:
   test/alternator: Ended test_ttl.py::test_describe_ttl_without_ttl

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-09-12 10:32:56 +03:00
Nadav Har'El
a81310e23d rest api: add a new /system/log operation
Add a new REST API operation, taking a log level and a message, and
printing it into the Scylla log.

This can be useful when a test wants to mark certain positions in the
log (e.g., to see which other log messages we get between the two
positions). An alternative way to achieve this could have been for the
test to write directly into the log file - but an on-disk log file is
only one of the logging options that Scylla support, and the approach
in this patch allows to add log message regardless of how Scylla keeps
the logs.

In motivation of this feature is that in the following patch the
test/alternator framework will add log messages when starting and
ending tests, which can help debug test failures.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-09-12 10:32:56 +03:00
Nadav Har'El
b9792ffb06 alternator ttl: log warning if scan took too long.
Currently, we log at "info" level how much time remained at the end of
a full TTL scan until the next scanning period (we sleep for that time).
If the scan was slower than the period, we didn't print anything.

Let's print a warning in this case - it can be useful for debugging,
and also users should know when their desired scan period is not being
honored because the full scan is taking longer than the desired scan
period.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-09-12 10:32:56 +03:00
Nadav Har'El
e7e9adc519 alternator,ttl: allow sub-second TTL scanning period, for tests
Alternator has the "alternator_ttl_period_in_seconds" parameter for
controlling how often the expiration thread looks for expired items to
delete. It is usually a very large number of seconds, but for tests
to finish quickly, we set it to 1 second.
With 1 second expiration latency, test/alternator/test_ttl.py took 5
seconds to run.

In this patch, we change the parameter to allow a  floating-point number
of seconds instead of just an integer. Then, this allows us to halve the
TTL period used by tests to 0.5 seconds, and as a result, the run time of
test_ttl.py halves to 2.5 seconds. I think this is fast enough for now.

I verified that even if I change the period to 0.1, there is no noticable
slowdown to other Alternator tests, so 0.5 is definitely safe.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-09-12 10:32:56 +03:00
Nadav Har'El
746c4bd9eb test/alternator: skip fewer Alternator TTL tests
Most of the Alternator TTL tests are extremely slow on DynamoDB because
item expiration may be delayed up to 24 hours (!), and in practice for
10 to 30 minutes. Because of this, we marked most of these tests
with the "veryslow" mark, causing them to be skipped by default - unless
pytest is given the "--runveryslow" option.

The result was that the TTL tests were not run in the normal test runs,
which can allow regressions to be introduced (luckily, this hasn't happened).

However, this "veryslow" mark was excessive. Many of the tests are very
slow only on DynamoDB, but aren't very slow on Scylla. In particular,
many of the tests involve waiting for an item to expire, something that
happens after the configurable alternator_ttl_period_in_seconds, which
is just one second in our tests.

So in this patch, we remove the "veryslow" mark from 6 tests of Alternator TTL
tests, and instead use two new fixtures - waits_for_expiration and
veryslow_on_aws - to only skip the test when running on DynamoDB or
when alternator_ttl_period_in_seconds is high - but in our usual test
environment they will not get skipped.

Because 5 of these 6 tests wait for an item to expire, they take one
second each and this patch adds 5 seconds to the Alternator test
runtime. This is unfortunate (it's more than 25% of the total Alternator
test runtime!) but not a disaster, and we plan to reduce this 5 second
time futher in the following patch, but decreasing the TTL scanning
period even further.

This patch also increases the timeout of several of these tests, to 120
seconds from the previous 10 seconds. As mentioned above, normally,
these tests should always finish in alternator_ttl_period_in_seconds
(1 second) with a single scan taking less than 0.2 seconds, but in
extreme cases of debug builds on overloaded test machines, we saw even
60 seconds being passed, so let's increase the maximum. I also needed
to make the sleep time between retries smaller, not a function of the
new (unrealistic) timeout.

4 more tests remain "veryslow" (and won't run by default) because they
are take 5-10 seconds each (e.g., a test which waits to see that an item
does *not* get expired, and a test involving writing a lot of data).
We should reconsider this in the future - to perhaps run these tests in
our normal test runs - but even for now, the 6 extra tests that we
start running are a much better protection against regressions than what
we had until now.

Fixes #11374

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

x

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-09-12 10:32:56 +03:00
Nadav Har'El
297109f6ee test/alternator: test Alternator TTL metrics
This patch adds a test for the metrics generated by the background
expiration thread run for Alternator's TTL feature.

We test three of the four metrics: scylla_expiration_scan_passes,
scylla_expiration_scan_table and scylla_expiration_items_deleted.
The fourth metric, scylla_expiration_secondary_ranges_scanned, counts the
number of times that this node took over another node's expiration duty.
so requires a multi-node cluster to test, and we can't test it in the
single-node cluster test framework.

To see TTL expiration in action this test may need to wait up to the
setting of alternator_ttl_period_in_seconds. For a setting of 1
second (the default set by test/alternator/run), this means this
test can take up to 1 second to run. If alternator_ttl_period_in_seconds
is set higher, the test is skipped unless --runveryslow is requested.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-09-12 10:32:56 +03:00
Botond Dénes
a0392bc1eb Merge 'doc: update the default SStable format' from Anna Stuchlik
The purpose of this PR is to update the information about the default SStable format.
It

Closes #11431

* github.com:scylladb/scylladb:
  doc: simplify the information about default formats in different versions
  doc: update the SSTables 3.0 Statistics File Format to add the UUID host_id option of the ME format
  doc: add the information regarding the ME format to the SSTables 3.0 Data File Format page
  doc: fix additional information regarding the ME format on the SStable 3.x page
  doc: add the ME format to the table
  add a comment to remove the information when the documentation is versioned (in 5.1)
  doc: replace Scylla with ScyllaDB
  doc: fix the formatting and language in the updated section
  doc: fix the default SStable format
2022-09-12 09:50:01 +03:00
Pavel Emelyanov
f3dfc9dbd4 system_keyspace: Don't load preferred IPs if not asked for
If snitch->prefer_local() is false, advertised (via gossiper)
INTERNAL_IPs are not suggested to messaging service to use. The same
should apply to boot-time when messaging service is loaded with those
IPs taken from the system.peers table.

fixes: #11353
tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/2172/

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220909144800.23122-1-xemul@scylladb.com>
2022-09-12 09:48:23 +03:00
Botond Dénes
9db940ff1b Merge "Make network_topology_strategy_test use topology" from Pavel Emelyanov
"
The test in question plays with snitches to simulate the topology
over which tokens are spread. This set replaces explicit snitch
usage with temporary topology object.

Some snitch traces are still left, but those are for token_metadata
internal which still call global snitch for DC/RACK.
"

* 'br-tests-use-topology-not-snitch' of https://github.com/xemul/scylla:
  network_topology_strategy_test: Use topology instead of snitch
  network_topology_strategy_test: Populate explicit topology
2022-09-12 09:40:17 +03:00
Avi Kivity
6c797587c7 dirty_memory_manager: region_group: remove sorting of subgroups
dirty_memory_manager tracks lsa regions (memtables) under region_group:s,
in order to be able to pick up the largest memtable as a candidate for
flushing.

Just as region_group:s contain regions, they can also contain other
region_group:s in a nested structure. It also tracks the nested region_group
that contains the largest region in a binomial heap.

This latter facility is no longer used. It saw use when we had the system
dirty_memory_manager nested under the user dirty_memory_manager, but
that proved too complicated so it was undone. We still nest a virtual
region_group under the real region_group, and in fact it is the
virtual region_group that holds the memtables, but it is accessed
directly to find the largest memtable (region_group::get_largest_region)
and so all the mechanism that sorts region_group:s is bypassed.

Start to dismantle this house of cards by removing the subgroup
sorting. Since the hierarchy has exactly one parent and one child,
it's clearly useless. This is seen by the fact that we can just remove
everything related.

We still need the _subgroups member to hold the virtual region_group;
it's replaced by a vector. I verified that the non-intrusive vector
is exception safe since push_back() happens at the very end; in any
case this is early during setup where we aren't under memory pressure.

A few tests that check the removed functionality are deleted.

Closes #11515
2022-09-12 09:29:08 +03:00
Botond Dénes
0e2d6cfd61 Merge 'Introduce Compaction Groups' from Raphael "Raph" Carvalho
Compaction group can be defined as a set of files that can be compacted together. Today, all sstables belonging to a table in a given shard belong to the same group. So we can say there's one group per table per shard. As we want to eventually allow isolation of data that shouldn't be mixed, e.g. data from different vnodes, then we want to have more than one group per table per shard. That's why compaction groups is being introduced here.

Today, all memtables and sstables are stored in a single structure per table. After compaction groups, there will be memtables and sstables for each group in the table.

As we're taking an incremental approach, table still supports a single group. But work was done on preparing table for supporting multiple groups. Completing that work is actually the next step. Also, a procedure for deriving the group from token is introduced, but today it always return the single group owned by the table. Once multiple groups are supported, then that procedure should be implemented to map a token to a group.

No semantics was changed by this series.

Closes #11261

* github.com:scylladb/scylladb:
  replica: Move memtables to compaction_group
  replica: move compound SSTable set to compaction group
  replica: move maintenance SSTable set to compaction_group
  replica: move main SSTable set to compaction_group
  replica: Introduce compaction_group
  replica: convert table::stop() into coroutine
  compaction_manager: restore indentation
  compaction_manager: Make remove() and stop_ongoing_compactions() noexcept
  test: sstable_compaction_test: Don't reference main sstable set directly
  test: sstable_utils: Set data size fields for fake SSTable
  test: sstable_compaction_test: remove needless usage of column_family_test::add_sstable
2022-09-12 09:28:44 +03:00
Botond Dénes
5374f0edbf Merge 'Task manager' from Aleksandra Martyniuk
Task manager for observing and managing long-running, asynchronous tasks in Scylla
with the interface for the user. It will allow listing of tasks, getting detailed
task status and progression, waiting for their completion, and aborting them.
The task manager will be configured with a “task ttl” that determines how long
the task status is kept in memory after the task completes.

At first it will support repair and compaction tasks, and possibly more in the future.

Currently:
Sharded `task_manager` is started in `main.cc` where it is further passed
to `http_context` for the purpose of user interface.

Task manager's tasks are implemented in two two layers: the abstract
and the implementation one. The latter is a pure virtual class which needs
to be overriden by each module. Abstract layer provides the methods that
are shared by all modules and the access to module-specific methods.

Each module can access task manager, create and manage its tasks through
`task_manager::module` object. This way data specific to a module can be
separated from the other modules.

User can access task manager rest api interface to track asynchronous tasks.
The available options consist of:
- getting a list of modules
- getting a list of basic stats of all tasks in the requested module
- getting the detailed status of the requested task
- aborting the requested task
- waiting for the requested task to finish

To enable testing of the provided api, test specific task implementation and module
are provided. Their lifetime can be simulated with the standalone test api.
These components are compiled and the tests are run in all but release build modes.

Fixes: #9809

Closes #11216

* github.com:scylladb/scylladb:
  test: task manager api test
  task_manager: test api layer implementation
  task_manager: add test specific classes
  task_manager: test api layer
  task_manager: api layer implementation
  task_manager: api layer
  task_manager: keep task_manager reference in http_context
  start sharded task manager
  task_manager: create task manager object
2022-09-12 09:26:46 +03:00
Petr Gusev
1b5fa4088e raft server, abort group0 server on background errors 2022-09-12 10:16:43 +04:00
Petr Gusev
e92dc9c15b raft server, provide a callback to handle background errors
Fix: #11352
2022-09-12 10:16:43 +04:00
Petr Gusev
c57238d3d6 raft server, check aborted state on public server public api's
Fix: #11352
2022-09-12 10:16:40 +04:00
Felipe Mendes
6a3d8607b4 tests: cql_query_test: add mixed tests for verifying TWCS guard rails
This patch adds set of 10 cenarios that have been unveiled during additional testing.
In particular, most of the scenarios cover ALTER TABLE statements, which - if not handled -
may break the guardrails safe-mode. The situations covered are:

- STCS->TWCS with no TTL defined
- STCS->TWCS with small TTL
- STCS->TWCS with large TTL value
- TWCS table with small to large TTL
- No TTL TWCS to large TTL and then small TTL
- twcs_max_window_count LiveUpdate - Decrease TTL
- twcs_max_window_count LiveUpdate - Switch CompactionStrategy
- No TTL TWCS table to STCS
- Large TTL TWCS table, modify attribute other than compaction and default_time_to_live
- Large TTL STCS table, fail to switch to TWCS with no TTL explicitly defined
2022-09-11 17:57:14 -03:00
Felipe Mendes
a7a91e3216 tests: cql_query_test: add test for TWCS window size
This patch adds a test for checking the validity of tables using TimeWindowCompactionStrategy
with an incorrect number of compaction windows.

The twcs_max_window_count LiveUpdate-able parameter is also disabled during the execution of the
test in order to ensure that users can effectively disable the enforcement, should they want.
2022-09-11 17:38:25 -03:00
Felipe Mendes
1c5d46877e tests: cql_query_test: add test for TWCS tables with no TTL defined
This patch adds a testcase for TimeWindowCompactionStrategy tables created with no
default_time_to_live defined. It makes use of the LiveUpdate-able restrict_twcs_default_ttl
parameter in order to determine whether TWCS tables without TTL should be forbidden or not.

The test replays all 3 possible variations of the tri_mode_restriction and verifies tables
are correctly created/altered according to the current setting on the replica which receives
the request.
2022-09-11 16:55:46 -03:00
Felipe Mendes
7fec4fcaa6 cql: add configurable restriction of default_time_to_live when for TimeWindowCompactionStrategy tables
TimeWindowCompactionStrategy (TWCS) tables are known for being used explicitly for time-series workloads. In particular, most of the time users should specify a default_time_to_live during table creation to ensure data is expired such as in a sliding window. Failure to do so may create unbounded windows - which - depending on the compaction window chosen, may introduce severe latency and operational problems, due to unbounded window growth.

However, there may be some use cases which explicitly ingest data by using the `USING TTL` keyword, which effectively has the same effect. Therefore, we can not simply forbid table creations without a default_time_to_live explicitly set to any value other than 0.

The new restrict_twcs_without_default_ttl option has three values: "true", "false", and "warn":

We default to "warn", which will notify the user of the consequences when creating a TWCS table without a default_time_to_live value set. However, users are encouraged to switch it to "true", as - ideally - a default_time_to_live value should always be expected to prevent applications failing to ingest data against the database ommitting the `USING TTL` keyword.
2022-09-11 16:50:42 -03:00
Felipe Mendes
a3356e866b cql: add max window restriction for TimeWindowCompactionStrategy
The number of potential compaction windows (or buckets) is defined by the default_time_to_live / sstable_window_size ratio. Every now and then we end up in a situation on where users of TWCS end up underestimating their window buckets when using TWCS. Unfortunately, scenarios on which one employs a default_time_to_live setting of 1 year but a window size of 30 minutes are not rare enough.

Such configuration is known to only make harm to a workload: As more and more windows are created, the number of SSTables will grow in the same pace, and the situation will only get worse as the number of shards increase.

This commit introduces the twcs_max_window_count option, which defaults to 50, and will forbid the Creation or Alter of tables which get past this threshold. A value of 0 will explicitly skip this check.

Note: this option does not forbid the creation of tables with a default_time_to_live=0 as - even though not recommended - it is perfectly possible for a TWCS table with default TTL=0 to have a bound window, provided any ingestion statements make use of 'USING TTL' within the CQL statement, in addition to it.
2022-09-11 16:50:22 -03:00
Felipe Mendes
f1ffb501f0 time_window_compaction_strategy: reject invalid window_sizes
Scylla mistakenly allows an user to configure an invalid TWCS window_size <= 0, which effectively breaks the notion of compaction windows.
Interestingly enough, a <= 0 window size should be considered an undefined behavior as either we would create a new window every 0 duration (?) or the table would behave as STCS, the reader is encouraged to figure out which one of these is true. :-)

Cassandra, on the other hand, will properly throw a ConfigurationException when receiving such invalid window sizes and we now match the behavior to the same as Cassandra's.

Refs: #2336
2022-09-11 16:40:03 -03:00
Raphael S. Carvalho
f5715d3f0b replica: Move memtables to compaction_group
Now memtables live in compaction_group. Also introduced function
that selects group based on token, but today table always return
the single group managed by it. Once multiple groups are supported,
then the function should interpret token content to select the
group.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-11 14:26:59 -03:00
Raphael S. Carvalho
f4579795e6 replica: move compound SSTable set to compaction group
The group is now responsible for providing the compound set.
table still has one compound set, which will span all groups for
the cases we want to ignore the group isolation.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-11 14:26:59 -03:00
Raphael S. Carvalho
6717d96684 replica: move maintenance SSTable set to compaction_group
This commit is restricted to moving maintenance set into compaction_group.
Next, we'll introduce compound set into it.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-11 14:26:59 -03:00
Raphael S. Carvalho
ce8e5f354c replica: move main SSTable set to compaction_group
This commit is restricted to moving main set into compaction_group.
Next, we'll move maintenance set into it and finally the memtable.

A method is introduced to figure out which group a sstable belongs
to, but it's still unimplemented as table is still limited to
a single group.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-11 14:26:59 -03:00
Raphael S. Carvalho
4871f1c97c replica: Introduce compaction_group
Compaction group is a new abstraction used to group SSTables
that are eligible to be compacted together. By this definition,
a table in a given shard has a single compaction group.
The problem with this approach is that data from different vnodes
is intermixed in the same sstable, making it hard to move data
in a given sstable around.
Therefore, we'll want to have multiple groups per table.

A group can be thought of an isolated LSM tree where its memtable
and sstable files are isolated from other groups.

As for the implementation, the idea is to take a very incremental
approach.
In this commit, we're introducing a single compaction group to
table.
Next, we'll migrate sstable and maintenance set from table
into that single compaction group. And finally, the memtable.

Cache will be shared among the groups, for simplicity.
It works due to its ability to invalidate a subset of the
token range.

There will be 1:1 relationship between compaction_group and
table_state.
We can later rename table_state to compaction_group_state.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-11 14:26:59 -03:00
Raphael S. Carvalho
a6ecadf3de replica: convert table::stop() into coroutine
await_pending_ops() is today marked noexcept, so doesn't have to
be implemented with finally() semantics.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-11 14:26:59 -03:00
Raphael S. Carvalho
44913ebbd0 compaction_manager: restore indentation
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-11 14:26:59 -03:00
Raphael S. Carvalho
888660fa44 compaction_manager: Make remove() and stop_ongoing_compactions() noexcept
stop_ongoing_compactions() is made noexcept too as it's called from
remove() and we want to make the latter noexcept, to allow compaction
group to qualify its stop function as noexcept too.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-11 14:26:59 -03:00
Raphael S. Carvalho
65414e6756 test: sstable_compaction_test: Don't reference main sstable set directly
Preparatory change for main sstable set to be moved into compaction
group. After that, tests can no longer direct access the main
set.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-11 14:26:59 -03:00
Raphael S. Carvalho
dfa7273127 test: sstable_utils: Set data size fields for fake SSTable
So methods that look at data size and require it to be higher than 0
will work on fake SSTables created using set_values().

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-11 14:26:59 -03:00
Raphael S. Carvalho
4fa8159a13 test: sstable_compaction_test: remove needless usage of column_family_test::add_sstable
column_family_test::add_sstable will soon be changed to run in a thread,
and it's not needed in this procedure, so let's remove its usage.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-11 14:26:59 -03:00
Jadw1
ba461aca8b cql-pytest: more neutral command in cql_test_connection fixture
I found 'use system` to not be neutral enough (e.g. in case of testing
describe statement). `BEGIN BATCH APPLY BATCH` sounds better.

Closes #11504
2022-09-11 18:49:06 +03:00
Nadav Har'El
d71098a3b8 Update tools/java submodule
* tools/java b7a0c5bd31...b004da9d1b (1):
  > Revert "dist/debian:add python3 as dependency"
2022-09-11 17:45:43 +03:00
Pavel Emelyanov
bbad3eac63 pylib: Cast port number config to int explicitly
Otherwise it crashes some python versions.

The cast was there before a2dd64f68f
explicitly dropped one while moving the code between files.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #11511
2022-09-09 18:08:08 +02:00
Kamil Braun
be1ef9d2a7 gms: feature_service: remove the USES_RAFT feature
It was not and won't be used for anything.
Note that the feature was always disabled or masked so no node ever
announced it, thus it's safe to get rid of.

Closes #11505
2022-09-09 18:05:46 +02:00
Michał Chojnowski
47844689d8 token_metadata: make local_dc_filter a lambda, not a std::function
This std::function causes allocations, both on construction
and in other operations. This costs ~2200 instructions
for a DC-local query. Fix that.

Closes #11494
2022-09-09 18:05:46 +02:00
Michał Chojnowski
af7ace3926 utils: config_file: fix handling of workdir,W in the YAML file
Option names given in db/config.cc are handled for the command line by passing
them to boost::program_options, and by YAML by comparing them with YAML
keys.
boost::program_options has logic for understanding the
long_name,short_name syntax, so for a "workdir,W" option both --workdir and -W
worked, as intended. But our YAML config parsing doesn't have this logic
and expected "workdir,W" verbatim, which is obviously not intended. Fix that.

Fixes #7478
Fixes #9500
Fixes #11503

Closes #11506
2022-09-09 18:05:46 +02:00
Kamil Braun
dba595d347 Merge 'Minimal implementation of Broadcast Tables' from Mikołaj Grzebieluch
Broadcast tables are tables for which all statements are strongly
consistent (linearizable), replicated to every node in the cluster and
available as long as a majority of the cluster is available. If a user
wants to store a “small” volume of metadata that is not modified “too
often” but provides high resiliency against failures and strong
consistency of operations, they can use broadcast tables.

The main goal of the broadcast tables project is to solve problems which
need to be solved when we eventually implement general-purpose strongly
consistent tables: designing the data structure for the Raft command,
ensuring that the commands are idempotent, handling snapshots correctly,
and so on.

In this MVP (Minimum Viable Product), statements are limited to simple
SELECT and UPDATE operations on the built-in table. In the future, other
statements and data types will be available but with this PR we can
already work on features like idempotent commands or snapshotting.
Snapshotting is not handled yet which means that restarting a node or
performing too many operations (which would cause a snapshot to be
created) will give incorrect results.

In a follow-up, we plan to add end-to-end Jepsen tests
(https://jepsen.io/). With this PR we can already simulate operations on
lists and test linearizability in linear complexity. This can also test
Scylla's implementation of persistent storage, failure detector, RPC,
etc.

Design doc: https://docs.google.com/document/d/1m1IW320hXtsGulzSTSHXkfcBKaG5UlsxOpm6LN7vWOc/edit?usp=sharing

Closes #11164

* github.com:scylladb/scylladb:
  raft: broadcast_tables: add broadcast_kv_store test
  raft: broadcast_tables: add returning query result
  raft: broadcast_tables: add execution of intermediate language
  raft: broadcast_tables: add compilation of cql to intermediate language
  raft: broadcast_tables: add definition of intermediate language
  db: system_keyspace: add broadcast_kv_store table
  db: config: add BROADCAST_TABLES feature flag
2022-09-09 18:05:37 +02:00
Aleksandra Martyniuk
55cd8fe3bf test: task manager api test
Test of a task manager api.
2022-09-09 14:29:28 +02:00
Aleksandra Martyniuk
ec86410094 task_manager: test api layer implementation
The implementation of a test api that helps testing task manager
api. It provides methods to simulate the operations that can happen
on modules and theirs task. Through the api user can: register
and unregister the test module and the tasks belonging to the module,
and finish the tasks with success or custom error.
2022-09-09 14:29:28 +02:00
Aleksandra Martyniuk
b1fa6e49af task_manager: add test specific classes
Add test_module and test_task classes inheriting from respectively
task_manager::module and task_manager::task::impl that serve
task manager testing.
2022-09-09 14:29:28 +02:00
Aleksandra Martyniuk
42f36db55b task_manager: test api layer
The test api that helps testing task manager api. It can be used
to simulate the operations that can happen on modules and theirs
task. Through the api user can: register and unregister the test
module and the tasks belonging to the module, and finish the tasks
with success or custom error.
2022-09-09 14:29:28 +02:00
Aleksandra Martyniuk
c9637705a6 task_manager: api layer implementation
The implementation of a task manager api layer. It provides
methods to list the modules registered in task_manager, list
tasks belonging to the given module, abort, wait for or retrieve
a status of the given task.
2022-09-09 14:29:28 +02:00
Aleksandra Martyniuk
07043cee68 task_manager: api layer
The task manager api layer. It can be used to list the modules
registered in task_manager, list tasks belonging to the given
module, abort, wait for or retrieve a status of the given task.
2022-09-09 14:29:28 +02:00
Aleksandra Martyniuk
b87a0a74ab task_manager: keep task_manager reference in http_context
Keep a reference to sharded<task_manager> as a member
of http_context so it can be reached from rest api.
2022-09-09 14:29:28 +02:00
Aleksandra Martyniuk
9e68c8d445 start sharded task manager
Sharded task manager object is started in main.cc.
2022-09-09 14:29:28 +02:00
Aleksandra Martyniuk
2439e55974 task_manager: create task manager object
Implementation of a task manager that allows tracking
and managing asynchronous tasks.

The tasks are represented by task_manager::task class providing
members common to all types of tasks. The methods that differ
among tasks of different module can be overriden in a class
inheriting from task_manager::task::impl class. Each task stores
its status containing parameters like id, sequence number, begin
and end time, state etc. After the task finishes, it is kept
in memory for configurable time or until it is unregistered.
Tasks need to be created with make_task method.

Each module is represented by task_manager::module type and should
have an access to task manager through task_manager::module methods.
That allows to easily separate and collectively manage data
belonging to each module.
2022-09-09 14:29:28 +02:00
Pavel Emelyanov
24d68e1995 messaging_service: Simplify remove_rpc_client_one()
Make it void as after previous patch no code is interesed in this value

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-09 12:50:41 +03:00
Pavel Emelyanov
ca92ed65e5 messaging_service: Notify connection drop when connection is removed
There are two methods to close an RPC socket in m.s. -- one that's
called on error path of messaging_service::send_... and the other one
that's called upon gossiper down/leave/cql-off notifications. The former
one notifies listeners about connection drop, the latter one doesn't.

The only listener is the storage-proxy which, in turn, kicks database to
release per-table cache hitrate data. Said that, when a node goes down
(or when an operator shuts down its transport) the hit-rate stats
regarding this node are leaked.

This patch moves notification so that any socket drop calls notification
and thus releases the hitrates.

fixes: #11497

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-09 12:47:38 +03:00
Kamil Braun
0efdc45d59 Merge 'test.py: remove top level conftest and improve logging' from Alecco
- To isolate the different pytest suites, remove the top level conftest
  and move needed contents to existing `test/pylib/cql_repl/conftest.py`
  and `test/topology/conftest.py`.
- Add logging to CQL and Python suites.
- Log driver version for CQL and topology tests.

Closes #11482

* github.com:scylladb/scylladb:
  test.py: enable log capture for Python suite
  test.py: log driver name/version for cql/topology
  test.py: remove top level conftest.py
2022-09-08 16:25:24 +02:00
Anna Stuchlik
54d6d8b8cc doc: fix the version name in file upgrade-guide-from-2021.1-to-2022.1-image.rst 2022-09-08 15:38:11 +02:00
Anna Stuchlik
6ccc838740 doc: rename the upgrade-image file to upgrade-image-opensource and update all the links to that file 2022-09-08 15:38:11 +02:00
Anna Stuchlik
22317f8085 doc: update the Enterprise guide to include the Enterprise-onlyimage file 2022-09-08 15:38:11 +02:00
Anna Stuchlik
593f987bb2 doc: update the image files 2022-09-08 15:38:10 +02:00
Anna Stuchlik
42224dd129 doc: split the upgrade-image file to separate files for Open Source and Enterprise 2022-09-08 15:38:10 +02:00
Anna Stuchlik
64a527e1d3 doc: clarify the alternative upgrade procedures for the ScyllaDB image 2022-09-08 15:38:10 +02:00
Anna Stuchlik
5136d7e6d7 doc: add the upgrade guide for ScyllaDB Image from 2022.x.y. to 2022.x.z 2022-09-08 15:38:10 +02:00
Anna Stuchlik
f1ef6a181e doc: add the upgrade guide for ScyllaDB Image from 5.x.y. to 5.x.z 2022-09-08 15:38:10 +02:00
Mikołaj Grzebieluch
eb610c45fe raft: broadcast_tables: add broadcast_kv_store test
Test queries scylla with following statements:
* SELECT value FROM system.broadcast_kv_store WHERE key = CONST;
* UPDATE system.broadcast_kv_store SET value = CONST WHERE key = CONST;
* UPDATE system.broadcast_kv_store SET value = CONST WHERE key = CONST IF value = CONST;
where CONST is string randomly chosen from small set of random strings
and half of conditional updates has condition with comparison to last
written value.
2022-09-08 15:25:36 +02:00
Mikołaj Grzebieluch
803115d061 raft: broadcast_tables: add returning query result
Intermediate language added new layer of abstraction between cql
statement and quering mutations, thus this commit adds new layer of
abstraction between mutations and returning query result.

Result can't be directly returned from `group0_state_machine::apply`, so
we decided to hold query results in map inside `raft_group0_client`. It can
be safely read after `add_entry_unguarded`, because this method waits
for applying raft command. After translating result to `result_message`
or in case of exception, map entry is erased.
2022-09-08 15:25:36 +02:00
Mikołaj Grzebieluch
db88525774 raft: broadcast_tables: add execution of intermediate language
Extended `group0_command` to enable transmission of `raft::broadcast_tables::query`.
Added `add_entry_unguarded` method in `raft_group0_client` for dispatching raft
commands without `group0_guard`.
Queries on group0_kv_store are executed in `group_0_state_machine::apply`,
but for now don't return results. They don't use previous state id, so they will
block concurrent schema changes, but these changes won't block queries.

In this version snapshots are ignored.
2022-09-08 15:25:36 +02:00
Mikołaj Grzebieluch
82df8a9905 raft: broadcast_tables: add compilation of cql to intermediate language
We decided to extend `cql_statement` hierarchy with `strongly_consistent_modification_statement`
and `strongly_consistent_select_statement`. Statements operating on
system.broadcast_kv_store will be compiled to these new subclasses if
BROADCAST_TABLES flag is enabled.

If the query is executed on a shard other than 0 it's bounced to that shard.
2022-09-08 15:25:36 +02:00
Wojciech Mitros
7effd4c53a wasm: directly handle recycling of invalidated instance
An instance may be invalidated before we try to recycle it.
We perform this by setting its value to a nullopt.
This patch adds a check for it when calculating its size.
This behavior didn't cause issues before because the catch
clause below caught errors caused by calling value() on
a nullopt, even though it was intended for errors from
get_instance_size.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>

Closes #11500
2022-09-08 15:39:28 +03:00
Geoffrey Beausire
f435276d2e Merge tokens for everywhere_topology
With EverywhereStrategy, we know that all tokens will be on the same node and the data is typically sparse like LocalStrategy.

Result of testing the feature:
Cluster: 2 DC, 2 nodes in each DC, 256 tokens per nodes, 14 shards per node

Before: 154 scanning operations
After: 14 scanning operations (~10x improvement)

On bigger cluster, it will probably be even more efficient.

Closes #11403
2022-09-08 15:33:23 +03:00
Mikołaj Grzebieluch
c541d5c363 raft: broadcast_tables: add definition of intermediate language
In broadcast tables, raft command contains a whole program to be executed.
Sending and parsing on each node entire CQL statement is inefficient,
thus we decided to compile it to an intermediate language which can be
easily serializable.

This patch adds a definition of such a language. For now, only the following
types of statements can be compiled:
* select value where key = CONST from system.broadcast_kv_store;
* update system.broadcast_kv_store set value = CONST where key = CONST;
* update system.broadcast_kv_store set value = CONST where key = CONST if value = CONST;
where CONST is string literal.
2022-09-08 14:03:51 +02:00
Michał Chojnowski
0c54e7c5c7 sstables: index_reader: remove a stray vsprintf call from the hot path
sstable::get_filename() constructs the filename from components, which
takes some work. It happens to be called on every
index_reader::index_reader() call even though it's only used for TRACE
logs. That's 1700 instructions (~1% of a full query) wasted on every
SSTable read. Fix that.

Closes #11485
2022-09-08 14:29:23 +03:00
Michał Chojnowski
c61b901828 utils: logalloc: prefer memory::free_memory() to memory::stats().free_memory
The former is a shortcut that does not involve a copy of all stats.
This saves some instructions in the hot path.

Closes #11495
2022-09-08 14:12:20 +03:00
Botond Dénes
438aaf0b85 Merge 'Deglobalize repair history maps' from Benny Halevy
Change a8ad385ecd introduced
```
thread_local std::unordered_map<utils::UUID, seastar::lw_shared_ptr<repair_history_map>> repair_history_maps;
```

We're trying to avoid global scoped variables as much as we can so this should probably be embedded in some sharded service.

This series moves the thread-local `repair_history_maps` instances to `compaction_manager`
and passes a reference to the shard compaction_manager to functions that need it for compact_for_query
and compact_for_compaction.

Since some paths don't need it and don't have access to the compactio_manager,
the series introduced `utils::optional_reference<T>` that allows to pass nullopt.
In this case, `get_gc_before_for_key` behaves in `tombstone_gc_mode::repair` as if the table wasn't repaired and tombstones are not garbage-collected.

Fixes #11208

Closes #11366

* github.com:scylladb/scylladb:
  tombstone_gc: deglobalize repair_history_maps
  mutation_compactor: pass tombstone_gc_state to compact_mutation_state
  mutation_partition: compact_for_compaction_v2: get tombstone_gc_state
  mutation_partition: compact_for_compaction: get tombstone_gc_state
  mutation_readers: pass tombstone_gc_state to compating_reader
  sstables: get_gc_before_*: get tombstone_gc_state from caller
  compaction: table_state: add virtual get_tombstone_gc_state method
  db: view: get_tombstone_gc_state from compaction_manager
  db: view: pass base table to view_update_builder
  repair: row_level: repair_update_system_table_handler: get get_tombstone_gc_state for db compaction_manager
  replica: database: get_tombstone_gc_state from compaction_manager
  compaction_manager: add tombstone_gc_state
  replica: table: add get_compaction_manager function
  tombstone_gc: introduce tombstone_gc_state
  repair_service: simplify update_repair_time error handling
  tombstone_gc: update_repair_time: get table_id rather than schema_ptr
  tombstone_gc: delete unused forward declaration
  database: do not drop_repair_history_map_for_table in detach_column_family
2022-09-08 14:08:38 +03:00
Botond Dénes
9d1cc5e616 Merge 'doc: update the OS support for versions 2022.1 and 2022.2' from Anna Stuchlik
The scope of this PR:
- Removing support for Ubuntu 16.04 and Debian 9.
- Adding support for Debian 11.

Closes #11461

* github.com:scylladb/scylladb:
  doc: remove support for Debian 9 from versions 2022.1 and 2022.2
  doc: remove support for Ubuntu 16.04 from versions 2022.1 and 2022.2
  doc: add support for Debian 11 to versions 2022.1 and 2022.2
2022-09-08 13:27:47 +03:00
Anna Stuchlik
0dee507c48 doc: fix the upgrade version in the upgrade guide for RHEL and CentOS
Closes #11477
2022-09-08 13:26:49 +03:00
Alejo Sanchez
4190c61dbf test.py: enable log capture for Python suite
Enable pytest log capture for Python suite. This will help debugging
issues in remote machines.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-09-08 11:37:32 +02:00
Alejo Sanchez
c6a048827a test.py: log driver name/version for cql/topology
Log the python driver name and version to help debugging on third party
machines.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-09-08 11:37:32 +02:00
Alejo Sanchez
a2dd64f68f test.py: remove top level conftest.py
Remove top level conftest so different suites have their own (as it was
before).

Move minimal functionality into existing test/pylib/cql_repl/conftest.py
so cql tests can run on their own.

Move param setting into test/topology/conftest.py.

Use uuid module for unique keyspace name for cql tests.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-09-08 11:37:32 +02:00
Alejo Sanchez
d892d194fb test.py: remove spurious after test check
Before/after test checks are done per test case, there's no longer need
to check after pytest finishes.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #11489
2022-09-08 11:33:37 +02:00
Felipe Mendes
7ccf8ed221 cql3 - create/alter_table_statement: Make check_restricted_table_properties accept a schema_ptr
As check_restricted_table_properties() is invoked both within CREATE TABLE and ALTER TABLE CQL statements,
we currently have no way to determine whether the operation was either a CREATE or ALTER. In many situations,
it is important to be able to distinguish among both operations, such as - for example - whether a table already has
a particular property set or if we are defining it within the statement.

This patch simply adds a std::optional<schema_ptr> to check_restricted_table_properties() and updates its caller.

Whenever a CREATE TABLE statement is issued, the method is called as a std::nullopt, whereas if an ALTER TABLE is
issued instead, we call it with a schema_ptr.
2022-09-07 21:27:32 -03:00
Kamil Braun
ff4430d8ea test: topology: make imports friendlier for tools (such as mypy)
When importing from `pylib`, don't modify `sys.path` but use the fact
that both `test/` and `test/pylib/` directories contain an `__init__.py`
file, so `test.pylib` is a valid module if we start with `test/` as the
Python package root.

Both `pytest` and `mypy` (and I guess other tools) understand this
setup.

Also add an `__init__.py` to `test/topology/` so other modules under the
`test/` directory will be able to import stuff from `test/topology/`
(i.e. from `test.topology.X import Y`).

Closes #11467
2022-09-07 23:52:50 +03:00
Karol Baryła
1c2eef384d transport/server.cc: Return correct size of decompressed lz4 buffer
An incorrect size is returned from the function, which could lead to
crashes or undefined behavior. Fix by erroring out in these cases.

Fixes #11476
2022-09-07 10:58:23 +03:00
Nadav Har'El
e5f6adf46c test/alternator: improve tests for DescribeTable for indexes
I created new issues for each missing field in DescribeTable's
response for GSIs and LSIs, so in this patch we edit the xfail
messages in the test to refer to these issues.

Additionally, we only had a test for these fields for GSIs, so this
patch also adds a similar test for LSIs. I turns out there is a
difference between the two tests -  the two fields IndexStatus and
ProvisionedThroughput are returned for GSIs, but not for LSIs.

Refs #7750
Refs #11466
Refs #11470
Refs #11471

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11473
2022-09-07 09:50:16 +02:00
Benny Halevy
e9cfe9e572 tombstone_gc: deglobalize repair_history_maps
Move the thread-local instances of the
per-table repair history maps into compaction_manager.

Fixes #11208

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-07 07:43:15 +03:00
Benny Halevy
8b38893895 mutation_compactor: pass tombstone_gc_state to compact_mutation_state
Used in get_gc_before.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-07 07:43:15 +03:00
Benny Halevy
d86810d22c mutation_partition: compact_for_compaction_v2: get tombstone_gc_state
To be passed down to compact_mutation_state in a following patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-07 07:43:15 +03:00
Benny Halevy
0627667a06 mutation_partition: compact_for_compaction: get tombstone_gc_state
And pass down to `do_compact`.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-07 07:43:15 +03:00
Benny Halevy
7e4612d3aa mutation_readers: pass tombstone_gc_state to compating_reader
To be passed further done to `compact_mutation_state` in
a following patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-07 07:43:14 +03:00
Benny Halevy
572d534d0d sstables: get_gc_before_*: get tombstone_gc_state from caller
Pass the tombstone_gc_state from the compaction_strategy
to sstables get_gc_before_* functions using the table state
to get to the tombstone_gc_state.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-06 23:05:39 +03:00
Benny Halevy
2cd3fc2f36 compaction: table_state: add virtual get_tombstone_gc_state method
and override it in table::table_state to get the tombstone_gc_state
from the table's compaction_manager.

It is going to be used in the next patched to pass the gc state
from the compaction_strategy down to sstables and compaction.

table_state_for_test was modified to just keep a null
tombstone_gc_state.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-06 23:05:39 +03:00
Benny Halevy
6fb4b5555d db: view: get_tombstone_gc_state from compaction_manager
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-06 23:05:39 +03:00
Benny Halevy
71ede6124a db: view: pass base table to view_update_builder
To be used by generate_update() for getting the
tombstone_gc_state via the table's compaction_manager.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-06 23:04:23 +03:00
Benny Halevy
6a11c410fd repair: row_level: repair_update_system_table_handler: get get_tombstone_gc_state for db compaction_manager
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-06 23:04:16 +03:00
Benny Halevy
3b0147390b replica: database: get_tombstone_gc_state from compaction_manager
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-06 23:02:54 +03:00
Benny Halevy
8b841e1207 compaction_manager: add tombstone_gc_state
Add a tombstone_gc_state member and methods to get it.

Currently the tombstone_gc_state is default constructed,
but a following patch will move the thread-local
repair history maps into the compaction_manager as a member
and then the _tombstone_gc_state member will be initialized
from that member.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-06 23:02:54 +03:00
Benny Halevy
1ce50439af replica: table: add get_compaction_manager function
so to let a view get the tombstone_gc_state via
the compaction_manager of the base table.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-06 23:02:54 +03:00
Benny Halevy
5dd15aa3c8 tombstone_gc: introduce tombstone_gc_state
and use it to access the repair history maps.

At this introductory patch, we use default-constructed
tombstone_gc_state to access the thread-local maps
temporarily and those use sites will be replaced
in following patches that will gradually pass
the tombstone_gc_state down from the compaction_manager
to where it's used.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-06 23:02:54 +03:00
Benny Halevy
b2b211568e repair_service: simplify update_repair_time error handling
There's no need for per-shard try/catch here.
Just catch exceptions from the overall sharded operation
to update_repair_time.

Also, update warning to indicate that only updating the repair history
time failed, not "Loading repair history".

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-06 22:43:08 +03:00
Benny Halevy
7d13811297 tombstone_gc: update_repair_time: get table_id rather than schema_ptr
The function doesn't need access to the whole schema.
The table_id is just enough to get by.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-06 22:43:08 +03:00
Benny Halevy
442d43181c tombstone_gc: delete unused forward declaration
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-06 22:43:08 +03:00
Benny Halevy
3d88fe9729 database: do not drop_repair_history_map_for_table in detach_column_family
drop_repair_history_map_for_table is called on each shard
when database::truncate is done, and the table is stopped.

dropping it before the table is stopped is too early.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-06 22:43:08 +03:00
Nadav Har'El
ee606a5d52 Merge 'doc: fix the CQL version in the Interfaces table' from Anna Stuchlik
Fix https://github.com/scylladb/scylla-doc-issues/issues/816
Fix https://github.com/scylladb/scylla-docs/issues/1613

This PR fixes the CQL version in the Interfaces page, so that it is the same as in other places across the docs and in sync with the version reported by the ScyllaDB (see https://github.com/scylladb/scylla-doc-issues/issues/816#issuecomment-1173878487).

To make sure the same CQL version is used across the docs, we should use the `|cql-version| `variable rather than hardcode the version number on several pages.
The variable is specified in the conf.py file:
```
rst_prolog = """
.. |cql-version| replace:: 3.3.1
"""
```

Closes #11320

* github.com:scylladb/scylladb:
  doc: add the Cassandra version on which the tools are based
  doc: fix the version number
  doc: update the Enterprise version where the ME format was introduced
  doc: add the ME format to the Cassandar Compatibility page
  doc: replace Scylla with ScyllaDB
  doc: rewrite the Interfaces table to the new format to include more information about CQL support
  doc: remove the CQL version from pages other than Cassandra compatibility
  doc: fix the CQL version in the Interfaces table
2022-09-06 19:02:44 +03:00
Asias He
792a91b1fa storage_service: Drop ignore dead nodes option for bootstrap and decommission in log
The ignore dead node options are not really supported at the moment.
Drop it in the log to reduce confusion.

Closes #11426
2022-09-06 18:21:21 +03:00
Anna Stuchlik
4c7aa5181e doc: remove support for Debian 9 from versions 2022.1 and 2022.2 2022-09-06 14:04:22 +02:00
Anna Stuchlik
dfc7203139 doc: remove support for Ubuntu 16.04 from versions 2022.1 and 2022.2 2022-09-06 14:01:35 +02:00
Anna Stuchlik
dd4979ffa8 doc: add support for Debian 11 to versions 2022.1 and 2022.2 2022-09-06 13:54:08 +02:00
Pavel Emelyanov
398e9f8593 network_topology_strategy_test: Use topology instead of snitch
Most of the test's cases use rack-inferring snitch driver and get
DC/RACK from it via the test_dc_rack() helper. The helper was introduced
in one of the previous sets to populate token metadata with some DC/RACK
as normal tokens manipulations required respective endpoint in topology.

This patch removes the usage of global snitch and replaces it with the
pre-populated topology. The pre-population is done in rack-inferring
snitch like manner, since token_metadata still uses global snitch and
the locations from snitch and this temporary topology should match.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-06 12:26:30 +03:00
Pavel Emelyanov
d8b2940cd8 network_topology_strategy_test: Populate explicit topology
There's a test case that makes its own snitch driver that generates
pre-claculated DC/RACK data for test endpoints. This patch replaces this
custom snitch driver with a standalone topology object.

Note: to get DC/RACK info from this topo the get_location() is used
since the get_rack()/get_datacenter() are still wrappers around global
snitch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-06 12:24:39 +03:00
Botond Dénes
b89b84ad3c compaction: scrub/abort: be more verbose
Currently abort-mode scrub exits with a message which basically says
"some problem was found", with no details on what problem it found. Add
a detailed error report on the found problem before aborting the scrub.

Closes #11418
2022-09-06 11:42:34 +03:00
Avi Kivity
3dc39474ec Merge 'tools/scylla-types: add tokenof and shardof actions' from Botond Dénes
`tokenof` calculates and prints the token of a partition-key.
`shardof` calculates the token and finds the owner shard of a partition-key. The number of shards has to be provided by the `--sharads` parameter. Ignore msb bits param can be tweaked with the `--ignore-msb-bits` parameter, which defaults to 12.

Examples:
```
$ scylla types tokenof --full-compound -t UTF8Type -t SimpleDateType -t UUIDType 000d66696c655f696e7374616e63650004800049190010c61a3321045941c38e5675255feb0196
(file_instance, 2021-03-27, c61a3321-0459-41c3-8e56-75255feb0196): -5043005771368701888

$ scylla types shardof --full-compound -t UTF8Type -t SimpleDateType -t UUIDType --shards=7 000d66696c655f696e7374616e63650004800049190010c61a3321045941c38e5675255feb0196
(file_instance, 2021-03-27, c61a3321-0459-41c3-8e56-75255feb0196): token: -5043005771368701888, shard: 1
```

Closes #11436

* github.com:scylladb/scylladb:
  tools/scylla-types: add shardof action
  tools/scylla-types: pass variable_map to action handlers
  tools/scylla-types: add tokenof action
  tools/scylla-types: extract printing code into functions
2022-09-06 11:25:54 +03:00
Pavel Emelyanov
42c9f35374 topology: Mark compare_endpoints() arguments as const
Continuation to debfcc0e (snitch: Move sort_by_proximity() to topology).
The passed addresses are not modified by the helper. They are not yet
const because the method was copy-n-pasted from snitch where it wasn't
such.

tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220906074708.29574-1-xemul@scylladb.com>
2022-09-06 11:03:13 +03:00
Yaron Kaikov
4459cecfd6 Docs: fix wrong manifest file for enterprise releases
In
https://docs.scylladb.com/stable/upgrade/upgrade-enterprise/upgrade-guide-from-2021.1-to-2022.1/upgrade-guide-from-2022.1-to-2022.1-image.html,
manifest file location is pointing the wrong filename for enterprise

Fixing

Closes #11446
2022-09-06 06:28:16 +03:00
Avi Kivity
ae4b2ee583 locator: token_metadata: drop unused and dangerous accessors
The mutable get_datacenter_endpoints() and get_datacenter_racks() are
dangerous since they expose internal members without enforcing class
invariants. Fortunately they are unused, so delete them.

Closes #11454
2022-09-06 06:08:02 +03:00
Avi Kivity
3f8cb608c3 Merge "Move auxiliary topology sorters from snitch" from Pavel E
"
There are two helpers on snitch that manipulate lists of nodes taking their
dc/rack into account. This set moves these methods from snitch to topology
and storage proxy.
"

* 'br-snitch-move-proximity-sorters' of https://github.com/xemul/scylla:
  snitch: Move sort_by_proximity() to topology
  topology: Add "enable proximity sorting" bit
  code: Call sort_endpoints_by_proximity() via topology
  snitch, code: Remove get_sorted_list_by_proximity()
  snitch: Move is_worth_merging_for_range_query to proxy
2022-09-05 17:25:08 +03:00
Anna Stuchlik
b0ebf0902c doc: add the Cassandra version on which the tools are based 2022-09-05 14:45:15 +02:00
Pavel Emelyanov
debfcc0eff snitch: Move sort_by_proximity() to topology
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-05 15:17:04 +03:00
Pavel Emelyanov
41973c5bf7 topology: Add "enable proximity sorting" bit
There's one corner case in nodes sorting by snitch. The simple snitch
code overloads the call and doesn't sort anything. The same behavior
should be preserved by (future) topology implementation, but it doesn't
know the snitch name. To address that the patch adds a boolean switch on
topology that's turned off by main code when it sees the snitch is
"simple" one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-05 15:15:07 +03:00
Pavel Emelyanov
b6fdea9a79 code: Call sort_endpoints_by_proximity() via topology
The method is about to be moved from snitch to topology, this patch
prepares the rest of the code to use the latter to call it. The
topology's method just calls snitch, but it's going to change in the
next patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-05 15:14:01 +03:00
Pavel Emelyanov
4184091f1c snitch, code: Remove get_sorted_list_by_proximity()
There are two sorting methods in snitch -- one sorts the list of
addresses in place, the other one creates a sorted copy of the passed
const list (in fact -- the passed reference is not const, but it's not
modified by the method). However, both callers of the latter anyway
create their own temporary list of address, so they don't really benefit
from snitch generating another copy.

So this patch leaves just one sorting method -- the in-place one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-05 15:11:37 +03:00
Pavel Emelyanov
642e50f3e3 snitch: Move is_worth_merging_for_range_query to proxy
Proxy is the only place that calls this method. Also the method name
suggests it's not something "generic", but rather an internal logic of
proxy's query processing.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-05 15:10:46 +03:00
Anna Stuchlik
7294fce065 doc: simplify the information about default formats in different versions 2022-09-05 11:36:24 +02:00
Avi Kivity
3ef8d616f6 Merge 'Fix wrong commit on scylla_raid_setup: prevent mount failed for /var/lib/scylla(#11399)' from Takuya ASADA
On #11399, I mistakenly committed bug fix of first patch (40134ef) to second one (8835a34).
So the script will broken when 40134ef only, it's not looks good when we backport it to older version.
Let's revert commits and make them single commit.

Closes #11448

* github.com:scylladb/scylladb:
  scylla_raid_setup: prevent mount failed for /var/lib/scylla
  Revert "scylla_raid_setup: check uuid and device path are valid"
  Revert "scylla_raid_setup: prevent mount failed for /var/lib/scylla"
2022-09-05 12:16:10 +03:00
Mikołaj Grzebieluch
726658f073 db: system_keyspace: add broadcast_kv_store table
First implementation of strongly consistent everywhere tables operates on simple table
representing string to string map.

Add hard-coded schema for broadcast_kv_store table (key text primary key,
value text). This table is under system keyspace and is created if and only if
BROADCAST_TABLES feature is enabled.
2022-09-05 11:11:08 +02:00
Mikołaj Grzebieluch
5b1421cc33 db: config: add BROADCAST_TABLES feature flag
Add experimental flag 'broadcast-tables' for enabling BROADCAST_TABLES feature.
This feature requires raft group0, thus enabling it without RAFT will cause an error.
2022-09-05 11:11:08 +02:00
Avi Kivity
e3cdc8c4d3 Update tools/java submodule (python3 dependency)
* tools/java 6995a83cc1...b7a0c5bd31 (1):
  > dist/debian:add python3 as dependency
2022-09-05 12:08:24 +03:00
Takuya ASADA
d676c22f09 scylla_raid_setup: prevent mount failed for /var/lib/scylla
Just like 4a8ed4c, we also need to wait for udev event completion to
create /dev/disk/by-uuid/$UUID for newly formatted disk, to mount the
disk just after formatting.
Also added code to check make sure uuid and uuid based device path are valid.

Fixes #11359

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2022-09-05 17:52:49 +09:00
Takuya ASADA
ede7da366b Revert "scylla_raid_setup: check uuid and device path are valid"
This reverts commit 40134efee4.
2022-09-05 17:52:42 +09:00
Takuya ASADA
841c686301 Revert "scylla_raid_setup: prevent mount failed for /var/lib/scylla"
This reverts commit 8835a34ab6.
2022-09-05 17:52:41 +09:00
Piotr Sarna
2379a25ade alternator: propagate authenticated user in client state
From now on, when an alternator user correctly passed an authentication
step, their assigned client_state will have that information,
which also means proper access to service level configuration.
Previously the username was only used in tracing.
2022-09-05 10:43:29 +02:00
Anna Stuchlik
39e6002fc8 doc: fix the version number 2022-09-05 10:04:34 +02:00
Piotr Sarna
66f7ab666f client_state: add internal constructor with auth_service
The constructor can be used as backdoor from frontends other than
CQL to create a session with an authenticated user, with access
to its attached service level information.
2022-09-05 10:03:00 +02:00
Piotr Sarna
9511c21686 alternator: pass auth_service and sl_controller to server
It's going to be needed to recreate a client state for an authenticated
user.
2022-09-05 10:03:00 +02:00
Anna Stuchlik
81f32899d0 doc: update the Enterprise version where the ME format was introduced 2022-09-05 10:02:57 +02:00
Botond Dénes
f8b38cbe09 Merge 'doc: add support for Ubuntu 22.04 in ScyllaDB Enterprise' from Anna Stuchlik
Fix https://github.com/scylladb/scylladb/issues/11430

@tzach I've added support for Ubuntu 22.04 to the row for version 2022.2. Does that version support Debian 11? That information is also missing (it was only added to OSS 5.0 and 5.1).

Closes #11437

* github.com:scylladb/scylladb:
  doc: add support for Ubuntu 22.04 to the Enterprise table
  doc: rename the columns in the Enterpise section to be in sync with the OSS section
2022-09-05 06:42:55 +03:00
Anna Stuchlik
41b91e3632 doc: fix the architecture type on the upgrade page
Closes #11438
2022-09-05 06:30:51 +03:00
Botond Dénes
21ef0c64f1 tools/scylla-types: add shardof action
Decorates a partition key and calculates which shard it belongs to,
given the shard count (--shards) and the ignore msb bits
(--ignore-msb-bits) parameters. The latter is optional and is defaulted to
12.

Example:

    $ scylla types shardof --full-compound -t UTF8Type -t SimpleDateType -t UUIDType --shards=7 000d66696c655f696e7374616e63650004800049190010c61a3321045941c38e5675255feb0196
    (file_instance, 2021-03-27, c61a3321-0459-41c3-8e56-75255feb0196): token: -5043005771368701888, shard: 1
2022-09-05 06:22:57 +03:00
Botond Dénes
4333d33f01 tools/scylla-types: pass variable_map to action handlers
Allowing them to have get the value of extra command line parameters.
2022-09-05 06:22:55 +03:00
Botond Dénes
58d4f22679 tools/scylla-types: add tokenof action
Calculate and print the token of a partition-key.
Example:

    $ scylla types tokenof --full-compound -t UTF8Type -t SimpleDateType -t UUIDType 000d66696c655f696e7374616e63650004800049190010c61a3321045941c38e5675255feb0196
    (file_instance, 2021-03-27, c61a3321-0459-41c3-8e56-75255feb0196): -5043005771368701888
2022-09-05 06:20:10 +03:00
Botond Dénes
be9d1c4df4 sstables: crawling mx-reader: make on_out_of_clustering_range() no-op
Said method currently emits a partition-end. This method is only called
when the last fragment in the stream is a range tombstone change with a
position after all clustered rows. The problem is that
consume_partition_end() is also called unconditionally, resulting in two
partition-end fragments being emitted. The fix is simple: make this
method a no-op, there is nothing to do there.

Also add two tests: one targeted to this bug and another one testing the
crawling reader with random mutations generated for random schema.

Fixes: #11421

Closes #11422
2022-09-04 20:02:50 +03:00
Botond Dénes
3e69fe0fe7 scylla-gdb.py: scylla repairs: print only address of repair_meta
Instead of the entire object. Repair meta is a large object, its
printout floods the output of the command. Print only its address, the
user can print the objects it is interested in.

Closes #11428
2022-09-04 19:58:42 +03:00
Yaron Kaikov
9f9ee8a812 build_docker.sh: Build docker based on Ubuntu:22.04
Ubuntu 20.04 has less than 3 years of OS support remaining.

We should switch to Ubuntu 22.04 to reduce the need for OS upgrades in newly installed clusters.

Closes #11440
2022-09-04 14:00:27 +03:00
Avi Kivity
61769d3b21 Merge "Make messaging service use topology for DC/RACK" from Pavel E
"
Messaging needs to know DC/RACK for nodes to decide whether it needs to
do encryption or compression depending on the options. As all the other
services did it still uses snitch to get it, but simple switch to use
topology needs extra care.

The thing is that messaging can use internal IP instead of endpoints.
Currently it's snitch who tries har^w somehow to resolve this, in
particular -- if the DC/RACK is not found for the given argument it
assumes that it might be internal IP and calls back messaging to convert
it to the endpoint. However, messaging does know when it uses which
address and can do this conversion itself.

So this set eliminates few more global snitch usages and drops the
knot tieing snitch, gossiper and messaging with each-other.
"

* 'br-messaging-use-topology-1.2' of https://github.com/xemul/scylla:
  messaging: Get DC/RACK from topology
  messaging, topology: Keep shared_token_metadata* on messaging
  messaging: Add is_same_{dc|rack} helpers
  snitch, messaging: Dont relookup dc/rack on internal IP
2022-09-04 13:54:34 +03:00
Pavel Emelyanov
6dedc69608 topology: Do not add bootstrapping nodes to topology
Recent change in topology (commit 4cbe6ee9 titled
"topology: Require entry in the map for update_normal_tokens()")
made token_metadata::update_normal_tokens() require the entry presense
in the embedded topology object. Respectively, the commit in question
equipped most callers of update_normal_tokens() with preceeding
topology update call to satisfy the requirement.

However, tokens are put into token_metadata not only for normal state,
but also for bootstrapping, and one place that added bootstrapping
tokens errorneously got topology update. This is wrong -- node must
not be present in the topology until switching into normal state. As
the result several tests with bootstrapping nodes started to fail.

The fix removes topology update for bootstrapping nodes, but this
change reveals few other places that piggy-backed this mistaken
update, so noy _they_ need to update topology themselves.

tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/2040/
       update_cluster_layout_tests.py::test_simple_add_new_node_while_schema_changes_with_repair
       update_cluster_layout_tests.py::test_simple_kill_new_node_while_bootstrapping_with_parallel_writes_in_multidc
       repair_based_node_operations_test.py::test_lcs_reshape_efficiency

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220902082753.17827-1-xemul@scylladb.com>
2022-09-04 13:53:38 +03:00
Avi Kivity
16a3e55aa1 Update seastar submodule
* seastar f2d70c4a17...2b2f6c080e (4):
  > perftune.py: special case a former 'MQ' mode in the new auto-detection code
  > iostream: Generalize flush and batched flush
  > Merge "Equip sharded<>::invoke_on_all with unwrap_sharded_args" from Pavel E
  > Merge "perftune.py: cosmetic fixes" from VladZ

Closes #11434
2022-09-04 10:19:48 +03:00
Anna Stuchlik
5d09e1a912 doc: add the ME format to the Cassandar Compatibility page 2022-09-02 15:12:30 +02:00
Anna Stuchlik
dfb7a221db doc: update the SSTables 3.0 Statistics File Format to add the UUID host_id option of the ME format 2022-09-02 14:55:11 +02:00
Anna Stuchlik
f1184e1470 doc: add the information regarding the ME format to the SSTables 3.0 Data File Format page 2022-09-02 14:48:58 +02:00
Anna Stuchlik
177a1d4396 doc: fix additional information regarding the ME format on the SStable 3.x page 2022-09-02 14:41:55 +02:00
Anna Stuchlik
b3eacdca75 doc: add the ME format to the table 2022-09-02 14:26:16 +02:00
Anna Stuchlik
af4d1b80d8 doc: add support for Ubuntu 22.04 to the Enterprise table 2022-09-02 12:43:04 +02:00
Anna Stuchlik
947f8769f4 doc: rename the columns in the Enterpise section to be in sync with the OSS section 2022-09-02 12:31:57 +02:00
Pavel Emelyanov
f0580aedaf messaging: Get DC/RACK from topology
Now everything is prepared for that

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-02 11:34:57 +03:00
Botond Dénes
be70fcf587 tools/scylla-types: extract printing code into functions
To make the individual overloads on the exact type usable on their own.
2022-09-02 07:46:18 +03:00
Botond Dénes
2c46c24608 Merge 'doc: change the tool names to "Scylla SStable" and "Scylla Types"' from Anna Stuchlik
Fix https://github.com/scylladb/scylladb/issues/11393

- Rename the tool names across the docs.
- Update the examples to replace `scylla-sstable` and `scylla-types` with `scylla sstable` and `scylla types`, respectively.

Closes #11432

* github.com:scylladb/scylladb:
  doc: update the tool names in the toctree and reference pages
  doc: rename the scylla-types tool as Scylla Types
  doc: rename the scylla-sstable tool as Scylla SStable
2022-09-01 16:32:18 +03:00
Anna Stuchlik
18da200669 doc: update the tool names in the toctree and reference pages 2022-09-01 15:09:12 +02:00
Anna Stuchlik
c255399f27 doc: rename the scylla-types tool as Scylla Types 2022-09-01 15:05:44 +02:00
Anna Stuchlik
d0cb24feaa doc: rename the scylla-sstable tool as Scylla SStable 2022-09-01 14:45:19 +02:00
Anna Stuchlik
1834d5d121 add a comment to remove the information when the documentation is versioned (in 5.1) 2022-09-01 12:57:15 +02:00
Anna Stuchlik
476107912c doc: replace Scylla with ScyllaDB 2022-09-01 12:52:58 +02:00
Anna Stuchlik
8aae8a3cef doc: fix the formatting and language in the updated section 2022-09-01 12:50:04 +02:00
Anna Stuchlik
ff4ae879cb doc: fix the default SStable format 2022-09-01 12:47:11 +02:00
Pavel Emelyanov
e147681d85 messaging, topology: Keep shared_token_metadata* on messaging
Messaging will need to call topology methods to compare DC/RACK of peers
with local node. Topology now resides on token metadata, so messaging
needs to get the dependency reference.

However, messaging only needs the topology when it's up and running, so
instead of producing a life-time reference, add a pointer, that's set up
on .start_listen(), before any client pops up, and is cleared on
.shutdown() after all connections are dropped.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-01 11:32:34 +03:00
Pavel Emelyanov
551c51b5bf messaging: Add is_same_{dc|rack} helpers
For convenience of future patching

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-01 11:32:34 +03:00
Pavel Emelyanov
c08c370c2c snitch, messaging: Dont relookup dc/rack on internal IP
When getting dc/rack snitch may perform two lookups -- first time it
does it using the provided IP, if nothing is found snitch assumes that
the IP is internal one, gets the corresponding public one and searches
again.

The thing is that the only code that may come to snitch with internal
IP is the messaging service. It does so in two places: when it tries
to connect to the given endpoing and when it accepts a connection.

In the former case messaging performs public->internal IP conversion
itself and goes to snitch with the internal IP value. This place can get
simpler by just feeding the public IP to snich, and converting it to the
internal only to initiate the connection.

In the latter case the accepted IP can be either, but messaging service
has the public<->private map onboard and can do the conversion itself.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-01 11:32:34 +03:00
Avi Kivity
fe401f14de Merge 'scylla_raid_setup: prevent mount failed for /var/lib/scylla' from Takuya ASADA
Just like 4a8ed4cc6f, we also need to wait for udev event completion to
create /dev/disk/by-uuid/$UUID for newly formatted disk, to mount the
disk just after formatting.

Also added code to check make sure uuid and uuid based device path are valid.

Fixes #11359

Closes #11399

* github.com:scylladb/scylladb:
  scylla_raid_setup: prevent mount failed for /var/lib/scylla
  scylla_raid_setup: check uuid and device path are valid
2022-08-31 19:59:38 +03:00
Kefu Chai
a5e696fab8 storage_service, test: drop unused storage_service_config
this setting was removed back in
dcdd207349, so despite that we are still
passing `storage_service_config` to the ctor of `storage_service`,
`storage_service::storage_service()` just drops it on the floor.

in this change, `storage_service_config` class is removed, and all
places referencing it are updated accordingly.

Signed-off-by: Kefu Chai <tchaikov@gmail.com>

Closes #11415
2022-08-31 19:49:13 +03:00
Botond Dénes
2a3012db7f docs/README.md: expand prerequisites list
poetry and make was missing from the list.

Closes #11391
2022-08-31 17:00:59 +03:00
Botond Dénes
cb98d4f5da docs: admin-tools: remove defunct sstable-index
Said tool was supplanted by scylla-sstable in 4.6. Remove the page as
well as all references to it.

Closes #11392
2022-08-31 17:00:04 +03:00
Avi Kivity
a9a230afbe scripts: introduce script to apply email, working around google groups brokeness
Google Groups recently started rewriting the From: header, garbaging
our git log. This script rewrites it back, using the Reply-To header
as a still working source.

Closes #11416
2022-08-31 14:47:24 +03:00
Botond Dénes
b9fc504fb2 Merge 'doc: cql-extensions.md: improve description of synchronous views' from Nadav Har'El
It was pointed out to me that our description of the synchronous_updates
materialized-view option does not make it clear enough what is the
default setting, or why a user might want to use this option.

This patch changes the description to (I hope) better address these
issues.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11404

* github.com:scylladb/scylladb:
  doc: cql-extensions.md: replace "Scylla" by "ScyllaDB"
  doc: cql-extensions.md: improve description of synchronous views
2022-08-31 14:33:39 +03:00
Avi Kivity
2ab5cbd841 Merge 'Docs: document how scylla-sstable obtains its schema' from Botond Dénes
This is a very important aspect of the tool that was completely missing from the document before. Also add a comparison with SStableDump.
Fixes: https://github.com/scylladb/scylladb/issues/11363

Closes #11390

* github.com:scylladb/scylladb:
  docs: scylla-sstable.rst: add comparison with SStableDump
  docs: scylla-sstable.rst: add section about providing the schema
2022-08-31 14:28:52 +03:00
Anna Stuchlik
72b77b8c78 doc: add a comment to remove the note in version 5.1 2022-08-31 12:49:10 +02:00
Anna Stuchlik
b4bbd1fd53 doc: update the information on the Countng all rows page and add the recommendation to upgrade ScyllaDB 2022-08-31 12:39:05 +02:00
Nadav Har'El
ad0f6158c4 doc: cql-extensions.md: replace "Scylla" by "ScyllaDB"
It was recently decided that the database should be referred to as
"ScyllaDB", not "Scylla". This patch changes existing references
in docs/cql/cql-extensions.md.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-08-31 13:23:24 +03:00
Nadav Har'El
ec8e98e403 doc: cql-extensions.md: improve description of synchronous views
It was pointed out to me that our description of the synchronous_updates
materialized-view option does not make it clear enough what is the
default setting, or why a user might want to use this option.

This patch changes the description to (I hope) better address these
issues.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-08-31 13:22:24 +03:00
Anna Stuchlik
180fc73695 doc: add a note to the description of COUNT with a reference to the KB article 2022-08-31 12:11:12 +02:00
Anna Stuchlik
cff849d845 doc: add COUNT to the list of acceptable selectors of the SELECT statement 2022-08-31 11:59:20 +02:00
Avi Kivity
421557b40a Merge "Provide DC/RACK when populating topology" from Pavel E
"
The topology object maintains all sort of node/DC/RACK mappings on
board. When new entries are added to it the DC and RACK are taken
from the global snitch instance which, in turn, checks gossiper,
system keyspace and its local caches.

This set make topology population API require DC and RACK via the
call argument. In most of the cases the populating code is the
storage service that knows exactly where to get those from.

After this set it will be possible to remove the dependency knot
consiting of snitch, gossiper, system keyspace and messaging.
"

* 'br-topology-dc-rack-info' of https://github.com/xemul/scylla:
  toplogy: Use the provided dc/rack info
  test: Provide testing dc/rack infos
  storage_service: Provide dc/rack for snitch reconfiguration
  storage_service: Provide dc/rack from system ks on start
  storage_service: Provide dc/rack from gossiper for replacement
  storage_service: Provide dc/rack from gossiper for remotes
  storage_service,dht,repair: Provide local dc/rack from system ks
  system_keyspace: Cache local dc-rack on .start()
  topology: Some renames after previous patch
  topology: Require entry in the map for update_normal_tokens()
  topology: Make update_endpoint() accept dc-rack info
  replication_strategy: Accept dc-rack as get_pending_address_ranges argument
  dht: Carry dc-rack over boot_strapper and range_streamer
  storage_service: Make replacement info a real struct
2022-08-31 12:53:06 +03:00
Benny Halevy
c284c32f74 Update seastar submodule
* seastar f9f5228b74...f2d70c4a17 (51):

  >  cmake: attach property to Valgrind not to hwloc
  >  Create the seastar_memory logger in all builds
  >  drop unused parameters
  >  Merge "Unify pollable_fd shutdown and abort_{reader|writer}" from Pavel E
  > >  pollable_fd: Replace two booleans with a mask
  > >  pollable_fd: Remove abort_reader/_writer
  >  Merge "Improve Rx channels assignment" from Vlad
  > >  perftune.py: fix comments of IRQ ordering functors
  > >  perftune.py: add VIRTIO fast path IRQs ordering functor
  > >  perftune.py: reduce number of Rx channels to the number of IRQ CPUs
  > >  perftune.py: introduce a --num-rx-queues parameter
  >  program_options: enable optional selection_value
  >  .gitignore: ignore the directories generated by VS Code and CLion.
  >  httpd: compare the Connection header value in a case-insensitive manner.
  >  httpd: move the logic of keepalive to a separate method.
  >  register one default priority class for queue
  >  Reset _total_stats before each run
  >  log: add colored logging support
  >  Merge "perftune.py: add NUMA aware auto-detection for big machines" from Vlad
  > >  perftune.py: mention 'irq_cpu_mask' in the description of the script operation
  > >  perftune.py: NetPerfTuner: fix bits counting in self.irqs_cpu_mask wider than 32 bits
  > >  perftune.py: PerfTuneBase.cpu_mask_is_zero(cpu_mask): cosmetics: fix a comment and a variable name
  > >  perftune.py: PerfTuneBase.cpu_mask_is_zero(cpu_mask): take into account omitted zero components of the mask
  > >  perftune.py: PerfTuneBase.compute_cpu_mask_for_mode(): cosmetics: fix a variable name
  > >  perftune.py: stop printing 'mode' in --dump-options-file
  > >  perftune.py: introduce a generic auto_detect_irq_mask(cpu_mask) function
  > >  perftune.py: DiskPerfTuner: use self.irqs_cpu_mask for tuning non-NVME disks
  > >  perftune.py: stop auto-detecting and using 'mode' internally
  > >  perftune.py: introduce --get-irq-cpu-mask command line parameter
  > >  perftune.py: introduce --irq-core-auto-detection-ratio parameter
  >  build: add a space after function name
  >  Update HACKING.md
  >  log: do not inherit formatter<seastar::log_level> from formatter<string_view>
  >  Merge "Mark connected_socket::shutdown_...'s internals noexcept" from Pavel E
  > > native-stack: Mark tcp::in_state() (and its wrappers) const noexcept
  > > native-stack: Mark tcb::close and tcb::abort_reader noexcept
  > > native-stack: Mark tcp::connection::close_{read|write}() noexcept
  > > native-stack: Mark tcb::clear_delayed_ack() and tcb::stop_retransmit_timer() noexcept
  > > tls: Mark session::close() noexcept
  > > file_desc: Add fdinfo() helper
  > > posix-stack: Mark posix_connected_socket_impl::shutdown_{input|output}() noexcept
  > > tests: Mark loopback_buffer::shutdown() noexcept
  >  Merge "Enhance RPC connection error injector" from Pavel E
  > >  loopback_socket: Shuffle error injection
  > >  loopback_socket: Extend error injection
  > >  loopback_socket: Add one-shot errors
  > >  loopback_socket: Add connection error injection
  > >  rpc_test: Extend error injector with kind
  > >  rpc_test: Inject errors on all paths
  > >  rpc_test: Use injected connect error
  > >  rpc_test: De-duplicate test socket creation
  >  Merge 'tls: vec_push: handle async errors rather than throwing on_internal_error' from Benny Halevy
  > >  tls: do_handshake: handle_output_error of gnutls_handshake
  > >  tls: session: vec_push: return output_pending error
  > >  tls: session: vec_push: reindent
  >  log: disambiguate formatter<log_level> from operator<<
  >  tls_test: Fix spurious fail in test_x509_client_with_builder_system_trust_multiple (et al)

Fixes scylladb/scylladb#11252

Closes #11401
2022-08-31 12:12:48 +03:00
Botond Dénes
dca351c2a6 Merge 'doc: add the upgrade guide for ScyllaDB image from 2021.1 to 2022.1' from Anna Stuchlik
This PR is related to  https://github.com/scylladb/scylla-docs/issues/4124 and https://github.com/scylladb/scylla-docs/issues/4123.

**New Enterprise Upgrade Guide from 2021.1 to 2022.2**
I've added the upgrade guide for ScyllaDB Enterprise image. In consists of 3 files:
/upgrade/_common/upgrade-guide-v2022-ubuntu-and-debian-p1.rst
upgrade/_common/upgrade-image.rst
 /upgrade/_common/upgrade-guide-v2022-ubuntu-and-debian-p2.rst

**Modified Enterprise Upgrade Guides 2021.1 to 2022.2**
I've modified the existing guides for Ubuntu and Debian to use the same files as above, but exclude the image-related information:
/upgrade/_common/upgrade-guide-v2022-ubuntu-and-debian-p1.rst + /upgrade/_common/upgrade-guide-v2022-ubuntu-and-debian-p2.rst = /upgrade/_common/upgrade-guide-v2022-ubuntu-and-debian.rst

To make things simpler and remove duplication, I've replaced the guides for Ubuntu 18 and 20 with a generic Ubuntu guide.

**Modified Enterprise Upgrade Guides from 4.6 to 5.0**
These guides included a bug: they included the image-related information (about updating OS packages), because a file that includes that information was included by mistake. What's worse, it was duplicated. After the includes were removed, image-related information is no longer included in the Ubuntu and Debian guides (this fixes https://github.com/scylladb/scylla-docs/issues/4123).

I've modified the index file to be in sync with the updates.

Closes #11285

* github.com:scylladb/scylladb:
  doc: reorganize the content to list the recommended way of upgrading the image first
  doc: update the image upgrade guide for ScyllaDB image to include the location of the manifest file
  doc: fix the upgrade guides for Ubuntu and Debian by removing image-related information
  doc: update the guides for Ubuntu and Debian to remove image information and the OS version number
  doc: add the upgrade guide for ScyllaDB image from 2021.1 to 2022.1
2022-08-31 07:24:55 +03:00
Gleb Natapov' via ScyllaDB development
0d20830863 direct_failure_detector: reduce severity of ping error logging
Having an error while pinging a peer is not a critical error. The code
retires and move on. Lets log the message with less severity since
sometimes those error may happen (for instance during node replace
operation some nodes refuse to answer to pings) and dtest complains that
there are unexpected errors in the logs.

Message-Id: <Ywy5e+8XVwt492Nc@scylladb.com>
2022-08-31 07:11:59 +03:00
Raphael S. Carvalho
631b2d8bdb replica: rename table::on_compaction_completion and coroutinize it
on_compaction_completion() is not very descriptive. let's rename
it, following the example of
update_sstable_lists_on_off_strategy_completion().

Also let's coroutinize it, so to remove the restriction of running
it inside a thread only.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #11407
2022-08-31 06:17:20 +03:00
Nadav Har'El
a797512148 Merge 'Raft test topology start stopped servers' from Alecco
Test teardown involves dropping the test keyspace. If there are stopped servers occasionally we would see timeouts.

Start stopped servers after a test is finished (and passed).

Revert previous commit making teardown async again.

Closes #11412

* github.com:scylladb/scylladb:
  test.py: restart stopped servers before teardown...
  Revert "test.py: random tables make DDL queries async"
2022-08-30 22:48:47 +03:00
Pavel Emelyanov
e5e75ba43c Merge 'scylla-gdb.py: bring scylla reads-stats up-to-date' from Botond Dénes
Said command is broken since 4.6, as the type of `reader_concurrency_semaphore::_permit_list` was changed without an accompanying update to this command. This series updates said command and adds it to the list of tested commands so we notice if it breaks in the future.

Closes #11389

* github.com:scylladb/scylladb:
  test/scylla-gdb: test scylla read-stats
  scylla-gdb.py: read_stats: update w.r.t. post 4.5 code
  scylla-gdb.py: improve string_view_printer implementation
2022-08-30 20:24:02 +03:00
Nadav Har'El
56d714b512 Merge 'Docs: Update support OS' from Tzach Livyatan
This PR change the CentOS 8 support to Rocky, and add 5.1 and 2022.1, 2022.2 rows to the list of Scylla releases

Closes #11383

* github.com:scylladb/scylladb:
  OS support page: use CentOS not Centos
  OS support page: add 5.1, 2022.1 and 2022.2
  OS support page: Update CentOS 8 to Rocky 8
2022-08-30 18:02:44 +03:00
Anna Stuchlik
0d3285dd3c doc: replace Scylla with ScyllaDB 2022-08-30 15:35:06 +02:00
Anna Stuchlik
ab04ed2fda doc: rewrite the Interfaces table to the new format to include more information about CQL support 2022-08-30 15:31:41 +02:00
Anna Stuchlik
4ac5574f1d doc: remove the CQL version from pages other than Cassandra compatibility 2022-08-30 13:58:26 +02:00
Alejo Sanchez
df1ca57fda test.py: restart stopped servers before teardown...
for topology tests

Test teardown involves dropping the test keyspace. If there are stopped
servers occasionally we would see timeouts.

Start stopped servers after a test is finished.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-30 11:40:40 +02:00
Alejo Sanchez
e5eac22a37 Revert "test.py: random tables make DDL queries async"
This reverts commit 67c91e8bcd.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-30 10:54:33 +02:00
Anna Stuchlik
99cee1aceb doc: reorganize the content to list the recommended way of upgrading the image first 2022-08-30 10:11:02 +02:00
Anna Stuchlik
ffe6f97c06 doc: update the image upgrade guide for ScyllaDB image to include the location of the manifest file 2022-08-30 10:01:56 +02:00
Tzach Livyatan
4e413787d2 doc: Fix nodetool flush example
`nodetool flush` have a space between *keyspace* and *table* names

See
https://docs.scylladb.com/stable/operating-scylla/nodetool-commands/flush
for the right syntax.

Fixes #11314

Closes #11334
2022-08-29 15:06:38 +03:00
Nadav Har'El
eed65dfc2d Merge 'db: schema_tables: Make table creation shadow earlier concurrent changes' from Tomasz Grabiec
Issuing two CREATE TABLE statements with a different name for one of
the partition key columns leads to the following assertion failure on
all replicas:

scylla: schema.cc:363: schema::schema(const schema::raw_schema&, std::optional<raw_view_info>): Assertion `!def.id || def.id == id - column_offset(def.kind)' failed.

The reason is that once the create table mutations are merged, the
columns table contains two entries for the same position in the
partition key tuple.

If the schemas were the same, or not conflicting in a way which leads
to abort, the current behavior would be to drop the older table as if
the last CREATE TABLE was preceded by a DROP TABLE.

The proposed fix is to make CREATE TABLE mutation include a tombstone
for all older schema changes of this table, effectively overriding
them. The behavior will be the same as if the schemas were not
different, older table will be dropped.

Fixes #11396

Closes #11398

* github.com:scylladb/scylladb:
  db: schema_tables: Make table creation shadow earlier concurrent changes
  db: schema_tables: Fix formatting
  db: schema_mutations: Make operator<<() print all mutations
  schema_mutations: Make it a monoid by defining appropriate += operator
2022-08-29 14:21:07 +03:00
Tomasz Grabiec
ae8d2a550d db: schema_tables: Make table creation shadow earlier concurrent changes
Issuing two CREATE TABLE statements with a different name for one of
the partition key columns leads to the following assertion failure on
all replicas:

scylla: schema.cc:363: schema::schema(const schema::raw_schema&, std::optional<raw_view_info>): Assertion `!def.id || def.id == id - column_offset(def.kind)' failed.

The reason is that once the create table mutations are merged, the
columns table contains two entries for the same position in the
partition key tuple.

If the schemas were the same, or not conflicting in a way which leads
to abort, the current behavior would be to drop the older table as if
the last CREATE TABLE was preceded by a DROP TABLE.

The proposed fix is to make CREATE TABLE mutation include a tombstone
for all older schema changes of this table, effectively overriding
them. The behavior will be the same as if the schemas were not
different, older table will be dropped.

Fixes #11396
2022-08-29 12:06:02 +02:00
Benny Halevy
d588e2a7c5 release: properly evaluate SCYLLA_BUILD_MODE_* macros
Patch 765d2f5e46 did not
evaluate the #if SCYLLA_BUILD_MODE directives properly
and it always matched SCYLLA_BULD_MODE == release.

This change fixes that by defining numerical codes
for each build mode and using macro expansion to match
the define SCYLLA_BUILD_MODE against these codes.

Also, ./configure.py was changes to pass SCYLLA_BUILD_MODE
to all .cc source files, and makes sure it is defined
in build_mode.hh.

Support was added for coverage build mode,
and an #error was added if SCYLLA_BUILD_MODE
was not recognized by the #if ladder directives.

Additional checks verifying the expected SEASTAR_DEBUG
against SCYLLA_BUILD_MODE were added as well,

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #11387
2022-08-29 10:20:19 +03:00
Botond Dénes
0f4666010a docs: scylla-sstable.rst: add comparison with SStableDump
The two tools have very similar goals, user might wonder when to use one
or the other.
Also add a link to sstabledump.rst to scylla-sstable.
2022-08-29 08:29:14 +03:00
Botond Dénes
65da6a26a3 docs: scylla-sstable.rst: add section about providing the schema
Providing the schema for the scylla-sstable tool is an important topic
that was completely missing from the description so far.
2022-08-29 08:29:09 +03:00
Alejo Sanchez
67c91e8bcd test.py: random tables make DDL queries async
There are async timeouts for ALTER queries. Seems related to othe issues
with the driver and async.

Make these queries synchronous for now.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #11394
2022-08-28 10:38:39 +03:00
Felipe Mendes
fd5cb85a7a alternator - Doc - Update DescribeTable response and introduce hashing function differences
This commit introduces the following changes to Alternator compability doc:

* As of https://github.com/scylladb/scylladb/pull/11298 Alternator will return ProvisionedThroughput in DescribeTable API calls. We add the fact that tables will default to a BillingMode of PAY_PER_REQUEST (this wasn't made explicit anywhere in the docs), and that the values for RCUs/WCUs are hardcoded to 0.
* Mention the fact that ScyllaDB (thus Alternator) hashing function is different than AWS proprietary implementation for DynamoDB. This is mostly of an implementation aspect rather than a bug, but it may cause user confusion when/if comparing the ResultSet between DynamoDB and Alternator returned from Table Scans.

Refs: https://github.com/scylladb/scylladb/issues/11222
Fixes: https://github.com/scylladb/scylladb/issues/11315

Closes #11360
2022-08-28 10:29:07 +03:00
Takuya ASADA
8835a34ab6 scylla_raid_setup: prevent mount failed for /var/lib/scylla
Just like 4a8ed4c, we also need to wait for udev event completion to
create /dev/disk/by-uuid/$UUID for newly formatted disk, to mount the
disk just after formatting.

Fixes #11359
2022-08-27 03:27:44 +09:00
Takuya ASADA
40134efee4 scylla_raid_setup: check uuid and device path are valid
Added code to check make sure uuid and uuid based device path are valid.
2022-08-27 03:08:31 +09:00
Tomasz Grabiec
661db2706f db: schema_tables: Fix formatting 2022-08-26 17:37:48 +02:00
Tomasz Grabiec
a020c4644c db: schema_mutations: Make operator<<() print all mutations 2022-08-26 16:48:15 +02:00
Tomasz Grabiec
cf034c1891 schema_mutations: Make it a monoid by defining appropriate += operator 2022-08-26 16:48:15 +02:00
Kamil Braun
6c16ae4868 Merge 'raft, limit for command size' from Gusev Petr
Commitlog imposes a limit on the size of mutations
and throws an exception if it's exceeded. In case of
schema changes before raft this exception was delivered
to the client. Now it happens while saving the raft
command in io_fiber in persistence->store_log_entries
and what the client gets is just a timeout exception,
which doesn't say much about the cause of the problem.
This patch introduces an explicit command size limit
and provides a clear error message in this case.

Closes #11318

* github.com:scylladb/scylladb:
  raft, use max_command_size to satisfy commitlog limit
  raft, limit for command size
2022-08-26 12:20:58 +02:00
Pavel Emelyanov
6405aba748 toplogy: Use the provided dc/rack info
Previous patches made all the callers of topology.update_endpoint()
(via token_metadata.update_topology()) provide correct dc/rack info
for the endpoint. It's now possible to stop using global snitch by
topology and just rely on the dc/rack argument.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-26 10:02:00 +03:00
Pavel Emelyanov
10e8804417 test: Provide testing dc/rack infos
There's a test that's sensitive to correct dc/rack info for testing
entries. To populate them it uses global rack-inferring snitch instance
or a special "testing" snitch. To make it continue working add a helper
that would populate the topology properly (spoiler: next branch will
replace it with explicitly populated topology object).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-26 10:00:04 +03:00
Pavel Emelyanov
f6abc3f759 storage_service: Provide dc/rack for snitch reconfiguration
When snitch reconfigures (gossiper-property-file one) it kicks storage
service so that it updates itself. This place also needs to update the
dc/rack info about itself, the correct (new) values are taken from the
snitch itself.

There's a bug here -- system.local table it not update with new data
until restart.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-26 09:58:34 +03:00
Pavel Emelyanov
f8614fe039 storage_service: Provide dc/rack from system ks on start
When a node starts it loads the information about peers from
system.peers table and populates token metadata and topology with this
information. The dc/rack are taken from the sys-ks cache here.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-26 09:57:15 +03:00
Pavel Emelyanov
5d5782a086 storage_service: Provide dc/rack from gossiper for replacement
When a node it started to replace another node it updates token metadata
and topology with the target information eary. The tokens are now taken
from gossiper shadow round, this patch makes the same for dc/rack info.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-26 09:55:31 +03:00
Pavel Emelyanov
6b70358616 storage_service: Provide dc/rack from gossiper for remotes
When a node is notified about other nodes state change it may want to
update the topology information about it. In all those places the
dc/rack into about the peer is provided by the gossiper.

Basically, these updates mirror the relevant updates of tokens on the
token metadata object.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-26 09:53:54 +03:00
Pavel Emelyanov
43e83c5415 storage_service,dht,repair: Provide local dc/rack from system ks
When a node starts it adds itself to the topology. Mostly it's done in
the storage_service::join_cluster() and whoever it calls. In all those
places the dc/rack for the added node is taken from the system keyspace
(it's cache was populated with local dc/rack by the previous patch).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-26 09:52:16 +03:00
Pavel Emelyanov
a03d6f7751 system_keyspace: Cache local dc-rack on .start()
There's a cache of endpoint:{dc,rack} on system keyspace cache, but the
local node is not there, because this data is populated from the peers
table, while local node's dc/rack is in snitch (or system.local table).

At the same time, storage_service::join_cluster() and whoever it calls
(e.g. -- the repair) will need this info on start and it's convenient
to have this data on sys-ks cache.

It's not on the peers part of the cache because next branch removes this
map and it's going to be very clumsy to have a whole container with just
one enty in it.

There's a peer code in system_keyspace::setup() that gets the local node
dc/rack and committs it into the system.local table. However, putting
the data into cache is done on .start(). This is because cql-test-env
needs this data cached too, but it doesn't call sys_ks.setup(). Will be
cleaned some other day.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-26 09:47:30 +03:00
Pavel Emelyanov
c043f6fa96 topology: Some renames after previous patch
The topology::update_endpoint() is now a plain wrapper over private
::add_endpoint() method of the same class. It's simpler to merge them

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-26 09:46:26 +03:00
Pavel Emelyanov
4cbe6ee9f4 topology: Require entry in the map for update_normal_tokens()
The method in question tries to be on the safest side and adds the
enpoint for which it updates the tokens into the topology. From now on
it's up to the caller to put the endpoint into topology in advance.

So most of what this patch does is places topology.update_endpoint()
into the relevant places of the code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-26 09:44:08 +03:00
Pavel Emelyanov
5fc9854eae topology: Make update_endpoint() accept dc-rack info
The method in question populates topology's internal maps with endpoint
vs dc/rack relations. As for today the dc/rack values are taken from the
global snitch object (which, in turn, goes to gossiper, system keyspace
and its internal non-updateable cache for that).

This patch prepares the ground for providing the dc/rack externally via
argument. By now it's just and argument with empty strings, but next
patches will populate it with real values (spoiler: in 99% it's storage
service that calls this method and each call will know where to get it
from for sure)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-26 09:41:09 +03:00
Pavel Emelyanov
7305061674 replication_strategy: Accept dc-rack as get_pending_address_ranges argument
The method creates a copy of token metadata and pushes an endpoint (with
some tokens) into it. Next patches will require providing dc/rack info
together with the endpoint, this patch prepares for that.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-26 09:39:44 +03:00
Pavel Emelyanov
360c4f8608 dht: Carry dc-rack over boot_strapper and range_streamer
Both classes may populate (temporarly clones of) token metadata object
with endpoint:tokens pairs for the endpoint they work with. Next patches
will require that endpoint comes with the dc/rack info. This patch makes
sure dht classes have the necessary information at hand (for now it's
just empty pair of strings).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-26 09:37:02 +03:00
Pavel Emelyanov
c7a3fed225 storage_service: Make replacement info a real struct
This is to extend it in one of the next patches

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-26 09:36:16 +03:00
Botond Dénes
4d33812a77 test/scylla-gdb: test scylla read-stats
This command was not run before, allowing it to silently break.
2022-08-26 08:08:28 +03:00
Botond Dénes
82c157368a scylla-gdb.py: read_stats: update w.r.t. post 4.5 code
scylla_read_stats is not up-to-date wr.r.t. the type of
`reader_concurrency_semaphore::_permit_list`, which was changed in 4.6.
Bring it up-to-date, keeping it backwards compatible with 4.5 and older
releases.
2022-08-26 07:25:40 +03:00
Botond Dénes
107fd97f45 scylla-gdb.py: improve string_view_printer implementation
The `_M_str` member of an `std::string_view` is not guaranteed to be a
valid C string (.e.g. be null terminated). Printing it directly often
resulted in printing partial strings or printing gibberish, effecting in
particular the semaphore diagnostics dumps (scylla read-stats).
Use a more reliable method: read `_M_len` amount of bytes from `_M_str`
and decode as UTF-8.
2022-08-26 07:25:11 +03:00
Avi Kivity
0dbcd13a0f config: change logging::settings constructor call to use designated initializer
Safer wrt reordering, and more readable too.

Closes #11382
2022-08-26 06:14:01 +03:00
Konstantin Osipov
4e128bafb5 docs: clarify the tricky field of row existence in LWT
Closes #11372
2022-08-26 06:10:45 +03:00
Vlad Zolotarov
c538cc2372 scylla_prepare + scylla_cpuset_setup: make scylla_cpuset_setup idempotent without introducing regressions
This patch fixes the regression introduced by 3a51e78 which broke
a very important contract: perftune.yaml should not be "touched"
by Scylla scriptology unless explicitly requested.

And a call for scylla_cpuset_setup is such an explicit request.

The issue that the offending patch was intending to fix was that
cpuset.conf was always generated anew for every call of
scylla_cpuset_setup - even if a resulting cpuset.conf would come
out exactly the same as the one present on the disk before tha call.

And since the original code was following the contract mentioned above
it was also deleting perftune.yaml every time too.
However, this was just an unavoidable side-effect of that cpuset.conf
re-generation.

The above also means that if scylla_cpuset_setup doesn't write to cpuset.conf
we should not "touch" perftune.yaml and vise versa.

This patch implements exactly that together with reverting the dangerous
logic introduced by 3a51e78.

Fixes #11385
Fixes #10121
2022-08-25 13:03:02 -04:00
Vlad Zolotarov
80917a1054 scylla_prepare: stop generating 'mode' value in perftune.yaml
Modern perftune.py supports a more generic way of defining IRQ CPUs:
'irq_cpu_mask'.

This patch makes our auto-generation code create a perftune.yaml
that uses this new parameter instead of using outdated 'mode'.

As a side effect, this change eliminates the notion of "incorrect"
value in cpuset.conf - every value is valid now as long as it fits into
the 'all' CPU set of the specific machine.

Auto-generated 'irq_cpu_mask' is going to include all bits from 'all'
CPU mask except those defined in cpuset.conf.

Fixes #9903
2022-08-25 13:02:57 -04:00
Benny Halevy
765d2f5e46 release: define SCYLLA_BUILD_MODE_STR by stringifying SCYLLA_BUILD_MODE
Currently SCYLLA_BULD_MODE is defined as a string by the cxxflags
generated by configure.py.  This is not very useful since one cannot use
it in a @if preprocessor directive.

Instead, use -DSCYLLA_BULD_MODE=release, for example, and define a
SCYLLA_BULD_MODE_STR as the dtirng representation of it.

In addition define the respective
SCYLLA_BUILD_MODE_{RELEASE,DEV,DEBUG,SANITIZE} macros that can be easily
used in @ifdef (or #ifndef :)) for conditional compilation.

The planned use case for it is to enable a task_manager test module only
in non-release modes.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #11357
2022-08-25 16:50:42 +02:00
Tzach Livyatan
e86cd3684e OS support page: use CentOS not Centos 2022-08-25 17:33:50 +03:00
Wojciech Mitros
49dba4f0c1 functions: fix dropping of a keyspace with an aggregate in it
Currently, if a keyspace has an aggregate and the keyspace
is dropped, the keyspace becomes corrupted and another keyspace
with the same name cannot be created again

This is caused by the fact that when removing an aggregate, we
call create_aggregate() to get values for its name and signature.
In the create_aggregate(), we check whether the row and final
functions for the aggregate exist.
Normally, that's not an issue, because when dropping an existing
aggregate alone, we know that its UDFs also exist. But when dropping
and entire keyspace, we first drop the UDFs, making us unable to drop
the aggregate afterwards.

This patch fixes this behavior by removing the create_aggregate()
from the aggregate dropping implementation and replacing it with
specific calls for getting the aggregate name and signature.

Additionally, a test that would previously fail is added to
cql-pytest/test_uda.py where we drop a keyspace with an aggregate.

Fixes #11327

Closes #11375
2022-08-25 16:28:57 +02:00
Tzach Livyatan
f6157a38a0 OS support page: add 5.1, 2022.1 and 2022.2 2022-08-25 16:44:40 +03:00
Tzach Livyatan
1697f17d90 OS support page: Update CentOS 8 to Rocky 8 2022-08-25 16:43:24 +03:00
Tomasz Grabiec
83850e247a Merge 'raft: server: handle aborts when waiting for config entry to commit' from Kamil Braun
Changing configuration involves two entries in the log: a 'joint
configuration entry' and a 'non-joint configuration entry'. We use
`wait_for_entry` to wait on the joint one. To wait on the non-joint one,
we use a separate promise field in `server`. This promise wasn't
connected to the `abort_source` passed into `set_configuration`.

The call could get stuck if the server got removed from the
configuration and lost leadership after committing the joint entry but
before committing the non-joint one, waiting on the promise. Aborting
wouldn't help. Fix this by subscribing to the `abort_source` in
resolving the promise exceptionally.

Furthermore, make sure that two `set_configuration` calls don't step on
each other's toes by one setting the other's promise. To do that, reset
the promise field at the end of `set_configuration` and check that it's
not engaged at the beginning.

Fixes #11288.

Closes #11325

* github.com:scylladb/scylladb:
  test: raft: randomized_nemesis_test: additional logging
  raft: server: handle aborts when waiting for config entry to commit
2022-08-25 12:49:09 +02:00
Avi Kivity
df87949241 Merge "Remove batch tokens update helper" from Pavel E
"
On token_metadata there are two update_normal_tokens() overloads --
one updates tokens for a single endpoint, another one -- for a set
(well -- std::map) of them. Other than updating the tokens both
methods also may add an endpoint to the t.m.'s topology object.

There's an ongoing effort in moving the dc/rack information from
snitch to topology, and one of the changes made in it is -- when
adding an entry to topology, the dc/rack info should be provided
by the caller (which is in 99% of the cases is the storage service).
The batched tokens update is extremely unfriendly to the latter
change. Fortunately, this helper is only used by tests, the core
code always uses fine-grained tokens updating.
"

* 'br-tokens-update-relax' of https://github.com/xemul/scylla:
  token_metadata: Indentation fix after prevuous patch
  token_metadata: Remove excessive empty tokens check
  token_metadata: Remove batch tokens updating method
  tests: Use one-by-one tokens updating method
2022-08-25 12:01:58 +02:00
Wojciech Mitros
9e6e8de38f tests: prevent test_wasm from occasional failing
Some cases in test_wasm.py assumed that all cases
are ran in the same order every time and depended
on values that should have been added to tables in
previous cases. Because of that, they were sometimes
failing. This patch removes this assumption by
adding the missing inserts to the affected cases.

Additionally, an assert that confirms low miss
rate of udfs is more precise, a comment is added
to explain it clearly.

Closes #11367
2022-08-25 11:32:06 +03:00
Kamil Braun
90233551be test: raft: randomized_nemesis_test: don't access failure detector service after it's stopped
It could happen that we accessed failure detector service after it was
stopped if a reconfiguration happened in the 'right' moment. This would
resolve in an assertion failure. Fix this.

Closes #11326
2022-08-25 11:32:06 +03:00
Tomasz Grabiec
1d0264e1a9 Merge 'Implement Raft upgrade procedure' from Kamil Braun
Start with a cluster with Raft disabled, end up with a cluster that performs
schema operations using group 0.

Design doc:
https://docs.google.com/document/d/1PvZ4NzK3S0ohMhyVNZZ-kCxjkK5URmz1VP65rrkTOCQ/
(TODO: replace this with .md file - we can do it as a follow-up)

The procedure, on a high level, works as follows:
- join group 0
- wait until every peer joined group 0 (peers are taken from `system.peers`
  table)
- enter `synchronize` upgrade state, in which group 0 operations are disabled
- wait until all members of group 0 entered `synchronize` state or some member
  entered the final state
- synchronize schema by comparing versions and pulling if necessary
- enter the final state (`use_new_procedures`), in which group 0 is used for
  schema operations.

With the procedure comes a recovery mode in case the upgrade procedure gets
stuck (and it may if we lose a node during recovery - the procedure, to
correctly establish a single group 0 cluster, requires contacting every node).

This recovery mode can also be used to recover clusters with group 0 already
established if they permanently lose a majority of nodes - killing two birds with
one stone. Details in the last commit message.

Read the design doc, then read the commits in topological order
for best reviewing experience.

---

I did some manual tests: upgrading a cluster, using the cluster to add nodes,
remove nodes (both with `decommission` and `removenode`), replacing nodes.
Performing recovery.

As a follow-up, we'll need to implement tests using the new framework (after
it's ready). It will be easy to test upgrades and recovery even with a single
Scylla version - we start with a cluster with the RAFT flag disabled, then
rolling-restart while enabling the flag (and recovery is done through simple
CQL statements).

Closes #10835

* github.com:scylladb/scylladb:
  service/raft: raft_group0: implement upgrade procedure
  service/raft: raft_group0: extract `tracker` from `persistent_discovery::run`
  service/raft: raft_group0: introduce local loggers for group 0 and upgrade
  service/raft: raft_group0: introduce GET_GROUP0_UPGRADE_STATE verb
  service/raft: raft_group0_client: prepare for upgrade procedure
  service/raft: introduce `group0_upgrade_state`
  db: system_keyspace: introduce `load_peers`
  idl-compiler: introduce cancellable verbs
  message: messaging_service: cancellable version of `send_schema_check`
2022-08-25 11:32:06 +03:00
Pavel Emelyanov
d8c5044eee token_metadata: Indentation fix after prevuous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-24 08:24:21 +03:00
Pavel Emelyanov
8238c38e9f token_metadata: Remove excessive empty tokens check
After the previous patch empty passed tokens make the helper co_return
early, so this if is the dead code

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-24 08:24:21 +03:00
Pavel Emelyanov
056d21c050 token_metadata: Remove batch tokens updating method
No users left.

The endpoint_tokens.empty() check is removed, only tests could trigger
it, but they didn't and are patched out.

Indentation is left broken

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-24 08:24:21 +03:00
Pavel Emelyanov
1d437302a8 tests: Use one-by-one tokens updating method
Tests are the only users of batch tokens updating "sugar" which
actually makes things more complicated

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-24 08:24:21 +03:00
Pavel Emelyanov
18fa5038b1 replication_strategy: Remove unused method
The get_pending_address_ranges() accepting a single token is not in use,
its peer that accepts a set of tokens is

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #11358
2022-08-23 20:23:50 +02:00
Avi Kivity
6ce5e9079c Merge 'utils/logalloc: consolidate lsa state in shard tracker' from Botond Dénes
Currently the state of LSA is scattered across a handful of global variables. This series consolidates all these into a single one: the shard tracker. Beyond reducing the number of globals (the less globals, the better) this paves the way for a planned de-globalization of the shard tracker itself.
There is one separate global left, the static migrators registry. This is left as-is for now.

Closes #11284

* github.com:scylladb/scylladb:
  utils/logalloc: remove reclaim_timer:: globals
  utils/logalloc: make s_sanitizer_report_backtrace global a member of tracker
  utils/logalloc: tracker_reclaimer_lock: get shard tracker via constructor arg
  utils/logalloc: move global stat accessors to tracker
  utils/logalloc: allocating_section: don't use the global tracker
  utils/logalloc: pass down tracker::impl reference to segment_pool
  utils/logalloc: move segment pool into tracker
  utils/logalloc: add tracker member to basic_region_impl
  utils/logalloc: make segment independent of segment pool
2022-08-23 18:51:14 +02:00
Benny Halevy
a980510654 table: seal_active_memtable: handle ENOSPC error
Aborting too soon on ENOSPC is too harsh, leading to loss of
availability of the node for reads, while restarting it won't
solve the ENOSPC condition.

Fixes #11245

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #11246
2022-08-23 17:58:20 +02:00
Tomasz Grabiec
9c4e32d2e2 Merge 'raft: server: drop waiters in applier_fiber instead of io_fiber' from Kamil Braun
When `io_fiber` fetched a batch with a configuration that does not
contain this node, it would send the entries committed in this batch to
`applier_fiber` and proceed by any remaining entry dropping waiters (if
the node was no longer a leader).

If there were waiters for entries committed in this batch, it could
either happen that `applier_fiber` received and processed those entries
first, notifying the waiters that the entries were committed and/or
applied, or it could happen that `io_fiber` reaches the dropping waiters
code first, causing the waiters to be resolved with
`commit_status_unknown`.

The second scenario is undesirable. For example, when a follower tries
to remove the current leader from the configuration using
`modify_config`, if the second scenario happens, the follower will get
`commit_status_unknown` - this can happen even though there are no node
or network failures. In particular, this caused
`randomized_nemesis_test.remove_leader_with_forwarding_finishes` to fail
from time to time.

Fix it by serializing the notifying and dropping of waiters in a single
fiber - `applier_fiber`. We decided to move all management of waiters
into `applier_fiber`, because most of that management was already there
(there was already one `drop_waiters` call, and two `notify_waiters`
calls). Now, when `io_fiber` observes that we've been removed from the
config and no longer a leader, instead of dropping waiters, it sends a
message to `applier_fiber`. `applier_fiber` will drop waiters when
receiving that message.

Improve an existing test to reproduce this scenario more frequently.

Fixes #11235.

Closes #11308

* github.com:scylladb/scylladb:
  test: raft: randomized_nemesis_test: more chaos in `remove_leader_with_forwarding_finishes`
  raft: server: drop waiters in `applier_fiber` instead of `io_fiber`
  raft: server: use `visit` instead of `holds_alternative`+`get`
2022-08-23 17:19:44 +02:00
Avi Kivity
fd9d8ddb3e Merge 'distributed_loader: Restore separate processing of keyspace init prio/normal' from Calle Wilund
Fixes #11349

In 7396de7 (and refactorings before it) the set of prioritized keyspaces (and processing thereof)
was removed, due to apparent non-usage (which is true for open-source version).

This functionality is however required for certain features of the enterprise version (ear).
As such is needs to be restored and reenabled. This patch set does so, adapted
to the recent version of this file.

Closes #11350

* github.com:scylladb/scylladb:
  distributed_loader: Restore separate processing of keyspace init prio/normal
  Revert "distributed_loader: Remove unused load-prio manipulations"
2022-08-23 16:25:48 +02:00
Kamil Braun
e350e37605 service/raft: raft_group0: implement upgrade procedure
A listener is created inside `raft_group0` for acting when the
SUPPORTS_RAFT feature is enabled. The listener is established after the
node enters NORMAL status (in `raft_group0::finish_setup_after_join()`,
called at the end of `storage_service::join_cluster()`).

The listener starts the `upgrade_to_group0` procedure.

The procedure, on a high level, works as follows:
- join group 0
- wait until every peer joined group 0 (peers are taken from
  `system.peers` table)
- enter `synchronize` upgrade state, in which group 0 operations are
  disabled (see earlier commit which implemented this logic)
- wait until all members of group 0 entered `synchronize` state or some
  member entered the final state
- synchronize schema by comparing versions and pulling if necessary
- enter the final state (`use_new_procedures`), in which group 0 is used
  for schema operations (only those for now).

The devil lies in the details, and the implementation is ugly compared
to this nice description; for example there are many retry loops for
handling intermittent network failures. Read the code.

`leave_group0` and `remove_group0` were adjusted to handle the upgrade
procedure being run correctly; if necessary, they will wait for the
procedure to finish.

If the upgrade procedure gets stuck (and it may, since it requires all
nodes to be available to contact them to correctly establish a single
group 0 raft cluster); or if a running cluster permanently loses a
majority of nodes, causing group 0 unavailability; the cluster admin
is not left without help.

We introduce a recovery mode, which allows the admin to
completely get rid of traces of existing group 0 and restart the
upgrade procedure - which will establish a new group 0. This works even
in clusters that never upgraded but were bootstrapped using group 0 from
scratch.

To do that, the admin does the following on every node:
- writes 'recovery' under 'group0_upgrade_state' key
  in `system.scylla_local` table,
- truncates the `system.discovery` table,
- truncates the `system.group0_history` table,
- deletes group 0 ID and group 0 server ID from `system.scylla_local`
  (the keys are `raft_group0_id` and `raft_server_id`
then the admin performs a rolling restart of their cluster. The nodes
restart in a "group 0 recovery mode", which simply means that the nodes
won't try to perform any group 0 operations. Then the admin calls
`removenode` to remove the nodes that are down. Finally, the admin
removes the `group0_upgrade_state` key from `system.scylla_local`,
rolling-restarts the cluster, and the cluster should establish group 0
anew.

Note that this recovery procedure will have to be extended when new
stuff is added to group 0 - like topology change state. Indeed, observe
that a minority of nodes aren't able to receive committed entries from a
leader, so they may end up in inconsistent group 0 states. It wouldn't
be safe to simply create group 0 on those nodes without first ensuring
that they have the same state from which group 0 will start.
Right now the state only consist of schema tables, and the upgrade
procedure ensures to synchronize them, so even if the nodes started in
inconsistent schema states, group 0 will correctly be established.
(TODO: create a tracking issue? something needs to remind us of this
 whenever we extend group 0 with new stuff...)
2022-08-23 13:51:01 +02:00
Kamil Braun
b42dfbc0aa test: raft: randomized_nemesis_test: additional logging
Add some more logging to `randomized_nemesis_test` such as logging the
start and end of a reconfiguration operation in a way that makes it easy
to find one given the other in the logs.
2022-08-23 13:14:30 +02:00
Kamil Braun
efad6fe9b4 raft: server: handle aborts when waiting for config entry to commit
Changing configuration involves two entries in the log: a 'joint
configuration entry' and a 'non-joint configuration entry'. We use
`wait_for_entry` to wait on the joint one. To wait on the non-joint one,
we use a separate promise field in `server`. This promise wasn't
connected to the `abort_source` passed into `set_configuration`.

The call could get stuck if the server got removed from the
configuration and lost leadership after committing the joint entry but
before committing the non-joint one, waiting on the promise. Aborting
wouldn't help. Fix this by subscribing to the `abort_source` in
resolving the promise exceptionally.

Furthermore, make sure that two `set_configuration` calls don't step on
each other's toes by one setting the other's promise. To do that, reset
the promise field at the end of `set_configuration` and check that it's
not engaged at the beginning.

Fixes #11288.
2022-08-23 13:14:29 +02:00
Calle Wilund
54aca8e814 distributed_loader: Restore separate processing of keyspace init prio/normal
Fixes #11349

In 7396de7 (and refactorings before it) the set of prioritized keyspaces (and processing thereof)
was removed, due to apparent non-usage (which is true for open-source version).

This functionality is however required for certain features of the enterprise version (ear).
As such is needs to be restored and reenabled. This patch and revert before it does so, adapted
to the recent version of this file.
2022-08-23 10:39:19 +00:00
Calle Wilund
d9c391e366 Revert "distributed_loader: Remove unused load-prio manipulations"
This reverts commit 7396de72b1.

In 7396de7 (and refactorings before it) the set of prioritized keyspaces (and processing thereof)
was removed, due to apparent non-usage (which is true for open-source version).

This functionality is however required for certain features of the enterprise version (ear).
As such is needs to be restored and reenabled. This reverts the actual commit, patch after
ensures we use the prio set.
2022-08-23 10:34:05 +00:00
Avi Kivity
5d1ff17ddf Merge 'Streaming: define plan_id as a strong tagged_uuid type' from Benny Halevy
This series turns plan_id from a generic UUID into a strong type so it can't be used interchangeably with other uuid's.
While at it, streaming/stream_fwd.hh was added for forward declarations and the definition of plan_id.
Also, `stream_manager::update_progress` parameter name was renamed to plan_id to represent its assumed content, before changing its type to `streaming::plan_id`.

Closes #11338

* github.com:scylladb/scylladb:
  streaming: define plan_id as a strong tagged_uuid type
  stream_manager: update_progress: rename cf_id param to plan_id
  streaming: add forward declarations in stream_fwd.hh
2022-08-23 10:48:34 +02:00
Petr Gusev
aa88d58539 raft, use max_command_size to satisfy commitlog limit
Commitlog imposes a limit on the size of mutations
and throws an exception if it's exceeded. In case of
schema changes before raft this exception was delivered
to the client. Now it happens while saving the raft
command in io_fiber in persistence->store_log_entries
and what the client gets is just a timeout exception,
which doesn't say much about the cause of the problem.
This patch introduces an explicit command size limit
and provides a clear error message in this case.
2022-08-23 12:09:32 +04:00
Tomasz Grabiec
0e5b86d3da Merge 'Optimize mutation consume of range tombstones in reverse' from Benny Halevy
Reversing the whole range_tombstone_list
into reversed_range_tombstones is inefficient
and can lead to reactor stalls with a large number of
range tombstones.

Instead, iterate over the range_tombsotne_list in reverse
direction and reverse each range_tombstone as we go,
keeping the result in the optional cookie.reversed_rt member.

While at it, this series contains some other cleanups on this path
to improve the code readability and maybe make the compiler's life
easier as for optimizing the cleaned-up code.

Closes #11271

* github.com:scylladb/scylladb:
  mutation: consume_clustering_fragments: get rid of reversed_range_tombstones;
  mutation: consume_clustering_fragments: reindent
  mutation: consume_clustering_fragments: shuffle emit_rt logic around
  mutation: consume, consume_gently: simplify partition_start logic
  mutation: consume_clustering_fragments: pass iterators to mutation_consume_cookie ctor
  mutation: consume_clustering_fragments: keep the reversed schema in cookie
  mutation: clustering_iterators: get rid of current_rt
  mutation_test: test_mutation_consume_position_monotonicity: test also consume_gently
2022-08-23 10:05:39 +02:00
Botond Dénes
5bc499080d utils/logalloc: remove reclaim_timer:: globals
One of them (_active_timer) is moved to shard tracker, the other is made
a simple local in reclaim_timer.
2022-08-23 10:38:58 +03:00
Botond Dénes
5f8971173e utils/logalloc: make s_sanitizer_report_backtrace global a member of tracker
We want to consolidate all the logalloc state into a single object: the
shard tracker. Replacing this global with a member in said object is
part of this effort.
2022-08-23 10:38:58 +03:00
Botond Dénes
499b9a3a7c utils/logalloc: tracker_reclaimer_lock: get shard tracker via constructor arg 2022-08-23 10:38:58 +03:00
Botond Dénes
7d17d675af utils/logalloc: move global stat accessors to tracker
These are pretend free functions, accessing globals in the background,
make them a member of the tracker instead, which everything needed
locally to compute them. Callers still have to access these stats
through the global tracker instance, but this can be changed to happen
through a local instance. Soon....
2022-08-23 10:38:58 +03:00
Botond Dénes
f406151a86 utils/logalloc: allocating_section: don't use the global tracker
Instead, get the tracker instance from the region. This requires adding
a `region&` parameter to `with_reserve()`.
This brings us one step closer to eliminating the global tracker.
2022-08-23 10:38:58 +03:00
Botond Dénes
e968866fa1 utils/logalloc: pass down tracker::impl reference to segment_pool
To get rid of some usages of `shard_tracker()`.
2022-08-23 10:38:58 +03:00
Botond Dénes
3bd94e41bf utils/logalloc: move segment pool into tracker
Instead of a separate global segment pool instance, make it a member of
the already global tracker. Most users are inside the tracker instance
anyway. Outside users can access the pool through the global tracker
instance.
2022-08-23 10:38:58 +03:00
Botond Dénes
5b86dfc35a utils/logalloc: add tracker member to basic_region_impl
For now this member is initialized from the global tracker instance. But
it allows the members of region impl to be detached from said global,
making a step towards removing it.
2022-08-23 10:38:58 +03:00
Botond Dénes
f4056bd344 utils/logalloc: make segment independent of segment pool
segment has some members, which simply forward the call to a
segment_pool method, via the global segment_pool instance. Remove these
and make the callers use the segment pool directly instead.
2022-08-23 10:38:58 +03:00
Nadav Har'El
9c15659194 Merge 'test.py: bump timeout of async requests for topology' from Alecco
Topology tests do async requests using the Python driver. The driver's
API for async doesn't use the session timeout.

Pass 60 seconds timeout (default is 10) to match the session's.

Fixes https://github.com/scylladb/scylladb/issues/11289

Closes #11348

* github.com:scylladb/scylladb:
  test.py: bump schema agreement timeout for topology tests
  test.py: bump timeout of async requests for topology
  test.py: fix bad indent
2022-08-23 10:30:59 +03:00
Raya Kurlyand
bc7539cff0 Update auditing.rst
https://github.com/scylladb/scylladb/issues/11341

Closes #11347
2022-08-23 06:59:41 +03:00
Botond Dénes
331033adae Merge 'Fix frozen mutation consume ordering' from Benny Halevy
Currently, frozen_mutation is not consumed in position_in_partition
order as all range tombstones are consumed before all rows.

This violates the range_tombstone_generator invariants
as its lower_bound needs to be monotonically increasing.

Fix this by adding mutation_partition_view::accept_ordered
and rewriting do_accept_gently to do the same,
both making sure to consume the range tombstones
and clustering rows in position_in_partition order,
similar to the mutation consume_clustering_fragments function.

Add a unit test that verifies that.

Fixes #11198

Closes #11269

* github.com:scylladb/scylladb:
  mutation_partition_view: make mutation_partition_view_virtual_visitor stoppable
  frozen_mutation: consume and consume_gently in-order
  frozen_mutation: frozen_mutation_consumer_adaptor: rename rt to rtc
  frozen_mutation: frozen_mutation_consumer_adaptor: return early when flush returns stop_iteration::yes
  frozen_mutation: frozen_mutation_consumer_adaptor: consume static row unconditionally
  frozen_mutation: frozen_mutation_consumer_adaptor: flush current_row before rt_gen
2022-08-23 06:37:04 +03:00
Alejo Sanchez
01cac33472 test.py: bump schema agreement timeout for topology tests
Increase the schema agreement timeout to match other timeouts.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-22 21:07:55 +02:00
Alejo Sanchez
f9d31112cf test.py: bump timeout of async requests for topology
Topology tests do async requests using the Python driver. The driver's
API for async doesn't use the session timeout.

Pass 60 seconds timeout (default is 10) to match the session's.

This will hopefully will fix timeout failures on debug mode.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-22 21:07:03 +02:00
Benny Halevy
357e805e1f mutation_partition_view: make mutation_partition_view_virtual_visitor stoppable
So that the frozen_mutation consumer can return
stop_iteration::yes if it wishes to stop consuming at
some clustering position.

In this case, on_end_of_partition must still be called
so a closing range_tombstone_change can be emitted to the consumer.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-22 20:12:58 +03:00
Mikołaj Sielużycki
b5380baf8a frozen_mutation: consume and consume_gently in-order
Currently, frozen_mutation is not consumed in position_in_partition
order as all range tombstones are consumed before all rows.

This violates the range_tombstone_generator invariants
as its lower_bound needs to be monotonically increasing.

Fix this by adding mutation_partition_view::accept_ordered
and rewriting do_accept_gently to do the same,
both making sure to consume the range tombstones
and clustering rows in position_in_partition order,
similar to the mutation consume_clustering_fragments function.

Add a unit test that verifies that.

Fixes #11198

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-22 20:12:20 +03:00
Kamil Braun
e0c6153adf test: raft: randomized_nemesis_test: more chaos in remove_leader_with_forwarding_finishes
Improve the randomness of this test, making it a bit easier to
reproduce the scenarios that the test aims to catch.

Increase timeouts a bit to account for this additional randomness.
2022-08-22 18:53:48 +02:00
Kamil Braun
db2a3deda1 raft: server: drop waiters in applier_fiber instead of io_fiber
When `io_fiber` fetched a batch with a configuration that does not
contain this node, it would send the entries committed in this batch to
`applier_fiber` and proceed by any remaining entry dropping waiters (if
the node was no longer a leader).

If there were waiters for entries committed in this batch, it could
either happen that `applier_fiber` received and processed those entries
first, notifying the waiters that the entries were committed and/or
applied, or it could happen that `io_fiber` reaches the dropping waiters
code first, causing the waiters to be resolved with
`commit_status_unknown`.

The second scenario is undesirable. For example, when a follower tries
to remove the current leader from the configuration using
`modify_config`, if the second scenario happens, the follower will get
`commit_status_unknown` - this can happen even though there are no node
or network failures. In particular, this caused
`randomized_nemesis_test.remove_leader_with_forwarding_finishes` to fail
from time to time.

Fix it by serializing the notifying and dropping of waiters in a single
fiber - `applier_fiber`. We decided to move all management of waiters
into `applier_fiber`, because most of that management was already there
(there was already one `drop_waiters` call, and two `notify_waiters`
calls). Now, when `io_fiber` observes that we've been removed from the
config and no longer a leader, instead of dropping waiters, it sends a
message to `applier_fiber`. `applier_fiber` will drop waiters when
receiving that message.

Fixes #11235.
2022-08-22 18:53:44 +02:00
Kamil Braun
5badf20c7a raft: server: use visit instead of holds_alternative+get
In `std::holds_alternative`+`std::get` version, the `get` performs a
redundant check. Also `std::visit` gives a compile-time exhaustiveness
check (whether we handled all possible cases of the `variant`).
2022-08-22 18:47:48 +02:00
Benny Halevy
314e45d957 streaming: define plan_id as a strong tagged_uuid type
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-22 19:45:30 +03:00
Benny Halevy
add612bc52 mutation: consume_clustering_fragments: get rid of reversed_range_tombstones;
Reversing the whole range_tombstone_list
into reversed_range_tombstones is inefficient
and can lead to reactor stalls with a large number of
range tombstones.

Instead, iterator over the range_tombsotne_list in reverse
direction and reverse each range_tombstone as we go,
keeping the result in the optional cookie.reversed_rt member.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-22 19:42:52 +03:00
Alejo Sanchez
87c233b36b test.py: fix bad indent
Fix leftover bad indent

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-22 14:29:54 +02:00
Nadav Har'El
941c719a23 alternator: return ProvisionedThroughput in DescribeTable
DescribeTable is currently hard-coded to return PAY_PER_REQUEST billing
mode. Nevertheless, even in PAY_PER_REQUEST mode, the DescribeTable
operation must return a ProvisionedThroughput structure, listing both
ReadCapacityUnits and WriteCapacityUnits as 0. This requirement is not
stated in some DynamoDB documentation but is explictly mentioned in
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_ProvisionedThroughput.html
Also in empirically, DynamoDB returns ProvisionedThroughput with zeros
even in PAY_PER_REQUEST mode. We even had an xfailing test to confirm this.

The ProvisionedThroughput structure being missing was a problem for
applications like DynamoDB connectors for Spark, if they implicitly
assume that ProvisionedThroughput is returned by DescribeTable, and
fail (as described in issue #11222) if it's outright missing.

So this patch adds the missing ProvisionedThroughput structure, and
the xfailing test starts to pass.

Note that this patch doesn't change the fact that attempting to set
a table to PROVISIONED billing mode is ignored: DescribeTable continues
to always return PAY_PER_REQUEST as the billing mode and zero as the
provisioned capacities.

Fixes #11222

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11298
2022-08-22 09:58:09 +02:00
Takuya ASADA
60e8f5743c systemd: drop StandardOutput=syslog
On recent version of systemd, StandardOutput=syslog is obsolete.
We should use StandardOutput=journal instead, but since it's default value,
so we can just drop it.

Fixes #11322

Closes #11339
2022-08-22 10:47:37 +03:00
Benny Halevy
fa7033bc2b configure: add --perf-tests-debuginfo option
Provides separate control over debuginfo for perf tests
since enabling --tests-debuginfo affects both today
causing the Jenkins archives of perf tests binaries to
inflate considerably.

Refs https://github.com/scylladb/scylla-pkg/issues/3060

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #11337
2022-08-21 19:08:21 +03:00
Konstantin Osipov
4b6ed7796b test.py: extend documentation
Add documentation about python, topology tests,
server pooling and provide some debugging tips.

Closes #11317
2022-08-21 17:55:49 +03:00
Benny Halevy
3554533e2c stream_manager: update_progress: rename cf_id param to plan_id
Before changing its type to streaming::plan_id
this patch clarifies that the parameter actually represents
the plan id and not the table id as its name suggests.

For reference, see the call to update_progress in
`stream_transfer_task::execute`, as well as the function
using _stream_bytes which map key is the plan id.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-21 16:56:41 +03:00
Benny Halevy
c1fc0672a5 streaming: add forward declarations in stream_fwd.hh
To be used for defining streaming::plan_id
in the next patcvh.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-21 16:00:02 +03:00
Anna Stuchlik
3b93184680 doc: fix the description of ORDER BY for the SELECT statement
Closes #11272
2022-08-21 15:28:15 +03:00
Tzach Livyatan
8fc58300ea Update Alternator Markdown file to use automatic link notation
Closes #11335
2022-08-21 13:32:57 +03:00
Piotr Sarna
484004e766 Merge 'Fix mutation commutativity with shadowable tombstone'
from Tomasz Grabiec

This series fixes lack of mutation associativity which manifests as
sporadic failures in
row_cache_test.cc::test_concurrent_reads_and_eviction due to differences
in mutations applied and read.

No known production impact.

Refs https://github.com/scylladb/scylladb/issues/11307

Closes #11312

* github.com:scylladb/scylladb:
  test: mutation_test: Add explicit test for mutation commutativity
  test: random_mutation_generator: Workaround for non-associativity of mutations with shadowable tombstones
  db: mutation_partition: Drop unnecessary maybe_shadow()
  db: mutation_partition: Maintain shadowable tombstone invariant when applying a hard tombstone
  mutation_partition: row: make row marker shadowing symmetric
2022-08-20 16:46:32 +02:00
Kamil Braun
2ba1fb0490 service/raft: raft_group0: extract tracker from persistent_discovery::run
Extract it to a top-level abstraction, write comments.
It will be reused in the following commit.
2022-08-19 19:15:19 +02:00
Kamil Braun
f7e02a7de9 service/raft: raft_group0: introduce local loggers for group 0 and upgrade 2022-08-19 19:15:19 +02:00
Kamil Braun
ac5f4248a9 service/raft: raft_group0: introduce GET_GROUP0_UPGRADE_STATE verb
During the upgrade procedure nodes will want to obtain the upgrade state
of other nodes to proceed. This is what the new verb is for.
2022-08-19 19:15:19 +02:00
Kamil Braun
43687be1f1 service/raft: raft_group0_client: prepare for upgrade procedure
Now, whether an 'group 0 operation' (today it means schema change) is
performed using the old or new methods, doesn't depend on the local RAFT
fature being enabled, but on the state of the upgrade procedure.

In this commit the state of the upgrade is always
`use_pre_raft_procedures` because the upgrade procedure is not
implemented yet. But stay tuned.

The upgrade procedure will need certain guarantees: at some point it
switches from `use_pre_raft_procedures` to `synchronize` state. During
`synchronize` schema changes must be disabled, so the procedure can
ensure that schema is in sync across the entire cluster before
establishing group 0. Thus, when the switch happens, no schema change
can be in progress.

To handle all this weirdness we introduce `_upgrade_lock` and
`get_group0_upgrade_state` which takes this lock whenever it returns
`use_pre_raft_procedures`. Creating a `group0_guard` - which happens at
the start of every group 0 operation - will take this lock, and the lock
holder shall be stored inside the guard (note: the holder only holds the
lock if `use_pre_raft_procedures` was returned, no need to hold it for
other cases). Because `group0_guard` is held for the entire duration of
a group 0 operation, and because the upgrade procedure will also have to
take this lock whenever it wants to change the upgrade state (it's an
rwlock), this ensures that no group 0 operation that uses the old ways
is happening when we change the state.

We also implement `wait_until_group0_upgraded` using a condition
variable. It will be used by certain methods during upgrade (later
commits; stay tuned).

Some additional comments were written.
2022-08-19 19:15:19 +02:00
Kamil Braun
7e56251aea service/raft: introduce group0_upgrade_state
Define an enum class, `group0_upgrade_state`, describing the state of
the upgrade procedure (implemented in later commits).

Provide IDL definitions for (de)serialization.

The node will have its current upgrade state stored on disk in
`system.scylla_local` under the `group0_upgrade_state` key. If the key
is not present we assume `use_pre_raft_procedures` (meaning we haven't
started upgrading yet or we're at the beginning of upgrade).

Introduce `system_keyspace` accessor methods for storing and retrieving
the on-disk state.
2022-08-19 19:15:19 +02:00
Kamil Braun
547134faf4 db: system_keyspace: introduce load_peers
Load the addresses of our peers from `system.peers`.

Will be used be the Raft upgrade procedure to obtain the set of all
peers.
2022-08-19 19:15:18 +02:00
Kamil Braun
a5b465b796 idl-compiler: introduce cancellable verbs
The compiler allowed passing a `with_timeout` flag to a verb definition;
it then generated functions for sending and handling RPCs that accepted
a timeout parameter.

We would like to generate functions that accept an `abort_source` so an
RPC can be cancelled from the sender side. This is both more and less
powerful than `with_timeout`. More powerful because you can abort on
other conditions than just reaching a certain point in time. Less
powerful because you can't abort the receiver. In any case, sometimes
useful.

For this the `cancellable` flag was added.
You can't use `with_timeout` and `cancellable` at the same verb.

Note that this uses an already existing function in RPC module,
`send_message_cancellable`.
2022-08-19 19:15:18 +02:00
Kamil Braun
9e5a81da4a message: messaging_service: cancellable version of send_schema_check
This RPC will be used during the Raft upgrade procedure during schema
synchronization step.

Make a version which can be cancelled when the upgrade procedure gets
aborted.
2022-08-19 19:15:18 +02:00
Nadav Har'El
516089beb0 Merge 'Raft test topology II part 1' from Alecco
- Remove `ScyllaCluster.__getitem__()`  (pending request by @kbr- in a previous pull request), for this remove all direct access to servers from caller code
- Increase Python driver timeouts (req by @nyh)
- Improve `ManagerClient` API requests: use `http+unix://<sockname>/<resource>` instead of `http://localhost/<resource>` and callers of the helper method only pass the resource
- Improve lint and type hints

Closes #11305

* github.com:scylladb/scylladb:
  test.py: remove ScyllaCluster.__getitem__()
  test.py: ScyllaCluster check kesypace with any server
  test.py: ScyllaCluster server error log method
  test.py: ScyllaCluster read_server_log()
  test.py: save log point for all running servers
  test.py: ScyllaCluster provide endpoint
  test.py: build host param after before_test
  test.py: manager client disable lint warnings
  test.py: scylla cluster lint and type hint fixes
  test.py: increase more timeouts
  test.py: ManagerClient improve API HTTP requests
2022-08-18 20:27:50 +03:00
Alejo Sanchez
fe07f9ceed test.py: make topology conftest module paths work when imported
To allow other suites to use topology suite conftest, add pylib to the
module lookup path.

Closes #11313
2022-08-18 20:22:35 +03:00
Konstantin Osipov
7481f0d404 test.py: simplify CQL test search
No need to repeat code available in the base class.

Closes #11156
2022-08-18 19:28:43 +03:00
Benny Halevy
7747b8fa33 sstables: define run_identifier as a strong tagged_uuid type
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #11321
2022-08-18 19:03:10 +03:00
Avi Kivity
35fbba3a5b Revert "gms: gossiper: include nodes with empty feature sets when calculating enabled features"
This reverts commit 08842444b4. It causes
a failure in test_shutdown_all_and_replace_node.

Fixes #11316.
2022-08-18 15:01:50 +03:00
Kamil Braun
b52429f724 Merge 'raft: relax some error severity' from Gleb Natapov
Dtest fails if it sees an unknown errors in the logs. This series
reduces severity of some errors (since they are actually expected during
shutdown) and removes some others that duplicate already existing errors
that dtest knows how to deal with. Also fix one case of unhandled
exception in schema management code.

* 'dtest-fixes-v1' of github.com:gleb-cloudius/scylla:
  raft: getting abort_requested_exception exception from a sm::apply is not a critical error
  schema_registry: fix abandoned feature warning
  service: raft: silence rpc::closed_errors in raft_rpc
2022-08-18 12:16:44 +02:00
Anna Stuchlik
dc307b6895 doc: fix the CQL version in the Interfaces table 2022-08-18 12:02:42 +02:00
Petr Gusev
eedfd7ad9b raft, limit for command size
Adds max_command_size to the raft configuration and
restricts commands to this limit.
2022-08-18 13:35:49 +04:00
Tomasz Grabiec
5a9df433c6 test: mutation_test: Add explicit test for mutation commutativity 2022-08-17 17:39:54 +02:00
Tomasz Grabiec
3d9efee3bf test: random_mutation_generator: Workaround for non-associativity of mutations with shadowable tombstones
Given 3 row mutations:

m1 = {
      marker: {row_marker: dead timestamp=-9223372036854775803},
      tombstone: {row_tombstone: {shadowable tombstone: timestamp=-9223372036854775807, deletion_time=0}, {tombstone: none}}
}

m2 = {
      marker: {row_marker: timestamp=-9223372036854775805}
}

m3 = {
      tombstone: {row_tombstone: {shadowable tombstone: timestamp=-9223372036854775806, deletion_time=2}, {tombstone: none}}
}

We get different shadowable tombstones depending on the order of merging:

(m1 + m2) + m3 = {
       marker: {row_marker: dead timestamp=-9223372036854775803},
       tombstone: {row_tombstone: {shadowable tombstone: timestamp=-9223372036854775806, deletion_time=2}, {tombstone: none}}

m1 + (m2 + m3) = {
       marker: {row_marker: dead timestamp=-9223372036854775803},
       tombstone: {row_tombstone: {shadowable tombstone: timestamp=-9223372036854775807, deletion_time=0}, {tombstone: none}}
}

The reason is that in the second case the shadowable tombstone in m3
is shadwed by the row marker in m2. In the first case, the marker in
m2 is cancelled by the dead marker in m1, so shadowable tombstone in
m3 is not cancelled (the marker in m1 does not cancel because it's
dead).

This wouldn't happen if the dead marker in m1 was accompanied by a
hard tombstone of the same timestamp, which would effectively make the
difference in shadowable tombstones irrelevant.

Found by row_cache_test.cc::test_concurrent_reads_and_eviction.

I'm not sure if this situation can be reached in practice (dead marker
in mv table but no row tombstone).

Work it around for tests by producing a row tombstone if there is a
dead marker.

Refs #11307
2022-08-17 17:39:54 +02:00
Tomasz Grabiec
56e5b6f095 db: mutation_partition: Drop unnecessary maybe_shadow()
It is performed inside row_tombstone::apply() invoked in the preceding line.
2022-08-17 17:39:54 +02:00
Tomasz Grabiec
9c66c9b3f0 db: mutation_partition: Maintain shadowable tombstone invariant when applying a hard tombstone
When the row has a live row marker which shadows the shadowable
tombstone, the shadowable tombstone should not be effective. The code
assumes that _shadowable always reflects the current tombstone, so
maybe_shadow() needs to be called whenever marker or regular tombstone
changes. This was not ensured by row::apply(tombstone).

This causes problems in tests which use random_mutation_generator,
which generates mutations which would violate this invariant, and as a
result, mutation commutativity would be violated.

I am not aware of problems in production code.
2022-08-17 17:34:13 +02:00
Botond Dénes
778f5adde7 mutation_partition: row: make row marker shadowing symmetric
Currently row marker shadowing the shadowable tombstone is only checked
in `apply(row_marker)`. This means that shadowing will only be checked
if the shadowable tombstone and row marker are set in the correct order.
This at the very least can cause flakyness in tests when a mutation
produced just the right way has a shadowable tombstone that can be
eliminated when the mutation is reconstructed in a different way,
leading to artificial differences when comparing those mutations.

This patch fixes this by checking shadowing in
`apply(shadowable_tombstone)` too, making the shadowing check symmetric.

There is still one vulnerability left: `row_marker& row_marker()`, which
allow overwriting the marker without triggering the corresponding
checks. We cannot remove this overload as it is used by compaction so we
just add a comment to it warning that `maybe_shadow()` has to be manually
invoked if it is used to mutate the marker (compaction takes care of
that). A caller which didn't do the manual check is
mutation_source_test: this patch updates it to use `apply(row_marker)`
instead.

Fixes: #9483

Tests: unit(dev)

Closes #9519
2022-08-17 17:22:13 +02:00
Benny Halevy
8f0376bba1 mutation: consume_clustering_fragments: reindent
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-17 16:45:20 +03:00
Benny Halevy
749371c2b0 mutation: consume_clustering_fragments: shuffle emit_rt logic around
To prepare for a following patch that will get rid of
the cookie.reversed_range_tombstones list.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-17 16:44:23 +03:00
Benny Halevy
0e21073c38 mutation: consume, consume_gently: simplify partition_start logic
Concentrate the logic in a single
(!cookie.partition_start_consumed) block

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-17 15:49:12 +03:00
Benny Halevy
d661b84d51 mutation: consume_clustering_fragments: pass iterators to mutation_consume_cookie ctor
and set crs and rts only in the block where they are used,
so we can get rid of reversed_range_tombstones.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-17 15:30:36 +03:00
Benny Halevy
f1b7a1a6f1 mutation: consume_clustering_fragments: keep the reversed schema in cookie
Rather than reversing the schema on every call
just keep the potentially reversed schema in cookie.

Othwerwise, cookie.schema was write only.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-17 15:30:36 +03:00
Benny Halevy
a230ea0019 mutation: clustering_iterators: get rid of current_rt
It is currently write-only.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-17 15:30:16 +03:00
Benny Halevy
017f9b4131 mutation_test: test_mutation_consume_position_monotonicity: test also consume_gently
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-17 14:43:52 +03:00
Alejo Sanchez
d732d776ed test.py: remove ScyllaCluster.__getitem__()
Users of ScyllaCluster should not directly manage its ScyllaServers.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-17 10:24:48 +02:00
Alejo Sanchez
729f8e2834 test.py: ScyllaCluster check kesypace with any server
Directly pick any server instead of calling self[0].

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-17 10:24:48 +02:00
Alejo Sanchez
7ad7a5e718 test.py: ScyllaCluster server error log method
Provide server error logs to caller (test.py).

Avoids direct access to list of servers.

To be done later: pick the failed server. For now it just provides the
log of one server.

While there, fix type hints.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-17 10:24:48 +02:00
Alejo Sanchez
e755207fcc test.py: ScyllaCluster read_server_log()
Instead of accessing the first server, now test.py asks ScyllaCluster
for the server log.

In a later commit, ScyllaCluster will pick the appropriate server.

Also removes another direct access to the list of servers we want to get
rid of.
2022-08-17 10:24:48 +02:00
Alejo Sanchez
f141ab95f9 test.py: save log point for all running servers
For error reporting, before a test a mark of the log point in time is
saved. Previously, only the log of the first server was saved. Now it's
done for all running servers.

While there, remove direct access to servers on test.py.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-17 10:24:48 +02:00
Alejo Sanchez
8fff636776 test.py: ScyllaCluster provide endpoint
For pytest CQL driver connections a host id (IP) is used. Provide it
with a method.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-17 10:24:48 +02:00
Alejo Sanchez
5bd266424e test.py: build host param after before_test
If no server started, there is no server in the cluster list. So only
build the pytest --host param after before_test check is done.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-17 10:24:48 +02:00
Alejo Sanchez
30c8e961ba test.py: manager client disable lint warnings
Disable noisy lint warnings.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-17 10:24:48 +02:00
Alejo Sanchez
2b4c7fbb8a test.py: scylla cluster lint and type hint fixes
Add missing docstrings, reorder imports, add type hints, improve
formatting, fix variable names, fix line lengths, iterate over dicts not
keys, and disable noisy lint warnings.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-17 10:24:48 +02:00
Alejo Sanchez
566a4ebf4e test.py: increase more timeouts
Increase Python driver connection timeouts to deal with extreme cases
for slow debug builds in slow machines as done (and explained) in
95bd02246a.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-17 10:24:48 +02:00
Alejo Sanchez
ce27c02d91 test.py: ManagerClient improve API HTTP requests
Use the AF Unix socket name as host name instead of localhost and avoid
repeating the full URL for callers of _request() for the Manager API
requests from the client.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-17 10:24:48 +02:00
Benny Halevy
1b997a8514 frozen_mutation: frozen_mutation_consumer_adaptor: rename rt to rtc
It is a range_tombstone_change, not a range_tombstone.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-17 10:17:42 +03:00
Benny Halevy
87fd4a7d82 frozen_mutation: frozen_mutation_consumer_adaptor: return early when flush returns stop_iteration::yes
If the consumer return stop_iteration::yes for a flushed
row (static or clustered, we should return early and
no consume any more fragments, until `on_end_of_partition`,
where we may still consume a closing range_tombstone_change
past the last consumed row.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-17 10:17:42 +03:00
Benny Halevy
f11a5e2ec8 frozen_mutation: frozen_mutation_consumer_adaptor: consume static row unconditionally
Consuming the static row is the first ooportunity for
the consumer to return stop_iteration::yes, so there's no
point in checking `_stop_consuming` before consuming it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-17 10:17:42 +03:00
Benny Halevy
4b4eb9037a frozen_mutation: frozen_mutation_consumer_adaptor: flush current_row before rt_gen
We already flushed rt_gen when building the current_row
When we get to flush_rows_and_tombstones, we should
just consume it, as the passed position is not if the
current_row but rather a position following it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-17 10:17:42 +03:00
Nadav Har'El
055340ae39 cql-pytest: increase more timeouts
In commit 7eda6b1e90, we increased the
request_timeout parameter used by cql-pytest tests from the default of
10 seconds to 120 seconds. 10 seconds was usually more than enough for
finishing any Scylla request, but it turned out that in some extreme
cases of a debug build running on an extremely over-committed machine,
the default timeout was not enough.

Recently, in issue #11289 we saw additional cases of timeouts which
the request_timeout setting did *not* solve. It turns out that the Python
CQL driver has two additional timeout settings - connect_timeout and
control_connection_timeout, which default to 5 seconds and 2 seconds
respectively. I believe that most of the timeouts in issue #11289
come from the control_connection_timeout setting - by changing it
to a tiny number (e.g., 0.0001) I got the same error messages as those
reported in #11289. The default of that timeout - 2 seconds - is
certainly low enough to be reached on an extremely over-committed
machine.

So this patch significantly increases both connect_timeout and
control_connection_timeout to 60 seconds. We don't care that this timeout
is ridiculously large - under normal operations it will never be reached.
There is no code which loops for this amount of time, for example.

Refs #11289 (perhaps even Fixes, we'll need to see that the test errors
go away).

NOTE: This patch only changes test/cql-pytest/util.py, which is only
used by the cql-pytest test suite. We have multiple other test suites which
copied this code, and those test suites might need fixing separately.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11295
2022-08-16 19:11:59 +03:00
Kamil Braun
08842444b4 gms: gossiper: include nodes with empty feature sets when calculating enabled features
Right now, if there's a node for which we don't know the features
supported by this node (they are neither persisted locally, nor gossiped
by that node), we would skip this node in calculating the set
of enabled features and potentially enable a feature which shouldn't be
enabled - because that node may not know it. We should only enable a
feature when we know that all nodes have upgraded and know the feature.

This bug caused us problems when we tried to move RAFT out of
experimental. There are dtests such as `partitioner_tests.py` in which
nodes would enable features prematurely, which caused the Raft upgrade
procedure to break (the procedure starts only when all nodes upgrade
and announce that they know the SUPPORTS_RAFT cluster feature).

Closes #11225
2022-08-16 19:07:41 +03:00
Piotr Sarna
cf30d4cbcf Merge 'Secondary index of collection columns' from Nadav Har'El
This pull request introduces global secondary-indexing for non-frozen collections.

The intent is to enable such queries:

```
CREATE TABLE test(int id, somemap map<int, int>, somelist<int>, someset<int>, PRIMARY KEY(id));
CREATE INDEX ON test(keys(somemap));
CREATE INDEX ON test(values(somemap));
CREATE INDEX ON test(entries(somemap));
CREATE INDEX ON test(values(somelist));
CREATE INDEX ON test(values(someset));

-- index on test(c) is the same as index on (values(c))
CREATE INDEX IF NOT EXISTS ON test(somelist);
CREATE INDEX IF NOT EXISTS ON test(someset);
CREATE INDEX IF NOT EXISTS ON test(somemap);

SELECT * FROM test WHERE someset CONTAINS 7;
SELECT * FROM test WHERE somelist CONTAINS 7;
SELECT * FROM test WHERE somemap CONTAINS KEY 7;
SELECT * FROM test WHERE somemap CONTAINS 7;
SELECT * FROM test WHERE somemap[7] = 7;
```

We use here all-familiar materialized views (MVs). Scylla treats all the
collections the same way - they're a list of pairs (key, value). In case
of sets, the value type is dummy one. In case of lists, the key type is
TIMEUUID. When describing the design, I will forget that there is more
than one collection type.  Suppose that the columns in the base table
were as follows:

```
pkey int, ckey1 int, ckey2 int, somemap map<int, text>, PRIMARY KEY(pkey, ckey1, ckey2)
```

The MV schema is as follows (the names of columns which are not the same
as in base might be different). All the columns here form the primary
key.

```
-- for index over entries
indexed_coll (int, text), idx_token long, pkey int, ckey1 int, ckey2 int
-- for index over keys
indexed_coll int, idx_token long, pkey int, ckey1 int, ckey2 int
-- for index over values
indexed_coll text, idx_token long, pkey int, ckey1 int, ckey2 int, coll_keys_for_values_index int
```

The reason for the last additional column is that the values from a collection might not be unique.

Fixes #2962
Fixes #8745
Fixes #10707

This patch does not implement **local** secondary indexes for collection columns: Refs #10713.

Closes #10841

* github.com:scylladb/scylladb:
  test/cql-pytest: un-xfail yet another passing collection-indexing test
  secondary index: fix paging in map value indexing
  test/cql-pytest: test for paging with collection values index
  cql, view: rename and explain bytes_with_action
  cql, index: make collection indexing a cluster feature
  test/cql-pytest: failing tests for oversized key values in MV and SI
  cql: fix secondary index "target" when column name has special characters
  cql, index: improve error messages
  cql, index: fix default index name for collection index
  test/cql-pytest: un-xfail several collecting indexing tests
  test/cql-pytest/test_secondary_index: verify that local index on collection fails.
  docs/design-notes/secondary_index: add `VALUES` to index target list
  test/cql-pytest/test_secondary_index: add randomized test for indexes on collections
  cql-pytest/cassandra_tests/.../secondary_index_test: fix error message in test ported from Cassandra
  cql-pytest/cassandra_tests/.../secondary_index_on_map_entries,select_test: test ported from Cassandra is expected to fail, since Scylla assumes that comparison with null doesn't throw error, just evaluates to false. Since it's not a bug, but expected behavior from the perspective of Scylla, we don't mark it as xfail.
  test/boost/secondary_index_test: update for non-frozen indexes on collections
  test/cql-pytest: Uncomment collection indexes tests that should be working now
  cql, index: don't use IS NOT NULL on collection column
  cql3/statements/select_statement: for index on values of collection, don't emit duplicate rows
  cql/expr/expression, index/secondary_index_manager: needs_filtering and index_supports_expression rewrite to accomodate for indexes over collections
  cql3, index: Use entries() indexes on collections for queries
  cql3, index: Use keys() and values() indexes on collections for queries.
  types/tuple: Use std::begin() instead of .begin() in tuple_type_impl::build_value_fragmented
  cql3/statements/index_target: throw exception to signalize that we didn't miss returning from function
  db/view/view.cc: compute view_updates for views over collections
  view info: has_computed_column_depending_on_base_non_primary_key
  column_computation: depends_on_non_primary_key_column
  schema, index/secondary_index_manager: make schema for index-induced mv
  index/secondary_index_manager: extract keys, values, entries types from collection
  cql3/statements/: validate CREATE INDEX for index over a collection
  cql3/statements/create_index_statement,index_target: rewrite index target for collection
  column_computation.hh, schema.cc: collection_column_computation
  column_computation.hh, schema.cc: compute_value interface refactor
  Cql.g, treewide: support cql syntax `INDEX ON table(VALUES(collection))`
2022-08-16 14:18:51 +02:00
Nadav Har'El
fbb0b66d0c test/cql-pytest: fix run's "--ssl" option
Commit 23acc2e848 broke the "--ssl" option of test/cql-pytest/run
(which makes Scylla - and cqlpytest - use SSL-encrypted CQL).
The problem was that there was a confusion between the "ssl" module
(Python's SSL support) and a new "ssl" variable. A rename and a missing
"import" solves the breakage.

We never noticed this because Jenkins does *not* run cql-pytest/run
with --ssl (actually, it no longer runs cql-pytest/run at all).
It is still a useful option for checking SSL-related problems in Scylla
and Seastar.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11292
2022-08-16 12:29:05 +02:00
Kamil Braun
4e35e62597 Merge 'Raft test topology part 3' from Alecco
Test schema changes when there was an underlying topology change.

- per test case checks of cluster health and cycling
- helper class to do cluster manager API requests
- tests can perform topology changes: stop/start/restart servers
- modified clusters are marked dirty and discarded after the test case
- cql connection is updated per topology change and per cluster change

Closes #11266

* github.com:scylladb/scylladb:
  test.py: test topology and schema changes
  test.py: ClusterManager API mark cluster dirty
  test.py: call before/after_test for each test case
  test.py: handle driver connection in ManagerClient
  test.py: ClusterManager API and ManagerClient
  test.py: improve topology docstring
2022-08-16 11:00:26 +02:00
Avi Kivity
afa7960926 Merge 'database: evict all inactive reads for table when detaching table' from Botond Dénes
Currently, when detaching the table from the database, we force-evict all queriers for said table. This series broadens the scope of this force-evict to include all inactive reads registered at the semaphore. This ensures that any regular inactive read "forgotten" for any reason in the semaphore, will not end up in said readers accessing a dangling table reference when destroyed later.

Fixes: https://github.com/scylladb/scylladb/issues/11264

Closes #11273

* github.com:scylladb/scylladb:
  querier: querier_cache: remove now unused evict_all_for_table()
  database: detach_column_family(): use reader_concurrency_semaphore::evict_inactive_reads_for_table()
  reader_concurrency_semaphore: add evict_inactive_reads_for_table()
2022-08-15 19:05:59 +03:00
Botond Dénes
d56dcb842c db/virtual_table: add virtual destructor to virtual_table
It should have had one, derived instances are stored and destroyed via
the base-class. The only reason this haven't caused bugs yet is that
derived instances happen to not have any non-trivial members yet.

Closes #11293
2022-08-15 16:58:05 +03:00
Avi Kivity
73d4930815 Merge 'test/lib: various improvements to sstable test env' from Botond Dénes
A mixed bag of improvements developed as part of another PR (https://github.com/scylladb/scylladb/pull/10736). Said PR was closed so I'm submitting these improvements separately.

Closes #11294

* github.com:scylladb/scylladb:
  test/lib: move convenience table config factory to sstable_test_env
  test/lib/sstable_test_env: move members to impl struct
  test/lib/sstable_utils: use test_env::do_with_async()
2022-08-15 16:57:01 +03:00
Botond Dénes
92e5f438a4 querier: querier_cache: remove now unused evict_all_for_table() 2022-08-15 14:16:41 +03:00
Botond Dénes
2b1eb6e284 database: detach_column_family(): use reader_concurrency_semaphore::evict_inactive_reads_for_table()
Instead of querier_cache::evict_all_for_table(). The new method cover
all queriers and in addition any other inactive reads registered on the
semaphore. In theory by the time we detach a table, no regular inactive
reads should be in the semaphore anymore, but if there is any still, we
better evict them before the table is destroyed, they might attempt to
access it in when destroyed later.
2022-08-15 14:16:41 +03:00
Botond Dénes
e55ccbde8f reader_concurrency_semaphore: add evict_inactive_reads_for_table()
Allowing for evicting all inactive reads that belong to a certain table.
2022-08-15 14:16:41 +03:00
Botond Dénes
c8ef356859 test/lib: move convenience table config factory to sstable_test_env
All users of `column_family_test_config()`, get the semaphore parameter
for it from `sstable_test_env`. It is clear that the latter serves as
the storage space for stable objects required by the table config. This
patch just enshrines this fact by moving the config factory method to
`sstable_test_env`, so it can just get what it needs from members.
2022-08-15 11:23:59 +03:00
Botond Dénes
c0e017e0f7 test/lib/sstable_test_env: move members to impl struct
All present members of sstable_test_env are std::unique_ptr<>:s because
they require stable addresses. This makes their handling somewhat
awkward. Move all of them into an internal `struct impl` and make that
member a unique ptr.
2022-08-15 11:20:09 +03:00
Botond Dénes
a9f296ed47 test/lib/sstable_utils: use test_env::do_with_async()
Instead of manually instantiating test_env.
2022-08-15 11:19:27 +03:00
Botond Dénes
a9573b84c5 Merge 'commitlog: Revert/modify fac2bc4 - do footprint add in delete' from Calle Wilund
Fixes #11184
Fixes #11237

In prev (broken) fix for https://github.com/scylladb/scylladb/issues/11184 we added the footprint for left-over
files (replay candidates) to disk footprint on commitlog init.

This effectively prevents us from creating segments iff we have tight limits. Since we nowadays do quite a bit of inserts _before_ commitlog replay (system.local, but...) we can end up in a situation where we deadlock start because we cannot get to the actual replay that will eventually free things.

Another, not thought through, consequence is that we add a single footprint to _all_ commitlog shard instances - even though only shard 0 will get to actually replay + delete (i.e. drop footprint).
So shards 1-X would all be either locked out or performance degraded.

Simplest fix is to add the footprint in delete call instead. This will lock out segment creation until delete call is done, but this is fast. Also ensures that only replay shard is involved.

To further emphasize this, don't store segments found on init scan in all shard instances,
instead retrieve (based on low time-pos for current gen) when required. This changes very little, but we at last don't store
pointless string lists in shards 1 to X, and also we can potentially ask for the list twice.
More to the point, goes better hand-in-hand with the semantics of "delete_segments", where any file sent in is
considered candidate for recycling, and included in footprint.

Closes #11251

* github.com:scylladb/scylladb:
  commitlog: Make get_segments_to_replay on-demand
  commitlog: Revert/modify fac2bc4 - do footprint add in delete
2022-08-15 09:10:32 +03:00
Botond Dénes
8f10413087 Merge 'doc: describe specifying workload attributes with service levels' from Anna Stuchlik
Fix https://github.com/scylladb/scylladb/issues/11197

This PR adds a new page where specifying workload attributes with service levels is described and adds it to the menu.

Also, I had to fix some links because of the warnings.

Closes #11209

* github.com:scylladb/scylladb:
  doc: remove the reduntant space from index
  doc: update the syntax for defining service level attributes
  doc: rewording
  doc: update the links to fix the warnings
  doc: add the new page to the toctree
  doc: add the descrption of specifying workload attributes with service levels
  doc: add the definition of workloads to the glossary
2022-08-15 07:14:28 +03:00
Nadav Har'El
c8b5c3595e Merge 'cql3: select_statement: coroutinize indexed_table_select_statement::do_execute_base_query()' from Avi Kivity
Increase readability in preparation for managing topology with
effective_replication_map (continuing 69aea59d9).

Closes #11290

* github.com:scylladb/scylladb:
  cql3: select_statement: improve loop termination condition in  indexed_table_select_statement::do_execute_base_query()
  cql3: select_statement: reindent indexed_table_select_statement::do_execute_base_query()
  cql3: select_statement: coroutinize indexed_table_select_statement::do_execute_base_query()
  cql3: select_statement: de-result_wrap indexed_table_select_statement::do_execute_base_query()
2022-08-14 23:26:06 +03:00
Nadav Har'El
4a4231ea53 Merge 'storage_proxy: coroutinize some counter mutate functions' from Avi Kivity
In preparation for effective_replication_map hygiene, convert
some counter functions to coroutines to simplify the changes.

Closes #11291

* github.com:scylladb/scylladb:
  storage_proxy: mutate_counters_on_leader: coroutinize
  storage_proxy: mutate_counters: coroutinize
  storage_proxy: mutate_counters: reorganize error handling
2022-08-14 23:16:42 +03:00
Avi Kivity
8070cdbbf9 storage_proxy: mutate_counters_on_leader: coroutinize
Simplify ahead of refactoring for consistent effective_replication_map.
2022-08-14 17:36:58 +03:00
Avi Kivity
6e330d98d2 storage_proxy: mutate_counters: coroutinize
Simplify ahead of refactoring for consistent effective_replication_map.

This is probably a pessimization of the error case, but the error case
will be terrible in any case unless we resultify it.
2022-08-14 17:28:46 +03:00
Avi Kivity
105b066ff7 storage_proxy: mutate_counters: reorganize error handling
Move the error handling function where it's used so the code
is more straightforward.

Due to some std::move()s later, we must still capture the schema early.
2022-08-14 17:13:22 +03:00
Avi Kivity
fbaa280acd cql3: select_statement: improve loop termination condition in indexed_table_select_statement::do_execute_base_query()
Move the termination condition to the front of the loop so it's
clear why we're looping and when we stop.

It's less than perfectly clean since we widen the scope of some variables
(from loop-internal to loop-carried), but IMO it's clearer.
2022-08-14 15:40:45 +03:00
Avi Kivity
60c7c11c96 cql3: select_statement: reindent indexed_table_select_statement::do_execute_base_query()
Reindent after coroutinization. No functional changes.
2022-08-14 15:35:36 +03:00
Avi Kivity
492dc6879e cql3: select_statement: coroutinize indexed_table_select_statement::do_execute_base_query()
It's much easier to maintain this way. Since it uses ranges_to_vnodes,
it interacts with topology and needs integration into
effective_replication_map management.

The patch leaves bad indentation and an infinite-looking loop in
the interest of minimization, but that will be corrected later.

Note, the test for `!r.has_value()` was eliminated since it was
short-circuited by the test for `!rqr.has_value()` returning from
the coroutine rather than propagating an error.
2022-08-14 15:31:45 +03:00
Avi Kivity
973034978c cql3: select_statement: de-result_wrap indexed_table_select_statement::do_execute_base_query()
We use result_wrap() in two places, but that makes coroutinizing the
containing function a little harder, since it's composed of more lambdas.

Remove the wrappers, gaining a bit of performance in the error case.
2022-08-14 15:22:18 +03:00
Kamil Braun
b4c5b79f5e db: system_distributed_keyspace: don't call on_internal_error in check_exists
The function `check_exists` checks whether a given table exists, giving
an error otherwise. It previously used `on_internal_error`.

`check_exists` is used in some old functions that insert CDC metadata to
CDC tables. These tables are no longer used in newer Scylla versions
(they were replaced with other tables with different schema), and this
function is no longer called. The table definitions were removed and
these tables are no longer created. They will only exists in clusters
that were upgraded from old versions of Scylla (4.3) through a sequence
of upgrades.

If you tried to upgrade from a very old version of Scylla which had
neither the old or the new tables to a modern version, say from 4.2 to
5.0, you would get `on_internal_error` from this `check_exists`
function. Fortunately:
1. we don't support such upgrade paths
2. `on_internal_error` in production clusters does not crash the system,
   only throws. The exception would be catched, printed, and the system
   would run (just without CDC - until you finished upgrade and called
   the propoer nodetool command to fix the CDC module).

Unfortunately, there is a dtest (`partitioner_tests.py`) which performs
an unsupported upgrade scenario - it starts Scylla from Cassandra (!)
work directories, which is like upgrading from a very old version of
Scylla.

This dtest was not failing due to another bug which masked the problem.
When we try to fix the bug - see #11225 - the dtest starts hitting the
assertion in `check_exists`. Because it's a test, we configure
`on_internal_error` to crash the system.

The point of this commit is to not crash the system in this rare
scenario which happens only in some weird tests. We now throw
`std::runtime_error` instead of calling `on_internal_error`. In the
dtest, we already ignore the resulting CDC error appearing in the logs
(see scylladb/scylla-dtest#2804). Together with this change, we'll be
able to fix the #11225 bug and pass this test.

Closes #11287
2022-08-14 13:12:03 +03:00
Nadav Har'El
329068df99 test/cql-pytest: un-xfail yet another passing collection-indexing test
After collection indexing has been implemented, yet another test which
failed because of #2962 now passes. So remove the "xfail" marker.

Refs #2962

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-08-14 10:29:52 +03:00
Nadav Har'El
f6f18b187a secondary index: fix paging in map value indexing
When indexing a map column's *values*, if the same value appears more
than once, the same row will appear in the index more than once. We had
code that removed these duplicates, but this deduplication did not work
across page boundaries. We had two xfailing tests to demonstrate this bug.

In this patch we fix this bug by looking at the page's start and not
generating the same row again, thereby getting the same deduplication
we had inside pages - now across pages.

The previously-xfailing tests now pass, and their xfail tag is removed.
I also added another test, for the case where the base table has only
partition keys without clustering keys. This second test is important
because the code path for the partition-key-only case is different,
and the second test exposed a bug in it as well (which is also fixed
in this patch).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-08-14 10:29:52 +03:00
Nadav Har'El
dc445b9a73 test/cql-pytest: test for paging with collection values index
If a map has several keys with the same value, then the "values(m)" index
must remember all of them as matching the same row - because later we may
remove one of these keys from the map but the row would still need to
match the value because of the remaining keys.

We already had a test (test_index_map_values) that although the same row
appears more than once for this value, when we search for this value the
result only returns the row once. Under the hood, Scylla does find the
same value multiple times, but then eliminates the duplicate matched raw
and returns it only once.

But there is a complication, that this de-duplication does not easily
span *paging*. So in this patch we add a test that checks that paging
does not cause the same row to be returned more than once.
Unfortunately, this test currently fails on Scylla so marked "xfail".
It passes on Cassandra.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-08-14 10:29:52 +03:00
Nadav Har'El
5d556115a1 cql, view: rename and explain bytes_with_action
The structure "bytes_with_action" was very hard to understand because of
its mysterious and general-sounding name, and no comments.

In this patch I add a large comment explaining its purpose, and rename
it to a more suitable name, view_key_and_action, which suggests that
each such object is about one view key (where to add a view row), and
an additional "action" that we need to take beyond adding the view row.

This is the best I can do to make this code easier to understand without
completely reorganizing it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-08-14 10:29:52 +03:00
Nadav Har'El
8b00c91c13 cql, index: make collection indexing a cluster feature
Prevent a user from creating a secondary index on a collection column if
the cluster has any nodes which don't support this feature. Such nodes
will not be able to correctly handle requests related to this index,
so better not allow creating one.

Attempting to create an index on a collection before the entire cluster
supports this feature will result in the error:

    Indexing of collection columns not supported by some older nodes
    in this cluster. Please upgrade them.

Tested by manually disabling this feature in feature_service.cc and
seeing this error message during collection indexing test.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-08-14 10:29:52 +03:00
Nadav Har'El
aa86f808a6 test/cql-pytest: failing tests for oversized key values in MV and SI
In issue #9013, we noticed that if a value larger than 64 KB is indexed,
the write fails in a bad way, and we fixed it. But the test we wrote
when fixing that issue already suggested that something was still wrong:
Cassandra failed the write cleanly, with an InvalidRequest, while Scylla
failed with a mysterious WriteFailure (with a relevant error message
only in the log).

This patch adds several xfailing tests which demonstrate what's still
wrong. This is also summarized in issue #8627:

1. A write of an oversized value to an indexed column returns the wrong
   error message.
2. The same problem also exists when indexing a collection, and the indexed
   key or value is oversized.
3. The situation is even less pleasant when adding an index to a table
   with pre-existing data and an oversized value. In this case, the
   view building will fail on the bad row, and never finish.
4. We have exactly the same bugs not just with indexes but also with
   materialized views. Interestingly, Cassandra has similar bugs in
   materialized views as well (but not in the secondary index case,
   where Cassandra does behave as expected).

Refs #8627.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-08-14 10:29:52 +03:00
Nadav Har'El
2c244c6e09 cql: fix secondary index "target" when column name has special characters
Unfortunately, we encode the "target" of a secondary index in one of
three ways:

1. It can be just a column name
2. It can be a string like keys(colname) - for the new type of
   collection indexes introduced in this series.
3. It can be a JSON map ({ ... }). This form is used for local indexes.

The code parsing this target - target_parser::parse() - needs not to
confuse these different formats. Before this patch, if the column name
contains special characters like braces or parentheses (this is allowed
in CQL syntax, via quoting), we can confuse case 1, 2, and 3: A column
named "keys(colname)" will be confused for case 2, and a column named
"{123}" will be confused with case 3.

This problem can break indexing of some specially-crafted column names -
as reproduced by test_secondary_index.py::test_index_quoted_names.

The solution adopted in this patch is that the column name in case 1
should be escaped somehow so it cannot be possibly confused with either
cases 2 and 3. The way we chose is to convert the column name to CQL (with
column_definition::as_cql_name()). In other words, if the column name
contains non-alphanumeric characters, it is wrapped in quotes and also
quotes are doubled, as in CQL. The result of this can't be confused
with case 2 or 3, neither of which may begin with a quote.

This escaping is not the minimal we could have done, but incidentally it
is exactly what Cassandra does as well, so I used it as well.

This change is *mostly* backward compatible: Already-existing indexes will
still have unescaped column names stored for their "target" string,
and the unescaping code will see they are not wrapped in quotes, and
not change them. Backward compatibility will only fail on existing indexes
on columns whose name begin and end in the quote characters - but this
case is extremely unlikely.

This patch illustrates how un-ideal our index "target" encoding is,
but isn't what made it un-ideal. We should not have used three different
formats for the index target - the third representation (JSON) should
have sufficed. However, two two other representations are identical
to Cassandra's, so using them when we can has its compatibility
advantages.

The patch makes test_secondary_index.py::test_index_quoted_names pass.

Fixes #10707.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-08-14 10:29:52 +03:00
Nadav Har'El
56204a3794 cql, index: improve error messages
Before this patch, trying to create an index on entries(x) where x is
not a map results in an error message:

  Cannot create index on index_keys_and_values of column x

The string "index_keys_and_values" is strange - Cassandra prints the
easier to understand string "entries()" - which better corresponds to
what the user actually did.

It turns out that this string "index_keys_and_values" comes from an
elaborate set of variables and functions spanning multiple source files,
used to convert our internal target_type variable into such a string.
But although this code was called "index_option" and sounded very
important, it was actually used just for one thing - error messages!

So in this patch we drop the entire "index_option" abstraction,
replacing it by a static trivial function defined exactly where
it's used (create_index_statement.cc), which prints a target type.
While at it, we print "entries()" instead of "index_keys_and_values" ;-)

After this patch, the
test_secondary_index.py::test_index_collection_wrong_type

finally passes (the previous patch fixed the default table names it
assumes, and this patch fixes the expected error messages), so its
"xfail" tag is removed.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-08-14 10:29:52 +03:00
Nadav Har'El
84461f1827 cql, index: fix default index name for collection index
When creating an index "CREATE INDEX ON tbl(keys(m))", the default name
of the index should be tbl_m_idx - with just "m". The current code
incorrectly used the default name tbl_m_keys_idx, so this patch adds
a test (which passes on Cassandra, and after this patch also on Scylla)
and fixes the default name.

It turns out that the default index name was based on a mysterious
index_target::as_string(), which printed the target "keys(m)" as
"m_keys" without explaining why it was so. This method was actually
used only in three places, and all of them wanted just the column
name, without the "_keys" suffix! So in this patch we rename the
mysterious as_string() to column_name(), and use this function instead.

Now that the default index name uses column_name() and gets just
column_name(), the correct default index name is generated, and the
test passes.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-08-14 10:29:52 +03:00
Nadav Har'El
94ba03a4d6 test/cql-pytest: un-xfail several collecting indexing tests
After the previous patches implemented collection indexing, several
tests in test/cql-pytest/test_secondary_index.py that were marked
with "xfail" started to pass - so here we remove the xfail.

Only three collection indexing tests continue to xfail:
test_secondary_index.py::test_index_collection_wrong_type
test_secondary_index.py::test_index_quoted_names (#10707)
test_secondary_index.py::test_local_secondary_index_on_collection (#10713)

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-08-14 10:29:52 +03:00
Michał Radwański
2690ecd65d test/cql-pytest/test_secondary_index: verify that local index on
collection fails.

Collection indexing is being tracked by #2962. Global secondary index
over collection is enabled by #10123. Leave this test to track this
behaviour.

Related issue: #10713
2022-08-14 10:29:52 +03:00
Michał Radwański
1d852a9c7f docs/design-notes/secondary_index: add VALUES to index target list
A new secondary index target is being supported, which is `VALUES(v)`.
2022-08-14 10:29:52 +03:00
Michał Radwański
25f4c905f5 test/cql-pytest/test_secondary_index: add randomized test for indexes on
collections
2022-08-14 10:29:52 +03:00
Michał Radwański
2a8289c101 cql-pytest/cassandra_tests/.../secondary_index_test: fix error message
in test ported from Cassandra
2022-08-14 10:29:52 +03:00
Michał Radwański
fb476702a7 cql-pytest/cassandra_tests/.../secondary_index_on_map_entries,select_test: test
ported from Cassandra is expected to fail, since Scylla assumes that
comparison with null doesn't throw error, just evaluates to false. Since
it's not a bug, but expected behavior from the perspective of Scylla, we
don't mark it as xfail.
2022-08-14 10:29:52 +03:00
Michał Radwański
f572051ee9 test/boost/secondary_index_test: update for non-frozen indexes on
collections
2022-08-14 10:29:52 +03:00
Karol Baryła
9e377b2824 test/cql-pytest: Uncomment collection indexes tests that should be working now 2022-08-14 10:29:52 +03:00
Nadav Har'El
67990d2170 cql, index: don't use IS NOT NULL on collection column
When the secondary-index code builds a materialized view on column x, it
adds "x IS NOT NULL" to the where-clause of the view, as required.

However, when we index a collection column, we index individual pieces
of the collection (keys, values), the the entire collection, so checking
if the entire collection is null does not make sense. Moreover, for a
collection column x, "x IS NOT NULL" currently doesn't work and throws
errors when evaluating that expression when data is written to the table.

The solution used in this patch is to simply avoid adding the "x IS NOT
NULL" when creating the materialized view for a collection index.
Everything works just fine without it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-08-14 10:29:52 +03:00
Michał Radwański
bd44bc3e35 cql3/statements/select_statement: for index on values of collection,
don't emit duplicate rows

The index on collection values is special in a way, as its' clustering key contains not only the
base primary key, but also a column that holds the keys of the cells in the collection, which
allows to distinguish cells with different keys but the same value.
This has an unwanted consequence, that it's possible to receive two identical base table primary
keys from indexed_table_select_statement::find_index_clustering_rows. Thankfully, the duplicate
primary keys are guaranteed to occur consequently.
2022-08-14 10:29:52 +03:00
Michał Radwański
10e241988e cql/expr/expression, index/secondary_index_manager: needs_filtering and
index_supports_expression rewrite to accomodate for indexes over
collections
2022-08-14 10:29:52 +03:00
Karol Baryła
ac97086855 cql3, index: Use entries() indexes on collections for queries
Previous commit added the ability to use GSI over non-frozen collections in queries,
but only the keys() and values() indexes. This commit adds support for the missing
index type - entries() index.

Signed-off-by: Karol Baryła <karol.baryla@scylladb.com>
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-08-14 10:29:52 +03:00
Karol Baryła
7966841d37 cql3, index: Use keys() and values() indexes on collections for queries.
Previous commits added the possibility of creating GSI on non-frozen collections.
This (and next) commit allow those indexes to actually be used by queries.
This commit enables both keys() and values() indexes, as they are pretty similar.
2022-08-14 10:29:52 +03:00
Karol Baryła
aa47f4a15c types/tuple: Use std::begin() instead of .begin() in tuple_type_impl::build_value_fragmented
std::begin in concept for build_value_fragmented's parameter allows creating it from an array
2022-08-14 10:29:52 +03:00
Michał Radwański
e6521ff8ba cql3/statements/index_target: throw exception to signalize that we
didn't miss returning from function

GCC doesn't consider switches over enums to be exhaustive. Replace
bogous return value after a switch where each of the cases return, with
an exception.
2022-08-14 10:29:52 +03:00
Michał Radwański
32289d681f db/view/view.cc: compute view_updates for views over collections
For collection indexes, logic of computing values for each of the column
needed to change, since a single particular column might produce more
than one value as a result.

The liveness info from individual cells of the collection impacts the
liveness info of resulting rows. Therefore it is needed to rewrite the
control flow - instead of functions getting a row from get_view_row and
later computing row markers and applying it, they compute these values
by themselves.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-08-14 10:29:49 +03:00
Michał Radwański
112086767c view info: has_computed_column_depending_on_base_non_primary_key
In case of secondary indexes, if an index does not contain any column from
the base which makes up for the primary key, then it is assumed that
during update, a change to some cells from the base table cannot cause
that we're dealing with a different row in the view. This however
doesn't take into account the possibility of computed columns which in
fact do depend on some non-primary-key columns. Introduce additional
property of an index,
has_computed_column_depending_on_base_non_primary_key.
2022-08-14 10:29:14 +03:00
Michał Radwański
4cfd264e5d column_computation: depends_on_non_primary_key_column
depends_on_non_primary_key_column for a column computation is needed to
detect a case where the primary key of a materialized view depends on a
non primary key column from the base table, but at the same time, the view
itself doesn't have non-primary key columns. This is an issue, since as
for now, it was assumed that no non-primary key columns in view schema
meant that the update cannot change the primary key of the view, and
therefore the update path can be simplified.
2022-08-14 10:29:14 +03:00
Michał Radwański
f1a9def2e1 schema, index/secondary_index_manager: make schema for index-induced mv
Indexes over collections use materialized views. Supposing that we're
dealing with global indexes, and that pk, ck were the partition and
clustering keys of the base table, the schema of the materialized view,
apart from having idx_token (which is used to preserve the order on the
entries in the view), has a computed column coll_value (the name is not
guaranteed to be exactly) and potentially also
coll_keys_for_values_index, if the index was over collection values.
This is needed, since values in a specific collection need not be
unique.

To summarize, the primary key is as follows:

coll_value, idx_token, pk, ck, coll_keys_for_values_index?

where coll_value is the computed value from the collection, be it a key
from the collection, a value from the collection, or the tuple containing
both.
2022-08-14 10:29:14 +03:00
Michał Radwański
60d50f6016 index/secondary_index_manager: extract keys, values, entries types from collection
These functions are relevant for indexes over collections (creating
schema for a materialized view related to the index).

Signed-off-by: Michał Radwański <michal.radwanski@scylladb.com>
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-08-14 10:29:14 +03:00
Michał Radwański
cbe33f8d7a cql3/statements/: validate CREATE INDEX for index over a collection
Allow CQL like this:
CREATE INDEX idx ON table(some_map);
CREATE INDEX idx ON table(KEYS(some_map));
CREATE INDEX idx ON table(VALUES(some_map));
CREATE INDEX idx ON table(ENTRIES(some_map));
CREATE INDEX idx ON table(some_set);
CREATE INDEX idx ON table(VALUES(some_set));
CREATE INDEX idx ON table(some_list);
CREATE INDEX idx ON table(VALUES(some_list));

This is needed to support creating indexes on collections.
2022-08-14 10:29:13 +03:00
Michał Radwański
997682ed72 cql3/statements/create_index_statement,index_target: rewrite index target for collection
The syntax used for creating indexes on collections that is present in
Cassandra is unintuitive from the internal representation point of view.

For instance, index on VALUES(some_set) indexes the set elements, which
in the internal representation are keys of collection. Rewrite the index
target after receiving it, so that the index targets are consistent with
the representation.
2022-08-14 10:29:13 +03:00
Michał Radwański
ebc4ad4713 column_computation.hh, schema.cc: collection_column_computation
This type of column computation will be used for creating updates to
materialized views that are indexes over collections.

This type features additional function, compute_values_with_action,
which depending on an (optional) old row and new row (the update to the
base table) returns multiple bytes_with_action, a vector of pairs
(computed value, some action), where the action signifies whether a
deletion of row with a specific key is needed, or creation thereby.
2022-08-14 10:29:13 +03:00
Michał Radwański
2babee2cdc column_computation.hh, schema.cc: compute_value interface refactor
The compute_value function of column_computation has had previously the
following signature:
    virtual bytes_opt compute_value(const schema& schema, const partition_key& key, const clustering_row& row) const override;

This is superfluous, since never in the history of Scylla, the last
parameter (row) was used in any implentation, and never did it happen
that it returned bytes_opt. The absurdity of this interface can be seen
especially when looking at call sites like following, where dummy empty
row was created:
```
    token_column.get_computation().compute_value(
            *_schema, pkv_linearized, clustering_row(clustering_key_prefix::make_empty()));
```
2022-08-14 10:29:13 +03:00
Michał Radwański
166afd46b5 Cql.g, treewide: support cql syntax INDEX ON table(VALUES(collection))
Brings support of cql syntax `INDEX ON table(VALUES(collection))`, even
though there is still no support for indexes over collections.
Previously, index_target::target_type::values was refering to values of
a regular (non-collection) column. Rename it to `regular_values`.

Fixes #8745.
2022-08-14 10:29:13 +03:00
Piotr Sarna
fe617ed198 Merge 'db/system_keyspace: in system.local, use broadcast_rpc_address in rpc_address column' from Piotr Dulikowski
Previously, the `system.local`'s `rpc_address` column kept local node's
`rpc_address` from the scylla.yaml configuration. Although it sounds
like it makes sense, there are a few reasons to change it to the value
of scylla.yaml's `broadcast_rpc_address`:

- The `broadcast_rpc_address` is the address that the drivers are
  supposed to connect to. `rpc_address` is the address that the node
  binds to - it can be set for example to 0.0.0.0 so that Scylla listens
  on all addresses, however this gives no useful information to the
  driver.
- The `system.peers` table also has the `rpc_address` column and it
  already keeps other nodes' `broadcast_rpc_address`es.
- Cassandra is going to do the same change in the upcoming version 4.1.

Fixes: #11201

Closes #11204

* github.com:scylladb/scylladb:
  db/system_keyspace: fix indentation after previous patch
  db/system_keyspace: in system.local, use broadcast_rpc_address in rpc_address column
2022-08-12 16:24:28 +02:00
Anna Stuchlik
41362829b5 doc: fix the upgrade guides for Ubuntu and Debian by removing image-related information 2022-08-12 14:39:10 +02:00
Anna Stuchlik
b45ba69a6c doc: update the guides for Ubuntu and Debian to remove image information and the OS version number 2022-08-12 14:05:49 +02:00
Anna Stuchlik
24acffc2ce doc: add the upgrade guide for ScyllaDB image from 2021.1 to 2022.1 2022-08-12 13:47:03 +02:00
Piotr Sarna
1ab4c6aab3 Merge 'cql3: enable collections as UDA accumulators' from Wojciech Mitros
Currently, the initial values of UDA accumulators are converted
to strings using the to_string() method and from strings using the
from_string() method. The from_string() method is not implemented
for collections, and it can't be implemented without changing the
string format, because in that format, we cannot differentiate
whether a separator is a part of a value or is an actual separator
between values. In particular, the separators are not escaped
in the collection values.

Instead of from_string()/to_string() the cql parser is used
for creating a value from a string (the same , and to_parsable_string()
is used to converting a value into a string.

A test using a list as an accumulator is added to
cql-pytest/test_uda.py.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>

Closes #11250

* github.com:scylladb/scylladb:
  cql3: enable collections as UDA accumulators
  cql3: extend implementation of to_bytes for raw_value
2022-08-12 12:51:17 +02:00
Botond Dénes
ceb1cdcb7a Merge 'doc: fix the typo on the Fault Tolerance page' from Anna Stuchlik
Fix https://github.com/scylladb/scylla-doc-issues/issues/438
In addition, I've replaced "Scylla" with "ScyllaDB" on that page.

Closes #11281

* github.com:scylladb/scylladb:
  doc: replace Scylla with ScyllaDB on the Fault Tolerance page
  doc: fis the typo in the note
2022-08-12 06:58:39 +03:00
Nadav Har'El
c27f431580 test/alternator: fix a flaky test for full-table scan page size
This patch fixes the test test_scan.py::test_scan_paging_missing_limit
which failed in a Jenkins run once (that we know of).

That test verifies that an Alternator Scan operation *without* an explicit
"Limit" is nevertheless paged: DynamoDB (and also Scylla) wanted this page
size to be 1 MB, but it turns out (see #10327) that because of the details
of how Scylla's scan works, the page size can be larger than 1 MB. How much
larger? I ran this test hundreds of times and never saw it exceed a 3 MB
page - so the test asserted the page must be smaller than 4 MB. But now
in one run - we got to this 4 MB and failed the test.

So in this patch we increase the table to be scanned from 4 MB to 6 MB,
and assert the page size isn't the full 6 MB. The chance that this size will
eventually fail as well should be (famous last words...) very small for
two reasons: First because 6 MB is even higher than I the maximum I saw
in practice, and second because empirically I noticed that adding more
data to the table reduces the variance of the page size, so it should
become closer to 1 MB and reduce the chance of it reaching 6 MB.

Refs #10327

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11280
2022-08-12 06:57:45 +03:00
Botond Dénes
2a39d6518d Merge 'doc: clarify the disclaimer about reusing deleted counter column values' from Anna Stuchlik
Fix https://github.com/scylladb/scylla-doc-issues/issues/857

Closes #11253

* github.com:scylladb/scylladb:
  doc: language improvemens to the Counrers page
  doc: fix the external link
  doc: clarify the disclaimer about reusing deleted counter column values
2022-08-12 06:56:28 +03:00
Botond Dénes
10371441c9 Merge 'docs: add a disclaimer about not supporting local counters by SSTableLoader' from Anna Stuchlik
Fix https://github.com/scylladb/scylla-doc-issues/issues/867
Plus some language, formatting, and organization improvements.

Closes #11248

* github.com:scylladb/scylladb:
  doc: language, formatting, and organization improvements
  doc: add a disclaimer about not supporting local counters by SSTableLoader
2022-08-12 06:55:00 +03:00
Benny Halevy
d295d8e280 everywhere: define locator::host_id as a strong tagged_uuid type
So it can be distinguished from other uuid-based
identifiers in the system.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #11276
2022-08-12 06:01:44 +03:00
Botond Dénes
69aea59d97 Merge 'storage_proxy: use consistent topology, prepare for fencing' from Avi Kivity
Replication is a mix of several inputs: tokens and token->node mappings (topology),
the replication strategy, replication strategy parameters. These are all captured
in effective_replication_map.

However, if we use effective_replication_map:s captured at different times in a single
query, then different uses may see different inputs to effective_replication_map.

This series protects against that by capturing an effective_replication_map just
once in a query, and then using it. Furthermore, the captured effective_replication_map
is held until the query completes, so topology code can know when a topology is no
longer is use (although this isn't exploited in this series).

Only the simple read and write paths are covered. Counters and paxos are left for
later.

I don't think the series fixes any bugs - as far as I could tell everything was happening
in the same continuation. But this series ensures it.

Closes #11259

* github.com:scylladb/scylladb:
  storage_proxy: use consistent topology
  storage_proxy: use consistent replication map on read path
  storage_proxy: use consistent replication map on write path
  storage_proxy: convert get_live{,_sorted}_endpoints() to accept an effective_replication_map
  consistency_level: accept effective_replication_map as parameter, rather than keyspace
  consistency_level: be more const when using replication_strategy
2022-08-12 06:00:30 +03:00
Alejo Sanchez
10baac1c84 test.py: test topology and schema changes
Add support for topology changes: add/stop/remove/restart/replace node.

Test simple schema changes when changing topology.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-11 23:39:13 +02:00
Alejo Sanchez
7f32fc0cc7 test.py: ClusterManager API mark cluster dirty
Allow tests to manually mark current cluster dirty.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-11 23:39:13 +02:00
Alejo Sanchez
a585a82ad1 test.py: call before/after_test for each test case
Preparing for topology tests with changing clusters, run before and
after checks per test case.

Change scope of pytest fixtures to function as we need them per test
casse.

Add server and client API logic.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-11 23:39:13 +02:00
Alejo Sanchez
eedc866433 test.py: handle driver connection in ManagerClient
Preparing for cluster cycling, handle driver connection in ManagerClient.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-11 23:39:13 +02:00
Alejo Sanchez
fe561a7dbd test.py: ClusterManager API and ManagerClient
Add an API via Unix socket to Manager so pytests can query information
about the cluster. Requests are managed by ManagerClient helper class.

The socket is placed inside a unique temporary directory for the
Manager (as safe temporary socket filename is not possible in Python).

Initial API services are manager up, cluster up, if cluster is dirty,
cql port, configured replicas (RF), and list of host ids.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-11 23:39:13 +02:00
Alejo Sanchez
aad015d4e2 test.py: improve topology docstring
Improve docstring of TopologyTestSuite to reflect its differences with
other test suites.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-11 23:39:13 +02:00
Avi Kivity
a2c4f5aa1a storage_proxy: use consistent topology
Derive the topology from captured and stable effective_replication_map
instead of getting a fresh topology from storage_proxy, since the
fresh topology may be inconsistent with the running query.

digest_read_resolver did not capture an effective_replication_map, so
that is added.
2022-08-11 17:58:42 +03:00
Avi Kivity
883518697b storage_proxy: use consistent replication map on read path
Capture a replication map just once in
abstract_read_executor::_effective_replication_map_ptr. Although it isn't
used yet, it serves to keep a reference count on topology (for fencing),
and some accesses to topology within reads still remain, which can be
converted to use the member in a later patch.
2022-08-11 17:58:42 +03:00
Avi Kivity
01a614fb4d storage_proxy: use consistent replication map on write path
Capture a replication map just once in
abstract_write_handler::_effective_replication_map_ptr and use it
in all write handlers. A few accesses to get the topology still remain,
they will be fixed up in a later patch.
2022-08-11 17:58:42 +03:00
Avi Kivity
f1b0e3d58e storage_proxy: convert get_live{,_sorted}_endpoints() to accept an effective_replication_map
Allow callers to use consistent effective_replication_map:s across calls
by letting the caller select the object to use.
2022-08-11 17:58:42 +03:00
Avi Kivity
46bd0b1e62 consistency_level: accept effective_replication_map as parameter, rather than keyspace
A keyspace is a mutable object that can change from time to time. An
effective_replication_map captures the state of a keyspace at a point in
time and can therefore be consistent (with care from the caller).

Change consistency_level's functions to accept an effective_replication_map.
This allows the caller to ensure that separate calls use the same
information and are consistent with each other.

Current callers are likely correct since they are called from one
continuation, but it's better to be sure.
2022-08-11 17:58:42 +03:00
Avi Kivity
1078d1bfda consistency_level: be more const when using replication_strategy
We don't modify the replication_strategy here, so use const. This
will help when the object we get is const itself, as it will be in
the next patches.
2022-08-11 17:58:42 +03:00
Wojciech Mitros
48bd752971 cql3: enable collections as UDA accumulators
Currently, the initial values of UDA accumulators are converted
to strings using the to_string() method and from strings using the
from_string() method. The from_string() method is not implemented
for collections, and it can't be implemented without changing the
string format, because in that format, we cannot differentiate
whether a separator is a part of a value or is an actual separator
between values. In particular, the separators are not escaped
in the collection values. For example, a list with string elements:
'a, b', 'c' would be represented as a string 'a, b, c', while now
it is represented as "['a, b', 'c']".
Some types that were parsable are now represented in a different
way. For example, a tuple ('a', null, 0) was represented as
"a:\@:0", and now it is "('a', null, 0)".

Instead of from_string()/to_string() the cql parser is used
for creating a value from a string (the same , and to_parsable_string()
is used to converting a value into a string.

A test using a list as an accumulator is added to
cql-pytest/test_uda.py.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2022-08-11 16:23:57 +02:00
Anna Stuchlik
f5a49688ae doc: replace Scylla with ScyllaDB on the Fault Tolerance page 2022-08-11 16:14:33 +02:00
Anna Stuchlik
7218a977df doc: fis the typo in the note 2022-08-11 16:09:49 +02:00
Botond Dénes
d407d3b480 Merge 'Calculate effective_replication_map: prevent stalls with everywhere_replication_strategy' from Benny Halevy
For replication strategies like "everywhere"
and "local" that return the same set of endpoints
for all tokens, we can call rs->calculate_natural_endpoints
one once and reuse the result for all token.

Note that ideally the replication_map could contain only
a single token range for this case, but that does't seem to work yet.

Add `maybe_yield()` calls to the tight loop
to prevent reactor stalls on large clusters when copying
a long vector returned by everywhere_replication_strategy
to potentially 1000's of tokens in the map.

Nicholas Peshek wrote in
https://github.com/scylladb/scylladb/issues/10337#issuecomment-1211152370

about similar patch by Geoffrey Beausire:
994c6ecf3c

> Yep. That dropped our startup from 3000+ seconds to about 40.

Fixes #10337

Closes #11277

* github.com:scylladb/scylladb:
  abstract_replication_strategy: calculate_effective_replication_map: optimize for static replication strategies
  abstract_replication_strategy: add has_uniform_natural_endpoints
2022-08-11 15:19:47 +03:00
Gleb Natapov
e5157b27ad raft: getting abort_requested_exception exception from a sm::apply is not a critical error
During shutdown it is normal to get abort_requested_exception exception
from a state machine "apply" method. Do not rethrow it as
state_machine_error, just abort an applier loop with an info message.
2022-08-11 15:11:21 +03:00
Gleb Natapov
9977851eb1 schema_registry: fix abandoned feature warning
maybe_sync ignores failed feature in case waiting is aborted. Fix it.
2022-08-11 15:11:21 +03:00
Gleb Natapov
eed8e19813 service: raft: silence rpc::closed_errors in raft_rpc
Before the patch if an RPC connection was established already then the
close error was reported by the RPC layer and then duplicated by
raft_rpc layer. If a connection cannot be established because the remote
node is already dead RPC does not report the error since we decided that
in that case gossiper and failure detector messages can be used to
detect the dead node case and there is no reason to pollute the logs
with recurring errors. This aligns raft behaviour with what we already
have in storage_proxy that does not report closed errors as well.
2022-08-11 15:11:21 +03:00
Anna Stuchlik
1603129275 doc: remove the reduntant space from index 2022-08-11 12:36:16 +02:00
Anna Stuchlik
ee258cb0af doc: update the syntax for defining service level attributes 2022-08-11 12:32:38 +02:00
Petr Gusev
4bc6611829 raft read_barrier, retry over intermittent rpc failures
If the leader was unavailable during read_barrier,
closed_error occurs, which was not handled in any way
and eventually reached the client. This patch adds retries in this case.

Fix: scylladb#11262
Refs: #11278

Closes #11263
2022-08-11 13:31:19 +03:00
Amnon Heiman
5ac20ac861 Reduce the number of per-scheduling group metrics
This patch reduces the number of metrics ScyllaDB generates.

Motivation: The combination of per-shard with per-scheduling group
generates a lot of metrics. When combined with histograms, which require
many metrics, the problem becomes even bigger.

The two tools we are going to use:
1. Replace per-shard histograms with summaries
2. Do not report unused metrics.

The storage_proxy stats holds information for the API and the metrics
layer.  We replaced timed_rate_moving_average_and_histogram and
time_estimated_histogram with the unfied
timed_rate_moving_average_summary_and_histogram which give us an option
to report per-shard summaries instead of histogram.

All the counters, histograms, and summaries were marked as
skip_when_empty.

The API was modified to use
timed_rate_moving_average_summary_and_histogram.

Closes #11173
2022-08-11 13:31:19 +03:00
Benny Halevy
9167b857e9 abstract_replication_strategy: calculate_effective_replication_map: optimize for static replication strategies
For replication strategies like "everywhere"
and "local" that return the same set of endpoints
for all tokens, we can call rs->calculate_natural_endpoints
one once and reuse the result for all token.

Note that ideally the replication_map could contain only
a single token range for this case, but that does't seem to work yet.

Add maybe_yield() calls to the tight loop
to prevent reactor stalls on large clusters when copying
a long vector returned by everywhere_replication_strategy
to potentially 1000's of tokens in the map.

Nicholas Peshek wrote in
https://github.com/scylladb/scylladb/issues/10337#issuecomment-1211152370

about similar patch by Geoffrey Beausire:
994c6ecf3c

> Yep. That dropped our startup from 3000+ seconds to about 40.

Fixes #10337

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-11 10:35:29 +03:00
Benny Halevy
eb678e723b abstract_replication_strategy: add has_uniform_natural_endpoints
So that using calaculate_natural_endpoints can be optimized
for strategies that return the same endpoints for all tokens,
namely everywhere_replication_strategy and local_strategy.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-11 10:34:14 +03:00
Calle Wilund
a729c2438e commitlog: Make get_segments_to_replay on-demand
Refs #11237

Don't store segments found on init scan in all shard instances,
instead retrieve (based on low time-pos for current gen) when
required. This changes very little, but we at last don't store
pointless string lists in shards 1 to X, and also we can potentially
ask for the list twice. More to the point, goes better hand-in-hand
with the semantics of "delete_segments", where any file sent in is
considered candidate for recycling, and included in footprint.
2022-08-11 06:41:23 +00:00
Nadav Har'El
d03bd82222 Revert "test: move scylla_inject_error from alternator/ to cql-pytest/"
This reverts commit 8e892426e2 and fixes
the code in a different way:

That commit moved the scylla_inject_error function from
test/alternator/util.py to test/cql-pytest/util.py and renamed
test/alternator/util.py. I found the rename confusing and unnecessary.
Moreover, the moved function isn't even usable today by the test suite
that includes it, cql-pytest, because it lacks the "rest_api" fixture :-)
so test/cql-pytest/util.py wasn't the right place for it anyway.
test/rest_api/rest_util.py could have been a good place for this function,
but there is another complication: Although the Alternator and rest_api
tests both had a "rest_api" fixture, it has a different type, which led
to the code in rest_api which used the moved function to have to jump
through hoops to call it instead of just passing "rest_api".

I think the best solution is to revert the above commit, and duplicate
the short scylla_inject_error() function. The duplication isn't an
exact copy - the test/rest_api/rest_util.py version now accepts the
"rest_api" fixture instead of the URL that the Alternator version used.

In the future we can remove some of this duplication by having some
shared "library" code but we should do it carefully and starting with
agreeing on the basic fixtures like "rest_api" and "cql", without that
it's not useful to share small functions that operate on them.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11275
2022-08-11 06:43:26 +03:00
Wojciech Mitros
42e0fb90ea cql3: extend implementation of to_bytes for raw_value
When called with a null_value or an unset_value,
raw_value::to_bytes() threw an std::get error for
wrong variant. This patch adds a description for
the errors thrown, and adds a to_bytes_opt() method
that instead of throwing returns a std::nullopt.
2022-08-10 16:40:22 +02:00
Avi Kivity
e9cbc9ee85 Merge 'Add support for empty replica pages' from Botond Dénes
Many tombstones in a partition is a problem that has been plaguing queries since the inception of Scylla (and even before that as they are a pain in Apache Cassandra too). Tombstones don't count towards the query's page limit, neither the size nor the row number one. Hence, large spans of tombstones (be that row- or range-tombstones) are problematic: the query can time out while processing this span of tombstones, as it waits for more live rows to fill the page. In the extreme case a partition becomes entirely unreadable, all read attempts timing out, until compaction manages to purge the tombstones.
The solution proposed in this PR is to pass down a tombstone limit to replicas: when this limit is reached, the replica cuts the page and marks it as short one, even if the page is empty currently. To make this work, we use the last-position infrastructure added recently by 3131cbea62, so that replicas can provide the position of the last processed item to continue the next page from. Without this no forward progress could be made in the case of an empty page: the query would continue from the same position on the next page, having to process the same span of tombstones.
The limit can be configured with the newly added `query_tombstone_limit` configuration item, defaulted to 10000. The coordinator will pass this to the newly added `tombstone_limit` field of `read_command`, if the `replica_empty_pages` cluster feature is set.

Upgrade sanity test was conducted as following:
* Created cluster of 3 nodes with RF=3 with master version
* Wrote small dataset of 1000 rows.
* Deleted prefix of 980 rows.
* Started read workload: `scylla-bench -mode=read -workload=uniform -replication-factor=3 -nodes 127.0.0.1,127.0.0.2,127.0.0.3 -clustering-row-count=10000 -duration=10m -rows-per-request=9000 -page-size=100`
* Also did some manual queries via `cqlsh` with smaller page size and tracing on.
* Stopped and upgraded each node one-by-one. New nodes were started by `--query-tombstone-page-limit=10`.
* Confirmed there are no errors or read-repairs.

Perf regression test:
```
build/release/test/perf/perf_simple_query_g -c1 -m2G --concurrency=1000 --task-quota-ms 10 --duration=60
```
Before:
```
median 133665.96 tps ( 62.0 allocs/op,  12.0 tasks/op,   43007 insns/op,        0 errors)
median absolute deviation: 973.40
maximum: 135511.63
minimum: 104978.74
```
After:
```
median 129984.90 tps ( 62.0 allocs/op,  12.0 tasks/op,   43181 insns/op,        0 errors)
median absolute deviation: 2979.13
maximum: 134538.13
minimum: 114688.07
```
Diff: +~200 instruction/op.

Fixes: https://github.com/scylladb/scylla/issues/7689
Fixes: https://github.com/scylladb/scylla/issues/3914
Fixes: https://github.com/scylladb/scylla/issues/7933
Refs: https://github.com/scylladb/scylla/issues/3672

Closes #11053

* github.com:scylladb/scylladb:
  test/cql-pytest: add test for query tombstone page limit
  query-result-writer: stop when tombstone-limit is reached
  service/pager: prepare for empty pages
  service/storage_proxy: set smallest continue pos as query's continue pos
  service/storage_proxy: propagate last position on digest reads
  query: result_merger::get() don't reset last-pos on short-reads and last pages
  query: add tombstone-limit to read-command
  service/storage_proxy: add get_tombstone_limit()
  query: add tombstone_limit type
  db/config: add config item for query tombstone limit
  gms: add cluster feature for empty replica pages
  tree: don't use query::read_command's IDL constructor
2022-08-10 13:38:06 +03:00
Avi Kivity
32b405d639 Merge 'doc: change the default for the overprovisioned option' from Anna Stuchlik
Fix https://github.com/scylladb/scylla-doc-issues/issues/842

This PR changes the default for the `overprovisioned` option from `disabled` to `enabled`, according to https://github.com/scylladb/scylla-doc-issues/issues/842.
In addition, I've used this opportunity to replace "Scylla" with "ScyllaDB" on the updated page.

Closes #11256

* github.com:scylladb/scylladb:
  doc: replace Scylla with ScyllaDB in the product name
  doc: change the default for the overprovisioned option
2022-08-10 12:43:44 +03:00
Raphael S. Carvalho
ace6334619 replica: table: kill unused _sstables_staging
Good change as it's one less thing to worry about in compaction
group.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-08-10 12:32:13 +03:00
Kamil Braun
cff595211e Merge 'Raft test topology part 2' from Alecco
Give cluster control to pytests. While there add missing stop gracefully and add server to ScyllaCluster.
Clusters can be marked dirty but they are not recycled yet. This will be done in a later series.

Closes #11219

* github.com:scylladb/scylladb:
  test.py: ScyllaCluster add_server() mark dirty
  test.py: ScyllaCluster add server management
  test.py: improve seeds for new servers
  test.py: Topology tests and Manager for Scylla clusters
  test.py: rename scylla_server to scylla_cluster
  test.py: function for python driver connection
  test.py: ScyllaCluster add_server helper
  test.py: shutdown control connection during graceful shutdown
  test.py: configurable authenticator and authorizer
  test.py: ScyllaServer stop gracefully
  test.py: FIXME for bad cluster log handling logic
2022-08-10 11:13:21 +02:00
Michał Chojnowski
de0f2c21ec configure.py: make messaging_service.cc the first source file
Currently messaging_service.o takes the longest of all core objects to
compile.  For a full build of build/release/scylla, with current ninja
scheduling, on a 32-hyperthread machine, the last ~16% of the total
build time is spent just waiting on messaging_service.o to finish
compiling.

Moving the file to the top of the list makes ninja start its compilation
early and gets rid of that single-threaded tail, improving the total build
time.

Closes #11255
2022-08-10 11:18:09 +03:00
Calle Wilund
8116c56807 commitlog: Revert/modify fac2bc4 - do footprint add in delete
Fixes #11184
Fixes #11237

In prev (broken) fix for #11184 we added the footprint for left-over
files (replay candidates) to disk footprint on commitlog init.
This effectively prevents us from creating segments iff we have tight
limits. Since we nowadays do quite a bit of inserts _before_ commitlog
replay (system.local, but...) we can end up in a situation where we
deadlock start because we cannot get to the actual replay that will
eventually free things.
Another, not thought through, consequence is that we add a single
footprint to _all_ commitlog shard instances - even though only
shard 0 will get to actually replay + delete (i.e. drop footprint).
So shards 1-X would all be either locked out or performance degraded.

Simplest fix is to add the footprint in delete call instead. This will
lock out segment creation until delete call is done, but this is fast.
Also ensures that only replay shard is involved.
2022-08-10 08:04:03 +00:00
Botond Dénes
e27127bb7f test/cql-pytest: add test for query tombstone page limit
Check that the replica returns empty pages as expected, when a large
tombstone prefix/span is present. Large = larger than the configured
query_tombstone_limit (using a tiny value of 10 in the test to avoid
having to write many tombstones).
2022-08-10 09:14:59 +03:00
Tomasz Grabiec
8ee5b69f80 test: row_cache: Use more narrow key range to stress overlapping reads more
This makes catching issues related to concurrent access of same or
adjacent entries more likely. For example, catches #11239.

Closes #11260
2022-08-10 06:53:54 +03:00
Botond Dénes
7730419f5c query-result-writer: stop when tombstone-limit is reached
The query result writer now counts tombstones and cuts the page (marking
it as a short one) when the tombstone limit is reached. This is to avoid
timing out on large span of tombstones, especially prefixes.
In the case of unpaged queries, we fail the read instead, similarly to
how we do with max result size.
If the limit is 0, the previous behaviour is used: tombstones are not
taken into consideration at all.
2022-08-10 06:03:38 +03:00
Botond Dénes
8066dbc635 service/pager: prepare for empty pages
The pager currently assumes that an empty pages means the query is
exhausted. Lift this assumption, as we will soon have empty short
pages.
Also, paging using filtering also needs to use the replica-provided
last-position when the page is empty.
2022-08-10 06:03:38 +03:00
Botond Dénes
6a7dedfe34 service/storage_proxy: set smallest continue pos as query's continue pos
We expect each replica to stop at exactly the same position when the
digests match. Soon however, if replicas have a lot of tombstones, some
may stop earlier then the others. As long as all digests match, this is
fine but we need to make sure we continue from the smallest such
positions on the next page.
2022-08-10 06:03:38 +03:00
Botond Dénes
2656968db2 service/storage_proxy: propagate last position on digest reads
We want to transmit the last position as determined by the replica on
both result and digest reads. Result reads already do that via the
query::result, but digest reads don't yet as they don't return the full
query::result structure, just the digest field from it. Add the last
position to the digest read's return value and collect these in the
digest resolver, along with the returned digests.
2022-08-10 06:03:37 +03:00
Botond Dénes
8c0dd99f7c query: result_merger::get() don't reset last-pos on short-reads and last pages
When merging multiple query-results, we use the last-position of the
last result in the combined one as the combined result's last position.
This only works however if said last result was included fully.
Otherwise we have to discard the last-position included with the result
and the pager will use the position of the last row in the combined
result as the last position.
The commit introducing the above logic mistakenly discarded the last
position when the result is a short read or a page is not full. This is
not necessary and even harmful as it can result in an empty combined
result being delivered to the pager, without a last-position.
2022-08-10 06:01:49 +03:00
Botond Dénes
d1d53f1b84 query: add tombstone-limit to read-command
Propagate the tombstone-limit from coordinator to replicas, to make sure
all is using the same limit.
2022-08-10 06:01:47 +03:00
Anna Stuchlik
43cc17bf5d doc: replace Scylla with ScyllaDB in the product name 2022-08-09 16:19:55 +02:00
Anna Stuchlik
d21b92fb13 doc: change the default for the overprovisioned option 2022-08-09 16:09:29 +02:00
Anna Stuchlik
c3dbb9706e doc: language improvemens to the Counrers page 2022-08-09 14:35:44 +02:00
Alejo Sanchez
05afca2199 test.py: ScyllaCluster add_server() mark dirty
When changing topology, tests will add servers. Make add_server mark the
cluster dirty. But mark the cluster as not dirty after calling
add_server when installing the cluster.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-09 14:26:13 +02:00
Alejo Sanchez
f1a6e4bda9 test.py: ScyllaCluster add server management
Preparing for topology changes, implement the primitives for managing
ScyllaServers in ScyllaCluster. The states are started, stopped, and
removed. Started servers can be stopped or restarted. Stopped servers
can be started. Stopped servers can be removed (destroyed).

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-09 14:26:13 +02:00
Alejo Sanchez
a6448458bb test.py: improve seeds for new servers
Instead of only using last started server as seed, use all started
servers as seed for new servers.

This also avoids tracking last server's state.

Pass empty list instead of None.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-09 14:26:13 +02:00
Alejo Sanchez
83dab6045b test.py: Topology tests and Manager for Scylla clusters
Preparing to cycle clusters modified (dirty) and use multiple clusters
per topology pytest, introduce Topology tests and Manager class to
handle clusters.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-09 14:26:13 +02:00
Alejo Sanchez
14328d1e42 test.py: rename scylla_server to scylla_cluster
This file's most important class is ScyllaCluster, so rename it
accordingly.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-09 14:26:13 +02:00
Alejo Sanchez
dcd8d77f34 test.py: function for python driver connection
Isolate python driver connection on its own function. Preparing for
harness client fixture to handle the connection.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-09 14:26:13 +02:00
Alejo Sanchez
1db31ebfdc test.py: ScyllaCluster add_server helper
For future use from API.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-09 14:26:13 +02:00
Konstantin Osipov
c81c8af1ba test.py: shutdown control connection during graceful shutdown 2022-08-09 14:26:13 +02:00
Alejo Sanchez
bc494754e8 test.py: configurable authenticator and authorizer
For scylla servers, keep default PasswordAuthenticator and
CassandraAuthorizer but allow this to be configurable per test suite.
Use AllowAll* for topology test suite.

Disabling authentication avoids complications later for topology tests
as system_auth kespace starts with RF=1 and tests take down nodes. The
keyspace would need to change RF and run repair. Using AllowAll avoids
this problem altogether.

A different cql fixture is created without auth for topology tests.

Topology tests require servers without auth from scylla.yaml conf.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-09 14:26:13 +02:00
Alejo Sanchez
6437a7f467 test.py: ScyllaServer stop gracefully
Add stop_gracefully() method. Terminates a server in a clean way.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-09 14:26:13 +02:00
Alejo Sanchez
573ed429ad test.py: FIXME for bad cluster log handling logic
The code in test.py using a ScyllaCluster is getting a server id and
taking logs from only the first server.

If there is a failure in another server it's not reported properly.

And CQL connection will go only to the first server.

Also, it might be better to have ScyllaCluster to handle these matters
and be more opaque.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-09 14:26:13 +02:00
Anna Stuchlik
15c24ba3e0 doc: fix the external link 2022-08-09 14:20:54 +02:00
Anna Stuchlik
82d1f67378 doc: clarify the disclaimer about reusing deleted counter column values 2022-08-09 14:12:37 +02:00
Avi Kivity
be44fd63f9 Merge 'Make get_range_addresses async and hold effective_replication_map_ptr around it' from Benny Halevy
This series converts the synchronous `effective_replication_map::get_range_addresses` to async
by calling the replication strategy async entry point with the same name, as its callers are already async
or can be made so easily.

To allow it to yield and work on a coherent view of the token_metadata / topology / replication_map,
let the callers of this patch hold a effective_replication_map per keyspace and pass it down
to the (now asynchronous) functions that use it (making affected storage_service methods static where possible
if they no longer depend on the storage_service instance).

Also, the repeated calls to everywhere_replication_strategy::calculate_natural_endpoints
are optimized in this series by introducing a virtual abstract_replication_strategy::has_static_natural_endpoints predicate
that is true for local_strategy and everywhere_replication_strategy, and is false otherwise.
With it, functions repeatedly calling calculate_natural_endpoints in a loop, for every token, will call it only once since it will return the same result every time anyhow.

Refs #11005

Doesn't fix the issue as the large allocation still remains until we make change dht::token_range_vector chunked (chunked_vector cannot be used as is at the moment since we require the ability to push also to the front when unwrapping)

Closes #11009

* github.com:scylladb/scylladb:
  effective_replication_map: make get_range_addresses asynchronous
  range_streamer: add_ranges and friends: get erm as param
  storage_service: get_new_source_ranges: get erm as param
  storage_service: get_changed_ranges_for_leaving: get erm as param
  storage_service: get_ranges_for_endpoint: get erm as param
  repair: use get_non_local_strategy_keyspaces_erms
  database: add get_non_local_strategy_keyspaces_erms
  database: add get_non_local_strategy_keyspaces
  storage_service: coroutinize update_pending_ranges
  effective_replication_map: add get_replication_strategy
  effective_replication_map: get_range_addresses: use the precalculated replication_map
  abstract_replication_strategy: get_pending_address_ranges: prevent extra vector copies
  abstract_replication_strategy: reindent
  utils: sequenced_set: expose set and `contains` method
  abstract_replication_strategy: calculate_natural_endpoints: return endpoint_set
  utils: sequenced_set: templatize VectorType
  utils: sanitize sequenced_set
  utils: sequenced_set: delete mutable get_vector method
2022-08-09 13:25:53 +03:00
Benny Halevy
f01a526887 docs: debugging: mention use of release number on backtrace.scylladb.com
Following scylladb/scylla_s3_reloc_server@af17e4ffcd
(scylladb/scylla_s3_reloc_server#28), the release number can
be used to search the relcatable package and/or decode a respective
backtrace.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #11247
2022-08-09 12:49:59 +03:00
Avi Kivity
d4c986e4fa Merge 'doc: add the upgrade guide from 5.0 to 2022.1 on Ubuntu 20.04' from Anna Stuchlik
Ubuntu 22.04 is supported by both ScyllaDB Open Source 5.0 and Enterprise 2022.1.

Closes #11227

* github.com:scylladb/scylladb:
  doc: add the redirects from Ubuntu version specific to version generic pages
  doc: remove version-speific content for Ubuntu and add the generic page to the toctree
  doc: rename the file to include Ubuntu
  doc: remove the version number from the document and add the link to Supported Versions
  doc: add a generic page for Ubuntu
  doc: add the upgrade guide from 5.0 to 2022.1 on Ubuntu 2022.1
2022-08-09 12:49:16 +03:00
Asias He
12ab2c3d8d storage_service: Prevent removed node to restart and join the cluster
1) Start node1,2,3
2) Stop node3
3) Run nodetool removenode $host_id_of_node3
4) Restart node3

Step 4 is wrong and not allowed. If it happens it will bring back node3
to the cluster.

This patch adds a check during node restart to detect such operation
error and reject the restart.

With this patch, we would see the following in step 4.

```
init - Startup failed: std::runtime_error (The node 127.0.0.3 with
host_id fa7e500a-8617-4de4-8efd-a0e177218ee8 is removed from the
cluster. Can not restart the removed node to join the cluster again!)
```

Refs #11217

Closes #11244
2022-08-09 12:46:21 +03:00
Avi Kivity
1d4bf115e2 Merge 'row_cache: Fix missing row if upper bound of population range is evicted and has adjacent dummy' from Tomasz Grabiec
Scenario:

cache = [
    row(pos=2, continuous=false),
    row(pos=after(2), dummy=true)
]

Scanning read starts, starts populating [-inf, before(2)] from sstables.

row(pos=2) is evicted.

cache = [
    row(pos=after(2), dummy=true)
]

Scanning read finishes reading from sstables.

Refreshes cache cursor via
partition_snapshot_row_cursor::maybe_refresh(), which calls
partition_snapshot_row_cursor::advance_to() because iterators are
invalidated. This advances the cursor to
after(2). no_clustering_row_between(2, after(2)) returns true, so
advance_to() returns true, and maybe_refresh() returns true. This is
interpreted by the cache reader as "the cursor has not moved forward",
so it marks the range as complete, without emitting the row with
pos=2. Also, it marks row(pos=after(2)) as continuous, so later reads
will also miss the row.

The bug is in advance_to(), which is using
no_clustering_row_between(a, b) to determine its result, which by
definition excludes the starting key.

Discovered by row_cache_test.cc::test_concurrent_reads_and_eviction
with reduced key range in the random_mutation_generator (1024 -> 16).

Fixes #11239

Closes #11240

* github.com:scylladb/scylladb:
  test: mvcc: Fix illegal use of maybe_refresh()
  tests: row_cache_test: Add test_eviction_of_upper_bound_of_population_range()
  tests: row_cache_test: Introduce one_shot mode to throttle
  row_cache: Fix missing row if upper bound of population range is evicted and has adjacent dummy
2022-08-09 12:39:10 +03:00
Anna Stuchlik
e753b4e793 doc: language, formatting, and organization improvements 2022-08-09 10:34:22 +02:00
Tomasz Grabiec
f59d2d9bf8 range_tombstone_list: Avoid amortized_reserve()
We can use std::in_place_type<> to avoid constructing op before
calling emplace_back(). As a reuslt, we can avoid reserving space. The
reserving was there to avoid the need to roll-back in case
emplace_back() throws.

Kudos to Kamil for suggesting this.

Closes #11238
2022-08-09 11:34:16 +03:00
Avi Kivity
8d37370a71 Revert "Merge "memtable-sstable: Add compacting reader when flushing memtable." from Mikołaj"
This reverts commit bcadd8229b, reversing
changes made to cf528d7df9. Since
4bd4aa2e88 ("Merge 'memtable, cache: Eagerly
compact data with tombstones' from Tomasz Grabiec"), memtable is
self-compacting and the extra compaction step only reduces throughput.

The unit test in memtable_test.cc is not reverted as proof that the
revert does not cause a regression.

Closes #11243
2022-08-09 11:23:29 +03:00
Anna Stuchlik
61d33cb2a8 doc: add a disclaimer about not supporting local counters by SSTableLoader 2022-08-09 10:00:14 +02:00
Anna Stuchlik
4be88e1a79 doc: add the redirects from Ubuntu version specific to version generic pages 2022-08-09 09:43:28 +02:00
Raphael S. Carvalho
337390d374 forward_service: execute_on_this_shard: avoid reallocation and copy
avoid about log2(256)=8 reallocations when pushing partition ranges to
be fetched. additionally, also avoid copying range into ranges
container. current_range will not contain the last range, after
moved, but will still be engaged by the end of the loop, allowing
next iteration to happen as expected.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #11242
2022-08-09 09:08:53 +02:00
Botond Dénes
1b669cefed service/storage_proxy: add get_tombstone_limit()
To be used by coordinator side code to determine the correct tombstone
limit to pass to read-command (tombstone limit field added in the next
commit). When this limit is non-zero, the replica will start cutting
pages after the tombstone limit is surpassed.
This getter works similarly to `get_max_result_size()`: if the cluster
feature for empty replica pages is set, it will return the value
configured via db::config::query_tombstone_limit. System queries always
use a limit of 0 (unlimited tombstones).
2022-08-09 10:00:40 +03:00
Botond Dénes
8cd2ef7a42 query: add tombstone_limit type
Will be used in read_command. Add it before it is added to read-command
so we can use the unlimited constant in code added in preparation to
that.
2022-08-09 10:00:40 +03:00
Botond Dénes
33f0447ba0 db/config: add config item for query tombstone limit
This will be the value used to break pages, after processing the
specified amount of tombstones. The page will be cut even if empty.
We could maybe use the already existing tombstone_{warn,fail}_threshold
instead and use them as a soft/hard limit pair, like we did with page
sizes.
2022-08-09 10:00:40 +03:00
Botond Dénes
1bc14b5e3b gms: add cluster feature for empty replica pages
So we can start using them only when the entire cluster supports it.
2022-08-09 10:00:40 +03:00
Botond Dénes
60a0e3d88b tree: don't use query::read_command's IDL constructor
It is not type safe: has multiple limits passed to it as raw ints, as
well as other types that ints implicitly convert to. Furthermore the row
limit is passed in two separate fields (lower 32 bits and upper 32
bits). All this make this constructor a minefield for humans to use. We
have a safer constructor for some time but some users of the old one
remain. Move them to the safe one.
2022-08-09 10:00:37 +03:00
Tomasz Grabiec
05b0a62132 test: mvcc: Fix illegal use of maybe_refresh()
maybe_refresh() can only be called if the cursor is pointing at a row.
The code was calling it before the cursor was advanced, and was
thus relying on implementation detail.
2022-08-09 02:28:56 +02:00
Tomasz Grabiec
ce624048d9 tests: row_cache_test: Add test_eviction_of_upper_bound_of_population_range()
Reproducer for #11239.
2022-08-09 02:28:56 +02:00
Tomasz Grabiec
6aaa6f8744 tests: row_cache_test: Introduce one_shot mode to throttle 2022-08-09 02:28:56 +02:00
Tomasz Grabiec
a6a61eaf96 row_cache: Fix missing row if upper bound of population range is evicted and has adjacent dummy
Scenario:

cache = [
    row(pos=2, continuous=false),
    row(pos=after(2), dummy=true)
]

Scanning read starts, starts populating [-inf, before(2)] from sstables.

row(pos=2) is evicted.

cache = [
    row(pos=after(2), dummy=true)
]

Scanning read finishes reading from sstables.

Refreshes cache cursor via
partition_snapshot_row_cursor::maybe_refresh(), which calls
partition_snapshot_row_cursor::advance_to() because iterators are
invalidated. This advances the cursor to
after(2). no_clustering_row_between(2, after(2)) returns true, so
advance_to() returns true, and maybe_refresh() returns true. This is
interpreted by the cache reader as "the cursor has not moved forward",
so it marks the range as complete, without emitting the row with
pos=2. Also, it marks row(pos=after(2)) as continuous, so later reads
will also miss the row.

The bug is in advance_to(), which is using
no_clustering_row_between(a, b) to determine its result, which by
definition excludes the starting key.

Discovered by row_cache_test.cc::test_concurrent_reads_and_eviction
with reduced key range in the random_mutation_generator (1024 -> 16).

Fixes #11239
2022-08-09 02:28:56 +02:00
Takuya ASADA
3ffc978166 main: move preinit_description to main()
We don't need to wait for handling version options after scylla_main()
called, we can handle it in main() instead.

Closes #11221
2022-08-08 18:31:43 +03:00
Benny Halevy
91ab8ee1c3 effective_replication_map: make get_range_addresses asynchronous
So it may yield, preenting reactor stalls as seen in #11005.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 17:31:01 +03:00
Benny Halevy
9b2af3f542 range_streamer: add_ranges and friends: get erm as param
Rather than getting it in the callee, let the caller
(e.g.  storage_service)
hold the erm and pass it down to potentially multiple
async functions.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 17:31:01 +03:00
Benny Halevy
194b9af8d6 storage_service: get_new_source_ranges: get erm as param
Rather than getting it in the callee, let the caller
hold the erm and pass it down to potentially multiple
async functions.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 17:31:01 +03:00
Benny Halevy
b50c79eab3 storage_service: get_changed_ranges_for_leaving: get erm as param
Rather than getting it in the callee, let the caller
hold the erm and pass it down to potentially multiple
async functions.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 17:31:01 +03:00
Benny Halevy
a5d7ade237 storage_service: get_ranges_for_endpoint: get erm as param
Let its caller Pass the effective_replication_map ptr
so we can get it at the top level and keep it alive
(and coherent) through multiple asynchronous calls.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 17:31:01 +03:00
Benny Halevy
cffe00cc58 repair: use get_non_local_strategy_keyspaces_erms
Use get_non_local_strategy_keyspaces_erms for getting
a coherent set of keyspace names and their respective
effective replication strategy.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 17:31:01 +03:00
Benny Halevy
db5c5ca59e database: add get_non_local_strategy_keyspaces_erms
To be used for getting a coheret set of all keyspaces
with non-local replication strategy and their respective
effective_replication_map.

As an example, use it in this patch in
storage_service::update_pending_ranges.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 17:31:01 +03:00
Benny Halevy
7ee6048255 database: add get_non_local_strategy_keyspaces
For node operations, we currently call get_non_system_keyspaces
but really want to work on all keyspace that have non-local
replication strategy as they are replicated on other nodes.

Reflect that in the replica::database function name.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 17:31:01 +03:00
Benny Halevy
d8484b3ee6 storage_service: coroutinize update_pending_ranges
Before we make a change in getting the
keyspaces and their effective_replication_map.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 17:31:01 +03:00
Benny Halevy
e541009f65 effective_replication_map: add get_replication_strategy
And use it in storage_service::get_changed_ranges_for_leaving.

A following patch will pass the e_r_m to
storage_service::get_changed_ranges_for_leaving, rather than
getting it there.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 17:31:00 +03:00
Benny Halevy
6794e15163 effective_replication_map: get_range_addresses: use the precalculated replication_map
There is no need to call get_natural_endpoints for every token
in sorted_tokens order, since we can just get the precalculated
per-token endpoints already in the _replication_map member.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 17:31:00 +03:00
Benny Halevy
1d4aea4441 abstract_replication_strategy: get_pending_address_ranges: prevent extra vector copies
Reduce large allocations and reactor stalls seen in #11005
by open coding `get_address_ranges` and using std::vector::insert
to efficiently appending the ranges returned by `get_primary_ranges_for`
onto the returned token_range_vector in contrast to building
an unordered_multimap<inet_address, dht::token_range> first in
`get_address_ranges` and traversing it and adding one token_range
at a time.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 17:31:00 +03:00
Benny Halevy
7811b0d0aa abstract_replication_strategy: reindent 2022-08-08 17:31:00 +03:00
Benny Halevy
ebe1edc091 utils: sequenced_set: expose set and contains method
And use that in sights using the endpoint set
returned by abstract_replication_strategy::calculate_natural_endpoints.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 17:31:00 +03:00
Benny Halevy
7017ad6822 abstract_replication_strategy: calculate_natural_endpoints: return endpoint_set
So it could be used also for easily searching for an endpoint.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 17:31:00 +03:00
Benny Halevy
38934413d4 utils: sequenced_set: templatize VectorType
Se we can use basic_sequenced_set<T, std::small_vector<T, N>>
as locator::endpoint_set.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 17:31:00 +03:00
Benny Halevy
df380c30b9 utils: sanitize sequenced_set
And templatize its Vector type so it can be used
with a small_vector for inet_address_vector_replica_set.

Mark the methods const/noexcept as needed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 17:31:00 +03:00
Benny Halevy
57d9275d4a utils: sequenced_set: delete mutable get_vector method
It is dangerous to use since modifying the
sequenced_set vector will make it go out of sync
with the associated unordered_set member, making
the object unusable.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 17:31:00 +03:00
Yaron Kaikov
2fe2306efb configure.py: add date-stamp parameter
When starting `Build` job we have a situation when `x86` and `arm`
starting in different dates causing the all process to fail

As suggested by @avikivity , adding a date-stamp parameter and will pass
it through downstream jobs to get one release for each job

Ref: scylladb/scylla-pkg#3008

Closes #11234
2022-08-08 17:28:38 +03:00
Anna Stuchlik
7d4770c116 doc: remove version-speific content for Ubuntu and add the generic page to the toctree 2022-08-08 16:18:20 +02:00
Anna Stuchlik
eb60e5757a doc: rename the file to include Ubuntu 2022-08-08 16:12:02 +02:00
Anna Stuchlik
011e2fad60 doc: remove the version number from the document and add the link to Supported Versions 2022-08-08 16:11:14 +02:00
Anna Stuchlik
83c08ac5fa doc: add a generic page for Ubuntu 2022-08-08 16:04:59 +02:00
Avi Kivity
871127f641 Update tools/java submodule
* tools/java ad6764b506...6995a83cc1 (1):
  > dist/debian: drop upgrading from scylla-tools < 2.0
2022-08-08 16:51:14 +03:00
Botond Dénes
49c00fa989 Merge 'Define strong uuid-class types for table_id, table_schema_version and query_id' from Benny Halevy
We would like to define more distinct types
that are currently defined as aliases to utils::UUID to identify
resources in the system, like table id and schema
version id.

As with counter_id, the motivation is to restrict
the usage of the distinct types so they can be used
(assigned, compared, etc.) only with objects of
the same type.  Using with a generic UUID will
then require explicit conversion, that we want to
expose.

This series starts with cleaning up the idl header definition
by adding support for `import` and `include` statements in the idl-compiler.
These allow the idl header to become self-sufficient
and then remove manually-added includes from source files.
The latter usually need only the top level idl header
and it, in turn, should include other headers if it depends on them.

Then, a UUID_class template was defined as a shared boiler plate
for the various uuid-class.  First, we convert counter_id to use it,
rather than mimicking utils::UUID on its own.

On top of utils::UUID_class<T>, we define table_id,
table_schema_version, and query_id.

Following up on this series, we should define more commonly used
types like: host_id, streaming_plan_id, paxos_ballot_id.

Fixes #11207

Closes #11220

* github.com:scylladb/scylladb:
  query-request, everywhere: define and use query_id as a strong type
  schema, everywhere: define and use table_schema_version as a strong type
  schema, everywhere: define and use table_id as a strong type
  schema: include schema_fwd.hh in schema.hh
  system_keyspace: get_truncation_record: delete unused lambda capture
  utils: uuid: define appending_hash<utils::tagged_uuid<Tag>>
  utils: tagged_uuid: rename to_uuid() to uuid()
  counters: counter_id: use base class create_random_id
  counters: base counter_id on utils::tagged_uuid
  utils: tagged_uuid: mark functions noexcept
  utils: tagged_uuid: bool: reuse uuid::bool operator
  raft: migrate tagged_id definition to utils::tagged_uuid
  utils: uuid: mark functions noexcept
  counters: counter_id delete requirement for triviality
  utils: bit_cast: require TriviallyCopyable To
  repair: delete unused include of utils/bit_cast.hh
  bit_cast: use std::bit_cast
  idl: make idl headers self-sufficient
  db: hints: sync_point: do not include idl definition file
  db/per_partition_rate_limit: tidy up headers self-sufficiency
  idl-compiler: include serialization impl and visitors in generated dist.impl.hh files
  idl-compiler: add include statements
  idl_test: add a struct depending on UUID
2022-08-08 13:20:40 +03:00
Benny Halevy
c71ef330b2 query-request, everywhere: define and use query_id as a strong type
Define query_id as a tagged_uuid
So it can be differentiated from other uuid-class types.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:13:28 +03:00
Benny Halevy
2b017ce285 schema, everywhere: define and use table_schema_version as a strong type
Define table_schema_version as a distinct tagged_uuid class,
So it can be differentiated from other uuid-class types,
in particular table_id.

Added reversed(table_schema_version) for convenience
and uniformity since the same logic is currently open coded
in several places.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:09:45 +03:00
Benny Halevy
257d74bb34 schema, everywhere: define and use table_id as a strong type
Define table_id as a distinct utils::tagged_uuid modeled after raft
tagged_id, so it can be differentiated from other uuid-class types,
in particular from table_schema_version.

Fixes #11207

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:09:41 +03:00
Benny Halevy
26aacb328e schema: include schema_fwd.hh in schema.hh
And remove repeated definitions and forward declarations
of the same types in both places.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:28 +03:00
Benny Halevy
6e77ad9392 system_keyspace: get_truncation_record: delete unused lambda capture
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:28 +03:00
Benny Halevy
a390b8475b utils: uuid: define appending_hash<utils::tagged_uuid<Tag>>
And simplify usage for appending_hash<counter_shard_view>
respectively.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:28 +03:00
Benny Halevy
8235cfdf7a utils: tagged_uuid: rename to_uuid() to uuid()
To make it more generic, similar to other uuid() get
methods we have.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Benny Halevy
813cffc2b5 counters: counter_id: use base class create_random_id
Rather than defining generate_random,
and use respectively in unit tests.
(It was inherited from raft::internal::tagged_id.)

This allows us to shorten counter_id's definition
to just using utils::tagged_uuid<struct counter_id_tag>.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Benny Halevy
e9cc24bc18 counters: base counter_id on utils::tagged_uuid
Use the common base class for uuid-based types.

tagged_uuid::to_uuid defined here for backward
compatibility, but it will be renamed in the next patch
to uuid().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Benny Halevy
082d5efca8 utils: tagged_uuid: mark functions noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Benny Halevy
1b78f8ba82 utils: tagged_uuid: bool: reuse uuid::bool operator
utils::UUID defined operator bool the same way,
rely on it rather than reimplementing it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Benny Halevy
6436c614d7 raft: migrate tagged_id definition to utils::tagged_uuid
So it can be used for other types in the system outside
of raft, like counter_id, table_id, table_schema_version,
and more.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Benny Halevy
f0567ab853 utils: uuid: mark functions noexcept
Before we define a tagged_uuid template.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Benny Halevy
ea91ccaa20 counters: counter_id delete requirement for triviality
This stemmed from utils/bit_cast overly strict requirement.
Now that it was relaxed, these is no need for this static assert
as counter_id is trivially copyable, and that is checked
by bit_cast {read,write}_unaligned

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Benny Halevy
c68e929b80 utils: bit_cast: require TriviallyCopyable To
Like std::bit_cast (https://en.cppreference.com/w/cpp/numeric/bit_cast)
we only require To to be trivially copyable.

There's no need for it to be a trivial type.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Benny Halevy
2948a4feb6 repair: delete unused include of utils/bit_cast.hh
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Benny Halevy
79000bc02e bit_cast: use std::bit_cast
Now that scylla requries c++20 there's no
need to define our own implementation in utils/bit_cast.hh

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Benny Halevy
1fda686f96 idl: make idl headers self-sufficient
Add include statements to satisfy dependencies.

Delete, now unneeded, include directives from the upper level
source files.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Benny Halevy
cfc7e9aa59 db: hints: sync_point: do not include idl definition file
idl definition files are not intended for direct
inclusion in .cc files.

Data types it represents are supposed to be defined
in regular C++ header, so define them in db/hints/scyn_point.hh
and include it rather then idl/hinted_handoff.idl.hh.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Benny Halevy
82fa205723 db/per_partition_rate_limit: tidy up headers self-sufficiency
include what's needed where needed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Benny Halevy
83811b8e35 idl-compiler: include serialization impl and visitors in generated dist.impl.hh files
They are generally required by the serialization implementation.
This will simplify using them without having to hand pick
what header to include in the .cc file that includes them.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Benny Halevy
da4f0aae37 idl-compiler: add include statements
For generating #include directives in the generated files,
so we don't have to hand-craft include the dependencies
in the right order.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Benny Halevy
4f275a17b4 idl_test: add a struct depending on UUID
For testing the next change which adds
import and include statements to the idl language.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Avi Kivity
ba42852b0e Merge 'Overhaul truncate and snapshot' from Benny Halevy
This series is aimed at fixing #11132.

To get there, the series untangles the functions that currently depend on the the cross-shard coordination in table::snapshot,
namely database::truncate and consequently database::drop_column_family.

database::get_table_on_all_shards is added here as a helper to get a foreign shared ptr of the the table shard from all shards,
and it is later used by multiple functions to truncate and then take a snapshot of the sharded table.

database::truncate_table_on_all_shards is defined to orchestrate the truncate process end-to-end, flushing or clearing all table shards before taking a snapshot if needed, using the newly defined table::snapshot_on_all_shards, and by that leaving only the discard_sstables job to the per-shard database::truncate function.

The latter, snapshot_on_all_shards, orchestrates the snapshot process on all shards - getting rid of the per-shard table::snapshot function (after refactoring take_snapshot and finalize_snapshot out of it), and the associated dreaded data structures: snapshot_manager and pending_snapshots.

Fixes #11132.

Closes #11133

* github.com:scylladb/scylladb:
  table: reindent write_schema_as_cql
  table: coroutinize write_schema_as_cql
  table: seal_snapshot: maybe_yield when iterating over the table names
  table: reindent seal_snapshot
  table: coroutinize seal_snapshot
  table: delete unused snapshot_manager and pending_snapshots
  table: delete unused snapshot function
  table: snapshot_on_all_shards: orchestrate snapshot process
  table: snapshot: move pending_snapshots.erase from seal_snapshot
  table: finalize_snapshot: take the file sets as a param
  table: make seal_snapshot a static member
  table: finalize_snapshot: reindent
  table: refactor finalize_snapshot out of snapshot
  table: snapshot: keep per-shard file sets in snapshot_manager
  table: take_snapshot: return foreign unique ptr
  table: take_snapshot: maybe yield in per-sstable loop
  table: take_snapshot: simplify tables construction code
  table: take_snapshot: reindent
  table: take_snapshot: simplify error handling
  table: refactor take_snapshot out of snapshot
  utils: get rid of joinpoint
  database: get rid of timestamp_func
  database: truncate: snapshot table in all-shards layer
  database: truncate: flush table and views in all-shards layer
  database: truncate: stop and disable compaction in all-shards layer
  database: truncate: move call to set_low_replay_position_mark to all-shards layer
  database: truncate: enter per-shard table async_gate in all-shards layer
  database: truncate: move check for schema_tables keyspace to all-shards layer.
  database: snapshot_table_on_all_shards: reindent
  table: add snapshot_on_all_shards
  database: add snapshot_table_on_all_shards
  database: rename {flush,snapshot}_on_all and make static
  database: drop_table_on_all_shards: truncate and stop table in upper layer
  database: drop_table_on_all_shards: get all table shards before drop_column_family on each
  database: drop_column_family: define table& cf
  database: drop_column_family: reuse uuid for evict_all_for_table
  database: drop_column_family: move log message up a layer
  database: truncate: get rid of the unused ks param
  database: add truncate_table_on_all_shards
  database: drop_table_on_all_shards: do not accept a truncated_at timestamp_func
  database: truncate: get optional snapshot_name from caller
  database: truncate: fix assert about replay_position low_mark
  database_test: apply_mutation on the correct db shard
2022-08-07 19:15:42 +03:00
Benny Halevy
45ce635527 table: reindent write_schema_as_cql
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
3b2cce068a table: coroutinize write_schema_as_cql
and make sure to always close the output stream.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
dbae7807d1 table: seal_snapshot: maybe_yield when iterating over the table names
Add maybe_yield calls in tight loop, potentially
over thousands of sstable names to prevent reactor stalls.
Although the per-sstable cost is very small, we've experienced
stalls realted to printing in O(#sstables) in compaction.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
3ba0c72b77 table: reindent seal_snapshot
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
41a2d09a5d table: coroutinize seal_snapshot
Handle exceptions, making sure the output
stream is properly closed in all cases,
and an intermediate error, if any, is returned as the
final future.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
5316dbbe78 table: delete unused snapshot_manager and pending_snapshots
Now that snapshot orchestration in snapshot_on_all_shards
doesn't use snapshot_manager, get rid of the data structure.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
cca9068cfb table: delete unused snapshot function
Now that snapshot orchestration is done solely
in snapshot_on_all_shards, the per-shard
snapshot function can be deleted.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
351a3a313d table: snapshot_on_all_shards: orchestrate snapshot process
Call take_snapshot on each shard and collect the
returns snapshot_file_set.

When all are done, move the vector<snapshot_file_set>
to finalize_snapshot.

All that without resorting to using the snapshot_manager
nor calling table::snapshot.

Both will deleted in the following patches.

Fixes #11132

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
84dfd2cabb table: snapshot: move pending_snapshots.erase from seal_snapshot
Now that seal_snapshot doesn't need to lookup
the snapshot_manager in pending_snapshots to
get to the file_sets, erasing the snapshot_manager
object can be done in table::snapshot which
also inserted it there.

This will make it easier to get rid of it in a later patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
39276cacc3 table: finalize_snapshot: take the file sets as a param
and pass it to seal_snapshot, so that the latter won't
need to lookup and access the snapshot_manager object.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
4dd56bbd6d table: make seal_snapshot a static member
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
7cb0a3f6f4 table: finalize_snapshot: reindent
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
12716866a9 table: refactor finalize_snapshot out of snapshot
Write schema.cql and the files manifest in finalize_snapshot.
Currently call it from table::snapshot, but it will
be called in a later patch by snapshot_on_all_shards.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
240f83546d table: snapshot: keep per-shard file sets in snapshot_manager
To simplify processing of the per-shard file names
for generating the manifest.

We only need to print them to the manifest at the
end of the process, so there's no point in copying
them around in the process, just move the
foreign unique unordered_set.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
5100c1ba68 table: take_snapshot: return foreign unique ptr
Currently copying the sstable file names are created
and destroyed on each shard and are copied by the
"coordinator" shards using submit_to, while the
coroutine holds the source on its stack frame.

To prepare for the next patches that refactor this
code so that the coordinator shard will submit_to
each shard to perform `take_snapshot` and return
the set of sstrings in the future result, we need
to wrap the result in a foreign_ptr so it gets
freed on the shard that created it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
b54626ad0e table: take_snapshot: maybe yield in per-sstable loop
There could be thousands of sstables so we better
cosider yielding in the tight loop that copies
the sstable names into the unordered_set we return.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
24a1a4069e table: take_snapshot: simplify tables construction code
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
75e38ebccc table: take_snapshot: reindent
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
67c1d00f44 table: take_snapshot: simplify error handling
Don't catch exception but rather just return
them in the return future, as the exception
is handled by the caller.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
ff6508aa53 table: refactor take_snapshot out of snapshot
Do the actual snapshot-taking code in a per-shard
take_snapshot function, to be called from
snapshot_on_all_shards in a following patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
37b7a9cce2 utils: get rid of joinpoint
Now that it is no longer used.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
56f336d1aa database: get rid of timestamp_func
Pass an optional truncated_at time_point to
truncate_table_on_all_shards instead of the over-complicated
timestamp_func that returns the same time_point on all shards
anyhow, and was only used for coordination across shards.

Since now we synchronize the internal execution phase in
truncate_table_on_all_shards, there is no longer need
for this timestamp_func.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
b640c4fd17 database: truncate: snapshot table in all-shards layer
With that the database layer does no longer
need to invoke the private table::snapshot function,
so it can be defriended from class table.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
af0c71aa12 database: truncate: flush table and views in all-shards layer
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
6e07e6b7ac database: truncate: stop and disable compaction in all-shards layer
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
e78dad1dfb database: truncate: move call to set_low_replay_position_mark to all-shards layer
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
a8bd3d97b6 database: truncate: enter per-shard table async_gate in all-shards layer
Start moving the per-shard state establishment logic
to truncate_table_on_all_shards, so that we would evetually
do only the truncate logic per-se in the per-shard truncate function.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
ff028316f2 database: truncate: move check for schema_tables keyspace to all-shards layer.
Now that the per-shard truncate function is called
only from truncate_table_on_all_shards, we can reject the schema_tables
keyspace in the upper layer.  There's no need to check that on each shard.

While at it, reuse `is_system_keyspace`.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
fbe1fa1370 database: snapshot_table_on_all_shards: reindent
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
4d4ca40c38 table: add snapshot_on_all_shards
Called from the respective database entry points.

Will be called also from the database drop / truncate path
and will be used for central coordination of per-shard
table::snapshot so we don't have to depend on the snapshot_manager
mechanism that is fragile and currently causes abort if we fail
to allocate it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
be56a73e78 database: add snapshot_table_on_all_shards
We need to snapshot a single table in several paths.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
d96b56fee2 database: rename {flush,snapshot}_on_all and make static
Follow the convention of drop_table_on_all_shards.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
a1eed1a6e9 database: drop_table_on_all_shards: truncate and stop table in upper layer
truncate the table on all shards then stop it on shards
in the upper layer rather than in the per-shard drop_column_family()
function, so we can further refactor truncate later, flushing
and taking snapshot on all shards, before truncating.

With that, rename drop_column_family to detach_columng_family
as now it only deregisters the column family from containers
that refer to it (even via its uuid) and then its caller
is reponsible to take it from there.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
92cb7d448b database: drop_table_on_all_shards: get all table shards before drop_column_family on each
Se we the upper layer can flush, snapshot, and truncate
the table on all shards, step by step.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
0aaaefbb5c database: drop_column_family: define table& cf
To reduce the churn in the following patch
that will pass the table& as a parameter.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
bb1e5ffb8c database: drop_column_family: reuse uuid for evict_all_for_table
cf->schema()->id() is the same one returned
by find_uuid(ks_name, cf_name);

As a follow up, we should define a concrete
table_id type and rename schema::id() to schema::table_id()
to return it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
e800e1e720 database: drop_column_family: move log message up a layer
Print once on "coordinator" shard.

And promote to info level as it's important to log
when we're dropping a table (and if we're going to take a snapshot).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
ca78a63873 database: truncate: get rid of the unused ks param
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
46e2a7c83b database: add truncate_table_on_all_shards
As a first step to decouple truncate from flush
and snpashot.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
5e8c05f1a8 database: drop_table_on_all_shards: do not accept a truncated_at
timestamp_func

Since in the drop_table case we want to discard ALL
sstables in the table, not only those with `max_data_age()`
up until drop started.

Fixes #11232

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:52:51 +03:00
Benny Halevy
574909c78f database: truncate: get optional snapshot_name from caller
Before we change drop_table_on_all_shards to always
pass db_clock::time_point::max() in the next patch,
let it pass a unique snapshot name, otherwise
the snapshot name will always be based on the constant, max
time_point.

Refs #11232

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:03:19 +03:00
Benny Halevy
474b2fdf37 database: truncate: fix assert about replay_position low_mark
This assert was tweaked several times:
Introduced in 83323e155e,
then fixed in b2b1a1f7e1 to account
for no rp from discard_sstables, then in
9620755c7f to account for
cases we do not flush the table, then again in
71c5dc82df to make that more accurate.

But, the assert wasn't correct in the first place
in the sense that we first get `low_mark` which
represents the highest replay_position at the time truncate
was called, but then we call discard_sstables with a time_point
of `truncated_at` that we get from the caller via the timestamp_func,
and that one could be in the past, before truncate was called -
hence discard_sstables with that timestamp may very well
return a replay_position from older sstables, prior to flush
that can be smaller than the low_mark.

Fix this assert to account for that case.

The real fix to this issue is to have a truncate_tombstone
that will carry an authoritative api::timstamp (#11230)

Fixes #11231

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 09:18:06 +03:00
Benny Halevy
9f5e13800d database_test: apply_mutation on the correct db shard
Following up on 1c26d49fba,
apply mutations on the correct db shard in all test cases
before we define and use database::truncate_table_on_all_shards.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 09:18:06 +03:00
Tomasz Grabiec
7f80602b01 db: range_tombstone_list: Avoid quadratic behavior when applying
Range tombstones are kept in memory (cache/memtable) in
range_tombstone_list. It keeps them deoverlapped, so applying a range
tombstone which covers many range tombstones will erase existing range
tombstones from the list. This operation needs to be exception-safe,
so range_tombstone_list maintains an undo log. This undo log will
receive a record for each range tombstone which is removed. For
exception safety reasons, before pushing an undo log entry, we reserve
space in the log by calling std::vector::reserve(size() + 1). This is
O(N) where N is the number of undo log entries. Therefore, the whole
application is O(N^2).

This can cause reactor stalls and availability issues when replicas
apply such deletions.

This patch avoids the problem by reserving exponentially increasing
amount of space. Also, to avoid large allocations, switches the
container to chunked_vector.

Fixes #11211

Closes #11215
2022-08-05 20:34:07 +03:00
Kamil Braun
d84a93d683 Merge 'Raft test topology part 1' from Alecco
These are the first commits out of #10815.

It starts by moving pytest logic out of the common `test/conftest.py`
and into `test/topology/conftest.py`, including removing the async
support as it's not used anywhere else.

There's a fix of a bug of leaving tables in `RandomTables.tables` after
dropping all of them.

Keyspace creation is moved out of `conftest.py` into `RandomTables` as
it makes more sense and this way topology tests avoid all the
workarounds for old version (topology needs ScyllaDB 5+ for Raft,
anyway).

And a minor fix.

Closes #11210

* github.com:scylladb/scylladb:
  test.py: fix type hint for seed in ScyllaServer
  test.py: create/drop keyspace in tables helper
  test.py: RandomTables clear list when dropping all tables
  test.py: move topology conftest logic to its own
  test.py: async topology tests auto run with pytest_asyncio
2022-08-05 17:56:16 +02:00
Anna Stuchlik
d48ae5a9e0 doc: add the upgrade guide from 5.0 to 2022.1 on Ubuntu 2022.1 2022-08-05 17:49:01 +02:00
Warren Krewenki
4178ccd27f gossiper: Correct typo in log message
Closes #11212
2022-08-05 18:21:36 +03:00
Alejo Sanchez
ec70e26f12 test.py: fix type hint for seed in ScyllaServer
Param seed can be None (e.g. first server) so fix type hint accordingly.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-05 13:05:26 +02:00
Alejo Sanchez
1d7789e5a9 test.py: create/drop keyspace in tables helper
Since all topology test will use the helper, create the keyspace in the
helper.

Avoid the need of dropping all tables per test and just drop the
keyspace.

While there, use blocking CQL execution so it can be used in the
constructor and avoids possible issues with scheduling on cleanup. Also,
creation and drop should happen only once per cluster and no test should
be running changes (either not started or finished).

All topology tests are for Scylla with Raft. So don't use the Cassandra
this_dc workaround as it's unnecessary for Scylla.

Remove return type of random_tables fixture to match other fixtures
everywhere else.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-05 13:05:26 +02:00
Alejo Sanchez
9a019628f5 test.py: RandomTables clear list when dropping all tables
Clear the list of active tables when dropping them.

While there do the list element exchange atomically across active and
removed tables lists.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-05 13:05:26 +02:00
Alejo Sanchez
f6aa0d7bd7 test.py: move topology conftest logic to its own
Move asyncio, Raft checks, and RandomTables to topology test suite's own
conftest file.

While there, use non-async version of pre-checks to avoid unnecessary
complexity (we want async tests, not async setup, for now).

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-05 13:05:26 +02:00
Alejo Sanchez
f665779cdb test.py: async topology tests auto run with pytest_asyncio
Async tests and fixtures in the topology directory are expected to run
with pytest_asyncio (not other async frameworks). Force this with auto
mode.

CI has an older pytest_asyncio version lacking pytest_asyncio.fixture.
Auto mode helps avoiding the need of it and tests and fixtures can just
be marked with regular @pytest.mark.async.

This way tests can run in both older and newer versions of the packages.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-08-05 13:05:26 +02:00
Botond Dénes
fbbe2529c1 Merge "Remove global snitch usage from consistency_level.cc" from Pavel Emelyanov
"
There are several helpers in this .cc file that need to get datacenter
for endpoints. For it they use global snitch, because there's no other
place out there to get that data from.

The whole dc/rack info is now moving to topology, so this set patches
the consistency_level.cc to get the topology. This is done two ways.
First, the helpers that have keyspace at hand may get the topology via
ks's effective_replication_map.

Two difficult cases are db::is_local() and db.count_local_endpoints()
because both have just inet_address at hand. Those are patched to be
methods of topology itself and all their callers already mess with
token metadata and can get topology from it.
"

* 'br-consistency-level-over-topology' of https://github.com/xemul/scylla:
  consistency_level: Remove is_local() and count_local_endpoints()
  storage_proxy: Use topology::local_endpoints_count()
  storage_proxy: Use proxy's topology for DC checks
  storage_proxy: Keep shared_ptr<proxy> on digest_read_resolver
  storage_proxy: Use topology local_dc_filter in its methods
  storage_proxy: Mark some digest_read_resolver methods private
  forwarding_service: Use topology local_dc_filter
  storage_service: Use topology local_dc_filter
  consistency_level: Use topology local_dc_filter
  consitency-level: Call count_local_endpoints from topology
  consistency_level: Get datacenter from topology
  replication_strategy: Remove hold snitch reference
  effective_replication_map: Get datacenter from topology
  topology: Add local-dc detection shugar
2022-08-05 13:31:55 +03:00
Anna Stuchlik
4bc7833a0b doc: update the link to CQL3 type mapping on GitHub
Closes #11224
2022-08-05 13:21:29 +03:00
Pavel Emelyanov
c3718b7a6e consistency_level: Remove is_local() and count_local_endpoints()
No code uses them now -- switched to use topology -- so thse two can be
dropped together with their calls for global snitch

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-05 12:19:48 +03:00
Pavel Emelyanov
9c662ee0e5 storage_proxy: Use topology::local_endpoints_count()
A continuation of the previous patches -- now all the code that needs
this helper have proxy pointer at hand

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-05 12:19:48 +03:00
Pavel Emelyanov
9a50d318b6 storage_proxy: Use proxy's topology for DC checks
Several proxy helper classes need to filter endpoints by datacenter.
Since now the have shared_ptr<proxy> on-board, they can get topology
via proxy's token metadata

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-05 12:19:48 +03:00
Pavel Emelyanov
183a2d5a83 storage_proxy: Keep shared_ptr<proxy> on digest_read_resolver
It will be needed to get token metadata from proxy. The resolver in
question is created and maintained by abstract_read_executor which
already has shared_ptr<proxy>, so it just gives its copy

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-05 12:19:48 +03:00
Pavel Emelyanov
e1ea801b67 storage_proxy: Use topology local_dc_filter in its methods
The proxy has token metadata pointer, so it can use its topology
reference to filter endpoints by datacenter

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-05 12:19:47 +03:00
Pavel Emelyanov
6f515f852d storage_proxy: Mark some digest_read_resolver methods private
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-05 12:19:47 +03:00
Pavel Emelyanov
9a19414c62 forwarding_service: Use topology local_dc_filter
The service needs to filter out non-local endpoints for its needs. The
service carries token metadata pointer and can get topology from it to
fulfill this goal

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-05 12:19:47 +03:00
Pavel Emelyanov
2423e1c642 storage_service: Use topology local_dc_filter
The storage-service API calls use db::is_local() helper to filter out
tokens from non-local datacenter. In all those places topology is
available from the token metadata pointer

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-05 12:19:47 +03:00
Pavel Emelyanov
0da8caba1d consistency_level: Use topology local_dc_filter
The filter_for_query() helper has keyspace at hand

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-05 12:19:47 +03:00
Pavel Emelyanov
de58b33eee consitency-level: Call count_local_endpoints from topology
Similar to previous patch, in those places with keyspace object at
hand the topology can be obtained from ks' replication map

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-05 12:19:47 +03:00
Pavel Emelyanov
f84ee8f0fb consistency_level: Get datacenter from topology
In some of db/consistency_level.cc helpers the topology can be
obtained from keyspace's effective replication map

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-05 12:19:47 +03:00
Pavel Emelyanov
00f166809e replication_strategy: Remove hold snitch reference
When the strategy is constructed there's no place to get snitch from
so the global instance is used. However, after previous patch the
replication strategy no longer needs snitch, so this dependency can
be dropped

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-05 12:19:43 +03:00
Pavel Emelyanov
298213f27f effective_replication_map: Get datacenter from topology
Now it gets it from snitch, but the dc/rack info is being relocated
onto topology. The topology is in turn already there

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-05 12:19:31 +03:00
Calle Wilund
fac2bc41ba commitlog: Include "segments_to_replay" in initial footprint
Fixes #11184

Not including it here can cause our estimate of "delete or not" after replay
to be skewed in favour of retaining segments as (new) recycles (or even flip
a counter), and if we have repeated crash+restarts we could be accumulating
an effectivly ever increasing segment footprint

Closes #11205
2022-08-05 12:16:53 +03:00
Pavel Emelyanov
527b345079 Merge 'storage_proxy: introduce a remote "subservice"' from Kamil Braun
Introduce a `remote` class that handles all remote communication in `storage_proxy`: sending and receiving RPCs, checking the state of other nodes by accessing the gossiper, and fetching schema.

The `remote` object lives inside `storage_proxy` and right now it's initialized and destroyed together with `storage_proxy`.

The long game here is to split the initialization of `storage_proxy` into two steps:
- the first step, which constructs `storage_proxy`, initializes it "locally" and does not require references to `messaging_service` and `gossiper`.
- the second step will take those references and add the `remote` part to `storage_proxy`.

This will allow us to remove some cycles from the service (de)initialization order and in general clean it up a bit. We'll be able to start `storage_proxy` right after the `database` (without messaging/gossiper). Similar refactors are planned for `query_processor`.

Closes #11088

* github.com:scylladb/scylladb:
  service: storage_proxy: pass `migration_manager*` to `init_messaging_service`
  service: storage_proxy: `remote`: make `_gossiper` a const reference
  gms: gossiper: mark some member functions const
  db: consistency_level: `filter_for_query`: take `const gossiper&`
  replica: table: `get_hit_rate`: take `const gossiper&`
  gms: gossiper: move `endpoint_filter` to `storage_proxy` module
  service: storage_proxy: pass `shared_ptr<gossiper>` to `start_hints_manager`
  service: storage_proxy: establish private section in `remote`
  service: storage_proxy: remove `migration_manager` pointer
  service: storage_proxy: remove calls to `storage_proxy::remote()` from `remote`
  service: storage_proxy: remove `_gossiper` field
  alternator: ttl: pass `gossiper&` to `expiration_service`
  service: storage_proxy: move `truncate_blocking` implementation to `remote`
  service: storage_proxy: introduce `is_alive` helper
  service: storage_proxy: remove `_messaging` reference
  service: storage_proxy: move `connection_dropped` to `remote`
  service: storage_proxy: make `encode_replica_exception_for_rpc` a static function
  service: storage_proxy: move `handle_write` to `remote`
  service: storage_proxy: move `handle_paxos_prune` to `remote`
  service: storage_proxy: move `handle_paxos_accept` to `remote`
  service: storage_proxy: move `handle_paxos_prepare` to `remote`
  service: storage_proxy: move `handle_truncate` to `remote`
  service: storage_proxy: move `handle_read_digest` to `remote`
  service: storage_proxy: move `handle_read_mutation_data` to `remote`
  service: storage_proxy: move `handle_read_data` to `remote`
  service: storage_proxy: move `handle_mutation_failed` to `remote`
  service: storage_proxy: move `handle_mutation_done` to `remote`
  service: storage_proxy: move `handle_paxos_learn` to `remote`
  service: storage_proxy: move `receive_mutation_handler` to `remote`
  service: storage_proxy: move `handle_counter_mutation` to `remote`
  service: storage_proxy: remove `get_local_shared_storage_proxy`
  service: storage_proxy: (de)register RPC handlers in `remote`
  service: storage_proxy: introduce `remote`
2022-08-04 17:50:20 +03:00
Alejo Sanchez
97f0e11c3a test.py: handle properly pytest ouput file for CQL tests
Previously, if pytest itself failed (e.g. bad import or unexpected
parameter), there was no output file but test.py tried to copy it and
failed.

Change the logic of handling the output file to first check if the
file is there. Then if it's worth keeping it, *move* it to the test
directory for easier comparison and maintenance. Else, if it's not worth
keeping, discard it.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #11193
2022-08-04 16:48:53 +02:00
Pavel Emelyanov
cf0f912e59 cdc: Handle sleep-aborted exception on stop
When update_streams_description() fails it spawns a fiber and retries
the update in the background once every 60s. If the sleeping between
attempts is aborted, the respective exceptional future happens to be
ignored and warned in logs.

fixes: #11192

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220802132148.20688-1-xemul@scylladb.com>
2022-08-04 13:03:29 +02:00
Kamil Braun
0a4e701b50 service: storage_proxy: pass migration_manager* to init_messaging_service
`migration_manager` lifetime is longer than the lifetime of "storage
proxy's messaging service part" - that is, `init_messaging_service` is
called after `migration_manager` is started, and `uninit_messaging_service`
is called before `migration_manager` is stopped. Thus we don't need to
hold an owning pointer to `migration_manager` here.

Later, when `init_messaging_service` will actually construct `remote`,
this will be a reference, not a pointer.

Also observe that `_mm` in `remote` is only used in handlers, and
handlers are unregistered before `_mm` is nullified, which ensures that
handlers are not running when `_mm` is nullified. (This argument shows
why the code made sense regardless of our switch from shared_ptr to raw
ptr).
2022-08-04 12:19:43 +02:00
Kamil Braun
a08be82ce2 service: storage_proxy: remote: make _gossiper a const reference 2022-08-04 12:19:43 +02:00
Kamil Braun
a1aa9cf3f7 gms: gossiper: mark some member functions const 2022-08-04 12:19:43 +02:00
Kamil Braun
a9fd156a1b db: consistency_level: filter_for_query: take const gossiper& 2022-08-04 12:19:38 +02:00
Kamil Braun
7b4146dd2a replica: table: get_hit_rate: take const gossiper&
It doesn't use any non-const members.
2022-08-04 12:16:09 +02:00
Kamil Braun
566e5f2a4f gms: gossiper: move endpoint_filter to storage_proxy module
The function only uses one public function of `gossiper` (`is_alive`)
and is used only in one place in `storage_proxy`.

Make it a static function private to the `storage_proxy` module.

The function used a `default_random_engine` field in `gossiper` for
generating random numbers. Turn this field into a static `thread_local`
variable inside the function - no other `gossiper` members used the
field.
2022-08-04 12:16:09 +02:00
Kamil Braun
078900042f service: storage_proxy: pass shared_ptr<gossiper> to start_hints_manager
No need to call `_remote->gossiper().shared_from_this()` from within
storage_proxy now.
2022-08-04 12:16:09 +02:00
Kamil Braun
d9d10d87ec service: storage_proxy: establish private section in remote
Only the (un)init, send_*, and `is_alive` functions are public,
plus a getter for gossiper.
2022-08-04 12:16:05 +02:00
Kamil Braun
7364d453dd service: storage_proxy: remove migration_manager pointer
The ownership is passed to `remote`, which now contains
a `shared_ptr<migration_manager>`.
2022-08-04 12:15:36 +02:00
Kamil Braun
bcc22ed1dc service: storage_proxy: remove calls to storage_proxy::remote() from remote
Catch `this` in the lambdas.
2022-08-04 12:15:36 +02:00
Kamil Braun
eddd3b8226 service: storage_proxy: remove _gossiper field
Access `gossiper` through `_remote`.
Later, all those accesses will handle missing `remote`.

Note that there are also accesses through the `remote()` internal getter.

The plan is as follows:
- direct accesses through `_remote` will be modified to handle missing
  `_remote` (these won't cause an error)
- `remote()` will throw if `_remote` is missing (`remote()` is only used
  for operations which actually need to send a message to a remote node).
2022-08-04 12:15:35 +02:00
Kamil Braun
ab946e392f alternator: ttl: pass gossiper& to expiration_service
This allows us to remove the `gossiper()` getter from `storage_proxy`.
2022-08-04 12:12:43 +02:00
Kamil Braun
242e31d56e service: storage_proxy: move truncate_blocking implementation to remote
The truncate operation always truncates a table on the entire cluster,
even for local tables. And it always does it by sending RPCs (the node
sends an RPC to itself too). Thus it fits in the remote class.

If we want to add a possibility to "truncate locally only" and/or change
the behavior for local tables, we can add a branch in
`storage_proxy::truncate_blocking`.

Refs: #11087
2022-08-04 12:12:43 +02:00
Kamil Braun
3e73de9a40 service: storage_proxy: introduce is_alive helper
A helper is introduced both in `remote` and in `storage_proxy`.

The `storage_proxy` one calls the `remote` one. In the future it will
also handle a missing `remote`. Then it will report only the local node
to be alive and other nodes dead while `remote` is missing.

The change reduces the number of functions using the `_gossiper` field
in `storage_proxy`.
2022-08-04 12:12:41 +02:00
Jenkins Promoter
0ce19e7812 release: prepare for 5.2.0-dev 2022-08-04 13:09:55 +03:00
Botond Dénes
df203a48af Merge "Remove reconnectable_snitch_helper" from Pavel Emelyanov
"
The helper is in charge of receiving INTERNAL_IP app state from
gossiper join/change notifications, updating system.peers with it
and kicking messaging service to update its preferred ip cache
along with initiating clients reconnection.

Effectively this helper duplicates the topology tracking code in
storage-service notifiers. Removing it makes less code and drops
a bunch of unwanted cross-components dependencies, in particular:

- one qctx call is gone
- snitch (almost) no longer needs to get messaging from gossiper
- public:private IP cache becomes local to messaging and can be
  moved to topology at low cost

Some nice minor side effect -- this helper was left unsubscribed
from gossiper on stop and snitch rename. Now its all gone.
"

* 'br-remove-reconnectible-snitch-helper-2' of https://github.com/xemul/scylla:
  snitch: Remove reconnectable snitch helper
  snitch, storage_service: Move reconnect to internal_ip kick
  snitch, storage_service: Move system.peers preferred_ip update
  snitch: Export prefer-local
2022-08-04 13:06:05 +03:00
Anna Stuchlik
532aa6e655 doc: update the links to Manager and Operator
Closes #11196
2022-08-04 11:38:39 +03:00
Avi Kivity
785ea869fb Merge 'tools/scylla-sstable: introduce the write operation' from Botond Dénes
Implementing json2sstable functionality. It allows generating an sstable from a JSON description of its content. Uses identical schema to dump-data, so it is possible to regenerate an existing sstable, by feeding the output of dump-data to write.
Most of the scylla storage engine features are supported. The only non-supported features are counters and non-strictly atomic data types (including frozen collections, tuples and UDTs).

Example invocation:
```
scylla sstable write --system-schema system_schema.columns --input-file ./input.json --generation 0
```

Refs: https://github.com/scylladb/scylladb/issues/9681

Future plans:
* Complete support for remaining features (counters and non-atomic types).
* Make sstable format configurable on the command line.

Closes #11181

* github.com:scylladb/scylladb:
  test/cql-pytest: test_tools.py: add test for sstable write
  test/cql-pytest: test-tools.py actually test with multiple sstables
  test/cql-pytest: test_tools.py: reduce the number of test-cases
  tools/scylla-sstable: introduce the write operation
  tools/scylla-sstable: add support for writer operations
  tools/scylla-sstable: dump-data: write bound-weight as int
  tools/scylla-sstable: dump-data: always write deletion time for cell tombstones
  tools/scylla-sstable: dump-data: add timezone to deletion_time
  types: publish timestamp_from_string()
2022-08-03 19:18:31 +03:00
Wojciech Mitros
64c03a2d24 wasm: fix compilation without libwasmtime
Some segments of code using wasmtime were not under an
ifdef SCYLLA_ENABLE_WASMTIME, making Scylla unable to compile
on machines without wasmtime. This patch adds the ifdef where
needed.

Closes #11200
2022-08-03 18:16:02 +03:00
Anna Stuchlik
143455d7ac doc: rewording 2022-08-03 16:58:29 +02:00
Anna Stuchlik
f2af63ddd5 doc: update the links to fix the warnings 2022-08-03 15:12:41 +02:00
Anna Stuchlik
1d61550c64 doc: add the new page to the toctree 2022-08-03 15:03:48 +02:00
Anna Stuchlik
756b9a278f doc: add the descrption of specifying workload attributes with service levels 2022-08-03 14:57:50 +02:00
Raphael S. Carvalho
5757cc5160 mutation_reader_merger: fix indentation
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220803003010.11551-1-raphaelsc@scylladb.com>
2022-08-03 14:33:07 +03:00
Anna Stuchlik
2fa175a819 doc: add the definition of workloads to the glossary 2022-08-03 13:31:07 +02:00
Piotr Dulikowski
4f2adc14de db/system_keyspace: fix indentation after previous patch 2022-08-03 13:19:19 +02:00
Piotr Dulikowski
eff8a6368c db/system_keyspace: in system.local, use broadcast_rpc_address in rpc_address column
Previously, the `system.local`'s `rpc_address` column kept local node's
`rpc_address` from the scylla.yaml configuration. Although it sounds
like it makes sense, there are a few reasons to change it to the value
of scylla.yaml's `broadcast_rpc_address`:

- The `broadcast_rpc_address` is the address that the drivers are
  supposed to connect to. `rpc_address` is the address that the node
  binds to - it can be set for example to 0.0.0.0 so that Scylla listens
  on all addresses, however this gives no useful information to the
  driver.
- The `system.peers` table also has the `rpc_address` column and it
  already keeps other nodes' `broadcast_rpc_address`es.
- Cassandra is going to do the same change in the upcoming version 4.1.

Fixes: #11201
2022-08-03 13:19:03 +02:00
Botond Dénes
19441881bc test/cql-pytest: test_tools.py: add test for sstable write
We can now do a full circle: dump an sstable to json, generate an
sstable from it, then dump again and compare to the original json.
Expand the existing simple_no_clustering_table and
simple_clustering_table schema/data to improve coverage of things like
TTL, tombstones and static rows.
2022-08-03 14:00:50 +03:00
Botond Dénes
5d5c3b3fe3 test/cql-pytest: test-tools.py actually test with multiple sstables
The test-cases in this suite have a parameter to run with one or
multiple input sstables. This was broken as each test table generated a
single sstable. Fix this so we actually get single/multiple input
sstable coverage.
2022-08-03 14:00:50 +03:00
Botond Dénes
bd772d095f test/cql-pytest: test_tools.py: reduce the number of test-cases
Currently this test-case exercises all the available component dumpers
with many different schemas. This doesn't add any value for most of the
dumpers, save for the dump-data one. It does have a cost however in
run-time of these test-cases. Test the dumpers which are mostly
indifferent to the schema with just a single one, cutting the number of
generated test-cases from 70 to 30.
2022-08-03 14:00:50 +03:00
Botond Dénes
d0eaa72bd7 tools/scylla-sstable: introduce the write operation
Allows generating an sstable based on a JSON description of its content.
Uses identical schema to dump-data, so it is possible to regenerate an
existing sstable, by feeding the output of dump-data to write.
Most of the scylladb storage engine features is supported, with the
exception of the following:
* counters
* non-strictly atomic types, including frozen collections, tuples or
  UDTs.
2022-08-03 14:00:02 +03:00
Botond Dénes
4377be30ba tools/scylla-sstable: add support for writer operations
Currently it is assumed that all operations read sstables. They get a
non-empty list of sstables as input and have no means to create
sstable-writers.
We want to add support for operations that write sstables. For this, we
relax the current top-level check about the sstable list not being
empty. We defer this empty-check for operations that actually need input
sstables. Furthermore, the operation_func gains an sstable_manager&
argument, to allow operations to create sstable writers.
Operations are now read-write capable.

In addition to the above the documentation language is adjusted to not
assume read-only operations.
2022-08-03 13:49:22 +03:00
Botond Dénes
87443d2da0 tools/scylla-sstable: dump-data: write bound-weight as int
No reason for it to be witten as string, the documentation even says it
is an integer.
2022-08-03 13:49:22 +03:00
Botond Dénes
ef786f9b85 tools/scylla-sstable: dump-data: always write deletion time for cell tombstones
Said field is not optional for dead cells - it is mandatory for all
tombstones, including cell tombstones.
2022-08-03 13:49:22 +03:00
Botond Dénes
833ed03533 tools/scylla-sstable: dump-data: add timezone to deletion_time
Deletion time is always in UTC but whoever looks at the JSON has no way
to know that. In particular date-time parsers assume local timezone in
its absence which of course results incorrect deletion_time after
parsing.
2022-08-03 13:49:17 +03:00
Takuya ASADA
d7dfd0a696 main: run --version before app_template initialize
Even on the environment which causes error during initalize Scylla,
"scylla --version" should be able to run without error.
To do so, we need to parse and execute these options before
initializing Scylla/Seastar classes.

Fixes #11117

Closes #11179
2022-08-03 11:25:28 +03:00
Kamil Braun
2aff2fea00 service: storage_proxy: remove _messaging reference
All uses of `messaging_service&` have been moved to `remote`.
2022-08-02 19:55:12 +02:00
Kamil Braun
cf931c7863 service: storage_proxy: move connection_dropped to remote 2022-08-02 19:55:12 +02:00
Kamil Braun
2203d4fa09 service: storage_proxy: make encode_replica_exception_for_rpc a static function
No need for this ugly template to be part of the `storage_proxy` header.
2022-08-02 19:55:12 +02:00
Kamil Braun
3499bc7731 service: storage_proxy: move handle_write to remote
It is a helper used by `receive_mutation_handler`
and `handle_paxos_learn`.
2022-08-02 19:55:12 +02:00
Kamil Braun
ba88ad8db0 service: storage_proxy: move handle_paxos_prune to remote 2022-08-02 19:55:12 +02:00
Kamil Braun
548767f91e service: storage_proxy: move handle_paxos_accept to remote 2022-08-02 19:55:12 +02:00
Kamil Braun
807c7f32de service: storage_proxy: move handle_paxos_prepare to remote 2022-08-02 19:55:12 +02:00
Kamil Braun
0e431e7c03 service: storage_proxy: move handle_truncate to remote 2022-08-02 19:55:12 +02:00
Kamil Braun
f8c1ba357f service: storage_proxy: move handle_read_digest to remote 2022-08-02 19:55:12 +02:00
Kamil Braun
43997af40f service: storage_proxy: move handle_read_mutation_data to remote 2022-08-02 19:55:12 +02:00
Kamil Braun
80586a0c7e service: storage_proxy: move handle_read_data to remote 2022-08-02 19:55:12 +02:00
Kamil Braun
00c0ee44bd service: storage_proxy: move handle_mutation_failed to remote 2022-08-02 19:55:12 +02:00
Kamil Braun
b9c436c6e0 service: storage_proxy: move handle_mutation_done to remote 2022-08-02 19:55:12 +02:00
Kamil Braun
178536d5d2 service: storage_proxy: move handle_paxos_learn to remote 2022-08-02 19:55:12 +02:00
Kamil Braun
f309886fac service: storage_proxy: move receive_mutation_handler to remote 2022-08-02 19:55:12 +02:00
Kamil Braun
fad14d2094 service: storage_proxy: move handle_counter_mutation to remote 2022-08-02 19:55:12 +02:00
Kamil Braun
93325a220f service: storage_proxy: remove get_local_shared_storage_proxy
Its remaining uses are trivial to remove.

Note: in `handle_counter_mutation` we had this piece of code:
```
            }).then([trace_state_ptr = std::move(trace_state_ptr), &mutations, cl, timeout] {
                auto sp = get_local_shared_storage_proxy();
                return sp->mutate_counters_on_leader(...);
```

Obtaining a `shared_ptr` to `storage_proxy` at this point is
no different from obtaining a regular pointer:
- The pointer is obtained inside `then` lambda body, not in the capture
  list. So if the goal of obtaining a `shared_ptr` here was to keep
  `storage_proxy` alive until the `then` lambda body is executed, that
  goal wasn't achieved because the pointer was obtained too late.
- The `shared_ptr` is destroyed as soon as `mutate_counters_on_leader`
  returns, it's not stored anywhere. So it doesn't prolong the lifetime
  of the service.

I replaced this with a simple capture of `this` in the lambda.
2022-08-02 19:55:12 +02:00
Kamil Braun
5148eafbd6 service: storage_proxy: (de)register RPC handlers in remote 2022-08-02 19:55:12 +02:00
Kamil Braun
f174645ab5 service: storage_proxy: introduce remote
Move most accesses to `_messaging` to this struct
(functions that send RPCs).
2022-08-02 19:55:10 +02:00
Avi Kivity
a4844826fc Merge 'Decouple compaction manager from database' from Benny Halevy
Start compaction_manager as a sharded service
and pass a reference to it to the database rather
than having the database construct its own compaction_manager.

This is part of the wider scope effort to decouple compaction from replica database and table.

Closes #11099

* github.com:scylladb/scylladb:
  compaction_manager: perform_cleanup, perform_sstable_upgrade: use a lw_shared_ptr for owned token ranges
  compaction: cleanup, upgrade: use a lw_shared_ptr for owned token ranges
  main: start compaction_manager as a sharded service
  compaction_manager: keep config as member
  backlog_controller: keep scheduling_group by value
  backlog_controller: scheduling_group: keep io_priority_class by value
  backlog_controller: scheduling_group: define default member initializers
  backlog_controller: get rid of _interval member
2022-08-02 19:02:46 +03:00
Avi Kivity
6fd2496501 Merge 'token_metadata: keep the set of normal token owners as a member' from Benny Halevy
token_metadata: impl: keep the set of normal token owners as a member

We don't need to recalculate the unique set of normal token
everytime we change `_token_to_endpoint_map`.
Similarly, this doesn't have to be done in `get_all_endpoints`.

Instead we can maintain it inexpensively in
`remove_endpoint`, and let `count_normal_token_owners`
just return its size and `get_all_endpoints` just return
the saved set.

Closes #11128
Fixes #11146

Closes #11158

* github.com:scylladb/scylladb:
  token_metadata: allow update_normal_token_owners to yield
  token_metadata: get_all_endpoints: return const unordered_set<inet_address>&
  token_metadata: impl: keep the set of normal token owners as a member
2022-08-02 16:49:41 +03:00
Avi Kivity
665c85aefe Merge 'multishard_mutation_query: don't unpop partition header of spent partition' from Botond Dénes
When stopping the read, the multishard reader will dismantle the
compaction state, pushing back (unpopping) the currently processed
partition's header to its originating reader. This ensures that if the
reader stops in the middle of a partition, on the next page the
partition-header is re-emitted as the compactor (and everything
downstream from it) expects.
It can happen however that there is nothing more for the current
partition in the reader and the next fragment is another partition.
Since we only push back the partition header (without a partition-end)
this can result in two partitions being emitted without being separated
by a partition end.
We could just add the missing partition-end when needed but it is
pointless, if the partition has no more data, just drop the header, we
won't need it on the next page.

The missing partition-end can generate an "IDL frame truncated" message
as it ends up causing the query result writer to create a corrupt
partition entry.

Fixes: https://github.com/scylladb/scylladb/issues/9482

Closes #11175

* github.com:scylladb/scylladb:
  test/cql-pytest: add regression test for "IDL frame truncated" error
  mutation_compactor: detach_state(): make it no-op if partition was exhausted
  querier: use full_position in shard_mutation_querier
2022-08-02 16:41:15 +03:00
Avi Kivity
e526facef2 Merge 'Fix undefined behavior during eviction' from Tomasz Grabiec
When the last non-dummy row is evicted from a partition, the partition
entry is evicted as well. The existing logic in on_evicted() leaves
the last dummy row in the partition version before evicting the
partition entry. This row may still be attached to the LRU. Eviction
of partition entry goes through mutation_cleaner::clear_gently(). If
this is preempted, the destruction may proceed in the background. If
evicition happens on the remaining row in that entry before it's
destroyed, the code will hit undefined behavior. on_evicted() calls
partition_version::is_referenced_from_entry(), which is unspecified
when the version is enqueued in the mutation_cleaner. It returns
incorrect value for the last item remaining in the LRU (middle entires evict fine).
In that case, eviction will try to access non-existent containing partition_entry,
causing undefined behavior.

Caught by debug-mode cql_query_test.test_clustering_filtering with
raft enabled. Where it manifested like this:

  partition_version.hh:328:16: runtime error: load of value 7, which is not a valid value for type 'bool'
  SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior partition_version.hh:328:16 in
  Aborting on shard 0.

Instances of this issue outside of the unit test environment are not
known as of yet.

This change makes is_referenced_from_entry() return the correct value
even for versions which are queued in the mutation cleaner.

Fixes https://github.com/scylladb/scylladb/issues/11140

The series also contains some related cleanups and minor fixes for issues which
could come up later.

Closes #11187

* github.com:scylladb/scylladb:
  cache_tracker: Make clear() leave no garbage
  partition_snapshot_row_cursor: Fix over-counting of rows
  row_cache: Fix undefined behavior during eviction under some conditions
2022-08-02 16:40:23 +03:00
Avi Kivity
268e4abe77 Merge 'wasm: reuse instances for wasm UDFs' from Wojciech Mitros
Calling WebAssembly UDFs requires wasmtime instance. Creating such an instance is expensive,
but these instances can be reused for subsequent calls of the same UDF on various inputs.

This patch introduces a way of reusing wasmtime instances: a wasm instance cache.
The cache stores a wasmtime instance for each UDF and scheduling group. The instances are
evicted using LRU strategy and their size is based on the size of their wasm memories.

The instances stored in the cache are also dropped when the UDF is dropped itself. For that reason,
the first patch modifies the current implementation of UDF dropping, so that the instance dropping may be added
later. The patch also removes the need of compiling the UDF again when dropping it.

The second patch contains the implementation and use of the new cache. The cache is implemented
in `lang/wasm_instance_cache.hh` and the main ways of using it are the `run_script` methods from `wasm.hh`

The third patch adds tests to `test_wasm.py` that check the correctness and performance of the new
cache. The tests confirm the instance reuse, size limits, instance eviction after timeout and after dropping the UDF.

Closes #10306

* github.com:scylladb/scylladb:
  wasm: test instances reuse
  wasm: reuse UDF instances
  schema_tables: simplify merge_functions and avoid extra compilation
2022-08-02 13:51:16 +03:00
Botond Dénes
38d0db4be5 Merge 'doc: remove the Manger documentation from the core ScyllaDB docs' from Anna Stuchlik
In this PR, I have:

- removed the docs for Manager  (including the sources for Manager 2.1 and the upgrade guides).
- added redirects to https://manager.docs.scylladb.com/.
- replaced the internal links with external links to https://manager.docs.scylladb.com/.

Closes #11162

* github.com:scylladb/scylladb:
  doc: update the link to fix the warning about duplicate targets
  Update docs/kb/gc-grace-seconds.rst
  Update docs/_utils/redirects.yaml
  doc: update the links to Manager
  doc: add the link to manager.docs.scylladb.com to the toctree
  doc: remove the docs for Manager - the Manager page, the guide for Manager 2.1, Manger upgrade guides
  doc: add redirections from Manager 2.1 to the Manager docs
  doc: add redirections to manager.docs.scylladb.com
2022-08-02 12:29:37 +03:00
Anna Stuchlik
cec54229fa doc: update the link to fix the warning about duplicate targets 2022-08-02 11:21:02 +02:00
Anna Stuchlik
780597b0f9 Update docs/kb/gc-grace-seconds.rst
Co-authored-by: Tzach Livyatan <tzach.livyatan@gmail.com>
2022-08-02 11:21:02 +02:00
Anna Stuchlik
849cdd715b Update docs/_utils/redirects.yaml
Co-authored-by: Tzach Livyatan <tzach.livyatan@gmail.com>
2022-08-02 11:20:59 +02:00
Anna Stuchlik
8cad0de042 doc: update the links to Manager 2022-08-02 11:18:16 +02:00
Anna Stuchlik
ba67dfeca6 doc: add the link to manager.docs.scylladb.com to the toctree 2022-08-02 11:15:14 +02:00
Anna Stuchlik
f72e16b013 doc: remove the docs for Manager - the Manager page, the guide for Manager 2.1, Manger upgrade guides 2022-08-02 11:14:07 +02:00
Anna Stuchlik
c9db3bd7ea doc: add redirections from Manager 2.1 to the Manager docs 2022-08-02 11:12:10 +02:00
Anna Stuchlik
3b5add05a7 doc: add redirections to manager.docs.scylladb.com 2022-08-02 11:12:05 +02:00
Tomasz Grabiec
4c33d1650d cache_tracker: Make clear() leave no garbage
Prremption during partition entry eviciton could put it in the
mutation cleaner.

No known issues caused by this. Affects only tests.
2022-08-02 11:02:22 +02:00
Tomasz Grabiec
a58fee1dcf partition_snapshot_row_cursor: Fix over-counting of rows
insert_before() may need to allocate memory for a btree, so may
fail. Call cache_tracker::insert() only after successful instance so
that row counters reflect the correct state. On failure, the entry
will be unlinked automatically by rows_entry destructor, but row
counters in the cache_tracker will not be automatically decremented.
2022-08-02 11:02:22 +02:00
Benny Halevy
0dfd92d0b3 token_metadata: allow update_normal_token_owners to yield
Given #11146, we see a 10ms stall when calculate_natural_endpoints
calls get_all_endpoints that up until this patch performed a
similar loop on the `_token_to_endpoint_map`, so to prevent such
a stall with large number of tokens, turn update_normal_token_owners
async, and allow yielding in the per-token tight loop.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-02 10:49:32 +03:00
Benny Halevy
4f8ccef2c1 token_metadata: get_all_endpoints: return const unordered_set<inet_address>&
There's no need to transform it into a vector.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-02 10:49:08 +03:00
Benny Halevy
a980f94d85 token_metadata: impl: keep the set of normal token owners as a member
We don't need to recalculate the unique set of normal token
everytime we change `_token_to_endpoint_map`.
Similarly, this doesn't have to be done in `get_all_endpoints`.

Instead we can maintain it inexpensively in
`remove_endpoint`, and let `count_normal_token_owners`
just return its size and `get_all_endpoints` just return
the saved set.

Note that currently topology is not updated accurately
in update_normal_token() and it may contain endpoint
that do no longer own any tokens.

If we did update topology accurately there, we
could use its locations map instead as its keys are equivalent
to the unordered_set<inet_address> we implement here.

Closes #11128

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-02 10:49:07 +03:00
Botond Dénes
5ea6700e23 types: publish timestamp_from_string()
It looks like it is a better option for timestamp parsing than anything
current C++ stdlib can offer. What a pity.
2022-08-02 10:33:01 +03:00
Benny Halevy
14faa3b6f4 compaction_manager: perform_cleanup, perform_sstable_upgrade: use a lw_shared_ptr for owned token ranges
And completely get rid of the dependency on replica::database.

Also, add respective rest_api tests.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-02 08:08:11 +03:00
Benny Halevy
e1fe598760 compaction: cleanup, upgrade: use a lw_shared_ptr for owned token ranges
Currently they are copied for the get_sstables function
so this change reduces copies.

Also, it will allow further decoupling of compaction_manager
from replica::database, by letting the caller of
perform_cleanup and perform_sstable_upgrade get the
owned token ranges from db and pass it to the perform_*
functions in the following patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-02 07:57:41 +03:00
Benny Halevy
e4e92d44ae main: start compaction_manager as a sharded service
And pass a reference to it to the database rather
than having the database construct its own compaction_manager.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-02 07:50:15 +03:00
Benny Halevy
7f70949693 compaction_manager: keep config as member
Rather than keeping separate, duplicated members.

And define helpers to get those members.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-02 07:48:01 +03:00
Benny Halevy
c9a9720247 backlog_controller: keep scheduling_group by value
There is no need to keep a mutable reference to the
scheduling_group passed at construction time since
setting / updating shares is using the schedulig_group /
io_priority_class id as a handle, and the id itself is never
changed by the backlog_controller.

Note that the class names are misleading, in hind sight,
they would better be called scheduling_group_id
and io_priority_class_id, respectively.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-02 07:38:40 +03:00
Benny Halevy
78ad1c70a2 backlog_controller: scheduling_group: keep io_priority_class by value
Exactly like the cpu scheduling_group, io_priority_class
contains the class id, which is a handle to the io_priority_class
and so can be kept by value, rather than by reference,
and be safely copied around.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-02 07:38:40 +03:00
Benny Halevy
450ecd60c6 backlog_controller: scheduling_group: define default member initializers
To prepare for the next patch, implement default initialization
of the scheduling_group and io_priority_class, to the default values.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-02 07:38:40 +03:00
Benny Halevy
3e6622180e backlog_controller: get rid of _interval member
It isn't used outside the constructor.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-02 07:38:40 +03:00
Botond Dénes
11af489e84 test/cql-pytest: add regression test for "IDL frame truncated" error 2022-08-02 06:43:24 +03:00
Botond Dénes
70b4158ce0 mutation_compactor: detach_state(): make it no-op if partition was exhausted
detach_state() allows the user to resume a compaction process later,
without having to keep the compactor object alive. This happens by
generating and returning the mutation fragments the user has to re-feed
to a newly constructed compactor to bring it into the exact same state
the current compactor was at the point of stopping the compaction.
This state includes the partition-header (partition-start and static-row
if any) and the currently active range tombstone.
Detaching the state is pointless however when the compaction was stopped
such that the currently compacted partition was completely exhausted.
Allowing the state to be detached in this case seems benign but it
caused a subtle bug in the main user of this feature: the partition
range scan algorithm, where the fragments included in the detached state
were pushed back into the reader which produced them. If the partition
happened to be exhausted -- meaning the next fragment in the reader was
a partition-start or EOS -- this resulted in the partition being
re-emitted later without a partition-end, resulting in corrupt
query-result being generated, in turn resulting in an obscure "IDL frame
truncated" error.

This patch solves this seemingly benign but sinister bug by making the
return value of `detach_state()` an std::optional and returning a
disengaged optional when the partition was exhausted.
2022-08-02 06:43:24 +03:00
Botond Dénes
cdd3a364cb querier: use full_position in shard_mutation_querier
Instead of a separate partition key and position-in-partition.
This continues the recently started effort to standardize storing of
full positions on `full_position`.

This patch is also a hidden preparation for read_context::save_readers()
multishard_mutation_query.cc) no longer being able to get partition key
from compaction state in the future.
2022-08-02 06:43:24 +03:00
Botond Dénes
768a5c8b5a Merge 'doc: add the upgrade guide from 5.0 to 2022.1' from Anna Stuchlik
Fix https://github.com/scylladb/scylla-docs/issues/4125

I've added the upgrade guides from 5.0 to 2022.1. They are based on the previous upgrade guides from Open Source to Enterprise.

Closes #11108

* github.com:scylladb/scylladb:
  doc: apply feedback about scylla-enterprise-machine-image
  doc: update the note about installing scylla-enterprise-machine-image
  update the info about installing scylla-enterprise-machine-image during upgrade
  doc: add the requirement to install scylla-enterprise-machine-image if the previous version was installed with an image
  doc: update the info about metrics in 2022.1 compared to 5.0
  doc: minor formatting and language fixes
  doc: add the new guide to the toctree
  doc: add the upgrade guide from 5.0 to 2022.1
2022-08-02 06:23:41 +03:00
Anna Stuchlik
8e0e603c48 doc: remove Drivers from getting startded index to avoid duplication and reflect the project structure
Closes #11163
2022-08-02 06:20:18 +03:00
Botond Dénes
d532fd7896 Merge 'doc: remove the Monitoring Stack documentation from the core ScyllaDB docs' from Anna Stuchlik
I created this branch to remove the external docs (Manager, Monitoring, Operator) from the core ScyllaDB documentation.
However, to make reviewing easier, this PR only covers removing the docs for ScyllaDB Monitoring Stack. I'm going to send other PRs to cover Manager and Operator.

In this PR, I have:
- removed the docs for ScyllaDB Monitoring Stack (including the sources for old versions).
- added redirects to https://monitoring.docs.scylladb.com/.
- replaced the internal links with external links to https://monitoring.docs.scylladb.com/.

Closes #11151

* github.com:scylladb/scylladb:
  doc: fix the link to the Monitoring Stack
  doc: fix the links in the manager section
  doc: add the external link to Monitoring Stack to the menu
  doc: replace the links to Monitoring Stack
  doc: add the redirections for Monitoring Stack
  doc: delete the Monitoring Stack documentation form the ScyllaDB docs and remove it from the toctree
2022-08-02 06:03:52 +03:00
Tomasz Grabiec
a459d9ab98 row_cache: Fix undefined behavior during eviction under some conditions
When the last non-dummy row is evicted from a partition, the partition
entry is evicted as well. The existing logic in on_evicted() leaves
the last dummy row in the partition version before evicting the
partition entry. This row may still be attached to the LRU. Eviction
of partition entry goes through mutation_cleaner::clear_gently(). If
this is preempted, the destruction may proceed in the background. If
evicition happens on the remaining row in that entry before it's
destroyed, the code will hit undefined behavior. on_evicted() calls
partition_version::is_referenced_from_entry(), which is unspecified
when the version is enqueued in the mutation_cleaner. It returns
incorrect value for the last item remaining in the LRU. In that case
eviction will try to access non-existent containing partition_entry,
causing undefined behavior.

Caught by debug-mode cql_query_test.test_clustering_filtering with
raft enabled. Where it manifested like this:

  partition_version.hh:328:16: runtime error: load of value 7, which is not a valid value for type 'bool'
  SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior partition_version.hh:328:16 in
  Aborting on shard 0.

Instances of this issue outside of the unit test environment are not
known as of yet.

This change makes is_referenced_from_entry() return the correct value
even for versions which are queued in the mutation cleaner.

Fixes #11140
2022-08-01 23:53:15 +02:00
Raphael S. Carvalho
934af9be52 mutation_reader_merger: Drop unneeded readers as soon as possible
Today, mutation_reader_merger drops unneeded readers in batches of 4,
meaning that the merger is having to keep the memory used by 3
unneeded readers in addition to the ones being currently read from.
As each may own a lot of memory, the combined effect of this waste,
coming from parallel reads, can potentially cause memory pressure.

This batching behavior was introduced in b524f96a74,
when readers had to be destroyed synchronously, as flat_mutation_reader
lacked an async close interface. But we have gone a long way since
then. Readers can be closed asynchronously and outstanding I/O
requests will be cancelled on close.

Now, we'll close readers as soon they're uneeded, one at a time,
using a continuation chain. If we're submitting close calls faster
than we can retire them, then we wait for their completion,
preventing memory usage from growing unbounded.

The benefit of this new approach will be very good when combining
disjoint readers, where only one is active at a time for producing
fragments. As soon as we're done with the current one, then it will
be closed allowing its memory to be released, before we move on
to the next reader that follows.

Refs #11040.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #11167
2022-08-01 20:06:29 +03:00
Anna Stuchlik
f7269d0f3b doc: update the description of vitrual tables on the Enterprise Features page
Closes #11097
2022-08-01 17:52:43 +03:00
Benny Halevy
edd308c705 config: use ordered map for experimental features
So that the help string will be sorted lexicographically.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #11178
2022-08-01 17:40:10 +03:00
Piotr Sarna
dd2417618e forward_service: limit the number of partition ranges fetched
The forward service uses a vector of ranges owned by a particular
shard in order to split and delegate the work. The number can
grow large though, which can cause large allocations.
This commit limits the number of ranges handled at a time to 256.

Fixes #10725

Closes #11182
2022-08-01 17:36:34 +03:00
Benny Halevy
663f2e2a8f Update seastar submodule
* seastar 1d4432ed28...f9f5228b74 (33):
  > intent: drop unused headers
  > resource: Improve incorrect --smp option handling
  > util/log: make the width shard_id field fixed in log message
  > Update building-docker.md
  > batch_flush: Replace circular buffer with slist
  > linux-aio: Sanitize get_user_data helpers
  > pipe: add missing return in pipe's operator='s
  > Merge 'core, rpc: silence couple warnings from GCC-12' from Kefu Chai
  > tls: vec_push: handle synchronous error from put

Fixes #11118

  > seastar/rpc: add fmt::ostream_formatter<> for rpc::connection_id
  > test: io_queue_test: remove unused lambda capture
  > Merge "Split oversized requests" from Pavel E
  > test: Add test for over-sized request submission
  > io_queue: Add AIO stats for requests splitting
  > io_queue: Remove capped ticket making
  > io_queue: Relax ticket making
  > io_queue: Split oversized request on submission
  > io_request: Add .split(size_t max_length) method
  > io_queue: Keep the iovec memory on io-desc
  > io_queue: Add io_direction_and_length::read/write aliases
  > io_request: Simplify io_direction_and_length manipulations
  > reactor: Move submit_io_...() into io_queue
  > file, reactor: Use iovec_len()
  > utils: Put iovec manipulation helpers into util
  > rpc: init all member variables
  > core/simple-stream: do not qualify return type with "const"
  > core, rpc: do not pass unused parameters
  > reactor: Check the io-properties being YAML::Map
  > rpc: Fix formatting on some fmt lib versions
  > Merge "Make RPC server connection negotiation synchronous" from Pavel E
  > rpc: Fix indentation after previous patch
  > rpc: Make server::connection::negotiate synchronous

Fixes #10950

  > doc: fix redundant double wording in tutorial.md

Closes #11176
2022-08-01 17:06:28 +03:00
Anna Stuchlik
4204fc3096 doc: apply feedback about scylla-enterprise-machine-image 2022-08-01 14:35:24 +02:00
Anna Stuchlik
9fe7aa5c9a doc: update the note about installing scylla-enterprise-machine-image 2022-08-01 14:22:45 +02:00
Piotr Sarna
d4abb73389 Merge 'scrub compaction: count validation errors...
and return status over the rest api' from Aleksandra Martyniuk

Currently, scrub returns to user the number indicating operation
result as follows:
- 1 when the operation was aborted;
- 3 in validate and segregate modes when validation errors were found
  (and in segregate mode - fixed);
- 0 if operation ended successfully.

To achieve so, if an operation was aborted in abort mode, then
the exception is propagated to storage_service.cc. Also the number
of validation errors for current scrub is gathered and summed
from each shard there.

The number of validation errors is counted and registered in metrics.
Metrics provide common counters for all scrub operation within
a compaction manager, though. Thus, to check the exact number
of validation errors, the comparison of counter value before and after
scrub operation needs to be done.

Closes #11074

* github.com:scylladb/scylladb:
  scrub compaction: return status indicating aborted operations over the rest api
  test: move scylla_inject_error from alternator/ to cql-pytest/
  scrub compaction: count validation errors and return status over the rest api
  scrub compaction: count validation errors for specific scrub task
  compaction: extract statistics in compaction_result
  scrub compaction: register validation errors in metrics
  scrub compaction: count validation errors
2022-08-01 12:05:00 +02:00
Anna Stuchlik
d84d2e6faa Merge branch 'master' into anna-remove-external-docs 2022-08-01 11:50:03 +02:00
guy9
4d24097b4b adding Documentation website top banner options, with the current setting set to hide the banner
Closes #11172
2022-08-01 10:32:07 +03:00
Botond Dénes
2c4e06330d Merge "Remove _replicating_nodes and _removing_node" from Pavel Emelyanov
"
Commit 829b4c14 (repair: Make removenode safe by default) turned these
two to be read only (in fact, erase- and clear- from too).
"

* 'br-dangling-replicating-nodes' of https://github.com/xemul/scylla:
  storage_service: Relax confirm_replication()
  storage_service: Remove _removing_node
  storage_service: Remove _replicating_nodes
2022-08-01 10:25:40 +03:00
Tzach Livyatan
6088bdea91 Docs: Add more information about Raft v2
Closes #11057
2022-08-01 09:00:18 +03:00
Tzach Livyatan
33aa50e783 Docs: move the consistency calculator to a dedticate page
Closes #11149
2022-08-01 08:59:30 +03:00
Botond Dénes
8ea1ebdb88 Merge 'doc: remove the Operator documentation pages from the core ScyllaDB docs' from Anna Stuchlik
This PR removes the existing Operator documentation pages from the core ScyllaDB docs. I have:
- removed the Operator page and replaced it with the link to the Operator documentation.
- created a redirect.
- updated the links to the Operator.

Closes #11154

* github.com:scylladb/scylladb:
  Update docs/operating-scylla/index.rst
  doc: fix the link to Operator
  add the redirect to the Operator
  replace the internal links with the external link to Operator
  add the external link to Operator to the toctree
  doc: remove the Operator docs from the core documentation
2022-08-01 07:00:21 +03:00
Tzach Livyatan
520b9c88b7 Docs: Update the git repo path to scylladb/scylladb
Closes #11171
2022-07-31 15:32:09 +03:00
Pavel Emelyanov
29768a2d02 gitattributes: Mark *.svg as binary
The goal is to put .svg files under git grep's radar. Otherwise a
pretty innocent 'git grep db::is_local' dumps the contents of the
docs/kb/flamegraph.svg on the screen, because it a) contains the
grep pattern and b) is looooong one-liner

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220730090026.8537-1-xemul@scylladb.com>
2022-07-31 15:25:24 +03:00
Avi Kivity
00cec159d6 Revert "Merge 'multishard_mutation_query: don't unpop partition header of spent partition' from Botond Dénes"
This reverts commit c3bad157e5, reversing
changes made to e66809d051. The checks it
adds are triggered by some dtests. While it's possible the check is
triggered due to an existing problem, better to investigate it out-of-tree.

Fixes #11169.
2022-07-31 15:24:33 +03:00
Pavel Emelyanov
ee0828b506 topology: Add local-dc detection shugar
It's often needed to check if an endpoint sits in the same DC as the
current node. It can be done by

    topo.get_datacenter() == topo.get_datacenter(endpoint)

but in some cases a RAII filter function can be helpful.

Also there's a db::count_local_endpoints() that is surprisingly in use,
so add it to topology as well. Next patches will make use of both.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-07-30 17:58:45 +03:00
Pavel Emelyanov
22fdc03b71 storage_service: Relax confirm_replication()
This method is called from REPLICATION_FINISHED handler and now just
logs a message. The verb is probably worth keeping for compatibility
at least for some time. The logging itself can be moved into handler's
lambda

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-07-29 11:47:37 +03:00
Pavel Emelyanov
c8f9d1237f storage_service: Remove _removing_node
This optional is always disengaged

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-07-29 11:47:11 +03:00
Pavel Emelyanov
4d08554a92 storage_service: Remove _replicating_nodes
The set in question is read-and-ease-only

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-07-29 11:45:42 +03:00
Aleksandra Martyniuk
6ea5bc96d7 scrub compaction: return status indicating aborted operations
over the rest api

Performing compaction scrub user did not know whether an operation
was aborted.

If compaction scrub is aborted, return status the user gets over
rest api is set to 1.
2022-07-29 09:35:20 +02:00
Aleksandra Martyniuk
8e892426e2 test: move scylla_inject_error from alternator/ to cql-pytest/
Move scylla_inject_error from alternator/ to cql-pytest/ so it
can be reached from various tests dirs. alternator/util.py is
renamed to alternator/alternator_util.py to avoid name shadowing.
2022-07-29 09:35:20 +02:00
Aleksandra Martyniuk
f1980f8dc6 scrub compaction: count validation errors and return status over the rest api
Performing compaction scrub user did not know whether any validation
errors were encountered.

The number of validation errors per given compaction scrub is gathered
and summed from each shard. Basing on that value return status over
the rest api is set to 3 if any validation errors were encountered.
2022-07-29 09:35:20 +02:00
Aleksandra Martyniuk
7d457cffb8 scrub compaction: count validation errors for specific scrub task
The number of validation errors per given compaction scrub on given
shard is passed up to perform_task() function.
2022-07-29 09:35:20 +02:00
Aleksandra Martyniuk
3a805a9d9b compaction: extract statistics in compaction_result
Statistics from compaction_result are extracted to new struct
compaction_stats and stored as a field of compaction_result.
2022-07-29 09:35:20 +02:00
Aleksandra Martyniuk
a80c187b20 scrub compaction: register validation errors in metrics
The number of validation errors is registered in metrics. Metrics
provide common counters for all scrub operation within a compaction
manager, though. Thus, to check the exact number of validation errors,
the comparison of counters before and after scrub operation needs
to be done.
2022-07-29 09:35:20 +02:00
Aleksandra Martyniuk
ab85dab05d scrub compaction: count validation errors
The number of validation errors encountered during scrub compaction
is counted.
2022-07-29 09:35:20 +02:00
Anna Stuchlik
5699da2357 Merge branch 'scylladb:master' into anna-remove-external-docs 2022-07-29 09:28:23 +02:00
Benny Halevy
cf47db2bdb token_metadata: document that update_normal_tokens is unsafe
Currently, if token_metadata_impl::update_normal_tokens
throws an exception before it's done, it leaves the
token_metadata_impl members partially updated
and we have no way of recovering from that.

The existing use cases take that into account
and always call it on a cloned, temporary copy of the token
metadata, so if it throws, the temporary copy is tossed away
without being applied back.

So just cement this, by adding cautions in the token_metadata
class declaration.

Closes #11127

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220728144821.130518-1-bhalevy@scylladb.com>
2022-07-29 05:38:56 +03:00
Avi Kivity
c3bad157e5 Merge 'multishard_mutation_query: don't unpop partition header of spent partition' from Botond Dénes
When stopping the read, the multishard reader will dismantle the
compaction state, pushing back (unpopping) the currently processed
partition's header to its originating reader. This ensures that if the
reader stops in the middle of a partition, on the next page the
partition-header is re-emitted as the compactor (and everything
downstream from it) expects.
It can happen however that there is nothing more for the current
partition in the reader and the next fragment is another partition.
Since we only push back the partition header (without a partition-end)
this can result in two partitions being emitted without being separated
by a partition end.
We could just add the missing partition-end when needed but it is
pointless, if the partition has no more data, just drop the header, we
won't need it on the next page.

The missing partition-end can generate an "IDL frame truncated" message
as it ends up causing the query result writer to create a corrupt
partition entry.

Fixes: https://github.com/scylladb/scylla/issues/9482

Closes #11137

* github.com:scylladb/scylladb:
  test/cql-pytest: add regression test for "IDL frame truncated" error
  query: query_result_builder: add check for missing partition-end
  mutation_compactor: detach_state(): make it no-op if partition was exhausted
  querier: use full_position in shard_mutation_querier
2022-07-28 20:14:15 +03:00
Avi Kivity
e66809d051 Merge 'Memtable flush: wait for sstable count reduction if needed' from Benny Halevy
Called from try_flush_memtable_to_sstable,
maybe_wait_for_sstable_count_reduction will wait for
compaction to catch up with memtable flush if there
the bucket to compact is inflated, having too many
sstables.  In that case we don't want to add fuel
to the fire by creating yet another sstable.

Fixes #4116

Closes #10954

* github.com:scylladb/scylla:
  table: Add test where compaction doesn't keep up with flush rate.
  compaction_manager: add maybe_wait_for_sstable_count_reduction
  time_window_compaction_strategy: get_sstables_for_compaction: clean up code
  time_window_compaction_strategy: make get_sstables_for_compaction idempotent
  time_window_compaction_strategy: get_sstables_for_compaction: improve debug messages
  leveled_manifest: pass compaction_counter as const&
2022-07-28 19:11:04 +03:00
Anna Stuchlik
4e2a41f53c Update docs/operating-scylla/index.rst
Co-authored-by: Tzach Livyatan <tzach.livyatan@gmail.com>
2022-07-28 15:25:16 +02:00
Anna Stuchlik
2845b4e598 doc: fix the link to the Monitoring Stack 2022-07-28 15:23:26 +02:00
Anna Stuchlik
69bf768907 doc: fix the link to Operator 2022-07-28 15:18:03 +02:00
Anna Stuchlik
792d1412d6 add the redirect to the Operator 2022-07-28 15:16:29 +02:00
Anna Stuchlik
302da44859 replace the internal links with the external link to Operator 2022-07-28 15:13:03 +02:00
Anna Stuchlik
70b79c6867 add the external link to Operator to the toctree 2022-07-28 15:04:15 +02:00
Anna Stuchlik
966c3423ad doc: remove the Operator docs from the core documentation 2022-07-28 15:00:57 +02:00
Anna Stuchlik
63a8ef7030 doc: fix the links in the manager section 2022-07-28 14:54:30 +02:00
Anna Stuchlik
3e2ffaf91e doc: add the external link to Monitoring Stack to the menu 2022-07-28 14:44:07 +02:00
Avi Kivity
09a6b93ddf Merge 'logalloc: region: properly track listeners when moved' from Benny Halevy
Currently logalloc::region is relying on boost binomial_heap handle to properly move listeners registration when the region (when derived from dirty_memory_manager_logalloc::size_tracked_region) is moved, like boost::intrusive link hooks do -
hence 81e20ceaab/dirty_memory_manager.cc (L89-L90) does nothing.

Unfortunately, this doesn't work as expected.

This series adds a unit test that verifies the move semantics
and a fix to size_tracked_region and region_group code to make it pass.

Also "logalloc: region: get_impl might be called on disengaged _impl when moved"
fixes a couple corner cases where the shared _impl could be dereferenced when disengaged, and
the change also adds a unit test for that too.

Closes #11141

* github.com:scylladb/scylla:
  logalloc: region: properly track listeners when moved
  logalloc: region_impl: add moved method
  logalloc: region: merge: optimize getting other impl
  logalloc: region: merge: call region_impl::unlisten
  logalloc: region: call unlisten rather than open coding it
  logalloc: region move-ctor: initialize _impl
  logalloc: region: get_impl might be called on disengaged _impl when moved
2022-07-28 15:29:54 +03:00
Mikołaj Sielużycki
e0c6e1ef3c table: Add test where compaction doesn't keep up with flush rate.
The test simulates a situation where 2 threads issue flushes to 2
tables. Both issue small flushes, but one has injected reactor stalls.
This can lead to a situation where lots of small sstables accumulate on
disk, and, if compaction never has a chance to keep up, resources can be
exhausted.

(cherry picked from commit b5684aa96d)
(cherry picked from commit 25407a7e41)
2022-07-28 14:43:33 +03:00
Benny Halevy
f26e655646 compaction_manager: add maybe_wait_for_sstable_count_reduction
Called from try_flush_memtable_to_sstable,
maybe_wait_for_sstable_count_reduction will wait for
compaction to catch up with memtable flush if there
the bucket to compact is inflated, having too many
sstables.  In that case we don't want to add fuel
to the fire by creating yet another sstable.

Fixes #4116

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-28 14:43:30 +03:00
Benny Halevy
69d4a16908 time_window_compaction_strategy: get_sstables_for_compaction: clean up code
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-28 14:22:03 +03:00
Benny Halevy
c450f3ee11 time_window_compaction_strategy: make get_sstables_for_compaction idempotent
To make sure fully_expired sstables are not missed
if get_sstables_for_compaction is called just heuristically,
change the state by setting _last_expired_check
to the current time only when no fully_expired_sstables are found
among the candidates.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-28 14:22:03 +03:00
Benny Halevy
3d07882431 time_window_compaction_strategy: get_sstables_for_compaction: improve debug messages
Print the compaction_strategy `this` pointer
so we can distinguish between different instance of the
compaction_strategy object (some code paths copy it and
some may instantiate a branch new compaction_strategy object).

The motivation is detecting when the side effects of this function are
applied on the "master" instance, stored in the table shard.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-28 14:22:03 +03:00
Benny Halevy
a149022ed4 leveled_manifest: pass compaction_counter as const&
It is not modified by the leveld_manifest functions.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-28 14:22:03 +03:00
Botond Dénes
11985bb173 Update tools/java submodule
* tools/java 1e7b872a61...ad6764b506 (1):
  > scylla-tools-java: Update "six" library used by cqlsh/python driver

Closes #11148
2022-07-28 13:43:21 +03:00
Anna Stuchlik
4b0ec11136 update the info about installing scylla-enterprise-machine-image during upgrade 2022-07-28 12:07:01 +02:00
Anna Stuchlik
30f564f0e6 doc: replace the links to Monitoring Stack 2022-07-28 11:53:07 +02:00
Anna Stuchlik
fc71f6facb doc: add the redirections for Monitoring Stack 2022-07-28 11:18:07 +02:00
Anna Stuchlik
0ce10be0a9 doc: delete the Monitoring Stack documentation form the ScyllaDB docs and remove it from the toctree 2022-07-28 11:09:43 +02:00
Nadav Har'El
cae1a41b30 Merge 'doc: update the README file in the docs directory' from Anna Stuchlik
The purpose of this PR is to update the README file in the `docs` folder to:
- Explain the contents of the folder (user docs vs developer docs).
- Add more information to help contributors.
- Remove outdated information.

Closes #11134

* github.com:scylladb/scylla:
  docs: remove outdated information -Vale support, Lint, warning about livereload
  doc: improve the section about knowledge base articles in README
  doc: replace distribution names with a generic phrase: Linux distributions
  doc: remove irrelevant guidelines for contributors from README
  doc: language improvements in the doc's README
  doc: reogrganize the content in the doc's README
  doc: update the Prerequisites section in the doc's README
  doc: remove redundant information from README in the docs folder
  doc: add key information to the introduction in README in the docs folder
2022-07-28 11:48:27 +03:00
Botond Dénes
26f1295536 Merge 'mutation: Ignore dummy rows when consuming clustering fragments' from Mikołaj Sielużycki
consume_clustering_fragments already ignores dummy rows, but does it in
the wrong place. Currently they're ignored after comparing them with
range tombstones. This change skips them before any useful work is done
with them.

Consider a simplified mutation reversal scenario scenario (ckp is
clustering key prefix, -1, 0, 1 are bound_weights):

schema_ptr s = schema_builder{"ks", "cf"}
    .with_column("pk", bytes_type, column_kind::partition_key)
    .with_column("ck1", bytes_type, column_kind::clustering_key)
    .build();

Input range tombstone positions:
    {clustered, ckp{}, before}
    {clustered, ckp{1}, after}

Clustering rows:
    {clustered, ckp{2}, equal}
    {clustered, ckp{}, after} // dummy row

During reversal, clustering rows are read backwards, and reversed range
tombstone positions are read forwards (because the range tombstones are
reversed and applied backwards). The read order in the example above is:

Reversed range tombstone positions:
    1: {clustered, ckp{}, before}
    2: {clustered, ckp{1}, before}

Clustering rows read backwards:
    3: {clustered, ckp{}, after} // dummy row
    4: {clustered, ckp{2}, equal}

Then we effectively do the merge part of merge sort, trying to put all
fragments in order according to their positions from the two lists
above. However, the dummy row is used in the comparison, and it compares
to be gt each of the reversed range tombstone positions. Then we
try to emit the clustering row, but only at that point we notice it's
dummy and should be skipped. Subsequent row with ckp{2} is compared to
the last used range tombstone position and the fragments are out of
order (in reversed schema, ckp{2} should come before ckp{1}).

The solution is to move the logic skipping the dummy clustering rows to
the beginning of the loop, so they can be ignored before they're used.

Fixes: https://github.com/scylladb/scylla/issues/11147

Closes #11129

* github.com:scylladb/scylla:
  mutation: Add test if mutations are consumed in order
  test: Move validating_consumer to test/lib/mutation_assertions.hh
  mutation: Ignore dummy rows when consuming clustering fragments
2022-07-28 11:18:36 +03:00
Benny Halevy
f6645313d8 logalloc: region: properly track listeners when moved
And add targeted unit tests for that.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-28 11:17:55 +03:00
Anna Stuchlik
da7f6cdec4 doc: add the requirement to install scylla-enterprise-machine-image if the previous version was installed with an image 2022-07-28 09:53:27 +02:00
Benny Halevy
1d9862dab3 logalloc: region_impl: add moved method
Don't open-code calling the region_impl
_listeners->moved() in region move-constructor
and move-assignment op.

The other._impl->_region might be different then &other
post region::merge so let the region_impl
decide which region* is moved from.

The new_region is also set to region_impl->_region
so need to open-code that either in the said call sites.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-28 10:49:49 +03:00
Benny Halevy
cd4dbb1cae logalloc: region: merge: optimize getting other impl
The other _impl is presumed to be engaged already,
so just call other.get_impl() once for both use cases.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-28 10:49:36 +03:00
Benny Halevy
a547cb79e8 logalloc: region: merge: call region_impl::unlisten
We can't be sure that the other_impl->_region == &other
since it could be a result of a previous merge,
so don't decide for it which region to unlisten to,
let it use its current _region.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-28 10:49:27 +03:00
Benny Halevy
003216de59 logalloc: region: call unlisten rather than open coding it
Current ~region and region::operator= open-code
region_impl::unlisten.  Just call it so it will be
easier to maintain.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-28 10:49:11 +03:00
Benny Halevy
cff953535c logalloc: region move-ctor: initialize _impl
There's no need to default-initialize it
and then move-assign it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-28 10:49:05 +03:00
Benny Halevy
c7d77e4076 logalloc: region: get_impl might be called on disengaged _impl when moved
First check if _impl is engaged before accessing it
to set its _region = this in the move constructor and
move assignment operator.

Add unit test for these odd orner cases.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-28 10:48:58 +03:00
Botond Dénes
079e425ef1 test/cql-pytest: add regression test for "IDL frame truncated" error 2022-07-28 09:02:28 +03:00
Botond Dénes
b23ce76b27 query: query_result_builder: add check for missing partition-end
If the reader feeding the result builder is missing a partition-end
between two partition, or at end-of-stream, the result builder will
write a corrupt partition-entry into the result, ending up in an
"IDL Frame truncated" error.
It is trivial to add a check for this and this will result in a much
more clear error message, then the mysterious frame truncated error
mentioned above.
2022-07-28 09:02:28 +03:00
Botond Dénes
f119554106 mutation_compactor: detach_state(): make it no-op if partition was exhausted
detach_state() allows the user to resume a compaction process later,
without having to keep the compactor object alive. This happens by
generating and returning the mutation fragments the user has to re-feed
to a newly constructed compactor to bring it into the exact same state
the current compactor was at the point of stopping the compaction.
This state includes the partition-header (partition-start and static-row
if any) and the currently active range tombstone.
Detaching the state is pointless however when the compaction was stopped
such that the currently compacted partition was completely exhausted.
Allowing the state to be detached in this case seems benign but it
caused a subtle bug in the main user of this feature: the partition
range scan algorithm, where the fragments included in the detached state
were pushed back into the reader which produced them. If the partition
happened to be exhausted -- meaning the next fragment in the reader was
a partition-start or EOS -- this resulted in the partition being
re-emitted later without a partition-end, resulting in corrupt
query-result being generated, in turn resulting in an obscure "IDL frame
truncated" error.

This patch solves this seemingly benign but sinister bug by making the
return value of `detach_state()` an std::optional and returning a
disengaged optional when the partition was exhausted.
2022-07-28 09:02:26 +03:00
Botond Dénes
afa694a20c querier: use full_position in shard_mutation_querier
Instead of a separate partition key and position-in-partition.
This continues the recently started effort to standardize storing of
full positions on `full_position`.

This patch is also a hidden preparation for read_context::save_readers()
multishard_mutation_query.cc) no longer being able to get partition key
from compaction state in the future.
2022-07-28 08:19:23 +03:00
Anna Stuchlik
82f96327d4 docs: remove outdated information -Vale support, Lint, warning about livereload 2022-07-27 22:07:40 +02:00
Botond Dénes
c54d19427d mutation_compactor: don't ignore consumer's stop request on range tombstone
Broken since the v2 output support was introduced (ad435dc).
No known adverse affects, besides mutation reads stopping a little later
than desired (on the next non-range-tombstone-change fragment) and hence
consuming more memory than the limit set for them.

Fixes: #11138

Closes #11139
2022-07-27 22:24:29 +03:00
Avi Kivity
2c0932cc41 Merge 'Reduce the amount of per-table metrics' from Amnon Heiman
This series is the first step in the effort to reduce the number of metrics reported by Scylla.
The series focuses on the per-table metrics.

The combination of histograms, per-tables, and per shard makes the number of metrics in a cluster explode.
The following series uses multiple tools to reduce the number of metrics.
1. Multiple metrics should only be reported for the user tables and the condition that checked it was not updated when more non-user keyspaces were added.
2. Second, instead of a histogram, per table, per shard, it will report a summary per table, per shard, and a single histogram per node.
3. Histograms, summaries, and counters will be reported only if they are used (for example, the cas-related metrics will not be reported for tables that are not using cas).

Closes #11058

* github.com:scylladb/scylla:
  Add summary_test
  database: Reduce the number of per-table metrics
  replica/table.cc: Do not register per-table metrics for system
  histogram_metrics_helper.hh: Add to_metrics_summary function
  Unified histogram, estimated_histogram, rates, and summaries
  Split the timed_rate_moving_average into data and timer
  utils/histogram.hh: should_sample should use a bitmask
  estimated_histogram: add missing getter method
2022-07-27 22:01:08 +03:00
Avi Kivity
2d4caa0134 Update tools/java submodule
* tools/java d0143b447c...1e7b872a61 (2):
  > scylla-tools-java: Update "six" library used by cqlsh/python driver
  > Add Scylla-specific table options to Option enum
Fixes scylladb/scylla#10856.
2022-07-27 21:41:18 +03:00
Avi Kivity
4438865a26 Merge 'memtable flush error handling' from Benny Halevy
The series unifies memtable flush error handling into table::seal_active_memtable
following up on f6d9d6175f.

The goal here is to prevent an infinite retry loop as in #10498
by aborting on any error that is not bad_alloc.

Fixes #10498

Closes #10691

* github.com:scylladb/scylla:
  test: memtable_test: failed_flush_prevents_writes: notify_soft_pressure only once
  test: memtable_test: failed_flush_prevents_writes: extend error injection
  table: seal_active_memtable: abort if retried for too long
  table: seal_active_memtable: abort on unexpected error
  table: try_flush_memtable_to_sstable: propagate errors to seal_active_memtable
  dirty_memory_manager: flush_when_needed: move error handling to flush_one/seal_active_memtable
  dirty_memory_manager: flush_permit: add has_sstable_write_permit
  dirty_memory_manager: flush_permit: release_sstable_write_permit: mark noexcept
  dirty_memory_manager: flush_permit: make _sstable_write_permit optional
  table: reindent seal_active_memtable
  table: coroutinize seal_active_memtable
  memtable_list: mark functions noexcept
  commitlog: make discard_completed_segments and friends noexcept
  dirty_memory_manager: flush_when_needed: target error handling at flush_one
  database: delete unused seal_delayed_fn_type
  dirty_memory_manager: mark functions noexcept
  memtable: mark functions noexcept
  memtable: memtable_encoding_stats_collector: mark functions noexcept
  encoding_state: mark functions noexcept
  logalloc: mark free functions noexcept
  logalloc: allocating_section: mark functions noexcept
  logalloc: allocating_section: guard: mark constructor noexcept
  logalloc: reclaim_lock: mark functions noexcept
  logalloc: tracker_reclaimer_lock: mark constructor noexcept
  logalloc: mark shard_tracker noexcept
  logalloc: region: mark functions const/noexcept
  logalloc: basic_region_impl: mark functions noexcept
  logalloc: region_impl: mark functions noexcept
  utils: log_heap: mark functions noexcept
  logalloc: region_impl: object_descriptor: mark functions noexcept
  logalloc: region_group: mark functions noexcept
  logalloc: tracker: mark functions const/noexcept
  logalloc: tracker::impl: make region_occupancy and friends const
  logalloc: tracker::impl: occupancy: get rid of reclaiming_lock
  logalloc: tracker::impl: mark functions noexcept
  logalloc: segment: mark functions const / noexcept
  logalloc: segment_pool: add const variant of descriptor method
  logalloc: segment_pool: move descriptor method to class definition
  logalloc: segment_pool: mark functions const/noexcept
  logalloc: segment_pool: delete unused free_or_restore_to_reserve method
  utils: dynamic_bitset: mark functions noexcept
  utils: dynamic_bitset: delete unused members
  logalloc: segment_store, segment_pool: idx_from_segment: get a const segment* in const overload
  logalloc: segment_store, segment_pool: return const segment* from segment_from_idx() const
  logalloc: segment_store: make can_allocate_more_segments const
  logalloc: segment_store: mark functions noexcept
  logalloc: segment_descriptor: mark functions noexcept
  logalloc: occupancy_stats: mark functions noexcept
  min_max_tracker: mark functions noexcept
  gc_clock, db_clock: mark functions noexcept
  dirty_memory_manager: region_group: mark functions noexcept
  dirty_memory_manager: region_group: make simple constructor noexcept
  dirty_memory_manager: region_group_reclaimer mark functions noexcept
  logalloc: lsa_buffer: mark functions noexcept
2022-07-27 19:08:59 +03:00
Amnon Heiman
3658aa9ec2 Add summary_test
This patch adds unit tests for the summary implementation.
2022-07-27 16:58:52 +03:00
Amnon Heiman
99a060126d database: Reduce the number of per-table metrics
This patch reduces the number of metrics that is reported per table, when
the per-table flag is on.

When possible, it moves from time_estimated_histogram and
timed_rate_moving_average_and_histogram to use the unified timer.

Instead of a histogram per shard, it will now report a summary per shard
and a histogram per node.

Counters, histograms, and summaries will not be reported if they were
never used.

The API was updated accordingly so it would not break.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2022-07-27 16:58:52 +03:00
Amnon Heiman
c31a58f2e9 replica/table.cc: Do not register per-table metrics for system
There is a set of per-table metrics that should only be registered for
user tables.
As time passes there are more keyspaces that are not for the user
keyspace and there is now a function that covers all those cases.

This patch replaces the implementation to use is_internal_keyspace.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2022-07-27 16:58:52 +03:00
Amnon Heiman
9a3e70adfb histogram_metrics_helper.hh: Add to_metrics_summary function
The to_metrics_summary is a helper function that create a metrics type
summary from a timed_rate_moving_average_with_summary object.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2022-07-27 16:58:52 +03:00
Amnon Heiman
c220e3a00f Unified histogram, estimated_histogram, rates, and summaries
Currently, there are two metrics reporting mechanisms: the metrics layer
and the API. In most cases, they use the same data sources. The main
difference is around histograms and rate.

The API calculates an exponentially weighted moving average using a
timer that decays the average on each time tick. It calculates a
poor-man histogram by holding the last few entries (typically the last
256 entries). The caller to the API uses those last entries to build a
histogram.

We want to add summaries to Scylla. Similar to the API rate and
histogram, summaries are calculated per time interval.

This patch creates a unified mechanism by introducing an object that
would hold both the old-style histogram and the new
(estimated_histogram). On each time tick, a summary would be calculated.
In the future, we'll replace the API to report summaries instead of the
old-style histogram and deprecate the old style completely.

summary_calculator uses two estimated_histogram to calculate a summary.

timed_rate_moving_average_summary_and_histogram is a unifed class for
ihistogram, rates, summary, and estimated_histogram and will replace
timed_rate_moving_average_and_histogram.

Follow-up patches would move code from using
timed_rate_moving_average_and_histogram to
timed_rate_moving_average_summary_and_histogram.  By keeping the API it
would make the transition easy.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2022-07-27 16:58:25 +03:00
Avi Kivity
a03a33dcaf Merge 'forward_service: reduce allocations in forward_service' from Piotr Sarna
This series refactors the code to get rid of unnecessary
allocations by extracing a helper requires_thread() function,
as well as by removing std::optional usage in forward_result,
now that it's possible to merge empty results with each other,
both ways (#11064).

Closes #11120

* github.com:scylladb/scylla:
  forward_service: remove redundant optional from forward_service
  forward_service: open-code running a Sestar thread
  forward_service: add requires_thread helper
2022-07-27 16:29:00 +03:00
Avi Kivity
71bec22117 Merge 'Speed up bootstrap with large number of tokens in the cluster 10X' from Asias He
=== Setup ===
1) start node1 with
```
scylla --num-tokens 20000 --smp 1
```
The large number of tokens per node is used to simulate large number of nodes in the cluster (large total number of tokens for the cluster).

2) start node2 with
```
scylla --num-tokens 20000 --smp 1
```
3) Measure the time to finish bootstrap

=== Result ===

1) With speed up patch:
```
node1 (16s)
INFO  2022-06-21 14:30:00,038 [shard 0] init - Scylla version 5.1.dev-0.20220621.a7b927bda764 with build-id d78b6233e8227975cc26259280ceabf2cf7817b9 starting ...
INFO  2022-06-21 14:30:16,019 [shard 0] init - Scylla version 5.1.dev-0.20220621.a7b927bda764 initialization completed.
node2 (bootstrap node,174s)
INFO  2022-06-21 14:30:40,954 [shard 0] init - Scylla version 5.1.dev-0.20220621.a7b927bda764 with build-id d78b6233e8227975cc26259280ceabf2cf7817b9 starting ...
INFO  2022-06-21 14:33:34,899 [shard 0] init - Scylla version 5.1.dev-0.20220621.a7b927bda764 initialization completed.
```
2) Without speed up patch:
```
node1 (171s)
INFO  2022-06-21 14:38:49,065 [shard 0] init - Scylla version 5.1.dev-0.20220621.6f4bfea99431 with build-id f22bfa5a75887258ab48ee092ec49b5299365168 starting ...
INFO  2022-06-21 14:41:40,601 [shard 0] init - Scylla version 5.1.dev-0.20220621.6f4bfea99431 initialization completed.
node2 (bootstrap node, 1181s)
INFO  2022-06-21 14:41:46,997 [shard 0] init - Scylla version 5.1.dev-0.20220621.6f4bfea99431 with build-id f22bfa5a75887258ab48ee092ec49b5299365168 starting ...
INFO  2022-06-21 15:01:27,507 [shard 0] init - Scylla version 5.1.dev-0.20220621.6f4bfea99431 initialization completed.
```

The improvements for bootstrap time:

node1: 171s  / 16s = 10.68X
node2: 1181s / 174s = 6.78X

Refs #10337
Refs #10817
Refs #10836
Refs #10837

Closes #10850

* github.com:scylladb/scylla:
  locator: Speed up abstract_replication_strategy::get_address_ranges
  locator: Speed up simple_strategy::calculate_natural_endpoint
  token_metadata: Speed up count_normal_token_owners
2022-07-27 16:04:09 +03:00
Anna Stuchlik
b31cb94944 doc: improve the section about knowledge base articles in README 2022-07-27 13:49:53 +02:00
Raphael S. Carvalho
0796b8c97a sstables: Enforce disjoint invariant in sstable_run
We know that sstable_run is supposed to contain disjoint files only,
but this assumption can temporarily break when switching strategies
as TWCS, for example, can incorrectly pick the same run id for
sstables in different windows during segregation. So when switching
from TWCS to ICS, it could happen a sstable_run won't contain disjoint
files. We should definitely fix TWCS and any other strategy doing
that, but sstable_run should have disjointness as actual invariant,
not be relaxed on it. Otherwise, we cannot build readers on this
assumption, so more complicated logic have to be added to merge
overlapping files.
After this patch, sstable_run will reject insertion of a file that
will cause the invariant to break, so caller will have to check
that and push that file into a different sstable run.

Closes #11116
2022-07-27 14:48:28 +03:00
Anna Stuchlik
ecf1633cb3 doc: replace distribution names with a generic phrase: Linux distributions 2022-07-27 13:28:05 +02:00
Anna Stuchlik
456f2f7c47 doc: remove irrelevant guidelines for contributors from README 2022-07-27 13:18:20 +02:00
Benny Halevy
bb9eddc67f test: memtable_test: failed_flush_prevents_writes: notify_soft_pressure only once
Now that memtable flush error handling was moved entirely
to table::seal_active_memtable, we don't need to notify_soft_pressure
to keep retry going.  The inifinite retry loop should
eventually either succeed or die (by isolating the node or aborting)
on its own.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 14:06:59 +03:00
Benny Halevy
b5abbb971f test: memtable_test: failed_flush_prevents_writes: extend error injection
Inject errors into all seal_active_memtable distinct error
handling sites.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 14:06:59 +03:00
Benny Halevy
a5911619c0 table: seal_active_memtable: abort if retried for too long
If we haven't been able to flush the memtable
in ~30 minutes (based on the number of retries)
just abort assuming that the OOM
condition is permanent rather than transient.

Refs #4344

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 14:06:59 +03:00
Benny Halevy
bc18f750c6 table: seal_active_memtable: abort on unexpected error
Currently when we can't write the flushed sstable
due to corruption in the memtable we get into
an infinite retry loop (see #10498).

Until we can go into maintenance mode, the next best thing
would be to abort, though there is still a risk that
commitlog replay will reproduce the corruption in the
memtable and we's end up with an infinite crash loop.
(hence #10498 is not Fixed with this patch)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 14:06:57 +03:00
Benny Halevy
f0a597a252 table: try_flush_memtable_to_sstable: propagate errors to seal_active_memtable
And let seal_active_memtable decide about how to handle them
as now all flush error handling logic is implemented there.

In particular, unlike today, sstable write errors will
cause internal error rather than loop forever.

Also, check for shutdown earlier to ignore errors
like semaphore_broken that might happen when
the table is stopped.

Refs #10498

(The issue will be considered fixed when going
into maintenance mode on write errors rather than
throwing internal error and potentially retrying forever)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 14:04:55 +03:00
Benny Halevy
d55a2ac762 dirty_memory_manager: flush_when_needed: move error handling to flush_one/seal_active_memtable
Currently flush is retried both by dirty_memory_manager::flush_when_needed
and table::seal_active_memtable, which may be called by other paths
like table::flush.

Unify the retry logic into seal_active_memtable so that
we have similar error handling semantics on all paths.

Refs #4174
Refs #10498

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
93f835a2dd dirty_memory_manager: flush_permit: add has_sstable_write_permit
after release_sstable_write_permit is called,
_sstable_write_permit will have no value.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
b3dcc77c66 dirty_memory_manager: flush_permit: release_sstable_write_permit: mark noexcept
Neither exchanging the std:;optional nor moving
the sstable_write_permit throw.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
53355eb95d dirty_memory_manager: flush_permit: make _sstable_write_permit optional
So we can safely test whether it was released or not
by release_sstable_write_permit in a following patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
67479e4243 table: reindent seal_active_memtable 2022-07-27 13:43:17 +03:00
Benny Halevy
00941452d5 table: coroutinize seal_active_memtable
As a first step to making it robust using
state machine driven retries.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
d3acd80cf5 memtable_list: mark functions noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
5991482049 commitlog: make discard_completed_segments and friends noexcept
To simplify table::seal_active_memtable error handling
and retry logic.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
863e9d9e6a dirty_memory_manager: flush_when_needed: target error handling at flush_one
Now that everything prior to flush_one is noexcept
make table::seal_active_memtable and the paths that call it
noexcept, making sure that any errors are returned only
as exceptional futures, and handle them in flush_when_needed().

The original handle_exception had a broader scope than now needed,
so this change is mostly technical, to show that we can narrow down
the error handling to the continuation of flush_one - and verify that
the unit test is not broken.
A later patch moves this error handling logic away to seal_active_memtable.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
73e50bc97d database: delete unused seal_delayed_fn_type
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
73e5cd0448 dirty_memory_manager: mark functions noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
fcb3347c7a memtable: mark functions noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
2d1ba0d7d8 memtable: memtable_encoding_stats_collector: mark functions noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
ad85e720f9 encoding_state: mark functions noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
6e961ead3b logalloc: mark free functions noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
705b42efe2 logalloc: allocating_section: mark functions noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
f9db708376 logalloc: allocating_section: guard: mark constructor noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
5416808367 logalloc: reclaim_lock: mark functions noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
95b0e41abb logalloc: tracker_reclaimer_lock: mark constructor noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
ed9e036509 logalloc: mark shard_tracker noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
d6e6ffc741 logalloc: region: mark functions const/noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
2beee4a6cd logalloc: basic_region_impl: mark functions noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
3ba85c3bbd logalloc: region_impl: mark functions noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
d838456be2 utils: log_heap: mark functions noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
3f96818c03 logalloc: region_impl: object_descriptor: mark functions noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
0866548b27 logalloc: region_group: mark functions noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
fe50c76dbc logalloc: tracker: mark functions const/noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:40:50 +03:00
Benny Halevy
71c21a83ad logalloc: tracker::impl: make region_occupancy and friends const
No that they don't modify the tracker impl.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:40:18 +03:00
Benny Halevy
1c0c01cc24 logalloc: tracker::impl: occupancy: get rid of reclaiming_lock
It was added in d20fae96a2
as a precaution not to invalidate iterators while
traversing _regions.  However it is not requried as no allocation
is done on this synchronous path - therefore there is no
point in preventing reclaim.

This will allow making the respective functions const
as they merely return stats and do not modify the tracker impl.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:39:18 +03:00
Benny Halevy
888e225113 logalloc: tracker::impl: mark functions noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:39:16 +03:00
Benny Halevy
f0027f60d4 logalloc: segment: mark functions const / noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:34:56 +03:00
Benny Halevy
830912cfa0 logalloc: segment_pool: add const variant of descriptor method
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:34:48 +03:00
Benny Halevy
f318d1664e logalloc: segment_pool: move descriptor method to class definition
To make the implementation inline and to prepare
for the next patch that adds a const overload of
this method.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:34:37 +03:00
Benny Halevy
35899463d4 logalloc: segment_pool: mark functions const/noexcept
Some methods were also marked inline when declared in the class
definition and in the ir definition site to provide a hint to
the compiler to inline them.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:33:47 +03:00
Benny Halevy
02e74696f2 logalloc: segment_pool: delete unused free_or_restore_to_reserve method
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:33:21 +03:00
Benny Halevy
00dae56e19 utils: dynamic_bitset: mark functions noexcept
dynamic_bitset allocates only when constructed.
then on it doesn't throw.

Though not that accessing bits out of range
is undefined behavior.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:32:36 +03:00
Benny Halevy
d911d03344 utils: dynamic_bitset: delete unused members
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:32:08 +03:00
Benny Halevy
da87a4a248 logalloc: segment_store, segment_pool: idx_from_segment: get a const segment* in const overload
To maintain the const chain from segment via segment_store to
segment_pool.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:28:21 +03:00
Benny Halevy
947f71ce91 logalloc: segment_store, segment_pool: return const segment* from segment_from_idx() const
Maintain the const chain by returning a const segment*
from segment_from_idx() const overload.

And add a respective mutable overload to return a mutable segment*.

This is done for a similar change in idx_from_segment.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:27:30 +03:00
Benny Halevy
17902da66c logalloc: segment_store: make can_allocate_more_segments const
Add a const noexcept overload of `find_empty()` so that
can_allocate_more_segments can be const noexcept as well.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:26:07 +03:00
Benny Halevy
2ae61d5209 logalloc: segment_store: mark functions noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:19:58 +03:00
Benny Halevy
852c23b97a logalloc: segment_descriptor: mark functions noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:18:15 +03:00
Benny Halevy
a49619a601 logalloc: occupancy_stats: mark functions noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:17:43 +03:00
Benny Halevy
721e94dcf1 min_max_tracker: mark functions noexcept
Based on tracked types being nothrow copy and move construtible.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:17:27 +03:00
Benny Halevy
b5f9a3d44e gc_clock, db_clock: mark functions noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:17:01 +03:00
Benny Halevy
724692e7f4 dirty_memory_manager: region_group: mark functions noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:16:02 +03:00
Benny Halevy
6aaec0928a dirty_memory_manager: region_group: make simple constructor noexcept
By std::moving its sstring name arg.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:13:33 +03:00
Benny Halevy
c386339730 dirty_memory_manager: region_group_reclaimer mark functions noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:06:32 +03:00
Mikołaj Sielużycki
9f5655bb97 mutation: Add test if mutations are consumed in order
It explicitly interleaves clustering rows with range tombstones and
ensures the last clustering row is dummy.
2022-07-27 11:22:55 +02:00
Mikołaj Sielużycki
9c43f1266a test: Move validating_consumer to test/lib/mutation_assertions.hh 2022-07-27 11:19:50 +02:00
Anna Stuchlik
b3648c1403 doc: language improvements in the doc's README 2022-07-27 10:28:59 +02:00
Anna Stuchlik
d4bc030705 doc: reogrganize the content in the doc's README 2022-07-27 10:18:01 +02:00
Anna Stuchlik
e989eed75b doc: update the Prerequisites section in the doc's README 2022-07-27 10:03:09 +02:00
Anna Stuchlik
48703adf59 doc: remove redundant information from README in the docs folder 2022-07-27 09:56:24 +02:00
Anna Stuchlik
3f097f3285 doc: add key information to the introduction in README in the docs folder 2022-07-27 09:50:13 +02:00
Mikołaj Sielużycki
09da47d87e mutation: Ignore dummy rows when consuming clustering fragments
consume_clustering_fragments already ignores dummy rows, but does it in
the wrong place. Currently they're ignored after comparing them with
range tombstones. This change skips them before any useful work is done
with them.

Consider a simplified mutation reversal scenario scenario (ckp is
clustering key prefix, -1, 0, 1 are bound_weights):

schema_ptr s = schema_builder{"ks", "cf"}
    .with_column("pk", bytes_type, column_kind::partition_key)
    .with_column("ck1", bytes_type, column_kind::clustering_key)
    .build();

Range tombstones:
    range_tombstone rt1{ckp{}, bound_kind::incl_start, ckp{1}, bound_kind::incl_end, tombstone{ts + 0, tp}};
    range_tombstone rt2{ckp{1}, bound_kind::excl_start, ckp{}, bound_kind::incl_end, tombstone{ts + 1, tp}};

Input range tombstone positions:
    {clustered, ckp{}, before}
    {clustered, ckp{1}, after}

Clustering rows:
    {clustered, ckp{2}, equal}
    {clustered, ckp{}, after} // dummy row

During reversal, clustering rows are read backwards, and reversed range
tombstone positions are read forwards (because the range tombstones are
reversed and applied backwards). Position of rows is not
reversed, as regular rows always have equal positions (which does not
hold for dummy rows, which causes the problem in this case).
The read order in the example above is:

Reversed range tombstone positions:
    1: {clustered, ckp{}, before}
    2: {clustered, ckp{1}, before}

Clustering rows read backwards:
    3: {clustered, ckp{}, after} // dummy row
    4: {clustered, ckp{2}, equal}

Then we effectively do the merge part of merge sort, trying to put all
fragments in order according to their positions from the two lists
above. However, the dummy row is used in the comparison, and it compares
to be gt each of the reversed range tombstone positions. Then we
try to emit the clustering row, but only at that point we notice it's
dummy and should be skipped. Subsequent row with ckp{2} is compared to
the last used range tombstone position and the fragments are out of
order (in reversed schema, ckp{2} should come before ckp{1}).

The solution is to move the logic skipping the dummy clustering rows to
the beginning of the loop, so they can be ignored before they're used.
2022-07-27 09:32:56 +02:00
Benny Halevy
a6356539bf logalloc: lsa_buffer: mark functions noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 10:22:35 +03:00
Botond Dénes
81e20ceaab Merge 'logalloc, dirty_memory_manager: move region_groups to dirty_memory_manager' from Avi Kivity
logalloc manages regions of log-structured allocated memory, and region_groups
containing such regions and other region_groups. region_groups were introduced
for accounting purposes - first to limit the amount of memory in memtables, then to
match new dirty memory allocation rate with memtable flushing rate so we never
hit a situation where allocation rate exceeded flush rate, and we exceed our limit.

The problem is that the abstraction is very weak - if we want to change anything
in memtable flush control we'll need to change region_groups too - and also
expensive to maintain.

The solution is to break the abstraction and move region_groups to memtable
dirty memory management code. Instead introduce a new, simpler abstraction,
the region_listener, which communicates changes in region memory consumption
to an external piece of code, which can then choose to do with it what it likes.

The long term plan is to completely remove region_groups and fold them into dirty_memory_manager:
 - make each memtable a region_listener so it gets called back after size changes
 - make memtables inform their dirty_memory_manager about the size to dirty_memory_manager can decide to throttle writes and which memtable to pick to flush

Closes #10839

* github.com:scylladb/scylla:
  logalloc: drop region_impl public accessors
  logalloc, dirty_memory_manager: move size-tracking binomial heap out of logalloc
  logalloc: relax lifetime rules around region_listener
  logalloc, dirty_memory_manager: move region_group and associated code
  logalloc: expose tracker_reclaimer_lock
  logalloc: reimplement tracker_reclaim_lock to avoid using hidden classes
  logalloc: reduce friendship between region and region_group
  logalloc: decouple region_group from region
  memtable: stop using logalloc::region::group() to test for flushed memtables
2022-07-26 17:08:37 +03:00
Amnon Heiman
72414b613b Split the timed_rate_moving_average into data and timer
This patch split the timed_rate_moving_average functionality into two, a
data class: rates_moving_average, and a wrapper class
timed_rate_moving_average that uses a timer to update the rates
periodically.

To make the transition as simple as possible timed_rate_moving_average,
takes the original API.

A new helper class meter_timer was introduced to handle the timer update
functionality.

This change required minimal code adaptation in some other parts of the
code.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2022-07-26 15:59:33 +03:00
Amnon Heiman
5bf51ed4af utils/histogram.hh: should_sample should use a bitmask
This patch fixes a bug in should_sample that uses its bitmask
incorrectly.

basic_ihistogram has a feature that allows it to sample values instead
of taking a timer each time.

To decide if it should sample or not, it uses a bitmask. The bitmask
is of the form 2^n-1, which means 1 out of 2^n will be sampled.

For example, if the mask is 0x1 (2^2-1) 1 out of 2 will be sampled.
If the mask is 0x7 (2^3-1) 1 out of 8 will be sampled.

There was a bug in the should_sampled() method.
The correct form is (value&mask) == mask

Ref #2747
It does not solve all of #2747, just the bug part of it.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2022-07-26 15:59:33 +03:00
Amnon Heiman
99bc6d882b estimated_histogram: add missing getter method
This patch adds the square bracket operator method that was missing.
2022-07-26 15:59:33 +03:00
Nadav Har'El
cb8a67dc98 Merge 'Allow materialized views to by synchronous' from Piotr Sarna
This pull request introduces a "synchronous mode" for global views. In this mode, all view updates are applied synchronously as if the view was local.

Marking view as a synchronous one can be done using `CREATE MATERIALIZED VIEW` and `ALTER MATERIALIZED VIEW`. E.g.:
```cql
ALTER MATERIALIZED VIEW ks.v WITH synchronous_updates = true;
```

Marking view as a synchronous one was done using tags (originally used by alternator). No big modifications in the view's code were needed.

Fixes: https://github.com/scylladb/scylla/issues/10545

Closes #11013

* github.com:scylladb/scylla:
  cql-pytest: extend synchronous mv test with new cases
  cql-pytest: allow extra parameters in new_materialized_view
  docs: add a paragraph on view synchronous updates
  test/boost/cql_query_test: add test setting synchronous updates property
  test: cql-pytest: add a test for synchronous mode materialized views
  db: view: react to synchronous updates tag
  cql3: statements: cf_prop_defs: apply synchronous updates tag
  alternator, db: move the tag code to db/tags
  cql3: statements: add a synchronous_updates property
2022-07-26 15:42:51 +03:00
Alejo Sanchez
5014bd0d51 test.py: add missing CQL test discovery
Add missing build_test_list() to CQLApproval test.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #11124
2022-07-26 15:40:04 +03:00
David Garcia
18c7006ac5 doc: Remove unused redirects
Closes #11112
2022-07-26 14:15:45 +03:00
Asias He
6152f5b858 locator: Speed up abstract_replication_strategy::get_address_ranges
To get the list of tokens for a given node, we loop through all the
tokens and calculate the nodes that are responsible for the token.

In case of the everywhere_topology, we know any node that is part of the
the ring will be responsible for all tokens.

This patch adds a fast path for everywhere_topology to avoid calculating
natural endpoints.

Refs #10337
Refs #10817
Refs #10836
Refs #10837
2022-07-26 18:53:09 +08:00
Asias He
9a8a80527b locator: Speed up simple_strategy::calculate_natural_endpoint
If the number of nodes in the cluster is smaller than the desired
replication factor we should return the loop when endpoints already
contains all the nodes in the cluster because no more nodes could be
added to endpoints lists

Refs #10337
Refs #10817
Refs #10836
Refs #10837
2022-07-26 18:53:09 +08:00
Asias He
4c714dfe3b token_metadata: Speed up count_normal_token_owners
Currently, a set of nodes is built from _token_to_endpoint_map to get
the number of nodes in _token_to_endpoint_map.

To make it faster so we can call it on a fast path in the following
patch, a _nr_normal_token_owners member is introduced to track the
number.

Refs #10337
Refs #10817
Refs #10836
Refs #10837
2022-07-26 18:53:09 +08:00
Pavel Emelyanov
40d6ea973c snitch: Remove reconnectable snitch helper
It's now no-op

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-07-26 13:51:05 +03:00
Pavel Emelyanov
b91f7e9ec4 snitch, storage_service: Move reconnect to internal_ip kick
The same thing as in previous patch -- when gossiper issues
on_join/_change notification, storage service can kick messaging
service to update its internal_ip cache and reconnect to the peer.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-07-26 13:48:46 +03:00
Pavel Emelyanov
1bf8b0dd92 snitch, storage_service: Move system.peers preferred_ip update
Currently the INTERNAL_IP state is updated using reconnectable helper
by subscribing on on_join/on_change events from gossiper. The same
subscription exists in storage service (it's a bit more elaborated by
checking if the node is the part of the ring which is OK).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-07-26 13:48:46 +03:00
Pavel Emelyanov
0abd2c1e52 snitch: Export prefer-local
The boolean bit says whether "the system" should prefer connecting to
the address gossiper around via INTERNAL_IP. Currently only gossiping
property file snitch allows to tune it and ec2-multiregion snitch
prefers internal IP unconditionally.

So exporting consists of 2 pieces:

- add prefer_local() snitch method that's false by default or returns
  the (existing) _prefer_local bit for production snitch base
- set the _prefer_local to true by ec2-multiregion snitch

While at it the _prefer_local is moved to production_snitch_base for
uniformity with the new prefer_local() call

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-07-26 13:48:04 +03:00
Piotr Sarna
abc5a7b7ec forward_service: remove redundant optional from forward_service
This commit refactors the code to get rid of unnecessary
std::optional usage in forward_result, since now it's possible
to merge empty results with each other, both ways (#11064).
2022-07-26 12:02:55 +02:00
Alejo Sanchez
2e39642728 test.py: fix log handling on error for Python and CQL tests
Fix mixing of log filename and log summary in error reporting for
CQLApprovalTest and PythonTest.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #11125
2022-07-26 11:38:14 +03:00
Alejo Sanchez
302b703efe test.py: remove unused global
random_tables gets the keyspace from caller so remove leftover counter.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #11123
2022-07-26 11:35:40 +03:00
Anna Stuchlik
38c2c4f7df doc: update the info about metrics in 2022.1 compared to 5.0 2022-07-26 10:25:26 +02:00
Avi Kivity
5b541bed72 logalloc: drop region_impl public accessors
With the region heap handle removed from logalloc::region, there is
nothing remaining there that needs violation of the abstraction
boundary, so we can drop these hacks.
2022-07-26 11:12:10 +03:00
Avi Kivity
2cb5f79e9d logalloc, dirty_memory_manager: move size-tracking binomial heap out of logalloc
The region_group mechanism used an intrusive heap handle embedded in
logalloc::region to allow region_group:s to track the largest region. But
with region_group moved out of logalloc, the handle is out of place.

Move it out, introducing a new intermediate class size_tracked_region
to hold the heap handle. We might eventually merge the new class into
memtable (which derives from it), but that requires a large rearrangement
of unit tests, so defer that.
2022-07-26 11:12:10 +03:00
Avi Kivity
ee720fa23b logalloc: relax lifetime rules around region_listener
Currently, a region_listener is added during construction and removed
during destruction. This was done to mimick the old region(region_group&)
constructor, as region_listener replaces region_group.

However, this makes moving the binomial heap handle outside logalloc
difficult. The natural place for the handle is in a derived class
of logalloc::region (e.g. memtable), but members of this derived class
will be destroyed earlier than the logalloc::region here. We could play
trickes with an earlier base class but it's better to just decouple
region lifecycle from listener lifecycle.

Do that be adding listen()/unlisten() methods. Some small awkwardness
remains in that merge() implicitly unlistens (see comment in
region::unlisten).

Unit tests are adjusted.
2022-07-26 11:12:10 +03:00
Avi Kivity
fbe8ea7727 logalloc, dirty_memory_manager: move region_group and associated code
region_group is an abstraction that allows accounting for groups of
regions, but the cost/benefit ratio of maintaining the abstraction
is poor. Each time we need to change decision algorithm of memtable
flushing (admittedly rarely), we need to distill that into an abstraction
for region_groups and then use it. An example is virtual regions groups;
we wanted to account for the partially flushed memtables and had to
invent region groups to stand in their place.

Rather than continuing to invest in the abstraction, break it now
and move it to the memtable dirty memory manager which is responsible
for making those decisions. The relevant code is moved to
dirty_memory_manager.hh and dirty_memory_manager.cc (new file), and
a new unit test file is added as well.

A downside of the change is that unit testing will be more difficult.
2022-07-26 11:12:10 +03:00
Avi Kivity
bffee2540f logalloc: expose tracker_reclaimer_lock
tracker_reclaimer_lock is used by region_group, which is being moved
out of logalloc, so expose it.
2022-07-26 11:12:10 +03:00
Avi Kivity
4ba0658670 logalloc: reimplement tracker_reclaim_lock to avoid using hidden classes
Right now tracker_reclaim_lock uses tracker::impl::reclaiming_lock,
which won't be visible if we want to expose tracker_reclaim_lock and
use it from another translation unit. However, it's simple to switch
to an implementation that doesn't require an unknown-size data member,
and instead increment a counter via a pointer, so do that.
2022-07-26 11:12:10 +03:00
Avi Kivity
652ab6f4a2 logalloc: reduce friendship between region and region_group
- add conversions between region and region_impl
 - add accessor for the binomial heap handle
 - add accessor for region_impl::id()
 - remove friend declarations

This helps in moving region_group to a different source file, where
the definitions of region_impl will not be visible.
2022-07-26 11:12:10 +03:00
Avi Kivity
c91ee9d04e logalloc: decouple region_group from region
As a first step in moving region_group away from logalloc, decouple
communications between region and region_group. We introduce region_listener,
that listens for the events that region passed directly to region_group.
A region_group now installs a region_listener in a region, instead of
having region know about the region_group directly.

This decoupling is still leaky:
 - merge() chooses to forget the merged-from region's region_listener.
  This happens to be suitable for the only user of merge().
 - We're still embedding the binomial heap handle, used by region_group
   to keep track of region sizes, in regions. A complete decoupling would
   transfer that responsibility to region_group.
2022-07-26 11:12:03 +03:00
Avi Kivity
cb1251199a memtable: stop using logalloc::region::group() to test for flushed memtables
Currently, the memtable reader uses logalloc::region::group() to test
for whether a memtable has been flushed. If a memtable doesn't belong
to a region group (from dirty_memory_manager), it is flushed.

This is quite tortuous - logalloc::region::merge() makes the merged-from
region identical to the merged-to region. The merged-to region, the cache,
doesn't have a group, so the check works.

Since we're making region groups part of dirty_memory_manager, the cache
will no longer have this indirect way of communication with memtable. But
instead we can use a direct callback it already has -
on_detach_from_region_group(). Use that to set a flag, and examine it in
the read path.
2022-07-26 11:07:25 +03:00
David Garcia
5067de6d3f docs: Fix broken links
Closes #11092
2022-07-26 10:53:17 +03:00
Piotr Sarna
626fb75949 forward_service: open-code running a Sestar thread
Previous interface forced the caller to allocate forward_aggregates
in order to be able to conditionally run the merging code inside
a Seastar thread, which is suboptimal. By open-coding the condition,
it's possible to drop the do_with, saving an allocation.
2022-07-26 08:10:47 +02:00
Piotr Sarna
e8f2565371 forward_service: add requires_thread helper
It will be needed later to be able to decide if seastar thread
is needed for merging forward service results.
2022-07-26 08:10:47 +02:00
Avi Kivity
29c28dcb0c Merge 'Unstall get_range_to_address_map' from Benny Halevy
Prevent stalls in this path as seen in performance testing.

Also, add a respective rest_api test.

Fixes #11114

Closes #11115

* github.com:scylladb/scylla:
  storage_service: reserve space in get_range_to_address_map and friends
  storage_service: coroutinize get_range_to_address_map and friends
  storage_service: pass replication map to get_range_to_address_map and friends
  storage_service: get_range_to_address_map: move selection of arbitrary ks to api layer
  test: rest_api: test range_to_endpoint_map and describe_ring
2022-07-25 18:06:28 +03:00
Piotr Sarna
c195ce1b82 query: allow merging non-empty forward_result with an empty one
Merging empty results was already allowed, but in one way only:

empty.merge(nonempty, r); // was permitted
nonempty.merge(empty, r); // not permitted

With this commit, both methods are permitted.
In order to remove copying, the other result is now taken
by rvalue reference, with all call sites being updated
accordingly.

Fixes #10446
Fixes #10174

Closes #11064
2022-07-25 18:06:28 +03:00
Benny Halevy
bc5f6cf45d storage_service: reserve space in get_range_to_address_map and friends
To reduce the chance of reallocation.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-25 18:06:28 +03:00
Avi Kivity
4fde9414dc Merge 'logalloc reclaim_timer improvements' from Michael Livshin
* round up reported time to microseconds
* add backtrace if stall detected
* add call site name (hierarchical when timers are nested)
* put timers in more places
* reduce possible logspam in nested timers by making sure to report on things only once and to not report on durations smaller than those already reported on

Closes #10576

* github.com:scylladb/scylla:
  utils: logalloc: fix indentation
  utils: logalloc: split the reclaim_timer in compact_and_evict_locked()
  utils: logalloc: report segment stats if reclaim_segments() times out
  utils: logalloc: reclaim_timer: add optional extra log callback
  utils: logalloc: reclaim_timer: report non-decreasing durations
  utils: logalloc: have reclaim_timer print reserve limits
  utils: logalloc: move reclaim timer destructor for more readability
  utils: logalloc: define a proper bundle type for reclaim_timer stats
  utils: logalloc: add arithmetic operations to segment_pool::stats
  utils: logalloc: have reclaim timers detect being nested
  utils: logalloc: add more reclaim_timers
  utils: logalloc: move reclaim_timer to compact_and_evict_locked
  utils: logalloc: pull reclaim_timer definition forward
  utils: logalloc: reclaim_timer make tracker optional
  utils: logalloc: reclaim_timer: print backtrace if stall detected
  utils: logalloc: reclaim_timer: get call site name
  utils: logalloc: reclaim_timer: rename set_result
  utils: logalloc: reclaim_timer: rename _reserve_segments member
  utils: logalloc: reclaim_timer round up microseconds
2022-07-25 18:06:28 +03:00
Benny Halevy
5eb31eff64 storage_service: coroutinize get_range_to_address_map and friends
And add calls to maybe_yield to prevent stalls in this path
as seen in performance testing.

Also, add a respective rest_api test.

Fixes #11114

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-25 18:06:28 +03:00
Tomasz Grabiec
76d20aeb96 Merge 'Refactor group 0 operations (joining, leaving, removing).' from Kamil Braun
A series of refactors to the `raft_group0` service.
Read the commits in topological order for best experience.

This PR is more or less equivalent to the second-to-last commit of PR https://github.com/scylladb/scylla/pull/10835, I split it so we could have an easier time reviewing and pushing it through.

Closes #11024

* github.com:scylladb/scylla:
  service: storage_service: additional assertions and comments
  service/raft: raft_group0: additional logging, assertions, comments
  service/raft: raft_group0: pass seed list and `as_voter` flag to `join_group0`
  service/raft: raft_group0: rewrite `remove_from_group0`
  service/raft: raft_group0: rewrite `leave_group0`
  service/raft: raft_group0: split `leave_group0` from `remove_from_group0`
  service/raft: raft_group0: introduce `setup_group0`
  service/raft: raft_group0: introduce `load_my_addr`
  service/raft: raft_group0: make some calls abortable
  service/raft: raft_group0: remove some temporary variables
  service/raft: raft_group0: refactor `do_discover_group0`.
  service/raft: raft_group0: rename `create_server_for_group` to `create_server_for_group0`
  service/raft: raft_group0: extract `start_server_for_group0` function
  service/raft: raft_group0: create a private section
  service/raft: discovery: `seeds` may contain `self`
2022-07-25 18:06:28 +03:00
Benny Halevy
3d62a1592f storage_service: pass replication map to get_range_to_address_map and friends
Before they are made asynchronous in the next patch,
so they work on a coherent snapshot of the token_metadata and
replication map as their caller.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-25 18:06:28 +03:00
Petr Gusev
52142bb8b3 raft_group_registry, is_alive for non-existent server_id
We could yield between updating the list of servers in raft/fsm
and updating the raft_address_map, e.g. in case of a set_configuration.
If tick_leader happens before the raft_address_map is updated,
is_alive will be called with server_id that is not in the map yet.

Fix: scylladb/scylla-dtest#2753

Closes #11111
2022-07-25 18:06:28 +03:00
Benny Halevy
0b474866a3 storage_service: get_range_to_address_map: move selection of arbitrary ks to api layer
It is only needed for the "storage_service/describe_ring" api
and service/storage_service shouldn't bother with it.
It's an api sugar coating.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-25 18:06:28 +03:00
Yaron Kaikov
c42c5111eb SCYLLA-VERSION-GEN: use semver-compatible version
Setting Scylla to use semantic versioning. (Ref: https://semver.org/)

Closes: https://github.com/scylladb/scylla/issues/9543

Closes #10957
2022-07-25 18:06:28 +03:00
Benny Halevy
429f110110 test: rest_api: test range_to_endpoint_map and describe_ring
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-25 18:06:28 +03:00
Nadav Har'El
85688a7a7e Merge 'cql3: grammar: make the whereClause production return a single expression' from Avi Kivity
Currently, the WHERE clause grammar is constrained to a conjunction of
relations: `WHERE a = ? AND b = ? AND c > ?`. The restriction happens in three
places:

1. the grammar will refuse to parse anything else
2. our filtering code isn't prepared for generic expressions
3. the interface between the grammar and the rest of the cql3 layer is via a vector of terms rather than an expression

While most of the work will be in extending the filtering code, this series tackles the
interface; it changes the `whereClause` production to return an expression rather than
a vector. Since much of cql3 layer is interested in terms, a new boolean_factors() function
is introduced to convert an expression to its boolean terms.

Closes #11105

* github.com:scylladb/scylla:
  cql3: grammar: make where clause return an expression
  cql3: util: deinline where clause utilities
  cql3: util: change where clause utilities to accept a single expression rather than a vector of terms
  cql3: statement_restrictions: accept a single expression rather than a vector
  cql3: statement_restrictions: merge `if` and `for`
  cql3: select_statement: remove wrong but harmless std::move() in prepare_restrictions
  cql3: expr: add boolean_factors() function to factorize an expression
  cql3: expression: define operator==() for expressions
  cql3: values: add operator==() for raw_value
2022-07-25 18:06:28 +03:00
Piotr Sarna
277aa30965 cql-pytest: extend synchronous mv test with new cases
The new cases cover:
 - a materialized view created with synchronous updates from the start
 - a materialized view created with synchronous updates,
   but then alter to not have synchronous updates anymore
2022-07-25 10:00:28 +02:00
Piotr Sarna
52f5ba16dc cql-pytest: allow extra parameters in new_materialized_view
The extra parameters can include a WITH clause.
2022-07-25 10:00:28 +02:00
Piotr Sarna
43c09eb9e6 docs: add a paragraph on view synchronous updates
The paragraph explains what synchronous view updates are
and how to set them up.
2022-07-25 10:00:28 +02:00
Michał Sala
c7b78cfd81 test/boost/cql_query_test: add test setting synchronous updates property
The test checks if a synchronous_updates property can be set via ALTER
MATERIALIZED VIEW or CREATE MATERIALIZED VIEW statements.
2022-07-25 09:53:33 +02:00
Michał Sala
2993bbc33b test: cql-pytest: add a test for synchronous mode materialized views
The test verifies if a synchronous updates code path was triggered in a
view that had synchronous_updates property set to true.
Done by inspecting query traces.
2022-07-25 09:53:33 +02:00
Michał Sala
d573ab0b58 db: view: react to synchronous updates tag
Code that waited for all remote view updates was already there. This
commit modifies the conditions of this wait to take into account the
"synchronous mode" (enabled when db::SYNCHRONOUS_VIEW_UPDATES_TAG_KEY is
set).
2022-07-25 09:53:33 +02:00
Michał Sala
128806f022 cql3: statements: cf_prop_defs: apply synchronous updates tag
This commit defines a new tag key (SYNCHRONOUS_VIEW_UPDATES_TAG_KEY) to
be used for marking "synchronous mode" views. This key is used in
`cf_prop_defs::apply_to_builder` if the properties contain
KW_SYNCHRONOUS_UPDATES.
2022-07-25 09:53:33 +02:00
Michał Sala
041cb77ad0 alternator, db: move the tag code to db/tags
Tags are a useful mechanism that could be used outside of alternator
namespace. My motivation to move tags_extension and other utilities to
db/tags/ was that I wanted to use them to mark "synchronous mode" views.

I have extracted `get_tags_of_table`, `find_tag` and `update_tags`
method to db/tags/utils.cc and moved alternator/tags_extension.hh to
db/tags/.

The signature of `get_tags_of_table` was changed from `const
std::map<sstring, sstring>&` to `const std::map<sstring, sstring>*`
Original behavior of this function was to throw an
`alternator::api_error` exception. This was undesirable, as it
introduced a dependency on the alternator module. I chose to change it
to return a potentially null value, and added a wrapper function to the
alternator module - `get_tags_of_table_or_throw` to keep the previous
throwing behavior.
2022-07-25 09:53:33 +02:00
Michał Sala
494e7fc5f5 cql3: statements: add a synchronous_updates property
This property can be used with CREATE MATERIALIZED VIEW and ALTER
MATERIALIZED VIEW statements. Setting it allows global views to enter
"synchronous mode". In this mode, all view updates are also applied
synchronously as if the view was local. This may reduce their
availability, but has the benefit of propagating a potential
inconsistency risk (in form of a write error) to the user, who can
respond to it appropriately (e.g. retry the write or fix the view
later).
2022-07-25 09:53:33 +02:00
Botond Dénes
b673b4bee3 Merge 'let scylla-gdb.py recognize coroutines' from Michael Livshin
"scylla task_histogram" and "scylla fiber" will now show coroutine "promises".

Refs #10894

Closes #11071

* github.com:scylladb/scylla:
  test: gdb: test that "task_histogram -a" finds some coroutines
  scylla-gdb.py: recognize coroutine-related symbols as task types
  scylla-gdb.py: whitelist the .text section for task "vtables"
  scylla-gdb.py: fix an error message
2022-07-25 06:48:45 +03:00
Nadav Har'El
f1e3494a10 cql-pytest: fix a test to not fail on very slow machines
The cql-pytest cassandra_tests/validation/operations/select_test.py::
testSelectWithAlias uses a TTL but not because it wants to test the TTL
feature - it just wants to check the SELECT aliasing feature. The test
writes a TTL of 100 and then reads it back using an alias. We would
normally expect to read back 100 or 99, but to guard against a very slow
test machine, the test verified that we read back something between 70
and 100. I thought that allowing a ridiculous 30 second delay between
the write and the read requests was more than enough.

But in one run of the aarch64 debug build, this ridiculous 30 seconds
wasn't ridiculous enough - the delay ended up 35 seconds, and the
test failed!

So in this patch, I just make it even more ridiculous - we write 1000
and expect to read something over 100 - allowing a 900 second delay
in the test.

Note that neither the earlier 30-second or current 900-second delay
slows down the test in any way - this test will normally complete in
milliseconds.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11085
2022-07-24 21:17:59 +03:00
Avi Kivity
9823e75d16 cql3: grammar: make where clause return an expression
In preparation of the relaxation of the grammar to return any expression,
change the whereClause production to return an expression rather than
terms. Note that the expression is still constrained to be a conjunction
of relations, and our filtering code isn't prepared for more.

Before the patch, if the WHERE clause was optional, the grammar would
pass an empty vector of expressions (which is exactly correct). After
the patch, it would pass a default-constructed expression. Now that
happens to be an empty conjunction, which is exactly what's needed, but
it is too accidental, so the patch changes optional WHERE clauses to
explicitly generate an empty conjunction if the WHERE clause wasn't
specified.
2022-07-22 20:14:48 +03:00
Avi Kivity
a037f9a086 cql3: util: deinline where clause utilities
Some where clause related functions were unnecessarily inline; another
was just recently de-templated. Move them to .cc.
2022-07-22 20:14:48 +03:00
Avi Kivity
fd663bcb94 cql3: util: change where clause utilities to accept a single expression rather than a vector of terms
Conversion to terms happens internally via boolean_factors().
2022-07-22 20:14:48 +03:00
Avi Kivity
a5dd588465 cql3: statement_restrictions: accept a single expression rather than a vector
Move closer to the goal of accepting a generic expression for WHERE
clause by accepting a generic expression in statement_restrictions. The
various callers will synthesize it from a vector of terms.
2022-07-22 20:14:48 +03:00
Avi Kivity
43aca25496 cql3: statement_restrictions: merge if and for
A `for` loop does nothing on an empty container, so no need for an
extra `if` for that condition. Drop the `if`.
2022-07-22 20:14:48 +03:00
Avi Kivity
4aa0a03b7e cql3: select_statement: remove wrong but harmless std::move() in prepare_restrictions
std::move(_where_clause) is wrong, because _where_clause is used later
(when analyzing GROUP BY), but also harmless (because the
statement_restrictions constructor accepts it by const reference).

To avoid confusion in the next patch where we'll pass _where_clause
to a different function, remove the bad std::move() in advance here.
2022-07-22 20:14:48 +03:00
Avi Kivity
8085b9f57a cql3: expr: add boolean_factors() function to factorize an expression
When analyzing a WHERE clause, we want to separate individual
factors (usually relations), and later partition them into
partition key, clustering key, and regular column relations. The
first step is separation, for which this helper is added.

Currently, it is not required since the grammar supplies the
expression in separated form, but this will not work once it is
relaxed to allow any expression in the WHERE clause.

A unit test is added.
2022-07-22 20:14:48 +03:00
Avi Kivity
1efb2fecbe cql3: expression: define operator==() for expressions
This is useful for tests, to check that expression manipulations
yield the expected results.
2022-07-22 20:14:48 +03:00
Avi Kivity
eec441d365 cql3: values: add operator==() for raw_value
This is useful for implementing operator==() for expressions, which in
turn require comparing constants, which contain raw_values.

Note that this is not CQL comparison (that would be implemented
in cql3::expr::evaluate() and would return a CQL boolean, not a C++
boolean, but a traditional C++ value comparison.
2022-07-22 20:13:49 +03:00
Anna Stuchlik
f46b207472 doc: update the links that are false external links and result in 404
Closes #11086
2022-07-22 14:17:42 +03:00
Anna Stuchlik
4bb9060268 doc: minor formatting and language fixes 2022-07-22 12:50:53 +02:00
Anna Stuchlik
2ade33f317 doc: add the new guide to the toctree 2022-07-22 12:36:43 +02:00
Anna Stuchlik
c9b0c6fdbf doc: add the upgrade guide from 5.0 to 2022.1 2022-07-22 12:32:08 +02:00
Botond Dénes
d12d429c47 Merge 'doc: add the upgrage guides from 2022.x.y to 2022.x.z' from Anna Stuchlik
Fix https://github.com/scylladb/scylla-docs/issues/4041

I've added the upgrade guides from 2022.x.y to 2022.x.z. They are based on the previous upgrade guides for patch releases.

Closes #11104

* github.com:scylladb/scylla:
  doc: add the new upgrade guide to the toctree
  doc: add the upgrage guides from 2022.x.y to 2022.x.z
2022-07-22 09:06:16 +03:00
Michael Livshin
0f1a884c90 test: gdb: test that "task_histogram -a" finds some coroutines
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-07-21 19:12:21 +03:00
Michael Livshin
6cbb367ba7 scylla-gdb.py: recognize coroutine-related symbols as task types
The criteria is too permissive because coroutine symbols (those
without the "[clone .resume]" part at the end, anyway) look like
normal function names; hopefully this won't give too many false
positives to become a problem.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-07-21 19:12:21 +03:00
Michael Livshin
f2c37b772d scylla-gdb.py: whitelist the .text section for task "vtables"
Actual vtables do not reside there, but coroutine object vptrs point
at the actual coroutine code, which is.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-07-21 19:12:21 +03:00
Michael Livshin
080bd7c481 scylla-gdb.py: fix an error message
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-07-21 19:12:21 +03:00
Gleb Natapov
f1f1176963 service: raft: do not allow downgrading non expiring entry to expiring one in raft_address_map
Expiring entries are added when a message is received from an unknown
host. If the host is later added to the raft configuration they become
non expiring. After that they can only be removed when the host is
dropped from the configuration, but they should never become expiring
again.

Refs #10826
2022-07-21 17:40:04 +02:00
Anna Stuchlik
23515c8695 doc: add the new upgrade guide to the toctree 2022-07-21 17:05:28 +02:00
Anna Stuchlik
bf5bf44ddd doc: add the upgrage guides from 2022.x.y to 2022.x.z 2022-07-21 16:46:06 +02:00
Asias He
39db15d2cb misc_services: Fix cache hitrate update
This patch avoids unncessary CACHE_HITRATES updates through gossip.

After this patch:

Publish CACHE_HITRATES in case:

- We haven't published it at all
- The diff is bigger than 1% and we haven't published in the last 5 seconds
- The diff is really big 10%

Note: A peer node can know the cache hitrate through read_data
read_mutation_data and read_digest RPC verbs which have cache_temperature in
the response. So there is no need to update CACHE_HITRATES through gossip in
high frequency.

We do the recalculation faster if the diff is bigger than 0.01. It is useful to
do the calculation even if we do not publish the CACHE_HITRATES though gossip,
since the recalculation will call the table->set_global_cache_hit_rate to set
the hitrate.

Fixes #5971

Closes #11079
2022-07-21 11:31:30 +03:00
Nadav Har'El
5faf3c711d doc, alternator: document the possibility of write reordering
In issue #10966, a user noticed that Alternator writes may be reordered
(a later write to an item is ignored with the earlier write to the same
item "winning") if Scylla nodes do not have synchronized time and if
always_use_lwt write isolation mode is not used.

In this patch I add to docs/alternator/compatibility.md a section about
this issue, what causes it, and how to solve or at least mitigate it.

Fixes #10966

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11094
2022-07-21 09:22:56 +02:00
Kamil Braun
4e42aeb0df service: storage_service: additional assertions and comments 2022-07-20 19:39:29 +02:00
Kamil Braun
25bb8384af service/raft: raft_group0: additional logging, assertions, comments
Move some rare logs from TRACE to INFO level.
Add some assertions.
Write some more comments, including FIXMEs and TODOs.

Remove unnecessary `_shutdown_gate.hold()` (this is not a background
task).
2022-07-20 19:39:29 +02:00
Kamil Braun
c9f1ec1268 service/raft: raft_group0: pass seed list and as_voter flag to join_group0
Group 0 discovery would internally fetch the seed list from gossiper.
Gossiper would return the seed list from conf/scylla.yaml. This seed
list is proper for the bootstrapping scenario - we specify the initial
contact points for a node that joins a cluster.

We'll have to use a different list of seeds for group 0 discovery for
the upgrade scenario. Prepare for that by taking the seed list as a
parameter.

In the bootstrap scenario we'll pass the seed list down from
`storage_service::join_cluster`.

Additionally, `join_group0` now takes an `as_voter` flag, which is
`false` in the bootstrap scenario (we initially join as a non-voter) but
will be `true` in the upgrade scenario.
2022-07-20 19:39:29 +02:00
Kamil Braun
684d8171ca service/raft: raft_group0: rewrite remove_from_group0
See previous commit. `remove_from_group0` had a similar problem as
`leave_group0`: it would handle the case where `raft_group0::_group0`
variant was not `raft::group_id` (i.e. we haven't joined group 0), but
RAFT local feature was enabled - i.e. the yet-unimplemented upgrade case
- by running discovery and calling `send_group0_modify_config`.

Instead, if we see that we've joined group 0 before, assume that we're
still a member and simply use the Raft `modify_config` API to remove
another server. If we're not a member it means we either decommissioned
or were removed by someone else; then we have no business trying to
remove others. There's also the unimplemented upgrade case but that will
come in another pull request.

Finally, add some logic for handling an edge case: suppose we joined
group 0 recently and we still didn't fully update our RPC address map
(it's being updated asynchronously by Raft's io_fiber). Thus we may fail
to find a member of group 0 in the address map. To handle this, ensure
we're up-to-date by performing a Raft read barrier.

State some assumptions in a comment.

Add a TODO for handling failures.

Remove unnecessary `_shutdown_gate.hold()` (this is not a background
task).
2022-07-20 19:39:29 +02:00
Kamil Braun
eeeef0bc50 service/raft: raft_group0: rewrite leave_group0
One of the following cases is true:
1. RAFT local feature is disabled. Then we don't do anything related to
  group 0.
2. RAFT local feature is enabled and when we bootstrapped, we joined
  group 0. Then `raft_group0::_group0` variant holds the
  `raft::group_id` alternative.
3. RAFT local feature is enabled and when we bootstrapped we didn't join
  group 0. This means the RAFT local feature was disabled when we
  bootstrapped and we're in the (unimplemented yet) upgrade scenario.
  `raft_group0::_group0` variant holds the `std::monostate` alternative.

The problem with the previous implementation was that it checked for the
conditions of the third case above - that RAFT local feature is enabled
but `_group0` does not hold `raft::group_id` - and if those conditions
were true, it executed some logic that didn't really make sense: it ran
the discovery algorithm and called `send_group0_modify_config` RPC.

In this rewrite I state some assumptions that `leave_group0` makes:
- we've finished the startup procedure.
- we're being run during decommission - after the node entered LEFT
  status.

In the new implementation, if `_group0` does not hold `raft::group_id`
(checked by the internal `joined_group0()` helper), we simply return.
This is the yet-unimplemented upgrade case left for a follow-up PR.

Otherwise we fetch our Raft server ID (at this point it must be present
- otherwise it's a fatal error) and simply call `modify_config` from the
`raft::server` API.

Remove unnecessary call to `_shutdown_gate.hold()` (this is not a
background task).
2022-07-20 19:39:29 +02:00
Kamil Braun
75608bcd2f service/raft: raft_group0: split leave_group0 from remove_from_group0
`leave_group0` was responsible for both removing a different node from
group 0 and removing ourselves (leaving) group 0. The two scenarios are
a bit different and the handling will be rewritten in following commits.

Split `leave_group0` into two functions. Remove the incorrect comment
about idempotency - saying that the procedure is idempotent is an
oversimplification, one could argue it's incorrect since the second call
simply hangs, at least in the case of leaving group 0; following commits
will state what's happening more precisely.

Add some additional logging and assertions where the two functions are
called in `storage_service`.
2022-07-20 19:39:29 +02:00
Kamil Braun
ee0219dfe3 service/raft: raft_group0: introduce setup_group0
Contains all logic for deciding to join (or not join) group 0.

Prepare for the case where we don't want to join group 0 immediately on
startup - the upgrade scenario (will be implemented in a follow-up).

Move the group 0 setup step earlier in `storage_service::join_cluster`.

`join_group0()` is now a private member of `raft_group0`. Some more
comments were written.
2022-07-20 19:39:29 +02:00
Kamil Braun
4b0db59671 service/raft: raft_group0: introduce load_my_addr
Compared to `load_or_create_my_addr` this function assumes that
the address is already present on disk; if not, it's a fatal error.

Use it in places where it would indeed be a fatal error
if the address was missing.
2022-07-20 19:39:29 +02:00
Kamil Braun
f0f9aa5c7d service/raft: raft_group0: make some calls abortable
There are some calls to `modify_config` which should react to aborts
(e.g. when we shutdown Scylla).

There are also calls to `send_group0_modify_config` which should
probably also react to aborts, but the functions don't take
an abort_source parameter. This is fixable but I left TODOs for now.
2022-07-20 19:39:29 +02:00
Kamil Braun
ab8c3c6742 service/raft: raft_group0: remove some temporary variables
Make the code a bit shorter.
2022-07-20 19:39:29 +02:00
Kamil Braun
b193ea8ec0 service/raft: raft_group0: refactor do_discover_group0.
The function no longer accesses the `_group0` variant directly, instead
it is made a member of `service::persistent_discovery`; the caller
guarantees that `persistent_discovery` is not destroyed before the
function finishes.

The function is now named `run`. A short comment was written at the
declaration site.

Make some members of `persistent_discovery` private, as they are only
used by `run`.

Simplify `struct tracker`, store the discovery output separately
(`struct tracker` is now responsible for a single thing).

Enclose the `parallel_for_each` over requests in a common coroutine
which keeps alive all the necessary things for the loop body and
performs the last step which was previously inside a `then`.
2022-07-20 19:39:29 +02:00
Kamil Braun
6d9d493e2a service/raft: raft_group0: rename create_server_for_group to create_server_for_group0 2022-07-20 19:39:28 +02:00
Kamil Braun
54d9219257 service/raft: raft_group0: extract start_server_for_group0 function
Extract part of the code from `join_group0`. Add some comments.
This part will be reused.
2022-07-20 19:38:53 +02:00
Kamil Braun
dca1ce52ed service/raft: raft_group0: create a private section
Move member functions and fields used internally by the `raft_group0`
class into a private section.

Write some comments.
2022-07-20 19:38:53 +02:00
Kamil Braun
d28170b1a5 service/raft: discovery: seeds may contain self
The set of seeds passed to the discovery algorithm may contain `self`.
The implementation will filter the `self` out (it calls `step(seeds)`;
`step` iterates over the given list of peers and ignores `_self`).

Specify this at the `discovery` constructor declaration site.

Simplify the code constructing `persistent_discovery` in
`raft_group0::discover_group0` using this assumption.
2022-07-20 19:38:53 +02:00
Wojciech Mitros
5590493abd wasm: test instances reuse
Add a test for a wasm aggregate function
which uses the new metrics to check if the cache has
been hit at least once.

Also check that the cache can get reused on different
queries, by testing that the number of queries is
higher than the number of cache misses.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2022-07-20 18:19:25 +02:00
Wojciech Mitros
9281ba3919 wasm: reuse UDF instances
When executing a wasm UDF, most of the time is spent on
setting up the instance. To minimize its cost, we reuse
the instance using wasm::instance_cache.

This patch adds a wasm instance cache, that stores
a wasmtime instance for each UDF and scheduling group.
The instances are evicted using LRU strategy. The
cache may store some entries for the UDF after evicting
the instance, but they are evicted when the corresponding
UDF is dropped, which greatly limits their number.

The size of stored instances is estimated using the size
of their WASM memories. In order to be able to read the
size of memory, we require that the memory is exported
by the client.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2022-07-20 18:19:22 +02:00
Wojciech Mitros
d7a933068a schema_tables: simplify merge_functions and avoid extra compilation
Currently, we have 2 mere_functions methods, where one is only the only
call to the other. We can replace them with a simple one.

The merge_functions method compiles a UDF (using create_func) only to
read its signature. We can avoid that by reading it from the row ourselves.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2022-07-20 18:10:21 +02:00
Nadav Har'El
59f684d2c3 Merge 'doc: fix the links to Alternator' from Anna Stuchlik
The Scylla Alternator documentation is now part of the Scylla user documentation (previously it was dev documentation).
This PR updates the links to the Alternator documentation.

Closes #11089

* github.com:scylladb/scylla:
  doc: update the link to Alternator for DynamoDB users
  doc: fix the links to Alternator
2022-07-20 18:38:08 +03:00
Anna Stuchlik
c1122c8f54 doc: update the link to Alternator for DynamoDB users 2022-07-20 17:02:48 +02:00
Avi Kivity
13a64d8ab2 Merge 'Remove all remaining restrictions classes' from Jan Ciołek
This PR removes all code that used classes `restriction`, `restrictions` and their children.

There were two fields in `statement_restrictions` that needed to be dealt with: `_clustering_columns_restrictions` and `_nonprimary_key_restrictions`.

Each function was reimplemented to operate on the new expression representaiion and eventually these fields weren't needed anymore.

After that the restriction classes weren't used anymore and could be deleted as well.

Now all of the code responsible for analyzing WHERE clause and planning a query works on expressions.

Closes #11069

* github.com:scylladb/scylla:
  cql3: Remove all remaining restrictions code
  cql3: Move a function from restrictions class to the test
  cql3: Remove initial_key_restrictions
  cql3: expr: Remove convert_to_restriction
  cql3: Remove _new from _new_nonprimary_key_restrictions
  cql3: Remove _nonprimary_key_restrictions field
  cql3: Reimplement uses of _nonprimary_key_restrictions using expression
  cql3: Keep a map of single column nonprimary key restrictions
  cql3: Remove _new from _new_clustering_columns_restrictions
  cql3: Remove _clustering_columns_restrictions from statement_restrictions
  cql3: Use a variable instead of dynamic cast
  cql3: Use the new map of single column clustering restrictions
  cql3: Keep a map of single column clustering key restrictions
  cql3: Return an expression in get_clustering_columns_restrctions()
  cql3: Reimplement _clustering_columns_restrictions->has_supporting_index()
  cql3: Don't create single element conjunction
  cql3: Add expr::index_supports_some_column
  cql3: Reimplement has_unrestricted_components()
  cql3: Reimplement _clustering_columns_restrictions->need_filtering()
  cql3: Reimplement num_prefix_columns_that_need_not_be_filtered
  cql3: Use the new clustering restrictions field instead of ->expression
  cql3: Reimplement _clustering_columns_restrictions->size() using expressions
  cql3: Reimplement _clustering_columns_restrictions->get_column_defs() using expressions
  cql3: Reimplement _clustering_columns_restrictions->is_all_eq() using expressions
  cql3: expr: Add has_only_eq_binops function
  cql3: Reimplement _clustering_columns_restrictions->empty() using expressions
2022-07-20 18:01:15 +03:00
Anna Stuchlik
ee53105e12 doc: fix the links to Alternator 2022-07-20 16:52:52 +02:00
Avi Kivity
89a935625d Merge 'doc: create a new CQL reference section to aggregate CQL information - V2' from Anna Stuchlik
This PR is V2 of https://github.com/scylladb/scylla/pull/11065.

The scope of updates:
- Created a _/cql/_ folder.
- Moved all the CQL-related pages from _/getting-started/_ to _/cql/_ .
- Moved the _cql-extensions.md_ file from _/dev/_ to  _/cql/_ .
- Removed the outdated files and references.
- Updated the links to the CQL-related pages.

Closes #11083

* github.com:scylladb/scylla:
  doc: update the links following the content reorganization
  doc: remove the outdated cql pages and delete them from the indexes
  doc: add index.rst for the cql folder and add it to toctree
  doc: move cql-extensions.md from the dev docs to the cql folder
  doc: move the CQL pages from getting-started to cql
  doc: add redirections for the CQL pages
2022-07-20 17:28:24 +03:00
Benny Halevy
6fd479b151 Update seastar submodule
* seastar 6d4a0cb7a3...1d4432ed28 (11):
  > rpc: Ignore failed future in connection::send()
  > install-dependencies: centos-{7,8}: use {DTS,GTS}-11 instead of {DTS,GTS}-9
  > coroutine: change access specifier of seastar::task member
  > Merge 'build: try to enable io_uring if it is not specified' from Kefu Chai
  > build: find_package() only if necessary
  > Merge "rpc: handle connection negotiation error during stream sink creation " from Gleb
  > build: try to enable io_uring if it is not specified
  > test: rpc: add test that inject error during stream connection negotiation.
  > test: rpc: inject errors only on streaming connections
  > test: rpc: allow specifying after what limit a connection start producing errors
  > rpc: do not destroy stream connection without stopping in case of negotiation failure
Fixes #10943

Closes #11082
2022-07-20 16:25:21 +03:00
Anna Stuchlik
4f2b12becc doc: update the links following the content reorganization 2022-07-20 13:07:51 +02:00
Anna Stuchlik
969a7b44e9 doc: remove the outdated cql pages and delete them from the indexes 2022-07-20 12:34:09 +02:00
Botond Dénes
014c5b56a3 query-result: move last_pos up to query::result
query_result was the wrong place to put last position into. It is only
included in data-responses, but not on digest-responses. If we want to
support empty pages from replicas, both data and digest responses have
to include the last position. So hoist up the last position to the
parent structure: query::result. This is a breaking change inter-node
ABI wise, but it is fine: the current code wasn't released yet.

Closes #11072
2022-07-20 13:28:09 +03:00
Anna Stuchlik
8915c5df0b doc: add index.rst for the cql folder and add it to toctree 2022-07-20 12:25:13 +02:00
Tomasz Grabiec
04f9a150be Merge 'raft: split can_vote field form server_address to separate struct' from Kamil Braun
Whether a server can vote in a Raft configuration is not part of the
address. `server_address` was used in many context where `can_vote` is
irrelevant.

Split the struct: `server_address` now contains only `id` and
`server_info` as it did before `can_vote` was introduced. Instead we
have a `config_member` struct that contains a `server_address` and the
`can_vote` field.

Also remove an "unsafe" constructor from `server_address` where `id` was
provided but `server_info` was not. The constructor was used for tests
where `server_info` is irrelevant, but it's important not to forget
about the info in production code. Replace the constructor with helper
functions which specify in comments that they are supposed to be used in
tests or in contexts where `info` doesn't matter (e.g. when checking
presence in an `unordered_set`, where the equality operator and hash
operate only on the `id`).

Closes #11047

* github.com:scylladb/scylla:
  raft: fsm: fix `entry_size` calculation for config entries
  raft: split `can_vote` field from `server_address` to separate struct
  serializer_impl: generalize (de)serialization of `unordered_set`
  to_string: generalize `operator<<` for `unordered_set`
2022-07-20 12:20:52 +02:00
Anna Stuchlik
4f897f149d doc: move cql-extensions.md from the dev docs to the cql folder 2022-07-20 12:20:25 +02:00
Anna Stuchlik
2dda3dbfb5 doc: move the CQL pages from getting-started to cql 2022-07-20 12:18:54 +02:00
Anna Stuchlik
862bf306ad doc: add redirections for the CQL pages 2022-07-20 12:12:53 +02:00
Asias He
482ee369d0 storage_service: Increase watchdog_interval for node ops
The node operations using node_ops_cmd have the following procedure:

1) Send node_ops_cmd::replace_prepare to all nodes
2) Send node_ops_cmd::replace_heartbeat to all nodes

In a large cluster 1) might take a long time to finish, as a result when
the node starts to perform 2), the heartbeat timer on the peer nodes which
is 30s might have already timed out. This fails the whole node
opeartions.

We have patches to make 1) more efficient and faster.

https://github.com/scylladb/scylla/pull/10850
https://github.com/scylladb/scylla/pull/10822

In addition to that, this patch increases the heartbeat timeout to reduce
the false positive of timeout.

Refs #10337
Refs #11078

Closes #11081
2022-07-20 12:56:17 +03:00
Jan Ciolek
599bcd6ea7 cql3: Remove all remaining restrictions code
The classes restriction, restrictions and its children
aren't used anywhere now and can be safely removed.

Some includes need to be modified for the code to compile.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-20 09:10:31 +02:00
Jan Ciolek
bff0b87c18 cql3: Move a function from restrictions class to the test
statement_restrictions_test uses a function that is defined
in multi_column_restriction.hh.

This file will be removed soon and for the test to still work
the function is moved to the test source.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-20 09:10:31 +02:00
Jan Ciolek
b269e5a24d cql3: Remove initial_key_restrictions
initial_key restrictions was a class used by statement_restrictions
to represent empty restrictions of different types and simplify
restriction merging logic. They are not used anymore and can
be removed.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-20 09:10:31 +02:00
Jan Ciolek
4f92c64e1b cql3: expr: Remove convert_to_restriction
This function isn't used anywhere anymore
and can be removed.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-20 09:10:31 +02:00
Jan Ciolek
d7e954307f cql3: Remove _new from _new_nonprimary_key_restrictions
The _new prefix was used to distinguish the new field
from the old represenation.

Now the new field has fully replaced the old one
and _new can be removed from its name.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-20 09:10:31 +02:00
Jan Ciolek
b6ae72f095 cql3: Remove _nonprimary_key_restrictions field
All code that made use of _nonprimary_key_restrictions
has been modified to use _new_nonprimary_key_restrictions
instead.

The field can be removed.

Additionally the old code responsible for adding new restrictions
can be fully removed, everything is now done using add_restriction.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-20 09:10:31 +02:00
Jan Ciolek
9d1ba07471 cql3: Reimplement uses of _nonprimary_key_restrictions using expression
All parts of the code that use _nonprimary_key_restrictions
are changed to use _new_nonprimary_key_restrictions instead.
I decided not to split this into multiple commits,
as there isn't a lot of changes and they are
analogous to the ones done before for partition
and clustering columns.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-20 09:10:30 +02:00
Jan Ciolek
2c28554390 cql3: Keep a map of single column nonprimary key restrictions
Keep a map of extracted restrictions for each restricted nonprimar column.
This map will be useful, just like the ones for clustering and partition
columns.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-20 09:10:30 +02:00
Jan Ciolek
0e8f437f24 cql3: Remove _new from _new_clustering_columns_restrictions
The _new was used to distinguish from the old field
during transition. Now the old field has been deleted
and the new one can take its place.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-20 09:10:27 +02:00
Botond Dénes
6e20cb3255 Merge 'database_test: test_truncate_without_snapshot_during_writes: apply mutation on the correct shard' from Benny Halevy
Currently, all the mutations this test generates are applied on shard 0.
In rare cases, this may lead to the following crash, when the flushed
sstable doesn't contain any key that belongs to the current shard,
as seen in https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1390/artifact/testlog/x86_64/dev/database_test.test_truncate_without_snapshot_during_writes.114.log
```
WARN  2022-07-17 17:41:36,630 [shard 0] sstable - create_sharding_metadata: range=[{-468459073612751032, pk{00046b657930}}, {-468459073612751032, pk{00046b657930}}] has no intersection with shard=0 first_key={key: pk{00046b657930}, token:-468459073612751032} last_key={key: pk{00046b657930}, token:-468459073612751032} ranges_single_shard=[] ranges_all_shards={{1, {[{-468459073612751032, pk{00046b657930}}, {-468459073612751032, pk{00046b657930}}]}}}
ERROR 2022-07-17 17:41:36,630 [shard 0] table - failed to write sstable /jenkins/workspace/releng/Scylla-CI/scylla/testlog/x86_64/dev/scylla-e2b694c7-db4f-4f9d-9940-9c6c21850888/ks/cf-8f74aba005de11ed92fa8661a0ed7890/me-2-big-Data.db: std::runtime_error (Failed to generate sharding metadata for /jenkins/workspace/releng/Scylla-CI/scylla/testlog/x86_64/dev/scylla-e2b694c7-db4f-4f9d-9940-9c6c21850888/ks/cf-8f74aba005de11ed92fa8661a0ed7890/me-2-big-Data.db)
ERROR 2022-07-17 17:41:36,631 [shard 0] table - Memtable flush failed due to: std::runtime_error (Failed to generate sharding metadata for /jenkins/workspace/releng/Scylla-CI/scylla/testlog/x86_64/dev/scylla-e2b694c7-db4f-4f9d-9940-9c6c21850888/ks/cf-8f74aba005de11ed92fa8661a0ed7890/me-2-big-Data.db). Aborting, at 0x329e28e 0x329e780 0x329ea88 0xf5bc69 0xf956b1 0x3196dc4 0x3198037 0x319742a 0x32be2e4 0x32bd8e1 0x32ba01c 0x317f97d /lib64/libpthread.so.0+0x92a4 /lib64/libc.so.6+0x100322
```

Instead, generate random keys and apply them on their
owning shard, and truncate all database shards.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #11066

* github.com:scylladb/scylla:
  database_test: test_truncate_without_snapshot_during_writes: apply mutation on the correct shard
  table: try_flush_memtable_to_sstable: consume: close reader on error
2022-07-20 09:06:07 +03:00
Jan Ciolek
4fac3be535 cql3: Remove _clustering_columns_restrictions from statement_restrictions
All code using the _clustering_columns_restrictions field
has been modified to instead use _new_clustering_columns_restrictions
expression representation.

The old field can now be removed.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-20 00:41:22 +02:00
Jan Ciolek
bf3f00413e cql3: Use a variable instead of dynamic cast
There is a dynamic cast used to determine whether
clustering columns are restricted by a multi column
restriction.

Instead of doing that we can just use the _has_multi_column
variable.

It's also used a few lines higher, which means that
it should be already initialized.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-20 00:41:22 +02:00
Jan Ciolek
a0884760ab cql3: Use the new map of single column clustering restrictions
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-20 00:41:07 +02:00
Avi Kivity
5a30f9b789 Merge 'Distributed aggregate query' from Michał Jadwiszczak
This PR extends #9209. It consists of 2 main points:

To enable parallelization of user-defined aggregates, reduction function was added to UDA definition. Reduction function is optional and it has to be scalar function that takes 2 arguments with type of UDA's state and returns UDA's state

All currently implemented native aggregates got their reducible counterpart, which return their state as final result, so it can be reduced with other result. Hence all native aggregates can now be distributed.

Local 3-node cluster made with current master. `node1` updated to this branch. Accessing node with `ccm <node-name> cqlsh`

I've tested belowed things from both old and new node:
- creating UDA with reduce function - not allowed
- selecting count(*) - distributed
- selecting other aggregate function - not distributed

Fixes: #10224

Closes #10295

* github.com:scylladb/scylla:
  test: add tests for parallelized aggregates
  test: cql3: Add UDA REDUCEFUNC test
  forward_service: enable multiple selection
  forward_service: support UDA and native aggregate parallelization
  cql3:functions: Add cql3::functions::functions::mock_get()
  cql3: selection: detect parallelize reduction type
  db,cql3: Move part of cql3's function into db
  selection: detect if selectors factory contains only simple selectors
  cql3: reducible aggregates
  DB: Add `scylla_aggregates` system table
  db,gms: Add SCYLLA_AGGREGATES schema features
  CQL3: Add reduce function to UDA
  gms: add UDA_NATIVE_PARALLELIZED_AGGREGATION feature
2022-07-19 19:05:19 +03:00
Avi Kivity
1f21c1ecc8 Merge "Add IO throttling to streaming class" from Pavel E
"
Same thing was done for compaction class some time ago, now
it's time for streaming to keep repair-generated IO in bounds.
This set mostly resembles the one for compaction IO class with
the exception that boot-time reshard/reshape currently runs in
streaming class, but that's nod great if the class is throttled,
so the set also moves boot-time IO into default IO class.
"

* 'br-streaming-class-throttling-2' of https://github.com/xemul/scylla:
  distributed_loader: Populate keyspaces in default class
  streaming: Maintain class bandwidth
  streaming: Pass db::config& to manager constructor
  config: Add stream_io_throughput_mb_per_sec option
  sstables: Keep priority class on sstable_directory
2022-07-19 17:10:25 +03:00
Jan Ciolek
9a03a09422 cql3: Keep a map of single column clustering key restrictions
Having this map is useful in a bunch of places.

To keep code simple it could be created from scratch each time,
but it's also used in do_filter, so this could actually
affect performance.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-19 16:02:01 +02:00
Jan Ciolek
2b7ffd57fb cql3: Return an expression in get_clustering_columns_restrctions()
get_clustering_columns_restrctions() used to return
a shared pointer to the clustering_restrictions class.

Now everything is being converted to expression,
so it should return an expression as well.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-19 16:02:01 +02:00
Jan Ciolek
ebbbc3291a cql3: Reimplement _clustering_columns_restrictions->has_supporting_index()
The code is copied from the corresponding restrictions classes.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-19 16:01:40 +02:00
Pavel Emelyanov
07460761fb Merge "Make compaction_static_shares and memtable_flush_static_shares live updateable" from Igor Ribeiro Barbosa Duarte (3):
Currently, after updating the static shares it's necessary
to restart the cluster. This patch series makes
compaction_static_shares and memtable_flush_static_shares
live updateable so that this restart isn't necessary anymore.

dtests: https://github.com/igorribeiroduarte/scylla-dtest/tree/test_liveupdate_compaction_static_shares
ci: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1412/

* https://github.com/igorribeiroduarte/scylla/tree/make_compaction_static_shares_live_updateable:
  memtable_flush: Make memtable_flush_static_shares liveupdateable
  compaction: Make compaction_static_shares liveupdateable
  backlog_controller: Unify backlog_controller constructors
2022-07-19 16:55:55 +03:00
Benny Halevy
1c26d49fba database_test: test_truncate_without_snapshot_during_writes: apply mutation on the correct shard
Currently, all the mutations this test generates are applied on shard 0.
In rare cases, this may lead to the following crash, when the flushed
sstable doesn't contain any key that belongs to the current shard,
as seen in https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1390/artifact/testlog/x86_64/dev/database_test.test_truncate_without_snapshot_during_writes.114.log
```
WARN  2022-07-17 17:41:36,630 [shard 0] sstable - create_sharding_metadata: range=[{-468459073612751032, pk{00046b657930}}, {-468459073612751032, pk{00046b657930}}] has no intersection with shard=0 first_key={key: pk{00046b657930}, token:-468459073612751032} last_key={key: pk{00046b657930}, token:-468459073612751032} ranges_single_shard=[] ranges_all_shards={{1, {[{-468459073612751032, pk{00046b657930}}, {-468459073612751032, pk{00046b657930}}]}}}
ERROR 2022-07-17 17:41:36,630 [shard 0] table - failed to write sstable /jenkins/workspace/releng/Scylla-CI/scylla/testlog/x86_64/dev/scylla-e2b694c7-db4f-4f9d-9940-9c6c21850888/ks/cf-8f74aba005de11ed92fa8661a0ed7890/me-2-big-Data.db: std::runtime_error (Failed to generate sharding metadata for /jenkins/workspace/releng/Scylla-CI/scylla/testlog/x86_64/dev/scylla-e2b694c7-db4f-4f9d-9940-9c6c21850888/ks/cf-8f74aba005de11ed92fa8661a0ed7890/me-2-big-Data.db)
ERROR 2022-07-17 17:41:36,631 [shard 0] table - Memtable flush failed due to: std::runtime_error (Failed to generate sharding metadata for /jenkins/workspace/releng/Scylla-CI/scylla/testlog/x86_64/dev/scylla-e2b694c7-db4f-4f9d-9940-9c6c21850888/ks/cf-8f74aba005de11ed92fa8661a0ed7890/me-2-big-Data.db). Aborting, at 0x329e28e 0x329e780 0x329ea88 0xf5bc69 0xf956b1 0x3196dc4 0x3198037 0x319742a 0x32be2e4 0x32bd8e1 0x32ba01c 0x317f97d /lib64/libpthread.so.0+0x92a4 /lib64/libc.so.6+0x100322
```

Instead, generate random keys and apply them on their
owning shard, and truncate all database shards.

Fixes #11076

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-19 16:55:11 +03:00
Jan Ciolek
991fd5e4db cql3: Don't create single element conjunction
In case the expression is empty and we want to merge it
with a new restriction we can just set the expression
to the new restriction.

Later this will make it easier to distinguish which case
of multi column restrictions are we dealing with.

IN and EQ can only have a single binary operator,
but slice might have two.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-19 15:38:33 +02:00
Jan Ciolek
c7495fa59e cql3: Add expr::index_supports_some_column
Add a function that checks if there is an index
which supports one of the columns present in
the given expression.

This functionality will soon be needed for
clustering and nonprimary columns so it's
good to separate into a reusable function.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-19 15:38:20 +02:00
Benny Halevy
f60ff44fdf table: try_flush_memtable_to_sstable: consume: close reader on error
If an exception is throws in `consume` before
write_memtable_to_sstable is called or if the latter fails,
we must close the reader passed to it.

Fixes #11075

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-19 16:35:59 +03:00
Igor Ribeiro Barbosa Duarte
3b19bcf1a1 memtable_flush: Make memtable_flush_static_shares liveupdateable
This patch makes memtable_flush_static_shares liveupdateable
to avoid having to restart the cluster after updating
this config.

Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>
2022-07-19 10:10:46 -03:00
Igor Ribeiro Barbosa Duarte
8dd0f4672d compaction: Make compaction_static_shares liveupdateable
This patch makes compaction_static_shares liveupdateable
to avoid having to restart the cluster after updating
this config.

Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>
2022-07-19 10:10:46 -03:00
Igor Ribeiro Barbosa Duarte
c2ee6492e6 backlog_controller: Unify backlog_controller constructors
This patch adds the _static_shares variable to the backlog_controller so that
instead of having to use a separate constructor when controller is disabled,
we can use a single constructor and periodically check on the adjust method
if we should use the static shares or the controller. This will be useful on
the next patches to make compaction_static_shares and memtable_flush_static_shares
live updateable.

Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>
2022-07-19 10:06:12 -03:00
Takuya ASADA
752be6536a rename relocatable packages
Currently, we use following naming convention for relocatable package
filename:
  ${package_name}-${arch}-package-${version}.${release}.tar.gz
But this is very different with Linux standard packaging system such as
.rpm and .deb.
Let's align the convention to .rpm style, so new convention should be:
  ${package_name}-${version}-${release}.${arch}.tar.gz

Closes #9799

Closes #10891

* tools/java de8289690e...d0143b447c (1):
  > build_reloc.sh: rename relocatable packages

* tools/jmx fe351e8...06f2735 (1):
  > build_reloc.sh: rename relocatable packages

* tools/python3 e48dcc2...bf6e892 (1):
  > reloc/build_reloc.sh: rename relocatable packages
2022-07-19 15:46:49 +03:00
David Garcia
dcb5550bc3 doc: update url to docs.scylladb.com
Closes #11050
2022-07-19 13:42:25 +03:00
Pavel Emelyanov
85d32485d9 config: Mark compaction_throughput_mb_per_sec option as Used
Otherwise it's not shown in the --help output.
Should've been the part of 868c3be0

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220716085221.26634-1-xemul@scylladb.com>
2022-07-19 13:18:17 +03:00
Pavel Emelyanov
55d4fa49f7 distributed_loader: Populate keyspaces in default class
The streaming class throughput can be limitd with the respective option.
Doing boot-time reshard/reshape doesn't need to obey it, as the node is
not yet up but instead should get there as soon as possible.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-07-19 12:21:13 +03:00
Pavel Emelyanov
96d6be7daf streaming: Maintain class bandwidth
Same as was done in b112a983 for compaction manager

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-07-19 12:19:56 +03:00
Pavel Emelyanov
a246b6d3eb streaming: Pass db::config& to manager constructor
The stream_manager will bookkeep the streaming bandwidth option, to
subscribe on its changes it needs the config reference. It would be
better if it was stream_manager::config, but currently subscription on
db::config::<stuff> updates is not very shard-friendly, so we need to
carry the config reference itself around.

Similar trouble is there for compaction_manager. The option is passed
through its own config, but the config is created on each shard by
database code. Stream manager config would be created once by main code
on shard 0.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-07-19 12:18:08 +03:00
Pavel Emelyanov
7d0110cd31 config: Add stream_io_throughput_mb_per_sec option
It's going to control the bandwidth for the streaming prio class.
For now it's jsut added but does't work for real

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-07-19 12:14:41 +03:00
Pavel Emelyanov
a56e2c83f3 sstables: Keep priority class on sstable_directory
Current code accepts priotity class as an argument to various functions
that need it and all its callers use streaming class. Next patches will
needs to sometimes use default class, but it will require heavy patching
of the distributed loader. Things get simpler if the priority class is
kept on sstable_directory on start.

This change also simplifies the ongoing effort on unification of sched
and IO classes.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-07-19 12:14:41 +03:00
Pavel Emelyanov
2a63c6f647 scylla-gdb: Don't show empty smp queues
When collecting a histogram of smp-queues population empty queues also
count, but it makes the output very long and not very informative.
Skipping empty queues increases signal / noise ratio.

v2:
- print the number of omitted empty queues

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220718180912.2931-1-xemul@scylladb.com>
2022-07-19 11:33:24 +03:00
Gleb Natapov
d40106d3a9 raft: remove unused code
Message-Id: <YtUh8Hs+nQQ8+hLY@scylladb.com>
2022-07-18 21:21:45 +03:00
Botond Dénes
af31a89afa scylla-gdb.py: scylla fiber: reverse backward fiber
We want to print the backwards fiber in reverse, starting with the
furthest-away task in the chain. For this, the task list returned by
`_walk()` has to be reversed.

Closes #11062
2022-07-18 20:20:37 +03:00
Kamil Braun
7c377cc457 raft: fsm: fix entry_size calculation for config entries
We forgot about `can_vote`.

Stumbled on this while separating `can_vote` to separate struct.
Note that `entry_size` is still inaccurate (#11068) but the patch is an
improvement.

Refs: #11068
2022-07-18 18:24:50 +02:00
Kamil Braun
daf9c53bb8 raft: split can_vote field from server_address to separate struct
Whether a server can vote in a Raft configuration is not part of the
address. `server_address` was used in many context where `can_vote` is
irrelevant.

Split the struct: `server_address` now contains only `id` and
`server_info` as it did before `can_vote` was introduced. Instead we
have a `config_member` struct that contains a `server_address` and the
`can_vote` field.

Also remove an "unsafe" constructor from `server_address` where `id` was
provided but `server_info` was not. The constructor was used for tests
where `server_info` is irrelevant, but it's important not to forget
about the info in production code. The constructor was used for two
purposes:
- Invoking set operations such as `contains`. To solve this we use C++20
  transparent hash and comparator functions, which allow invoking
  `contains` and similar functions by providing a different key type (in
  this case `raft::server_id` in set of addresses, for example).
- constructing addresses without `info`s in tests. For this we provide
  helper functions in the test helpers module and use them.
2022-07-18 18:22:10 +02:00
Kamil Braun
f5d274d866 serializer_impl: generalize (de)serialization of unordered_set
Be able to (de)serialize sets with different Hash or KeyEqual
specializations.
2022-07-18 18:20:33 +02:00
Kamil Braun
5907049ecc to_string: generalize operator<< for unordered_set
Be able to print sets with different Hash or KeyEqual specializations.
Use a variadic template, as it was done for `operator<<` for
`unordered_map`.
2022-07-18 18:20:33 +02:00
Jan Ciolek
c2d20adc49 cql3: Reimplement has_unrestricted_components()
The code is copied from:
clustering_key_restrictions::has_unrestricted_components

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-18 17:49:23 +02:00
Jan Ciolek
85ebe99eb5 cql3: Reimplement _clustering_columns_restrictions->need_filtering()
The code is copied from:
single_column_primary_key_restrictions<clustering_key>::needs_filtering

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-18 17:49:09 +02:00
Jan Ciolek
d3a2a77b99 cql3: Reimplement num_prefix_columns_that_need_not_be_filtered
The code is copied from:
single_column_primary_key_restrictions<clustering_key>
::num_prefix_columns_that_need_not_be_filtered

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-18 17:48:55 +02:00
Jan Ciolek
1914d21f7b cql3: Use the new clustering restrictions field instead of ->expression
Instead of writing
_clustering_columns_restrictions->expression
It's better to use the new field:
_new_clustering_columns_restrictions

These expressions should be the same.
It removes another use of the unwanted restrictions field.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-18 17:48:36 +02:00
Jan Ciolek
360087c580 cql3: Reimplement _clustering_columns_restrictions->size() using expressions
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-18 17:46:14 +02:00
Jan Ciolek
92df275868 cql3: Reimplement _clustering_columns_restrictions->get_column_defs() using expressions
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-18 17:45:50 +02:00
Jan Ciolek
88da7ae0dc cql3: Reimplement _clustering_columns_restrictions->is_all_eq() using expressions
Use the freshly added function to replace old calls to ->is_all_eq().

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-18 17:45:35 +02:00
Jan Ciolek
6cf0981aa6 cql3: expr: Add has_only_eq_binops function
Add a function which checks that an expression
contains only binary operators with '='.

Right now this check is done only in a single place,
but soon the same check will have to be done
for clustering columns as well, so the code
is moved to a separate function to prevent duplication.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-18 17:45:06 +02:00
Jan Ciolek
b84787efac cql3: Reimplement _clustering_columns_restrictions->empty() using expressions
All occurences of _clustering_columns_restrictions->empty()
have been replaced with code that operates on the new
expression representation: _new_clustering_columns_restrictions.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-18 17:44:50 +02:00
Anna Stuchlik
274691f45e doc: fix the example of UDT in the SELECT statement
Closes #11067
2022-07-18 18:26:06 +03:00
Pavel Emelyanov
62d95f09de view: De-futurize make_view_update_builder()
It doesn't sleep, just returns ready future with builder

tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1384
       it's red because e-mail notification is broken (scylla-pkg#2988)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220718132529.30751-1-xemul@scylladb.com>
2022-07-18 17:15:48 +03:00
Jadw1
7497fda370 test: add tests for parallelized aggregates 2022-07-18 15:25:42 +02:00
Jadw1
c95a0a9fe6 test: cql3: Add UDA REDUCEFUNC test
Adds test checking if reduction function is correctly assigned to UDA
and if the information is stored in `scylla_aggregates` system table.
2022-07-18 15:25:41 +02:00
Jadw1
182438c5f8 forward_service: enable multiple selection
Enables parallelization of query like `SELECT MIN(x), MAX(x)`.
Compatibility is ensured under the same cluster feature as
UDA and native aggregates parallelization. (UDA_NATIVE_PARALLELIZED_AGGREGATION)
2022-07-18 15:25:41 +02:00
Jadw1
29a0be75da forward_service: support UDA and native aggregate parallelization
Enables parallelization of UDA and native aggregates. The way the
query is parallelized is the same as in #9209. Separate reduction
type for `COUNT(*)` is left for compatibility reason.
2022-07-18 15:25:41 +02:00
Jadw1
a0a6d87c1b cql3:functions: Add cql3::functions::functions::mock_get()
`mock_get` was created only for forward_service use, thus it only checks for
aggregate functions if no declared function was found.

The reason for this function is, there is no serialization of `cql3::selection::selection`,
so functions lying underneath these selections has to be refound.

Most of this code is copied from `functions::get()`, however `functions::get()` is not used because it requires to
mock or serialize expressions and `functions::find()` is not enough,
because it does not search for dynamic aggregate functions
2022-07-18 15:25:41 +02:00
Jadw1
6d977fcf88 cql3: selection: detect parallelize reduction type
Detects type of reduction if it is possible. Separate case for
`COUNT(*)` is left for compatibility reason. By now only single
selection is supported.
2022-07-18 15:25:41 +02:00
Jadw1
59498caeca db,cql3: Move part of cql3's function into db
Moving `function`, `function_name` and `aggregate_function` into
db namespace to avoid including cql3 namespace into query-request.
For now, only minimal subset of cql3 function was moved to db.
2022-07-18 15:25:41 +02:00
Jadw1
6b63417bc8 selection: detect if selectors factory contains only simple selectors
Because `selection` is not serializable and it has to be send via network
to parallelize query, we have to mock the selection. To simplify
the mocking, for now only single selectors for aggregate's arguments
are allowed (no casting or other functions as arguments).
2022-07-18 15:25:41 +02:00
Jadw1
0f08c8e099 cql3: reducible aggregates
Introduces reducible aggregates which don't return final result
but accumulator, that can be later reduced.
2022-07-18 15:25:41 +02:00
Jadw1
d13f347621 DB: Add scylla_aggregates system table
Saving information about UDA's reduce function to `scylla_aggregates`
table and distributing it across cluster.
2022-07-18 15:25:37 +02:00
Jadw1
2c46222e31 db,gms: Add SCYLLA_AGGREGATES schema features
This schema feature will be used to guard
system_schema.scylla_aggregates schema table.
2022-07-18 14:18:48 +02:00
Jadw1
d8f3461147 CQL3: Add reduce function to UDA
Add optional field to UDA, that describes reduce function to allow
parallelization of UDA aggregates.
2022-07-18 14:18:48 +02:00
Jadw1
346fb08680 gms: add UDA_NATIVE_PARALLELIZED_AGGREGATION feature
Feature that indicate whether the cluter supports optional UDA
parameter (reduction function) and parallelization of uda and
native aggregates.
2022-07-18 14:18:48 +02:00
Botond Dénes
9afd2dc428 Merge 'Make compaction manager switch to table abstraction ' from Raphael "Raph" Carvalho
This work gets us a step closer to compaction groups.

Everything in compaction layer but compaction_manager was converted to table_state.

After this work, we can start implementing compaction groups, as each group will be represented by its own table_state. User-triggered operations that span the entire table, not only a group, can be done by calling the manager operation on behalf of each group and then merging the results, if any.

Closes #11028

* github.com:scylladb/scylla:
  compaction: remove forward declaration of replica::table
  compaction_manager: make add() and remove() switch to table_state
  compaction_manager: make run_custom_job() switch to table_state
  compaction_manager: major: switch to table_state
  compaction_manager: scrub: switch to table_state
  compaction_manager: upgrade: switch to table_state
  compaction: table_state: add get_sstables_manager()
  compaction_manager: cleanup: switch to table_state
  compaction_manager: offstrategy: switch to table_state()
  compaction_manager: rewrite_sstables(): switch to table_state
  compaction_manager: make run_with_compaction_disabled() switch to table_state
  compaction_manager: compaction_reenabler: switch to table_state
  compaction_manager: make submit(T) switch to table_state
  compaction_manager: task: switch to table_state
  compaction: table_state: Add is_auto_compaction_disabled_by_user()
  compaction: table_state: Add on_compaction_completion()
  compaction: table_state: Add make_sstable()
  compaction_manager: make can_proceed switch to table_state
  compaction_manager: make stop compaction procedures switch to table_state
  compaction_manager: make get_compactions() switch to table_state
  compaction_manager: change task::update_history() to use table_state instead
  compaction_manager: make can_register_compaction() switch to table_state
  compaction_manager: make get_candidates() switch to table_state
  compaction_manager: make propagate_replacement() switch to table_state
  compaction: Move table::in_strategy_sstables() and switch to table_state
  compaction: table_state: Add maintenance sstable set
  compaction_manager: make has_table_ongoing_compaction() switch to table_state
  compaction_manager: make compaction_disabled() switch to table_state
  compaction_manager: switch to table_state for mapping of compaction_state
  compaction_manager: move task ctor into source
2022-07-18 15:18:29 +03:00
Nadav Har'El
a02db6f928 Merge 'doc: add the upgrade guide from 2021.1 to 2022.1' from Anna Stuchlik
Fix https://github.com/scylladb/scylla-docs/issues/4040
Fix https://github.com/scylladb/scylla-docs/issues/4128

This PR adds the upgrade guides from ScyllaDB Enterprise 2021.1 to 2022.1. They are based on the previous guides.

Closes #11036

* github.com:scylladb/scylla:
  doc: add the description of the new metrics in 2022.1
  doc: remove the upgrade guide for Ubuntu 16.04 (no longer supported in version 2022.1)
  doc: remove the outdated warning
  Update docs/upgrade/_common/upgrade-guide-from-2021.1-to-2022.1-ubuntu-and-debian.rst
  Update docs/upgrade/_common/upgrade_to_2022_warning.rst
  doc: add a space on line 60 to fix the warning
  doc: document metric update for 2022.1
  doc: add the upgrade guide from 2021.1 to 2022.1
2022-07-18 14:14:57 +03:00
Avi Kivity
94b6aebf01 Merge 'Remove dropped table directory' from Benny Halevy
This series adds removal of dropped table directory when it has no remaining snapshots.

There are 2 code paths that take of that:
1. when the table is dropped and there are no active snapshots for it (typically when auto_snapshot disabled).
2. or when the last snapshot is cleared, leaving no other snapshot for a dropped table.

Unit tests were extended to covert these scenarios.

Fixes #10896

Closes #11001

* github.com:scylladb/scylla:
  legacy_schema_migrator: simplify drop_legacy_tables
  database: clear_snapshot: remove dropped table directory when it has no remaining snapshots
  database: clear_snapshot: make it a coroutine and use thread
  database_test: add clear_multiple_snapshots test
  database: make drop_column_family private
  schema_tables: merge_tables_and_views: use drop_table_on_all_shards
  database_test: drop_table_with_snapshots: test auto_snapshot
  database_test: populate_from_quarantine_works: pass optional db:config to do_with_some_data
  database: drop_table_on_all_shards: remove table directory having no snapshots
  sstables: define table_subdirectories
  sstables: officially define pending_delete_dir
  database: add drop_table_on_all_shards
2022-07-18 13:41:35 +03:00
Benny Halevy
3f0402db68 legacy_schema_migrator: simplify drop_legacy_tables
There is no need for utils::make_joinpoint now
that the function calls replica::database::drop_table_on_all_shards.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-18 10:28:18 +03:00
Benny Halevy
bbbbea65fb database: clear_snapshot: remove dropped table directory when it has no remaining snapshots
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-17 14:33:34 +03:00
Benny Halevy
c70a675d77 database: clear_snapshot: make it a coroutine and use thread
and use an async thread around `directory_lister`
rather than `lister::scan_dir` to simplify the implementation.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-17 14:33:34 +03:00
Benny Halevy
e710fe527c database_test: add clear_multiple_snapshots test
Based on the `clear_snapshot` test.

Test with multiple snapshots and different
combinations of parameters to database::clear_snapshot.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-17 14:33:34 +03:00
Benny Halevy
d7564b9081 database: make drop_column_family private
Now that all users are converted to use the public
entry point - drop_table_on_all.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-17 14:33:34 +03:00
Benny Halevy
71aad45757 schema_tables: merge_tables_and_views: use drop_table_on_all_shards
So that the dropped table's directory can be
removed after it has been dropped on all shards
if it has no snapshots.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-17 14:33:34 +03:00
Benny Halevy
ae3b1b5a64 database_test: drop_table_with_snapshots: test auto_snapshot
Refactor test_drop_table_with_auto_snapshot out of
drop_table_with_snapshots, adding a auto_snapshot param,
controlling how to configure the cql_test_env db:.config::auto_snapshot,
so we can test both cases - auto_snapshot enabled and disabled.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-17 14:33:34 +03:00
Benny Halevy
af6805dd75 database_test: populate_from_quarantine_works: pass optional db:config to do_with_some_data
Instead of just `tmpdir_for_data`, so we can easily set auto_snapshot
for `drop_table_with_snapshots` in the next patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-17 14:33:34 +03:00
Benny Halevy
2e37dcf62a database: drop_table_on_all_shards: remove table directory having no snapshots
If the table to remove has no snapshots then
completely remove its directory on storage
as the left-over directory slows down operations on the keyspace
and makes searching for live tables harder.

Fixes #10896

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-17 14:33:34 +03:00
Benny Halevy
dd481e9f58 sstables: define table_subdirectories
Define a constexpr array of all official table sub-dorectories.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-17 14:33:34 +03:00
Benny Halevy
c4a42c3a3f sstables: officially define pending_delete_dir
Rather than using the "pending_delete" string
in `pending_delete_dir_basename()`, so it can
be orderly removed in the next patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-17 14:33:34 +03:00
Benny Halevy
e005629afb database: add drop_table_on_all_shards
Runs drop_column_family on all database shards.
Will be extended later to consider removing the table directory.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-17 14:33:34 +03:00
Raphael S. Carvalho
246e945086 compaction: remove forward declaration of replica::table
compaction_manager.cc still cannot stop including replica/database.hh
because upgrade and scrub still take replica::database as param,
but I'll remove it soon in another series.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-07-16 21:35:06 -03:00
Raphael S. Carvalho
a94d974835 compaction_manager: make add() and remove() switch to table_state
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-07-16 21:35:06 -03:00
Raphael S. Carvalho
31655acb5e compaction_manager: make run_custom_job() switch to table_state
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-07-16 21:35:06 -03:00
Raphael S. Carvalho
9a1efc69d0 compaction_manager: major: switch to table_state
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-07-16 21:35:06 -03:00
Raphael S. Carvalho
cebe6e22cb compaction_manager: scrub: switch to table_state
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-07-16 21:35:06 -03:00
Raphael S. Carvalho
d29f7070d9 compaction_manager: upgrade: switch to table_state
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-07-16 21:35:06 -03:00
Raphael S. Carvalho
c2678ca661 compaction: table_state: add get_sstables_manager()
That will be needed for retrieving sstable manager in
perform_sstable_upgrade().

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-07-16 21:35:06 -03:00
Raphael S. Carvalho
bdd049afd6 compaction_manager: cleanup: switch to table_state
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-07-16 21:35:06 -03:00
Raphael S. Carvalho
f547e0f2fb compaction_manager: offstrategy: switch to table_state()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-07-16 21:35:06 -03:00
Raphael S. Carvalho
538d412fba compaction_manager: rewrite_sstables(): switch to table_state
rewrite_sstables() is used by maintenance compactions that perform
an operation on a single file at a time.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-07-16 21:35:06 -03:00
Raphael S. Carvalho
79e385057f compaction_manager: make run_with_compaction_disabled() switch to table_state
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-07-16 21:35:06 -03:00
Raphael S. Carvalho
79f91fe61e compaction_manager: compaction_reenabler: switch to table_state
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-07-16 21:35:06 -03:00
Raphael S. Carvalho
7c1d178f4e compaction_manager: make submit(T) switch to table_state
Now that submit() switched to table_state, compaction_reenabler
and friends can switch to table_state too.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-07-16 21:35:06 -03:00
Raphael S. Carvalho
a176022272 compaction_manager: task: switch to table_state
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-07-16 21:35:06 -03:00
Raphael S. Carvalho
43136a3ca7 compaction: table_state: Add is_auto_compaction_disabled_by_user()
auto_compaction_disabled_by_user is a configuration that can be enabled
or disabled on a particular table. We're adding this interface to
avoid having to push the configuration for every compaction_state,
which would result in redundant information as the configuration
value is the same for all table states.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-07-16 21:35:06 -03:00
Raphael S. Carvalho
1deeeff825 compaction: table_state: Add on_compaction_completion()
The idea is that we'll have a single on-completion interface for both
"in-strategy" and off-strategy compactions, so not to pollute table_state
with one interface for each.
replica::table::on_compaction_completion is being moved into private namespace.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-07-16 21:35:06 -03:00
Raphael S. Carvalho
1520580212 compaction: table_state: Add make_sstable()
compaction_manager needs this interface when setting the sstable
creation lambda in compaction_descriptor, which is then forwarded
into the actual compaction procedure.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-07-16 21:35:06 -03:00
Raphael S. Carvalho
956c3997cb compaction_manager: make can_proceed switch to table_state
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-07-16 21:35:06 -03:00
Raphael S. Carvalho
7a9908dbf1 compaction_manager: make stop compaction procedures switch to table_state
they're used to stop all ongoing compaction on behalf of a given table
T. Today, each table has a single table_state representing it, but after
we implement compaction groups, we'll need to call the procedure for
each group in a table. But the discussion doesn't belong here, as
compaction group work will only come later. By the time being, we're
only making compaction manager fully switch to table_state.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-07-16 21:35:06 -03:00
Raphael S. Carvalho
b6126395e1 compaction_manager: make get_compactions() switch to table_state
The only external user of get_compactions() doesn't use any filtering,
so after table_state switch, one will be allowed to get all jobs
running associated with a table_state.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-07-16 21:35:06 -03:00
Raphael S. Carvalho
309d73c584 compaction_manager: change task::update_history() to use table_state instead
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-07-16 21:35:06 -03:00
Raphael S. Carvalho
598ede607f compaction_manager: make can_register_compaction() switch to table_state
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-07-16 21:35:06 -03:00
Raphael S. Carvalho
61510af62a compaction_manager: make get_candidates() switch to table_state
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-07-16 21:35:06 -03:00
Raphael S. Carvalho
b5417096e2 compaction_manager: make propagate_replacement() switch to table_state
propagate_replacement is used by incremental compaction to notify
ongoing compaction about sstable list updates, such that the
ongoing job won't hold reference to exhausted sstables.

So it needs to switch to table_state, too.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-07-16 21:35:06 -03:00
Raphael S. Carvalho
cb05142d58 compaction: Move table::in_strategy_sstables() and switch to table_state
in_strategy_sstables() doesn't have to be implemented in table, as it's
simply about main set with maintenance and staging files filtered out.

Also, let's make it switch to table_state as part of ongoing work.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-07-16 21:35:06 -03:00
Raphael S. Carvalho
23e21ed5bc compaction: table_state: Add maintenance sstable set
Needed for off-strategy compaction.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-07-16 21:35:06 -03:00
Raphael S. Carvalho
e4d9cdf284 compaction_manager: make has_table_ongoing_compaction() switch to table_state
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-07-16 21:35:06 -03:00
Raphael S. Carvalho
ff9e9524e6 compaction_manager: make compaction_disabled() switch to table_state
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-07-16 21:35:06 -03:00
Raphael S. Carvalho
b47ed727c7 compaction_manager: switch to table_state for mapping of compaction_state
manager stores a state for each table. As we're transitioning towards
table_state, the mapping of a table to compaction state will now use
table_state ptr as key. table_state ptr is stable and its lifetime
is the same as table.

we're temporarily adding a ptr to compaction_state, as there's lots
of dependency on replica::table, but we'll get rid of it once
we complete the transition.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-07-16 21:35:06 -03:00
Raphael S. Carvalho
45a4f8d1fa compaction_manager: move task ctor into source
That's to be able to get table_state from table in subsequent patch,
as table only has a forward declaration to it in compaction_manager.hh
to avoid including database.hh.
Once everything is moved to table_state, then ctor can be moved
back into header.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-07-16 21:35:06 -03:00
Raphael S. Carvalho
4bfcead2ba compaction_manager: stop using infinite loop in run_offstrategy_compaction()
we can have a better flow than infinite loop -> break for exit
condition.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #11045
2022-07-16 16:39:58 +03:00
Botond Dénes
b3227ee9b4 CODEOWNERS: add @psarna and @nyh as owners for docs/alternator
Closes #11048
2022-07-16 11:39:04 +03:00
Aleksandra Martyniuk
7871989551 api: list of the user keyspaces contains only user keyspaces
storage_service/keyspaces?type=user along with user keyspaces returned
the keyspaces that were internal but non-system.

The list of the keyspaces for the user option
(storage_service/keyspaces?type=user) contains neither system nor
internal but only user keyspaces.

Fixes: #11042

Closes #11049
2022-07-15 20:42:30 +02:00
Michael Livshin
ca21ce8e6f utils: logalloc: fix indentation
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-07-14 19:40:09 +03:00
Michael Livshin
bcb7404a0e utils: logalloc: split the reclaim_timer in compact_and_evict_locked()
(Into one for the compact part and one for the evict part)

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-07-14 19:40:09 +03:00
Michael Livshin
007d8fb5c9 utils: logalloc: report segment stats if reclaim_segments() times out
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-07-14 19:40:09 +03:00
Michael Livshin
1d700442ae utils: logalloc: reclaim_timer: add optional extra log callback
The idea is to let the caller add arbitrary extra info to the timeout
report.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-07-14 19:40:09 +03:00
Michael Livshin
abd7b9f01c utils: logalloc: reclaim_timer: report non-decreasing durations
The hope is that this reduces logspam without losing utility.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-07-14 19:40:09 +03:00
Michael Livshin
07fdcb268e utils: logalloc: have reclaim_timer print reserve limits
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-07-14 19:40:09 +03:00
Michael Livshin
256b911fbd utils: logalloc: move reclaim timer destructor for more readability
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-07-14 19:40:09 +03:00
Michael Livshin
c15e384507 utils: logalloc: define a proper bundle type for reclaim_timer stats
And define/use arithmetics on it.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-07-14 19:40:09 +03:00
Michael Livshin
0eefbfa3cc utils: logalloc: add arithmetic operations to segment_pool::stats
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-07-14 19:40:09 +03:00
Michael Livshin
3fced65542 utils: logalloc: have reclaim timers detect being nested
Make sure that inner timers don't waste CPU measuring anything.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-07-14 19:40:09 +03:00
Benny Halevy
76ca93b779 utils: logalloc: add more reclaim_timers
Measure stalls at higher resolution.

Refs #6189

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-14 19:40:09 +03:00
Benny Halevy
42db63d012 utils: logalloc: move reclaim_timer to compact_and_evict_locked
track compact_and_evict_locked timing from
all call paths, not only from compact_and_evict.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-14 19:40:09 +03:00
Benny Halevy
fd2b4a4b7d utils: logalloc: pull reclaim_timer definition forward
So it can be used in functions defined earlier in the source file
in the next patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-14 19:40:09 +03:00
Benny Halevy
33785d261e utils: logalloc: reclaim_timer make tracker optional
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-14 19:40:09 +03:00
Benny Halevy
acd82d3b25 utils: logalloc: reclaim_timer: print backtrace if stall detected
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-14 19:40:09 +03:00
Benny Halevy
239992f16c utils: logalloc: reclaim_timer: get call site name
Before adding even more call sites, print the call site
name in the report.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-14 19:40:09 +03:00
Benny Halevy
c4d64c3bf7 utils: logalloc: reclaim_timer: rename set_result
Rename set_result to set_memory_released
to make it clearer what the result means.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-14 19:40:09 +03:00
Benny Halevy
5ce0038e6a utils: logalloc: reclaim_timer: rename _reserve_segments member
Rename reclaim_timer::_reserve_segments to _segments_to_release
as it is clearer and more suitable for later patches
that will add reclaim_timers in more functions.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-14 19:40:09 +03:00
Benny Halevy
c34d1a7705 utils: logalloc: reclaim_timer round up microseconds
better report 29000 us than 28999 us.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-14 19:40:09 +03:00
Avi Kivity
1cb64de8d8 Merge 'Codeowners update' from Botond Dénes
Remove some stale entries, add new entries for docs/.

Closes #11046

* github.com:scylladb/scylla:
  CODEOWNERS: add owners for docs/
  CODEOWNERS: remove @haaawk
2022-07-14 15:51:08 +03:00
Botond Dénes
a7a36c5189 CODEOWNERS: add owners for docs/
User documentation was recently migrated to scylla.git, and this is
maintained by non scylla-core people. Add entries for docs/ so they are
notified when somebody submits changes to docs/.
2022-07-14 15:43:04 +03:00
Botond Dénes
2af12bbaaa CODEOWNERS: remove @haaawk
He is no longer with the company.
2022-07-14 15:41:26 +03:00
Anna Stuchlik
68750b7612 doc: migrate the update about dropping tables from the scylla-docs repo
Closes #11032
2022-07-14 14:08:54 +03:00
Anna Stuchlik
95725c04d1 doc: remove the info about Katacoda and the link to the nonexistent lab
Closes #11031
2022-07-14 14:05:59 +03:00
Anna Stuchlik
3254378375 doc: add the description of the new metrics in 2022.1 2022-07-14 12:29:46 +02:00
Avi Kivity
e69e485396 Merge 'Preparatory work for compaction manager switch to table state' from Raphael "Raph" Carvalho
These are cleanups needed for upcoming series that will make manager switch to table abstraction.

Closes #11037

* github.com:scylladb/scylla:
  compaction_manager: remove unused variable in rewrite_sstable()
  table: remove ref from on_compaction_completion() signature
  table: use compaction_completion_desc to describe changes for off-strategy
  compaction_manager: rename table_state's get_sstable_set to main_sstable_set
2022-07-14 13:08:38 +03:00
Anna Stuchlik
e756bf5067 doc: remove the upgrade guide for Ubuntu 16.04 (no longer supported in version 2022.1) 2022-07-14 12:02:58 +02:00
Anna Stuchlik
68eae4d4e0 doc: remove the outdated warning 2022-07-14 11:55:15 +02:00
Petr Gusev
86299ad194 raft: server: fix comment for set_configuration
follow-up to https://github.com/scylladb/scylla/pull/10905 as discussed in the comments.

Closes #11035
2022-07-14 11:37:35 +02:00
Tomasz Grabiec
cfd785a02b utils: memory_data_sink: Override mandatory buffer_size()
The default implementation aborts. The class has bit rot because it was unused.
2022-07-14 11:56:20 +03:00
David Garcia
0ee5b50bac doc: create migration redirections
Update redirects

Closes #11022
2022-07-14 11:53:41 +03:00
Anna Stuchlik
773be9fc02 Update docs/upgrade/_common/upgrade-guide-from-2021.1-to-2022.1-ubuntu-and-debian.rst
Co-authored-by: Tzach Livyatan <tzach.livyatan@gmail.com>
2022-07-14 10:24:27 +02:00
Anna Stuchlik
bca51c9bb9 Update docs/upgrade/_common/upgrade_to_2022_warning.rst
Co-authored-by: Tzach Livyatan <tzach.livyatan@gmail.com>
2022-07-14 09:17:51 +02:00
Benny Halevy
dc93564247 storage_proxy: abstract_read_resolver: swallow gate_closed exception
Like other errors triggered on shutdown,
this one is triggered by #8995.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #11029
2022-07-14 09:26:34 +03:00
Avi Kivity
98aa3ec99b Update seastar submodule
* seastar 7d8d846b26...6d4a0cb7a3 (18):
  > io: Adjust IO latency goal on fair-queue level
Fixes #10927
  > coroutine: exception: deprecate return_exception(exception_ptr)
  > Merge "Make fair-queue class manipulations noexcept" from Pavel E
  > lowres_timers: Put timeout to infinity if no timers armed
  > util/conversion: support IEC prefix like "Ki"
  > util/conversion: use string_view instead of string
  > thread: fix backtrace termination for s390x on clang
  > *: add fmt::ostream_formatter<> so {fmt} can use operator<<
  > net: Remove operator<< for ipv4_addr
  > rpc-impl: Log "caught exception" when catching exception
  > rpc: Don't format non-trivial types with format specifier
  > sharded: use std::invoke() to call mapper function
  > Merge 'Avoid false-positive warnings in Gcc 12.1.1' from Nadav Har'El
  > tls_test: Remove dns bottle neck + improve read loop in google connect test
  > Revert "sstring: restore compatibility with std::string"
  > sstring: restore compatibility with std::string
  > tls_test: Make google https connect routine loop buffer reads
  > coroutine: add buffer support to async generator

Closes #11033
2022-07-13 18:34:15 +03:00
Anna Stuchlik
ca8c5fd2c4 doc: add a space on line 60 to fix the warning 2022-07-13 16:37:54 +02:00
Raphael S. Carvalho
f6ab220c2a compaction_manager: remove unused variable in rewrite_sstable()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-07-13 11:26:57 -03:00
Raphael S. Carvalho
d3d9b13d9d table: remove ref from on_compaction_completion() signature
Now update_sstable_lists_on_off_strategy_completion() and
on_compaction_completion() can be called from the same unified
interface.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-07-13 11:25:51 -03:00
Anna Stuchlik
ee1eb85ebc doc: document metric update for 2022.1 2022-07-13 16:24:30 +02:00
Anna Stuchlik
b3566391e6 doc: add the upgrade guide from 2021.1 to 2022.1 2022-07-13 16:17:51 +02:00
Raphael S. Carvalho
ca58054485 table: use compaction_completion_desc to describe changes for off-strategy
To make it possible to add a single interface in table_state for
updating sstable list on behalf of both off-strategy and in-strategy
compactions, update_sstable_lists_on_off_strategy_completion() will
work with compaction_completion_desc too for describing sstable set
changes.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-07-13 11:16:19 -03:00
Raphael S. Carvalho
f52ad722f3 compaction_manager: rename table_state's get_sstable_set to main_sstable_set
With compaction_manager switching to table_state, we'll need to
introduce a method in table_state to return maintenance set.
So better to have a descriptive name for main set.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-07-13 11:12:33 -03:00
Raphael S. Carvalho
7d97e15c43 bytes_ostream: Avoid waste by rounding up allocation size to power-of-two
- bytes_ostream has a default initial chunk size of 512.
- let's say we call bytes_ostream::write() to write 500 bytes.
- as next_alloc_size() takes into account space to hold chunk metadata
(24 bytes) + chunk data, then 512 bytes is not enough, so it returns
500 + 24 instead to be allocated.
- when allocating next chunk, next_alloc_size() will use the size of
existing chunk, which is 500 bytes (without metadata) and multiply it
to 2 (growth factor), so 1000 bytes is allocated for it.

So allocations can be non power-of-two, resulting in memory waste.

When seastar is allocating from small pools, the waste is not terrible
(although accumulated small wastes can be problematic), but once
allocations pass the large threshold (16k), then alignment is 4k
(page size) and the waste is not negligible.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #11027
2022-07-13 16:51:13 +03:00
Konstantin Osipov
a645bb3622 test.py: recursively search for pytests
cql-pytest contains subdirectories with tests
ported from Cassandra. It's desirable to preserve
the same layout and file names for these tests as in
the original source tree. To do that, add support for recursive
search of tests to PythonTestSuite. The log files for
the tests which are found recursively are created in subdirs
of the test tmpdir.

While implementing the feature, switch to using pathlib,
since a) it supports rglob (recursive glob) and b) it
was requested in one of the earlier reviews.

Closes #11018
2022-07-13 14:59:29 +03:00
Nadav Har'El
eaf3579c15 test/alternator: several more simple tests for UpdateItem
This patch adds several more tests for Alternator's UpdateItem operation.
These tests verify a few simple cases that, surprisingly, never had test
coverage. The new tests pass (on both DynamoDB and Alternator) so did not
expose any bug.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11025
2022-07-12 21:48:33 +02:00
Avi Kivity
e1e4a73793 Merge 'Always register fully expired sstables for compaction' from Benny Halevy
If the compaction_descriptor returned by `time_window_compaction_strategy::get_sstables_for_compaction`
is marked with `has_only_fully_expired::yes` it should always be compacted
since `time_window_compaction_strategy::get_sstables_for_compaction` is not idempotent.

It sets `_last_expired_check` and if compaction is postponed and retried before
`expired_sstable_check_frequency` has passed, it will not look for those fully-expired
sstables again. Plus, compacting them is the cheapest possible as it does not require
reading anything, just deleting the input sstables, so there's no reason not postpone it.

Also, extend `max_ongoing_compaction_test` to test serialization of compaction jobs with the same weight.

Fixes #10989

Closes #10990

* github.com:scylladb/scylla:
  compaction_manager: always register descriptor with fully expired sstables for compaction
  test: max_ongoing_compaction_test: test serialization of regular compaction with same weight
  test: max_ongoing_compaction_test: reindent refactored code
  test: max_ongoing_compaction_test: define compact_all_tables lambda
  test: max_ongoing_compaction_test: refactor make_table_with_single_fully_expired_sstable
  test: max_ongoing_compaction_test: reduce number of tables
2022-07-12 18:40:01 +03:00
Nadav Har'El
761ca88aa8 Merge 'Add test cases for granting and revoking data permissions' from Piotr Sarna
This series adds the infrastructure needed for testing user permissions, like the ability to create temporary roles and CQL sessions which log in as different users, and a few initial test cases for granting and revoking permissions.

Closes #10998

* github.com:scylladb/scylla:
  cql-pytest: add a case for granting/revoking data permissions
  cql-pytest: add new_user and new_session utils
  cql-pytest: speed up permissions refresh period for tests
2022-07-12 18:31:33 +03:00
Avi Kivity
ea4a907090 Merge 'mutation_compactor: remove emit only live rows parameter' from Botond Dénes
Said parameter is a convenience so downstream consumers of the
mutation compactors don't have to check the `bool is_live` already
passed to them. This convenience however causes a template parameter and
additional logic for the compactor. As the most prominent of these
consumers (the query result builder) will soon have to switch to
`emit_only_live_rows::no` for other reasons anyway (it will want to count
tombstones), we take the opportunity to switch everybody to ::no. This
can be done with very little additional complexity to these consumers --
basically an additional if or two. With everybody using the `::no` variant
of the compactor, we can remove this template parameter and the logic
associated with it altogether.

Closes #10931

* github.com:scylladb/scylla:
  multishard_mutation_query: remove now pointless compact_for_result_state typedef
  mutation_compactor: remove only-live related logic
  mutation_compactor: remove emit_only_live_rows template parameter
  mutation_compactor: remove unused compact_mutation_state::parameters
  querier: remove {data,mutation}_querier aliases
  querier: remove now pointless emit_only_live_rows template parameter
  tree: use emit_only_live_rows::no
  querier: querier_cache: de-override insert() methods
2022-07-12 17:30:46 +03:00
Takuya ASADA
23973f9591 Support installing pip provided command symlinks to /usr/bin
This is part of support installing executables from PIP package,
now we support installing executable from PIP package but it will
install under /opt/scylladb/python3/bin.
To call these commands without speciying full path, we also need to install
symlink to /usr/bin.
To do this, we need new list which specifies command name for symlink.

Closes #10748
2022-07-12 17:26:05 +03:00
David Garcia
0c2a18af2d doc: enable faster builds
Closes #11023
2022-07-12 16:33:38 +03:00
Nadav Har'El
15ed0a441e Merge 'scylla-gdb.py: assortment of task filtering improvements, scylla fiber going backwards' from Botond Dénes
This series includes an assortment of loosely related improvements developed for a recent investigation. The changes include:
* Fix broken `std_deque` wrapper.
* Make `scylla smp-queues` fast.
* Teach `scylla smp-queues` to filter for both sender CPU (`--from`) and receiver CPU (`--to`) or both.
* Teach `scylla smp-queues` to make histogram over content of the queues -- i.e. the type of tasks in the smp queues.
* Teach `scylla smp-queues` to filter for tasks belonging to a certain scheduling group.
* Teach `scylla task_histogram` to include only tasks in the histogram.
* Teach `scylla task_histogram` to filter for tasks belonging to a certain scheduling group.
* Teach `scylla-fiber` to walk in both directions.

And some refactoring.

Fixes: https://github.com/scylladb/scylla/issues/7059

Closes #11019

* github.com:scylladb/scylla:
  docs/dev/debugging.md: update continuation chain traversal guide
  scylla-gdb.py: scylla fiber: walk continuation chain in both directions
  scylla-gdb.py: scylla fiber: allow passing analyzed pointers to _probe_pointer()
  scylla-gdb.py: scylla fiber: hoist preparatory code out of _walk()
  scylla-gdb.py: scylla task_histogram: add --scheduling-groups option
  scylla-gdb.py: scylla task_histogram: add --filter-tasks option
  scylla-gdb.py: scylla task_histogram: use histogram class
  scylla-gdb.py: scylla-fiber: extract symbol matching logic
  scylla-gdb.py: histogram: add limit feature
  scylla-gdb.py: histogram: handle formatting errors
  scylla-gdb.py: intrusive_slist: avoid infinite recursion in __len__()
  scylla-gdb.py: scylla smp-queues: add --scheduling-group option
  scylla-gdb.py: scylla smp-queues: add --content switch
  scylla-gdb.py: smp-queue: add filtering capability
  scylla-gdb.py: make scylla smp-queues fast
  scylla-gdb.py: fix disagreement between std_deque len() and iter()
2022-07-12 15:38:23 +03:00
Piotr Sarna
fcd8dfa694 cql-pytest: add a case for granting/revoking data permissions
The test cas checks if permissions set for a non-superuser user
are enforced.
2022-07-12 13:44:21 +02:00
Benny Halevy
6332816ccf compaction_manager: always register descriptor with fully expired sstables for compaction
If the compaction_descriptor returned by time_window_compaction_strategy::get_sstables_for_compaction
is marked with has_only_fully_expired::yes it should always be compacted
since time_window_compaction_strategy::get_sstables_for_compaction is not idempotent.

It sets _last_expired_check and if compaction is postponed and retried before
expired_sstable_check_frequency has passed, it will not look for those fully-expired
sstables again. Plus, compacting them is the cheapest possible as it does not require
reading anything, just deleting the input sstables, so there's no reason not postpone it.

Fixes #10989

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-12 12:04:04 +03:00
Benny Halevy
cfc7a5065a test: max_ongoing_compaction_test: test serialization of regular compaction with same weight
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-12 12:04:03 +03:00
Benny Halevy
65a5e0a7bb test: max_ongoing_compaction_test: reindent refactored code
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-12 12:03:32 +03:00
Benny Halevy
5212e81475 test: max_ongoing_compaction_test: define compact_all_tables lambda
To test both expired and non-expired sstables scenarios
we need to pass this helper function the expected number
of sstables before compaction and after compaction.

When compaction a set of fully-expired sstables,
we expect none to remain, while when the set of sstables
is not fully expired, we'll expect 1 output sstable
after compaction.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-12 12:00:27 +03:00
Benny Halevy
fe4a59372e test: max_ongoing_compaction_test: refactor make_table_with_single_fully_expired_sstable
So we can use the lower-level build blocks to
test compaction serialization of both fully-expired
and non-fully-expired sstables scenarios in the following patches.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-12 11:56:41 +03:00
Benny Halevy
d18fc6a7ed test: max_ongoing_compaction_test: reduce number of tables
There is no need to test 100 tables.
10 tables are enough so make the test complete faster.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-12 11:53:01 +03:00
Konstantin Osipov
4a2d645a0f scylla-gdb: fix "scylla netw" command
The command wasn't tested fully,
and when tested, started failing scylla-gdb test.

Before Raft, the list of connections to print in a single-node setup
was always empty, so a mistake in the gdb script command 'scylla netw'
didn't lead to a test failure.

With raft, there is always an RPC connection to self after initial
bootstrap, and the test begins to print connections (and fail, because
there is a bug in the printing code).

Fix that bug.

Closes #11012
2022-07-12 10:10:40 +03:00
Botond Dénes
ac9935b645 multishard_mutation_query: remove now pointless compact_for_result_state typedef
No need to switch on the now defunct emit_only_live_rows.
2022-07-12 08:44:33 +03:00
Botond Dénes
17509e9664 mutation_compactor: remove only-live related logic
We removed the template parameter in the previous patch, now we can
remove the logic related to it.
2022-07-12 08:44:32 +03:00
Botond Dénes
4d2ce5c304 mutation_compactor: remove emit_only_live_rows template parameter
Now that we use emit_only_live_rows::no everywhere we can remove this
template parameters. Only the template parameter is removed, the
internal logic around it is left in place (will be removed in a next
patch), by hard-wiring `only_live()`.
2022-07-12 08:43:49 +03:00
Botond Dénes
9ee8ef5930 mutation_compactor: remove unused compact_mutation_state::parameters 2022-07-12 08:41:51 +03:00
Botond Dénes
f912f5f373 querier: remove {data,mutation}_querier aliases
They now both mean the same thing: querier.
2022-07-12 08:41:51 +03:00
Botond Dénes
c77fe427c5 querier: remove now pointless emit_only_live_rows template parameter 2022-07-12 08:41:51 +03:00
Botond Dénes
bedc82e52c tree: use emit_only_live_rows::no
emit_only_live_rows is a convenience so downstream consumers of the
mutation compactors don't have to check the `bool is_live` already
passed to them. This convenience however causes a template parameter and
additional logic for the compactor. As the most prominent of these
consumers (the query result builder) will soon have to switch to
emit_only_live_rows::no for other reasons anyway (it will want to count
tombstones), we take the opportunity to switch everybody to ::no. This
can be done with very little additional complexity to these consumer --
basically an additional if or two.
This prepares the ground for removing this template parameter and the
associate logic from the compactor.
2022-07-12 08:41:51 +03:00
Botond Dénes
742dc10185 querier: querier_cache: de-override insert() methods
Soon, the currently two distinct types of queriers will be merged, as
the template parameter differentiating them will be gone. This will make
using type based overload for insert() impossible, as 2 out of the 3
types will be the same. Use different names instead.
2022-07-12 08:41:48 +03:00
Botond Dénes
5c56125187 docs/dev/debugging.md: update continuation chain traversal guide
`scylla fiber` is the way to traverse in both directions now.
2022-07-12 07:27:45 +03:00
Botond Dénes
89595f5b12 scylla-gdb.py: scylla fiber: walk continuation chain in both directions
Parameterize _walk() with a method that does the actual walking. This is
a trivial change as it was already delegating all the walking logic to
_do_walk(). The latter is renamed to _walk_forward() and we add a new
method called _walk_backward() which implements walking the continuation
chain backwards (towards tasks waited on by the queried task).
The starting task is now printed at index #0, tasks waited on by the
starting task have negative indexes, tasks waiting on the starting task
have positive indexes (like before).
With this scylla fiber can be used to dump an entire fiber (barring any
difficulties detecting following more special tasks like threads).
2022-07-12 07:08:54 +03:00
Botond Dénes
1d5547ae22 scylla-gdb.py: scylla fiber: allow passing analyzed pointers to _probe_pointer()
A future caller will have pre-analyzed pointers to pass to said method,
in which case we want to avoid re-running the expensive process.
2022-07-12 06:21:03 +03:00
Botond Dénes
b41493d165 scylla-gdb.py: scylla fiber: hoist preparatory code out of _walk()
We soon want to teach _walk() to walk in both directions. In preparation
to that, we extract all generic preparatory code that is related to the
starting task and combining arguments. This now resides in invoke(),
_walk() should only be concerned with traversing the continuation chain.
2022-07-12 06:14:44 +03:00
Nadav Har'El
f5ff687b64 Merge 'cql3: Reorganize expr::to_restriction' from Jan Ciołek
This PR introduces improvements to `expr::to_restriction` and prepares the validation part for restriction classes removal.

`expr::to_restriction` is currently used to take a restriction from the WHERE clause, prepare it, perform some validation checks and finally convert it to an instance of the restriction class.

Soon we will get rid of the restriction class.

In preparation for that `expr::to_restriction` is split into two independent parts:
* The part that prepares and validates a binary_operator
* The part that converts a binary_operator to restriction

Thanks to this split getting rid of restriction class will be painless, we will just stop using the second part.

`to_restriction.cc` is replaced by `restrictions.hh/cc`. In the future we can put all the restriction expressions code there to avoid clutter in `expression.hh/cc`.

This change made it much easier to fix #10631, so I did that as well.

Fixes: #10631

Closes #10979

* github.com:scylladb/scylla:
  cql-pytest: Test that IS NOT only accepts NULL
  cql-pytest: Enable testInvalidCollectionNonEQRelation
  cql3: Move single element IN restrictions handling
  cql3: Check for disallowed operators early
  cql3: Simplify adding restrictions
  cql3: Reorganize to_restriction code
  cql3: Fix IS NOT NULL check in to_restriction
  cql3: Swap order of arguments in error message
2022-07-12 00:26:34 +03:00
Avi Kivity
53e0dc7530 bytes_ostream: base on managed_bytes
bytes_ostream is an incremental builder for a discontiguous byte container.
managed_bytes is a non-incremental (size must be known up front) byte
container, that is also compatible with LSA. So far, conversion between
them involves copying. This is unfortunate, since query_result is generated
as a bytes_ostream, but is later converted to managed_bytes (today, this
is done in cql3::expr::get_non_pk_values() and
compound_view_wrapper::explode(). If the two types could be made compatible,
we could use managed_bytes_view instead of creating new objects and avoid
a copy. It's also nicer to have one less vocabulary type.

This patch makes bytes_ostream use managed_bytes' internal representation
(blob_storage instead of bytes_ostream::chunk) and provides a conversion
to managed_bytes. All bytes_ostream users are left in place, but the goal
is to make bytes_ostream a write-only type with the only observer a conversion
to managed_bytes.

It turns out to be relatively simple. The internal representations were
already similar. I made blob_storage::ref_type self-initializing to
reduce churn (good practice anyway) and added a private constructor
to managed_bytes for the conversion.

Note that bytes_ostream can only be used to construct a non-LSA managed_bytes,
but LSA uses of managed_bytes are very strictly controlled (the entry
points to memtable and cache) so that's not a problem.

A unit test is added.

Closes #10986
2022-07-12 00:23:29 +03:00
Pavel Emelyanov
5526738794 view: Fix trace-state pointer use after move
It's moved into .mutate_locally() but it captured and used in its
continuation. It works well just because moved-from pointer looks like
nullptr and all the tracing code checks for it to be non-such.

tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1266/
       (CI job failed on post-actions thus it's red)

Fixes #11015

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220711134152.30346-1-xemul@scylladb.com>
2022-07-11 17:20:51 +03:00
Avi Kivity
34886ce1a1 Merge 'Allow regular compaction during major' from Benny Halevy
After acquiring the _compaction_state write lock,
select all sstables using get_candidates and register them
as compacting, then unlock the _compaction_state lock
to let regular compaction run in parallel.

Also, run major compaction in maintenance scheduling group.
We should separate the scheduling groups used for major compaction
from the the regular compaction scheduling group so that
the latter can be affected by the backlog tracker in case
backlog accumulates during a long running major compaction.

Fixes #10961

Closes #10984

* github.com:scylladb/scylla:
  compaction_manager: major_compaction_task: run in maintenance scheduling groupt
  compaction_manager: allow regular compaction to run in parallel to major
2022-07-11 17:11:51 +03:00
Jan Ciolek
012f7d5b1a cql-pytest: Test that IS NOT only accepts NULL
The IS_NOT operator can only be used during materialized view creation
and it can only be used to express IS NOT NULL.
Trying to write something like IS NOT 42 should cause an error.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-11 15:47:16 +02:00
Jan Ciolek
22e605f823 cql-pytest: Enable testInvalidCollectionNonEQRelation
The wrong error message has been fixed and
now the test passes.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-11 15:47:16 +02:00
Jan Ciolek
38e115edf7 cql3: Move single element IN restrictions handling
Restrictions like
col IN (1)
get converted to
col = 1
as an optimization/simplification.

This used to be done in prepare_binary_operator,
but it fits way better inside of
validate_and_prepare_new_restriction.

When it was being done in prepare_binary_operator
the conversion happened before validation checks
and the error messages would describe an equality
restriction despite the user making an IN restriction.

Now the conversion happens after all validation
is finished, which ensures that all checks are
being done on the original expression.

Fixes: #10631

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-11 15:47:16 +02:00
Jan Ciolek
cb504b2d6e cql3: Check for disallowed operators early
Move checking for disallowed operators
earlier in the code flow.
This is needed to pass some tests that
expect one error message instead of the other.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-11 15:47:16 +02:00
Jan Ciolek
62155846bc cql3: Simplify adding restrictions
The code that adds restrictions in statement_restrictions.cc
is unnecessarily convoluted.

The code to handle IS NOT NULL is actually repeated twice,
once in the constructor and once in add_is_not_restriction.
I missed this when I orignally modified this code.
There is no need to keep duplicate code, we can just
use the new add_is_not_restriction.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-11 15:47:16 +02:00
Jan Ciolek
debd7399fd cql3: Reorganize to_restriction code
expr::to_restriction is currently used to
take a restriction from the WHERE clause,
prepare it, perform some validation checks
and finally convert it to an instance of
the restriction class.

Soon we will get rid of the restriction class.

In preparation for that expr::to_restriction
is split into two independent parts:
* The part that prepares and validates a binary_operator
* The part that converts a binary_operator to restriction

Thanks to this split getting rid of restriction class
will be painless, we will just stop using the
second part.

This commit splits expr::to_restriction into two functions;
* validate_and_prepare_new_restriction
* convert_to_restriction
that handle each of those parts.

All helper validation methods in the anonymous namespace
are copied from the to_restriction.cc file.

to_restriction.cc isn't the best filename for the new functionality,
so it has been renamed to restrictions.hh/cc.
In the future all the code regarding restrictions could be
put there to reduce clutter in expression.hh/cc

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-11 15:47:16 +02:00
Jan Ciolek
5be574fe51 cql3: Fix IS NOT NULL check in to_restriction
expr::to_restriction performs a check to see if
the restriction is of form: `col IS NOT NULL`

There is a mistake in this check.
It uses is<null>(prepared_binop.rhs)
to determine if the right hand side of binary operator
is a null, but the binary operator is already prepared.

During preparation expr::null is converted to expr::constant
and that wouldn't be detected by this check.

The check has been changed to check for null constant instead
of expr::null.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-11 15:47:15 +02:00
Jan Ciolek
4142b27d85 cql3: Swap order of arguments in error message
The error message displays two arguments in
a specific order, but the tests actually
expect them to be swapped.

Swap the arguments to match the expected
error messages in tests.

It wasn't detected earlier because the
check was never reached, but this will change
soon in the following commits.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-11 15:47:13 +02:00
Botond Dénes
c11875d025 scylla-gdb.py: scylla task_histogram: add --scheduling-groups option
Causing the histogram to be made from the scheduling groups of the found
tasks. Allows for finding out which scheduling group dominates in-memory
tasks. This currently cannot be determined, scylla task-queues only
includes ready tasks.
2022-07-11 16:39:49 +03:00
Botond Dénes
2c3c9563b6 scylla-gdb.py: scylla task_histogram: add --filter-tasks option
Allowing to include only task objects in the histogram. Leads to
histograms with less noise but might exclude potentially important items
due to the filtering being inexact.
2022-07-11 16:35:09 +03:00
Botond Dénes
b3fd03f02d scylla-gdb.py: scylla task_histogram: use histogram class
Instead of open-coded alternative. Spares some lines of code and makes
future patching easier.
2022-07-11 16:17:29 +03:00
Botond Dénes
46eeb874fc scylla-gdb.py: scylla-fiber: extract symbol matching logic
Into its own class. Soon there will be another user for it.
2022-07-11 16:17:29 +03:00
Botond Dénes
79d5a4cccb scylla-gdb.py: histogram: add limit feature
Limiting the number of printed lines to the top "limit" ones if
provided.
2022-07-11 16:08:29 +03:00
Botond Dénes
a80f5f2ba4 scylla-gdb.py: histogram: handle formatting errors
Both the content and the formatter method is caller-provided. Mistakes
are easy to come by. Instead of aborting the entire operation, just a
print an error if an item fails to format.
2022-07-11 16:06:25 +03:00
Botond Dénes
753c1608dd scylla-gdb.py: intrusive_slist: avoid infinite recursion in __len__()
Said method currently uses a list() to iterate over all elements,
determining the length. Passing `self` to `list()` will however make
call `len()` first, causing infinite recursion.
2022-07-11 15:53:01 +03:00
Botond Dénes
935f9f7ac4 scylla-gdb.py: scylla smp-queues: add --scheduling-group option
Allowing for filtering for async work items belonging to a certain
scheduling group.
2022-07-11 15:41:29 +03:00
Botond Dénes
fe0f4d4dc1 scylla-gdb.py: scylla smp-queues: add --content switch
When present on the command line, the histogram is created over the
content of the queues, rather than the number of items in them.
It is possible to filter in combination with --content. In particular it
can be used to see the content of a single queue when all three of
`--to`, `--from` and `--content` is present on the command line.
2022-07-11 15:41:02 +03:00
Nadav Har'El
a504d120d0 Merge 'docs: migrate the docs from the scylla-docs repo' from Anna Stuchlik
This PR migrates the ScyllaDB end-user documentation from the [scylla-docs](https://github.com/scylladb/scylla-docs/) repository, according to the [migration plan](https://docs.google.com/document/d/15yBf39j15hgUVvjeuGR4MCbYeArqZrO1ir-z_1Urc6A/edit?usp=sharing). All the files are added to the `docs` subfolder.

 **This PR does not cover any content changes.**

How to test this PR:

1. Go to `scylla/docs`.
2. Run `make preview`. The docs should build without any warnings.
3. Open http://127.0.0.1:5500/ in your browser. You should see the documentation landing page:
![image](https://user-images.githubusercontent.com/37244380/177358869-af9f1b78-e528-4d0d-9479-cc69e25f3b67.png)

Closes #10976

* github.com:scylladb/scylla:
  doc: fix errors -fix the indent in the conf.py file
  doc: fix the path to Alternator
  doc: fix errors - add Alternator to the toctree
  doc: fix errors- update the conf.py file
  doc: fix errors - remove the CNAME file
  doc: add the CNAME and robots files
  doc: move index and README from scylla-docs repo
  doc: move the documentation from the scylla-docs repo
  doc: remove the old index file
2022-07-11 15:26:06 +03:00
Botond Dénes
abc77d07d5 scylla-gdb.py: smp-queue: add filtering capability
Allow filtering for the from/to cpu (or both). Useful when looking for
queues going to a certain CPU.
2022-07-11 15:19:13 +03:00
Botond Dénes
73563d9800 scylla-gdb.py: make scylla smp-queues fast
Currently scylla smp-queues has O(count(vobjects)) time complexity as it
works by scanning all objects with a vptr and searching them for a
pointer to one of the smp message queues. This is very inefficient and
unnecessary. It much better to just look at the queues themselves and
sum up the number of items in them. This completes in 1-2 seconds on a
core where the old algorithm didn't complete in 2h+.
2022-07-11 15:19:13 +03:00
Botond Dénes
966373fcfb scylla-gdb.py: fix disagreement between std_deque len() and iter()
std_deque implementation was broken, with __len__() and __iter__()
disagreeing about the size of the container. Turns out both are wrong in
certain situations. Fix the iteration logic and re-base both __len__()
and __iter__() on the same node iteration code to prevent future
disagreements.
2022-07-11 15:19:13 +03:00
Avi Kivity
957bf48eb2 Merge 'Don't throw exceptions on the replica side when handling single partition reads and writes' from Piotr Dulikowski
This PR gets rid of exception throws/rethrows on the replica side for writes and single-partition reads. This goal is achieved without using `boost::outcome` but rather by replacing the parts of the code which throw with appropriate seastar idioms and by introducing two helper functions:

1.`try_catch` allows to inspect the type and value behind an `std::exception_ptr`. When libstdc++ is used, this function does not need to throw the exception and avoids the very costly unwind process. This based on the "How to catch an exception_ptr without even try-ing" proposal mentioned in https://github.com/scylladb/scylla/issues/10260.

This function allows to replace the current `try..catch` chains which inspect the exception type and account it in the metrics.

Example:

```c++
// Before
try {
    std::rethrow_exception(eptr);
} catch (std::runtime_exception& ex) {
    // 1
} catch (...) {
    // 2
}

// After
if (auto* ex = try_catch<std::runtime_exception>(eptr)) {
    // 1
} else {
    // 2
}
```

2. `make_nested_exception_ptr` which is meant to be a replacement for `std::throw_with_nested`. Unlike the original function, it does not require an exception being currently thrown and does not throw itself - instead, it takes the nested exception as an `std::exception_ptr` and produces another `std::exception_ptr` itself.

Apart from the above, seastar idioms such as `make_exception_future`, `co_await as_future`, `co_return coroutine::exception()` are used to propagate exceptions without throwing. This brings the number of exception throws to zero for single partition reads and writes (tested with scylla-bench, --mode=read and --mode=write).

Results from `perf_simple_query`:

```
Before (719724e4df):
  Writes:
    Normal:
      127841.40 tps ( 56.2 allocs/op,  13.2 tasks/op,   50042 insns/op,        0 errors)
    Timeouts:
      94770.81 tps ( 53.1 allocs/op,   5.1 tasks/op,   78678 insns/op,  1000000 errors)
  Reads:
    Normal:
      138902.31 tps ( 65.1 allocs/op,  12.1 tasks/op,   43106 insns/op,        0 errors)
    Timeouts:
      62447.01 tps ( 49.7 allocs/op,  12.1 tasks/op,  135984 insns/op,   936846 errors)

After (d8ac4c02bfb7786dc9ed30d2db3b99df09bf448f):
  Writes:
    Normal:
      127359.12 tps ( 56.2 allocs/op,  13.2 tasks/op,   49782 insns/op,        0 errors)
    Timeouts:
      163068.38 tps ( 52.1 allocs/op,   5.1 tasks/op,   40615 insns/op,  1000000 errors)
  Reads:
    Normal:
      151221.15 tps ( 65.1 allocs/op,  12.1 tasks/op,   43028 insns/op,        0 errors)
    Timeouts:
      192094.11 tps ( 41.2 allocs/op,  12.1 tasks/op,   33403 insns/op,   960604 errors)
```

Closes #10368

* github.com:scylladb/scylla:
  database: avoid rethrows when handling exceptions from commitlog
  database: convert throw_commitlog_add_error to use make_nested_exception_ptr
  utils: add make_nested_exception_ptr
  storage_proxy: don't rethrow when inspecting replica exceptions on write path
  database: don't rethrow rate_limit_exception
  storage_proxy: don't rethrow the exception in abstract_read_resolver::error
  utils/exceptions.cc: don't rethrow in is_timeout_exception
  utils/exceptions: add try_catch
  utils: add abi/eh_ia64.hh
  storage_proxy: don't rethrow exceptions from replicas when accounting read stats
  message: get rid of throws in send_message{,_timeout,_abortable}
  database/{query,query_mutations}: don't rethrow read semaphore exceptions
2022-07-11 14:01:41 +03:00
Anna Stuchlik
0bf99b25c9 doc: fix errors -fix the indent in the conf.py file 2022-07-11 12:31:59 +02:00
Anna Stuchlik
ae9ed315d1 doc: fix the path to Alternator 2022-07-11 12:31:59 +02:00
Anna Stuchlik
a1d9f0f0c8 doc: fix errors - add Alternator to the toctree 2022-07-11 12:31:30 +02:00
Anna Stuchlik
81949bbc7a doc: fix errors- update the conf.py file 2022-07-11 12:18:47 +02:00
Anna Stuchlik
2e95bd0ed1 doc: fix errors - remove the CNAME file 2022-07-11 12:17:33 +02:00
Anna Stuchlik
7b5dfde56a doc: add the CNAME and robots files 2022-07-11 12:16:53 +02:00
Anna Stuchlik
8d86dfa929 doc: move index and README from scylla-docs repo 2022-07-11 12:14:40 +02:00
Anna Stuchlik
6e97b83b60 doc: move the documentation from the scylla-docs repo 2022-07-11 12:14:02 +02:00
Anna Stuchlik
bb41457f73 doc: remove the old index file 2022-07-11 12:12:15 +02:00
Piotr Sarna
23acc2e848 cql-pytest: add new_user and new_session utils
These helpers can be used to create a new user and connect
to the cluster using custom credentials to log in.
2022-07-11 10:49:15 +02:00
Piotr Sarna
cf57b369e7 cql-pytest: speed up permissions refresh period for tests
The default refresh period for permissions in both Scylla and Cassandra
is 2 seconds, which is usually perfectly fine for production
environments, but it introduces a significant delay in automatic
test cases. The refresh period is hereby set to 100ms, which allows
test_permissions.py cases to run in around 1s for Scylla instead of
tens of seconds.
2022-07-11 10:30:01 +02:00
Nadav Har'El
cc69177dcc config: fix printing of experimental feature list
Recently we noticed a regression where with certain versions of the fmt
library,

   SELECT value FROM system.config WHERE name = 'experimental_features'

returns string numbers, like "5", instead of feature names like "raft".

It turns out that the fmt library keep changing their overload resolution
order when there are several ways to print something. For enum_option<T> we
happen to have to conflicting ways to print it:
  1. We have an explicit operator<<.
  2. We have an *implicit* convertor to the type held by T.

We were hoping that the operator<< always wins. But in fmt 8.1, there is
special logic that if the type is convertable to an int, this is used
before operator<<()! For experimental_features_t, the type held in it was
an old-style enum, so it is indeed convertible to int.

The solution I used in this patch is to replace the old-style enum
in experimental_features_t by the newer and more recommended "enum class",
which does not have an implicit conversion to int.

I could have fixed it in other ways, but it wouldn't have been much
prettier. For example, dropping the implicit convertor would require
us to change a bunch of switch() statements over enum_option (and
not just experimental_features_t, but other types of enum_option).

Going forward, all uses of enum_option should use "enum class", not
"enum". tri_mode_restriction_t was already using an enum class, and
now so does experimental_features_t. I changed the examples in the
comments to also use "enum class" instead of enum.

This patch also adds to the existing experimental_features test a
check that the feature names are words that are not numbers.

Fixes #11003.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11004
2022-07-11 09:17:30 +02:00
Nadav Har'El
4a4d9ec9c0 cql-pytest: remove "xfail" mark from two passing tests
Fix two cql-pytest that have been "XPASS"ing (unexpectedly passing)
by removing the "xfail" (expecting failure) mark from them:

One test was for an issue that has already been fixed (refs #10081).
The second test was a translated Cassandra test that should never
have failed because it doesn't trigger the issue that supposedly failed
it (that test sets a large value for a non-indexed column, so doesn't
trigger the problem we have with large values in an indexed column).

Closes #11006
2022-07-11 08:34:19 +03:00
Nadav Har'El
0a71151bc4 test/cql-pytest: avoid deprecation message
When running test/cql-pytest, pytest prints one warning at the end:

   /home/nyh/scylla/test/cql-pytest/test_secondary_index.py:82: DeprecationWarning: ResultSet indexing support will be removed in 4.0.
   Consider using ResultSet.one() to get a single row.
   assert any([index_name in event.description for event in cql.execute(query, trace=True).get_query_trace().events])

So in this patch I do exactly what the warning recommends - use one().

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11002
2022-07-11 08:01:23 +03:00
Nadav Har'El
2581b54ea0 test/{alternator,redis}: stop using deprecated "disutils" package
Python has deprecated the distutils package. In several places in the
Alternator and Redis test suites, we used distutils.version to check if
the library is new enough for running the test (and skip the test if
it's too old). On new versions of Python, we started getting deprecation
warnings such as:

    DeprecationWarning: The distutils package is deprecated and slated for
    removal in Python 3.12. Use setuptools or check PEP 632 for potential
    alternatives

PEP 632 recommends using package.version instead of distutils.version,
and indeed it works well. After applying this patch, Alternator and
Redis test runs no long end in silly deprecation warnings.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11007
2022-07-11 08:00:45 +03:00
Benny Halevy
7e2d2cf1c1 table: snapshot: coroutine::return_exception_ptr
Otherwise, we lose the returned exception_ptr type.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #11000
2022-07-10 17:56:24 +03:00
Nadav Har'El
2437a42b64 Merge 'cql-pytest: add test_permissions.py' from Piotr Sarna
This new test suite is expected to gather all kinds of permissions
tests - granting, revoking, authorizing, and so on.
Right now it contains a single minimal test which ensures that
the default superuser can be granted applicable permissions,
which they already have anyway.

The test suite added in this pull request will also be useful
when developing #10633 - permissions for UDF/UDA infrastructure.

Closes #10991

* github.com:scylladb/scylla:
  cql-pytest: add initial permissions test suite
  cql-pytest: enable CassandraAuthorizer for Scylla and Cassandra
2022-07-10 09:30:10 +03:00
cvybhu
80dda2bb97 cql3: expr: Fix handling reversed types in limits()
There was a bug which caused incorrect results of limits()
for columns with reversed clustering order.
Such columns have reversed_type as their type and this
needs to be taken into account when comparing them.

It was introduced in 6d943e6cd0.
This commit replaced uses of get_value_comparator
with type_of. The difference between them is that
get_value_comparator applied ->without_reversed()
on the result type.

Because the type was reversed, comparisons like
1 < 2 evaluated to false.

This caused the test testIndexOnKeyWithReverseClustering
to fail, but sadly it wasn't caught by CI because
the CI itself has a bug that makes it skip some tests.
The test passes now, although it has to be run manually
to check that.

Fixes: #10918

Signed-off-by: cvybhu <jan.ciolek@scylladb.com>

Closes #10994
2022-07-10 09:24:06 +03:00
Nadav Har'El
a7fa29bceb cross-tree: fix header file self-sufficiency
Scylla's coding standard requires that each header is self-sufficient,
i.e., it includes whatever other headers it needs - so it can be included
without having to include any other header before it.

We have a test for this, "ninja dev-headers", but it isn't run very
frequently, and it turns out our code deviated from this requirement
in a few places. This patch fixes those places, and after it
"ninja dev-headers" succeeds again.

Fixes #10995

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #10997
2022-07-08 12:59:14 +03:00
Avi Kivity
3b20407f25 Merge 'db: Avoid memtable flush latency on schema merge' from Tomasz Grabiec
Currently, applying schema mutations involves flushing all schema
tables so that on restart commit log replay is performed on top of
latest schema (for correctness). The downside is that schema merge is
very sensitive to fdatasync latency. Flushing a single memtable
involves many syncs, and we flush several of them. It was observed to
take as long as 30 seconds on GCE disks under some conditions.

This patch changes the schema merge to rely on a separate commit log
to replay the mutations on restart. This way it doesn't have to wait
for memtables to be flushed. It has to wait for the commitlog to be
synced, but this cost is well amortized.

We put the mutations into a separate commit log so that schema can be
recovered before replaying user mutations. This is necessary because
regular writes have a dependency on schema version, and replaying on
top of latest schema satisfies all dependencies. Without this, we
could get loss of writes if we replay a write which depends on the
latest schema on top of old schema.

Also, if we have a separate commit log for schema we can delay schema
parsing for after the replay and avoid complexity of recognizing
schema transactions in the log and invoking the schema merge logic.

I reproduced bad behavior locally on my machine with a tired (high latency)
SSD disk, load driver remote. Under high load, I saw table alter (server-side part) taking
up to 10 seconds before. After the patch, it takes up to 200 ms (50:1 improvement).
Without load, it is 300ms vs 50ms.

Fixes #8272
Fixes #8309
Fixes #1459

Closes #10333

* github.com:scylladb/scylla:
  config: Introduce force_schema_commit_log option
  config: Introduce unsafe_ignore_truncation_record
  db: Avoid memtable flush latency on schema merge
  db: Allow splitting initiatlization of system tables
  db: Flush system.scylla_local on change
  migration_manager: Do not drop system.IndexInfo on keyspace drop
  Introduce SCHEMA_COMMITLOG cluster feature
  frozen_mutation: Introduce freeze/unfreeze helpers for vectors of mutations
  db/commitlog: Improve error messages in case of unknown column mapping
  db/commitlog: Fix error format string to print the version
  db: Introduce multi-table atomic apply()
2022-07-07 16:03:50 +03:00
Benny Halevy
acae3cc223 treewide: stop use of deprecated coroutine::make_exception
Convert most use sites from `co_return coroutine::make_exception`
to `co_await coroutine::return_exception{,_ptr}` where possible.

In cases this is done in a catch clause, convert to
`co_return coroutine::exception`, generating an exception_ptr
if needed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #10972
2022-07-07 15:02:16 +03:00
Piotr Sarna
2e61e50e97 cql-pytest: add initial permissions test suite
This new test suite is expected to gather all kinds of permissions
tests - granting, revoking, authorizing, and so on.
Right now it contains a single minimal test which ensures that
the default superuser can be granted applicable permissions,
which they already have anyway.
2022-07-07 13:45:26 +02:00
Piotr Sarna
1dc116f4dc cql-pytest: enable CassandraAuthorizer for Scylla and Cassandra
In order to be able to test permissions, an authorizer different
than AllowAllAuthorizer (default) must be set.
CassandraAuthorizer is thus enabled - it works on default user/password
pair, so it doesn't introduce any regressions to the test suite.
2022-07-07 13:45:26 +02:00
Avi Kivity
bfc521ee9c Merge "Activate compaction_throughput_mb_per_sec option" from Pavel E
"
The option controlls the IO bandwidth of the compaction sched class.
It's not set to be 16MB/s, but is unused. This set makes it 0 by
default (which means unlimited), live-updateable and plugs it to the
seastar sched group IO throttling.

branch: https://github.com/xemul/scylla/tree/br-compaction-throttling-3
tests: unit(dev),
       v2: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1010/ ,
       v2: manual config update
"

* 'br-compaction-throttling-3-a' of https://github.com/xemul/scylla:
  compaction_manager: Add compaction throughput limit
  updateable_value: Support dummy observing
  serialized_action: Allow being observer for updateable_value
  config: Tune the config option
2022-07-07 13:14:07 +03:00
Tomasz Grabiec
6622e3369a config: Introduce force_schema_commit_log option 2022-07-06 22:08:56 +02:00
Tomasz Grabiec
b8d20335a4 config: Introduce unsafe_ignore_truncation_record
The node now refuses to boot if schema tables were truncated.
This adds a config option to ignore truncation records as a
workaround if user truncated them manually.
2022-07-06 22:08:56 +02:00
Tomasz Grabiec
6b316f267f db: Avoid memtable flush latency on schema merge
Currently, applying schema mutations involves flushing all schema
tables so that on restart commit log replay is performed on top of
latest schema (for correctness). The downside is that schema merge is
very sensitive to fdatasync latency. Flushing a single memtable
involves many syncs, and we flush several of them. It was observed to
take as long as 30 seconds on GCE disks under some conditions.

This patch changes the schema merge to rely on a separate commit log
to replay the mutations on restart. This way it doesn't have to wait
for memtables to be flushed. It has to wait for the commitlog to be
synced, but this cost is well amortized.

We put the mutations into a separate commit log so that schema can be
recovered before replaying user mutations. This is necessary because
regular writes have a dependency on schema version, and replaying on
top of latest schema satisfies all dependencies. Without this, we
could get loss of writes if we replay a write which depends on the
latest schema on top of old schema.

Also, if we have a separate commit log for schema we can delay schema
parsing for after the replay and avoid complexity of recognizing
schema transactions in the log and invoking the schema merge logic.

One complication with this change is that replay_position markers are
commitlog-domain specific and cannot cross domains. They are recorded
in various places which survive node restart: sstables are annotated
with the maximum replay position, and they are present inside
truncation records. The former annotation is used by "truncate"
operation to drop sstables. To prevent old replay positions from being
interpreted in the context in the new schema commitlog domain, the
change refuses to boot if there are truncation records, and also
prohibits truncation of schema tables.

The boot sequence needs to know whether the cluster feature associated
with this change was enabled on all nodes. Fetaures are stored in
system.scylla_local. Because we need to read it before initializing
schema tables, the initialization of tables now has to be split into
two phases. The first phase initializes all system tables except
schema tables, and later we initialize schema tables, after reading
stored cluster features.

The commitlog domain is switched only when all nodes are upgraded, and
only after new node is restarted. This is so that we don't have to add
risky code to deal with hot-switching of the commitlog domain. Cold
switching is safer. This means that after upgrade there is a need for
yet another rolling restart round.

Fixes #8272
Fixes #8309
Fixes #1459
2022-07-06 22:08:56 +02:00
Tomasz Grabiec
c5ad05c819 db: Allow splitting initiatlization of system tables
We will need some system tables to be initialized earlier in the boot
so that system.scylla_local can be read before schema tables are
initialized.
2022-07-06 22:08:56 +02:00
Tomasz Grabiec
9b3f96047f db: Flush system.scylla_local on change
So that it can be read before commit log replay.

SCHEMA_COMMITLOG feature relies on that.
2022-07-06 22:08:56 +02:00
Tomasz Grabiec
609bf1d547 migration_manager: Do not drop system.IndexInfo on keyspace drop
It's not needed anymore because system.IndexInfo is a virtual table
calculated from view info.

The drop accesses a table which is outside system_schema keyspace
so crosses commit log domain. This will trigger an internal from
database::apply() on schema merge once the code switches to use
the schema commit log and require that all mutations which are
part of the schema change belong to a single commit log domain.

We could theoretically move system.IndexInfo to the schema commit log
domain. It's not easy though because table initialization at boot
needs to be split, and current functions for initailization work
at keyspace granularity, not table granularity.
2022-07-06 22:08:56 +02:00
Tomasz Grabiec
62df9f446c Introduce SCHEMA_COMMITLOG cluster feature 2022-07-06 22:08:56 +02:00
Tomasz Grabiec
6c112bf854 frozen_mutation: Introduce freeze/unfreeze helpers for vectors of mutations 2022-07-06 22:08:56 +02:00
Tomasz Grabiec
4eb4689d8c db/commitlog: Improve error messages in case of unknown column mapping
Include the table id, and also add a debug-level log line with replay pos
which is similar to the one logged when no error happens.
2022-07-06 22:08:56 +02:00
Tomasz Grabiec
f62eb186b4 db/commitlog: Fix error format string to print the version
It always printed {} instead.
2022-07-06 22:08:56 +02:00
Tomasz Grabiec
6444d959dc db: Introduce multi-table atomic apply()
Will be used to apply schema mutations atomically.
2022-07-06 22:08:56 +02:00
Benny Halevy
e3f561db31 compaction_manager: major_compaction_task: run in maintenance scheduling groupt
We should separate the scheduling groups used for major compaction
from the the regular compaction scheduling group so that
the latter can be affected by the backlog tracker in case
backlog accumulates during a long running major compaction.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-06 18:18:45 +03:00
Benny Halevy
a9dc7b1841 compaction_manager: allow regular compaction to run in parallel to major
After acquiring the _compaction_state write lock,
select all sstables using get_candidates and register them
as compacting, then unlock the _compaction_state lock
to let regular compaction run in parallel.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-06 14:44:27 +03:00
Petr Gusev
6cdd5b9ff5 raft, set_configuration fix: don't use dummy entries
Leader which ceases to be a leader as a result of a
execute_modify_config cannot wait for a dummy record to be
committed because io_fiber aborts current waiters as soon as it
detects a lost of leadership.

This commit excludes dummy entries from the configuration change
procedure. A special promise is set on io_fiber when it gets a
non-joint configuration, and set_configuration just waits for
the corresponding future instead of a dummy record.

Fixes: #10010

Closes #10905
2022-07-06 11:26:59 +02:00
Avi Kivity
419fe65259 Revert "Merge 'Block flush until compaction finishes if sstables accumulate' from Mikołaj Sielużycki"
This reverts commit aa8f135f64, reversing
changes made to 9a88bc260c. The patch
causes hangs during flush.

Also reverts parts of 411231da75 that impacted the unit test.

Fixes #10897.
2022-07-06 12:19:02 +03:00
Asias He
a33c370f9a gossip: Speed up wait for gossip settle
In a large cluster, a node would receive frequent and periodic gossip
application state updates like CACHE_HITRATES or VIEW_BACKLOG from peer
nodes. Those states are not critical. They should not be counted for the
_msg_processing counter which is used to decide if gossip is settled.

This patch fixes the long settle on every restart issue reported by
users.

Refs #10337

Closes #10892
2022-07-06 11:26:32 +03:00
Pavel Emelyanov
b112a98318 compaction_manager: Add compaction throughput limit
Re-use eisting compaction_throughput_mb_per_sec option, push it down to
compaction manager via config and update the nderlying compaction sched
class when the option is (live)updated.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-07-06 08:17:08 +03:00
Pavel Emelyanov
b86d11cf67 updateable_value: Support dummy observing
An updateable_value() may come without source attached. One of the
options how this can happen is if the value sits on a service config.
It's a good option to make the config have some default initialization
for the option, but in this case observe()ing an option by the service
would step on null pointer dereference.

Said that, if a value without source is tried to be observed -- assume
that it's OK, but the value would never change, so a dummy observer is
to be provided.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-07-06 08:17:08 +03:00
Pavel Emelyanov
dd0f60ef24 serialized_action: Allow being observer for updateable_value
Live-updating an option may involve running some action when the option
changes, not just getting its new value into somewhere. The action is
nice to be run as serialized action to batch config updates.

Said that, here's a sugar to write

    serialized_action _foo = [this] { return foo(); };
    observer<> _o = option.observe(_foo.make_observer());

instead of

    serialized_action _foo = [this] { return foo(); };
    observer<> _o = option.observe([this] {
        // waited with .join on stop
        (void)_foo.trigger();
    });

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-07-06 08:17:08 +03:00
Mikołaj Grzebieluch
ea3da23f3b cql3: refactor of native types parsing
Native types were parsed directly to data_type, where varchar and text were
parsed to utf8_type. To get the name of the type there was a call to
the data_type method thus getting the name of the varchar type returns "text".

To fix this, added new nonterminal type_unreserved_keyword, which parse native
types to their names. It replaced native_or_internal_type in unreserved_function_keyword.

unreserved_function_keyword is also used to parse usernames, keyspace names, index names,
column identifieres, service levels and role names, so this bug was repaired also in them.

Fixes: #10642

Closes #10960
2022-07-05 18:09:17 +02:00
Piotr Dulikowski
b2504a707c database: avoid rethrows when handling exceptions from commitlog
The database::do_apply and database::apply_with_commitlog are now
changed so that they don't rethrow exceptions returned from the
commitlog.
2022-07-05 16:41:09 +02:00
Piotr Dulikowski
5264fdb3d0 database: convert throw_commitlog_add_error to use make_nested_exception_ptr
Now, throw_commitlog_add_error is renamed to throw_commitlog_add_error.
Instead of wrapping the currently executing exception and rethrowing it,
it takes an std::exception_ptr, wraps it and also returns
std::exception_ptr.
2022-07-05 16:41:09 +02:00
Piotr Dulikowski
1b8aacfee1 utils: add make_nested_exception_ptr
The utils::make_nested_exception_ptr function works similar to
std::throw_with_nested, but instead of storing the currently thrown
exception as the nested exception and then immediately throwing the new
exception, it receives the nested exception as an std::exception_ptr and
also returns an std::exception_ptr.

If the standard library supports it, the function does not perform any
throws. Otherwise the fallback logic performs two throws.
2022-07-05 16:41:09 +02:00
Piotr Dulikowski
2008db58c4 storage_proxy: don't rethrow when inspecting replica exceptions on write path
Now, storage_proxy::send_to_live_endpoints doesn't rethrow exceptions
received from the replica logic when inspecting them.
2022-07-05 16:41:09 +02:00
Piotr Dulikowski
eff462a0e7 database: don't rethrow rate_limit_exception
Now, utils::try_catch is used to detect whether the write operation
failed due to a rate limit exception.
2022-07-05 16:41:09 +02:00
Piotr Dulikowski
ffb95c4840 storage_proxy: don't rethrow the exception in abstract_read_resolver::error
Now, the abstract_read_resolver::error uses the utils::try_catch utility
to analyse the error received from replica instead of rethrowing it.
2022-07-05 16:41:09 +02:00
Piotr Dulikowski
969a2b4b47 utils/exceptions.cc: don't rethrow in is_timeout_exception
Now, is_timeout_exception doesn't need to rethrow the exception in order
to determine whether it's a timeout exception.
2022-07-05 16:41:09 +02:00
Piotr Dulikowski
18f43fa00e utils/exceptions: add try_catch
Introduces a utility function which allows obtaining a pointer to the
exception data held behind an std::exception_ptr if the data matches the
requested type. It can be used to implement manual but concise
try..catch chains.

The `try_catch` has the best performance when used with libstdc++ as it
uses the stdlib specific functions for simulating a try..catch without
having to actually throw. For other stdlibs, the implementation falls
back to a throw surrounded by an actual try..catch.
2022-07-05 16:41:09 +02:00
Nadav Har'El
a0ffbf3291 test/cql-pytest: fix test that started failing after error message change
Recently a change to Scylla's expression implementation changed the standard
error message copied from Cassandra:

    Cannot execute this query as it might involve data filtering and thus
    may have unpredictable performance. If you want to execute this query
    despite the performance unpredictability, use ALLOW FILTERING

In the special case where the filter is on the partition key, we changed
the message to:

    Only EQ and IN relation are supported on the partition key (unless you
    use the token() function or allow filtering)

We had a cql-pytest test translated from Cassandra's unit test that checked
the old message, and started to fail. Unfortunately nobody noticed because
a bug in test.py caused it to stop running these translated unit tests.

So in this patch, we trivially fix the test to pass again. Instead of
insisting on the old message, we check jsut for the string "allow
filtering", in lowercase or uppercase. After this patch, the tests
passes as expected on both Scylla and Cassandra.

Refs #10918 (this test failing is one of the failures reported there)
Refs #10962 (test.py stopped running this test)

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #10964
2022-07-05 15:24:20 +03:00
Takuya ASADA
ce87e15ecf scylla_prepare: fix Exception when SET_NIC_AND_DISKS=no and SET_CLOCKSOURCE=yes
We shouldn't call get_tune_mode() when NIC tuning is disabled.

fixes #10412

Closes #10959
2022-07-05 14:52:52 +03:00
Takuya ASADA
7501465b7c scylla_util.py: change debug log directory to /var/tmp/scylla
Current debug log is bit difficult to collect in CI, to find the debug log
we must know which script caused Exception.
Because the filename does not include prefix, and also specified
directory is shared with other programs.

To make things more easily, let's change debug log directory to /var/tmp/scylla.

Closes #10730
2022-07-05 14:49:00 +03:00
Avi Kivity
74b02b9719 Merge 'storage_service: track restore_replica_count' from Benny Halevy
This mini-series adds an _async_gate to storage_service that is closed on stop()
and it performs restore_replica_count under this gate so it can be orderly waited on in stop()

Fixes #10672

Closes #10922

* github.com:scylladb/scylla:
  storage_service: handle_state_removing: restore_replica_count under _async_gate
  storage_service: add async_gate for background work
2022-07-05 13:18:59 +03:00
Michał Chojnowski
152eff249c dbuild: fix --security-opt syntax
A recent change added `--security-opt label:disable` to the docker
options. There are examples of this syntax on the web, but podman
and docker manuals don't mention it and it doesn't work on my machine.

Fix it into `--security-opt label=disable`, as described by the manuals.

Closes #10965
2022-07-05 10:50:31 +03:00
Piotr Dulikowski
c1ac116eb9 utils: add abi/eh_ia64.hh
Adds a header for utility functions/structures, based on the Itanium ABI
for C++, necessary for us to inspect exceptions behind
std::exception_ptr without having to actually rethrow the exception.
2022-07-04 19:27:06 +02:00
Piotr Dulikowski
491cc2a8df storage_proxy: don't rethrow exceptions from replicas when accounting read stats
Now, make_{data,mutation_data,digest}_requests don't rethrow the
exception received from replicas when increasing the error count metric.
2022-07-04 19:27:06 +02:00
Piotr Dulikowski
da571ed93b message: get rid of throws in send_message{,_timeout,_abortable}
Now, those function don't rethrow existing or throw new exceptions.
2022-07-04 19:27:06 +02:00
Piotr Dulikowski
902f1b7cfe database/{query,query_mutations}: don't rethrow read semaphore exceptions
Now, read semaphore exceptions are propagated from database::query and
database::query_mutations without rethrowing them.
2022-07-04 19:26:02 +02:00
Avi Kivity
33fe28b0c5 Merge 'commitlog allocation/deletion/flush request rate counters + footprint projection' from Calle Wilund
Adds measuring the apparent delta vector of footprint added/removed within
the timer time slice, and potentially include this (if influx is greater
than data removed) in threshold calculation. The idea is to anticipate
crossing usage threshold within a time slice, so request a flush slightly
earlier, hoping this will give all involved more time to do their disk
work.

Obviously, this is very akin to just adjusting the threshold downwards,
but the slight difference is that we take actual transaction rate vs.
segment free rate into account, not just static footprint.

Note: this is a very simplistic version of this anticipation scheme,
we just use the "raw" delta for the timer slice.
A more sophisiticated approach would perhaps do either a lowpass
filtered rate (adjust over longer time), or a regression or whatnot.
But again, the default persiod of 10s is something of an eternity,
so maybe that is superfluous...

Closes #10651

* github.com:scylladb/scylla:
  commitlog: Add (internal) measurement of byte rates add/release/flush-req
  commitlog: Add counters for # bytes released/flush requested
  commitlog: Keep track of last flush high position to avoid double request
  commitlog: Fix counter descriptor language
2022-07-04 16:26:17 +03:00
Botond Dénes
553538392e Merge "Improve shutdown logging" from Pavel Emelyanov
"
On stop there's a rather long log-less gap in the middle of
storage_service::drain_on_shutdown(). This set adds log in
interesting places and while at it tosses the patched code.

refs: #10941
"

* 'br-shutdown-logging' of https://github.com/xemul/scylla:
  batchlog_manager: Add drain and stop logging
  batchlog_manager: Coroutinize drain and stop
  batchlog_manager: Drain it with shared future
  commitlog: Add shutdown message
  database: Move flushing logging
  compaction_manager: Add logging around drain
  compaction_manager: Coroutinize drain
  storage_service: Sanitize stop_transport()
2022-07-04 13:50:16 +03:00
Pavel Emelyanov
5a4e15f65d Add .mailmap
Google group started replacing sender email with the group email
recently. Here's the list of spoiled entries combined from seastar
and scylla repos

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220701160252.11967-1-xemul@scylladb.com>
2022-07-04 13:44:28 +03:00
Pavel Emelyanov
98ff779676 batchlog_manager: Add drain and stop logging
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-07-04 13:42:46 +03:00
Pavel Emelyanov
e2007cd317 batchlog_manager: Coroutinize drain and stop
This is not identical change, if drain() resolves with exception we end
up skipping the gate closing, but since it's stop why bother

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-07-04 13:42:46 +03:00
Pavel Emelyanov
8a03683671 batchlog_manager: Drain it with shared future
The .drain() method can be called from several places, each needs to
wait for its completion. Now this is achieved with the help of a gate,
but there's a simpler way

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-07-04 13:42:45 +03:00
Pavel Emelyanov
2e1ec36efd commitlog: Add shutdown message
It happens in database::drain(), we know when it starts after keyspaces
are flushed, now it's good to know when it completes

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-07-04 13:42:45 +03:00
Pavel Emelyanov
ea820e13b3 database: Move flushing logging
Now it happens before calling database::drain() but drain is not only
flushing it does lots of other things. More elaborated logging is better

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-07-04 13:42:45 +03:00
Avi Kivity
f949612620 Update tools/python3 submodule (install pip executables)
* tools/python3 3471634...e48dcc2 (1):
  > Support installing executables from PIP package
2022-07-04 13:02:51 +03:00
Nadav Har'El
09d0ca7c75 test/cql-pytest: make run-cassandra work on new systems...
with several Java versions

The test/cql-pytest/run-cassandra script runs our cql-pytest tests against
Cassandra. Today, Cassandra can only run correctly on Java 8 or 11
(see https://issues.apache.org/jira/browse/CASSANDRA-16895) but recent
Linux distributions have switched to newer versions of Java - e.g., on
my Fedora 36 installation, the default "java" is Java 17. Which can't
run Cassandra.

So what I do in this patch is to check if "java" has the right version,
and if it doesn't, it looks at several additional locations if it can
find a Java of the right version. By the way, we are sure that Java 8
must be installed because our install-dependencies.sh installs it.

After this patch, test/cql-pytest/run-cassandra resumes working on
Fedora 36.

Fixes #10946

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #10947
2022-07-04 12:00:46 +02:00
Aleksandra Martyniuk
bf34589fc1 cql3: create tokens out of null values properly
Method reponsible for creating a token of given values is not meant to be
used with empty optionals. Thus, having requested a token of the columns
containing null values resulted with an exception being thrown. This kind
of behaviour was not compatible with the one applied in cassandra.

To fix this, before the computation of a token, it is checked whether
no null value is contained. If any value in the processed vector is null,
null value is returned.

Fixes:  #10594

Closes #10942
2022-07-04 10:42:23 +02:00
Avi Kivity
719724e4df scripts: pull_github_pr.sh: support recovering from a failed cherry-pick
If a single-patch pull request fails cherry-picking, it's still possible
to recover it (if it's a simple conflict). Give the maintainer the option
by opening a subshell and instructing them to either complete the cherry-pick
or abort it.

Closes #10949
2022-07-04 09:26:45 +03:00
Avi Kivity
2eedb9bf7c Update seastar submodule
* seastar 9c016aeebf...7d8d846b26 (16):
  > Merge 'coroutine: exception: retain exception_ptr type' from Benny Halevy
  > core: log in on_internal_error even when throwing
  > sched_group: Report the sched group that exceeded the limit
Fixes #8226.
  > Add .mailmap
  > prometheus: make the help string optional
  > core: lw_shared_ptr: allow defining `lw_shared_ptr<T>` class member without knowing the definition of `T`
  > ci: build and test in debug and dev modes
  > Merge 'Added summaries, remove empty, and aggregation to Prometheus' from Amnon Heiman
  > Merge 'net/tls: vec_push: call on_internal_error if _output_pending already failed' from Benny Halevy
Fixes #10127
  > Merge 'CI: build and test with both gcc and clang ' from Beni Peled
  > Merge "Initialize lowres_clock::_now earlier" from Pavel E
Ref #10743
  > reactor: don't count make_exception_future etc. in cpp_exceptions metric
  > file: Deprecate file lifetime hint calls
  > foreign_ptr: fix doc.
  > cmake: fix mention of FindLibUring.cmake in install target
  > semaphore: derive named_semaphore_aborted exception from semaphore_aborted
Fixes #10666.

Closes #10951
2022-07-04 00:24:47 +03:00
Avi Kivity
973d2a58d0 Merge 'docs: move docs to docs/dev folder' from David Garcia
In order to allow our Scylla OSS customers the ability to select a version for their documentation, we are migrating the Scylla docs content to the Scylla OSS repository. This PR covers the following points of the [Migration Plan](https://docs.google.com/document/d/15yBf39j15hgUVvjeuGR4MCbYeArqZrO1ir-z_1Urc6A/edit#):

1. Creates a subdirectory for dev docs: /docs/dev
2. Moves the existing dev doc content in the scylla repo to /docs/dev, but keep Alternator docs in /docs.
3. Flattens the structure in /docs/dev (remove the subfolders).
4. Adds redirects from `scylla.docs.scylladb.com/<version>/<document>` to `https://github.com/scylladb/scylla/blob/master/docs/dev/<document>.md`
5. Excludes publishing docs for /docs/devs.

1. Enter the docs folder with `cd docs`.
2. Run `make redirects`.
3. Enter the docs folder and run `make preview`. The docs should build without warnings.
4. Open http://127.0.0.1:5500 in your browser. You shoul donly see the alternator docs.
5. Open http://127.0.0.1:5500/stable/design-notes/IDL.html in your browser. It should redirect you to https://github.com/scylladb/scylla/blob/master/docs/dev/IDL.md and raise a 404 error since this PR is not merged yet.
6. Surf the `docs/dev` folder. It should have all the scylla project internal docs without subdirectories.

Closes #10873

* github.com:scylladb/scylla:
  Update docs/conf.py
  Update docs/dev/protocols.md
  Update docs/dev/README.md
  Update docs/dev/README.md
  Update docs/conf.py
  Fix broken links
  Remove source folder
  Add redirections
  Move dev docs to docs/dev
2022-07-03 20:37:11 +03:00
Wojciech Mitros
bfa3c0e734 test: move codes of UDFs compiled to WASM to test/resource
After compiling to WASM, UDFs become much larger than the
source code. When they're included in test_wasm.py, it
becomes difficult to navigate in the file. Moving them
to another place does not make understanding the test
scripts harder, because the source code is still included.

This problem will become even more severe when testing
UDFs using WASI.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>

Closes #10934
2022-07-03 17:37:21 +03:00
Nadav Har'El
5f98b81cb3 dbuild: disable selinux instead of relabeling
By default, Docker uses SELinux to prevent malicious code in the container
from "escaping" and touching files outside the container: The container
is only allowed to touch files with a special SELinux label, which the
outside files simply do not have. However, this means that if you want
to "mount" outside files into the container, Docker needs to add the
special label to them. This is why one needs to use the ":z" option
when mounting an outside file inside docker - it asks docker to "relabel"
the directory to be usable in Docker.

But this relabeling process is slow and potentially harmful if done to
large directories such as your home directory, where you may theoretically
have SELinux labels for other reasons. The relabling is also unnecessary -
we don't really need the SELinux protection in dbuild. Dbuild was meant
to provide a common toolchain - it was never meant to protect the build
host from a malicious build script.

The alternative we use in this patch is "--security-opt label=disable".
This allows the container to access any file in the host filesystem,
but as usual - only if it's explicitly "mounted" into the container.
All ":z" we added in the past can be removed.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #10945
2022-07-03 16:20:07 +03:00
Avi Kivity
c290976185 Merge 'cql3: Remove some restrictions classes' from Jan Ciołek
This PR removes some restrictions classes and replaces them with expression.

* `single_column_restriction` has been removed altogether.
* `partition_key_restrictions` field inside `statement_restrictions` has been replaced with `expression`

`clustering_key_restrictions` are not replaced yet, but this PR already has 30 commits so it's probably better to merge this before adding any more changes.
Luckily most of these commits are implementations of small helper functions.

`single_column_restriction` was pretty easy to remove. This class holds the `expression` that describes the restriction and `column_definition` of the restricted column.
It inherits from `restriction` - the base class of all restrictions.

I wasn't able to replace it with plain `expression` just yet, because a lot of times a `shared_ptr<single_column_restriction>` is being cast to `shared_ptr<restriction>`.
Instead I replaced all instances of `single_column_restriction` with `restriction`.
To decide if a `restriction` is a `single_column_restriction` we can use a helper method that works on expressions.
Same with acquiring the restricted `column_definition`.

This change has two advantages:
* One less restriction class -> moving towards 0
* Preparing towards one generic `restriction/expression` type and using functions to distinguish the type of expression that we're dealing with.

`partition_key_restrictions` is a class used to keep restrictions on the partition key inside `statement_restrictions`.
Removing it required two major steps.

First I had to implement taking all the binary operators and making sure that they are valid together.
Before the change this was the `merge_to` method. It ensures that for example there are no token and regular restrictions occurring at the same time.
This has been implemented as `statement_restrictions::add_restriction`.
It detects which case it's dealing with and mimics `merge_to` from the right restrictions class.

Then I implemented all methods of `partition_key_restrictions` but operating on plain `expressions`.
While doing that I was able to gradually shift the responsibility to the brand new functions.

Finally `partition_key_restrictions` wasn't used anywhere at all and I was able to remove it.

Here's the inheritance tree of all restriction classes for context:
![image](https://user-images.githubusercontent.com/36861778/176141470-f96f6189-e650-44c2-9648-2a840b4c89c0.png)

For now this is marked as a draft.
I just put all this together in a readable way and wanted to put it out for you to see.
I will have another look at the code and maybe do some improvements.

Closes #10910

* github.com:scylladb/scylla:
  cql3: Remove _new from  _new_partition_key_restrictions
  cql3: Remove _partition_key_restrictions from statement_restrictions
  cql3: Use expression for index restrictions
  cql3: expr: Add contains_multi_column_restriction
  cql3: Add expr::value_for
  cql3: Use the new restrictions map in another place
  cql3: use the new map in get_single_column_partition_key_restrictions
  cql3: Keep single column restrictions map inside statement restrictions
  cql3: Use expression instead of _partition_key_restrictions in the remaining code
  cql3: Replace partition_key_restrictions->has_supporting_index()
  cql3: Replace statement_restrictions->get_column_defs()
  cql3: Replace partition_key_restrictions->needs_filtering()
  cql3: Replace partition_key_restrictions->size()
  cql3: Replace partition_key_restrictions->is_all_eq()
  cql3: Replace parition_key_restriction->has_unrestricted_components()
  cql3: Replace parition_key_restrictions->empty()
  cql3: Keep restrictions as expressions inside statement_restrictions
  cql3: Handle single value INs inside prepare_binary_operator
  cql3: Add get_columns_in_commons
  cql3: expr: Add is_empty_restriction
  cql3: Replicate column sorting functionality using expressions
  cql3: Remove single_column_restriction class
  cql3: Replace uses of single_column_restriction with restriction
  cql3: expr: Add get_the_only_column
  cql3: expr: Add is_single_column_restriction
  cql3: expr: Add for_each_expression
  cql3: Remove some unsued methods
2022-07-03 16:11:25 +03:00
Jan Ciolek
f37094959d cql3: Remove _new from _new_partition_key_restrictions
_new_partition_key_restrictions was a temporary name
used during the transition from restrictions to expressions.

Now that restrictions aren't used anymore it can be changed
back to _partition_key_restrictions.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-01 16:29:11 +02:00
Jan Ciolek
854ffd7bd8 cql3: Remove _partition_key_restrictions from statement_restrictions
Now that all functionality of partition_key_restrictions
has been implemented using expressions we can remove
this field from statement_restrictions.

_new_partition_key_restrictions will be used for
everything instead.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-01 16:29:11 +02:00
Jan Ciolek
76bf75a9d3 cql3: Use expression for index restrictions
Restrictions that might be used by an index
are currently being kept as shared_ptr<restrictions>.

This stand in the way of replacing _parition_key_restrictions
with an expression as an expression can't be cast to
shared_ptr<restriction>.

Change shared_ptr<restriction> to expression everywhere
where necessary in index operations.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-01 16:29:11 +02:00
Jan Ciolek
83f27fc8c1 cql3: expr: Add contains_multi_column_restriction
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-01 16:29:11 +02:00
Jan Ciolek
fd0798c8a2 cql3: Add expr::value_for
value_for is a method from the restriction class
which finds the value for a given column.

Under the hood it makes use of possible_lhs_values.
It will be needed to implement some functionality
that was implemented using restrictions before.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-01 16:29:11 +02:00
Jan Ciolek
8c8a03aad1 cql3: Use the new restrictions map in another place
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-01 16:29:11 +02:00
Jan Ciolek
4026042cbc cql3: use the new map in get_single_column_partition_key_restrictions
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-01 16:29:11 +02:00
Jan Ciolek
3916bf1168 cql3: Keep single column restrictions map inside statement restrictions
Some parts of the code make use of a map keeping single column restrictions
for each partition key column. One of this places is inside do_filter,
so it could be a performance problem to create such a map from scratch
each time.

After adding all restrictions from the where clause the new
map is created and can be used for various purposes.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-01 16:29:10 +02:00
Jan Ciolek
1339ff1c79 cql3: Use expression instead of _partition_key_restrictions in the remaining code
There are still some places that use partition_key_restrictions
instead of _new_partition_key_restrictions in statement_restrictions.
Change them to use the new representation

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-01 16:29:10 +02:00
Jan Ciolek
0bb49e423a cql3: Replace partition_key_restrictions->has_supporting_index()
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-01 16:29:10 +02:00
Jan Ciolek
418ed0b802 cql3: Replace statement_restrictions->get_column_defs()
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-01 16:29:10 +02:00
Jan Ciolek
103f8e3d05 cql3: Replace partition_key_restrictions->needs_filtering()
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-01 16:29:10 +02:00
Jan Ciolek
de99c0a0fa cql3: Replace partition_key_restrictions->size()
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-01 16:29:10 +02:00
Jan Ciolek
16e94aaa91 cql3: Replace partition_key_restrictions->is_all_eq()
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-01 16:29:09 +02:00
Jan Ciolek
4d376eb84f cql3: Replace parition_key_restriction->has_unrestricted_components()
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-01 16:29:09 +02:00
Jan Ciolek
7f620cfa29 cql3: Replace parition_key_restrictions->empty()
To remove partition_key_restrictions all of its
methods have to be implemented using the new expression
representation.

The first to go is empty() as it's easy to implement.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-01 16:29:09 +02:00
Jan Ciolek
e6e502e6e8 cql3: Keep restrictions as expressions inside statement_restrictions
Currently restrictions on partition, clustering and nonprimary columns
are kept inside special purpose restriction objects.

We want to remove all the restrictions classes so these objects
will be removed as well.
In the future each of these restrictions will be kept in
an expression.

Add new fields to statement_restrictions class which
will keep the right restrictions.

Currently restrictions from where clause are
added one by one using merge_to method of
the restrictions class.

This functionality will be replaced by statement_restrictions::add_restriction.
Functions for adding restrictions perform validation and
add new restrictions to the right field inside the class.

The checks that are done in add_*_restriction methods
correspond to the checks performed by merge_to
in respective restriction classes.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-01 16:29:09 +02:00
Jan Ciolek
9b6b1f69aa cql3: Handle single value INs inside prepare_binary_operator
Currently expr::to_restriction is the only place where
prepare_binary_operator is called.

In case of a single-value IN restriction like:
mycol IN (1)
this expression is converted to
mycol = 1
by expr::to_restriction.

Once restriction is removed expr::to_restriction
will be removed as well so its functionality has to
be moved somewhere else.

Move handling single value INs inside
prepare_binary_operator.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-01 16:29:09 +02:00
Jan Ciolek
24b0a61d51 cql3: Add get_columns_in_commons
Add a function that finds common columns
between two expressions.

It's used in error messages in the original
restrictions code so it must be included
in the new code as well for compatibility.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-01 16:29:09 +02:00
Jan Ciolek
177ba9b9db cql3: expr: Add is_empty_restriction
Add a function to check whether
an expression restricts anything at all.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-01 16:29:09 +02:00
Jan Ciolek
228b344d9c cql3: Replicate column sorting functionality using expressions
Restrictions code keeps restrictions for each column
in a map sorted by their position in the schema.
Then there are methods that allow to access
the restricted column in the correct order.

To replicate this in upcoming code
we need functions that implement this functionality.

The original comparator can be found in:
cql3/restrictions/single_column_restrictions.hh
For primary key columns this comparator compares their
positions in the schema.
For non-primary columns the position is assumed to
be clustering_key_size(), which seems pretty random.

To avoid passing the schema to the comparator
for nonprimary columns I just assume the
position is u32::max(). This seems to be
as good of a choice as clustering_key_size().
Orignally Cassandra used -1:
bc8a260471/src/java/org/apache/cassandra/config/ColumnDefinition.java (L79-L86)

We never end up comparing columns of different kind using this comparator anyway.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-01 16:28:41 +02:00
Pavel Emelyanov
af026e423e compaction_manager: Add logging around drain
Now we know when it starts and whe^w if it finishes

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-07-01 17:17:53 +03:00
Pavel Emelyanov
a9d6e5cfb6 compaction_manager: Coroutinize drain
It's short enough to fix indentation right at once

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-07-01 17:17:53 +03:00
Pavel Emelyanov
b5c4553a66 storage_service: Sanitize stop_transport()
It generates ignored future that can be avoided if using forwarding to
shared_future<>'s promise

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-07-01 17:17:53 +03:00
Jan Ciolek
e37ddd5b89 cql3: Remove single_column_restriction class
Now that all uses of this class have been
replaced by the generic restriction
the class is not used anywhere and can be removed.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-01 15:53:19 +02:00
Jan Ciolek
3e3d2f939c cql3: Replace uses of single_column_restriction with restriction
single_column_restriction is a class used to represent
restrictions in a single column.

The class is very simple - it's basically an expression
with some additional information.

As a step towards removing all restriction classes
all uses of this class are replaced by uses of
the generic restriction class.

All functionality of this class has been implemented
using free standing functions operating on expressions.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-01 15:52:10 +02:00
Jan Ciolek
afc482f0a5 cql3: expr: Add get_the_only_column
Add a function that gets the only column
from a single column restriction expression.

The code would be very similiar to
is_single_column_restriction, so a new
function is introducted to reduce duplication.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-01 15:50:48 +02:00
Jan Ciolek
9c3b0299a1 cql3: expr: Add is_single_column_restriction
Add a function that checks whether an expression
contains restrictions on exactly one column.

This a "single_column_restriction"
in the same way that instances of
"class single_column_restriction" are.

It will be used later to distinguish cases
later once this class is removed

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-01 15:49:37 +02:00
Wojciech Mitros
4fd289a78c wasm: fix freeing in wasm UDFs using WASI
To call a UDF that is using WASI, we need to properly
configure the wasmtime instance that it will be called
on. The configuration was missing from udf_cache::load(),
so we add it here.

The free function does not return any value, so we should use
a calling method that does not expect any returns.
This patch adds such a method and uses it.

A test that did not pass without this fix and does pass after
is added.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>

Closes #10935
2022-07-01 07:57:45 +02:00
Piotr Sarna
42f51b2f7b Merge 'alternator: use position-in-partition in paging...
cookie only when reading CQL tables' from Botond Dénes

Recently, we added full position-in-partition support to alternator's
paging cookie, so it can support stopping at arbitrary positions. This
support however is only really needed when tables have range tombstones
and alternator tables never have them. So to avoid having to make the
new fields in 'ExclusiveStartKey' reserved, we avoid filling these in
when reading an alternator table, as in this case it is safe to assume
the position is `after_key($clustring_key)`. We do include these new
members however when reading CQL tables through alternator. As this is
only supported for system tables, we can also be sure that the elaborate
names we used for these fields are enough to avoid naming clashes.

Fixes: https://github.com/scylladb/scylla/issues/10903

Closes #10920

* github.com:scylladb/scylla:
  alternator: use position-in-partition in paging cookie only when reading CQL tables
  alternator: make is_alternator_keyspace() a standalone method
2022-06-30 20:24:40 +02:00
Nadav Har'El
29a0a2d694 test/scylla-gdb: detect Scylla compiled without debugging information
test/scylla-gdb tests Scylla's gdb debugging tools, and cannot work if
Scylla was compiled without debug information (i.e, the "dev" build mode).
In the past, test/scylla-gdb/run detected this case and printed a clear error:

   Scylla executable was compiled without debugging information (-g)
   so cannot be used to test gdb. Please set SCYLLA environment variable.

Unfortunately, since recently this detection fails, because even when
Scylla is compiled without debug information we link into it a library
(libwasmtime.a) which has *some* debug information. As a result, instead
of one clear error message, we get all scylla-gdb tests running -
and each of them failing separately. This is ugly and unhelpful.

Each of the tests fail because our "gdb" test fixture tries to load
scylla-gdb.py and fails when the symbols it needs (e.g., "size_t")
cannot be found. So in this patch, we check once for the existance
of this symbol - and if missing we exit pytest instead of failing each
individual test.

Moreover, if loading scylla-gdb.py fails for some other unexpected
reason, let's exit the test as well, instead of failing each individual
test.

Fixes #10863.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #10937
2022-06-30 20:22:50 +02:00
David Garcia
fc59ebc2c3 Update docs/conf.py 2022-06-30 19:12:08 +01:00
David Garcia
a1b990075e Update docs/dev/protocols.md 2022-06-30 19:11:09 +01:00
David Garcia
d555febdd3 Update docs/dev/README.md
Co-authored-by: annastuchlik <37244380+annastuchlik@users.noreply.github.com>
2022-06-30 19:10:41 +01:00
David Garcia
350abdd72d Update docs/dev/README.md
Co-authored-by: annastuchlik <37244380+annastuchlik@users.noreply.github.com>
2022-06-30 19:10:33 +01:00
David Garcia
66a77f74d2 Update docs/conf.py
Co-authored-by: annastuchlik <37244380+annastuchlik@users.noreply.github.com>
2022-06-30 19:10:23 +01:00
Botond Dénes
d80256f4dd dirty_memory_manager: move db ctor out-of-line
To facilitate further patching.
2022-06-30 17:26:18 +03:00
Botond Dénes
3a19412237 Merge 'Various improvements to perf_row_cache_upgrade test' from Tomasz Grabiec
Closes #10930

* github.com:scylladb/scylla:
  test: perf_row_cache_update: Flush std output after each line
  test: perf_row_cache_update: Drain background cleaner before starting the test
  test: perf_row_cache_update: Measure memtable filling time
  test: perf_row_cache_update: Respect preemption when applying mutations
  test: perf_row_cache_update: Drop unused pk variable
2022-06-30 15:57:29 +03:00
Botond Dénes
2b6eeadc07 alternator: use position-in-partition in paging cookie only when reading CQL tables
Recently, we added full position-in-partition support to alternator's
paging cookie, so it can support stopping at arbitrary positions. This
support however is only really needed when tables have range tombstones
and alternator tables never have them. So to avoid having to make the
new fields in 'ExclusiveStartKey' reserved, we avoid filling these in
when reading an alternator table, as in this case it is safe to assume
the position is `after_key($clustring_key)`. We do include these new
members however when reading CQL tables through alternator. As this is
only supported for system tables, we can also be sure that the elaborate
names we used for these fields are enough to avoid naming clashes.
The condition in the code implementing this is actually even more
general: it only includes the region/weight members when the position
differs from that of a normal alternator one.
2022-06-30 15:10:30 +03:00
Botond Dénes
52058ea974 alternator: make is_alternator_keyspace() a standalone method 2022-06-30 14:18:29 +03:00
Jan Ciolek
cb3b179945 cql3: expr: Add for_each_expression
for_each_expression is a function that
can be used to iterate over all expressions
inside an expression recursively and perform
some operation on each of them.

For example:
for_each_expression<column_vaue>(e, [](const column_value& cval) {std::cout << cval << '\n';});
Will print all column values in an expression

It's awkward to do this using recurse_until or find_in_expression
because these functions are meant for slightly different purposes.
Having a dedicated function for this purpose will make the code
cleaner and easier to understand.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-06-30 10:03:53 +02:00
Pavel Emelyanov
868c3be01f config: Tune the config option
The option is used, but is not implemented. If attaching implementation
to it right a once the compaction will slow down to 16MB/s on all nodes.
Make it zero (unbound) by default and mard live-updateable while at it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-06-30 09:55:52 +03:00
Tomasz Grabiec
8f3349b407 test: lib: flat_mutation_reader_assertion: Add trace-level logging of read fragments
Message-Id: <20220629153926.137824-1-tgrabiec@scylladb.com>
2022-06-30 08:43:30 +03:00
Tomasz Grabiec
6d753bd01c gdb: Make robust in case there is no global storage_proxy or database instance
Some unit tests don't initialze these.
Message-Id: <20220629152743.134296-1-tgrabiec@scylladb.com>
2022-06-30 08:41:57 +03:00
Nadav Har'El
8024da10f0 test/cql-pytest: avoid leaving behind temporary files
Before this patch, the test cql-pytest/test_tools.py left behind
a temporary file in /tmp. It used pytest's "tmp_path_factory" feature,
but it doesn't remove temporary files it creates.

This patch removes the temporary file when the fixture using it ends,
but moreover, it puts the temporary file not in /tmp but rather next
to Scylla's data directory. That directory will be eventually removed
entirely, so even if we accidentally leave a file there, it will
eventually be deleted.

Fixes #10924

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #10929
2022-06-30 07:35:55 +03:00
Tomasz Grabiec
a6aef60b93 memtable: Fix missing range tombstones during reads under ceratin rare conditions
There is a bug introduced in e74c3c8 (4.6.0) which makes memtable
reader skip one a range tombstone for a certain pattern of deletions
and under certain sequence of events.

_rt_stream contains the result of deoverlapping range tombstones which
had the same position, which were sipped from all the versions. The
result of deoverlapping may produce a range tombstone which starts
later, at the same position as a more recent tombstone which has not
been sipped from the partition version yet. If we consume the old
range tombstone from _rt_stream and then refresh the iterators, the
refresh will skip over the newer tombstone.

The fix is to drop the logic which drains _rt_stream so that
_rt_stream is always merged with partition versions.

For the problem to trigger, there have to be multiple MVCC versions
(at least 2) which contain deletions of the following form:

[a, c] @ t0
[a, b) @ t1, [b, d] @ t2

c > b

The proper sequence for such versions is (assuming d > c):

[a, b) @ t1,
[b, d] @ t2

Due to the bug, the reader will produce:

[a, b) @ t1,
[b, c] @ t0

The reader also needs to be preempted right before processing [b, d] @
t2 and iterators need to get invalidated so that
lsa_partition_reader::do_refresh_state() is called and it skips over
[b, d] @ t2. Otherwise, the reader will emit [b, d] @ t2 later. If it
does emit the proper range tombstone, it's possible that it will violate
fragment order in the stream if _rt_stream accumulated remainders
(possible with 3 MVCC versions).

The problem goes away once MVCC versions merge.

Fixes #10913
Fixes #10830

Closes #10914
2022-06-29 19:02:23 +03:00
Tomasz Grabiec
3a222c1db3 test: perf_row_cache_update: Flush std output after each line 2022-06-29 17:36:02 +02:00
Tomasz Grabiec
daf0f041be test: perf_row_cache_update: Drain background cleaner before starting the test 2022-06-29 17:36:02 +02:00
Tomasz Grabiec
5d4bd5d6d5 test: perf_row_cache_update: Measure memtable filling time 2022-06-29 17:36:02 +02:00
Tomasz Grabiec
9d9bf8c196 test: perf_row_cache_update: Respect preemption when applying mutations
Otherwise, once preemption is signalled, memtable::apply() will keep
creating MVCC snapshots, which will slow the test down.
2022-06-29 17:36:02 +02:00
Tomasz Grabiec
46a5a606c4 test: perf_row_cache_update: Drop unused pk variable 2022-06-29 17:36:02 +02:00
Pavel Emelyanov
85033ea6ae Merge 'A bunch of refactors related to Raft group 0' from Kamil Braun
The commits here were extracted from PR https://github.com/scylladb/scylla/pull/10835 which implements upgrade procedure for Raft group 0.

They are mostly refactors which don't affect the behavior of the system, except one: the commit 4d439a16b3 causes all schema changes to be bounced to shard 0. Previously, they would only be bounced when the local Raft feature was enabled. I do that because:
1. eventually, we want this to be the default behavior
2. in the upgrade PR I remove the `is_raft_enabled()` function - the function was basically created with the mindset "Raft is either enabled or not" - which was right when we didn't support upgrade, but will be incorrect when we introduce intermediate states (when we upgrade from non-raft-based to raft-based operations); the upgrade PR introduces another mechanism to dispatch based on the upgrade state, but for the case of bouncing to shard 0, dispatching is simply not necessary.

Closes #10864

* github.com:scylladb/scylla:
  service/raft: raft_group_registry: add assertions when fetching servers for groups
  service/raft: raft_group_registry: remove `_raft_support_listener`
  service/raft: raft_group0: log adding/removing servers to/from group 0 RPC map
  service/raft: raft_group0: move group 0 RPC handlers from `storage_service`
  service/raft: messaging: extract raft_addr/inet_addr conversion functions
  service: storage_service: initialize `raft_group0` in `main` and pass a reference to `join_cluster`
  treewide: remove unnecessary `migration_manager::is_raft_enabled()` calls
  test/boost: memtable_test: perform schema operations on shard 0
  test/boost: cdc_test: remove test_cdc_across_shards
  message: rename `send_message_abortable` to `send_message_cancellable`
  message: change parameter order in `send_message_oneway_timeout`
2022-06-29 16:51:54 +03:00
Pavel Emelyanov
b60f2c220b test_tools: Do not create type if it exists
There effectively are several test-cases in this test, each calls the
scylla_sstable() to prepare, thus each creates a type in the same scylla
instance. The 2nd attempt ends up with the "already exists" error:

E   cassandra.InvalidRequest: Error from server: code=2200 [Invalid query] message="A user type of name cql_test_1656396925652.type1 already exists"

tests: unit(dev)
       https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1075/
fixes: #10872

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220628081459.12791-1-xemul@scylladb.com>
2022-06-29 14:31:57 +03:00
Calle Wilund
aab7794c31 commitlog_test: Change timeout handling to do abort()
Refs #10805

To help debug spurious failures, ensure to do abort() for debugger/core
ease.

Closes #10843.
2022-06-29 13:26:51 +03:00
Nadav Har'El
fc0243a43a Merge 'Implement a number of improvements in test.py' from Konstantin Osipov
A number of improvements in test.py as requested by maintainers:

* don't capture pytest output
* stick to the specific server in control connections
* support --log-level option and pass it to logging module
* when checking if CQL is up, ignore timeout errors
* no longer force schema migration when starting the server
* use test uname, not id, in log output
* improve logging of ScyllaServer
* log what cluster is used for a test
* extend xml output with logs

On the same token, remove mypy warnings and make linter pass on test.py, as well as add some type checking.

Fixes #10871
Fixes #10785

Closes #10902

* github.com:scylladb/scylla:
  test.py: extend xml output with logs
  test.py: log what cluster is used for a test
  test.py: improve logging of ScyllaServer
  test.py: use test uname, not id, in log output
  test.py: support --log-level option and pass it to logging module
  test.py: make ScyllaServer more reliable and fast
  test.py: don't capture pytest output
  test.py: add type annotations
  test.py: convert log_filename to pathlib
  test.py: please linter
  test.py: remove mypy warnings
2022-06-29 13:06:07 +03:00
Pavel Emelyanov
3a753068be Merge "Make permissions cache live updateable and add an API for resetting authorization cache" from Igor Ribeiro Barbosa Duarte
Currently, for users who have permissions_cache configs set to very high
values (and thus can't wait for the configured times to pass) having to restart
the service every time they make a change related to permissions or
prepared_statements cache (e.g. Adding a user and changing their permissions)
can become pretty annoying.
This patch series make permissions_validity_in_ms, permissions_update_interval_in_ms
and permissions_cache_max_entries live updateable so that restarting the
service is not necessary anymore for these cases.
It also adds an API for flushing the cache to make it easier for users who
don't want to modify their permissions_cache config.

branch: https://github.com/igorribeiroduarte/scylla/tree/make_permissions_cache_live_updateable
CI: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1005/
dtests: https://github.com/igorribeiroduarte/scylla-dtest/tree/test_permissions_cache

* https://github.com/igorribeiroduarte/scylla/make_permissions_cache_live_updateable:
  loading_cache_test: Test loading_cache::reset and loading_cache::update_config
  api: Add API for resetting authorization cache
  authorization_cache: Make permissions cache and authorized prepared statements cache live updateable
  auth_prep_statements_cache: Make aut_prep_statements_cache accept a config struct
  utils/loading_cache.hh: Add update_config method
  utils/loading_cache.hh: Rename permissions_cache_config to loading_cache_config and move it to loading_cache.hh
  utils/loading_cache.hh: Add reset method
2022-06-29 11:14:13 +03:00
Benny Halevy
cb0b728ed1 storage_service: handle_state_removing: restore_replica_count under _async_gate
Track the background restore_replica_count fiber
so it be awaited on in stop() by closing the
_async_gate.

Fixes #10672

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-06-29 10:51:26 +03:00
Benny Halevy
1b1c02b243 storage_service: add async_gate for background work
To be used for tracking restore_replica_count
and waiting for it on stop().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-06-29 10:47:49 +03:00
Igor Ribeiro Barbosa Duarte
8cc2de5fe0 loading_cache_test: Test loading_cache::reset and loading_cache::update_config
Validate that the size of the cache is zero after calling the
reset method and that the config is being updated correctly
after calling update_config.

Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>
2022-06-28 19:58:06 -03:00
Igor Ribeiro Barbosa Duarte
a23c3d6338 api: Add API for resetting authorization cache
For cases where we have very high values set to permissions_cache validity and
update interval (E.g.: 1 day), whenever a change to permissions is made it's
necessary to update scylla config and decrease these values, since waiting for
all this time to pass wouldn't be viable.
This patch adds an API for resetting the authorization cache so that changing
the config won't be mandatory for these cases.

Usage:
    $ curl -X POST http://localhost:10000/authorization_cache/reset

Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>
2022-06-28 19:58:06 -03:00
Igor Ribeiro Barbosa Duarte
b9051c79bc authorization_cache: Make permissions cache and authorized prepared statements cache live updateable
Currently, for users who have permissions_cache configs set to very high
values (and thus can't wait for the configured times to pass) having to restart
the service every time they make a change related to permissions or
prepared_statements cache(e.g.: Adding a user) can become pretty annoying.
This patch make permissions_validity_in_ms, permissions_update_interval_in_ms
and permissions_cache_max_entries live updateable so that restarting the
service is not necessary anymore for these cases.

Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>
2022-06-28 19:58:06 -03:00
Igor Ribeiro Barbosa Duarte
c8c48a98fa auth_prep_statements_cache: Make aut_prep_statements_cache accept a config struct
This patch makes authorized_prepared_statements_cache acccept a config struct,
similarly to permissions_cache. This will make it easier to make this cache
live updateable on the next patch.

Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>
2022-06-28 19:57:52 -03:00
Igor Ribeiro Barbosa Duarte
d02cd5e8bc utils/loading_cache.hh: Add update_config method
This patch adds an update_config method in order to allow live updating the
config for permissions_cache. This method is going to be used in the next
patches after making permissions_cache config live updateable.

Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>
2022-06-28 19:46:58 -03:00
Igor Ribeiro Barbosa Duarte
667840a7eb utils/loading_cache.hh: Rename permissions_cache_config to loading_cache_config and move it to loading_cache.hh
This patch renames the permissions_cache_config struct to loading_cache_config
and moves it to utils/loading_cache.hh. This will make it easier to handle
config updates to the authorization caches on the next patches

Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>
2022-06-28 19:46:22 -03:00
Nadav Har'El
630959bb77 Merge 'test.py async and schema changes' from Alecco
Change tests to use async mode and add helpers and tests for schema changes.

These test series will be expanded with topology changes.

Closes #10550

* github.com:scylladb/scylla:
  test.py topology: repro for issue #1207
  test.py: port fixture fails_without_raft
  test.py topology: table methods to add/remove index
  test.py topology: add/drop table column helpers
  test.py topology: insert sequential row
  test.py: remove deprecated test test_null
  test.py: managed random tables
  test.py: test_keyspace fixture async
  test.py: rename fixture test_keyspace to keyspace
  test.py topology: test with asyncio
2022-06-28 23:25:18 +03:00
Avi Kivity
37780c6521 Merge 'test: perf: allow testing timeouts in perf_simple_query' from Piotr Dulikowski
This PR adds necessary modifications to perf_simple_query so that it can be used to test performance of the timeout handling path. With an appropriate combination of flags, it is possible to consistently trigger timeouts on every operation.

The following flags are added:
- `--stop-on-error` - if true (which is the default), the test stops after encountering the first exception and reports it; otherwise it causes errors to be counted and reported at the end.
- `--timeout <x>` - allows to use `USE TIMEOUT <x>` in the benchmark query/statement.
- `--bypass-cache` - uses `BYPASS CACHE` in the benchmark query (relevant only to reads).

Examples:

```
./build/release/test/perf/perf_simple_query --smp=1 --operations-per-shard=1000000 --write
131023.65 tps ( 56.2 allocs/op,  13.2 tasks/op,   49784 insns/op,        0 errors)

./build/release/test/perf/perf_simple_query --smp=1 --operations-per-shard=1000000 --write --stop-on-error=false --timeout=0s
97163.73 tps ( 53.1 allocs/op,   5.1 tasks/op,   78687 insns/op,  1000000 errors)

./build/release/test/perf/perf_simple_query --smp=1 --operations-per-shard=1000000
154060.36 tps ( 63.1 allocs/op,  12.1 tasks/op,   42998 insns/op,        0 errors)

./build/release/test/perf/perf_simple_query --smp=1 --operations-per-shard=1000000 --stop-on-error=false --flush --bypass-cache --timeout=0s
30127.43 tps ( 48.2 allocs/op,  14.3 tasks/op,  312416 insns/op,  1000000 errors)
```

Refs: #2363

Closes #10899

* github.com:scylladb/scylla:
  test: perf: add bypass cache argument
  test: perf: add timeout argument
  test: perf: count errors and report the count in results
  test: perf: add stop-on-error argument
  test: perf: coroutinize run_worker()
  test: perf: fix crash on exception in time_parallel_ex
2022-06-28 19:28:22 +03:00
Tomasz Grabiec
3bb147ae95 db: mutation_cleaner: Enqueue new snapshots at the back
This fixes a quadratic behavior in case lots of snapshots with range
tombstones are queued for merging. Before the change, new snapshots
were inserted at the front, which is also where the worker looks
at. Merging a version has a linear component in complexity function
which depends on the number of range tombstones. If we merge snapshots
starting from the latest to oldest then the whole process becomes
quadratic because the version which is merged accumulates an
increasing amont of tombstones, ones which were already merged
before. We should instead merge starting from the oldest snapshots,
this way each tombstone is applied exactly once during merge.

This bug got wose after 4bd4aa2e88,
which makes merging tombstones more expensive.

Closes #10916
2022-06-28 18:29:29 +03:00
Nadav Har'El
a8b02f7965 test: set sanitizer options in run scripts
When the run scripts for tests of cql-pytest, alternator, redis, etc.,
run Scylla, they should set the UBSAN_OPTIONS and ASAN_OPTIONS so that
if the executable is built with sanitizers enabled, it will ignore false
positives that we know about, and fail on real errors.

The change in this patch affects all test/*/run scripts which use the
this shared Scylla-starting code. test.py already had the same settings,
and it affected the tests that it knows to run directly (unit tests,
cql-pytest, etc.).

Fixes #10904

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #10915
2022-06-28 18:24:48 +03:00
Konstantin Osipov
d5d748ae86 test.py: extend xml output with logs
Add test and server logs, as well as the unidiff, to
XML output. This makes jenkins reports nicer.

While on it, debug & fix bugs in handling of flaky tests:

- the reset would reset a flaky test even after the last attempt
  fails, so it would be impossible to see what happened to it
- the args needed to be reset as well, since execution modifies
  them
- we would say that we're going to retry the flaky test when in
  fact it was the last attempt to run it and no more retries were
  planned
2022-06-28 18:22:01 +03:00
Konstantin Osipov
22d30abed8 test.py: log what cluster is used for a test 2022-06-28 18:22:01 +03:00
Konstantin Osipov
dbdfac7a0f test.py: improve logging of ScyllaServer 2022-06-28 18:22:01 +03:00
Konstantin Osipov
c4bee2860b test.py: use test uname, not id, in log output
Clarify "Test was cancelled" error message.
2022-06-28 18:22:01 +03:00
Konstantin Osipov
b502dd3c4e test.py: support --log-level option and pass it to logging module
scylla-python driver and scylla_server.py can be more verbose
at higher log levels, allow specifying the log level from the command
line.
2022-06-28 18:22:01 +03:00
Konstantin Osipov
ad7423649f test.py: make ScyllaServer more reliable and fast
1) Stick to the specific server in control connections.
It could happen that, when starting a cluster and checking
if a specific node is up, the check would actually execute
against an already running node. Prevent this from happening
by setting a white list connection balancing policy for control
connections.

2) When checking if CQL is up, ignore timeout errors
Scylla in debug mode can easily time out on a DDL query,
and the timeout error at start up would lead to the entire cluster
marked as broken. This is too harsh, allow timeouts at start.

3) No longer force schema migration when starting the server
By default, Raft is on, so the nodes are getting schema
through Raft leader. Schema migration significantly slows
down cluster start in debug mode (60 seconds -> 100 seconds),
and even though it was a great test that helped discover
several bugs in Scylla, it shouldn't be part of normal
cluster boot, so disable it.
2022-06-28 18:21:25 +03:00
Konstantin Osipov
867d5b4eda test.py: don't capture pytest output
Let print()s inside pytest tests go into pytest logs. This simplifies
debugging, especially if someone is not familiar with pytest.
2022-06-28 17:46:47 +03:00
Konstantin Osipov
fd3d08e560 test.py: add type annotations
Add type annotations where possible.
2022-06-28 17:46:47 +03:00
Konstantin Osipov
2470b1d888 test.py: convert log_filename to pathlib 2022-06-28 17:46:47 +03:00
Konstantin Osipov
20070e2d89 test.py: please linter
Rename the static method to not collied with a member variable
with the same name.
2022-06-28 17:46:46 +03:00
Konstantin Osipov
fc232099bb test.py: remove mypy warnings 2022-06-28 17:46:46 +03:00
David Garcia
b85843b9cc Fix broken links
Fix broken links
2022-06-28 15:19:36 +01:00
David Garcia
b87537c767 Remove source folder
Remove source folder

Remove source folder
2022-06-28 15:07:35 +01:00
Alejo Sanchez
c478a53d9c test.py topology: repro for issue #1207
Repro for bug in concurrent schema changes for many tables and indexing
involved.

Do alter tables by doing in parallel new table creation, alter a table
(_alter), and index other tables (_index).

Original repro had sets of 20 of those and slept for 20 seconds to
settle. This repro does it for Scylla with just 1 set and 1 second.

This issue goes away once Raft is enabled.

https://github.com/scylladb/scylla/issues/1207

Originally at https://issues.apache.org/jira/browse/CASSANDRA-10250

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-06-28 15:07:27 +02:00
Alejo Sanchez
4228bfef84 test.py: port fixture fails_without_raft
Port fails_without_raft to higher level conftest file for future use in
topology pytests.

While there, make it async.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-06-28 15:07:27 +02:00
Alejo Sanchez
e2cc35b768 test.py topology: table methods to add/remove index
Add helper methods to add and drop indexes on a given column.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-06-28 15:07:27 +02:00
Alejo Sanchez
d80857e26e test.py topology: add/drop table column helpers
Helper to add/drop a specified or random column.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-06-28 15:07:27 +02:00
Alejo Sanchez
e8e6a8e85a test.py topology: insert sequential row
For each table keep a counter and insert rows with sequential values
generated correspondingly by each column's type.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-06-28 15:07:27 +02:00
Alejo Sanchez
0624be6d58 test.py: remove deprecated test test_null
With test_schema there's no need for test_null.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-06-28 15:07:27 +02:00
Alejo Sanchez
ed140f98d8 test.py: managed random tables
Helpers to create keyspace and manange randomized tables.

Fixture drops all created tables still active after the test finishes.

Includes helper methods to verify schema consistency.

These helpers will be used in Raft schema changes tests coming later.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-06-28 15:07:27 +02:00
Alejo Sanchez
df1a032c04 test.py: test_keyspace fixture async
Make test_keyspace fixture async.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-06-28 15:07:26 +02:00
Alejo Sanchez
fda69a0773 test.py: rename fixture test_keyspace to keyspace
Name makes better sense as it's not a test but a fixture for tests.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-06-28 15:07:26 +02:00
Alejo Sanchez
00648342a6 test.py topology: test with asyncio
Run test async using a wrapper for Cassandra python driver's future.

The wrapper was suggested by a user and brought forward by @fruch.
It's based on https://stackoverflow.com/a/49351069 .

Redefine pytest event_loop fixture to avoid issues with fixtures with
scope bigger than function (like keyspace).
See https://github.com/pytest-dev/pytest-asyncio/issues/68

Convert sample test_null to async. More useful test cases will come
afterwards.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-06-28 15:07:26 +02:00
David Garcia
8e7ebea335 Merge remote-tracking branch 'upstream/master' into move-dev-docs 2022-06-28 11:02:38 +01:00
David Garcia
5adb5875f1 Add redirections 2022-06-28 09:39:14 +01:00
Jan Ciolek
1824878e9f cql3: Remove some unsued methods
They are removed because they are not used anywhere
and they contain code that would have to be modified
in the following commits.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-06-28 08:10:21 +02:00
Botond Dénes
6c818f8625 Merge 'sstables: generation_type tidy-up' from Michael Livshin
- Use `sstables::generation_type` in more places
- Enforce conceptual separation of `sstables::generation_type` and `int64_t`
- Fix `extremum_tracker` so that `sstables::generation_type` can be non-default-constructible

Fixes #10796.

Closes #10844

* github.com:scylladb/scylla:
  sstables: make generation_type an actual separate type
  sstables: use generation_type more soundly
  extremum_tracker: do not require default-constructible value types
2022-06-28 08:50:12 +03:00
Calle Wilund
688fd31e64 commitlog: Add counters for actual pending allocations + segment wait
Fixes #9367

The CL counters pending_allocations and requests_blocked_memory are
exposed in graphana (etc) and often referred to as metrics on whether
we are blocking on commit log. But they don't really show this, as
they only measure whether or not we are blocked on the memory bandwidth
semaphore that provides rate back pressure (fixed num bytes/s - sortof).

However, actual tasks in allocation or segment wait is not exposed, so
if we are blocked on disk IO or waiting for segments to become available,
we have no visible metrics.

While the "old" counters certainly are valid, I have yet to ever see them
be non-zero in modern life.

Closes #9368
2022-06-28 08:36:27 +03:00
Nadav Har'El
e22364dcc5 doc, alternator: split "experimental" features from "unimplemented" ones
Currently in docs/alternator/compatibility.md experimental features
and unimplemented features are bunched together under one heading
("unimplemented features"). In this patch we separate them into two
sections. This makes the "unimplemented features" section shorter,
and also allows us to link to the new "experimental features" section
separately.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #10893
2022-06-28 08:08:50 +03:00
Tomasz Grabiec
1a9d1d380a Reads from cache lack preemption check when scanning over range tombstones
A scan over range tombstones will ignore preemption, which may cause
reactor stalls or read failure due to std::bad_alloc.

This is a regression introduced in
5e97fb9fc4.  _lower_bound_changed was
always set to false, which is later checked at preemption point and
inhibits yielding.

Closes #10900
2022-06-28 06:58:48 +03:00
Piotr Dulikowski
6c69606702 test: perf: add bypass cache argument
Adds the "--bypass-cache" argument which adds a "BYPASS CACHE" clause
to the query being run in the benchmark. It only affects the read mode.
2022-06-27 22:14:29 +02:00
Piotr Dulikowski
fdd0a4146f test: perf: add timeout argument
Adds the "--timeout" argument which allows specifying a timeout used in
all operations. It works by inserting "USING <timeout>" in appropriate
place in the query.

The flag is most useful when set to zero - with an appropriate
combination of other flags (flush, bypass cache) it guarantees that each
operation will time out and performance of the timeout handling logic
can be measured.
2022-06-27 22:14:29 +02:00
Piotr Dulikowski
b9250a43e3 test: perf: count errors and report the count in results
Now, exceptions encountered during the test are counted as errors, and
the error count is reported at the end of the test.
2022-06-27 22:14:29 +02:00
Piotr Dulikowski
21612f97b0 test: perf: add stop-on-error argument
Adds the "--stop-on-error" argument to perf_simple_query. When enabled
(and it is enabled by default), the benchmark will propagate exceptions
if any occur in the tested function. Otherwise, errors will be ignored.
2022-06-27 22:14:29 +02:00
Piotr Dulikowski
d3bc946859 test: perf: coroutinize run_worker()
Converts the executor::run_worker() method to a coroutine. This will
allow extending the function in further commits without having to
allocate continuations.
2022-06-27 22:14:29 +02:00
Piotr Dulikowski
33b22e78be test: perf: fix crash on exception in time_parallel_ex
The `time_parallel_ex` function creates a sharded<executor> and uses it
to run the benchmark on multiple shards in parallel. However, if the
benchmarking function throws an exception, the sharded<executor> will be
destroyed without being stopped, which triggers an assertion in
sharded<T> destructor.

This commit makes sure that the executor is stopped before being
destroyed by putting `exec.stop()` into a `seastar::defer`.
2022-06-27 22:14:29 +02:00
Avi Kivity
7b37e02aa7 Update seastar submodule
* seastar ff46af9ae0...9c016aeebf (8):
  > Merge "Handle overflow in token bucket replenisher" from Pavel E
Fixes #10743
Fixes #10846
  > abort_source: request_abort: restore legacy no-args method
  > configure.py: do not use distutils
  > configure.py: drop unused "import sys"
  > Revert "Use recv syscall instead of read in do_read_some()"
  > Use recv syscall instead of read in do_read_some()
  > Merge 'Add initial support for websocket protocol' from Andrzej Stalke
  > Merge 'abort_source: request_abort: allow passing exception to subscribers' from Benny Halevy

Closes #10898
2022-06-27 23:11:56 +03:00
Kamil Braun
ff4ecfa182 dht: boot_strapper: check if keyspace still exists in bootstrap
While we're iterating over the fetched keyspace names, some of these
keyspaces may get dropped. Handle that by checking if the keyspace still
exists.

Also, when retrieving the replication strategy from the keyspace, store
the pointer (which is an `lw_shared_ptr`) to the strategy to keep it
alive, in case the keyspace that was holding it gets dropped.

Closes #10861
2022-06-27 19:13:46 +02:00
Asias He
d3c6e72c69 repair: Allow abort repair jobs in early stage
Consider this:

- User starts a repair job with http api
- User aborts all repair
- The repair_info object for the repair job is created
- The repair job is not aborted

In this patch, the repair uuid is recorded before repair_info object is
created, so that repair can now abort repair jobs in the early stage.

Fixes #10384

Closes #10428
2022-06-27 16:39:36 +03:00
Pavel Emelyanov
f3841c1b45 exceptions: Define operator<< for exception_code
Otherwise cql_transport::additional_options_for_proto_ext() complains
about inability to format the enum class value

Introduced by efc3953c (transport: add rate_limit_error)
Fmt version 8.1.1-5.fc35, fresher one must have it out of the box

Fixes #10884

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220627052703.32024-1-xemul@scylladb.com>
2022-06-27 14:49:58 +03:00
Avi Kivity
3131cbea62 Merge 'query: allow replica to provide arbitrary continue position' from Botond Dénes
Currently, we use the last row in the query result set as the position where the query is continued from on the next page. Since only live rows make it into query result set, this mandates the query to be stopped on a live row on the replica, lest any dead rows or tombstones processed after the live rows, would have to be re-processed on the next page (and the saved reader would have to be thrown away due to position mismatch). This requirement of having to stop on a live row is problematic with datasets which have lots of dead rows or tombstones, especially if these form a prefix. In the extreme case, a query can time out before it can process a single live row and the data-set becomes effectively unreadable until compaction gets rid of the tombstones.
This series prepares the way for the solution: it allows the replica to determine what position the query should continue from on the next page. This position can be that of a dead row, if the query stopped on a dead row. For now, the replica supplies the same position that would have been obtained with looking at the last row in the result set, this series merely introduces the infrastructure for transferring a position together with the query result, and it prepares the paging logic to make use of this position. If the coordinator is not prepared for the new field, it will simply fall-back to the old way of looking at the last row in the result set. As I said for now this is still the same as the content of the new field so there is no problem in mixed clusters.

Refs: https://github.com/scylladb/scylla/issues/3672
Refs: https://github.com/scylladb/scylla/issues/7689
Refs: https://github.com/scylladb/scylla/issues/7933

Tests: manual upgrade test.
I wrote a data set with:
```
./scylla-bench -mode=write -workload=sequential -replication-factor=3 -nodes 127.0.0.1,127.0.0.2,127.0.0.3 -clustering-row-count=10000 -clustering-row-size=8096 -partition-count=1000
```
This creates large, 80MB partitions, which should fill many pages if read in full. Then I started a read workload:
```
./scylla-bench -mode=read -workload=uniform -replication-factor=3 -nodes 127.0.0.1,127.0.0.2,127.0.0.3 -clustering-row-count=10000 -duration=10m -rows-per-request=9000 -page-size=100
```
I confirmed that paging is happening as expected, then upgraded the nodes one-by-one to this PR (while the read-load was ongoing). I observed no read errors or any other errors in the logs.

Closes #10829

* github.com:scylladb/scylla:
  query: have replica provide the last position
  idl/query: add last_position to query_result
  mutlishard_mutation_query: propagate compaction state to result builder
  multishard_mutation_query: defer creating result builder until needed
  querier: use full_position instead of ad-hoc struct
  querier: rely on compactor for position tracking
  mutation_compactor: add current_full_position() convenience accessor
  mutation_compactor: s/_last_clustering_pos/_last_pos/
  mutation_compactor: add state accessor to compact_mutation
  introduce full_position
  idl: move position_in_partition into own header
  service/paging: use position_in_partition instead of clustering_key for last row
  alternator/serialization: extract value object parsing logic
  service/pagers/query_pagers.cc: fix indentation
  position_in_partition: add to_string(partition_region) and parse_partition_region()
  mutation_fragment.hh: move operator<<(partition_region) to position_in_partition.hh
2022-06-27 12:23:21 +03:00
Benny Halevy
81fa1ce9a1 Revert 'Compact staging sstables'
This patch reverts the following patches merged in
78750c2e1a "Merge 'Compact staging sstables' from Benny Halevy"

> 597e415c38 "table: clone staging sstables into table dir"
> ce5bd505dc "view_update_generator: discover_staging_sstables: reindent"
> 59874b2837 "table: add get_staging_sstables"
> 7536dd7f00 "distributed_loader: populate table directory first"

The feature causes regressions seen with e.g.
https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-daily-release/41/testReport/materialized_views_test/TestMaterializedViews/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split011___test_base_replica_repair/
```
AssertionError: Expected [[0, 0, 'a', 3.0]] from SELECT * FROM t_by_v WHERE v = 0, but got []
```

Where views aren't updated properly.
Apparently since `table::stream_view_replica_updates`
doesn't exclude the staging sstables anymore and
since they are cloned to the base table as new sstables
it seems to the view builder that no view updates are
required since there's no changes comparing to the base table.

Reopens #9559

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #10890
2022-06-27 12:18:48 +03:00
Benny Halevy
8bccd5e9c5 compaction_manager: task: acquire_semaphore: handle abort_requested_exception
Change 8f39547d89 added
`handle_exception_type([] (const semaphore_aborted& e) {})`,
but it turned out that `named_semaphore_aborted` isn't
derived from `semaphore_aborted`, but rather from
`abort_requested_exception` so handle the base exception
instead.

Fixes #10666

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #10881
2022-06-27 09:47:48 +03:00
Pavel Emelyanov
708d3a1ea4 cql-test: Initialize cluster port as integer
Otherwise it complains like this:

_________ ERROR at setup of test_allow_filtering_indexed_no_filtering __________

request = <SubRequest 'cql' for <Function test_allow_filtering_indexed_no_filtering>>

    @pytest.fixture(scope="session")
    def cql(request):
        [...<snip>...]
>       return cluster.connect()

test/cql-pytest/conftest.py:66:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
cassandra/cluster.py:1708: in cassandra.cluster.Cluster.connect
    ???
cassandra/cluster.py:1765: in cassandra.cluster.Cluster._new_session
    ???
cassandra/cluster.py:2563: in cassandra.cluster.Session.__init__
    ???
cassandra/pool.py:203: in cassandra.pool.Host.__str__
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   TypeError: %d format: a real number is required, not str

tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-06-27 09:00:48 +03:00
Benny Halevy
751eceb2e6 types: time_point_to_string: use numeric formatting rather than chrono-format specifiers
As reported in #10867, newer versions of the fmt library
format %Y using 4-characters width, 0-padding the prefix
when needed, while older versions don't do that.

This change moves away from using %Y and friends
fmt specifiers to using explicit numeric-based formatting
conforming to ISO 8601 and making sure the year field
has at least 4 digits and is zero padded.  When
negative, the width is upped to 5 so it would show as -0001
rather than -001.

The unit test was updated respectively.

Fixes #10867

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #10870
2022-06-27 08:28:56 +03:00
Benny Halevy
9c231ad0ce repair_reader: construct _reader_handle before _reader
Currently, the `_reader` member is explicitly
initialized with the result of the call to `make_reader`.
And `make_reader`, as a side effect, assigns a value
to the `_reader_handle` member.

Since C++ initializes class members sequentially,
in the order they are defined, the assignment to `_reader_handle`
in `make_reader()` happens before `_reader_handle` is initialized.

This patch fixes that by changing the definition order,
and consequently, the member initialization order
in the constructor so that `_reader_handle` will be (default-)initialized
before the call to `make_reader()`, avoiding the undefined behavior.

Fixes #10882

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #10883
2022-06-26 20:17:47 +03:00
Amnon Heiman
6b9b76c919 main.cc: add trailing backslash to the API directories
The API uses the http server to serve two directories: the api_ui_dir
where the swagger-ui directory is found and the api_doc_dir where the
swagger definition files are found.

Internally, the API uses the httpd::directory_handler that append the
files it gets from the path to the base directory name.
A user can override the default configuration and set a directory name
that will not end with a backslash. This will result with files not
found.

This patch check if that backslash is missing, and if it is, adds it to
the API configuration.

Fixes #10700

Signed-off-by: Amnon Heiman <amnon@scylladb.com>

Closes #10877
2022-06-26 20:05:37 +03:00
David Garcia
bb21c3c869 Move dev docs to docs/dev 2022-06-24 18:07:08 +01:00
Piotr Sarna
f2bb676d27 docs: mention python in debugging.md
Evaluating Python code from within gdb is priceless,
especially that all helper classes and functions sourced from
scylla-gdb.py can be used in there. This commit adds a paragraph
in debugging.md mentioning this tool.

Closes #10869
2022-06-24 15:16:43 +03:00
Pavel Emelyanov' via ScyllaDB development
a78af050fd cql: Constify select_statement restrictions
It is in fact immutable (both the pointer and the object it points
to), so is the pointer copy returned by get_restrictions() method,
so are those propagated to filtering stuff.

tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1028

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220624083351.24970-1-xemul@scylladb.com>
2022-06-24 12:27:36 +03:00
Nadav Har'El
905088ce7a cql: improve error message for static column in materialized view
Static columns are not currently allowed in a materialized view. If the
base table has a static column and one tries to create a view with a
"SELECT *", the following error message is printed today:

  Unable to include static column 'ColumnDefinition{name=s,
  type=org.apache.cassandra.db.marshal.Int32Type, kind=STATIC,
  componentIndex=null, droppedAt=-9223372036854775808}' which would
  be included by Materialized View SELECT * statement

It is completely unnecessary to include all these details about the
column definition - just its name would have sufficed. In other words,
we should print def.name_as_text(), not the entire def. This is what
other error messages in the same file do as well.

After this patch the error message becomes nicer and clearer:

  Unable to include static column 's' which would be included by
  Materialized View SELECT * statement

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #10854
2022-06-24 11:19:33 +03:00
Botond Dénes
78750c2e1a Merge 'Compact staging sstables' from Benny Halevy
This series decouples the staging sstables from the table's sstable set.

The current behavior keeps the sstables in the staging directory until view building is done. They are readable as any other sstable, but fenced off from compaction, so they don't go away in the meanwhile.

Currently, when views are built, the sstables are moved into the main table directory where they will then be compacted normally.

The problem with this design is that the staging sstables are never compacted, in particular they won't get cleaned up or scrubbed.

The cleanup scenario open a backdoor for data resurrection when the staging sstables are moved after view building while possibly containing stale partitions (#9559) which will not be cleaned up until next time cleanup compaction is performed.

With this series, SSTables that are created in or moved to the staging sub-directory are "cloned" into the base table directory by hard-linking the components there and creating a new sstable object which loads the cloned files.

The former, in the staging directory is used solely for view building and is not added to the table's sstable set, while the latter, its clone, behaves like any other sstable and is added either to the regular or maintenance set and is read and compacted normally.

When view building is done, instead of moving the staging sstable into the table's base directory, it is simply unlinked.
If its "clone" wasn't compacted away yet, then it will just remain where it is, exactly like it would be after it was moved there in the present state of things.  If it was already compacted and no longer exists, then unlinking will then free its storage.

Note that snapshot is based on the sstables listed by the table, which do not include the staging sstables with this change.
But that shouldn't matter since even today, the sstables in the snapshot has no notion of "staging" directory and it is expected that the MV's are either updated view `nodetool refresh` if restoring sstables from snapshot using the uploads dir, or if restoring the whole table from backup - MV's are effectively expected to be rebuilt from scratch (they are not included in automatic snapshots anyway since we don't have snapshot-coherency across tables).

A fundamental infrastructure change was done to achieve that which is to change the sstable_list which was a std::unordered_set<shared_sstable> into a std::unordered_map<generation_type, shared_sstable> that keeps the shared_sstable objects indexed by generation number (that must be unique).  With this model, sstables are supposed to be searched by the generation number, not by their pointer, since when the staging sstable is clones, there will be 2 shared_sstable objects with the same generation (and different `dir()`) and we must distinguish between them.

Special care was taken to throw a runtime_error exception if when looking up a shared sstable and finding another one with the same generation, since they must never exist in the same sstable_map.

Fixes #9559

Closes #10657

* github.com:scylladb/scylla:
  table: clone staging sstables into table dir
  view_update_generator: discover_staging_sstables: reindent
  table: add get_staging_sstables
  view_update_generator: discover_staging_sstables: get shared table ptr earlier
  distributed_loader: populate table directory first
  sstables: time_series_sstable_set: insert: make exception safe
  sstables: move_to_new_dir: fix debug log message
2022-06-24 08:05:38 +03:00
Botond Dénes
1f4f8ba773 Merge 'compaction_manager: track if off-startegy compaction was performed in run_offstrategy_compaction' from Benny Halevy
This series moves the logic to not perform off-strategy compaction if the maintenance set is empty from the table layer down to the compaction_manager layer since it is the one that needs to make the decision.

With that compaction_manager::perform_offstrategy will return a future<bool> which resolves to true
iff off-strategy compaction was required and performed.

The sstable_compaction_test was adjusted and a new compaction_manager_for_testing class was added
to make sure the compaction manager is enabled when constructed (it wasn't so test_offstrategy_sstable_compaction didn't perform any off-strategy compactions!) and stopped before destroyed.

Closes #10848

* github.com:scylladb/scylla:
  table: perform_offstrategy_compaction: move off-strategy logic to compaction_manager
  compaction_manager: offstrategy_compaction_task: refactor log printouts
  test: sstable_compaction: compaction_manager_for_testing
2022-06-24 08:04:02 +03:00
Avi Kivity
dab56b82fa Merge 'Per-partition rate limiting' from Piotr Dulikowski
Due to its sharded and token-based architecture, Scylla works best when the user workload is more or less uniformly balanced across all nodes and shards. However, a common case when this assumption is broken is the "hot partition" - suddenly, a single partition starts getting a lot more reads and writes in comparison to other partitions. Because the shards owning the partition have only a fraction of the total cluster capacity, this quickly causes latency problems for other partitions within the same shard and vnode.

This PR introduces per-partition rate limiting feature. Now, users can choose to apply per-partition limits to their tables of choice using a schema extension:

```
ALTER TABLE ks.tbl
WITH per_partition_rate_limit = {
	'max_writes_per_second': 100,
	'max_reads_per_second': 200
};
```

Reads and writes which are detected to go over that quota are rejected to the client using a new RATE_LIMIT_ERROR CQL error code - existing error codes didn't really fit well with the rate limit error, so a new error code is added. This code is implemented as a part of a CQL protocol extension and returned to clients only if they requested the extension - if not, the existing CONFIG_ERROR will be used instead.

Limits are tracked and enforced on the replica side. If a write fails with some replicas reporting rate limit being reached, the rate limit error is propagated to the client. Additionally, the following optimization is implemented: if the coordinator shard/node is also a replica, we account the operation into the rate limit early and return an error in case of exceeding the rate limit before sending any messages to other replicas at all.

The PR covers regular, non-batch writes and single-partition reads. LWT and counters are not covered here.

Results of `perf_simple_query --smp=1 --operations-per-shard=1000000`:

- Write mode:
  ```
  8f690fdd47 (PR base):
  129644.11 tps ( 56.2 allocs/op,  13.2 tasks/op,   49785 insns/op)
  This PR:
  125564.01 tps ( 56.2 allocs/op,  13.2 tasks/op,   49825 insns/op)
  ```
- Read mode:
  ```
  8f690fdd47 (PR base):
  150026.63 tps ( 63.1 allocs/op,  12.1 tasks/op,   42806 insns/op)
  This PR:
  151043.00 tps ( 63.1 allocs/op,  12.1 tasks/op,   43075 insns/op)
  ```

Manual upgrade test:
- Start 3 nodes, 4 shards each, Scylla version 8f690fdd47
- Create a keyspace with scylla-bench, RF=3
- Start reading and writing with scylla-bench with CL=QUORUM
- Manually upgrade nodes one by one to the version from this PR
- Upgrade succeeded, apart from a small number of operations which failed when each node was being put down all reads/writes succeeded
- Successfully altered the scylla-bench table to have a read and write limit and those limits were enforced as expected

Fixes: #4703

Closes #9810

* github.com:scylladb/scylla:
  storage_proxy: metrics for per-partition rate limiting of reads
  storage_proxy: metrics for per-partition rate limiting of writes
  database: add stats for per partition rate limiting
  tests: add per_partition_rate_limit_test
  config: add add_per_partition_rate_limit_extension function for testing
  cf_prop_defs: guard per-partition rate limit with a feature
  query-request: add allow_limit flag
  storage_proxy: add allow rate limit flag to get_read_executor
  storage_proxy: resultize return type of get_read_executor
  storage_proxy: add per partition rate limit info to read RPC
  storage_proxy: add per partition rate limit info to query_result_local(_digest)
  storage_proxy: add allow rate limit flag to mutate/mutate_result
  storage_proxy: add allow rate limit flag to mutate_internal
  storage_proxy: add allow rate limit flag to mutate_begin
  storage_proxy: choose the right per partition rate limit info in write handler
  storage_proxy: resultize return types of write handler creation path
  storage_proxy: add per partition rate limit to mutation_holders
  storage_proxy: add per partition rate limit info to write RPC
  storage_proxy: add per partition rate limit info to mutate_locally
  database: apply per-partition rate limiting for reads/writes
  database: move and rename: classify_query -> classify_request
  schema: add per_partition_rate_limit schema extension
  db: add rate_limiter
  storage_proxy: propagate rate_limit_exception through read RPC
  gms: add TYPED_ERRORS_IN_READ_RPC cluster feature
  storage_proxy: pass rate_limit_exception through write RPC
  replica: add rate_limit_exception and a simple serialization framework
  docs: design doc for per-partition rate limiting
  transport: add rate_limit_error
2022-06-24 01:32:13 +03:00
Kamil Braun
a3d2f54806 service/raft: raft_group_registry: add assertions when fetching servers for groups
Better than dereferencing null-pointers or null-opts.
2022-06-23 16:14:41 +02:00
Kamil Braun
bb58ee0b2e service/raft: raft_group_registry: remove _raft_support_listener
It did nothing.
It will be readded in `raft_group0` and it will do something, stay
tuned.

With this we can remove the `feature_service` reference from
`raft_group_registry`.
2022-06-23 16:14:41 +02:00
Kamil Braun
0f78d81573 service/raft: raft_group0: log adding/removing servers to/from group 0 RPC map
For better observability during testing or debugging.
2022-06-23 16:14:41 +02:00
Kamil Braun
8e907cbf57 service/raft: raft_group0: move group 0 RPC handlers from storage_service
And generate the boilerplate from IDL declarations.
Simplifies the code, and the code now resides where it belongs.
2022-06-23 16:14:41 +02:00
Kamil Braun
4f0feee43e service/raft: messaging: extract raft_addr/inet_addr conversion functions
Don't repeat yourself.
2022-06-23 16:14:41 +02:00
Kamil Braun
5da163e0b8 service: storage_service: initialize raft_group0 in main and pass a reference to join_cluster
`raft_group0` was constructed at the beginning of `join_cluster`, which
required passing references to 3 additional services to `join_cluster`
used only for that purpose (group 0 client, raft group registry, and
query processor).

Now we initialize `raft_group0` in main - like all other services - and
pass a reference to `join_cluster` so `storage_service` can store a
pointer to group 0.

We initialize `raft_group0` before we start listening for RPCs in
`messaging_service`. In a later commit we'll move the initialization
of group 0 related verbs to the constructor of `raft_group0` from
`storage_service`, so they will be initialized before we start
listening for RPCs.
2022-06-23 16:14:41 +02:00
Kamil Braun
4d439a16b3 treewide: remove unnecessary migration_manager::is_raft_enabled() calls
In schema_altering_statement: we will bounce statements to shard 0
whether Raft is enabled or not.

In migration_manager, when we're sending a group 0 snapshot: well, if
we're sending a group 0 snapshot, Raft must be enabled; the check is
redundant.
2022-06-23 16:14:41 +02:00
Kamil Braun
411231da75 test/boost: memtable_test: perform schema operations on shard 0
Will be a prerequisite with Raft enabled.
2022-06-23 16:14:41 +02:00
Kamil Braun
3be376f6c5 test/boost: cdc_test: remove test_cdc_across_shards
The test checked if creating a table with CDC enabled on shard other
than 0 would create the CDC log table as well; it was a regression test
for #5582. However we will soon bounce all schema change requests to
shard 0, so the test's purpose is gone.

I need to remove this test because `cquery_nofail` does not handle the
bouncing correctly: it silently accepts the bounce message, assumes that
the query was successful and returns. So after we change the code to
start bouncing all requests to shard 0, if a query was ran inside test
code using `cquery_nofail` on a shard different than 0 it would do
nothing and following queries executed on shard 0 would fail because they
depended on the effect of the aforementioned query.
2022-06-23 16:14:41 +02:00
Kamil Braun
c030d03893 message: rename send_message_abortable to send_message_cancellable
It's not possible to abort an RPC call entirely, since the remote part
continues running (if the message got out). Calling the provided abort
source does the following:
1. if the message is still in the outgoing queue, drop it,
2. resolve waiter callbacks exceptionally.

Using the word "cancellable" is more appropriate.

Also write a small comment at `send_message_cancellable`.
2022-06-23 16:14:41 +02:00
Kamil Braun
07fe3e4a99 message: change parameter order in send_message_oneway_timeout
Make it consistent with the other 'send message' functions.
Simplify code generation logic in idl-compiler.

Interestingly this function is not used anywhere so I didn't have to fix
any call sites.
2022-06-23 16:14:41 +02:00
Benny Halevy
597e415c38 table: clone staging sstables into table dir
clone staging sstables so their content may be compacted while
views are built.  When done, the hard-linked copy in the staging
subdirectory will be simply unlinked.

Fixes #9559

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-06-23 16:55:27 +03:00
Benny Halevy
ce5bd505dc view_update_generator: discover_staging_sstables: reindent
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-06-23 16:55:27 +03:00
Benny Halevy
59874b2837 table: add get_staging_sstables
We don't have to go over all sstables in the table to select the
staging sstables out of them, we can get it directly from the
_sstables_staging map.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-06-23 16:55:27 +03:00
Benny Halevy
b8b14d76b3 view_update_generator: discover_staging_sstables: get shared table ptr earlier
It's potentially a bit more efficient since
t.get_sstables is called only once, while
t.shared_from_this() is called per staging sstable.

Also, prepare for the following patches that modify
this function further.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-06-23 16:55:27 +03:00
Benny Halevy
7536dd7f00 distributed_loader: populate table directory first
So we can clone staging sstables into it later
when populating the table from the staging_dir

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-06-23 16:55:27 +03:00
Benny Halevy
cd68b04fbf sstables: time_series_sstable_set: insert: make exception safe
Need to erase the shared sstable from _sstables
if insertion to _sstables_reversed fails.

Fixes #10787

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-06-23 16:55:27 +03:00
Benny Halevy
9d41676116 sstables: move_to_new_dir: fix debug log message
Remove extraneous `old_dir` arg.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-06-23 16:55:27 +03:00
Raphael S. Carvalho' via ScyllaDB development
32600f60f3 scylla-gdb: Fix scylla_compaction_tasks
Make it account for all the changes done in the compaction manager
recently. 5.0 is not affected. So does not merit a backport.

(gdb) scylla compaction-tasks
        1 type=sstables::compaction_type::Reshard, state=compaction_manager::task::state::active, "keyspace1"."standard1"
Total: 1 instances of compaction_manager::task

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220621225600.20359-1-raphaelsc@scylladb.com>
2022-06-23 16:17:31 +03:00
Piotr Sarna
026f58f2a4 scylla-gdb: document scylla_shard
The command is quite straightforward, but it didn't offer
any documentation when calling `help scylla shard`, so it's
hereby added. As a small bonus, a more comprehensive message
is printed when the argument is not an integer.
Message-Id: <9b958a4befce1c7baa6f86504ab74b93840b37e9.1655984258.git.sarna@scylladb.com>
2022-06-23 15:24:36 +03:00
Piotr Sarna
280e54c3e6 scylla-gdb: add printing registers from seastar::thread
`scylla thread` command is extended with a non-intrusive
option for dumping saved registers from the jmp_buf structure
in an unmangled form.
It can later be useful, e.g. for peeking into thread's instruction
pointer or reasoning about its stack.

Example debugging session:
(gdb) scylla threads
[shard 1] (seastar::thread_context*) 0x6010000d9e00, stack: 0x601004f00000
[shard 1] (seastar::thread_context*) 0x6010000daf00, stack: 0x601004e00000

(gdb) scylla thread --print-regs 0x6010000d9e00
rbx: 0x601004f1fd00
rbp: 0x601004f1fc20
r12: 0x6010000d9e20
r13: 0x6010002a3190
r14: 0x601004f1fd08
r15: 0x6010000d9e10
rsp: 0x601004f1fbb0
rip: 0x2f0aea6

(gdb) disassemble 0x2f0aea6
Dump of assembler code for function _ZN7seastar12jmp_buf_link10switch_outEv:
   0x0000000002f0ae90 <+0>:	push   %rax
   0x0000000002f0ae91 <+1>:	mov    0xc8(%rdi),%rax
   0x0000000002f0ae98 <+8>:	mov    %rax,%fs:0xfffffffffffe5dc8
   0x0000000002f0aea1 <+17>:	call   0x30333d0 <_setjmp@plt>
   0x0000000002f0aea6 <+22>:	test   %eax,%eax
   0x0000000002f0aea8 <+24>:	je     0x2f0aeac <_ZN7seastar12jmp_buf_link10switch_outEv+28>
   0x0000000002f0aeaa <+26>:	pop    %rax
   0x0000000002f0aeab <+27>:	ret
   0x0000000002f0aeac <+28>:	mov    %fs:0xfffffffffffe5dc8,%rdi
   0x0000000002f0aeb5 <+37>:	mov    $0x1,%esi
   0x0000000002f0aeba <+42>:	call   0x30333c0 <longjmp@plt>
End of assembler dump.
Message-Id: <553c1ed76987776916d5261ed13866650e84df34.1655984258.git.sarna@scylladb.com>
2022-06-23 15:24:36 +03:00
Piotr Sarna
01d281442e test: extend view filtering test case
In order to cover more code paths, the test case
now places filtering on various combinations of base columns,
including both primary keys and regular columns.
It also makes the test scylla_only, as filtering is an extension
not supported in Cassandra right now.

Closes #10860
2022-06-23 14:19:41 +03:00
Botond Dénes
fd5f8f2275 query: have replica provide the last position
Use the recently introduced query-result facility to have the replica
set the position where the query should continue from. For now this is
the same as what the implicit position would have been previously (last
row in result), but it opens up the possibility to stop the query at a
dead row.
2022-06-23 13:36:24 +03:00
Botond Dénes
009d2fe2f7 idl/query: add last_position to query_result
To be used to allow the replica to specify the last position in the
stream, where the query was left off. Currently this is always
the same as the implicit position -- the last row in the result-set --
but this requires only stopping the read on a live row, which is a
requirement we want to lift: we want to be able to stop on a tombstone.
As tombstones are not included in the query result, we have to allow the
replica to overwrite the last seen position explicitly.
This patch introduces the new field in the query-result IDL but it is
not written to yet, nor is it read, that is left for the next patches.
2022-06-23 13:36:24 +03:00
Botond Dénes
7b6b7a49cd mutlishard_mutation_query: propagate compaction state to result builder
Not used in this patch, facilitates further patching.
2022-06-23 13:36:24 +03:00
Botond Dénes
738cb99c53 multishard_mutation_query: defer creating result builder until needed
Currently the result builder is created two frames above the method in
which actually needed. Push down a factory method instead and create it
where actually used. This allows us to pass it arguments that are
present only in the method which uses it.
2022-06-23 13:36:24 +03:00
Botond Dénes
5575f8a55a querier: use full_position instead of ad-hoc struct 2022-06-23 13:36:24 +03:00
Botond Dénes
58d53b66c1 querier: rely on compactor for position tracking
For some time now the compactor track its own position. The querier can
make use of this instead of duplicating this effort.
2022-06-23 13:36:24 +03:00
Botond Dénes
9beef08a1b mutation_compactor: add current_full_position() convenience accessor 2022-06-23 13:36:24 +03:00
Botond Dénes
a3cd235de2 mutation_compactor: s/_last_clustering_pos/_last_pos/
Generalize position tracking to track non-clustering positions too.
Also add an accessor for it.
2022-06-23 13:36:24 +03:00
Botond Dénes
5a6e807a1c mutation_compactor: add state accessor to compact_mutation 2022-06-23 13:36:24 +03:00
Botond Dénes
e0cf7cec27 introduce full_position
A simple struct containing a full position, including a partition key
and a position in partition. Two variants are introduced: an owning
version and a view. This is to replace all the ad-hoc structures
introduced for the same purpose: std::pair() and std::tuple() of
partition key and clustering key, and other similar small structs
scattered around the code.
This patch does not replace any of the above mentioned construcs with
the new full_position, it merely introduces it to enable incremental
standardization.
2022-06-23 13:36:24 +03:00
Botond Dénes
119be5d5db idl: move position_in_partition into own header
So it can be used without pulling in all of partition_checksum.idl.hh.
2022-06-23 13:36:24 +03:00
Botond Dénes
2b0bc11f2e service/paging: use position_in_partition instead of clustering_key for last row
The former allows for expressing more positions, like a position
before/after a clustering key. This practically enables the coordinator
side paging logic, for a query to be stopped at a tombstone (which can
have said positions).
2022-06-23 13:36:20 +03:00
Avi Kivity
3c33fe93df install-dependencies.sh: uprgade node_exporter to 1.3.1
New features and bugfixes.

Closes #10859
2022-06-23 11:47:13 +03:00
Botond Dénes
adabe3b5a3 alternator/serialization: extract value object parsing logic
To make it reusable by a method added by the next patch.
2022-06-23 11:33:18 +03:00
Botond Dénes
cb0146e372 service/pagers/query_pagers.cc: fix indentation
Broken since forever.
2022-06-23 11:19:55 +03:00
Botond Dénes
a7d467d794 position_in_partition: add to_string(partition_region) and parse_partition_region()
And rebase operator<<(partition_region) on top of the former.
2022-06-23 11:19:55 +03:00
Botond Dénes
ab0f3512c8 mutation_fragment.hh: move operator<<(partition_region) to position_in_partition.hh
Where its definition lives.
2022-06-23 11:19:55 +03:00
Nadav Har'El
51fbc89df3 util/chunked_vector: more complete comment
chunked_vector was headed by short comment which didn't really explain
why it exists and how and why it really differs from std::dequeue.
Moreover, it made the vague claim that it "limits" contiguous
allocations, which it really doesn't (at least not in the asymptotic
sense).

In this patch I wrote a much longer comment, which I hope will clearly
explain exactly what chunked_vector is, how it really differs in its
contiguous allocations from std::deque, and what it guarantees and
doesn't guarantee.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #10857
2022-06-23 10:33:35 +03:00
Takuya ASADA
3a51e7820a scylla_cpuset_setup: stop deleting perftune.yaml and skip update cpuset.conf when same parameter specified
To make scylla setup scripts easier to handle in Ansible, stop deleting
perftune.yaml and detect cpuset.conf changes by mtime of the file.
Also, skip update cpuset.conf when same parameter specified.

Fixes #10121

Closes #10312
2022-06-23 10:28:36 +03:00
Botond Dénes
080ed590bf Merge "Obtain dc/rack from topology, not snitch" from Pavel Emelyanov
"
The way dc/rack info is maintained is very intricate.

The dc/rack strings originate at snitch, get propagated via gossiper,
get notified to storage service which, in turn, stores them into the
system keyspace and token metadata. Code that needs to get dc/rack
for a given endpoint calls snitch which tries to get the data from
gossiper and if failed goes and loads it from system keyspace cache.
Also there's "internal IP" thing hanging arond that loops messaging
service in both -- updating and getting the info.

The plan is to make topology (that currently sits on token metadata)
stay the only "source of truth" regarding the endpoints' dc/rack and
internal IP info. The dc/rack mappings are put into topology already,
but it cannot yet fully replace snitch for two reasons:

- it doesn't map internal IP to endpoint
- it doesn't get data stored in system keyspace

So what this patch set does is patches most of the dc/rack getters
to call topology methods. The topology is temporarily patched to
just call the respective snitch methods. This removes a big portion
of calls for global snitch instance.

After the set the places that still explicitly rely on snitch to
provide dc/rack are

- messaging service: needs internal IP knowledge on topology
- db/consistency_level: is all "global", needs heavier patching
- tests: just later
"

* 'br-get-dc-rack-from-topology-2' of https://github.com/xemul/scylla:
  proxy stats: Get rack/datacenter from topology
  proxy stats: Push topology arg to get_ep_stats
  api: Get rack/datacenter from topology
  hints: Remove snitch dependency
  hints: Get rack/datacenter from topology
  alternator: Get rack/datacenter from topology
  range_streamer: Get rack/datacenter from topology
  repair: Get rack/datacenter from topology
  view: Get rack/datacenter from topology
  storage_service: Get rack/datacenter from topology
  proxy: Get rack/datacenter from topology
  topology: Add get_rack/_datacenter methods
2022-06-23 10:01:36 +03:00
Benny Halevy
a65ed19edc table: perform_offstrategy_compaction: move off-strategy logic to compaction_manager
compaction_manager needs to decide about running off-strategy
compaction or not based on the maintenance_set, not partly
in table::trigger_offstrategy_compaction and part in
the compaction_manager layer as it is done today.

So move the logic down to performa_offstrategy
that now returns future<bool> to return true
iff it performed offstrategy compaction.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-06-23 08:18:17 +03:00
Benny Halevy
9079c98db0 compaction_manager: offstrategy_compaction_task: refactor log printouts
Move logging from run_offstrategy_compaction to do_run
so that in the next patch we can skip run_offstrategy_compaction
if the maintenance set is empty (but still log it,
for the sake of dtests.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-06-23 08:02:44 +03:00
Benny Halevy
34e9391587 test: sstable_compaction: compaction_manager_for_testing
Make the compaction manager for testing using
this class.

Makes sure to enable the compaction manager
and to stop it before it's destroyed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-06-23 08:02:44 +03:00
Igor Ribeiro Barbosa Duarte
277f5a4009 utils/loading_cache.hh: Add reset method
This patch adds a reset method which is going to be used in the next patches
for updating the loadind_cache config and also to be able to flush the cache
without having to update scylla config

Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>
2022-06-23 01:18:22 -03:00
Piotr Dulikowski
442901f14a storage_proxy: metrics for per-partition rate limiting of reads
Adds a metric "read_rate_limited" which indicates how many times a read
operation was rejected due to per-partition rate limiting. The metric
differentiates between reads rejected by the coordinator and reads
rejected by replicas.
2022-06-22 20:16:49 +02:00
Piotr Dulikowski
6e5d486970 storage_proxy: metrics for per-partition rate limiting of writes
Adds a metric "write_rate_limited" which indicates how many times a
write operation was rejected due to per-partition rate limiting. The
metric differentiates between writes rejected by the coordinator and
writes rejected by replicas.
2022-06-22 20:16:49 +02:00
Piotr Dulikowski
13a5022499 database: add stats for per partition rate limiting
Adds statistics which count how many times a replica has decided to
reject a write ("total_writes_rate_limited") or a read
("total_reads_rate_limited").
2022-06-22 20:16:49 +02:00
Piotr Dulikowski
bc50163016 tests: add per_partition_rate_limit_test
Adds the per_partition_rate_limit_test.cc file. Currently, it only
contains a test which verifies that the feature correctly switches off
rate limiting for internal queries (!allow_limit || internal sg).
2022-06-22 20:16:49 +02:00
Piotr Dulikowski
761a037afb config: add add_per_partition_rate_limit_extension function for testing
...and use it in cql_test_env to enable the per_partition_rate_limit
extension for all tests that use it.
2022-06-22 20:16:49 +02:00
Piotr Dulikowski
1a36029ab5 cf_prop_defs: guard per-partition rate limit with a feature
The per-partition rate limit feature requires all nodes in the cluster
to support it in order to work well. This commit adds a check which
disallows creating/altering tables with per-partition rate limit until
the node is sure that all nodes in the cluster support it.
2022-06-22 20:16:49 +02:00
Piotr Dulikowski
a7ad70600d query-request: add allow_limit flag
Adds allow_limit flag to the read_command. The flag decides whether rate
limiting of this operation is allowed.
2022-06-22 20:16:49 +02:00
Piotr Dulikowski
c691e94190 storage_proxy: add allow rate limit flag to get_read_executor
Adds a flag to get_read_executor which decides whether the read should
be rate limited or not. The read executors were modified to choose the
appropriate per partition rate limit info parameter and send it to the
replicas.
2022-06-22 20:16:49 +02:00
Piotr Dulikowski
3357066387 storage_proxy: resultize return type of get_read_executor
Now, get_read_executor is able to return coordinator exceptions without
throwing them. In an upcoming commit, it will start returning rate limit
exception in some cases and it is preferable to return them without
throwing.
2022-06-22 20:16:49 +02:00
Piotr Dulikowski
d3d9add219 storage_proxy: add per partition rate limit info to read RPC
Now, the read RPC accept the per partition rate limit info parameter. It
is passed on to query_result_local(_digest) methods.
2022-06-22 20:16:49 +02:00
Piotr Dulikowski
e8e8ada4b4 storage_proxy: add per partition rate limit info to query_result_local(_digest)
The query_result_local and query_result_local_digest methods were
updated to accept db::per_partition_rate_limit::info structure and pass
it on to database::accept.
2022-06-22 20:16:49 +02:00
Piotr Dulikowski
e6beab3106 storage_proxy: add allow rate limit flag to mutate/mutate_result
Now, mutate/mutate_result accept a flag which decides whether the write
should be rate limited or not.

The new parameter is mandatory and all call sites were updated.
2022-06-22 20:16:49 +02:00
Piotr Dulikowski
1f65c4e001 storage_proxy: add allow rate limit flag to mutate_internal
Now, mutate_internal accepts a flag which decides whether the write
should be rate limited or not.
2022-06-22 20:16:49 +02:00
Piotr Dulikowski
1e4e92ed8b storage_proxy: add allow rate limit flag to mutate_begin
Now, mutate_begin accepts a flag which decides whether given write
should be rate limited or not.
2022-06-22 20:16:49 +02:00
Piotr Dulikowski
76e95e7ae8 storage_proxy: choose the right per partition rate limit info in write handler
Now, write response handler calculates the appropriate rate limit info
parameter and passes it to the mutation holder.
2022-06-22 20:16:49 +02:00
Piotr Dulikowski
2a7ba76c3e storage_proxy: resultize return types of write handler creation path
The mutate_prepare and create_write_response_handler(_helper) functions
are modified to be able to return exceptions without throwing them. In
an upcoming commit, create_write_response_handler will sometimes return
rate limit exception, and it is preferable to return them without
throwing.
2022-06-22 20:16:49 +02:00
Piotr Dulikowski
3f88ecdea6 storage_proxy: add per partition rate limit to mutation_holders
Now, `apply_locally` and `apply_remotely` accept the per partition rate
limit info parameter.
2022-06-22 20:16:49 +02:00
Piotr Dulikowski
02469e0b15 storage_proxy: add per partition rate limit info to write RPC
Adds db::per_partition_rate_limit::info parameter to the write RPC. The
rate limit info controls the behavior of the rate limiter on the
replica.
2022-06-22 20:16:48 +02:00
Piotr Dulikowski
c06376b383 storage_proxy: add per partition rate limit info to mutate_locally
Now, mutate_locally accepts a parameter that controls the rate limiter
behavior on the replica.
2022-06-22 20:16:48 +02:00
Piotr Dulikowski
cc9a2ad41f database: apply per-partition rate limiting for reads/writes
Adds the `db::rate_limiter` to the `database` class and modifies the
`query` and `apply` methods so that they account the read/write
operations in the rate limiter and optionally reject them.
2022-06-22 20:16:48 +02:00
Piotr Dulikowski
ec635ba170 database: move and rename: classify_query -> classify_request
Moves the classify_query higher and renames it to classify_request. The
function will be reused in further commits to protect non-user queries
from accidentally being rate limited.
2022-06-22 20:16:48 +02:00
Piotr Dulikowski
dccb8a5729 schema: add per_partition_rate_limit schema extension
Adds the new `per_partition_rate_limit` schema extension. It has two
parameters: `max_writes_per_second` and `max_reads_per_second`.
In the future commits they will control how many operations of given
type are allowed for each partition in the given table.
2022-06-22 20:16:48 +02:00
Piotr Dulikowski
0fe8b55427 db: add rate_limiter
Introduces the rate_limiter, a replica-side data structure meant for
tracking the frequence with which each partition is being accessed
(separately for reads and writes) and deciding whether the request
should be accepted and processed further or rejected.

The limiter is implemented as a statically allocated hashmap which keeps
track of the frequency with which partitions are accessed. Its entries
are incremented when an operation is admitted and are decayed
exponentially over time.

If a partition is detected to be accessed more than its limit allows,
requests are rejected with a probability calculated in such a way that,
on average, the number of accepted requests is kept at the limit.

The structure currently weights a bit above 1MB and each shard is meant
to keep a separate instance. All operations are O(1), including the
periodic timer.
2022-06-22 20:16:48 +02:00
Piotr Dulikowski
2162bb9f3b storage_proxy: propagate rate_limit_exception through read RPC
This commit modifies the read RPC and the storage_proxy logic so that
the coordinator knows whether a read operation failed due to rate limit
being exceeded, and returns `exceptions::rate_limit_exception` if that
happens.
2022-06-22 20:16:48 +02:00
Piotr Dulikowski
000f417d23 gms: add TYPED_ERRORS_IN_READ_RPC cluster feature
We would like to extend the read RPC to return an optional, second value
which indicates an exception - seastar type-erases exception on the RPC
handler boundary and we need to differentiate rate_limit_exception from
others. However, it may happen that a replica with an up-to-date version
of Scylla tries to return an exception in this way to a coordinator with
an old version and the coordinator will drop the error, thinking that
the request succeeded.

In order to protect from that, we introduce the
`TYPED_ERROR_IN_READ_RPC` feature. Only after it is enabled replicas
will start returning exceptions in the new way, and until then all
exceptions will be reported using seastar's type-erasure mechanism.
2022-06-22 20:16:48 +02:00
Piotr Dulikowski
51546b0609 storage_proxy: pass rate_limit_exception through write RPC
This commit modifies the storage_proxy logic so that the coordinator
knows whether a write operation failed due to rate limit being exceeded,
and returns `exceptions::rate_limit_exception` when that happens.
2022-06-22 20:16:48 +02:00
Piotr Dulikowski
621b7f35e2 replica: add rate_limit_exception and a simple serialization framework
Introduces `replica::rate_limit_exception` - an exceptions that is
supposed to be thrown/returned on the replica side when the request is
rejected due to the exceeding the per-partition rate limit.

Additionally, introduces the `exception_variant` type which allows to
transport the new exception over RPC while preserving the type
information. This will be useful in later commits, as the coordinator
will have to know whether a replica has failed due to rate limit being
exceeded or another kind of error.

The `exception_variant` currently can only either hold "other exception"
(std::monostate) or the aforementioned `rate_limit_exception`, but can
be extended in a backwards-compatible way in the future to be able to
hold more exceptions that need to be handled in a different way.
2022-06-22 20:07:58 +02:00
Piotr Dulikowski
a55d7ad46d docs: design doc for per-partition rate limiting 2022-06-22 20:07:58 +02:00
Piotr Dulikowski
efc3953c0a transport: add rate_limit_error
Adds a CQL protocol extension which introduces the rate_limit_error. The
new error code will be used to indicate that the operation failed due to
it exceeding the allowed per-partition rate limit.

The error code is supposed to be returned only if the corresponding CQL
extension is enabled by the client - if it's not enabled, then
Config_error will be returned in its stead.
2022-06-22 20:07:58 +02:00
Piotr Sarna
bc3a635c42 view: exclude using static columns in the view filter
The code which applied view filtering (i.e. a condition placed
on a view column, e.g. "WHERE v = 42") erroneously used a wildcard
selection, which also assumes that static columns are needed,
if the base table contains any such columns.
The filtering code currently assumes that no such columns are fetched,
so the selection is amended to only ask for regular columns
(primary key columns are sent anyway, because they are enabled
via slice options, so no need to ask for them explicitly).

Fixes #10851

Closes #10855
2022-06-22 15:55:45 +03:00
Pavel Emelyanov' via ScyllaDB development
b0b29edcd7 distributed-loader: Remove ensure_system_table_directories
It looks like the exactly same code is called few steps above via

distributed_loader::init_system_keyspace
 `- distributed_loader::populate_keyspace

While at it -- move the supervisor::notify("loading system sstables")
handing around in the more suitable location.

tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/981/

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220621165313.31284-1-xemul@scylladb.com>
2022-06-22 13:59:00 +03:00
Nadav Har'El
cf289ad538 Merge 'types: time_point_to_string: harden against out of range timestamps' from Benny Halevy
The time point is multiplied by an adjustment factor of 1000
for boost::posix_time::time_duration::ticks_per_second() = 1000000
when calling boost::posix_time::milliseconds(count)
and that may lead to integer overflow as reported
by the UndefinedBehaviorSanitizer.

See https://github.com/scylladb/scylla/issues/10830#issuecomment-1158899187

This change checks for possible overflow in advance and
prints the raw counter value in this case, along with
an explanation.

Refs #10830

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #10831

* github.com:scylladb/scylla:
  test: types: add test cases for timestamp_type to_string format
  types: time_point_to_string: harden against out of range timestamps
2022-06-22 14:01:06 +03:00
Pavel Emelyanov
f0cafc35fd proxy stats: Get rack/datacenter from topology
The reference is already at hand. The get_ep_stats() calls another
helper that also maps endpoint to datacenter, but it can get the
obtained dc sstring via argument.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-06-22 11:47:27 +03:00
Pavel Emelyanov
8ffe249430 proxy stats: Push topology arg to get_ep_stats
The latter will need it to get dc info from. All the callers are either
storage proxy or have storage proxy pointer/reference to get topology
from.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-06-22 11:47:27 +03:00
Pavel Emelyanov
3ab7c9320c api: Get rack/datacenter from topology
The http_ctx already has token metadata on board, it's possible to get
topology from it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-06-22 11:47:27 +03:00
Pavel Emelyanov
820be06ac1 hints: Remove snitch dependency
After previous patch hints manager class gets unused dependency on
snitch. While removing it it turns out that several unrelated places
get needed headers indirectly via host_filter.hh -> snitsh_base.hh
inclusion.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-06-22 11:47:26 +03:00
Pavel Emelyanov
9b6312687b hints: Get rack/datacenter from topology
The topology referecne is obtained from the proxy anchor pointer sitting
on manager.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-06-22 11:47:26 +03:00
Pavel Emelyanov
98a4d41e31 alternator: Get rack/datacenter from topology
It's needed in two places, both can get topology from the proxy's token
metadata.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-06-22 11:47:26 +03:00
Pavel Emelyanov
5e2fa32c8c range_streamer: Get rack/datacenter from topology
It's needed in source filter classes so range-streamer passes the
topology reference into its methods.

Nice side effect -- snitch header goes away from range-streamer one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-06-22 11:47:26 +03:00
Pavel Emelyanov
b28db0294c repair: Get rack/datacenter from topology
Repair gets token metadata from its local database reference. Not
perfect, repair should better have its own private token meta reference,
but it's OK for now.

The change obsoletes static get_local_dc helper.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-06-22 11:47:26 +03:00
Pavel Emelyanov
17128eb54b view: Get rack/datacenter from topology
The view code already gets token metadata from global proxy instance. Do
the same to get topology object.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-06-22 11:47:26 +03:00
Pavel Emelyanov
894cbeacc5 storage_service: Get rack/datacenter from topology
Same as in previous patch -- storage service has token metadata to get
topology from.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-06-22 11:47:26 +03:00
Pavel Emelyanov
507db73586 proxy: Get rack/datacenter from topology
Proxy has shared token metadata from which it can get the topology.
This change obsoletes static get_local_dc() helper.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-06-22 11:47:26 +03:00
Pavel Emelyanov
b6f7c8da8b topology: Add get_rack/_datacenter methods
For now they just forward the request to snitch. Once topology is
properly updated boot-time dc/rack info and knows internal IP
it will be able to serve request on its own.

For convenience overloads without arguments return dc/rack for
current node.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-06-22 11:47:26 +03:00
Avi Kivity
e499f45593 Update seastar submodule
* seastar 443e6a9b77...ff46af9ae0 (15):
  > rpc: Take care of client::send() future in send_helper
  > test: futures: add test_get_on_exceptional_promise
  > compile_commands.json generation in configure
  > condition-variable: use an empty loop for spinning CPU
  > byteorder: use boost::endian to do the conversion.
  > Merge "Replace RPC outgoing queue with continuation chain" from Pavel E
  > test_runner: use std::endl to ensure messages are flushed
  > memory: realloc: defer to malloc if ptr is null
  > cmake: require boost 1.73 for building with C++20
  > reactor: backend: io_uring: disable on old kernels if RAID devices exist
  > Move function in invoke_on_all
  > core/loop: drop unused parameters
  > net/api: add connected_socket::operator bool()
  > fix cpuset count is zero after shift
  > docker: add pandoc package

Closes #10845
2022-06-22 00:39:24 +03:00
Michael Livshin
d7c90b5239 sstables: make generation_type an actual separate type
Now that `generation_type` is used properly (at least in some places),
we turn to the compiler to help keep the generation/value separation
intact.

Fixes #10796.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-06-21 20:08:01 +03:00
Konstantin Osipov
c59a730c1e test: re-enable but mark as flaky cdc_with_lwt_test
Running flaky tests prevents regressions from sneaking which is possible if the test is disabled.

Closes #10832
2022-06-21 16:36:49 +03:00
Benny Halevy
e5b7ce4cb7 test: types: add test cases for timestamp_type to_string format
Following the previous patch that changed time_point_to_string
we should cement the different edge cases for the next
time this function changes.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-06-21 14:11:35 +03:00
Avi Kivity
88f75f91ae cql3: grammar: move semantic check for incompatible UPDATEs to prepare() code
The grammar now checks that UPDATEs don't clash (for example,
updates to the same column). The checks are good, but the grammar
isn't the right place for them - better to concentrate all the checks
in the prepare() code so it's easy to see all the checks.

Move the checks to raw::update_statement::prepare_internal(). This
exposes that the checks are quadratic, so add a comment. It could be
fixed with a stable_sort() first, but that is left to later.

Closes #10820
2022-06-21 11:42:11 +02:00
Botond Dénes
2b62f67593 tools/scylla-types: escape {} chars in description
So fmt::format() doesn't interpret them as substitutions and doesn't
error-out because there is no argument for them.

Closes #10803
2022-06-21 11:58:13 +03:00
Michael Livshin
1e7360ef6d checksum_utils_test: supply valid input to crc32_combine()
If the len2 argument to crc32_combine() is zero, then the crc2
argument must also be zero.

fast_crc32_combine() explicitly checks for len2==0, in which case it
ignores crc2 (which is the same as if it were zero).

zlib's crc32_combine() used to have that check prior to version
1.2.12, but then lost it, making its necessary for callers to be more
careful.

Also add the len2==0 check to the dummy fast_crc32_combine()
implementation, because it delegates to zlib's.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>

Closes #10731
2022-06-21 11:58:13 +03:00
Botond Dénes
5b50725c45 Merge 'partition_snapshot_row_cursor: avoid unnecessary row cloning in row()' from Michał Chojnowski
Due to implementation details, all `deletable_row`s used in `row()` are copied twice, even though the only need to be copied/applied once.
This is unnecessary work.

`perf_simple_query_g --enable-cache=1 --flush --smp 1 --duration 30`
Before:
median 158516.17 tps ( 64.1 allocs/op,  12.1 tasks/op,   45010 insns/op)
After:
median 164307.76 tps ( 62.1 allocs/op,  12.1 tasks/op,   43220 insns/op)

Closes #10509

* github.com:scylladb/scylla:
  partition_snapshot_row_cursor: construct the clustering_row directly in row()
  mutation_fragment: add a "from deletable_row" constructor to clustering_row
  mutation_fragment: pass the applied row by reference in clustering_row::apply()
2022-06-21 11:58:13 +03:00
Botond Dénes
121900e377 Merge "Sanitize compaction manager construction and stopping" from Pavel Emelyanov
"
In order to wire-in the compaction_throughput_mb_per_sec the compaction
creation and stopping will need to be patched. Right now both places are
quite hairy, this set coroutinizes stop() for simpler adding of stopping
bits, unifies all the compaction manager constructors and adds the
compaction_manager::config for simpler future extending.

As a side effect the backlog_controller class gets an "abstract" sched
group it controlls which in turn will facilitate seastar sched groups
unification some day.
"

* 'br-compaction-manager-start-stop-cleanup' of https://github.com/xemul/scylla:
  compaction_manager: Introduce compaction_manager::config
  backlog_controller: Generalize scheduling groups
  database: Keep compound flushing sched group
  compaction_manager: Swap groups and controller
  compaction_manager: Keep compaction_sg on board
  compaction_manager: Unify scheduling_group structures
  compaction_manager: Merge static/dynamic constructors
  compaction_manager: Coroutinuze really_do_stop()
  compaction_manager: Shuffle really_do_stop()
  compaction_manager: Remove try-catch around logger
2022-06-21 11:58:13 +03:00
Raphael S. Carvalho
aa667e590e sstable_set: Fix partitioned_sstable_set constructor
The sstable set param isn't being used anywhere, and it's also buggy
as sstable run list isn't being updated accordingly. so it could happen
that set contains sstables but run list is empty, introducing
inconsistency.

we're fortunate that the bug wasn't activated as it would've been
a hard one to catch. found this while auditting the code.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220617203438.74336-1-raphaelsc@scylladb.com>
2022-06-21 11:58:13 +03:00
Benny Halevy
59acc58920 test: error_injection: test_inject_noop: do no rely on wall clock timing
This unit test may fail in debug mode
since .then may yield, even if inject returns
a ready future, so the wall clock timing has no basis.

As seen in https://jenkins.scylladb.com/job/releng/job/Scylla-CI/932/testReport/junit/boost.error_injection_test.debug/test_boost_error_injection_test/test_inject_noop/
```
[Exception] - critical check wait_time.count() < sleep_msec.count() has failed [47 >= 10]
 == [File] - test/boost/error_injection_test.cc
 == [Line] -45
 ```

Instead, just verify that inject returns
a successful, ready future, when using
a non-existing errr injection name.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #10842
2022-06-21 11:58:13 +03:00
Benny Halevy
87ee6c3722 types: time_point_to_string: harden against out of range timestamps
The time point is multiplied by an adjustment factor of 1000
for boost::posix_time::time_duration::ticks_per_second() = 1000000
when calling boost::posix_time::milliseconds(count).
That may lead to integer overflow as reported by the
UndefinedBehaviorSanitizer.

See https://github.com/scylladb/scylla/issues/10830#issuecomment-1158899187

This change uses gmtime_r to convert seconds since unix epoch
to std::tm and the fmt library to format the iso representation
of the time_point to avoid exceptions and undefined behavior.

gmtime_r may still detect an overflow "when the year does not fit into
an integer" (see ctime(3)).  In this case we return a backward
compatible representation of "{count} milliseconds (out of range)".

Refs #10830

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-06-21 08:08:57 +03:00
Michael Livshin
ab13127761 sstables: use generation_type more soundly
`generation_type` is (supposed to be) conceptually different from
`int64_t` (even if physically they are the same), but at present
Scylla code still largely treats them interchangeably.

In addition to using `generation_type` in more places, we
provide (no-op) `generation_value()` and `generation_from_value()`
operations to make the smoke-and-mirrors more believable.

The churn is considerable, but all mechanical.  To avoid even
more (way, way more) churn, unit test code is left untreated for
now, except where it uses the affected core APIs directly.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-06-20 19:37:31 +03:00
Michael Livshin
ef06f92631 extremum_tracker: do not require default-constructible value types
The requirement is just an unintended artifact of the implementation.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-06-20 19:37:31 +03:00
Calle Wilund
8b49718203 commitlog: Add (internal) measurement of byte rates add/release/flush-req
Adds measuring the apparent delta vector of footprint added/removed within
the timer time slice, and potentially include this (if influx is greater
than data removed) in threshold calculation. The idea is to anticipate
crossing usage threshold within a time slice, so request a flush slightly
earlier, hoping this will give all involved more time to do their disk
work.

Obviously, this is very akin to just adjusting the threshold downwards,
but the slight difference is that we take actual transaction rate vs.
segment free rate into account, not just static footprint.

Note: this is a very simplistic version of this anticipation scheme,
we just use the "raw" delta for the timer slice.
A more sophisiticated approach would perhaps do either a lowpass
filtered rate (adjust over longer time), or a regression or whatnot.
But again, the default period of 10s is something of an eternity,
so maybe that is superfluous...
2022-06-20 15:58:36 +00:00
Calle Wilund
6921210bf5 commitlog: Add counters for # bytes released/flush requested
Adds "bytes_released" and "bytes_flush_requested", representing
total bytes released from disk as a result of segment release
(as allocation bytes + overhead - not counting unused "waste"),
resp. total size we've requested flush callbacks to release data,
also counted as actual used bytes in segments we request be made
released.

These counters, together with bytes_written, should in ideal use
cases be at an equilibrium (actually equal), thus observing them
should give an idea on whether we are imbalanced in managing to
release bytes in same rate as they are allocated (i.e. transaction
rate).
2022-06-20 15:58:36 +00:00
Calle Wilund
336383c87e commitlog: Keep track of last flush high position to avoid double request
Apparent mismerge or something. We already have an unused "_flush_position",
intended to keep track of the last requested high rp.
Now actually update and use it. The latter to avoid sending requests for
segments/cf id:s we've already requested external flush of. Also enables
us to ensure we don't do double bookkeep here.
2022-06-20 15:58:26 +00:00
Calle Wilund
c904b3cf35 commitlog: Fix counter descriptor language
Remove superfluous "a"
2022-06-20 15:54:20 +00:00
Takuya ASADA
13caac7ae6 install.sh: install files with correct permission in strict umask setting
To avoid failing to run scripts in non-root user, we need to set
permission explicitly on executables.

Fixes #10752

Closes #10840
2022-06-20 17:52:03 +03:00
Avi Kivity
a8507a6d28 Merge 'docs/contribute/maintainer.md: expand with merging guidelines' from Botond Dénes
The current maintainer.md lacks any guidelines on what patches to accept/reject. Instead maintainers are expected to observe the unwritten rules as exercised by more senior maintainers, as well as use their own judgement or ask when in doubt. This has worked well as maintainers are all people who either worked at the company for a long time and hence had time to observe how things work, and/or have previous experience maintaining open-source projects. Nevertheless, many times I have wished we had a guideline I could glance at to make sure I considered all the angles and to make sure I did not forget some important unwritten rule.
This series attempts to concisely summarize these unwritten rules in the form of a checklist, without attempting to cover all exceptions and corner-cases. This should already be enough for a maintainer-in-doubt to be able to quickly go over the checklist and see if they forgot to check anything (especially when evaluating backports).

/cc @scylladb/scylla-maint

Closes #10806

* github.com:scylladb/scylla:
  docs/contribute/maintainer.md: add merging and backporting guidelines
  docs/contribute/CONTRIBUTING.md: add reference to review checklist:
  docs/contribute/review-checklist.md: add section about patch organization
  docs/contribute/maintainer.md: expand section on git submodule sync
2022-06-20 17:20:52 +03:00
Botond Dénes
c3e7c1cf59 tools/schema_loader: load_schemas(): add note about CDC table names
Explaining how the code determines what tables are CDC tables when
parsing schema statements.

Closes #10788
2022-06-20 17:16:33 +03:00
Michał Chojnowski
5570354f44 partition_snapshot_row_cursor: construct the clustering_row directly in row()
Currently row() creates an empty clustering_row, then applies deletable_rows
from the cursor to the empty clustering_row.
But the apply logic is unnecessary for the first apply(), and it's cheaper
to simply copy the row.
2022-06-20 15:45:19 +02:00
Michał Chojnowski
52c963b331 mutation_fragment: add a "from deletable_row" constructor to clustering_row
Currently, construction of clustering_row from deletable_row is done by
applying the deletable_row to an empty clustering_row.
Direct construction is a slightly cheaper alternative.
2022-06-20 15:45:19 +02:00
Michał Chojnowski
a061eb9e76 mutation_fragment: pass the applied row by reference in clustering_row::apply()
Currently, clustering_row::apply() takes deletable_row by reference, but
copies it before passing it to deletable_row::apply(). This is more expensive
than passing the reference down (by about 1800 instructions for
perf_simple_query rows).
2022-06-20 15:22:17 +02:00
Avi Kivity
f8d84e3aaf Update tools/java submodule (sync to Cassandra 3.11.3, deps update)
* tools/java d4133b54c9...de8289690e (1):
  > Merge 'Sync with Cassandra 3.11.13 and update a few dependencies (v2)' from Piotr Grabowski
2022-06-20 13:25:17 +03:00
Asias He
72797bf516 token_metadata: Shortcut zero leaving nodes case in calculate_pending_ranges_for_leaving
If there are zero leaving nodes, no need to calculate anything. This
saves time for calculating pending ranges in large clusters
significantly to avoid unnecessary calculation.

Refs #10337

Closes #10822
2022-06-20 13:19:58 +03:00
Piotr Sarna
bbbd1f4edd Merge 'alternator: make BatchGetItem group reads by partition'
from Nadav Har'El

This small series improves Alternator's BatchGetItem performance by
grouping requests to the same partition together (Fixes #10753) and also
improves error checking when the same item is requested more than once
(Fixes #10757).

Closes #10834

* github.com:scylladb/scylla:
  alternator: make BatchGetItem group reads by partition
  test/alternator: additional test for BatchGetItem
2022-06-20 10:07:19 +02:00
Raphael S. Carvalho
f15a6ce41a tests: Introduce optional RNG seed for boost suite
Today, if you want to reproduce a rare condition using the same RNG seed
reported, you cannot use test.py which provides useful infrastructure
and will have to run the tests manually instead.
So let's extend test.py to allow optional forwarding of RNG seed to
boost tests only, as other suites don't support the seed option.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220615223657.142110-1-raphaelsc@scylladb.com>
2022-06-20 07:19:08 +03:00
Nadav Har'El
3aca1ca572 alternator: make BatchGetItem group reads by partition
DynamoDB API's BatchGetItem invokes a number (up to 25) of read requests
in parallel, returning when all results are available. Alternator naively
implemented this by sending all read requests in parallel, no matter which
requests these were.

That implementation was inefficient when all the requests are to different
items (clustering rows) of the same partition. In a multi-node setup this
will end up sending 25 separate requests to the same remote node(s). Even
on a single-node setup, this may result in reading from disk more than
once, and even if the partition is cached - doing an O(logN) search in
each multiple times.

What we do in this patch, instead, is to group all the BatchGetItem
requests that aimed at the same partition into a single read request
asking for a (sorted) list of clustering keys. This is similar to an
"IN" request in CQL.

As an example of the performance benefit of this patch, I tried a
BatchGetItem request asking for 20 random items from a 10-million item
partition. I measured the latency of this request on a single-node
Scylla. Before this patch, I saw a latency of 17-21 ms (the lower number
is when the request is retried and the requested items are already in
the cache). After this patch, the latency is 10-14 ms. The performance
improvement on multi-node clusters are expected to be even higher.

Unfortunately the patch is less trivial than I hoped it would be,
because some of the old code was organized under the assumption that
each read request only returned one item (and if it failed, it means
only one item failed), so this part of the code had to be reorganized
(and, for making the code more readable, coroutinized).

An unintended benefit of the code reorganization is that it also gave
me an opportunity to fail an attempt to ask BatchGetItem the same
item more than once (issue #10757).

The patch also adds a few more corner cases in the tests, to be even
more sure that the code reorganization doesn't introduce a regression
in BatchGetItem.

Fixes #10753
Fixes #10757

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-06-19 14:47:57 +03:00
Pavel Emelyanov
85263b2d02 trace-state: Remove unused fields
... and one friendship declaration

tests: compilation

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220616094224.30676-1-xemul@scylladb.com>
2022-06-17 15:02:51 +03:00
Geoffrey Beausire
ee9841b138 Ensure gossip is enabled on all shards before starting the failure_detector_loop
Before it was possible for a race condition to happen where the failure_detector_loop is started before the gossiper._enabled is set to true on every shard.
This change ensure that _enabled is set to true before moving forward

Closes #10548
2022-06-17 14:10:45 +03:00
Avi Kivity
4d587e0c3d cql3: raw_value: deduplicate view() and to_view()
Commit e739f2b779 ("cql3: expr: make evaluate() return a
cql3::raw_value rather than an expr::constant") introduced
raw_value::view() as a synonym to raw_value::to_view() to reduce
churn. To fix this duplication, we now remove raw_value::to_view().

raw_value::to_view() was picked for removal because is has fewer
call sites, reducing churn again.

Closes #10819
2022-06-17 09:32:58 +02:00
Avi Kivity
19a6e69001 cql3: accept and type-check reused named bind variables
A named bind-variable can be reused:

    SELECT * FROM tab
    WHERE a = :var AND b = :var

Currently, the grammar just ignores the possibility and creates
a new variable with the same name. The new variable cannot be
referenced by name since the first one shadows it.

Catch variable reuse by maintaining a map from bind variable names
to indexed, and check that when reusing a bind variable the types
match.

A unit test is added.

Fixes #10810

Closes #10813
2022-06-17 09:09:49 +02:00
Konstantin Osipov
670b2562a1 lwt: Cassandrda compatibility when incarnating a row for UPDATE
When evaluating an LWT condition involving both static and non-static
cells, and matching no regular row, the static row must be used UNLESS
the IF condition is IF EXISTS/IF NOT EXISTS, in which case special rules
apply.

Before this fix, Scylla used to assume a row doesn't exist if there is
no matching primary key. In Cassandra, if there is a
non-empty static row in the partition, a regular row based
on the static row' cell values is created in this case, and then this
row is used to evaluate the condition.

This problem was reported as gh-10081.

The reason for Scylla behaviour before the patch was that when
implementing LWT I tried to converge Cassandra data model (or lack of
thereof) with a relational data model, and assumed a static row is a
"shared" portion of a regular row, i.e. a storage level concept intended
to save space, and doesn't have independent existence.
This was an oversimplification.

This patch fixes gh-10081, making Scylla semantics match the one of
Cassandra.

I will now list other known examples when a static row has an own
independent existence as part of a table, for cataloguing purposes.

SELECT * from a partition which has a partition key
and a static cell set returns 1 row. If later a regular row is added
to the partition, the SELECT would still return 1 row, i.e.
the static row will disappear, and a regular row will appear instead.

Another example showing a static row has an independent existence below:

CREATE TABLE t (p int, c int, s int static, PRIMARY KEY(p, c));
INSERT INTO t (p, c) VALUES(1, 1);
INSERT INTO t (p, s) VALUES(1, 1) IF NOT EXISTS;

In Cassandra (and Scylla), IF NOT EXISTS evaluates to TRUE, even though both
the regular row and the partition exist. But the static cells are not
set, and the insert only provides a partition key, so the database assumes the
insert is operating against a static row.

It would be wrong to assume that a static row exists when the partition
key exists:
INSERT INTO t (p, c, s) VALUES(1, 1, 1) IF NOT EXISTS;

 [applied] | p | c | s
 -----------+---+---+------
      False | 1 | 1 | null

evaluates to False, i.e. the regular row does exist when p and c exist.

Issue

CREATE TABLE t (p INT, c INT, r INT, s INT static, PRIMARY KEY(p, c))
INSERT INTO t (p, s) VALUES (1, 1);
UPDATE t SET s=2, r=1 WHERE p=1 AND c=1 IF s=1 and r=null;
- in this case, even though the regular row doesn't exist, the static
row does, and should be used for condition evaluation.

In other words, IF EXISTS/IF NOT EXISTS have contextual semantics.
They apply to the regular row if clustering key is used in the WHERE
clause, otherwise they apply to static row.

One analogy for static rows is that it is like a static member of C++ or
Java class. It's an attribute of the class (assuming class = partition),
which is accessible through every object of the class (object = regular
row). It is also present if there are no objects of the class, but the
class itself exists: i.e. a partition could have no regular rows, but
some static cells set, in this case it has a static row.

*Unlike C++/Java static class members* a static row is an optional
attribute of the partition. A partition may exist, but the static row
may be absent (e.g. no static cell is set). If the static row does exist,
all regular rows share its contents, *even if they do not exist*.
A regular row exists when its clustering key is present
in the table. A static row exists when at least one static cell is set.

Tests are updated because now when no matching row is found
for the update we show the value of the static row as the previous
value, instead of a non-matching clustering row.

Changes in v2:
- reworded the commit message
- added select tests

Closes #10711
2022-06-16 19:23:46 +03:00
Nadav Har'El
0be06e0bdf test/alternator: additional test for BatchGetItem
Our simple test for BatchGetItem on a table with sort keys still has
requests with just one sort key per partition, so if BatchGetItem has
a bug with requesting multiple sort keys from the same partition,
such bug won't be caught by the simple tests. So in this test we add a
test that does. This will be useful for the next patch, we are planning
to refactor BatchGetItem's handling of multiple sort keys in the same
partition - so it will be useful to have more regression tests.

The tests test_batch_get_item_large and test_batch_get_item_partial
would actually also catch such bugs, but they are more elaborate tests
and it's nice to have smaller tests more focused on checking specific
features.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-06-16 18:19:20 +03:00
Pavel Emelyanov
0c8abca75e compaction_manager: Introduce compaction_manager::config
This is to make it constructible in a way most other services are -- all
the "scalar" parameters are passed via a config.

With this it will be much shorter to add compaction bandwidth throttling
option by just extending the config itself, not the list of constructor
arguments (and all its callers).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-06-16 17:40:19 +03:00
Pavel Emelyanov
997a34bf8c backlog_controller: Generalize scheduling groups
Make struct scheduling_group be sub-class of the backlog controller. Its
new meaning is now -- the group under controller maintenance. Both
database and compaction manager derive their sched groups from this one.

This makes backlog controller construction simpler, prepares the ground
for sched groups unification in seastar and facilitates next patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-06-16 17:40:19 +03:00
Pavel Emelyanov
12b2d6400d database: Keep compound flushing sched group
Similar to previous patch that made the same for compaction manager. The
newly introduced private scheduling_group class is temporary and will go
away in next patch

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-06-16 17:40:19 +03:00
Pavel Emelyanov
0fef2e0273 compaction_manager: Swap groups and controller
To have groups initialized before controller. Makes next patch shorter

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-06-16 17:40:19 +03:00
Pavel Emelyanov
fbb59fc920 compaction_manager: Keep compaction_sg on board
This is mainly to make next patch simpler. Also this makes the backlog
controller API smaller by removing its sg() method.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-06-16 17:40:19 +03:00
Pavel Emelyanov
0662036d27 compaction_manager: Unify scheduling_group structures
There are two of them with identical content and meaning

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-06-16 17:40:19 +03:00
Pavel Emelyanov
41f1044d3c compaction_manager: Merge static/dynamic constructors
The only difference between those two are in the way backlog controller
is created. It's much simpler to have the controller construction logic
in compaction manager instead. Similar "trick" is used to construct
flush controller for the database.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-06-16 17:40:19 +03:00
Pavel Emelyanov
2dbf0b5248 compaction_manager: Coroutinuze really_do_stop()
This way it's more compact and easier to extend.
Also it's small enough to fix indentation right at once.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-06-16 17:40:19 +03:00
Pavel Emelyanov
bbd9fc26cd compaction_manager: Shuffle really_do_stop()
Make it the future-returning method and setup the _stop_future in its
only caller. Makes next patch much simpler

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-06-16 17:40:19 +03:00
Pavel Emelyanov
b19b8c9e5b compaction_manager: Remove try-catch around logger
Logging functions are all noexcept already

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-06-16 17:40:19 +03:00
Israel Fruchter
d2ca2455db scripts/scylla_util.py: introduce back user/group arguments for out()
since #10467 remove the user/group parameters needed for the housekeeping
call, need to introuce them back

Fixes: #10804

Closes #10818
2022-06-16 13:50:17 +03:00
Petr Gusev
d606966597 cql3::column_condition.cc: fix _in_marker handling The commit scylladb@5dee55d introduced a regression: type of in_list_receiver was taken from receiver instead of value_spec as it was before. This regression was caught by dtest test_lwt_update_prepared_listlike_and_tuples. This commit reverts to original behavior and adds a specific boost-test for this scenario.
Fixes: #10821

Closes #10812
2022-06-16 10:57:12 +03:00
Botond Dénes
1718ed9e9f docs/contribute/maintainer.md: add merging and backporting guidelines 2022-06-16 10:29:26 +03:00
Botond Dénes
e9c9ca4a8a docs/contribute/CONTRIBUTING.md: add reference to review checklist:
It serves as a good resource for aspiring contributors to see what
reviewers will be looking for in submitted patches.
2022-06-16 10:29:26 +03:00
Botond Dénes
4542486f23 docs/contribute/review-checklist.md: add section about patch organization 2022-06-16 10:29:26 +03:00
Botond Dénes
25f4ad1543 docs/contribute/maintainer.md: expand section on git submodule sync
git submodule sync is only dangerous if one is using certain workflows.
Explain what the danger are and when it is safe to use.
2022-06-16 10:29:21 +03:00
Botond Dénes
2734c70803 Merge 'Batchlog replay for decommission fix and cleanup' from Asias He
This patch set
- adds log before and after batch log replay
- removes a duplicated call to trigger batch log replay
- removes obsoletes log

Closes #10800

* github.com:scylladb/scylla:
  storage_service: Remove obsolete log
  storage_service: Do not call do_batch_log_replay again in unbootstrap
  storage_service: Add log for start and stop of batchlog replay
2022-06-16 08:55:05 +03:00
Botond Dénes
0b80b5850f Merge 'allow view snapshots when automatic' from Michael Livshin
A pre-scrub view snapshot cannot be attributed to user error, so no
call to bail out.

Closes #10760.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>

Closes #10783

* github.com:scylladb/scylla:
  api-doc: correct spelling
  allow pre-scrub snapshots of materialized views and secondary indices
2022-06-16 08:47:33 +03:00
Botond Dénes
4bd4aa2e88 Merge 'memtable, cache: Eagerly compact data with tombstones' from Tomasz Grabiec
When memtable receives a tombstone it can happen under some workloads
that it covers data which is still in the memtable. Some workloads may
insert and delete data within a short time frame. We could reduce the
rate of memtable flushes if we eagerly drop tombstoned data.

One workload which benefits is the raft log. It stores a row for each
uncommitted raft entry. When entries are committed they are
deleted. So the live set is expected to be short under normal
conditions.

Fixes #652.

Closes #10807

* github.com:scylladb/scylla:
  memtable: Add counters for tombstone compaction
  memtable, cache: Eagerly compact data with tombstones
  memtable: Subtract from flushed memory when cleaning
  mvcc: Introduce apply_resume to hold state for partition version merging
  test: mutation: Compare against compacted mutations
  compacting_reader: Drop irrelevant tombstones
  mutation_partition: Extract deletable_row::compact_and_expire()
  mvcc: Apply mutations in memtable with preemption enabled
  test: memtable: Make failed_flush_prevents_writes() immune to background merging
2022-06-15 18:12:42 +03:00
Nadav Har'El
665e8c1a23 test/cql-ptest: add tests for collection indexing
This patch adds an extensive array of tests for the Cassandra feature
that Scylla hasn't implemented yet (issues #2962, #8745, #10707) of
indexing the keys, values or entries of a collection column.

The goal of these tests is to explicitly exercise every corner case
I could think of by looking at the documentation of this feature and
considering its possible implementation - and as usual, making sure
that the tests actually pass on Cassandra.

These tests overlap some of the existing unit tests that we translated
from Cassandra, as well as some randomized tests that do not necessarily
cover the same edge cases as these tests cover.

All tests added in this patch pass on Cassandra, but currently fail
on Scylla due to the above issues.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #10771
2022-06-15 16:10:36 +02:00
Michael Livshin
43f2c55c5d configure.py: speed up and simplify compdb generation
The most time-consuming part is invoking "ninja -t compdb", and there
is no need to repeat that for every mode.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>

Closes #10733
2022-06-15 16:40:52 +03:00
Botond Dénes
6242f3fef8 Merge 'process_sstables_dir: close directory_lister on error' from Benny Halevy
`sstable_directory::process_sstables_dir` may hit an exception when calling `handle_component`.
In this case we currently destroy the `sstable_dir_lister` variable without closing the `directory_lister` first -
leading to terminate in `~directory_lister` as seen in #10697.

This mini-series handles this exception and always closes the `directory_lister`.

Add unit test to reproduce this issue.

Fixes #10697

Closes #10754

* github.com:scylladb/scylla:
  sstable_directory: process_sstable_dir: fixup indentation
  sstable_directory: process_sstable_dir: close directory_lister on error
2022-06-15 16:40:30 +03:00
Tomasz Grabiec
3bec1cc19f test: memtable: Make failed_flush_prevents_writes() immune to background merging
Before the change, the test artificiallu set the soft pressure
condition hoping that the background flusher will flush the
memtable. It won't happen if by the time the background flusher runs
the LSA region is updated and soft pressure (which is not really
there) is lifted. Once apply() becomes preemptibe, backgroun partition
version merging can lift the soft pressure, making the memtable flush
not occur and making the test fail.

Fix by triggering soft pressure on retries.

Fixes #10801
Refs #10793

(cherry picked from commit 0e78ad50ea)

Closes #10802
2022-06-15 14:33:19 +02:00
Mikołaj Sielużycki
db5b05948b compaction: Clarify comment.
Closes #10799
2022-06-15 15:09:44 +03:00
Avi Kivity
aa8f135f64 Merge 'Block flush until compaction finishes if sstables accumulate' from Mikołaj Sielużycki
If we reach a situation where flush rate exceeds compaction rate, we may
end up with arbitrarily large number of sstables on disk. If a read is
executed in such case, the amount of memory required is proportional to
the number of sstables for the given shard, which in extreme cases can
lead to OOM.

In the wild, this was observed in 2 scenarios:
- A node with >10 shards creates a keyspace with thousands of tables,
  drops the keyspace and shuts down before compaction finishes. Dropping
  keyspace drops tables, and each dropped table is smp::count writes to
  system.local table with flush after write, which creates tens of
  thousands of sstables. Bootstrap read from system.local will run OOM.
- A failure to agree on table schema (due to a code bug) between nodes
  during repair resulted in excessive flushing of small sstables which
  compaction couldn't keep up with.

In the unit test introduced in this patch series it can be proved that
even hard setting maximum shares for compaction and minimum shares for
flushing doesn't tilt the balance towards compaction enough to prevent
the problem. Since it's a fast producer, slow consumer problem, the
remaining solution is to block producer until the consumer catches up.
If there are too many table runs originating from memtable, we block the
current flush until the number of sstables is reduced (via ongoing
compaction or a truncate operation).

Fixes https://github.com/scylladb/scylla/issues/4116

Changelog:
v5:
- added a nicer way of timing the stalls caused by waiting for flush
- added predicate on signal when waiting for reduction of the number of sstables to correctly handle spurious wake ups
- added comment why we trigger compaction before waiting for sstable count reduction
- removed unnecessary cv.signal from table::stop

v4:
- removed conversion of table::stop to coroutines. It's an orthogonal change and doesn't need to go into this patchset

v3:
- removed unnecessary change to scheduling groups from v2
- moved sstables_changed signalling to suggested place in table::stop
- added log how long the table flush was blocked for
- changed the threshold to max(schema()->max_compaction_threshold(), 32) and comparison to <=

v2:
- Reimplemented waiting algorithm based on reviewers' feedback. It's confined to the table class and it waits in a loop until the number of sstable runs goes below threshold. It uses condition variable which is signaled on sstable set refresh. It handles node shutdown as well.
- Converted table::stop to coroutines.
- Reordered commits so that test is committed after fix, so it doesn't trip up bisection.

Closes #10717

* github.com:scylladb/scylla:
  table: Add test where compaction doesn't keep up with flush rate.
  random_mutation_generator: Add option to specify ks_name and cf_name
  table: Prevent creating unbounded number of sstables
2022-06-15 14:51:08 +03:00
Benny Halevy
89c5e8413f sstable_directory: process_sstable_dir: fixup indentation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-06-15 13:56:11 +03:00
Benny Halevy
6cafd83e1c sstable_directory: process_sstable_dir: close directory_lister on error
Otherwise, if we don't consume all lister's entries,
~directory_lister terminates since the
directory_lister is destroyed without being closed.

Add unit test to reproduce this issue.

Fixes #10697

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-06-15 13:56:10 +03:00
Tomasz Grabiec
169025d9b4 memtable: Add counters for tombstone compaction 2022-06-15 11:30:25 +02:00
Tomasz Grabiec
94f9109bea memtable, cache: Eagerly compact data with tombstones
When memtable receives a tombstone it can happen under some workloads
that it covers data which is still in the memtable. Some workloads may
insert and delete data within a short time frame. We could reduce the
rate of memtable flushes if we eagerly drpo tombstoned data.

One workload which benefits is the raft log. It stores a row for each
uncommitted raft entry. When entries are committed they are
deleted. So the live set is expected to be short under normal
conditions.

Fixes #652.
2022-06-15 11:30:25 +02:00
Tomasz Grabiec
53026f3ba6 memtable: Subtract from flushed memory when cleaning
This patch prevents virtual dirty from going negative during memtable
flush in case partition version merging erases data previously
accounted by the flush reader. There is an assert in
~flush_memory_accounter which guards for this.

This will start happening after tombstones are compacted with rows on
partition version merging.

This problem is prevented by the patch by having the cleaner notify
the memtable layer via callback about the amount of dirty memory released
during merging, so that the memtable layer can adjust its accounting.
2022-06-15 11:30:25 +02:00
Tomasz Grabiec
cd523214a2 mvcc: Introduce apply_resume to hold state for partition version merging
Partition version merging is preemptable. It may stop in the middle
and be resumed later. Currently, all state is kept inside the versions
themselves, in the form of elements in the source version which are
yet to be moved. This will change once we add compaction (tombstones
with rows) into the merging algorithm. There, state cannot be encoded
purley within versions. Consider applying a partition tombstone over
large number of rows.

This patch introduces apply_rows object to hold the necessary state to
make sure forward progress in case of preemption.

No change in behavior yet.
2022-06-15 11:30:01 +02:00
Tomasz Grabiec
02c92d5ea2 test: mutation: Compare against compacted mutations
Memtables and cache will compact eagerly, so tests should not expect
readers to produce exact mutations written, only those which are
equivalant after applying copmaction.
2022-06-15 11:30:01 +02:00
Tomasz Grabiec
570b76bc5b compacting_reader: Drop irrelevant tombstones
The compacting reader created using make_compacting_reader() was not
dropping range_tombstone_change fragments which were shadowed by the
partition tombstones. As a result the output fragment stream was not
minimal.

Lack of this change would cause problems in unit tests later in the
series after the change which makes memtables lazily compact partition
versions. In test_reverse_reader_reads_in_native_reverse_order we
compare output of two readers, and assume that compacted streams are
the same. If compacting reader doesn't produce minimal output, then
the streams could differ if one of them went through the compaction in
the memtable (which is minimal).
2022-06-15 11:30:01 +02:00
Tomasz Grabiec
44bb9d495b mutation_partition: Extract deletable_row::compact_and_expire() 2022-06-15 11:30:01 +02:00
Tomasz Grabiec
a4e96960b8 mvcc: Apply mutations in memtable with preemption enabled
Preerequisite for eagerly applying tombstones, which we want to be
preemptible. Before the patch, apply path to the memtable was not
preemptible.

Because merging can now be defered, we need to involve snapshots to
kick-off background merging in case of preemption. This requires us to
propagate region and cleaner objects, in order to create a snapshot.
2022-06-15 11:29:43 +02:00
Tomasz Grabiec
c682521ac7 test: memtable: Make failed_flush_prevents_writes() immune to background merging
Before the change, the test artificiallu set the soft pressure
condition hoping that the background flusher will flush the
memtable. It won't happen if by the time the background flusher runs
the LSA region is updated and soft pressure (which is not really
there) is lifted. Once apply() becomes preemptibe, backgroun partition
version merging can lift the soft pressure, making the memtable flush
not occur and making the test fail.

Fix by triggering soft pressure on retries.
2022-06-15 11:29:43 +02:00
Mikołaj Sielużycki
25407a7e41 table: Add test where compaction doesn't keep up with flush rate.
The test simulates a situation where 2 threads issue flushes to 2
tables. Both issue small flushes, but one has injected reactor stalls.
This can lead to a situation where lots of small sstables accumulate on
disk, and, if compaction never has a chance to keep up, resources can be
exhausted.
2022-06-15 10:57:28 +02:00
Mikołaj Sielużycki
b5684aa96d random_mutation_generator: Add option to specify ks_name and cf_name 2022-06-15 10:57:28 +02:00
Mikołaj Sielużycki
4cd42f97d0 table: Prevent creating unbounded number of sstables
If we reach a situation where flush rate exceeds compaction rate, we may
end up with arbitrarily large number of sstables on disk. If a read is
executed in such case, the amount of memory required is proportional to
the number of sstables for the given shard, which in extreme cases can
lead to OOM.

In the wild, this was observed in 2 scenarios:
- A node with >10 shards creates a keyspace with thousands of tables,
  drops the keyspace and shuts down before compaction finishes. Dropping
  keyspace drops tables, and each dropped table is smp::count writes to
  system.local table with flush after write, which creates tens of
  thousands of sstables. Bootstrap read from system.local will run OOM.
- A failure to agree on table schema (due to a code bug) between nodes
  during repair resulted in excessive flushing of small sstables which
  compaction couldn't keep up with.

In the unit test introduced in this patch series it can be proved that
even hard setting maximum shares for compaction and minimum shares for
flushing doesn't tilt the balance towards compaction enough to prevent
the problem. Since it's a fast producer, slow consumer problem, the
remaining solution is to block producer until the consumer catches up.
If there are too many table runs originating from memtable, we block the
current flush until the number of sstables is reduced (via ongoing
compaction or a truncate operation).
2022-06-15 10:57:28 +02:00
Pavel Emelyanov
9a88bc260c Merge 'various group0 start/stop issues' from Gleb
The series fixes a couple of crashes that were found during starting and
stopping Scylla with raft while doing ddl operations. Most of them
related to shutdown order between different components.

Also in scylla-dev gleb/group0-fixes-v1

CI https://jenkins.scylladb.com/job/releng/job/Scylla-CI/749/

* origin-dev/gleb/group0-fixes-v1:
  migration manager: remove unused code
  db/system_distributed_keyspace: do not announce empty schema
  main: stop raft before the migration manager
  storage_service: do not pass the raft group manager to storage_service constructor
  main: destroy the group0_client after stopping the group0
2022-06-15 11:44:03 +03:00
Michael Livshin
28d44ce6db api-doc: correct spelling
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-06-15 11:30:58 +03:00
Michael Livshin
aab4cd850c allow pre-scrub snapshots of materialized views and secondary indices
Previously, any attempt to take a materialized view or secondary index
snapshot was considered a mistake and caused the snapshot operation to
abort, with a suggestion to snapshot the base table instead.

But an automatic pre-scrub snapshot of a view cannot be attributed to
user error, so the operation should not be aborted in that case.

(It is an open question whether the more correct thing to do during
pre-scrub snapshot would be to silently ignore views.  Or perhaps they
should be ignored in all cases except when the user explicitly asks to
snapshot them, by name)

Closes #10760.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-06-15 11:30:58 +03:00
Avi Kivity
e739f2b779 cql3: expr: make evaluate() return a cql3::raw_value rather than an expr::constant
An expr::constant is an expression that happens to represent a constant,
so it's too heavyweight to be used for evaluation. Right now the extra
weight is just a type (which causes extra work by having to maintain
the shared_ptr reference count), but it will grow in the future to include
source location (for error reporting) and maybe other things.

Prior to e9b6171b5 ("Merge 'cql3: expr: unify left-hand-side and
right-hand-side of binary_operator prepares' from Avi Kivity"), we had
to use expr::constant since there was not enough type infomation in
expressions. But now every expression carries its type (in programming
language terms, expressions are now statically typed), so carrying types
in values is not needed.

So change evaluate() to return cql3::raw_value. The majority of the
patch just changes that. The rest deals with some fallout:

 - cql3::raw_value gains a view() helper to convert to a raw_value_view,
   and is_null_or_unset() to match with expr::constant and reduce further
   churn.
 - some helpers that worked on expr::constant and now receive a
   raw_value now need the type passed via an additional argument. The
   type is computed from the expression by the caller.
 - many type checks during expression evaluation were dropped. This is
   a consequence of static typing - we must trust the expression prepare
   phase to perform full type checking since values no longer carry type
   information.

Closes #10797
2022-06-15 08:47:24 +02:00
Avi Kivity
398a86698d Update tools/python3 submodule (/usr/lib/sysimage filtering)
* tools/python3 f725ec7...3471634 (1):
  > create-relocatable-package.py: filter out /usr/lib/sysimage
2022-06-15 09:27:06 +03:00
Avi Kivity
8f690fdd47 Update seastar submodule
* seastar 1424d34c93...443e6a9b77 (5):
  > reactor: re-raise fatal signals
Ref #9242
  > test: initialize _earliest_started and _latest_finished
  > reactor: add io_uring backend
  > semaphore: add semaphore_unit operator bool
  > Merge 'map reduce: save mapper' from Benny Halevy

io_uring is disabled since the frozen toolchain's liburing it too old.

Closes #10794
2022-06-15 08:36:08 +03:00
Asias He
6f4bfea994 storage_service: Remove obsolete log
In unbootstrap, we do not really stream hints here. Remove the log about it.
2022-06-15 08:28:06 +08:00
Avi Kivity
5129280f45 Revert "Merge 'memtable, cache: Eagerly compact data with tombstones' from Tomasz Grabiec"
This reverts commit e0670f0bb5, reversing
changes made to 605ee74c39. It causes failures
in debug mode in
database_test.test_database_with_data_in_sstables_is_a_mutation_source_plain,
though with low probability.

Fixes #10780
Reopens #652.
2022-06-14 18:06:22 +03:00
Benny Halevy
5bd2e0ccce test: memtable_test: failed_flush_prevents_writes: validate flush using min_memtable_timestamp
active_memtable().empty() becomes true once seal_active_memtable
succeeds with _memtables->add_memtable(), not when it is able
to flush the (once active) memtable.

In contrast, min_memtable_timestamp() returns api::max_timestamp
only if there is no data in any memtable.

Fixes #10793

Backport notes:
- Introduced in f6d9d6175f (currently in
branch-5.0)
- backport requires also 0e78ad50ea

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #10798
2022-06-14 16:13:35 +03:00
Piotr Sarna
61ae0a46e3 Merge 'Three small fixes to Alternator's handling of GSIs...
and LSIs' from Nadav Har'El

This series includes three small fixes (and of course, tests) for
various edge cases of GSI and LSI handling in Alternator:

1. We add the IndexArn that were missing in DescribeTable for indexes
   (GSI and LSI)
2. We forbid the same name to be used for both GSI and LSI (allowing it
   was a bug, not a feature)
3. We improve the error handling when trying to tag a GSI or LSI, which
   is not currently allowed (it's also not allowed in DynamoDB).

Closes #10791

* github.com:scylladb/scylla:
  alternator: improve error handling when trying to tag a GSI or LSI
  alternator: forbid duplicate index (LSI and GSI) names
  alternator: add ARN for indexes (LSI and GSI)
2022-06-14 07:39:44 +02:00
Nadav Har'El
c0a09669c1 Merge 'cql3: expr: unify binary operator left-hand-side and right-hand-side evaluation' from Avi Kivity
The left-hand-side of a binary_operator is currently evaluated via
a get_value() function that receives the row values. On the other hand,
the right hand side is evaluated via evaluate(), which receives query_options
in order to resolve bind variables.

This series unifies the two paths into evaluate(), and standardizes the different
inputs into a new evaluation_inputs struct. The old hacks column_value_eval_bag
and column_maybe_subscripted are removed.

Closes #10782

* github.com:scylladb/scylla:
  cql3: expr: drop column_maybe_subscripted
  cql3: expr: possible_lhs_values(): open-code get_value_comparator()
  cql3: expr: rationalize lhs/rhs argument order
  cql3: expr: don't rely on grammar when comparing tuples
  cql3: expr: wire column_value and subscript to evaluate()
  cql3: get_value(subscript): remove gratuitous pointer
  cql3: expr: reindent get_value(subscript)
  cql3: expr: extract get_value(subscript) from get_value(column_maybe_subscripted)
  cql3: raw_value: add missing conversion from managed_bytes_opt&&
  cql3: prepare_expr: prepare subscript type
  cql3: expr: drop internal 'column_value_eval_bag'
  cql3: expr: change evalute() to accept evaluation_inputs
  cql3: expr: make evaluate(<expression subtype>) static
  cql3: expr: push is_satisfied_by regular and static column extraction to callers
  cql3: expr: convert is_satisfied_by() signature to evaluation_inputs
  cql3: expr: introduce evaluation_inputs
2022-06-13 23:07:57 +03:00
Nadav Har'El
e20233dab1 alternator: improve error handling when trying to tag a GSI or LSI
In issue #10786, we raised the idea of maybe allowing to tag (with
TagResource) GSIs and LSIs, not just base tables. However, currently,
neither DynamoDB nor Syclla allows it. So in this patch we add a
test that confirms this. And while at it, we fix Alternator to
return the same error message as DynamoDB in this case.

Refs #10786.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-06-13 18:14:42 +03:00
Nadav Har'El
8866c326de alternator: forbid duplicate index (LSI and GSI) names
Adding an LSI and GSI with the same name to the same Alternator table
should be forbidden - because if both exists only one of them (the GSI)
would actually be usable. DynamoDB also forbids such duplicate name.

So in this patch we add a test for this issue, and fix it.

Since the patch involves a few more uses of the IndexName string,
we also clean up its handling a bit, to use std::string_view instead
of the old-style std::string&.

Fixes #10789

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-06-13 18:14:42 +03:00
Nadav Har'El
00866a75d8 alternator: add ARN for indexes (LSI and GSI)
DynamoDB gives an ARN ("Amazon Resource Name") to LSIs and GSIs. These
look like BASEARN/index/INDEXNAME, where BASEARN is the ARN of the base
table, and INDEXNAME is the name of the LSI or the GSI.

These ARNs should be returned by DescribeTable as part of its
description of each index, and this patch adds that missing IndexArn
field.

The ARN we're adding here is hardly useful (e.g., as explained in
issue #10786, it can't be used to add tags to the index table),
but nevertheless should exist for compatibility with DynamoDB.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-06-13 18:14:42 +03:00
Benny Halevy
8f39547d89 compaction_manager: task: convert semaphore_aborted to compaction_stopped exception
Fixes #10666

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #10686
2022-06-13 16:20:39 +03:00
Botond Dénes
b820aad3e0 Merge 'test/cql-pytest: skip another test on older, buggy, drivers' from Nadav Har'El
Older versions of the Python Cassandra driver had a bug where a single empty
page aborts a scan.

The test test_secondary_index.py::test_filter_and_limit uses filtering and deliberately
tiny pages, so it turns out that some of them are  empty, so the test breaks on buggy
versions of the driver, which cause the test to fail when run by developers who happen
to have old versions of the driver.

So in this small series we skip this test when running on a buggy version of the driver.

Fixes #10763

Closes #10766

* github.com:scylladb/scylla:
  test/cql-pytest: skip another test on older, buggy, drivers
  test/cql-pytest: de-duplicate code checking for an old buggy driver
2022-06-13 16:06:11 +03:00
Avi Kivity
8edb79ea80 Merge 'Reduce compaction serialization' from Mikołaj Sielużycki
update_history can take a long time compared to compaction, as a call
issued on shard S1 can be handled on shard S2. If the other shard is
under heavy load, we may unnecessarily block kicking off a new
compaction. Normally it isn't a problem, as compactions aren't super
frequent, but there were edge cases where the described behaviour caused
compaction to fail to keep up with excessive flushing, leading to too
many sstables on disk and OOM during a read.

There is no need to wait with next compaction until history is updated,
so release the weight earlier to remove unnecessary serialization.

Changelog:
v3:
- explicitly call deregister instead of moving the weight RAII object to release weight
- mark compaction as finished when sstables are compacted, without waiting for history to update
v2:
- Split the patches differently for easier review
- Rebased agains newer master, which contains fixes that failed the debug version of the test
- Removed the test, as it will be provided by [PR#10717](https://github.com/scylladb/scylla/pull/10717)

Closes #10507

* github.com:scylladb/scylla:
  compaction: Release compaction weight before updating history.
  compaction: Inline compact_sstables_and_update_history call.
  compaction: Extract compact_sstables function
  compaction: Rename compact_sstables to compact_sstables_and_update_history
  compaction: Extract update_history function
  compaction: Extract should_update_history function.
  compaction: Fetch start_size from compaction_result
  compaction: Add tracking start_size in compaction_result.
2022-06-13 16:04:20 +03:00
Takuya ASADA
5643c6de56 scylla_util.py: fix "systemctl is-active" causes error
On 48b6aec16a we mistakenly allowed
check=True on systemd_unit.is_active(), it should be check=False.
We check unit's status by "systemctl is-active" output string,
it returns "active" or "inactive".
But systemctl command returns non-zero status when it returning
"inactive", so we are getting Exception here.
To fix this, we need new option "ignore_error=True" for out(),
and use it in systemd_unit.is_active().

Fixes #10455

Closes #10467
2022-06-13 13:45:50 +03:00
Botond Dénes
43d23d797d scylla-gdb.py: make scylla-threads more flexible
Currently the scylla-threads command just lists all threads on all
shards. This is usually more than what one wants. This patch adds
support for listing threads only on the current shard (new default), on
a specific shard or all shards (old default).
Also, optionally the thread functor's vtable symbol can be added to the
listing. This makes the listing more informative but much more bloated
as well. Which is why it's opt-in.
Example listing (no vtable symbols):

    (gdb) scylla threads
    [shard 5] (seastar::thread_context*) 0x60100035c900, stack: 0x60100bd20000
    [shard 5] (seastar::thread_context*) 0x6010084a3800, stack: 0x601008c80000
    [shard 5] (seastar::thread_context*) 0x60100037c900, stack: 0x60100c640000
    [shard 5] (seastar::thread_context*) 0x60100035d200, stack: 0x60100d9e0000
    [shard 5] (seastar::thread_context*) 0x60100372d980, stack: 0x60100ad60000
    [shard 5] (seastar::thread_context*) 0x601000110d80, stack: 0x601009be0000
    [shard 5] (seastar::thread_context*) 0x6010084cd680, stack: 0x60100a160000
    [shard 5] (seastar::thread_context*) 0x6010000dc780, stack: 0x60100a2e0000
    [shard 5] (seastar::thread_context*) 0x6010084cca80, stack: 0x60100a1c0000
    [shard 5] (seastar::thread_context*) 0x6010084cc000, stack: 0x60100ab40000
    [shard 5] (seastar::thread_context*) 0x60100038ca80, stack: 0x601009860000
    [shard 5] (seastar::thread_context*) 0x60100037db00, stack: 0x60100a820000

Example listing with vtable symbols:
    (gdb) scylla threads -v
    [shard 5] (seastar::thread_context*) 0x60100035c900, stack: 0x60100bd20000 vtable: 0x478520 <seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::async<sstables::compaction::consume_without_gc_writer(std::chrono::time_point<gc_clock, std::chrono::duration<long, std::ratio<1l, 1l> > >)::{lambda(flat_mutation_reader_v2)#1}::operator()(flat_mutation_reader_v2)::{lambda()#1}>(seastar::thread_attributes, sstables::compaction::consume_without_gc_writer(std::chrono::time_point<gc_clock, std::chrono::duration<long, std::ratio<1l, 1l> > >)::{lambda(flat_mutation_reader_v2)#1}::operator()(flat_mutation_reader_v2)::{lambda()#1}&&)::{lambda()#1}>::s_vtable>
    [shard 5] (seastar::thread_context*) 0x6010084a3800, stack: 0x601008c80000 vtable: 0x478520 <seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::async<sstables::compaction::consume_without_gc_writer(std::chrono::time_point<gc_clock, std::chrono::duration<long, std::ratio<1l, 1l> > >)::{lambda(flat_mutation_reader_v2)#1}::operator()(flat_mutation_reader_v2)::{lambda()#1}>(seastar::thread_attributes, sstables::compaction::consume_without_gc_writer(std::chrono::time_point<gc_clock, std::chrono::duration<long, std::ratio<1l, 1l> > >)::{lambda(flat_mutation_reader_v2)#1}::operator()(flat_mutation_reader_v2)::{lambda()#1}&&)::{lambda()#1}>::s_vtable>
    [shard 5] (seastar::thread_context*) 0x60100037c900, stack: 0x60100c640000 vtable: 0x478520 <seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::async<sstables::compaction::consume_without_gc_writer(std::chrono::time_point<gc_clock, std::chrono::duration<long, std::ratio<1l, 1l> > >)::{lambda(flat_mutation_reader_v2)#1}::operator()(flat_mutation_reader_v2)::{lambda()#1}>(seastar::thread_attributes, sstables::compaction::consume_without_gc_writer(std::chrono::time_point<gc_clock, std::chrono::duration<long, std::ratio<1l, 1l> > >)::{lambda(flat_mutation_reader_v2)#1}::operator()(flat_mutation_reader_v2)::{lambda()#1}&&)::{lambda()#1}>::s_vtable>
    [shard 5] (seastar::thread_context*) 0x60100035d200, stack: 0x60100d9e0000 vtable: 0x4784f0 <seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::async<sstables::compaction::run(std::unique_ptr<sstables::compaction, std::default_delete<sstables::compaction> >)::$_1>(seastar::thread_attributes, sstables::compaction::run(std::unique_ptr<sstables::compaction, std::default_delete<sstables::compaction> >)::$_1&&)::{lambda()#1}>::s_vtable>
    [shard 5] (seastar::thread_context*) 0x60100372d980, stack: 0x60100ad60000 vtable: 0x5649a8 <seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::async<db::hints::manager::end_point_hints_manager::sender::start()::$_20>(seastar::thread_attributes, db::hints::manager::end_point_hints_manager::sender::start()::$_20&&)::{lambda()#1}>::s_vtable>
    [shard 5] (seastar::thread_context*) 0x601000110d80, stack: 0x601009be0000 vtable: 0x5649a8 <seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::async<db::hints::manager::end_point_hints_manager::sender::start()::$_20>(seastar::thread_attributes, db::hints::manager::end_point_hints_manager::sender::start()::$_20&&)::{lambda()#1}>::s_vtable>
    [shard 5] (seastar::thread_context*) 0x6010084cd680, stack: 0x60100a160000 vtable: 0x5649a8 <seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::async<db::hints::manager::end_point_hints_manager::sender::start()::$_20>(seastar::thread_attributes, db::hints::manager::end_point_hints_manager::sender::start()::$_20&&)::{lambda()#1}>::s_vtable>
    [shard 5] (seastar::thread_context*) 0x6010000dc780, stack: 0x60100a2e0000 vtable: 0x5649a8 <seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::async<db::hints::manager::end_point_hints_manager::sender::start()::$_20>(seastar::thread_attributes, db::hints::manager::end_point_hints_manager::sender::start()::$_20&&)::{lambda()#1}>::s_vtable>
    [shard 5] (seastar::thread_context*) 0x6010084cca80, stack: 0x60100a1c0000 vtable: 0x582ca8 <seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::async<db::view::view_update_generator::start()::$_0>(seastar::thread_attributes, db::view::view_update_generator::start()::$_0&&)::{lambda()#1}>::s_vtable>
    [shard 5] (seastar::thread_context*) 0x6010084cc000, stack: 0x60100ab40000 vtable: 0x5649a8 <seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::async<db::hints::manager::end_point_hints_manager::sender::start()::$_20>(seastar::thread_attributes, db::hints::manager::end_point_hints_manager::sender::start()::$_20&&)::{lambda()#1}>::s_vtable>
    [shard 5] (seastar::thread_context*) 0x60100038ca80, stack: 0x601009860000 vtable: 0x8c6088 <seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::async<std::_Bind<void (seastar::tls::reloadable_credentials_base::reloading_builder::*(seastar::tls::reloadable_credentials_base::reloading_builder*))()>>(seastar::thread_attributes, std::_Bind<void (seastar::tls::reloadable_credentials_base::reloading_builder::*(seastar::tls::reloadable_credentials_base::reloading_builder*))()>&&)::{lambda()#1}>::s_vtable>
    [shard 5] (seastar::thread_context*) 0x60100037db00, stack: 0x60100a820000 vtable: 0x569bd8 <seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::async<db::hints::space_watchdog::start()::$_2>(seastar::thread_attributes, db::hints::space_watchdog::start()::$_2&&)::{lambda()#1}>::s_vtable>

Closes #10776
2022-06-13 13:05:27 +03:00
Avi Kivity
ee2420ff43 messaging: add boilerplate to rpc_protocol_impl.hh
License, copyright, #pragma once.

The copyright is set to 2021 since that was when the file was created.

Closes #10778
2022-06-13 07:29:32 +02:00
Piotr Sarna
ddf83f6ddc scripts: make pull_github_pr.sh more universally usable
After 93b765f655, our pull_github_pr.sh script tries to detect
a non-orthodox remote repo name, but it also adds an assumption
which breaks on some configurations (by some I mean mine).
Namely, the script tries to parse the repo name from the upstream
branch, assuming that current HEAD actually points to a branch,
which is not the way some users (by some I mean me) work with
remote repositories. Therefore, to make the script also work
with detached HEAD, it now has two fallback mechanisms:
1. If parsing @{upstream} failed, the script tries to parse
   master@{upstream}, under the assumption that the master branch
   was at least once used to track the remote repo.
2. If that fails, `origin/master` is used as last resort solution.

This patch allows some users (guess who) to get back to using
scripts/pull_github_pr.sh again without using a custom patched version.

Closes #10773
2022-06-13 08:15:40 +03:00
Avi Kivity
06a62b150d sstables: processing_result_generator: prefer standard coroutines over the technical specification with clang 14
Clang up to version 13 supports the coroutines technical specification
(in std::experimental). 15 and above support standard coroutines (in
namespace std). Clang 14 supports both, but with a warning for the
technical specification coroutines.

To avoid the warning, change the threshold for selecting standard
coroutines from clang 15 to clang 14. This follow seastar commit
070ab101e2.

Closes #10647
2022-06-12 20:05:28 +03:00
Avi Kivity
6d943e6cd0 cql3: expr: drop column_maybe_subscripted
column_maybe_subscripted is a variant<column_value*, subscript*> that
existed for two reasons:
  1. evaluation of subscripts and of columns took different paths.
  2. calculation of the type of column or column[sub] took different paths.

Now that all evaluations go through evaluate(), and the types are
present in the expression itself, there is no need for column_maybe_subscripted
and it is replaced with plain expressions.
2022-06-12 19:21:28 +03:00
Avi Kivity
2aa9199e9a cql3: expr: possible_lhs_values(): open-code get_value_comparator()
get_value_comparator() is going away soon, so open-code it here. It's
not doing much anyway.
2022-06-12 19:14:50 +03:00
Avi Kivity
b1c12073b1 cql3: expr: rationalize lhs/rhs argument order
Some functions accept the right-hand-side as the first argument
and the left-hand-side as the second argument. This is now confusing,
but at least safe-ish, as the arguments have different types. It's
going to become dangerous when we switch to expressions for both sides,
so let's rationalize it by always starting with lhs.

Some parameters were annotated with _lhs/_rhs when it was not clear.
2022-06-12 18:55:24 +03:00
Avi Kivity
9beac1df53 cql3: expr: don't rely on grammar when comparing tuples
The grammar only allows comparing tuples of clustering columns, which
are non-null, but let's not rely on that deep in expression evaluation
as it can be relaxed.
2022-06-12 18:41:03 +03:00
Avi Kivity
9a4f2a8cc3 cql3: expr: wire column_value and subscript to evaluate()
With everything standardized on evaluation_inputs(), it's a matter of calling
get_value().
2022-06-12 18:21:04 +03:00
Avi Kivity
30721fdc4a cql3: get_value(subscript): remove gratuitous pointer
While extracting get_value(subscript) we inherited a pointer due
to the calling convention, we can now remove it.
2022-06-12 18:18:59 +03:00
Avi Kivity
dd2fec9cb1 cql3: expr: reindent get_value(subscript)
Whitespace only change.
2022-06-12 18:04:12 +03:00
Avi Kivity
31b9e2a565 cql3: expr: extract get_value(subscript) from get_value(column_maybe_subscripted)
We wish to wire get_value(subscript) into evaluate (and get rid of
column_maybe_subscripted).
2022-06-12 18:03:03 +03:00
Avi Kivity
844756d4fe cql3: raw_value: add missing conversion from managed_bytes_opt&&
Conversion from const managed_bytes_opt& already exists.
2022-06-12 17:55:48 +03:00
Avi Kivity
248433d7e0 cql3: prepare_expr: prepare subscript type
The type of a subscript expression is the value comparator
of the expression (column) being subscripted, according to out
wierd naming.
2022-06-12 17:39:08 +03:00
Avi Kivity
b5287db8ea cql3: expr: drop internal 'column_value_eval_bag'
is_satisfied_by() used an internal column_value_eval_bag type that
was more awkwardly named (and more awkward to use due to more nesting)
than evaluation_inputs. Drop it and use evaluation_inputs throughout.

The thunk is_satisified_by(evaluation_inputs) that just called
is_satisified_by(column_value_eval_bag) is dropped.
2022-06-12 17:12:41 +03:00
Avi Kivity
55085906ca cql3: expr: change evalute() to accept evaluation_inputs
Currently, evaluate() accepts only query_options, which makes
it not useful to evaluate columns. As a result some callers
(column_condition) have to call it directly on the right-hand-side
of binary expressions instead of evaluating the binary expression
itself.

Change it to accept evaluation_input as a parameter, but keep
the old signature too, since it is called from many places that
don't have rows.
2022-06-12 16:51:42 +03:00
Avi Kivity
2ecdb219fb cql3: expr: make evaluate(<expression subtype>) static
They aren't called from anywhere outside expression.cc, and
we're playing with the signatures, so hide them to avoid
rebuilds.
2022-06-12 16:13:20 +03:00
Avi Kivity
c80999fab4 cql3: expr: push is_satisfied_by regular and static column extraction to callers
is_satisfied_by() rearranges the static and regular columns from
query::result_row_view form (which is a use-once iterator) to
std::vector<managed_bytes_opt> (which uses the standard value
representation, and allows random access which expression
evaluation needs). Doing it in is_saitisfied_by() means that it is
done every time an expression is evaluated, which is wasteful. It's
also done even if the expression doesn't need it at all.

Push it out to callers, which already eliminates some calls.

We still pass cql3::expr::selection, which is a layering violation,
but that is left to another time.

Note that in view.cc's check_if_matches(), we should have been
able to move static_and_regular_columns calculation outside the
loop. However, we get crashes if we do. This is likely due to a
preexisting bug (which the zero iterations loop avoids). However,
in selection.cc, we are able to avoid the computation when the code
claims it is only handling partition keys or clustering keys.
2022-06-12 16:12:41 +03:00
Avi Kivity
4b715226fe cql3: expr: convert is_satisfied_by() signature to evaluation_inputs
Callers are converted, but the internals are kept using the old
conventions until more APIs are converted.

Although the new API allows passing no query_options, the view code
keeps passing dummy query_options and improvement is left as a FIXME.
2022-06-12 12:53:44 +03:00
Avi Kivity
7a9b645d64 cql3: expr: introduce evaluation_inputs
An expression may refer to values provided externally: the partition
and clusterinng keys, the static and regular row (all providing
column values), and the query options (providing values for bind
variables). Currently, different evaluation functions
(evaluate(), get_value(), and is_satisfied_by()) receive different
subsets of these values.

As a first step towards unifying the various ways to evaluate an
expression, collect the parameters in a single structure. Since
different evaluation contexts have different subsets, make everything
optional (via a pointer). Note that callers are expected to verify
using the grammar or prepare phase that they don't refer to values
that are not provided.

The cql3::selection::selection parameter is provided to translate
from query::result_row_view to schema column indexes. This is pretty
bad since it means the translation needs to be done for every
evaluation and is therefore a candidate for removal, but is kept here
since that's how it's currently done.
2022-06-12 12:47:23 +03:00
Kamil Braun
e87ca733f0 Merge 'test.py: fix bugs, add support for flaky tests' from Konstantin Osipov
Marking a test as flaky allows to keep running it in CI rather than disable it when it's discovered that a test is flaky.
Flaky tests, if they fail, show up as flaky in the output, but don't fail the CI.
```
kostja@hulk:~/work/scylla/scylla$ ./test.py cdc_with --repeat=30 --verbose
Found 30 tests.
================================================================================
[N/TOTAL]   SUITE    MODE   RESULT   TEST
------------------------------------------------------------------------------
[1/30]       cql     debug  [ FLKY ] cdc_with_lwt_test.2 9.36s
[2/30]       cql     debug  [ FLKY ] cdc_with_lwt_test.1 9.53s
[3/30]       cql     debug  [ PASS ] cdc_with_lwt_test.7 9.37s
[4/30]       cql     debug  [ PASS ] cdc_with_lwt_test.8 9.41s
[5/30]       cql     debug  [ PASS ] cdc_with_lwt_test.10 9.76s
[6/30]       cql     debug  [ FLKY ] cdc_with_lwt_test.9 9.71s
```

Closes #10721

* github.com:scylladb/scylla:
  test.py: add support for flaky tests
  test.py: make Test hierarchy resettable
  test.py: proper suite name in the log
  test.py: shutdown cassandra-python connection before exit
2022-06-10 19:00:36 +02:00
Konstantin Osipov
8036d19b84 test.py: add support for flaky tests
The idea is that a flaky test can be marked as flaky
rather than disabled to make sure it passes in CI.
This reduces chances of a regression being added
while the flakiness is being resolved and the number
of disabled tests doesn't grow.
2022-06-10 14:10:21 +03:00
Konstantin Osipov
4cf63efe6c test.py: make Test hierarchy resettable
Introduce reset() hierarchy, which is similar to __init__(),
i.e. allows to reset test execution state before retrying it.
Useful for retrying flaky tests.
2022-06-10 14:10:21 +03:00
Konstantin Osipov
2b92d96c87 test.py: proper suite name in the log
Use a nice suite name rather than an internal Python
object key in the log. Fixes a regression introduced
when addressing a style-related review remark.
2022-06-10 14:10:21 +03:00
Konstantin Osipov
950d606e38 test.py: shutdown cassandra-python connection before exit
Shutdown cassandra-python connections before exit, to avoid
warnings/exceptions at shutdown.

Cassandra-python runs a thread pool and if
connections are not shut down before exit, there could
be a warning that the thread pool is not destroyed
before exiting main.
2022-06-10 14:10:21 +03:00
Kamil Braun
082f9889b4 Merge 'tools/schema_loader: add support for CDC tables' from Botond Dénes
CDC tables use a custom partitioner, which is not reflected in schema dumps (`CREATE TABLE ...`) and currently it is not possible to fix this properly, as we have no syntax to set the partitioner for a table. To work around this, the schema loader determines whether a table is a cdc table based on its name (does it end with `_scylla_cdc_table`) and sets the partitioner manually if it is the case.
Fixes: https://github.com/scylladb/scylla/issues/9840

Closes #10774

* github.com:scylladb/scylla:
  tools/schema_loader: add support for CDC tables
  cdc/log.hh: expose is_log_name()
2022-06-10 13:04:38 +02:00
Kamil Braun
aeba88cc29 Merge 'test.py: fixes for connection handling' from Alecco
Change port type passed to Cassandra Python driver to int to avoid format errors in exceptions.

Manually shutdown connections to avoid reconnects after tests are done (required by upcoming async pytests).

Tests: (dev)

Closes #10722

* github.com:scylladb/scylla:
  test.py: shutdown connection manually
  test.py: fix port type passed to Cassandra driver
2022-06-10 11:40:47 +02:00
Botond Dénes
b3d6a182e4 tools/schema_loader: add support for CDC tables
CDC tables use a custom partitioner, which is not reflected in schema
dumps (`CREATE TABLE ...`) and currently it is not possible to fix this
properly, as we have no syntax to set the partitioner for a table.
To work around this, the schema loader determines whether a table is a
cdc table based on its name (does it end with `_scylla_cdc_table`) and
sets the partitioner manually if it is the case.
2022-06-10 10:57:55 +03:00
Botond Dénes
f8a8fe41d6 cdc/log.hh: expose is_log_name()
Allow outside code to use it to determine whether a table is cdc or not.
This is currently the most reliable method if the custom partitioner is
not set on the schema of the investigated table.
2022-06-10 10:57:12 +03:00
Botond Dénes
1c8c693ff7 Merge "Redefine Leveled compaction backlog" from Raphael S. Carvalho
"
This series is a consequence of the work started by:
"compaction: LCS: Fix inefficiency when pushing SSTables to higher levels"
9de7abdc80
"Redefine Compaction Backlog to tame compaction aggressiveness" d8833de3bb

The backlog definition for leveled is incorrectly built on the assumption that
the world must reach the state of zero amplification, i.e. everything in the
last level. The actual goal is space amplification of 1.1.

In reality, LCS just wants that for every level L, level L is fan_out=10 times
larger than L-1. See more in commit 9de7abdc80 which adjusts LCS to conform
to this goal.

If level 3 = 1000G, level 2 = 100G, level 1 = 10G, level 0 = 1G, that should
return zero backlog as space amplification is (1000+100+10+1)/1000 = ~1.1

But today, LCS calculates high backlog for the layout above, as it will only be
satisfied once everything is promoted to the maximum level. That's completely
disconnected from what the strategy actually wants. Therefore, a mismatch.

With today's definition, the backlog for any SSTable is:
    sizeof(sstable) * (Lmax - levelof(sstable)) * fan_out

    where Lmax = maximum level,
    and fan_out = LCS' fan out which is 10 by default

That's essentially calculating the total cost for data in the SSTable to climb
up to the maximum level. Of course, if a SSTable is at the maximum level,
(Lmax - levelof(sstable)) returns zero, therefore backlog for it is zero.

Take a look at this example:

If L0 sstable is 0.16G, then its backlog = 0.16G * (3 - 0) * 10 = 4.8G
   0.16G = LCS' default fragment size
   Maximum level (Lmax in formula) can be easily 3 as:
    log10 of (30G/0.16G=~187 sstables)) = ~2.27
    ~2.27 means that data has exceeded level 2 capacity and so needs 3 levels.

So 3 L0 sstables could add ~15G of backlog. With 1G memory per shard (30:1 disk
memory ratio), that's normalized backlog of ~15, which translates into
additional ~500 shares. That's halfway to full compaction speed.
With more files in higher levels, we can easily get to a normalized backlog
above 30, resulting in 1k shares.

The suboptimal backlog definition causes either table using LCS or coexisting
tables to run with more shares than needed, causing compaction to steal
resources, resulting in higher latency and reduced throughput.

To solve this problem, a new formula is used which will basically calculate
the amount of work needed to achieve the layout goal. We no longer want to
promote everything to the last level, but instead we'll incrementally calculate
the backlog in each level L, which is the amount of work needed such that the
next level L + 1 is at least fan_out times bigger.

Fixes #10583.

Results
=====

image:
https://user-images.githubusercontent.com/1409139/168713675-d5987d09-7011-417c-9f91-70831c069382.png

The patched version correctly clears the backlog, meaning that once LCS is
satisfied, backlog is 0. Therefore, next compaction either from this table or
another won't run unnecessarily aggressive.

p99 read and write latency have clearly improved. throughput is also more
stable.
"

* 'LCS_backlog_revamp' of https://github.com/raphaelsc/scylla:
  tests: sstable_compaction_test: Adjust controller unit test for LCS
  compaction: Redefine Leveled compaction backlog
2022-06-10 09:21:13 +03:00
Raphael S. Carvalho
079283193a tests: sstable_compaction_test: Adjust controller unit test for LCS
The controller unit test for LCS was only creating level 0 SSTables.
As level 0 falls back to STCS controller, it means that we weren't actually
testing LCS controller.
So let's adjust the unit test to account for LCS fan_out, which is 10
instead of 4, and also allow creation of SSTables on higher levels.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-06-09 14:21:40 -03:00
Raphael S. Carvalho
b27a1d88fe compaction: Redefine Leveled compaction backlog
The backlog definition for leveled is incorrectly built on the assumption that
the world must reach the state of zero amplification, i.e. everything in the
last level. The actual goal is space amplification of 1.1.

In reality, LCS just wants that for every level L, level L is fan_out=10 times
larger than L-1. See more in commit 9de7abdc80 which adjusts LCS to conform
to this goal.

If level 3 = 1000G, level 2 = 100G, level 1 = 10G, level 0 = 1G, that should
return zero backlog as space amplification is (1000+100+10+1)/1000 = ~1.1

But today, LCS calculates high backlog for the layout above, as it will only be
satisfied once everything is promoted to the maximum level. That's completely
disconnected from what the strategy actually wants. Therefore, a mismatch.

With today's definition, the backlog for any SSTable is:
    sizeof(sstable) * (Lmax - levelof(sstable)) * fan_out

    where Lmax = maximum level,
    and fan_out = LCS' fan out which is 10 by default

That's essentially calculating the total cost for data in the SSTable to climb
up to the maximum level. Of course, if a SSTable is at the maximum level,
(Lmax - levelof(sstable)) returns zero, therefore backlog for it is zero.

Take a look at this example:

If L0 sstable is 0.16G, then its backlog = 0.16G * (3 - 0) * 10 = 4.8G
   0.16G = LCS' default fragment size
   Maximum level (Lmax in formula) can be easily 3 as:
    log10 of (30G/0.16G=~187 sstables)) = ~2.27
    ~2.27 means that data has exceeded level 2 capacity and so needs 3 levels.

So 3 L0 sstables could add ~15G of backlog. With 1G memory per shard (30:1 disk
memory ratio), that's normalized backlog of ~15, which translates into
additional ~500 shares. That's halfway to full compaction speed.
With more files in higher levels, we can easily get to a normalized backlog
above 30, resulting in 1k shares.

The suboptimal backlog definition causes either table using LCS or coexisting
tables to run with more shares than needed, causing compaction to steal
resources, resulting in higher latency and reduced throughput.

To solve this problem, a new formula is used which will basically calculate
the amount of work needed to achieve the layout goal. We no longer want to
promote everything to the last level, but instead we'll incrementally calculate
the backlog in each level L, which is the amount of work needed such that the
next level L + 1 is at least fan_out times bigger.

Fixes #10583.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-06-09 14:21:40 -03:00
Asias He
e0aca10bcd streaming: Enable auto off strategy compaction trigger for all rbno ops
Since commit 3dc9a81d02 (repair: Repair
table by table internally), a table is always repaired one after
another. This means a table will be repaired in a continuous manner.

Unlike before a table will be repaired again after other tables have
finished the same range.

```
for range in ranges
  for table in tables
      repair(range, table)
```

The wait interval can be large so we can not utilize the assumption if
there is no repair traffic, the whole table is finished.

After commit 3dc9a81d02, we can utilize
the fact that a table is repaired continuously property and trigger off
strategy automatically when no repair traffic for a table is present.

This is especially useful for decommission operation with multiple
tables. Currently, we only notify the peer node the decommission is done
and ask the peer to trigger off strategy compaction. With this
patch, the peer node will trigger automatically after a table is
finished, reducing the number of temporary sstables on disk.

Refs #10462

Closes #10761
2022-06-09 17:10:14 +03:00
Avi Kivity
afc06f0017 messaging: forward-declare types in messaging_service.hh
messaging_service.hh is a switchboard - it includes many things,
and many things include it. Therefore, changes in the things it
includes affect many translation units.

Reduce the dependencies by forward-declaring as much as possible.
This isn't pretty, but it reduces compile time and recompilations.

Other headers adjusted as needed so everything (including
`ninja dev-headers`) still compile.

Closes #10755
2022-06-09 15:52:12 +03:00
Nadav Har'El
e9b6171b51 Merge 'cql3: expr: unify left-hand-side and right-hand-side of binary_operator prepares' from Avi Kivity
Currently, preparing the left-hand-side of a binary operator and the
right-hand-side use different code paths. The left-hand-side derives
the type of the expression from the expression itself, while the
right-hand-side imposes the type on the expression (allowing the types
of bind variables to be inferred).

This series unifies the two, by making the imposed type (the "receiver")
optional, and by allowing prepare to fail gracefully if we were not able
to infer the type. The old prepare_binop_lhs() is removed and replaced
with prepare_expression, already used for the right hand side.

There is one step remaining, and that is to replace prepare_binary_operator
with prepare_expression, but that is more involved and is left for a follow-up.

Closes #10709

* github.com:scylladb/scylla:
  cql3: expr: drop prepare_binop_lhs()
  cql3: expr: move implementation of prepare_binop_lhs() to try_prepare_expression()
  cql3: expr: use recursive descent when preparing subscripts
  cql3: expr: allow prepare of tuple_constructor with no receiver
  cql3: expr: drop no longer used printable_relation parameter from prepare_binop_lhs()
  cql3: expr: print only column name when failing to resolve column
  cql3: expr: pass schema to prepare_expression
  cql3: expr: prepare_binary_operator: drop unused argument ctx
  cql3: expr: stub type inference for prepare_expression
  cql3: expr: introduce type_of() to fetch the type of an expression
  cql3: expr: keep type information in casts
  cql3: expr: add type field to subscript, field_selection, and null expressions
  cql3: expr: cast: use data_type instead of cql3_type for the prepared form
2022-06-09 15:38:50 +03:00
Nadav Har'El
b2444b6e9f test/cql-pytest: skip another test on older, buggy, drivers
Older versions of the Python Cassandra driver had a bug, detected by
the driver_bug_1 fixture, where a single empty page aborts a scan.

The test test_secondary_index.py::test_filter_and_limit uses filtering
and deliberately tiny pages, so it turns out that some of them are
empty, so the test breaks on buggy versions of the driver, which causes
the test to fail when run by developers who happen to have old versions
of the driver.

So in this patch we use the driver_bug_1 fixture, to skip this test
when running on a buggy version of the driver.

Fixes #10763

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-06-09 14:37:45 +03:00
Nadav Har'El
c8a3d0758a test/cql-pytest: de-duplicate code checking for an old buggy driver
We have in test_filtering.py two tests which fail when running on an old
version of the Python driver which has a specific bug, so we skip those
tests if the buggy driver is installed.

But the code to check the driver version is duplicated twice, so in this
patch we move the version-checking-and-skipping code to a fixture, which
we can use twice.

The motivation is that in the next patch we will want to introduce a
third use of the same code - and a fixture is cleaner than a third
duplicate.

This patch is supposed to be code-movement only, without functional
changes.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-06-09 14:23:40 +03:00
Gleb Natapov
2fa0519991 migration manager: remove unused code 2022-06-09 09:40:55 +03:00
Gleb Natapov
727a9071d8 db/system_distributed_keyspace: do not announce empty schema 2022-06-09 09:40:55 +03:00
Gleb Natapov
6e100d1ea3 main: stop raft before the migration manager
Since the group0 uses migration manager to apply commands we need to
stop raft before we stopping migration manager.
2022-06-09 09:40:55 +03:00
Gleb Natapov
70b7b2b4d6 storage_service: do not pass the raft group manager to storage_service constructor
Reduce the storage_service's dependency on the raft group manager. The
group manager is needed only during bootstrap and in an rpc handler, so
pass it to those functions directly.
2022-06-09 09:40:55 +03:00
Gleb Natapov
89fe305888 main: destroy the group0_client after stopping the group0
The group0_client uses the group0 internally and cannot be destroyed
until the group0 is stopped to guaranty no ongoing calls into it by the
group0_client.
2022-06-09 09:23:53 +03:00
Nadav Har'El
75c2bd78ae test/alternator: reproducer for GetBatchItem duplicate keys
It turns out that DynamoDB forbids requesting the same item more than
once in a GetBatchItem request. Trying to do it would obviously be a
waste, but DynamoDB outright refuses it - and Alternator currently
doesn't (refs #10757).

The test currently passes on DynamoDB and fails on Alternator, so it
is marked xfail.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #10758
2022-06-09 07:04:50 +02:00
Asias He
05a8d382b6 storage_service: Do not call do_batch_log_replay again in unbootstrap
We have called do_batch_log_replay in the beginning of unbootstrap. No
need to call it again.
2022-06-09 10:07:33 +08:00
Asias He
29c3f10ee4 storage_service: Add log for start and stop of batchlog replay
It might take a long time to finish. Log batchlog replay event so we
know if batchlog replay goes wrong.

Refs #10756
2022-06-09 09:52:18 +08:00
Piotr Sarna
e5956fee8a Merge 'cql3: column_condition cleanups' from Avi Kivity
Cosmetic cleanups for column_condition.

Closes #10716

* github.com:scylladb/scylla:
  cql3: column_condition: deinline constructor
  cql3: column_condition: rename `column` member
2022-06-08 17:48:07 +02:00
Botond Dénes
f060129223 Merge 'compaction::setup: yield to prevent stalls with large number of sstables' from Benny Halevy
As seen in https://github.com/scylladb/scylla/issues/10738,
compaction::setup might stall when processing a large number of sstables.

Make it a coroutine and maybe_yield to prevent those stalls.

Closes #10750

* github.com:scylladb/scylla:
  compaction: setup: reserve space for _input_sstable_generations
  compaction: coroutinize setup and maybe yield
2022-06-08 14:54:46 +03:00
Botond Dénes
3d8cd72c97 Merge 'multishard_mutation_query: use coroutine::as_future' from Benny Halevy
This series converts try/catch blocks in coroutines for multishard_mutation_query to use coroutine::as_future to get and handle errors, reducing exception handling costs (that are expected on timeouts).

It was previously sent to the mailing list.
This version (v2) is just a rebase of the v1 series,
with one patch dropped as it was already merged to master independentally.

Closes #10727

* github.com:scylladb/scylla:
  multishard_mutation_query: do_query: couroutinize save_readers lambda
  multishard_mutation_query: do_query: prevent exceptions using coroutine::as_future
  multishard_mutation_query: read_page: prevent exceptions using coroutine::as_future
  multishard_mutation_query: save_readers: fixup indentation
  multishard_mutation_query: coroutinize save_readers
  multishard_mutation_query: lookup_readers: make noexcept
  multishard_mutation_query: optimize lookup_readers
2022-06-08 14:26:55 +03:00
Konstantin Osipov
29f8ba2c5e raft: add Raft design nodes to the docs
Closes #10504
2022-06-08 12:33:51 +02:00
Benny Halevy
593a192664 compaction: setup: reserve space for _input_sstable_generations
We know in advance the maximum number of
sstable generations to track, so reserve space for it
to prevent vector reallocation for large number of sstables.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-06-08 10:18:24 +03:00
Benny Halevy
4fac6e0b27 compaction: coroutinize setup and maybe yield
To prevent reactor stalls with large number of sstables.

Fixes #10738

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-06-08 10:12:41 +03:00
Benny Halevy
5babc609c6 multishard_mutation_query: do_query: couroutinize save_readers lambda
To keep it simple. It is unlikely to throw.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-06-08 09:31:17 +03:00
Benny Halevy
921092955b multishard_mutation_query: do_query: prevent exceptions using coroutine::as_future
Optimize error handling by preventing
exception try/catch using coroutine::as_future.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-06-08 09:31:17 +03:00
Benny Halevy
7a76ba4038 multishard_mutation_query: read_page: prevent exceptions using coroutine::as_future
Optimize error handling by preventing
exception try/catch using coroutine::as_future
to get query::consume_page's result.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-06-08 09:31:15 +03:00
Benny Halevy
817a0f316a multishard_mutation_query: save_readers: fixup indentation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-06-08 09:23:14 +03:00
Benny Halevy
804d727b8b multishard_mutation_query: coroutinize save_readers
And use smp::invoke_on_all rather than a home-brewed
version of parallel_for_each over all shard ids.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-06-08 09:23:14 +03:00
Benny Halevy
22e5352cc2 multishard_mutation_query: lookup_readers: make noexcept
Sot it can be co_awaited efficiently using coroutine::as_future,
othwise, any exceptions will escape `as_future`.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-06-08 09:23:14 +03:00
Benny Halevy
ea3935507e multishard_mutation_query: optimize lookup_readers
No need to call _db.invoke_on inside a parallel_for_each
loop over all shards.  Just use _db.invoke_on_all instead.

Besides that, there's no need for a .then continuation
for assigning the per-shard reader in _readers[shard].
It can be done by the functor running on each db shard.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-06-08 09:23:14 +03:00
Botond Dénes
74587cb1b5 Merge 'Fix stalls during repair with rbno' from Asias He
Found with scylla --blocked-reactor-notify-ms 1 during replace operation with rbno turned on.
The stalls showed without this patch were gone after this path set.

Closes #10737

* github.com:scylladb/scylla:
  repair: Avoid stall in working_row_hashes
  repair: Avoid stall in apply_rows_on_master_in_thread
2022-06-08 06:41:51 +03:00
Avi Kivity
836cbf4e86 Update seastar submodule
* seastar 2be9677d6e...1424d34c93 (22):
  > Use tls socket to retrieve distinguished name
  > perftune.py: remove duplicates in 'append' parameters when we dump an options file
  > rpc: add an option for an asynchronous connection isolation function
  > Merge "Add more facilities to RPC tester" from Pavel E
  > json: wait for writing final characters of a json document
  > Revert "Use tls socket to retrieve distinguished name"
  > future.hh: drop unused parameters
  > core/scollected: initialize _buf explicitly
  > rpc: remove recursion in do_unmarshall()
  > coroutine: Fix generator clang compilation
  > core: Reduce the default blocked-reactor-notify-ms to 25ms
  > build: group "CMAKE_CXX_*" options together
  > doc: s/c++dialect/c++-standard/
  > test: coroutines: adjust coroutine generator test for gcc
  > Use tls socket to retrieve distinguished name
  > coroutine: add an async generator
  > net/api: s/server_socket::is_listening()/operator bool()/
  > net/api: let "server_socket::local_address()" always return an addr
  > tls_test: Remove unsupported prio string from test case
  > Merge 'abort_source: assert request_abort called exactly once' from Benny Halevy
  > coroutines/all: stop using std::aligned_union_t
  > coroutines/all: ensure the template argument deduction work with clang-15

Closes #10739
2022-06-07 21:54:47 +03:00
Benny Halevy
1daa7820c9 main: shutdown: do not abort on storage_io_error
Do not abort in defer_verbose_shutdown if the callback
throws storage_io_error, similar and in addition to
the system errors handling that was added in
132c9d5933

As seen in https://github.com/scylladb/scylla/issues/9573#issuecomment-1148238291

Fixes #9573

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #10740
2022-06-07 16:55:08 +03:00
Mikołaj Sielużycki
4143878558 compaction: Release compaction weight before updating history.
update_history can take a long time compared to compaction, as a call
issued on shard S1 can be handled on shard S2. If the other shard is
under heavy load, we may unnecessarily block kicking off a new
compaction. Normally it isn't a problem, as compactions aren't super
frequent, but there were edge cases where the described behaviour caused
compaction to fail to keep up with excessive flushing, leading to too
many sstables on disk and OOM during a read.

There is no need to wait with next compaction until history is updated,
so release the weight earlier to remove unnecessary serialization.
Compaction is marked as finished as soon as sstables are compacted
(without waiting for history update).
2022-06-07 12:55:28 +02:00
Mikołaj Sielużycki
5ce1fd1574 compaction: Inline compact_sstables_and_update_history call.
This commit introduces no functional changes and exists solely for
clarity of the change in the subsequent commit.
2022-06-07 12:55:28 +02:00
Mikołaj Sielużycki
533552273a compaction: Extract compact_sstables function 2022-06-07 12:55:28 +02:00
Mikołaj Sielużycki
33c5802957 compaction: Rename compact_sstables to compact_sstables_and_update_history 2022-06-07 12:55:28 +02:00
Mikołaj Sielużycki
9572520d0d compaction: Extract update_history function 2022-06-07 12:55:28 +02:00
Mikołaj Sielużycki
537819b7f8 compaction: Extract should_update_history function. 2022-06-07 12:55:28 +02:00
Mikołaj Sielużycki
447bd8a2e0 compaction: Fetch start_size from compaction_result
The start size is calculated during compaction and returned from
sstables::compact_sstables, so there is no need to do it twice.
2022-06-07 12:55:28 +02:00
Mikołaj Sielużycki
2edf137f61 compaction: Add tracking start_size in compaction_result. 2022-06-07 12:55:28 +02:00
Petr Gusev
0450974057 cql3_type::raw_collection: handle unknown types first
The issue is about handling errors when the user specifies something strange instead of a type, e.g. CREATE TABLE try1 (a int PRIMARY KEY, b list<zzz>):
* the error message only talks about collections, while zzz could also be an UDT;
* the same error message is given even when zzz is not a valid collection or UDT name.
The first point has already been fixed, now Scylla says 'Non-frozen user types or collections are not allowed inside collections: list<zzz>'. This commit fixes the second.

Whether the type is a valid UDT or not is checked in cql3_type::raw_ut::prepare_internal, but 'non-frozen' check triggers first in cql3_type::raw_collection::prepare_internal, before we recursively get to the argument types of the collection. The patch reverses the order here, first thing we recurse and ensure that the collection argument types are valid, and only then we apply the collection checks. A side effect of this is that the error messages of the checks in raw_collection will include the keyspace name, because it will now be assigned in raw_ut::prepare_internal before them.

The patch affects the validation order, so in case of list<zzz<xxx>> the message could be different, but it doesn't seem to be possible according to the Cql grammar.

Examples:
create type ut2 (a int, b list<ut1>); --> error('Unknown type ks.ut1')

create type ut1 (a int);
create type ut2 (a int, b list<ut1>); --> error('Non-frozen user types or collections are not allowed inside collections: list<ks.ut1>')

create type ut2 (a int, b list<frozen<ut1>>); --> OK

Fixes: scylladb#3541

Closes #10726
2022-06-07 11:16:12 +02:00
Avi Kivity
1e7cece837 tools: toolchain: prepare: use buildah multi-arch build instead of bash hacks
In 69af7a830b ("tools: toolchain: prepare: build arch images in parallel"),
we added parallel image generation. But it turns out that buildah can
do this natively (with the --platform option to specify architectures
and --jobs parameter to allow parallelism). This is simpler and likely
has better error handling than an ad-hoc bash script, so switch to it.

Closes #10734
2022-06-07 11:51:13 +03:00
Asias He
f2c05e21ee repair: Avoid stall in working_row_hashes
Fix the following stall during repair:

```
Reactor stalled for 1 ms on shard 0. Backtrace:
[Backtrace #11]
{build/release/scylla} 0x4c6deb2: void seastar::backtrace<seastar::backtrace_buffer::append_backtrace_oneline()::{lambda(seastar::frame)#1}>(seastar::backtrace_buffer::append_backtrace_oneline()::{lambda(seastar::frame)#1}&&) at ./
 (inlined by) seastar::backtrace_buffer::append_backtrace_oneline() at ./build/release/seastar/./seastar/src/core/reactor.cc:772
 (inlined by) seastar::print_with_backtrace(seastar::backtrace_buffer&, bool) at ./build/release/seastar/./seastar/src/core/reactor.cc:791
{build/release/scylla} 0x4c6cb10: seastar::internal::cpu_stall_detector::generate_trace() at ./build/release/seastar/./seastar/src/core/reactor.cc:1366
{build/release/scylla} 0x4c6ddc0: seastar::internal::cpu_stall_detector::maybe_report() at ./build/release/seastar/./seastar/src/core/reactor.cc:1108
 (inlined by) seastar::internal::cpu_stall_detector::on_signal() at ./build/release/seastar/./seastar/src/core/reactor.cc:1125
 (inlined by) seastar::reactor::block_notifier(int) at ./build/release/seastar/./seastar/src/core/reactor.cc:1349
{build/release/scylla} 0x7f75551bfa1f: ?? ??:0
{build/release/scylla} 0x37abf12: repair_hash::operator<(repair_hash const&) const at ././repair/hash.hh:30
 (inlined by) std::less<repair_hash>::operator()(repair_hash const&, repair_hash const&) const at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/stl_function.h:400
 (inlined by) bool absl::container_internal::key_compare_adapter<std::less<repair_hash>, repair_hash>::checked_compare::operator()<repair_hash, repair_hash, 0>(repair_hash const&, repair_hash const&) const at ./abseil/absl/containe
 (inlined by) absl::container_internal::SearchResult<int, false> absl::container_internal::btree_node<absl::container_internal::set_params<repair_hash, std::less<repair_hash>, std::allocator<repair_hash>, 256, false> >::binary_sear
 (inlined by) _ZNK4absl18container_internal10btree_nodeINS0_10set_paramsI11repair_hashSt4lessIS3_ESaIS3_ELi256ELb0EEEE13binary_searchIS3_NS0_19key_compare_adapterIS5_S3_E15checked_compareEEENS0_12SearchResultIiXsr23btree_is_key_com (inlined by) absl::container_internal::SearchResult<int, false> absl::container_internal::btree_node<absl::container_internal::set_params<repair_hash, std::less<repair_hash>, std::allocator<repair_hash>, 256, false> >::lower_bound
 (inlined by) absl::container_internal::SearchResult<absl::container_internal::btree_iterator<absl::container_internal::btree_node<absl::container_internal::set_params<repair_hash, std::less<repair_hash>, std::allocator<repair_hash (inlined by) std::pair<absl::container_internal::btree_iterator<absl::container_internal::btree_node<absl::container_internal::set_params<repair_hash, std::less<repair_hash>, std::allocator<repair_hash>, 256, false> >, repair_hash (inlined by) std::pair<absl::container_internal::btree_iterator<absl::container_internal::btree_node<absl::container_internal::set_params<repair_hash, std::less<repair_hash>, std::allocator<repair_hash>, 256, false> >, repair_hash
 (inlined by) operator() at ./repair/row_level.cc:896
 (inlined by) seastar::future<void> seastar::futurize<void>::invoke<repair_meta::working_row_hashes()::{lambda(absl::btree_set<repair_hash, std::less<repair_hash>, std::allocator<repair_hash> >&)#1}::operator()(absl::btree_set<repa
 (inlined by) auto seastar::futurize_invoke<repair_meta::working_row_hashes()::{lambda(absl::btree_set<repair_hash, std::less<repair_hash>, std::allocator<repair_hash> >&)#1}::operator()(absl::btree_set<repair_hash, std::less<repai{build/release/scylla} 0x37ac70f: seastar::internal::do_for_each_state<std::_List_iterator<repair_row>, repair_meta::working_row_hashes()::{lambda(absl::btree_set<repair_hash, std::less<repair_hash>, std::allocator<repair_hash> >&)
{build/release/scylla} 0x4c7ee64: seastar::reactor::run_tasks(seastar::reactor::task_queue&) at ./build/release/seastar/./seastar/src/core/reactor.cc:2356
 (inlined by) seastar::reactor::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:2769
{build/release/scylla} 0x4c80247: seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:2938
{build/release/scylla} 0x4c7f49c: seastar::reactor::run() at ./build/release/seastar/./seastar/src/core/reactor.cc:2821
{build/release/scylla} 0x4c264d8: seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:265
{build/release/scylla} 0x4c259b1: seastar::app_template::run(int, char**, std::function<seastar::future<int> ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:156
{build/release/scylla} 0xf5c16f: scylla_main(int, char**) at ./main.cc:535
{build/release/scylla} 0xf5999a: std::function<int (int, char**)>::operator()(int, char**) const at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/std_function.h:590
 (inlined by) main at ./main.cc:1575
{build/release/scylla} 0x27b74: ?? ??:0
{build/release/scylla} 0xf5892d: _start at ??:?
```

Found with scylla --blocked-reactor-notify-ms 1

Refs #10665
2022-06-07 16:04:50 +08:00
Asias He
45bcacf672 repair: Avoid stall in apply_rows_on_master_in_thread
Fix the following stall during repair:

```
Reactor stalled for 3 ms on shard 0. Backtrace:
[Backtrace #20]
{build/release/scylla} 0x4c6deb2: void seastar::backtrace<seastar::backtrace_buffer::append_backtrace_oneline()::{lambda(seastar::frame)#1}>(seastar::backtrace_buffer::append_backtrace_oneline()::{lambda(seastar::frame)#1}&&) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:59
 (inlined by) seastar::backtrace_buffer::append_backtrace_oneline() at ./build/release/seastar/./seastar/src/core/reactor.cc:772
 (inlined by) seastar::print_with_backtrace(seastar::backtrace_buffer&, bool) at ./build/release/seastar/./seastar/src/core/reactor.cc:791
{build/release/scylla} 0x4c6cb10: seastar::internal::cpu_stall_detector::generate_trace() at ./build/release/seastar/./seastar/src/core/reactor.cc:1366
{build/release/scylla} 0x4c6ddc0: seastar::internal::cpu_stall_detector::maybe_report() at ./build/release/seastar/./seastar/src/core/reactor.cc:1108
 (inlined by) seastar::internal::cpu_stall_detector::on_signal() at ./build/release/seastar/./seastar/src/core/reactor.cc:1125
 (inlined by) seastar::reactor::block_notifier(int) at ./build/release/seastar/./seastar/src/core/reactor.cc:1349
{build/release/scylla} 0x7f75551bfa1f: ?? ??:0
{build/release/scylla} 0x11293e9: std::default_delete<bytes_ostream::chunk>::operator()(bytes_ostream::chunk*) const at database.cc:?
 (inlined by) std::default_delete<bytes_ostream::chunk>::operator()(bytes_ostream::chunk*) const at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/unique_ptr.h:85
{build/release/scylla} 0x37b18e6: ~unique_ptr at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/unique_ptr.h:361
 (inlined by) ~bytes_ostream at ././bytes_ostream.hh:26
 (inlined by) ~frozen_mutation_fragment at ././frozen_mutation.hh:265
 (inlined by) std::_Optional_payload_base<frozen_mutation_fragment>::_M_destroy() at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/optional:260
 (inlined by) std::_Optional_payload_base<frozen_mutation_fragment>::_M_reset() at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/optional:280
 (inlined by) ~_Optional_payload at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/optional:401
 (inlined by) ~_Optional_base at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/optional:472
 (inlined by) ~repair_row at ././repair/row.hh:24
 (inlined by) void std::destroy_at<repair_row>(repair_row*) at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/stl_construct.h:88
 (inlined by) void std::allocator_traits<std::allocator<std::_List_node<repair_row> > >::destroy<repair_row>(std::allocator<std::_List_node<repair_row> >&, repair_row*) at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/alloc_traits.h:537
 (inlined by) std::__cxx11::_List_base<repair_row, std::allocator<repair_row> >::_M_clear() at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/list.tcc:77
 (inlined by) ~_List_base at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/stl_list.h:499
 (inlined by) repair_meta::apply_rows_on_master_in_thread(std::__cxx11::list<partition_key_and_mutation_fragments, std::allocator<partition_key_and_mutation_fragments> >, gms::inet_address, seastar::bool_class<update_working_row_buf_tag>, seastar::bool_class<update_peer_row_hash_sets_tag>, unsigned int) at ./repair/row_level.cc:1273
{build/release/scylla} 0x37ad9dc: repair_meta::get_row_diff_source_op(seastar::bool_class<update_peer_row_hash_sets_tag>, gms::inet_address, unsigned int, seastar::rpc::sink<repair_hash_with_cmd>&, seastar::rpc::source<repair_row_on_wire_with_cmd>&) at ./repair/row_level.cc:1617
{build/release/scylla} 0x37a2982: repair_meta::get_row_diff_with_rpc_stream(absl::btree_set<repair_hash, std::less<repair_hash>, std::allocator<repair_hash> >, seastar::bool_class<needs_all_rows_tag>, seastar::bool_class<update_peer_row_hash_sets_tag>, gms::inet_address, unsigned int) at ./repair/row_level.cc:1683
```

Found with scylla --blocked-reactor-notify-ms 1

Refs #10665
2022-06-07 16:04:50 +08:00
Takuya ASADA
c82da0ea8e install-dependencies.sh: add scylla-api-client PIP package
Add scylla-api-client PIP package, and install into scylla-python3
package.

See scylladb/scylla-machine-image#340

Closes #10728

[avi: regenerate frozen toolchain]

Closes #10732
2022-06-07 09:43:50 +03:00
Takuya ASADA
ad2344a864 scylla_coredump_setup: support new format of Storage field
Storage field of "coredumpctl info" changed at systemd-v248, it added
"(present)" on the end of line when coredump file available.

Fixes #10669

Closes #10714
2022-06-07 02:21:32 +03:00
Avi Kivity
e0670f0bb5 Merge 'memtable, cache: Eagerly compact data with tombstones' from Tomasz Grabiec
When memtable receives a tombstone it can happen under some workloads
that it covers data which is still in the memtable. Some workloads may
insert and delete data within a short time frame. We could reduce the
rate of memtable flushes if we eagerly drop tombstoned data.

One workload which benefits is the raft log. It stores a row for each
uncommitted raft entry. When entries are committed they are
deleted. So the live set is expected to be short under normal
conditions.

Fixes #652.

Closes #10612

* github.com:scylladb/scylla:
  memtable: Add counters for tombstone compaction
  memtable, cache: Eagerly compact data with tombstones
  memtable: Subtract from flushed memory when cleaning
  mvcc: Introduce apply_resume to hold state for partition version merging
  test: mutation: Compare against compacted mutations
  compacting_reader: Drop irrelevant tombstones
  mutation_partition: Extract deletable_row::compact_and_expire()
  mvcc: Apply mutations in memtable with preemption enabled
  test: memtable: Make failed_flush_prevents_writes() immune to background merging
2022-06-07 02:17:09 +03:00
Tomasz Grabiec
0bc45f9666 memtable: Add counters for tombstone compaction 2022-06-06 19:25:41 +02:00
Tomasz Grabiec
beadd248e3 memtable, cache: Eagerly compact data with tombstones
When memtable receives a tombstone it can happen under some workloads
that it covers data which is still in the memtable. Some workloads may
insert and delete data within a short time frame. We could reduce the
rate of memtable flushes if we eagerly drpo tombstoned data.

One workload which benefits is the raft log. It stores a row for each
uncommitted raft entry. When entries are committed they are
deleted. So the live set is expected to be short under normal
conditions.

Fixes #652.
2022-06-06 19:25:41 +02:00
Tomasz Grabiec
9135d1fd1f memtable: Subtract from flushed memory when cleaning
This patch prevents virtual dirty from going negative during memtable
flush in case partition version merging erases data previously
accounted by the flush reader. There is an assert in
~flush_memory_accounter which guards for this.

This will start happening after tombstones are compacted with rows on
partition version merging.

This problem is prevented by the patch by having the cleaner notify
the memtable layer via callback about the amount of dirty memory released
during merging, so that the memtable layer can adjust its accounting.
2022-06-06 19:25:41 +02:00
Tomasz Grabiec
989ef88e26 mvcc: Introduce apply_resume to hold state for partition version merging
Partition version merging is preemptable. It may stop in the middle
and be resumed later. Currently, all state is kept inside the versions
themselves, in the form of elements in the source version which are
yet to be moved. This will change once we add compaction (tombstones
with rows) into the merging algorithm. There, state cannot be encoded
purley within versions. Consider applying a partition tombstone over
large number of rows.

This patch introduces apply_rows object to hold the necessary state to
make sure forward progress in case of preemption.

No change in behavior yet.
2022-06-06 19:25:41 +02:00
Tomasz Grabiec
374234cf76 test: mutation: Compare against compacted mutations
Memtables and cache will compact eagerly, so tests should not expect
readers to produce exact mutations written, only those which are
equivalant after applying copmaction.
2022-06-06 19:25:40 +02:00
Tomasz Grabiec
604e720706 compacting_reader: Drop irrelevant tombstones
The compacting reader created using make_compacting_reader() was not
dropping range_tombstone_change fragments which were shadowed by the
partition tombstones. As a result the output fragment stream was not
minimal.

Lack of this change would cause problems in unit tests later in the
series after the change which makes memtables lazily compact partition
versions. In test_reverse_reader_reads_in_native_reverse_order we
compare output of two readers, and assume that compacted streams are
the same. If compacting reader doesn't produce minimal output, then
the streams could differ if one of them went through the compaction in
the memtable (which is minimal).
2022-06-06 19:23:37 +02:00
Tomasz Grabiec
080c403d0b mutation_partition: Extract deletable_row::compact_and_expire() 2022-06-06 19:23:37 +02:00
Tomasz Grabiec
0e3c4fc641 mvcc: Apply mutations in memtable with preemption enabled
Preerequisite for eagerly applying tombstones, which we want to be
preemptible. Before the patch, apply path to the memtable was not
preemptible.

Because merging can now be defered, we need to involve snapshots to
kick-off background merging in case of preemption. This requires us to
propagate region and cleaner objects, in order to create a snapshot.
2022-06-06 19:23:37 +02:00
Tomasz Grabiec
0e78ad50ea test: memtable: Make failed_flush_prevents_writes() immune to background merging
Before the change, the test artificiallu set the soft pressure
condition hoping that the background flusher will flush the
memtable. It won't happen if by the time the background flusher runs
the LSA region is updated and soft pressure (which is not really
there) is lifted. Once apply() becomes preemptibe, backgroun partition
version merging can lift the soft pressure, making the memtable flush
not occur and making the test fail.

Fix by triggering soft pressure on retries.
2022-06-06 19:23:37 +02:00
Alejo Sanchez
98061c8960 test.py: shutdown connection manually
To prevent async scheduling issues of reconnection after tests are done,
manually close the connection after fixture ends.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-06-03 12:09:18 +02:00
Alejo Sanchez
17afcff228 test.py: fix port type passed to Cassandra driver
Port is expected to be int, not str. Using a str causes errors for
exception formatting.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-06-03 12:09:06 +02:00
Botond Dénes
605ee74c39 Merge 'sstables: save Scylla version & build id in metadata' from Michael Livshin
To provide a reasonably-definitive answer to "what exact version of
Scylla wrote this?".

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>

Closes #10712

* github.com:scylladb/scylla:
  docs: document recently-added Scylla sstable metadata sections
  sstables: save Scylla version & build id in metadata
  scylla_sstable: generalize metadata visitor for disk_string
  build_id: cache the value
2022-06-03 07:49:51 +03:00
Botond Dénes
49215fcff7 Merge 'Remove flat_mutation_reader (v1)' from Michael Livshin
- Introduce a simpler substitute for `flat_mutation_reader`-resulting-from-a-downgrade that is adequate for the remaining uses but is _not_ a full-fledged reader (does not redirect all logic to an `::impl`, does not buffer, does not really have `::peek()`), so hopefully carries a smaller performance overhead. The name `mutation_fragment_v1_stream` is kind of a mouthful but it's the best I have
- (not tests) Use the above instead of `downgrade_to_v1()`
- Plug it in as another option in `mutation_source`, in and out
- (tests) Substitute deliberate uses of `downgrade_to_v1()` with `mutation_fragment_v1_stream()`
- (tests) Replace all the previously-overlooked occurrences of `mutation_source::make_reader()` with  `mutation_source::make_reader_v2()`, or with `mutation_source::make_fragment_v1_stream()` where deliberate or still required (see below)
- (tests) This series still leaves some tests with `mutation_fragment_v1_stream` (i.e. at v1) where not called for by the test logic per se, because another missing piece of work is figuring out how to properly feed `mutation_fragment_v2` (i.e. range tombstone changes) to `mutation_partition`.  While that is not done (and I think it's better to punt on it in this PR), we have to produce `mutation_fragment` instances in tests that `apply()` them to `mutation_partition`, thus we still use downgraded readers in those tests
- Remove the `flat_mutation_reader` class and things downstream of it

Fixes #10586

Closes #10654

* github.com:scylladb/scylla:
  fix "ninja dev-headers"
  flat_mutation_reader ist tot
  tests: downgrade_to_v1() -> mutation_fragment_v1_stream()
  tests: flat_reader_assertions: refactor out match_compacted_mutation()
  tests: ms.make_reader() -> ms.make_fragment_v1_stream()
  repair/row_level: mutation_fragment_v1_stream() instead of downgrade_to_v1()
  stream_transfer_task: mutation_fragment_v1_stream() instead of downgrade_to_v1()
  sstables_loader: mutation_fragment_v1_stream() instead of downgrade_to_v1()
  mutation_source: add ::make_fragment_v1_stream()
  introduce mutation_fragment_v1_stream
  tests: ms.make_reader() -> ms.make_reader_v2()
  tests: remove test_downgrade_to_v1_clear_buffer()
  mutation_source_test: fix indentation
  tests: remove some redundant calls to downgrade_to_v1()
  tests: remove some to-become-pointless ms.make_reader()-using tests
  tests: remove some to-become-pointless reader downgrade tests
2022-06-03 07:26:29 +03:00
Michael Livshin
9a541c7c58 docs: document recently-added Scylla sstable metadata sections
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-06-02 19:40:52 +03:00
Kamil Braun
72f629c2b6 test: cdc_enable_disable_test: remove non-determinism
The test sometimes fails because the order of rows in the SELECT results
depends on how stream IDs for the different partition keys get generated.
In some runs the stream ID for pk=1 may go before the stream ID for
pk=4, in some runs the other way.

The fix is to use the same partition key but different clustering keys
for the different rows.

Refs: #10601

Closes #10718
2022-06-02 19:40:07 +03:00
Botond Dénes
0a25a2bff3 sstables: validate_checksums(): more readable checksum mismatch messages
Replace:

    Compressed chunk checksum mismatch at chunk {}, offset {}, for chunk of size {}: expected={}, actual={}

With:

    Compressed chunk checksum mismatch at offset {}, for chunk #{} of size {}: expected={}, actual={}

This is a follow-up for #10693. Also bring the uncompressed chunk
checksum check messages up to date with the compressed one (which #10693
forgot to do).
Another change included is merging the advancement of the chunk index
with the iteration over the chunks, so we don't maintain two counters
(one in the iterator and an explicit one).

Closes #10715
2022-06-02 19:38:39 +03:00
Anna Stuchlik
a309c2a1b6 conf: update the description of the seeds parameter in scylla.yaml
Closes #10719
2022-06-02 18:45:11 +03:00
Avi Kivity
bfa8a8efb7 cql3: column_condition: deinline constructor
It will be easier to mangle it later out of the header.
2022-06-02 13:11:05 +03:00
Avi Kivity
4f7cbbb54c cql3: column_condition: rename column member
Prefix with underscore as a data member.
2022-06-02 13:09:54 +03:00
Michael Livshin
fc1b957367 sstables: save Scylla version & build id in metadata
To provide a reasonably-definitive answer to "what exact version of
Scylla wrote this?".

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-06-02 11:21:05 +03:00
Michael Livshin
b60bc8bb8a scylla_sstable: generalize metadata visitor for disk_string
Some metadata fields have interesting types, and some are just
strings.  There can be more than one string field, which the visitor
would not be able to distinguish from one another by type alone, so no
reason to make `scylla_metadata::sstable_origin` special.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-06-02 11:21:05 +03:00
Michael Livshin
80c9455413 build_id: cache the value
The CPU cost of iterating over the relevant ELF structures is probably
negligible (despite the amount of code involved), but there is no need
to keep the containing page mapped in RAM when it doesn't have to be.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-06-02 11:21:05 +03:00
Avi Kivity
f5062f4b5a Merge 'Use generation_type for SSTable ancestors' from Raphael "Raph" Carvalho
To avoid a discrepancy about underlying generation type once something other than integer is allowed for the sstable generation.
Also simplifies one generic writer interface for sealing sstable statistics.

Closes #10703

* github.com:scylladb/scylla:
  sstables: Use generation_type for compaction ancestors
  sstables: Make compaction ancestors optional when sealing statistics
2022-06-01 19:55:08 +03:00
Avi Kivity
7debf6780c cql3: expr: drop prepare_binop_lhs()
It is now just a thin wrapper around try_prepare_expression(), so
replace it with that.
2022-06-01 18:58:14 +03:00
Avi Kivity
76e0dc66e5 cql3: expr: move implementation of prepare_binop_lhs() to try_prepare_expression()
This unifies the left-hand-side and right-hand-side of expression preparation.
The contents of the visitor in prepare_binop_lhs() is moved to the
visitor in try_prepare_expression(). This usually replaces an
on_internal_error() branch.

An exception is tuple_constructor, which is valid in both the left-hand-side
and right-hand-side (e.g. WHERE (x, y) IN (?, ?, ?)). We previously
enhanced this case to support not having a a column_specification, so
we just delete the branch from prepare_binop_lhs.
2022-06-01 18:58:14 +03:00
Avi Kivity
046abc4323 cql3: expr: use recursive descent when preparing subscripts
When encountering a subscript as the left-hand-side of a binary operator,
we assume the subscripted value is a column and process it directly.

As a step towards de-specializing the left-hand-side of binary operators,
use recursive descent into prepare_binop_lhs() instead. This requires
generating a column_specification for arbitrary expressions, so we
add a column_specification_of() function for that. Currently it will
return a good representation for columns (the only input allowed by
the grammar) and a bad representation (the text representation of the
expression) for other expressions. We'll have to improve that when we
relax the grammar.
2022-06-01 18:58:12 +03:00
Avi Kivity
747a1dd244 cql3: expr: allow prepare of tuple_constructor with no receiver
Currently the only expression form that can appear on both the left
hand side of an expression and the right hand side is a tuple constructor,
so consequently it must support both modes of type processing - either
deriving the type from the expression, or imposing a type on the expression.
As an example, in

    WHERE (A, B) = (:a, :b)

the first tuple derives its type from the column types, while the
second tuple has the type of the first tuple imposed on it.

So, we adjust tuple_constructor_prepare_nontuple to support both forms.
This means allowing the receiver not to be present, and calculating the
tuple type if that is the case.
2022-06-01 18:48:55 +03:00
Avi Kivity
b1c8fd8fa5 cql3: expr: drop no longer used printable_relation parameter from prepare_binop_lhs()
Inching ever closer to unifying the two expression preparation variants.
2022-06-01 18:48:03 +03:00
Avi Kivity
4e0a089f3e cql3: expr: print only column name when failing to resolve column
resolve_column() is part of the prepare stage, and tries to
resolve a column name in a query against the table's columns.

If it fails, it prints the containing binary_expression as
context. However, that's unnecessary - the unresolved
column name is sufficient context. So print that.

The motivation is to unify preparation of binary_operator
left-hand-side and right-hand-side - prepare_expression()
doesn't have the extra parameter and it wouldn't make sense
to add it, as expressions might not be children of binary_operators.
2022-06-01 18:48:03 +03:00
Avi Kivity
9e213d979f cql3: expr: pass schema to prepare_expression
Currently prepare_expression is never used where a schema is needed -
it is called for the right-hand-side of binary operators (where we
don't accept columns) or for attributes like WRITETIME or TTL. But
when we unify expression preparation it will need to handle columns
too, and these need the schema to look up the column.

So pass the schema as a parameter. It is optional (a pointer) since
not all contexts will have a schema (for example CREATE AGGREGATE).
2022-06-01 18:48:03 +03:00
Avi Kivity
9a81285206 cql3: expr: prepare_binary_operator: drop unused argument ctx
This brings the calling convention closer to prepare_expression
so we can unify them.
2022-06-01 18:48:03 +03:00
Avi Kivity
9deabdfbf4 cql3: expr: stub type inference for prepare_expression
In CQL (and SQL) types flow in different directions in expression
components. In an expression

  A[:x] = :y

The type of A is known, the type of :x is derived from the type of A,
and the type of :y is derived from the type of A[:x].

Currently prepare_expression() only supports the second mode - an
expression's type is dictated by its caller via the column_specification
parameter. But this means it can only be used to evaluate the
right-hand-side of binary expressions, since the left-hand-side uses
the first mode, where the type is derived from the column, not
imposed by the caller.

To support both modes, make the column_specification parameter optional
(it is already a pointer so just accept null) and also make the returned
expression optional, to indicate failure to infer the type if the
column_specification was not given.

This patch only arranges for the new calling convention (as a new
try_prepare_expression call), it does not actually implement anything.
2022-06-01 18:48:03 +03:00
Avi Kivity
10aa6ddca3 cql3: expr: introduce type_of() to fetch the type of an expression
For most types, we just return the type field. A few expressions have
other methods to access the type, and some expressions cannot survive
prepare and so calling type_of() on them is illegal.
2022-06-01 18:47:58 +03:00
Avi Kivity
43a3c94532 cql3: expr: keep type information in casts
Currently, preparing a cast drops the cast completely (as the
types are verified to be binary compatibile). This means we lose
the casted-to type. Since we wish to keep type infomation, keep the
cast in the prepared expression tree (and therefore the casted-to
type).

Once we do that, we must extend evaluate() to support cast
expressions.
2022-06-01 18:46:55 +03:00
Avi Kivity
0a4a8c6b92 cql3: expr: add type field to subscript, field_selection, and null expressions
Almost all expressions either already have a type field or
have an O(1) way of reaching the type (for example, column_value
can access the type via its column_definition).

Add a type field to the few expression types that don't already
have it. Since prepare_expr() doesn't yet generate these expressions,
we don't have any place to populate it, so it remains null.
2022-06-01 18:45:56 +03:00
Avi Kivity
d984ea1b7a cql3: expr: cast: use data_type instead of cql3_type for the prepared form
A cast expression naturally includes a data type indicating what type
we are casting into. Right now the prepared form uses cql3_type.
Change it to data_type which is what other expressions use to reduce
friction. Since cql3_type is a thin wrapper around data_type, the
change is minimal.

The change propagates to selectable::with_cast, but again it is
minimal.
2022-06-01 12:19:53 +03:00
Michael Livshin
632b4e5a9a fix "ninja dev-headers"
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-05-31 23:42:34 +03:00
Michael Livshin
029508b77c flat_mutation_reader ist tot
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-05-31 23:42:34 +03:00
Michael Livshin
2a91323051 tests: downgrade_to_v1() -> mutation_fragment_v1_stream()
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-05-31 23:42:34 +03:00
Michael Livshin
eabe568d1c tests: flat_reader_assertions: refactor out match_compacted_mutation()
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-05-31 23:42:34 +03:00
Michael Livshin
a08ee649fc tests: ms.make_reader() -> ms.make_fragment_v1_stream()
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-05-31 23:42:34 +03:00
Michael Livshin
7a11a22cd6 repair/row_level: mutation_fragment_v1_stream() instead of downgrade_to_v1()
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-05-31 23:42:34 +03:00
Michael Livshin
8305ac26ca stream_transfer_task: mutation_fragment_v1_stream() instead of downgrade_to_v1()
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-05-31 23:42:34 +03:00
Michael Livshin
00bee4e0b3 sstables_loader: mutation_fragment_v1_stream() instead of downgrade_to_v1()
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-05-31 23:42:34 +03:00
Michael Livshin
00b2e7b2c5 mutation_source: add ::make_fragment_v1_stream()
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-05-31 23:42:34 +03:00
Michael Livshin
1b98692c8c introduce mutation_fragment_v1_stream
At this point, none of the remaining uses of
`flat_mutation_reader` (all of which are results of calling
`downgrade_to_v1()` anyway) actually need a full-featured flat
mutation reader with its own separate buffer etc.

`mutation_fragment_v1_stream` can only be constructed by wrapping a
`flat_mutation_reader_v2`, contains enough functionality for the
remaining consumers of `mutation_fragment_v1` sources and unit tests
and no more, and does not buffer.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-05-31 23:42:34 +03:00
Michael Livshin
d137b32994 tests: ms.make_reader() -> ms.make_reader_v2()
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-05-31 23:42:34 +03:00
Michael Livshin
1a9e0ed73d tests: remove test_downgrade_to_v1_clear_buffer()
The projected limited replacement of downgraded v1 mutation reader
will not do its own buffering, so this test will be pointless.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-05-31 23:42:34 +03:00
Michael Livshin
66ceb32612 mutation_source_test: fix indentation
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-05-31 23:42:34 +03:00
Michael Livshin
b9ada78ec2 tests: remove some redundant calls to downgrade_to_v1()
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-05-31 23:42:34 +03:00
Michael Livshin
63a61ccaad tests: remove some to-become-pointless ms.make_reader()-using tests
mutation_source are going to be created only from v2 readers and the
::make_reader() method family is scheduled for removal.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-05-31 23:42:34 +03:00
Michael Livshin
b288cc4f9f tests: remove some to-become-pointless reader downgrade tests
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-05-31 23:42:34 +03:00
Raphael S. Carvalho
2a7eb16c02 sstables: Use generation_type for compaction ancestors
Let's also use generation_type for compaction ancestors, so once we
support something other than integer for SSTable generation, we
won't have discrepancy about what the generation type is.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-05-31 15:28:02 -03:00
Raphael S. Carvalho
d36604703f sstables: Make compaction ancestors optional when sealing statistics
Compaction ancestors is only available in versions older than mx,
therefore we can make it optional in seal_statistics(). The motivation
is that mx writer will no longer call sstable::compaction_ancestors()
which return type will be soon changed to type generation_type, so the
returned value can be something other than an integer, e.g. uuid.
We could kill compaction_ancestors in seal_statistics interface, but
given that most generic write functions still work for older versions,
if there were still a writer for them, I decided to not do it now.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-05-31 15:26:03 -03:00
Calle Wilund
adda43edc7 CDC - do not remove log table on CDC disable
Fixes #10489

Killing the CDC log table on CDC disable is unhelpful in many ways,
partly because it can cause random exceptions on nodes trying to
do a CDC-enabled write at the same time as log table is dropped,
but also because it makes it impossible to collect data generated
before CDC was turned off, but which is not yet consumed.

Since data should be TTL:ed anyway, retaining the table should not
really add any overhead beyond the compaction to eventually clear
it. And user did set TTL=0 (disabled), then he is already responsible
for clearing out the data.

This also has the nice feature of meshing with the alternator streams
semantics.

Closes #10601
2022-05-31 19:07:07 +03:00
Konstantin Osipov
94a192a7aa Revert "test.py: temporarily disable raft"
This reverts commit 26128a222b.

The issue the commit depends on is fixed, so enable raft back.

Closes #10694
2022-05-31 14:39:26 +03:00
Avi Kivity
41b098f54e Udpate tools/jmx submodule (jackson dependency update)
* tools/jmx 53f7f55...fe351e8 (1):
  > Update jackson dependency
2022-05-31 13:46:46 +03:00
Mikołaj Sielużycki
bc18e97473 sstable_writer: Fix mutation order violation
The change
- adds a test which exposes a problem of a peculiar setup of
tombstones that trigger a mutation fragment stream validation exception
- fixes the problem

Applying tombstones in the order:

range_tombstone_change pos(ck1), after_all_prefixed, tombstone_timestamp=1
range_tombstone_change pos(ck2), before_all_prefixed, tombstone=NONE
range_tombstone_change pos(NONE), after_all_prefixed, tombstone=NONE

Leads to swapping the order of mutations when written and read from
disk via sstable writer. This is caused by conversion of
range_tombstone_change (in memory representation) to range tombstone
marker (on disk representation) and back.

When this mutation stream is written to disk, the range tombstone
markers type is calculated based on the relationship between
range_tombstone_changes. The RTC series as above produces markers
(start, end, start). When the last marker is loaded from disk, it's kind
gets incorrectly loaded as before_all_prefixed instead of
after_all_prefixed. This leads to incorrect order of mutations.

The solution is to skip writing a new range_tombstone_change with empty
tombstone if the last range_tombstone_change already has empty
tombstone. This is redundant information and can be safely removed,
while the logic of encoding RTCs as markers doesn't handle such
redundancy well.

Closes #10643
2022-05-31 13:39:48 +03:00
Piotr Sarna
7169e021e5 Merge 'cql3: support list subscripts in WHERE clause' from Avi Kivity
I noticed that `column_condition` (used in LWT `IF` clause) supports lists.
As part of the Grand Expression Unification we'll need to migrate that to
expressions, so we'll need to support list subscripts.

Use the opportunity to relax the normal filtering to allow filtering on
list subscripts: `WHERE my_list[:index] = :value`.

Closes #10645

* github.com:scylladb/scylla:
  test: cql-pytest: add test for list subscript filtering
  doc: document list subscripts usable in WHERE clause
  cql3: expr: drop restrictions on list subscripts
  cql3: expr: prepare_expr: support subscripted lists
  cql3: expressions: reindent get_value()
  cql3: expression: evaluate() support subscripting lists
2022-05-31 09:28:52 +02:00
Avi Kivity
4b53af0bd5 treewide: replace parallel_for_each with coroutine::parallel_for_each in coroutines
coroutine::parallel_for_each avoids an allocation and is therefore preferred. The lifetime
of the function object is less ambiguous, and so it is safer. Replace all eligible
occurences (i.e. caller is a coroutine).

One case (storage_service::node_ops_cmd_heartbeat_updater()) needed a little extra
attention since there was a handle_exception() continuation attached. It is converted
to a try/catch.

Closes #10699
2022-05-31 09:06:24 +03:00
Botond Dénes
02608bec9d Update tools/java submodule
* tools/java a4573759a2...d4133b54c9 (1):
  > removeNode: Remove other alias for --ignore-dead-nodes
2022-05-31 07:56:54 +03:00
Botond Dénes
660921eb22 Merge 'Two improvements to configure.py' from Nadav Har'El
This two-patch series makes two improvements to configure.py:

The first patch fixes, yet again, issue #4706 where interrupting ninja's rebuild of build.ninja can leave it without any build.ninja at all. The patch uses a different approach from the previous pull-request #10671 that aimed to solve the same problem.

The second patch makes the output of configure.py more reproducible, not resulting in a different random order every time. This is useful especially when debugging configure.py and wanting to check if anything changed in its output.

Closes #10696

* github.com:scylladb/scylla:
  configure.py: make build.ninja the same every time
  configure.py: don't delete build.ninja when rebuild is interrupted
2022-05-31 06:35:16 +03:00
Avi Kivity
248cdf0e34 test: cql-pytest: add test for list subscript filtering
Test match and mismatch, as well as out of bound cases.
2022-05-30 20:47:47 +03:00
Nadav Har'El
e85bd37c6e Update seastar submodule
* seastar 96bb3a1b8...2be9677d6 (37):
  > Merge 'stream_range_as_array: always close output stream' from Benny Halevy

Fixes #10592

  > net/api: add "server_socket::is_listening()"
  > src/net/proxy: remove unused variable
  > coroutine: parallel_for_each: relax contraints
  > native-stack: do not use 0 as ip address if !_dhcp
  > coroutine: fix a typo in comment
  > std-coroutine: include for LLVM-14
  > tutorial: use non-variadic version of when_all_succeed()
  > scripts: Fix build.sh to use new --c++-standard config option
  > core/thread: initialize work::pr and work::th explicitly
  > util/log-impl: remove "const" qualifier in return type
  > map_reduce: remove redundant move() in return statement
  > util: mark unused parameter with [[maybe_unused]]
  > drop unused parameters
  > build: use "20" for the default CMAKE_CXX_STANDARD
  > build: make CMAKE_CXX_STANDARD a string
  > utils: log: don't crash on allocation failure while extending log buffer
  > tests: unix_domain_test: fix thread/future confusion in client_round()
  > compat: do not use std::source_location if it is broken
  > build: use CMAKE_CXX_STANDARD instead of Seastar_CXX_DIALECT
  > Merge 'Add hello-world demo from tutorial' from Pavel
  > rpc_tester: Put client/server sides into correct sched groups
  > reactor_backend: Use _r reference, not engine() method
  > future.hh: #include std-compat.hh for SEASTAR_COROUTINES_ENABLED
  > Merge "Add more CPU-hog facilities to RPC-tester" from Pavel E
  > Merge "io: Enlighten queued_request" from Pavel E
  > Correct swapped AIO detection/setup calls
  > sharded: De-duplicate map-reduce overloads
  > file: don't trample on xfs flags when setting xfs size hint
  > Merge "Per-class IO bandwidth limits" from Pavel E
  > Merge 'sstring: fix format and optimize the performance of sstring::find().' from Jianyong Chen
  > reactor_backend: Mark reactor_backend_aio::poll() private
  > scripts/build.sh: Mind if not running on a terminal
  > test, rpc: Don't work with large buffers
  > test, futures: Don't expect ready future to resolve immediately
  > source_location compatibility: Fix an unused private field error when treat warning as errors
  > file: Remove try-catch around noexcept calls
2022-05-30 17:46:32 +03:00
Pavel Emelyanov
7f2837824e system_keyspace: Save coroutine's captured variable on stack
Currently it works, but the newer version of seastar's map_reduce()
is compiled in a way to trigger use-after-free on accessing captured
value.

tests: unit(dev), unit.alternator(debug on v1)

Fixes #10689

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220523095409.6078-1-xemul@scylladb.com>
2022-05-30 17:46:32 +03:00
Botond Dénes
3a943b23fb sstables: validate_checksums(): add chunk index to error message
When logging a failed checksum on a compressed chunk.
Currently, only the offset is logged, but the index of the chunk whose
checksum failed to validate is also interesting.

Closes #10693
2022-05-30 17:11:28 +03:00
Nadav Har'El
84e1fa0513 configure.py: make build.ninja the same every time
In several places, configure.py uses unsorted sets which results in
its output being in different order every time - both a different
order of targets, and a different order in dependencies of each
target.

This is both strange, and annoying when trying to debug configure.py
and trying to understand when, if at all, its output changes.

So in this patch, we use "sorted(...)" in the right places that
are needed to guarantee a fixed order. This fixed order is alphabetical,
but that's not the goal of this patch - the goal is to ensure a fixed
order.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-05-30 16:20:37 +03:00
Nadav Har'El
8db9e62de9 configure.py: don't delete build.ninja when rebuild is interrupted
In commit 9cc9facbea, I fixed issue #4706.
That issue about what happens when interrupting a rebuild of build.ninja
(which happens automatically when you run "ninja" after configure.py
changed). We don't want to leave behind a half-built build.ninja,
or leave it deleted.

The solution in that commit was for configure.py to build a temporary file
(build.ninja.tmp), and only as the very last step rename it build.ninja.

Unfortunately, since that time, we added more last steps after what
used to be that very last step :-(

If this new code running after the rename takes a noticable amount of
time, and if the user is unlucky enough to interrupt it during that
time, ninja will see a modified output file (build.ninja) and a failed
rule, and will delete the output file!

The solution is to move the rename out of configure.py. Instead, we
add a "--out=filename" option to configure.py which allows it to write
directly to a different file name, not build.ninja. When rebuilding
build.ninja, the rule will now call configure.py with "--out=build.ninja.new"
and then rename it back to build.ninja. Any failure or interrupt at any
stage of configure.py will leave build.ninja untouched, so ninja will
not delete it - it will just delete the temporary build.ninja.new.

Fixes #4706 (again)

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-05-30 16:17:41 +03:00
Kamil Braun
78f81171ba Merge 'raft: test non-voters in randomized_nemesis_test' from Kamil Braun
We modify the `reconfigure` and `modify_config` APIs to take a vector of
<server_id, bool> pairs (instead of just a vector of server_ids), where
the bool indicates whether the server is a voter in the modified config.

The `reconfiguration` operation would previously shuffle the set of
servers and split it into two parts: members and non-members. Now it
partitions it into three parts: voters, non-voters, and non-members.

The PR also includes fixes for some liveness problems stumbled upon
during testing.

Closes #10640

* github.com:scylladb/scylla:
  test: raft: randomized_nemesis_test: include non-voters during reconfigurations
  raft: server: if `add_entry` with `wait_type::applied` successfully returns, ensure `state_machine::apply` is called for this entry
  raft: tracker: fix the definition of `voters()`
  raft: when printing `raft::server_address`, include `can_vote`
2022-05-30 15:06:35 +02:00
Raphael S. Carvalho
0307cdd2bf compaction: Fix incremental compaction logging
The messages only dumps the last sealed fragment, but it should dump
all the output fragments replacing the exhausted input ones.

Let's print origin of output fragments, so we can differ between
files with compaction and garbage-collection origin.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220524232232.119520-1-raphaelsc@scylladb.com>
2022-05-30 15:58:14 +03:00
Botond Dénes
d71c865344 Merge "Fix snitching on Azure" from Pavel Emelyanov
"
Azure snitch tries to replicate db/rack info from all shards to all
other shards. This may lead to use-after-free when shard A gets "this"
from shard B, starts copying its _dc field and the shard A destructs
its _dc from under B because it's receiving one from shard C.

Also polish replication code a little bit while at it.
"

* 'br-azure-snitch-serialize' of https://github.com/xemul/scylla:
  snitch: Use invoke_on_others() to replicate
  snitch: Merge set_my_dc and set_my_rack into one
  azure_snitch: Do nothing on non-io-cpu
2022-05-30 15:35:37 +03:00
Benny Halevy
32e79840ca tools: scylla-sstable: terminate error message with newline
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220523080747.2492640-1-bhalevy@scylladb.com>
2022-05-30 14:47:28 +03:00
Kamil Braun
6fc82be832 service: storage_service: remove get() call not in thread
Regression introduced by code movement in
89163a3be4.

Fixes #10679.

Closes #10680
2022-05-30 13:43:02 +03:00
Avi Kivity
8136f7bc4b doc: document list subscripts usable in WHERE clause 2022-05-30 13:29:49 +03:00
Avi Kivity
f9b3c6ddbd cql3: expr: drop restrictions on list subscripts
Restriction validation forbids lists (somewhat oddly, it talks about
indexes; validation should make a soft check about indexes (since it
can fall back to filtering) and a hard check about supported filtering
expressions), and enforces a map in another place. Remove the first
restriction and relax the second to allow lists as well as maps as
subscript operands.

Some validation messages are adjusted to reflect that lists are supported.
2022-05-30 13:29:49 +03:00
Avi Kivity
35e0474410 cql3: expr: prepare_expr: support subscripted lists
Infer the type of a list index as int32_type.

The error message when a non-subscriptable type is provided is
changed, so the corresponding test is changed too.
2022-05-30 13:29:49 +03:00
Avi Kivity
8d667e374b cql3: expressions: reindent get_value()
Whitespace-only change.
2022-05-30 13:29:49 +03:00
Avi Kivity
05388f7a2a cql3: expression: evaluate() support subscripting lists
We already support subscripting maps (for filtering WHERE m[3] = 6),
so adding list subscript support is easy. Most of the code is shared.
Differences are:
 - internal list representation is a vector of values, not of key/values
 - key type is int32_type, not defined by map
 - need to check index bounds
2022-05-30 13:29:49 +03:00
Piotr Sarna
5d59c841d0 Merge 'alternator: add Describe operations even if a feature...
is not supported' from Nadav Har'El

This small series implements the DescribeTimeToLive and
DescribeContinuousBackups operations in Alternator. Even if the
corresponding features aren't implemented, it can help applications that
we implement just the Describe operation that can say that this feature
is in fact currently disabled.

Fixes  #10660

Closes #10670

* github.com:scylladb/scylla:
  alternator: remove dead code
  alternator: implement DescribeContinuousBackups operation
  alternator: allow DescribeTimeToLive even without TTL enabled
2022-05-30 09:26:13 +02:00
Nadav Har'El
63132431e8 test/cql-pytest: reproducers for three scan bugs
This patch contains five tests which reproduce three old bugs in
Scylla's handling of multi-column restrictions like (c1,c2)<(1,2).

These old bugs are:

    Refs #64 (yes, a two-digit issue!)
    Refs #4244
    Refs #6200

The three github issues are closely intertwined, exposing the same
or similar bugs in our internal implementation, and I suspect that
eventually most of them could be fixed together.

In writing these tests, I carefully read all three issues and the
various failure scenarios described in them, tried to distill and
simplify the scenarios, and also consider various other broken
variants of the scenarios. The resulting tests are heavily commented,
explaining the motivation of each test and exactly which of the
above bugs it reproduces.

All five tests included in this patch pass on Cassandra and currently
fail on Scylla, so are marked "xfail".

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #10675
2022-05-30 09:25:00 +02:00
Kamil Braun
4c3678e2a0 gms: gossiper: fix direct_fd_pinger::_generation_number initialization
It's an `int64_t` that needs to be explicitly initialized, otherwise the
value is undefined.

This is probably the cause of #10639, although I'm not sure - I couldn't
reproduce it (the bug is dependent on how the binary is compiled, so
that's probably it). We'll see if it reproduces with this fix, and if
it will, close the issue.

Closes #10681
2022-05-29 13:08:09 +03:00
Avi Kivity
6bcbe3157a Merge "Turn table::config::sstables_manager* into table::sstables_manager&" from Pavel E
"
The table keeps references on sstables_ and compaction_ managers
(among other things), but the latter sits as a pointer on table's
config while the former -- as on-table direct reference.

This set unifies both by turning sstables manager on-config pointer
into on-table reference.

branch: https://github.com/xemul/scylla/tree/br-table-vs-sstables-manager
tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/574/
"

* 'br-table-vs-sstables-manager' of https://github.com/xemul/scylla:
  tests: Remove sstables_manager& from column_family_test_config()
  table: Move sstables_manager from config onto table itself
  table, db, tests: Pass sstables_manager& into table constructor
2022-05-29 13:02:50 +03:00
Kamil Braun
ef7643d504 service: raft: raft_group0: don't call _abort_source.request_abort()
`raft_group0` does not own the source and is not responsible for calling
`request_abort`. The source comes from top-level `stop_signal` (see
main.cc) and that's where it's aborted.

Fixes #10668.

Closes #10678
2022-05-27 16:37:07 +02:00
Nadav Har'El
52362de3df test/cql-pytest: tests for assigning an empty string to non-string
In issues #7944 and #10625 it was noticed that by assigning an empty
string to a non-string type (int, date, etc.) using INSERT or
INSERT JSON, some combinations of the above can create "empty" values
while they should produce a clear error.

The tests added in this patch explore the different combinations of
types and insert modes, and reproduce several buggy cases in Scylla
(resulting in xfail'ing tests) as well as Cassandra.

We feared that there might be a way using those buggy statements to
create a partition with an empty key - something which used to kill
older versions of Scylla. But the tests show that this is not possible -
while a user can use the buggy statements to create an empty value,
Scylla refuses it when it is used as a single-column partition key.

Refs #10625
Refs #7944

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #10628
2022-05-27 16:37:01 +02:00
Avi Kivity
c83393e819 messaging: do isolate default tenants
In 10dd08c9 ("messaging_service: supply and interpret rpc isolation_cookies",
4.2), we added a mechanism to perform rpc calls in remote scheduling groups
based on the connection identity (rather than the verb), so that
connection processing itself can run in the correct group (not just
verb processing), and so that one verb can run in different groups according
to need.

In 16d8cdadc ("messaging_service: introduce the tenant concept", 4.2), we
changed the way isolation cookies are sent:

 scheduling_group
 messaging_service::scheduling_group_for_verb(messaging_verb verb) const {
     return _scheduling_info_for_connection_index[get_rpc_client_idx(verb)].sched_group;
@@ -665,11 +694,14 @@ shared_ptr<messaging_service::rpc_protocol_client_wrapper> messaging_service::ge
     if (must_compress) {
         opts.compressor_factory = &compressor_factory;
     }
     opts.tcp_nodelay = must_tcp_nodelay;
     opts.reuseaddr = true;
-    opts.isolation_cookie = _scheduling_info_for_connection_index[idx].isolation_cookie;
+    // We send cookies only for non-default statement tenant clients.
+    if (idx > 3) {
+        opts.isolation_cookie = _scheduling_info_for_connection_index[idx].isolation_cookie;
+    }

This effectively disables the mechanism for the default tenant. As a
result some verbs will be executed in whatever group the messaging
service listener was started in. This used to be the main group,
but in 554ab03 ("main: Run init_server and join_cluster inside
maintenance scheduling group", 4.5), this was change to the maintenance
group. As a result normal read/writes now compete with maintenance
operations, raising their latency significantly.

Fix by sending the isolation cookie for all connections. With this,
a 2-node cassandra-stress load has 99th percentile increase by just
3ms during repair, compared to 10ms+ before.

Fixes #9505.

Closes #10673
2022-05-27 16:36:57 +02:00
Michał Radwański
906cee7052 treewide: remove unqualified calls to std::move
clang 15 emits such a warning:

cql3/statements/raw/parsed_statement.cc:46:16: error: unqualified call to 'std::move' [-Werror,-Wunqualified-std-cast-call]
    , warnings(move(warnings))
               ^
               std::
cql3/statements/raw/parsed_statement.cc:52:101: error: unqualified call to 'std::move' [-Werror,-Wunqualified-std-cast-call]
    : prepared_statement(statement_, ctx.get_variable_specifications(), partition_key_bind_indices, move(warnings))
                                                                                                    ^
                                                                                                    std::

Closes #10656
2022-05-27 16:36:49 +02:00
Tomasz Grabiec
e9fbc0b6c5 Merge 'test.py: add cluster and new approval tests' from Konstantin Osipov
The purpose of this series is to introduce infrastructure
for managed scylla processes into test.py,
switch some existing suites to use test.py managed processes
and introduce cluster tests.

All of this is expected to make possible to test Raft topology
changes and schema changes using an easy to use and fast tool
such as test.py. In general this will allow testing Scylla clusters
from within the development test harness.

Branch URL: kostja/test.py.v5

Closes #10406

* github.com:scylladb/scylla:
  test: disable topology/test_null
  test.py: disable cdc_with_lwt_test it's flaky in debug mode
  test.py: workaround for a python bug
  test: cleanup (drop keyspace) in two cql tests to support --repeat
  test.py: respect --verbose even if output is a tty
  test: remove tools/cql_repl
  test.py: switch cql/ suite to pytest/tabular output
  test: remove a flaky test case
  test.py: implement CQL approval tests over pytest
  test.py: implement cql_repl
  test.py: add topology suite
  test.py: add common utility functions to test/pylib
  test.py: switch cql-pytest and rest_api suites to PythonTestSuite
  test.py: introduce PythonTest and PythonTestSuite
  test.py: use artifact registry
  test.py: temporarily disable raft
  test.py: (pylib) add Scylla Server and Artifact Registry
  test.py: (pylib) add Host Registry to track used server hosts
  test.py: (pylib) add a pool of scylla servers (or clusters)
2022-05-27 16:36:08 +02:00
Pavel Emelyanov
3dab0bfc8d tests: Remove sstables_manager& from column_family_test_config()
It's unused arg in there after last patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-27 16:37:21 +03:00
Pavel Emelyanov
490bf65e11 table: Move sstables_manager from config onto table itself
The manager reference is already available in constructor and thus
can be copied to on-table member.

The code that chooses the manager (user/system one) should be moved
from make_column_family_config() into add_column_family() method.

Once this happens, the get_sstables_manager() should be fixed to
return the reference from its new location. While at it -- mark the
method in question noexcept and add it's mutable overload.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-27 16:37:21 +03:00
Pavel Emelyanov
50e6810536 table, db, tests: Pass sstables_manager& into table constructor
In core code there's only one place that constructs table -- in
database.cc -- and this place currently has the sstables_manager pointer
sitting on table config (despite it's a pointer, it's always non-null).

All the tests always use the manager from one of _env's out there.

For now the new contructor arg is unused.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-27 16:27:44 +03:00
Kamil Braun
2dafd99e3b test: raft: randomized_nemesis_test: include non-voters during reconfigurations
We modify the `reconfigure` and `modify_config` APIs to take a vector of
<server_id, bool> pairs (instead of just a vector of server_ids), where
the bool indicates whether the server is a voter in the modified config.

The `reconfiguration` operation would previously shuffle the set of
servers and split it into two parts: members and non-members. Now it
partitions it into three parts: voters, non-voters, and non-members.
2022-05-27 12:06:18 +02:00
Kamil Braun
d128f65354 raft: server: if add_entry with wait_type::applied successfully returns, ensure state_machine::apply is called for this entry
Previously it could happen that `add_entry` returned successfully but
`state_machine::apply` was never called by the server for this entry,
even though `wait_type::applied` was used, if the server loaded
a snapshot that contained this entry in just the right moment. Some
clients may find this behavior surprising, even though we may argue that
it's not technically incorrect.

For example, the nemesis test assumed that if `add_entry` returned
successfully (with `wait_type::applied`), the local state machine
applied the entry; the test uses `apply` to obtain an output - the
result of the command - from the state machine.

It's not a problem to give a stronger guarantee, so we do it in this
commit. In the scenario where a snapshot causes Raft to skip over the
entry, `add_entry` will finish exceptionally with
`commit_status_unknown`.
2022-05-27 12:06:18 +02:00
Kamil Braun
f31f73b1e8 raft: tracker: fix the definition of voters()
The previous implementation was weird, and it's not even clear if
the C++ standard defined what the result would be (because it used
`std::unordered_set::insert(iterator, iterator)`, where the iterators
pointed to a sequence of elements with elements that already had
equivalent elements in the set; cppreference does not specify which
elements end up in the set in this case).

In any case, in testing, the resulting set did not give the desired
result: if the configuration was joint, and a server was a voter in
the previous config but a non-voter in the current one, it would
not be a member of this set. This would cause the server to not vote for
itself when it became a candidate, which could lead to cluster
unavailability.

The new definition is simple: a server belongs to `voters()` iff it is
a voter in current or previous configuration. This fixes the problem
described above.

Fixes #10618.
2022-05-27 12:06:18 +02:00
Kamil Braun
6e5d1f4784 raft: when printing raft::server_address, include can_vote
Makes debugging easier when configurations include non-voters.
2022-05-27 12:06:18 +02:00
Nadav Har'El
e363febeb1 test/cql-pytest: reproducer for bug in index+filter+limit
This patch adds reproducing tests for wrong handling of LIMIT in a query
which uses a secondary index *and* filtering, described in issue #10649.

In that case, Scylla incorrectly limits the number of rows found in the
index *before* the filtering, while it should limit the number of rows
*after* the filtering.

The tests in this patch (which xfail on Scylla, and pass on Cassandra)
go beyond the minimum required to reproduce this bug. It turns out that
there are different sub-cases of this problem that go through different
code paths, namely whether the base table has clustering keys or just
partition keys, and whether the overall LIMITed result spans more than
one page. So these tests attempt to also cover all these sub-cases.
Without all these test sub-cases, an incomplete and incorrect fix of this
bug may, by chance, cause the original test to succeed.

Refs #10649

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #10658
2022-05-26 16:46:45 +02:00
Tomasz Grabiec
edd447ed32 Merge 'raft: test read_barriers in randomized_nemesis_test' from Kamil Braun
Introduce a new operation, `raft_read`, which calls `read_barrier`
on a server, reads the state of the server's state machine, and returns
that state.

Extend the generator in `basic_generator_test` to generate `raft_read`s.
Only do it if forwarding is enabled (although it may make sense to test
read barriers in non-forwarding scenario as well - we may think about it
and do it in a follow-up).

Check the consistency of the read results by comparing them with the model
and using the result to extend the model with any newly observed elements.

The patchset includes some fixes for correctness (#10578)
and liveness (handling aborts correctly).

Closes #10561

* github.com:scylladb/scylla:
  test: raft: randomized_nemesis_test: check consistency of reads
  test: raft: randomized_nemesis_test: perform linearizable reads using read_barriers
  test: raft: randomized_nemesis_test: add flags for disabling nemeses
  raft: server: in `abort()`, abort read barriers before waiting for rpc abort
  raft: server: handle aborts correctly in `read_barrier`
  raft: fsm: don't advance commit index further than match_idx during read_quorum
2022-05-26 16:46:35 +02:00
Nadav Har'El
62b6179c88 alternator: remove dead code
Remove the function make_keyspace_name() that was never used.

We *could* have used this function, but we didn't, and it had
had an inconvenient API. If we later want to de-duplicate the
several copies of "executor::KEYSPACE_NAME_PREFIX + table_name"
we have in the code, we can do it with a better API.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-05-26 15:21:14 +03:00
Nadav Har'El
d0ca09a925 alternator: implement DescribeContinuousBackups operation
Although we don't yet support the DynamoDB API's backup features (see
issue #5063), we can already implement the DescribeContinuousBackups
operation. It should just say that continuous backups, and point-in-time
restores, and disabled.

This will be useful for client code which tries to inquire about
continuous backups, even if not planning to use them in practice
(e.g., see issue #10660).

Refs #5063
Refs #10660

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-05-26 15:13:50 +03:00
Konstantin Osipov
7b3f9bb5fa test: disable topology/test_null
Scylla has a bug that only fires ones in a hundred runs in debug mode
when a schema change parallel to a topology change leads to a lost
keyspace and internal error. Disable the tests until Raft is enabled for
schema.
2022-05-26 14:09:58 +03:00
Nadav Har'El
8ecf1e306f alternator: allow DescribeTimeToLive even without TTL enabled
We still consider the TTL support in Alternator to be experimental, so we
don't want to allow a user to enable TTL on a table without turning on a
"--experimental-features" flag. However, there is no reason not to allow
the DescribeTimeToLive call when this experimental flag is off - this call
would simply reply with the truth - that the TTL feature is disabled for
the table!

This is important for client code (such as the Terraform module
described in issue #10660) which uses DescribeTimeToLive for
information, even when it never intends to actually enable TTL.

The patch is trivial - we simply remove the flag check in
DescribeTimeToLive, the code works just as before.

After this patch, the following test now works on Scylla without
experimental flags turned on:

    test/alternator/run test_ttl.py::test_describe_ttl_without_ttl

Refs #10660

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-05-26 10:55:36 +03:00
Konstantin Osipov
5e67e48f8b test.py: disable cdc_with_lwt_test it's flaky in debug mode
The test is flaky in debug mode, see issue #10661 for details.
2022-05-26 09:23:13 +03:00
Konstantin Osipov
ad01840117 test.py: workaround for a python bug
Workaround for a Python3 bug which prevents a correct
exception printout when asyncio is used with logging on.
2022-05-25 20:26:42 +03:00
Konstantin Osipov
180cc0fc4d test: cleanup (drop keyspace) in two cql tests to support --repeat 2022-05-25 20:26:42 +03:00
Konstantin Osipov
dd633cdb21 test.py: respect --verbose even if output is a tty
If output is a not a tty, verbose is set automatically.

If the output is a tty, one has to request --verbose.

However, a part of test.py verbosity was ignoring --verbose
and looking only at the terminal type.
2022-05-25 20:26:42 +03:00
Konstantin Osipov
9512e076af test: remove tools/cql_repl
Remove tools/cql_repl from the source, build targets and
use in test.py. Superseded by ApprovalTest and
test/pylib/cql_repl/cql_repl.py.
2022-05-25 20:26:42 +03:00
Konstantin Osipov
b82c8447d7 test.py: switch cql/ suite to pytest/tabular output 2022-05-25 20:26:42 +03:00
Konstantin Osipov
c43736c53b test: remove a flaky test case
When we append an entry to a list with the same user-defined
timestamp, the behaviour is actually undefined. If the append
is processed by the same coordinator as the one that accepted
the existing entry, then it gets the same timeuuid as the list key,
and replaces (potentially) the existing list valiue. Then it
gets a timeuuid which maybe both larger and smaller than the existing
key's timeuuid, and then turns either to an append or a prepend.

The part of the timestamp responsible for the result is the shard
id's spoof node address implemented in scope of fixing Scylla's
timeuuid uniqueness. When the test was implemented all spoof node ids
where 0 on all shards and all coordinators. Later the difference
in behaviour was dormant because cql_repl would always execute
the append on the same shard.

We could fix Scylla to use a zero spoof node address in case a user
timestamp is supplied, but the purpose of this is unclear, it
may actually be to the contrary of the user's intent.
2022-05-25 20:26:42 +03:00
Konstantin Osipov
405517bc1b test.py: implement CQL approval tests over pytest
Before this patch, approval tests (test/cql/*) were using a C++
application called cql_repl, which is a seastar app running
Scylla, reading commands from the standard input and producing
results in Json format in the standard output. The rationale for
this was to avoid running a standalone Scylla which could leak
more resources such as open sockets.

Now that other suites already start and stop Scylla servers, it
makes more sense to run CQL commands in approval tests against an
existing running server. It saves us from building a one more
binary and allows to better format the output. Specifically, we
would like to see Scylla output in tabular format in approval
tests, which is difficult to do when C++ formatting libraries
are used.
2022-05-25 20:26:42 +03:00
Konstantin Osipov
c119087719 test.py: implement cql_repl
Implement a pytest which would run CQL commands against
a scylla server and pretty print server output.

Will be used in existing Approval tests in subsequent patches.
2022-05-25 20:26:42 +03:00
Konstantin Osipov
5b9262f567 test.py: add topology suite 2022-05-25 20:26:42 +03:00
Konstantin Osipov
26fa9336d1 test.py: add common utility functions to test/pylib 2022-05-25 20:26:42 +03:00
Konstantin Osipov
1955f9168a test.py: switch cql-pytest and rest_api suites to PythonTestSuite
Manage scylla servers for rest_api and cql-pytest suites
using PythonTestSuite. The pool size determines the max
number of servers test.py would run concurrently per
suite. For tiny suites (rest_api) the cost of starting
the servers overweights the cost of running tests so keep
it at a minimum. cql-pytest cas dozens of tests, so run them
in 4 parallel tracks.
2022-05-25 20:26:42 +03:00
Konstantin Osipov
bc719209ee test.py: introduce PythonTest and PythonTestSuite
PythonTest and PythonTestSuite allow to use test.py-managed
scylla servers (or clusters) to run pytest tests.
2022-05-25 20:26:42 +03:00
Konstantin Osipov
1a74828834 test.py: use artifact registry
Track running tests in the suite.
Cleanup after each suite (after all tests
in the suite end).
Cleanup all artifacts before exit.  Don't drop server logs if
there is at least one failed test.
2022-05-25 20:26:42 +03:00
Konstantin Osipov
26128a222b test.py: temporarily disable raft
Raft boot sporadically hangs in the master due to
issue #10355.
2022-05-25 20:26:42 +03:00
Konstantin Osipov
7c1c83320c test.py: (pylib) add Scylla Server and Artifact Registry
Allow starting clusters of Scylla servers. Chain up the next
server start to the end of the previous one, and set the next
server's seed to the previous server.

As a workaround for a race between token dissemination through
gossip and streaming, change schema version to force a gossip
round and make sure all tokens end up at the joining node in time.

Make sure scylla start is not race prone.

auth::standard_role_manager creates "cassandra" role in an async loop
auth::do_after_system_ready(), which retries role creation with an
exponential back-off. In other words, even after CQL port is up, Scylla
may still be initializing.

This race condition could lead to spurious errors during cluster
bootstrap or during a test under CI.

When the role is ready, queries begin to work, so rely on this "side
effect".

To start or stop servers, use a new class, ScyllaCluster,
which encapsulates multiple servers united into a cluster.
In it, validate that a test case cleans up after itself.
Additionally, swallow startup errors and throw them when
the test is actually used.
2022-05-25 20:26:37 +03:00
Kamil Braun
6268c63739 test: raft: randomized_nemesis_test: check consistency of reads
The test would perform `read_barrier`s but not check the correctness
of the reads: whether the state observed by a read is consistent with
the model and recent enough (in short, check linearizability).

This commit adds the correctness checks.
2022-05-25 15:00:19 +02:00
Kamil Braun
6b2b400143 test: raft: randomized_nemesis_test: perform linearizable reads using read_barriers
Introduce a new operation, `raft_read`, which calls `read_barrier`
on a server, reads the state of the server's state machine, and returns
that state.

Extend the generator in `basic_generator_test` to generate `raft_read`s.
Only do it if forwarding is enabled (although it may make sense to test
read barriers in non-forwarding scenario as well - we may think about it
and do it in a follow-up).

For now, we don't check the consistency of the results of the reads.
They do return the observed state, but we don't compare it yet with the
model. For now we simply issue the reads concurrently with other
operations to introduce some more chaos to the cluster and check
liveness and consistency of existing operations.
2022-05-25 15:00:19 +02:00
Kamil Braun
4ea5807862 test: raft: randomized_nemesis_test: add flags for disabling nemeses
Makes it easier to debug stuff.
2022-05-25 15:00:16 +02:00
Kamil Braun
c8237d405e raft: server: in abort(), abort read barriers before waiting for rpc abort
`rpc::abort` may need to wait until all read barriers finish, so abort
read barrier before waiting for `rpc::abort` to finish to avoid a
deadlock on shutdown.

`rpc::abort` is still called before the read barriers are aborted, only
waited for after. Calling it first prevents new read barriers from being
started by `rpc` (see `rpc::abort` comment).

Also prevent new read barriers from being started after abort starts
directly on a leader by checking the `_aborted` flag at the beginning
of `execute_read_barrier`.

Finally, use the opportunity to remove some compiler-dependent code.
2022-05-25 14:56:32 +02:00
Kamil Braun
86c5036353 raft: server: handle aborts correctly in read_barrier
The `wait_for_apply` function, called from `read_barrier`, didn't handle
aborts. Fix that.
2022-05-25 14:56:32 +02:00
Kamil Braun
1eb849c3d7 raft: fsm: don't advance commit index further than match_idx during read_quorum
It's not safe to advance the commit index further than match_idx since
beyond that point the follower's log may be outdated.

Fixes #10578.
2022-05-25 14:56:32 +02:00
Konstantin Osipov
1b5472f0be test.py: (pylib) add Host Registry to track used server hosts 2022-05-25 14:59:01 +03:00
Konstantin Osipov
2781c4fc66 test.py: (pylib) add a pool of scylla servers (or clusters) 2022-05-25 14:59:01 +03:00
Avi Kivity
19316dee94 Merge 'auto-scale promoted index' from Benny Halevy
Add column_index_auto_scale_threshold_in_kb to the configuration (defaults to 10MB).

When the promoted index (serialized) size gets to this
threshold, it's halved by merging each two adjacent blocks
into one and doubling the desired_block_size.

Fixes #4217

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #10646

* github.com:scylladb/scylla:
  sstables: mx: add pi_auto_scale_events metric
  sstables: mx/writer: auto-scale promoted index
2022-05-25 10:27:19 +03:00
Tomasz Grabiec
f87274f66a sstable: partition_index_cache: Fix abort on bad_alloc during page loading
When entry loading fails and there is another request blocked on the
same page, attempt to erase the failed entry will abort because that
would violate entry_ptr guarantees, which is supposed to keep the
entry alive.

The fix in 92727ac36c was incomplete. It
only helped for the case of a single loader. This patch makes a more
general approach by relaxing the assert.

The assert manifested like this:

scylla: ./sstables/partition_index_cache.hh:71: sstables::partition_index_cache::entry::~entry(): Assertion `!is_referenced()' failed.

Fixes #10617

Closes #10653
2022-05-25 09:27:04 +03:00
Avi Kivity
e99b99e537 docs: unconfuse doc test checker wrt gdbinit file
The docs test dislike the gdbinit link because it refers out of
the source tree. Unconfuse the tests by removing the link. It's
sad, but the file is more easily used by referring to it rather
than viewing it, so give a hint about that too.

Closes #10650
2022-05-24 20:20:30 +03:00
Gleb Natapov
083b47cecb gossiper: replace ad-hoc guard with defer()
msg_proc_guard is a guard that makes sure _msg_processing is always
decreased. We can use regular defer() to achieve the same.

Message-Id: <YoZTQPbTMWAdCObs@scylladb.com>
2022-05-24 19:20:25 +03:00
Pavel Emelyanov
ed23e83207 Merge 'Coroutinize distributed loader' from Benny Halevy
Before touching any of this code for https://github.com/scylladb/scylla/issues/9559,
that requires a change when loading sstables from the staging subdirectory,
simplify it using coroutines.

Closes #10609

* github.com:scylladb/scylla:
  replica: distributed_loader: reindent populate_keyspace
  replica: distributed_loader: coroutinize populate_keyspace
  replica: distributed_loader: reindent handle_sstables_pending_delete
  replica: distributed_loader: coroutinize handle_sstables_pending_delete
  replica: distributed_loader: reindent cleanup_column_family_temp_sst_dirs
  replica: distributed_loader: coroutinize cleanup_column_family_temp_sst_dirs
  replica: distributed_loader: reindent make_sstables_available
  replica: distributed_loader: coroutinize make_sstables_available
  sstable_directory: parallel_for_each_restricted: keep func alive across calls
  replica: distributed_loader: reindent reshape
  replica: distributed_loader: coroutinize reshape
  replica: distributed_loader: coroutinize reshard
  replica: distributed_loader: reindent run_resharding_jobs
  replica: distributed_loader: coroutinize run_resharding_jobs
  replica: distributed_loader: reindent distribute_reshard_jobs
  replica: distributed_loader: coroutinize distribute_reshard_jobs
  replica: distributed_loader: reindent collect_all_shared_sstables
  replica: distributed_loader: coroutinize collect_all_shared_sstables
  replica: distributed_loader: reindent process_sstable_dir
  replica: distributed_loader: coroutinize process_sstable_dir
2022-05-24 18:01:13 +03:00
Kamil Braun
e8f9bca288 Merge 'raft: test modify_config API in randomized_nemesis_test' from Kamil Braun
Extend the reconfiguration nemesis to send `modify_config` requests as
well as `reconfigure` requests. It chooses one or the other with
probability 1/2.

Fix a bunch of problems that surfaced during testing.

Closes #10544

* github.com:scylladb/scylla:
  test: raft: randomized_nemesis_test: send `modify_config` requests in reconfiguration nemsesis
  test: raft: randomized_nemesis_test: fix `rpc` reply ID generation
  test: raft: randomized_nemesis_test: during bouncing call, allow a leader to reroute to itself
  test: raft: randomized_nemesis_test: handle timed_out_error from modify_config
  service: raft: rpc: don't call `execute...` functions after `abort()`
  raft: server: fix bad_variant_access in `modify_config`
2022-05-24 16:47:31 +02:00
Benny Halevy
33bad72fd2 sstables: mx: add pi_auto_scale_events metric
Counts the number of promoted index auto-scale events.

A large number of those, relative to `partition_writes`,
indicates that `column_index_size_in_kb` should be increased.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-24 13:32:39 +03:00
Benny Halevy
6677028212 sstables: mx/writer: auto-scale promoted index
Add column_index_auto_scale_threshold_in_kb to the configuration
(defaults to 10MB).

When the promoted index (serialized) size gets to this
threshold, it's halved by merging each two adjacent blocks
into one and doubling the desired_block_size.

Fixes #4217

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-24 13:32:35 +03:00
Kamil Braun
700a1fdd20 test: raft: randomized_nemesis_test: send modify_config requests in reconfiguration nemsesis
Extend the reconfiguration nemesis to send `modify_config` requests as
well as `reconfigure` requests. It chooses one or the other with
probability 1/2.
2022-05-24 11:39:08 +02:00
Kamil Braun
2222f095b3 test: raft: randomized_nemesis_test: fix rpc reply ID generation
When `rpc` wants to perform a two-way RPC call it sends a message
containing a `reply_id`. The other side will send the `reply_id` back
when answering, so the original side can match the response to the promise
corresponding to the future being waited on by the RPC caller.

Previously each instance of `rpc` generated reply IDs independently as
increasing integers starting from 0. The network delivers messages
based on Raft server IDs. A response message may thus be delievered not
to the original instance which invoked the RPC, but to a new instance
which uses the same Raft server ID (after we simulated a server
crash/stop and restart, creating a new server with the same ID that
reuses the previous instance's `persistence` instance but has a new `rpc`).
The new instance could have started a new RPC call using the same
`reply_id` as one currently being in-flight that was started by the
previous instance. The new instance could then receive and handle a
response that was intended for the previous instance, leading to weird
bugs.

Fix this by replacing the local reply ID counters by a global counter so
that every two-way RPC call gets a unique reply ID.
2022-05-24 11:39:08 +02:00
Kamil Braun
b9807f07e6 test: raft: randomized_nemesis_test: during bouncing call, allow a leader to reroute to itself
A server executing a `modify_config` call, even if it initially was a
leader and accepted the request, may end up throwing a `not_a_leader`
error, rerouting the caller to a new leader - but this new leader may be
that same server. This happens because `execute_modify_config`
translates certain errors that it considers transient (such as
`conf_change_in_progress`) into `not_a_leader{last_known_leader}`,
in attempt to notify the caller that they should retry the request; but
when this translation happens, the `last_known_leader` may be that same
server (it could have even lost leadership and then regained it back
while the request was being handled).

This is not strictly an error, and it should be safe for the client to
retry the request by sending it to the same server. The nemesis test
assumed that a server never returns `not_a_leader{itself}`; this commit
drops the assumption.

An alternative solution would be to extend the error types that are now
translated to `not_a_leader` so they include information about the last
known leader. This way the client does not lose information about the
original error and still gets a potential contact point for retry.
2022-05-24 11:36:51 +02:00
Kamil Braun
b33bc7a5d6 test: raft: randomized_nemesis_test: handle timed_out_error from modify_config
May be propagated from `rpc::send_modify_config` to the caller of
`modify_config`.
2022-05-24 11:36:51 +02:00
Kamil Braun
4767b163ef service: raft: rpc: don't call execute... functions after abort()
The functions are called from RPC when a follower forwards a request to
a leader (`add_entry`, `modify_config`, `read_barrier`). The call may be
attempted during shutdown. The Raft shutdown code cleans up data structures
created by those requests. Make sure that they are not updated
concurrently with shutdown. This can lead to problems such as using the
server object after it was aborted, or even destroyed.

After this change, the RPC implementation may wait for a `execute_modify_config`
call to finish before finishing abort. That call in turn may be stuck on
`wait_for_entry`. Thus the waiter may prevent RPC from aborting. Fix
this be moving the wait on the future returned from `_rpc->abort()` in
`server::abort()` until after waiters were destroyed.
2022-05-24 11:36:51 +02:00
Kamil Braun
5e06d0ad6f raft: server: fix bad_variant_access in modify_config
`modify_config` would call `execute_modify_config` or
`_rpc->send_modify_config`, which returned a reply of type
`add_entry_reply`. This is a variant of 3 options: `entry_id`,
`not_a_leader`, or `commit_status_unknown`. The code would check
for the `entry_id` option and otherwise assume that it was `not_a_leader`.
During nemesis testing however, the reply was sometimes
`commit_status_unknown`, which caused a `bad_variant_access` exception
during `std::get` call. Fix this.

There is a similar piece of code in `add_entry`, but there it should be
impossible to obtain `commit_status_unknown` even though the types don't
enforce it. Make it more explicit with a comment and an assertion.
2022-05-24 11:36:51 +02:00
Nadav Har'El
dc5c9321fe test/cql-pytest: have new_test_table() recycle table names
Scylla has a long-standing bug (issue #7620) where having many
tombstones in the schema table significantly slows down further
schema operations.

Many cql-pytest tests use new_test_table() to create a temporary test
table with a specific schema. Before this patch, each temporary table
was created with a random name, and deleted after the test. When
running many tests on the same Scylla server, this results in a lot
of tombstones in the schema tables, and really slow schema operations.
For example, look at home much time it takes to run the same test file
N times:

$ test/cql-pytest/run --count N test_filtering.py

 N=25 -  16 seconds (total time for the N repetitions)
 N=50 -  41 seconds
N=100 - 122 seconds

Notice how progressively slower each repetition is becoming - the
total test time should have been linear in N, but it isn't!

In this patch, we keep a cache of already-deleted table names (not the
tables, just their names!) so as to reuse the same name when we can
instead of inventing a new random name. With this patch, the performance
improvement after some repetitions is amazing (compare to the table above):

 N=25 - 14 seconds
 N=50 - 29 seconds
N=100 - 46 seconds

Note how the testing time is now more-or-less linear in the number of
repetitions, as expected.

The table-name recycling trick is the same trick I already used in the
past for the translated Cassandra tests (test/cql-pytest/cassandra_tests).
The problem was even more obvious there because those tests create a
lot of different tables. But the same problem also exists in cql-pytest
in general, so let's solve it here too.

Refs #7620

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #10635
2022-05-24 11:32:25 +03:00
Asias He
1f8b529e08 range_streamer: Disable restream logic
Consider:
- n1 and n2 in the cluster
- n3 bootstraps to join
- n1 does not hear gossip update from n3 due to network issue
- n1 removes n3 from gossip and pending node list
- stream between n1 and n3 fails
- n1 and n3 network issue is fixed
- n3 retry the stream with n1
- n3 finishes the stream with n1
- n3 advertises normal to join the cluster

The problem is that n1 will not treat n3 as the pending node so writes
will not route to n3 once n1 removes n3.

Another problem is that when n1 gets normal gossip status update from
n3. The gossip listener will fail because n1 has removed n3 so n1 could
not find the host id for n3. This will cause n1 to abort.

To fix, disable the retry logic in range_streamer so that once a stream
with existing fails the bootstrap fails.

The downside is that we lose the ability to restream caused by temporary
network issue but since we have repair based node operation. We can use
it to resume the previous failed node operations.

Fixes: #9805

Closes #9806
2022-05-24 11:24:25 +03:00
Avi Kivity
21728dff6f Merge 'cql: Remove support for null inside lists of IN values' from Jan Ciołek
Currently we support queries like:
```cql
SELECT * FROM ks.tab WHERE p IN (1, 2, null, 4);
```
Nothing can be equal to null so this is equivalent to:
```cql
SELECT * FROM ks.tab WHERE p IN (1, 2, 4);
```

Cassandra doesn't support it at all.
```cql
> SELECT * FROM ks.tab WHERE p IN (1, 2, null, 4)
Error: DbError(Invalid, "Invalid null value in condition for column p")

> SELECT * FROM ks.tab WHERE p IN (1, 2, ?, 4) # ? is NULL
Error: DbError(Invalid, "Invalid null value in condition for column p")

> SELECT * FROM ks.tab WHERE p IN ? # ? is (1, 2, null, 4)
Error: DbError(Invalid, "Invalid null value in condition for column p")
```
It makes little sense to send a null inside list of IN values and supporting it is a bit cumbersome.

Supporting it causes trouble because internally the values are represented as a list, not a tuple, and lists can't contain nulls.
Because of that code requires exceptions because in this single case there can be a null inside of a collection.

This PR starts treating a llist of IN values the same as any other list and as result nulls are forbidden inside them.

In case of a null the message is the same as any other collection:
```
null is not supported inside collections
```
I'm not entirely happy about it - someone could be confused if they received this message after a query that didn't involve any collections.
The problem with making a prettier error message is that once again we would have to give `evaluate` additional information that it's now evaluating a list of IN values. And we would end up back with `evaluate_IN_list`

I think we could consider adding some kind of generic context to evaluate. The context would contain the whole expression and a mark on the part that we are currently evaluating. Then in case of error we could use this context and use it to create more helpful error messages, e.g. point to the part of the expression where a problem occured. But that's outside of the scope of this PR.

Fixes #10579

Closes #10620

* github.com:scylladb/scylla:
  cql: Add test for null in IN list
  cql: Forbid null in lists of IN values
2022-05-24 09:15:13 +03:00
Jan Ciolek
ff3205cf19 cql: Add test for null in IN list
Added tests that check for error
when null or unset appears in a list
of IN values.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-05-24 00:17:54 +02:00
Jan Ciolek
f9b1fc0b69 cql: Forbid null in lists of IN values
We used to allow nulls in lists of IN values,
i.e. a query like this would be valid:
SELECT * FROM tab WHERE pk IN (1, null, 2);

This is an old feature that isn't really used
and is already forbidden in Cassandra.

Additionally the current implementation
doesn't allow for nulls inside the list
if it's sent as a bound value.
So something like:
SELECT * FROM tab WHERE pk IN ?;
would throw an error if ? was (1, null, 2).
This is inconsistent.

Allowing it made writing code cumbersome because
this was the only case where having a null
inside of a collection was allowed.
Because of it there needed to be
separate code paths to handle regular lists
and lists of NULL values.

Forbidding it makes the code nicer and consistent
at the cost of a feature that isn't really
important.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-05-24 00:17:41 +02:00
Avi Kivity
e65b3ed50a Merge 'Allow trigger off strategy compaction early for node operations' from Asias He
This patch set adds two commits to allow trigger off strategy early for node operations.

*) repair: Repair table by table internally

This patch changes the way a repair job walks through tables and ranges
if multiple tables and ranges are requested by users.

Before:

```
for range in ranges
   for table in tables
       repair(range, table)
```

After:

```
for table in tables
    for range in ranges
       repair(range, table)
```

The motivation for this change is to allow off-strategy compaction to trigger
early, as soon as a table is finished. This allows to reduce the number of
temporary sstables on disk. For example, if there are 50 tables and 256 ranges
to repair, each range will generate one sstable. Before this change, there will
be 50 * 256 sstables on disk before off-strategy compaction triggers. After this
change, once a table is finished, off-strategy compaction can compact the 256
sstables. As a result, this would reduce the number of sstables by 50X.

This is very useful for repair based node operations since multiple ranges and
tables can be requested in a single repair job.

Refs: #10462

*) repair: Trigger off strategy compaction after all ranges of a table is repaired

When the repair reason is not repair, which means the repair reason is
node operations (bootstrap, replace and so on), a single repair job contains all
the ranges of a table that need to be repaired.

To trigger off strategy compaction early and reduce the number of
temporary sstable files on disk, we can trigger the compaction as soon
as a table is finished.

Refs: #10462

Closes #10551

* github.com:scylladb/scylla:
  repair: Trigger off strategy compaction after all ranges of a table is repaired
  repair: Repair table by table internally
2022-05-23 18:58:21 +03:00
Avi Kivity
3fadae74e7 Merge "Keep cluster-join code in one place" from Pavel E
"
There are several issues with it

- it's scattered between main() and storage_service methods
- yet another incarnation of it also sits in the cql-test-env
- the prepare_to_join() and join_token_ring() names are lying to readers,
  as sometimes node joins the ring in prepare- stage
- storage service has to carry several private fields to keep the state
  between prepare- and join- parts
- some storage service dependencies are only needed to satisfy joining,
  but since they cannot start early enough, they are pushed to storage
  service uninitialized "in the hope" that it won't use them until join

This patch puts joining steps in one place and enlightens storage service
not to carry unneeded dependencies/state onboard. And eliminates one more
usage of global proxy instance while at it.

branch: https://github.com/xemul/scylla/tree/br-merge-init-server-and-join-cluster
tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/466/
refs: #2795
"

* 'br-merge-init-server-and-join-cluster' of https://github.com/xemul/scylla:
  storage_service: Remove global proxy call
  storage_service: Remove sys_dist_ks from storage_service dependencies
  storage_service: Remove cdc_gen_service from storage_service dependencies
  storage_service: Make _cdc_gen_id local variable
  storage_service: Make _bootstrap_tokens local variable
  storage_service: Merge prepare- and join- private members
  storage_service: Move some code up the file
  storage_service: Coroutinize join_token_ring
  storage_service: Fix indentation after previous patch
  storage_service: Execute its .bootstrap() into async()
  storage_service: Dont assume async context in mark_existing_views_as_built
  storage_service: Merge init-server and join-cluster
  main, storage_service: Move wait for gossip to settle
  main, storage_service: Move passive announce subscription
  main, storage_service: Move early group0 join call
2022-05-23 17:33:02 +03:00
Piotr Dulikowski
ead7bdd6f8 storage_proxy: remove unused overload of query_mutations_locally
An overload of storage_proxy::query_mutations_locally was declared in
a35136533d which takes a vector of
partition ranges as an argument, but it was never defined. This commit
removes the unused overload declaration.

Closes #10610
2022-05-23 16:20:51 +03:00
Avi Kivity
75e001fc6a cql3: grammar: unify production for bind variables
Since 9b49d27a8 ("cql3: expr: Remove shape_type from bind_variable"),
bind variables no longer remember their context (e.g. if they are
in a scalar or vector comparison, or if they are in an IN or
other relation. Exploit that my merging all of the productions that
generate a bind variables (that are now exactly equal) into a single
marker production.

Closes #10624
2022-05-23 17:41:16 +03:00
Pavel Emelyanov
d755fdc1f4 storage_service: Remove global proxy call
Storage service needs it to calculate schema version on join. The proxy
at this point can be passed as an argument to the joining helper.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-23 12:55:30 +03:00
Pavel Emelyanov
bc051387c5 storage_service: Remove sys_dist_ks from storage_service dependencies
The service in question is only needed join_cluster-time, no need to
keep it in the dependencies list. This also solves the dependency
trouble -- the distributed keyspace is sharded::start-ed after it's
passed to storage_service initialization.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-23 12:55:30 +03:00
Pavel Emelyanov
5a97ba7121 storage_service: Remove cdc_gen_service from storage_service dependencies
This service is only needed join-time, it's better to pass it as
argument to join_cluster(). This solves current reversed dependency
issuse -- the cdc_gen_svc is now started after it's passed to storage
service initialization.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-23 12:55:30 +03:00
Pavel Emelyanov
0c40b69411 storage_service: Make _cdc_gen_id local variable
Same as with _bootstrap_tokens -- this variable is only needed
throughout a single function invocation, so it doesn't have to be a
class member.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-23 12:55:30 +03:00
Pavel Emelyanov
80bd317292 storage_service: Make _bootstrap_tokens local variable
Now it's a member on storage_service, but it was such just to carry the
set of tokens between to subsequent calls. Now when all the joining
happens in one function, the set can become local variable.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-23 12:55:30 +03:00
Pavel Emelyanov
3f6d3ea601 storage_service: Merge prepare- and join- private members
These two are the real code that does preparation and joining. They are
called in async() context by public storage_service methods that had
been merged recently, so this patch merges the internals.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-23 12:55:30 +03:00
Pavel Emelyanov
7ac73bb87f storage_service: Move some code up the file
No logic change, this is to keep join_token_ring next to
prepare_to_join so that the patch merging them becomes clean and small.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-23 12:55:30 +03:00
Pavel Emelyanov
f478c7f29c storage_service: Coroutinize join_token_ring
Next patch will merge this method with prepare_to_join() which is
already coroutinized. To make it happen -- coroutinize it in advance.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-23 12:55:30 +03:00
Pavel Emelyanov
81e7de076e storage_service: Fix indentation after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-23 12:55:30 +03:00
Pavel Emelyanov
3a16d3ee95 storage_service: Execute its .bootstrap() into async()
Next patches will coroutinize join_cluster(), so the .bootstrap() method
should return a future. It's worth coroutinizing it as well, but that's
a huge change, so for now -- keep it in its own explicit async().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-23 12:55:30 +03:00
Pavel Emelyanov
4a6bf57e8f storage_service: Dont assume async context in mark_existing_views_as_built
Next patches will coroutinize join_cluster(), this is a preparation to
that change.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-23 12:55:30 +03:00
Pavel Emelyanov
282cc070bc storage_service: Merge init-server and join-cluster
Now they always follow one another both in main and cql-test-env.
Also, despite the name, init-server does joins the cluster when it's
just a normal node restarting, so join-cluster is called when the
cluster is already joind. This merge make the function be named as
what it really does.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-23 12:55:30 +03:00
Pavel Emelyanov
b2b86b0c83 main, storage_service: Move wait for gossip to settle
And make cql-test-env configure to skip it not to slow down tests in
vain. Another side effect is that cql-test-env would trigger features
enabling at this point, but that's OK, they are enabled anyway.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-23 12:55:30 +03:00
Pavel Emelyanov
b842167bcc main, storage_service: Move passive announce subscription
Storage service already has a vector of random subscription scope
holders, this becomes yet another one. This partially reverts
e4f35e2139, which's half-step backwards, but so far I've no better
ideas where to track that scope guard.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-23 12:55:30 +03:00
Pavel Emelyanov
89163a3be4 main, storage_service: Move early group0 join call
It happens right after the prepare to join, moving it at the end of the
latter call doesn't change the code logic. A side effect -- this removes
a silly join_group0() one-line helper.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-23 12:55:30 +03:00
Nadav Har'El
02fb6f33fb test/*/run: improve signal handling in test runners
When a user runs a script and presses control-C, a SIGINT (signal 2)
gets sent to every process in the script's "process group". By default,
every subprocess started by a script joins the parent's process group.

Our test/*/run test-runner scripts typically start two processes: scylla
and pytest. If we keep them in the same process group, a control-C
would kill them in a random order and that is ugly - if Scylla is
killed before pytest, we'll see a few test failures before pytest
is finally killed. So the existing code put Scylla in its own process
group, and killed it on exit after killing pytest.

But there were a few inconsistencies in our implementation, leading
to some annoying behaviors:

1. Doing "kill -2" to the runner's process (not a control-C which sends
   a signal to the process group) caused scylla and pytest to be killed
   on exit. So far so good. But, we should kill their entire process
   groups, not just the one process. This is important when pytest starts
   its own subprocesses (as happens in cql-pytest/test_tools.py),
   otherwise they just remain running.
   We need to call pgkill() instead of kill(), but also we forgot
   to start a new process group for the pytest run - so this patch
   fixes it.

2. Our exit handler - which kills the subprocesses - only gets called
   on signals which Python catches, and this is only SIGINT. Killing
   the test runner with SIGTERM or SIGHUP before this patch caused
   the subprocesses to be left running. In this patch we also catch
   SIGTERM and SIGHUP, so our exit handler is also run in that case.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #10629
2022-05-23 12:24:19 +03:00
Piotr Sarna
84833e495e Merge 'Additional cql-pytest tests related to WHERE expressions' from Nadav Har'El
This small series includes a few more CQL tests in the cql-pytest
framework.

The main patch is a translation of a unit test from Cassandra that
checks the behavior of restrictions (WHERE expressions, filtering) in
different cases.  It turns out that Cassandra didn't implement some
cases - for example filtering on unfrozen UDTs - but Scylla does
implement them. So in the translated test, the
checks-that-these-features-generate-an-error from Cassandra are
commented out, and this series also includes separate tests for these
Scylla-unique features to check that they actually work correctly and
not just that they exist.

Closes #10611

* github.com:scylladb/scylla:
  cql-pytest: translate Cassandra's tests for relations
  test/cql-pytest: add test for filtering UDTs
  test/cql-pytest: tests for IN restrictions and filtering
  test/cql-pytest: test more cases of overlapping restrictions
2022-05-23 11:10:24 +02:00
Avi Kivity
6a3494442e Update abseil submodule
* abseil f70eadad...9e408e05 (109):
  > Cord: workaround a GCC 12.1 bug that triggers a spurious warning
  > Change workaround for MSVC bug regarding compile-time initialization to trigger from MSC_VER 1910 to 1930. 1929 is the last _MSC_VER for Visual Studio 2019.
  > Don't default to the unscaled cycle clock on any Apple targets.
  > Use SSE instructions for prefetch when __builtin_prefetch is unavailable
  > Replace direct uses of __builtin_prefetch from SwissTable with the wrapper functions.
  > Cast away an unused variable to play nice with -Wunused-but-set-variable.
  > Use NullSafeStringView for const char* args to absl::StrCat, treating null pointers as "" Fixes #1167
  > raw_logging: Extract the inlined no-hook-registered behavior for LogPrefixHook to a default implementation.
  > absl: fix use-after-free in Mutex/CondVar
  > absl: fix live-lock in CondVar
  > Add a stress test for base_internal::ThreadIdentity reuse.
  > Improve compiler errors for mismatched ParsedFormat inputs.
  > Internal change
  > Fix an msan warning in cord_ringbuffer_test
  > Fix spelling error "charachter"
  > Document that Consume(Prefix|Suffix)() don't modify the input on failure
  > Fixes for C++20 support when not using std::optional.
  > raw_logging: Document that AbortHook's buffers live for as long as the process remains alive.
  > raw_logging: Rename SafeWriteToStderr to indicate what about it is safe (answer: it's async-signal-safe).
  > Correct the comment about the probe sequence.  It's (i/2 + i)/2 not (i/2 - i)/2.
  > Improve analysis of the number of extra `==` operations, which was overly complicated, slightly incorrect.
  > In btree, move rightmost_ into the CompressedTuple instead of root_.
  > raw_logging: Rename LogPrefixHook to reflect the other half of it's job (filtering by severity).
  > Don't construct/destroy object twice
  > Rename function_ref_benchmark.cc into more generic function_type_benchmark.cc, add missing includes
  > Fixed typo in `try_emplace` comment.
  > Fix a typo in a comment.
  > Adds ABSL_CONST_INIT to initializing declarations where it is missing
  > Automated visibility attribute cleanup.
  > Fix typo in absl/time/time.h
  > Fix typo: "a the condition" -> "a condition".
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Fix build with uclibc-ng (#1145)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Replace the implementation of the Mix function in arm64 back to 128bit multiplication (#1094)
  > Support for QNX (#1147)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Exclude unsupported x64 intrinsics from ARM64EC (#1135)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Add NetBSD support (#1121)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Some trivial OpenBSD-related fixes (#1113)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Add support of loongarch64 (#1110)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Disable ABSL_INTERNAL_ENABLE_FORMAT_CHECKER under VsCode/Intellisense (#1097)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > macos: support Apple Universal 2 builds (#1086)
  > cmake: make `random_mocking_bit_gen` library public. (#1084)
  > cmake: use target aliases from local Google Test checkout. (#1083)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > cmake: add ABSL_BUILD_TESTING option (#1057)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Fix googletest URL in CMakeLists.txt (#1062)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Fix Randen and PCG on Big Endian platforms (#1031)
  > Export of internal Abseil changes

Closes #10630
2022-05-22 23:46:33 +03:00
Avi Kivity
4afe2a24b9 cql3: grammar: make 'relation' return an expression rather than append to a vector
The 'relation' production is self-contained, except for its interface
to the rest of the grammar where it appends to a vector of expressions
(that happens to represent a conjunction of relations). Make it stand-
alone by returning an expression, and move the responsibility for
appending to an expression vector to the whereClause production (later
we can make it build a conjunction expression rather than a vector
of expressions, paving the way for more boolean operators).

Closes #10623
2022-05-22 22:16:12 +03:00
Avi Kivity
a6b554409b storage_proxy: stop using deprecated std::not1 and std::bind1st in cas(), get_paxos_participants(), and create_write_response_handler_helper()
Use equivalent std::not_fn and std::bind_front instead.

Closes #10622
2022-05-22 22:13:12 +03:00
Nadav Har'El
ac393a62a1 cql-pytest: translate Cassandra's tests for relations
This is a translation of Cassandra's CQL unit test source file
validation/operations/SelectSingleColumnRelationTest.java into our
cql-pytest framework.

This test file includes 23 tests for various types of SELECT operations
which involve relations, a.k.a expressions (i.e., WHERE).

All 23 tests pass on Cassandra. 3 of the tests fail on Scylla
reproducing 2 already known Scylla issues and three minor
previously-unknown issues:

Previously known issues:

Refs #2962:  Collection column indexing
Refs #10358: Comparison with UNSET_VALUE should produce an error

Three new (and minor) issue:

Refs #10577: Is max-clustering-key-restrictions-per-query too low?
Refs #10631: Invalid IN restriction is reported as a '=' restriction
Refs #10632: Column name printed in a strange way in error message

NOTE: Scylla supports some expressions which Cassandra does not. In some
cases the Cassandra unit test had checks that certain constructs are not
allowed, and I had to comment out such checks when the expression *does*
work in Scylla. But of course, in such cases, it is not enough to comment
out a check - we also need to verify that Scylla's unique behavior
is the correct one. For that, we will have separate cql-pytest test for
those features - they won't be in the translated Cassandra unit tests
(of course). For example, in this test I had to comment out a check
that filtering on *non*-frozen UDTs is not allowed. In a separate
patch which I'm sending in parallel, I added a new test -
test_filter_UDT_restriction_nonfrozen - which will verify that what
Scylla does in that case is the correct behavior.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-05-22 20:49:04 +03:00
Nadav Har'El
8af2c4aced test/cql-pytest: add test for filtering UDTs
Both Scylla and Cassandra support filtering on frozen UDTs, which are
compared using lexicographical order. This patch adds a test to verify
that the behavior here is the same - and indeed it is.

For *non*-frozen UDTs, Cassandra does not allow filtering on them (this
was decided in CASSANDRA-13247), but Scylla does. So we also add a test
on how non-frozen UDTs work - that passes on Scylla (and of course not
in Cassandra).

The two tests here - for frozen and non-frozen UDTs - are identical
(they just call the same function) - to ensure these two cases work the
same. This is important because we can't judge the correctness of the
non-frozen test by comparison to Cassandra - because it can't run there.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-05-22 20:20:15 +03:00
Nadav Har'El
1c355d773a test/cql-pytest: tests for IN restrictions and filtering
Cassandra only allows IN restrictions on primary key columns.
With filtering (ALLOW FILTERING), this limitation makes little
sense, and it turns out that Scylla does not have limitation.
So this patch adds a test that we support such queries *correctly*
(we can't compare to Cassandra because it doesn't implement this).

Another test checks IN restrictions on an *indexed* column
(without ALLOW FILTERING). We could have implemented this - the
indexed column behaves like a partition key - but we didn't,
so this test xfails. It also fails on Cassandra because as mentioned
above, Cassandra didn't implement IN except for primary key columns.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-05-22 20:20:15 +03:00
Nadav Har'El
bd71e6d962 test/cql-pytest: test more cases of overlapping restrictions
As noted by test_filtering.py::test_multiple_restrictions_on_same_column
Cassandra WHERE does not allow specifying two restrictions on the the
same column, but Scylla does allow it, and this test verifies that the
results are correct (conflicting restrictions would lead to no results,
but overlapping restrictions can return some results).

In this patch we add yet another example of multiple restrictions
on the same column that was seen in a Cassandra unit test - this time
one of the restrictions involves a IN. These patch helps confirm that
the expression evaluation is done correct (and, again, differently from
Cassandra - Cassandra results in an error in this case).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-05-22 20:20:15 +03:00
Alejo Sanchez
3904d3b96e install-dependencies.sh: add python3-pytest-asyncio
Add package pytest-asyncio for async pytest support.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #10616

[avi: regenerate frozen toolchain]
2022-05-22 17:46:56 +03:00
Pavel Emelyanov
1199c6e5da snitch: Use invoke_on_others() to replicate
The replication happens on all shards but current one. There's a special
helper in seastar for such cases

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-20 18:16:22 +03:00
Pavel Emelyanov
5ec87285f8 snitch: Merge set_my_dc and set_my_rack into one
These two are always used in pair.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-20 18:16:19 +03:00
Pavel Emelyanov
c6d0bc87d0 azure_snitch: Do nothing on non-io-cpu
All snitch drivers are supposed to snitch info on some shard and
replicate the dc/rack info across others. All, but azure really do so.
The azure one gets dc/rack on all shards, which's excessive but not
terrible, but when all shards start to replicate their data to all the
others, this may lead to use-after-frees.

fixes: #10494

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-20 18:15:57 +03:00
Benny Halevy
b3e2204fe6 replica: distributed_loader: reindent populate_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-20 17:16:41 +03:00
Benny Halevy
a3c1dc8cee replica: distributed_loader: coroutinize populate_keyspace
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-20 17:15:06 +03:00
Benny Halevy
5b038affae replica: distributed_loader: reindent handle_sstables_pending_delete
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-20 17:07:53 +03:00
Benny Halevy
b8260c9983 replica: distributed_loader: coroutinize handle_sstables_pending_delete
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-20 17:07:53 +03:00
Benny Halevy
48122d3006 replica: distributed_loader: reindent cleanup_column_family_temp_sst_dirs
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-20 17:07:53 +03:00
Benny Halevy
8ba10dba2d replica: distributed_loader: coroutinize cleanup_column_family_temp_sst_dirs
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-20 17:07:53 +03:00
Benny Halevy
5f4d20267d replica: distributed_loader: reindent make_sstables_available
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-20 17:07:53 +03:00
Benny Halevy
b3ebbf35e2 replica: distributed_loader: coroutinize make_sstables_available
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-20 17:07:53 +03:00
Benny Halevy
868cea21e0 sstable_directory: parallel_for_each_restricted: keep func alive across calls
Without that there's use-after-free when called from
distributed_loader::make_sstables_available where
func is turned into a coroutine and the shared_sstable parameter
is not explicitly copied and captured for the continuation
of sst->move_to_new_dir.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-20 17:07:53 +03:00
Benny Halevy
b13f44ca61 replica: distributed_loader: reindent reshape
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-20 17:07:53 +03:00
Benny Halevy
a1e663f225 replica: distributed_loader: coroutinize reshape
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-20 17:07:53 +03:00
Benny Halevy
cf0d0a18a0 replica: distributed_loader: coroutinize reshard
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-20 17:07:53 +03:00
Benny Halevy
e1ba285d52 replica: distributed_loader: reindent run_resharding_jobs
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-20 17:07:53 +03:00
Benny Halevy
29e51ed0cd replica: distributed_loader: coroutinize run_resharding_jobs
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-20 17:07:53 +03:00
Benny Halevy
b65d55cbbf replica: distributed_loader: reindent distribute_reshard_jobs
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-20 17:07:53 +03:00
Benny Halevy
3baa4d4946 replica: distributed_loader: coroutinize distribute_reshard_jobs
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-20 17:07:53 +03:00
Benny Halevy
ba1eb7ab9c replica: distributed_loader: reindent collect_all_shared_sstables
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-20 17:07:53 +03:00
Benny Halevy
84d528cd84 replica: distributed_loader: coroutinize collect_all_shared_sstables
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-20 17:07:53 +03:00
Benny Halevy
8080e98309 replica: distributed_loader: reindent process_sstable_dir
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-20 17:07:51 +03:00
Benny Halevy
33179c8647 replica: distributed_loader: coroutinize process_sstable_dir
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-20 17:07:00 +03:00
Botond Dénes
f873806c7c Merge 'Fix use-after-free when queue reader is destroyed before its handle' from Benny Halevy
The handle must not point at this reader implementation after it's destroyed.

This fixes use-after-free when the queue_reader_v2
is destroyed first as repair_writer_impl::_queue_reader,
before repair_writer_impl::_mq is destroyed.

The issue was introduced in 39205917a8
in the definition of `repair_writer_impl`.

Fixes #10528

While at it, fix also an ignored exceptional future seen in the test:
`repair_additional_test.py::TestRepairAdditional::test_repair_kill_3`

Closes #10591

* github.com:scylladb/scylla:
  mutation_readers: queue_reader_v2: detach from handle when destroyed
  messaging_service: do_make_sink_source: handle failed source future
2022-05-19 21:51:41 +03:00
Raphael S. Carvalho
b120cacdd1 compaction_manager: Allow off-strategy to proceed in parallel to in-strategy compactions
Off-strategy works on maintenance sstable set using maintenance
scheduling group, whereas "in-strategy" works on main sstable set
and uses compaction group.

Today, it can happen that off-strategy has to wait for an "in-strategy"
maintenance compaction, e.g. cleanup, to complete before getting
a chance to run. But that's not desired behavior as off-strategy uses
maintenance group, and its candidates don't add to the backlog that
influences "in-strategy" bandwidth. Therefore, "in-strategy" and
off-strategy should be decoupled, with off-strategy having its own
semaphore for guaranteeing serialization across tables.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #10595
2022-05-19 17:37:11 +03:00
Takuya ASADA
b6003989f9 scylla_setup: stop using sudo -u, use user/group parameter on subprocess module
To run scylla-housekeeping we currently use "sudo -u scylla <cmd>" to switch
scylla user, but it fails on some environment.
Since recent version of Python 3 supports to switch user on subprocess module,
let's use python native way and drop sudo.

Fixes #10483

Closes #10538
2022-05-19 17:21:35 +03:00
Avi Kivity
fcf292b5d1 Merge 'Add range deserializer' from Benny Halevy
Currently, the idl-generated deserialization code, e.g. mutation_partition_view::rows() deserializes and returns a complete utils::chunked_vector<deletable_row_view> . And that could be arbitrarily long.

To consume it gently, we don't need the whole vector in advance, but rather we can consume it one element at a time (and in a nested way for cells in a row in the future).

Use `range_deserializer` to consume range tombstones and rows one item at a time.

We may consider in the future also gently iterating over cells
in a row and then dipping into collection cells that might also
contain a large number of items.

Fixes #10558

Closes #10566

* github.com:scylladb/scylla:
  ser: use vector_deserializer by default for all idl vectors
  mutation_partition_view: do_accept_gently: use the range based deserializers
  idl-compiler: generate *_range methods using vector_deserializer
  serializer_impl: add vector_deserializer
  test: frozen_mutation_test: test_writing_and_reading_gently: log detailed error
2022-05-19 17:21:35 +03:00
Avi Kivity
5285ccbb12 Merge 'Add prune ghost rows statement' from Piotr Sarna
This series is split from another, bigger RFC series which provides
manual remedies to deal with inconsistencies between the base table
and its views. This part deals with ghost rows by providing a statement
which fetches view rows from a given range, then reads its corresponding
rows from the base table (cl=ALL), and finally removes rows which were
not present in the base table at all, qualifying them as ghost rows.
Motivations for introducing such a statement:
 * in case of detected inconsistencies, it can be used to fix
   materialized views without recreating them from scratch, which can
   take days and generates lots of throughput
 * a tool which periodically scrubs a materialized view can be easily
   created on top of this statement, especially that it's possible
   to remove ghost rows from a user-defined view token range;

This series comes with a unit test.

The reason for digging up this series is because it's still possible to end up with ghost rows in certain rather improbable scenarios, and we lack a way of fixing them without rebuilding the whole view. For instance, in case of a failed synchronous update to a local view, the user will be notified that the query failed, but a ghost row can be created nonetheless. The pruning statement introduced in this series would allow healing the failure locally, without rebuilding the whole view.

Tests: unit(dev)

Closes #10426

* github.com:scylladb/scylla:
  docs: add a paragraph on PRUNE MATERIALIZED VIEW statement
  service,test: add a test case for error during pruning
  tests: add ghost row deletion test case
  cql3: enable ghost row deletion via CQL
  cql3: add a statement for deleting ghost rows
  cql3: convert is_json statement parameter to enum
  pager: add ghost row deleting pager
  db,view: add delete ghost rows visitor
2022-05-19 17:21:35 +03:00
Gleb Natapov
c2ef390a52 service: raft: move group0 write path into a separate file
Writing into the group0 raft group on a client side involves locking
the state machine, choosing a state id and checking for its presence
after operation completes. The code that does it resides now in the
migration manager since the currently it is the only user of group0. In
the near future we will have more client for group0 and they all will
have to have the same logic, so the patch moves it to a separate class
raft_group0_client that any future user of group0 can use to write
into it.

Message-Id: <YoYAJwdTdbX+iCUn@scylladb.com>
2022-05-19 17:21:35 +03:00
Avi Kivity
78eccd8763 Merge "Remove sstable_format_slector::sync()" from Pavel E
"
There's an explicit barrier in main that waits for the sstable format
selector to finish selecting it by the time node start to join a cluter.
(Actually -- not quite, when restarting a normal node it joins cluster
in prepare_to_join()).

This explicit barrier is not needed, the sync point already exists in
the way features are enabled, the format-selector just needs to use it.

branch: https://github.com/xemul/scylla/tree/br-format-selector-sync
tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/351/
refs: #2795
"

* 'br-format-selector-sync' of https://github.com/xemul/scylla:
  format-selector: Remove .sync() point
  format-selector: Coroutinize maybe_select_format()
  format-selector: Coroutinize simple methods
2022-05-19 17:21:35 +03:00
Avi Kivity
08ed4d7405 Merge 'scylla-gdb.py: add commands to dump sstables summary and index-cache' from Botond Dénes
This series adds two commands:
* scylla sstable-summary
* scylla sstable-index-cache

The former dumps the content of the sstable summary. This component is kept in memory in its entirety, so this can be easily done. The latter command dumps the content of the sstable index cache. This contains all the index-pages that are currently cached. The promoted index is not dumped yet and there is no indication of whether a given entry is in the LRU or not, but this already allows at seeing what pages are in the cache and what aren't.

Closes #10546

* github.com:scylladb/scylla:
  scylla-gdb.py: add scylla sstable-index-cache command
  scylla-gdb.py: add scylla sstable-summary command
  test/scylla-gdb: add sstable fixture
  scylla-gdb.py: make chunked_vector a proper container wrapper"
  scylla-gdb.py: make small_vector a proper container wrapper"
  scylla-gdb.py: add sstring container wrapper
  scylla-gdb.py: add chunked_managed_vector container wrapper
  scylla-gdb.py: add managed_vector container wrapper
  scylla-gdb.py: std_variant: add workaround for clang template bug
  scylla-gdb.py: add bplus_tree container wrapper
2022-05-19 17:21:35 +03:00
Pavel Emelyanov
3e53a0965c scylla-gdb: Handle new seastar/fair_queue layout
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220519094452.14538-1-xemul@scylladb.com>
2022-05-19 13:29:03 +03:00
Benny Halevy
9c5feb2781 mutation_readers: queue_reader_v2: detach from handle when destroyed
The handle must not point at this reader implementation
after it's destroyed.

This fixes use-after-free when the queue_reader_v2
is destroyed first as repair_writer_impl::_queue_reader,
before repair_writer_impl::_mq is destroyed.

The issue was introduced in 39205917a8
in the definition of `repair_writer_impl`.

Fixes #10528

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-19 11:48:03 +03:00
Benny Halevy
1308b45c58 messaging_service: do_make_sink_source: handle failed source future
I've stumbled upon this with version a2901a376d in debug mode
when testing repair_additional_test.py::TestRepairAdditional::test_repair_kill_3:

WARN  2022-05-17 07:26:12,581 [shard 0] seastar - Exceptional future ignored: seastar::rpc::closed_error (connection is closed), backtrace: 0x137c33d0 0x1ad14d0d 0x1ad149cd 0x1ad16fc3 0x1ad17e52 0x19d8a809 0x19d8ab6a 0x139165a9 0x17be0d21 0x17bdcfb0 0x17bf3611 0x17bf39f0 0x17bf3c62 0x17bf3958 0x17bf57d8 0x17bf5468 0x19efe44e 0x19f04ac6 0x19f09732 0x19f072a1 0x19cca281 0x19cc7de5 0x13859cbf 0x13d309d6 0x13d3090b 0x13d30775 0x1391364d 0x13858521 /lib64/libc.so.6+0x27b74 0x137774ad

Decode:
```
seastar::report_failed_future(seastar::future_state_base::any&&) at //./seastar/src/core/future.cc:218
seastar::future_state_base::any::check_failure() at //./seastar/include/seastar/core/future.hh:573
seastar::future_state<seastar::rpc::source<repair_row_on_wire_with_cmd> >::clear() at ././seastar/include/seastar/core/future.hh:615
~future_state at ././seastar/include/seastar/core/future.hh:620
 (inlined by) ~future at ././seastar/include/seastar/core/future.hh:1343
~ at ./message/messaging_service.cc:841
```

Looks like if sink.close() fails after source.failed() then source gets abandoned.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-19 11:47:38 +03:00
Avi Kivity
5c481973a3 sstables/processing_result_generator.hh: refine check for coroutine standard
We have a check for whether we can use standard coroutines (in namespace
std) or the technical specification (in std::experimental), but it doesn't
work since Clang doesn't report the correct standard version. Use a
compiler versionspecific check, inspired by Seastar's check.

This allows building with clang 14.

Closes #10603
2022-05-19 11:31:40 +03:00
Nadav Har'El
0040e9e7f4 Merge 'cql: Add proper validation for null and unset inside collections send as bound values' from Jan Ciołek
Let's say we have a query like:
```cql
INSERT INTO ks.t (list_column) VALUES (?);
```
And the driver sends a list with null inside as the bound value, something like `[1, 2, null, 4]`.

In such case we should throw `invalid_request_exception` because `nulls` are not allowed inside collections.

Currently when a query like this gets executed Scylla throws an ugly marshalling error.
This is because the validation code reads size of the next element, interprets it as an unsigned integer and tries to read this much.
In case of `null` element the size is `-1`, which when converted to unsigned `size_t` gives 18446744073709551615 and it fails to read this much.

This PR adds proper validation checks to make the error message better.

I also added some tests.
I originally tried to write them in python, but python driver really doesn't like sending invalid values.
Trying to send `[1, None, 2]` results in a list with empty value instead of null.
Trying to send `[1, UNSET_VALUE, 2]` Fails before query even leaves the driver.

Fixes #10580

Closes #10599

* github.com:scylladb/scylla:
  cql3: Add tests for null and unset inside collections
  cql3: Add null and unset checks in collection validation
2022-05-19 11:25:24 +03:00
Piotr Sarna
e54a4ebdcb docs: add a paragraph on PRUNE MATERIALIZED VIEW statement 2022-05-19 10:16:04 +02:00
Piotr Sarna
b8a36ff253 service,test: add a test case for error during pruning
The test case checks that errors which occur during materialized
view pruning are properly propagated back to the user.
2022-05-19 10:16:04 +02:00
Piotr Sarna
995468520e tests: add ghost row deletion test case
The tests checks if manually injected ghost rows are properly deleted
by the ghost row delete statement - and, that non-ghost regular rows
are left intact.
2022-05-19 10:16:03 +02:00
Piotr Sarna
be2ef862bd cql3: enable ghost row deletion via CQL
This commit allows accepting a CQL request to clear ghost rows
from a given view partition range. Currently its syntax is a purposely
convoluted mix of existing keywords, which makes sure that the statement
is never issued by mistake. Example runs:

 -- try deleting all ghost rows, effectively performs a paged full scan
PRUNE MATERIALIZED VIEW my_mv;

 -- try deleting ghost rows from a single view partition
PRUNE MATERIALIZED VIEW my_mv WHERE mv_pk = 3;

 -- try deleting ghost rows from a token range (effective full scans)
PRUNE MATERIALIZED VIEW my_mv WHERE TOKEN(mv_pk) > 7 AND TOKEN(mv_pk) < 42
2022-05-19 10:11:50 +02:00
Piotr Sarna
ec0a3bbbd4 cql3: add a statement for deleting ghost rows
In order to expose the API for deleting ghost rows from a view,
a CQL statement is created. It is loosely based on select_statement,
as its first step is to select view table rows.
2022-05-19 10:11:50 +02:00
Piotr Sarna
d74e25be67 cql3: convert is_json statement parameter to enum
Right now is_json is used to decide if the statement needs to be treated
in a special way. For two types (regular statement and JSON statement),
a boolean is enough, but this series extends it for two more types,
so the flag is converted to an enum.
2022-05-19 10:11:50 +02:00
Piotr Sarna
2c6e1a5409 pager: add ghost row deleting pager
The pager is based on ghost row deleting visitor - it simply
traverses each fetched page with it in order to delete
ghost rows from the view table.
2022-05-19 10:11:50 +02:00
Piotr Sarna
c3a9658535 db,view: add delete ghost rows visitor
The visitor is used to traverse view rows, and if it detects
a ghost row it qualifies it for deletion. Qualification is based
on a base table read with cl=ALL: if the corresponding row is not
present in the base table, it is considered a ghost.
2022-05-19 10:11:50 +02:00
cvybhu
7adc572ec6 cql3: Add tests for null and unset inside collections
Add a bunch of tests that test what happens
when there is a null or unset value inside collections.

They are not allowed so every such attempt
should end with invalid_request_exception
with proper message.

I had to write a new function for collection serialization.
I tried to use data_value and its methods, but it's impossible
to create a data_value that represents an unset value.

Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
2022-05-19 00:15:17 +02:00
Benny Halevy
4e78b4de1b ser: use vector_deserializer by default for all idl vectors
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-18 19:24:18 +03:00
Benny Halevy
ea89a9aa56 mutation_partition_view: do_accept_gently: use the range based deserializers
Currently use the range_deserializer for range tombstones and rows.

We may consider in the future also gently iterating over cells
in a row and then dipping into collection cells that might also
contain a large number of items.

Fixes #10558

Test: frozen_mutation_test(dev, debug)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-18 19:24:18 +03:00
Benny Halevy
29632e739d idl-compiler: generate *_range methods using vector_deserializer
Generate code for *_range methods that return a
vector_deserializer rather than constructing the complete
vector of views.

This would be useful for streamed mutation unfreezing
in the following patch.

Later, we should just use vector_deserializer for all vectors.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-18 19:24:18 +03:00
Benny Halevy
5b902d9fd6 serializer_impl: add vector_deserializer
To be used for streaming through a serialized
vector, deserializing the items as we go
when dereferencing or incrementing the iterator.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-18 19:10:13 +03:00
Avi Kivity
e6fe9cc683 Merge 'Reapply: "disable_auto_compaction: stop ongoing compactions"' from Eliran Sinvani
Reapply: "disable_auto_compaction: stop ongoing compactions"
This is a reapplication of a former commit
4affa801a5 which was reverted by
8e8dc2c930.
This commit is a fixed version of the original where a call to the
compaction_manager constructor accidentally issued (`compaction_manager()`)
instead a call to retrieve a compaction manager reference
(`get_compaction_manager()`), we don't use this function because it
doesn't exist anymore - it existed at the time the patch was written
bu was removed in 9066224cf4 later on,
instead, we just use the private table member _compaction_manager which refs
the compaction manager.

The explanation for the bad effect is probably that a `this` pointer
capture down the call chain, resulted in a use after free which had
an unknown effect on the system. (memory corruption at startup).

Test: unit (dev,debug)
      write performance test as the one used to find the bug.
A screenshot of the performance test can be found at
https://github.com/scylladb/scylla/issues/10146/#issuecomment-1129578381

Fixes https://github.com/scylladb/scylla/issues/9313
Refs https://github.com/scylladb/scylla/issues/10146

For completeness, the original commit message was:
The api call disables new regular compaction jobs from starting
but it doesn't wait for ongoing compaction to stop and so it's
much less useful.

Returning after stopping regular compaction jobs and waiting
for them to stop guarantees that no regular compactions job are
running when nodetool disableautocompaction returns successfully.

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>

Closes #10597

* github.com:scylladb/scylla:
  compaction_manager: Make invoking the empty constructor more explicit
  Reapply: "disable_auto_compaction: stop ongoing compactions"
2022-05-18 18:33:12 +03:00
Eliran Sinvani
c5e5692a01 compaction_manager: Make invoking the empty constructor more explicit
The compaction manager's empty constructor is supposed to be invoked
only in testing environment, however, it is easy to invoke it by mistake
from production code.
Here we add a more verbose constructor and making the default compaction
private, the verbose compiler need to be invoked with a tag
for_testing_tag, this will ensure that this constructor will be invoked
only when intended.
The unit tests were changed according to this new paradigm.

Tests: unit (dev)

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
2022-05-18 14:57:10 +03:00
Eliran Sinvani
c138981286 Reapply: "disable_auto_compaction: stop ongoing compactions"
This is a reapplication of a former commit
4affa801a5 which was reverted by
8e8dc2c930.
This commit is a fixed version of the original where a call to the
compaction_manager constructor accidentally issued (`compaction_manager()`)
instead a call to retrieve a compaction manager reference
(`get_compaction_manager()`), we don't use this function because it
doesn't exist anymore - it existed at the time the patch was written
bu was removed in 9066224cf4 later on,
instead, we just use the private table member _compaction_manager which refs
the compaction manager.

The explanation for the bad effect is probably that a `this` pointer
capture down the call chain, resulted in a use after free which had
an unknown effect on the system. (memory corruption at startup).

Test: unit (dev,debug)
      write performance test as the one used to find the bug.
A screenshot of the performance test can be found at
https://github.com/scylladb/scylla/issues/10146/#issuecomment-1129578381

Fixes #9313
Refs #10146

For completeness, the original commit message was:
The api call disables new regular compaction jobs from starting
but it doesn't wait for ongoing compaction to stop and so it's
much less useful.

Returning after stopping regular compaction jobs and waiting
for them to stop guarantees that no regular compactions job are
running when nodetool disableautocompaction returns successfully.

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
2022-05-18 14:57:10 +03:00
cvybhu
345e89756b cql3: Add null and unset checks in collection validation
Validating a collection should ensure that there
are no null or unset values inside the collection.

The validation already fails in case of such values,
but it does so in an ugly way.
Length of null and unset value is negative but is
cast to unsigned size_t. Then it tries to read
a really large value and fails with marshalling error.

The new checks are a better way to handle this.

Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
2022-05-18 11:05:14 +02:00
Benny Halevy
9e1c76ea9e test: frozen_mutation_test: test_writing_and_reading_gently: log detailed error
The nested exception is also interesting
and boost doesn't print it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-18 08:21:54 +03:00
Botond Dénes
6fd1322bf3 test/boost/mutation_test: test_query_digest: use the same now everywhere
This test started failing sporadically of late. This failure is seen
quite often in CI tests but is very hard to reproduce locally. The
problem seems to be timing related, as the same seeds that fail in CI
don't fail locally. This patch is a speculative fix. The test has a
single time-related components: `gc_clock::now()`. This is invoked in 4
different places during a single iteration, giving ample opportunity for
off-by-one errors to appear. Although there is no solid proof for this
being the problem, this is a good candidate. This patch replaces all
those different invocations, with a single one per test: this value is
then propagated to all places that need it.

Fixes: #10554

Marking the patch as a fix for the issue, if the problem re-surfaces
after this patch we'll re-poen it.

Closes #10589
2022-05-17 16:25:04 +03:00
Takuya ASADA
883b97d8b2 dist/common/scripts: generate debug log when exception occurred
Using traceback_with_variables module, generate more detail traceback
with variables into debug log.
This will help fixing bugs which is hard to reproduce.

Closes #10472

[avi: regenerate frozen toolchain]
2022-05-17 13:18:27 +03:00
Raphael S. Carvalho
ca322fb7c2 compaction_manager: Quickly abort maintenance compaction waiting for its turn
Today, aborting a maintenance compaction like major, which is waiting for
its turn to run, can take lots of time because compaction manager will
only be able to bail out the task once it gets the "permit" from the
serialization mechanism, i.e. semaphore. Meaning that the command that
started the task will only complete after all this time waiting for
the "permit".

To allow a pending maintenance compaction to be quickly aborted, we
can use the abortable variant of get_units(). So when user submits an
abortion request, get_units() will be able to return earlier through
the abort exception.

Refs #10485.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #10581
2022-05-17 13:14:51 +03:00
Avi Kivity
a8f9ed56fb Merge 'cql3: Replace relation class with expression' from Jan Ciołek
Before this change parser used to output instances of the `relation` class, which were later converted to `restrction`.
`relation` took care of initial processing such as preparing and some validation checks.

This PR aims to remove the `relation` class and perform it's functionality using only `expression`.
This is a step towards removing the legacy classes and converting all AST analysis to work on `expressions`.

Closes #10409

* github.com:scylladb/scylla:
  cql3: Remove scalar from bind_variable_scalar_prepare_expression
  cql3: expr: Remove shape_type from bind_variable
  cql3: Remove prepare_expression_multi_column
  cql3: Remove relation class
  cql3: Add more tests for expr::printer
  cql3: Make parser output expression for relations
  cql3: expr: add printer for expression
  cql3: expr: expr::to_restriction: Handle token relations
  cql3: expr: expr::to_restriction: Handle multi column relations
  cql3: expr: Add expr::to_restriction for single column relations
  cql3: expr: Add prepare_binary_operator
  cql3: expr: Change how prepare_expression handles bind_variable
  clq3: expr: Add columns to expr::token struct
  cql3: expr: Modify list_prepare_expression to handle lists of IN values
  cql3: expr: Add expr::as_if for non-const expressions
2022-05-17 12:51:54 +03:00
Takuya ASADA
00ce34c29b scylla_prepare: describe error more correctly
Currently our error message on scylla_prepare says "Exception occurred
while creating perftune.yaml", even perftune.yaml is already generated,
and error occurred after that.
To describe error more correctly, add another error message after
perftune.yaml generated.

see scylladb/scylla-enterprise#2201

Closes #10575
2022-05-16 20:05:58 +03:00
cvybhu
21453ac9a4 cql3: Remove scalar from bind_variable_scalar_prepare_expression
There is now only one function to prepare bind_variable,
so we can remove 'scalar' from its name.

Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
2022-05-16 18:17:58 +02:00
cvybhu
9b49d27a8d cql3: expr: Remove shape_type from bind_variable
shape_type was used in prepare_expression to differentiate
between a few cases and create the correct receivers.
This was used by the relation class.

Now creating the correct receiver has been delegated to the caller
of prepare_expression and all bind_variables can be handled
in the same simple way.

shape_type is not needed anymore.

Not having it is better because it simplifies things.

Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
2022-05-16 18:17:58 +02:00
cvybhu
c0fc82d4be cql3: Remove prepare_expression_multi_column
This function was used by multi_column_relation.hh,
but now it isn't needed anymore.

The only way to prepare a bind_variable is now the standard prepare_expression.

Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
2022-05-16 18:17:58 +02:00
cvybhu
d85f680df3 cql3: Remove relation class
Functionality of the relation class has been replaced by
expr::to_restriction.

Relation and all classes deriving from it can now be removed.

Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
2022-05-16 18:17:58 +02:00
cvybhu
575c4bd76b cql3: Add more tests for expr::printer
Now that parser outputs expressions
it's much easier to check whether
expression printer works correctly.

We can prepare a bunch of strings
which will be parsed and then printed
back to string.
Then we can compare those strings.

It's much easier than creating
expresions to print manually.

The only downside is that this tests
only unprepared version of expression,
so instead of column_value there will
be unresolved identifier, insted of constant
untyped_constant etc.

Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
2022-05-16 18:17:58 +02:00
cvybhu
51cdbdeacb cql3: Make parser output expression for relations
Parser used to output the where clause as a vector of relations,
but now we can change it to a vector of expressions.

Cql.g needs to be modified to output expressions instead
of relations.

The WHERE clause is kept in a few places in the code that
need to be changed to vector<expression>.

Finally relation->to_restriction is replaced by expr::to_restriction
and the expressions are converted to restrictions where required.

The relation class isn't used anywhere now and can be removed.

Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
2022-05-16 18:17:58 +02:00
Michał Sala
f6bdc4d694 cql3: expr: add printer for expression
expression::printer is used to print CQL expressions
in a pretty way that allows them to be parsed back
to the same representation.

There is a bunch of things that need to be changed when
compared to the current implementation of opreatorr<<(expression)
to output something parsable.

column names should be printed without 'unresolved_identifier()'
and sometimes they need to be quoted to perserve case sensitivity.

I needed to write new code for printing constant values
because the current one did debug printing
(e.g. a set was printed as '1; 2; 3').

A list of IN values should be printed inside () intead of [],
but because it is internally represented as a list it is
by default printed with [].
To fix this a temporary tuple_constructor is created and printed.

Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
2022-05-16 18:17:58 +02:00
cvybhu
c4f846dbc8 cql3: expr: expr::to_restriction: Handle token relations
Implement converting token relations to expressions.

The code is mostly tekken from functions in token_relation.hh,
because we are replicating functionliaty of the functions called
token_relation::new_XX_restrictions.

Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
2022-05-16 18:17:58 +02:00
cvybhu
5fc5012f9b cql3: expr: expr::to_restriction: Handle multi column relations
Implement converting multi column relations to expressions.

The code is mostly taken from functions in multi_column_relation.hh,
because we are replicating functionality of the functions called
multi_column_relation::new_XX_restriction.

Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
2022-05-16 18:17:58 +02:00
cvybhu
89950e02b5 cql3: expr: Add expr::to_restriction for single column relations
Add a function that will be used to convert expressions
received from the parser to restrictions.

Currently parser creates relations with expressions inside
and then those relations are converted to restrictions.

Once this function is implemented we will be able to skip
creating relations altogether and convert straight from
expression to restriction. This will allow us to remove
the relation class.

Further functionality will be implemented in the following commits.
This commit implements converting single column relations to expressions.

The code is mostly taken from functions in single_column_relation.hh,
because we are replicating functionality of the functions called
single_column_relation::new_XX_restriction.

Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
2022-05-16 18:17:57 +02:00
cvybhu
3e5e5c4a17 cql3: expr: Add prepare_binary_operator
Add a function that allows to prepare
a binary_operator received from the parser.

It resolves columns on the LHS, calculates type of LHS,
and prepares RHS with the correct type.

It will be used by expr::to_restriction.

Some basic type checks are performed, but more throughout
checks will be required in expr::to_restriction to fully
validate a relation.

Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
2022-05-16 18:15:37 +02:00
cvybhu
5dee55d433 cql3: expr: Change how prepare_expression handles bind_variable
The situation with preparing bind_variable is a bit strange,
there are four shapes of bind variables and receiver behaviour
is not in line with other types.

To prepare a bind_variable for a list of IN values for an int column
the current code requires us to pass a receiver of type int.
This is counterintuitive, to prepare a string we pass
a receiver with string type, so to prepare list<int> we should
pass a receiver of type list<int>, not just int.

This commit changes the behaviour in two ways:
- Shape of bind_variable doesn't matter anymore
- The bind_variable gets the receiver passed to prepare_expression,
  no more list<receiver> magic.

Other variants of bind_variable_x_prepare_expression are not removed yet
because they are needed by prepare_expression_mutlti_column.
They will be removed later, along with bind_variable::shape_type.

Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
2022-05-16 18:15:36 +02:00
cvybhu
be6e741b6c clq3: expr: Add columns to expr::token struct
The expr::token struct is created when something
like token(p1, p2) occurs in the WHERE clause.

Currently expr::token doesn't keep columns passed
as arguemnts to the token function.

They weren't needed because token() validation
was done inside token_relation.

Now that we want to use only expressions
we need to have columns inside the token struct
and validate that those are the correct columns.

Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
2022-05-16 18:03:11 +02:00
cvybhu
b99aae7d41 cql3: expr: Modify list_prepare_expression to handle lists of IN values
The standard CQL list type doesn't allow for nulls inside the collection.

However lists of IN values are the exception where bind nullsare allowed,
for example in restrictions like: p IN (1, 2, null)

To be able to use list_prepare_expression with lists of IN values
a flag is added to specify whether nulls should be allowed.

Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
2022-05-16 18:03:11 +02:00
cvybhu
2b5818697a cql3: expr: Add expr::as_if for non-const expressions
expr::as_if is our wrapper for std::get_if.

There was a version for const expression*,
but there weren't one for mutable expression*.

Add the mutable version,
it will be needed in the following commits.

Signed-off-by: cvybhu <jan.ciolek@scylladb.com>
2022-05-16 18:03:11 +02:00
Pavel Emelyanov
f81f1c7ef7 format-selector: Remove .sync() point
The feature listener callbacks are waited upon to finish in the
middle of the cluster joining process. I particular -- before
actually joining the cluster the format should have being selected.
For that there's a .sync() method that locks the semaphore thus
making sure that any update is finished and it's called right after
the wait_for_gossip_to_settle() finishes.

However, features are enabled inside the wait_for_gossip_to_settle()
in a seastar::async() context that's also waited upon to finish. This
waiting makes it possible for any feature listener to .get() any of
its futures that should be resolved until gossip is settled.

Said that, the format selection barrier can be moved -- instead of
waiting on the semaphore, the respective part of the selection code
can be .get()-ed (it all runs in async context). One thing to care
about -- the remainder should continue running with the gate held.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-16 14:14:14 +03:00
Pavel Emelyanov
7fee50f1e3 format-selector: Coroutinize maybe_select_format()
This method is run when a feature is enabled. It's a bit trickier than
the others, also there are two methods actually, that are merged into
one by this patch. By and large most of the care is about the _sel
gate and _sem semaphore.

The gate protects the whole selection code from the selector being freed
from underneath it on stop. The semaphore is only needed to keep two
different format selections from each other -- each update the system
keyspace, local variable and replica::database instance on all shards.
In the end there's a gossiper update, but it happens outside of the
semaphore.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-16 14:13:59 +03:00
Pavel Emelyanov
93df88aac4 format-selector: Coroutinize simple methods
These all are just straightfowrard usage of co_await's around the code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-16 14:13:59 +03:00
Nadav Har'El
a2b9a17927 test/cql-pytest: add test for default clustering order of SELECT
Our documentation for SELECT,
https://docs.scylladb.com/getting-started/dml/#ordering-results
says that:

  "The ORDER BY clause lets you select the order of the returned results.
   It takes as argument a list of column names along with the order for
   the column (ASC for ascendant and DESC for descendant,
   **omitting the order being equivalent to ASC**)."

The test in this patch confirms that the last emphasized line is not
accurate - The default order for SELECT is the default order of the table
being read - NOT always ascending order. If the table was created with
descending WITH CLUSTERING ORDER BY, then a SELECT not specifying
an ORDER BY will get this descending order by default.

The test passes on both Scylla and Cassandra, demonstrating that this
behavior is expected and correct - regardless of what our docs say.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220515115030.775813-1-nyh@scylladb.com>
2022-05-16 11:52:02 +02:00
Avi Kivity
45f75ef595 service/memory_limiter.hh: correct license
When split from service/storage_service.hh (4ca2ae13) it accidentally
changed license. Change it back (since it does not contain
Apache derived code, constrain it to AGPL-3.0-or-later).

Closes #10572
2022-05-16 10:01:06 +03:00
Benny Halevy
8a6f8c622d sstables: writer: pass bytes_ostream by reference
The bytes_stream param is passed by value from `write_promoted_index`
(since 0d8463aba5)
causing an uneeded copy.

This can lead to OOM if the promoted index is extremely large.

Pass the bytes_ostream by reference instead to prevent this copy.

Fixes #10569

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #10570
2022-05-15 17:53:38 +03:00
Avi Kivity
528ab5a502 treewide: change metric calls from make_derive to make_counter
make_derive was recently deprecated in favor of make_counter, so
make the change throughput the codebase.

Closes #10564
2022-05-14 12:53:55 +02:00
Piotr Sarna
768b5f3f29 utils: mark loading_cache::shrink as noexcept
Current code already assumes (correctly), that shrink() does not
throw, otherwise we risk leaking memory allocated in get_ptr():

```
 ts_value_lru_entry* new_lru_entry = Alloc().template allocate_object<ts_value_lru_entry>();

 // Remove the least recently used items if map is too big.
 shrink();
```

Let's be explicit and mark shrink() and a few helper methods
that it uses as noexcept. Ultimately they are all noexcept anyway,
because polymorphic allocator's deallocation routines don't throw,
and neither do boost intrusive list iterators.

Closes #10565
2022-05-13 18:28:58 +03:00
Nadav Har'El
c51a41a885 test/cql-pytest: test for multiple restrictions on same column
It turns out that there is a difference between how Scylla and Cassandra
handle multiple restrictions on the same column - for example "WHERE
c = 0 and c >0". Cassandra treats all such cases as invalid queries,
whereas Scylla *allows* them.

This test demonstrates this difference (it is marked "scylla_only"
because it's a Scylla-only feature), and also verifies that the results
of such queries on Scylla are correct - i.e., if the two restrictions
conflict the result is empty, and if the two restrictions overlap,
the result can be non-empty.

The test passes, verifying that although Scylla differs from Cassandra
on this, its behavior is correct.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220512165107.644932-1-nyh@scylladb.com>
2022-05-13 08:43:57 +03:00
Avi Kivity
0dd5f02022 build: enable ABSL_PROPAGATE_CXX_STD
Recently Abseil started to ask to enable ABSL_PROPAGATE_CXX_STD,
warning that it will do so itself in the future. Do so, and
specify that we use C++20 to avoid inconsistencies.

Closes #10563
2022-05-13 07:12:03 +02:00
Avi Kivity
5937b1fa23 treewide: remove empty comments in top-of-files
After fcb8d040 ("treewide: use Software Package Data Exchange
(SPDX) license identifiers"), many dual-licensed files were
left with empty comments on top. Remove them to avoid visual
noise.

Closes #10562
2022-05-13 07:11:58 +02:00
Botond Dénes
1f0d3d57eb Merge 'Convert (almost) all uses of flat_mutation_reader_assertions to v2' from Michael Livshin
"Almost" because 2 uses of the v1 asserter remain (as they are deliberate).

Closes #10518

* github.com:scylladb/scylla:
  tests: remove obsolete utility functions
  tests: less trivial flat_reader_assertions{,_v2} conversions
  tests: trivial flat_reader_assertions{,_v2} conversions
  flat_mutation_reader_assertions_v2: improve range tombstone support
2022-05-13 08:04:20 +03:00
Eliran Sinvani
8e8dc2c930 Revert "table: disable_auto_compaction: stop ongoing compactions"
This reverts commit 4affa801a5.
In issue #10146 a write throughput drop of ~50% was reported, after
bisect it was found that the change that caused it was adding some
code to the table::disable_auto_compaction which stops ongoing
compactions and returning a future that resolves once all the  compaction
tasks for a table, if any, were terminated. It turns out that this function
is used only at startup (and in REST api calls which are not used in the test)
in the distributed loader just before resharding and loading of
the sstable data. It is then reanabled after the resharding and loading
is done.
For still unknown reason, adding the extra logic of stopping ongoing
compactions made the write throughput drop to 50%.
Strangely enough this extra logic **should** (still unvalidated) not
have any side effects since no compactions for a table are supposed to
be running prior to loading it.
This regains the performance but also undo a change which eventually
should get in once we find the actual culprit.

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>

Closes #10559

Reopens #9313.
2022-05-12 18:51:25 +03:00
Benny Halevy
333f6c5ec9 mutation_partition_view: do_accept_gently: keep clustering_row key on stack
We're hitting a unit test failure as in
https://jenkins.scylladb.com/view/master/job/scylla-master/job/build/1010/artifact/testlog/aarch64_dev/frozen_mutation_test.test_writing_and_reading_gently.918.log
```
unknown location(0): fatal error: in "test_writing_and_reading_gently": std::_Nested_exception<std::runtime_error>: frozen_mutation::unfreeze_gently(): failed unfreezing mutation pk{00801806000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000} of ks.cf
```

on aarch64 clang 12.0.1, in release and dev modes (but not debug).

This turned out to be a miscompilation
in `position_in_partition_view::for_key(cr.key())`
that returns a position_in_partition_view of the
clustering_key_prefix rvalue that cr.key() returns.

The latter is lost on aarch64 in release mode.
Keeping the key on the stack allows to safely pass
a view to it.

Fixes #10555

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-12 16:26:07 +03:00
Piotr Sarna
eb6f4cc839 Merge 'dependencies: add rust' from Wojciech Mitros
The main reason for adding rust dependency to scylla is the
wasmtime library, which is written in rust. Although there
exist c++ bindings, they don't expose all of its features,
so we want to do that ourselves using rust's cxx.

The patch also includes an example rust source to be used in
c++, and its example use in tests/boost/rust_test.

The usage of wasmtime has been slightly modified to avoid
duplicate symbol errors, but as a result of adding a Rust
dependency, it is going to be removed from `configure.py`
completely anyway

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>

Closes #10341

* github.com:scylladb/scylla:
  docs: document rust
  tests: add rust example
2022-05-12 15:24:58 +02:00
Michael Livshin
00ed4ac74c batchlog_manager: warn when a batch fails to replay
Only for reasons other than "no such KS", i.e. when the failure is
presumed transient and the batch in question is not deleted from
batchlog and will be retried in the future.

(Would info be more appropriate here than warning?)

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>

Closes #10556
2022-05-12 13:34:03 +03:00
Asias He
7a38b806be repair: Trigger off strategy compaction after all ranges of a table is repaired
When the repair reason is not repair, which means the repair reason is
node operations (bootstrap, replace and so on), a single repair job contains all
the ranges of a table that need to be repaired.

To trigger off strategy compaction early and reduce the number of
temporary sstable files on disk, we can trigger the compaction as soon
as a table is finished.

Refs: #10462
2022-05-12 10:46:11 +08:00
Asias He
3dc9a81d02 repair: Repair table by table internally
This patch changes the way a repair job walks through tables and ranges
if multiple tables and ranges are requested by users.

Before:

```
for range in ranges
   for table in tables
       repair(range, table)
```

After:

```
for table in tables
    for range in ranges
       repair(range, table)
```

The motivation for this change is to allow off-strategy compaction to trigger
early, as soon as a table is finished. This allows to reduce the number of
temporary sstables on disk. For example, if there are 50 tables and 256 ranges
to repair, each range will generate one sstable. Before this change, there will
be 50 * 256 sstables on disk before off-strategy compaction triggers. After this
change, once a table is finished, off-strategy compaction can compact the 256
sstables. As a result, this would reduce the number of sstables by 50X.

This is very useful for repair based node operations since multiple ranges and
tables can be requested in a single repair job.

Refs: #10462
2022-05-12 10:46:11 +08:00
Wojciech Mitros
cb5d054a67 docs: document rust
Using Rust in Scylla is not intuitive, the doc explains the entire
process of adding new Rust source files to Scylla. What happens
during compilation is also explained.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2022-05-11 16:49:31 +02:00
Wojciech Mitros
4ad012cb6a tests: add rust example
The patch includes an example rust source to be used in
c++, and its example use in tests/boost/rust_test.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2022-05-11 16:49:31 +02:00
Botond Dénes
c3eddab976 scylla-gdb.py: add scylla sstable-index-cache command
Print currently cached index pages for the given sstable.
Example output:

    (gdb) scylla sstable-index-cache $sst
    [0]: [
      { key: 63617373616e647261, token: 356242581507269238, position: 0 }
    ]
    Total: 1 page(s) (1 loaded, 0 loading)
2022-05-11 16:17:40 +03:00
Botond Dénes
2f3f07881b scylla-gdb.py: add scylla sstable-summary command
Print content of sstable summary. Example output:

    (gdb) scylla sstable-summary $sst
    header: {min_index_interval = 128, size = 1, memory_size = 21, sampling_level = 128, size_at_full_sampling = 0}
    first_key: 63617373616e647261
    last_key: 63617373616e647261
    [0]: {
      token: 356242581507269238,
      key: 63617373616e647261,
      position: 0}
2022-05-11 16:16:24 +03:00
Botond Dénes
7af107b39d test/scylla-gdb: add sstable fixture 2022-05-11 16:16:05 +03:00
Benny Halevy
4a5842787e memtable_list: clear_and_add: let caller clear the old memtables
As a follow up on b8263e550a,
make clear_and_add synchronous yet again, and just return
the swapped list of memtables so that the caller (table::clear)
can clear them gently.

Refs https://github.com/scylladb/scylla/pull/10424#discussion_r867455056

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #10540
2022-05-11 14:46:30 +02:00
Botond Dénes
eeca9f24e8 Merge 'Docs: improve debugging.md' from Benny Halevy
This series update debugging.md with:
- add an example .gdbinit file
- update recommendation for finding the relocatable packages using a build-id on http://backtrace.scylladb.com/

Closes #10492

* github.com:scylladb/scylla:
  docs: debugging.md: update instructions regarding backtrace.scylladb.com
  docs: debugging.md: add a sample gdbinit file
2022-05-11 14:46:30 +02:00
Nadav Har'El
b2450886d7 Merge 'Debugging.md relocatable package updates' from Botond Dénes
Drop the section about non-relocatable packages. They are not a thing anymore.
Also tweaked the instructions for launching the toolchain container.

Closes #10539

* github.com:scylladb/scylla:
  docs/debugging.md: adjust instructions for using the toolchain
  docs/debugging.md: drop section about handling binaries from non-relocatable packages
2022-05-11 14:46:30 +02:00
Takuya ASADA
a9dfe5a8f4 scylla_sysconfig_setup: handle >=32CPUs correctly
Seems like 59adf05 has a bug, the regex pattern only handles first
32CPUs cpuset pattern, and ignores rest.
We should extend regex pattern to handle all CPUs.

Fixes #10523

Closes #10524
2022-05-11 14:46:30 +02:00
Nadav Har'El
043b1c7f89 Update seastar submodule. Unfortunately, also requires two changes
to Scylla itself to make it still compile - see below

* seastar 5e863627...96bb3a1b (18):
  > install-dependencies: add rocky as a supported distro
  > circleci: relax docker limits to allow running with new toolchain
  > core: memory: Add memory::free_memory() also in Debug mode
  > build: bump up zlib to 1.2.12
  > cmake: add FindValgrind.cmake
  > Merge 'seastar-addr2line: support sct syslogs' from Benny Halevy
  > rpc: lower log level for 'failed to connect' errors
  > scripts: Build validation
  > perftune.py: remove rx_queue_count from mode condition.
  > memory: add attributes to memalign for compatibility with glibc 2.35
  > condition-variable: Fix timeout "when" potentially not killing timer
  > Merge "tests: perf: measure coroutines performance" from Benny
  > Merge: Refine COUNTER metrics
  > Revert "Merge: Refine COUNTER metrics"
  > reactor: document intentional bitwise-on-bool op in smp_pollfn::poll()
  > Merge: Refine COUNTER metrics
  > SLES: additionally check irqbalance.service under /usr/lib
  > rpc_tester: job_cpu: mark virtual methods override

Changes to Scylla also included in this merge:

1. api: Don't export DERIVEs (Pavel Emelyanov)

Newer seastar doesn't have DERIVE metrics, but does have REAL_COUNTER
one. Teach the collectd getter the change.

(for the record: I don't understand how this endpoing works at all,
there's a HISTOGRAM metrics out there that would be attempted to get
exposed with the v.ui() call which's totally wrong)

2. test: use linux_perf_events.{cc,hh} from Seastar

Seastar now has linux_perf_events.{cc,hh}. Remove Scylla's version
of the same files and use Seastar's. Without this change, Scylla
fails to compile when some source files end up including both
versions and seeing double definitions.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-05-11 14:46:30 +02:00
Piotr Sarna
209c2f5d99 sstables: define generation_type for sstables
No functional changes intended - this series is quite verbose,
but after it's in, it should be considerably easier to change
the type of SSTable generations to something else - e.g. a string
or timeUUID.

Closes #10533
2022-05-11 14:46:30 +02:00
Beni Peled
3abe4a2696 Adjust scripts/pull_github_pr.sh to check tests status
Closes #10263

Closes #10264
2022-05-11 14:46:30 +02:00
Botond Dénes
7501a075bd sstables/index_reader: push down eof() check to advance_to(index_bound&, dht::ring_position_view)
Commit e8f3d7dd13 added eof() checks to public partition-level
advance_to() methods, to ensure we do not attempt to re-read the last
page of the index when at eof(). It was noted however that this check
would be safer in advance_to(index_bound&, dht::ring_position_view)
because that is the method that all these higher-level methods end up
calling. Placing the check there would guarantee safety for all such
operations. This path does exactly that: it pushes down the check to
said method. One change needed for this to work is to check eof on the
bound that is currently advanced, instead of unconditionally checking
the lower bound.

Closes #10531
2022-05-11 14:46:30 +02:00
Asias He
77b1db475c locator: Do not enforce public ip address for broadcast_rpc_address
Reported by Felipe Cardeneti:

- Create a 2-node Scylla cluster w/ Ec2MultiRegionSnitch
- Check system.peers table

Scylla (uses public address)
```
cqlsh> select peer,data_center,host_id,preferred_ip,rack,rpc_address,schema_version from system.peers;

peer          | data_center | host_id                              | preferred_ip  | rack | rpc_address   | schema_version
---------------+-------------+--------------------------------------+---------------+------+---------------+--------------------------------------
18.216.98.219 |   us-east-2 | d9443741-a12e-4bbb-91ce-9931cece589c | 172.31.43.122 |   2c | 18.216.98.219 | 95c3fca5-c463-3aba-98c6-1c0b3fac5b58

(1 rows)
```

Cassandra (uses local address):
```
cqlsh> SELECT peer,data_center,host_id,preferred_ip,rack,rpc_address,schema_version from system.peers;

peer          | data_center | host_id                              | preferred_ip  | rack       | rpc_address   | schema_version
---------------+-------------+--------------------------------------+---------------+------------+---------------+--------------------------------------
52.15.104.255 |   us-east-2 | 42c0b717-775f-4998-a420-0388fe8b4e70 | 172.31.42.126 | us-east-2c | 172.31.42.126 | 2207c2a9-f598-3971-986b-2926e09e239d

(1 rows)
```

Config diff:
```
cassandra.yaml:rpc_address: 0.0.0.0
cassandra.yaml:broadcast_rpc_address: 172.31.42.126
/etc/scylla/scylla.yaml:broadcast_rpc_address: 172.31.42.126
/etc/scylla/scylla.yaml:rpc_address: 0.0.0.0
```

After this patch, if broadcast_rpc_address is unset, Ec2MultiRegionSnitch
will use the public ip address to set broadcast_rpc_address. If
broadcast_rpc_address is set, Ec2MultiRegionSnitch will not modify it.

Fixes #10236

Closes #10519
2022-05-11 14:46:30 +02:00
Tomasz Grabiec
f703e8ded5 Merge 'New failure detector for Raft' from Kamil Braun
We introduce a new service that performs failure detection by periodically pinging
endpoints. The set of pinged endpoints can be dynamically extended and
shrinked. To learn about liveness of endpoints, user of the service
registers a listener and chooses a threshold - a duration of time which
has to pass since the last successful ping in order to mark an endpoint
as dead. When an endpoint responds it's immediately marked as alive.

Endpoints are identified using abstract integer identifiers.
The method of performing a ping is a dependency of the service provided
by the user through the `pinger` interface. The implementation of `pinger` is
responsible for translating the abstract endpoint IDs to 'real'
addresses. For example, production implementation may map endpoint IDs
to IP addresses and use TCP/IP to perform the ping, while a test/simulation
implementation may use a simulated network that also operates on
abstract identifiers.

Similarly, the method of measuring time is a dependency provided by the
user using the `clock` interface. The service operates on abstract time
intervals and timepoints. So, for example, in a production
implementation time can be measured using a stopwatch, while in
test/simulation we can use a logical clock.

The service distributes work across different shards. When an endpoint
is added to the set of detected endpoints, the service will choose a
shard with the smallest amount of workers and create a worker that is
responsible for periodically pinging this endpoint on that shard and
sending notifications to listeners.

We modify the randomized nemesis test to use the new service.
The service is sharded, but for simplicity of implementation in the test
we implement rpcs and sleeps by routing the requests to shard 0, where
logical timers and network live. rpcs are using the existing simulated
network and clock using the existing logical timers.

We also integrate the service with production code. There,
`pinger` is implemented using existing GOSSIP_ECHO verb. The gossip echo
message requires the node's gossip generation number. We handle this by
embedding the pinger implementation inside `gossiper`, and making
`gossiper` update the generation number (cached inside the pinger class)
periodically.

Production `clock` is a simple implementation which uses
`std::chrono::steady_clock` and `seastar::sleep_until` underneath.
Translating `steady_clock` durations to `direct_fd::clock` durations happens
by taking the number of ticks.

We connect the group 0 raft server rpc implementation to the new service,
so that when servers are added or removed from the the group 0 configuration,
corresponding endpoints are added to the direct failure detector service.
Thus the set of detected endpoints will be equal to the group 0 configuration.

On each shard, we register a listener for the service.
The listener maintains a set of live addresses; on mark_alive it adds a
server to the set and on mark_dead it removes it. This set is then used
to implement the `raft::failure_detector` interface, consisting of
`is_alive()` function, which simply checks set membership.

---

v6:
- remove `_alive_start_index`. Instead, keep a map of `bool`s to track liveness of each endpoint. See the code for details (`listeners_liveness` struct and its usage in `ping_fiber()`, `notify_fiber()`, `add/remove_worker`, `add/remove_listener`). The diff is easy to read: f617aeca62..d4b225437c

v5:
- renamed `rpc` to `pinger`
- replaced `bool` with `enum class endpoint_update` (with values `added` and `removed`) in `_endpoint_updates`
- replaced `unsigned` with `shard_id`
- fixed definition of `threshold(size_t n)` (it didn't use `n`, but `_alive_start`; fortunately all uses passed `_alive_start` as `n` so the bug wouldn't affect the behavior)
- improve `_num_workers` assertions
- signal `_alive_start_changed` only when `_alive_start` indeed changed
- renamed `{_marked}_alive_start` to `{_marked}_alive_start_index`

v4:
- rearrange ping_fiber(). Remove the loop at the end of the big `while`
  which was timing out listeners (after the sleep). Instead:
    - rely on the loop before the sleep for timing out listeners
    - before calling ping(), check if there is a timed out listener,
      if so abandon the ping, immediately proceed to the timing-out-listeners
      loop, and then immediately proceed to the next iteration of the big `while`
      (without sleeping)
- inline send_mark_dead() and send_mark_alive(); each was used in
  exactly one place after the rearrangement
- when marking alive, instead of repeatedly doing `--_alive_start` and
  signalling the condition variable, just do `_alive_start = 0` and signal
  the condition variable once
- fix the condition for stopping `endpoint_worker::notify_fiber()`: before, it was
  `_as.abort_requested()`, now it is `_as.abort_requested() && _alive_start == _fd._listeners.size()`.
  Indeed, we want to wait for the stopping code (`destroy_worker()`)
  to set `_alive_start = _fd._listeners.size()` before `notify_fiber()`
  finishes so `notify_fiber()` can send the final `mark_dead`
  notifications for this endpoint. There was a race before where
  `notify_fiber()` could finish before it sent those notifications
  (because it finished as soon as it noticed `_as.abort_requested()`)
- fix some waits in the unit test; they depended on particular ordering
  of tasks by the Scylla reactor, the test could sometimes hang in debug
  mode which randomizes task order
- fix `rpc::ping()` in randomized_nemesis_test so it doesn't give an
  exceptional discarded future in some cases

v3:
- fix a race in failure_detector::stop(): we must first wait for _destroy_subscriptions fiber to finish on all shards, only then we can set _impl to nullptr on any shard
- invoke_abortable_on was moved from randomized_nemesis_test to raft/helpers
- add a unit test (second patch)

v2:
- rename `direct_fd` namespace to `direct_failure_detector`
- move gms/direct_failure_detector.{cc,hh} to direct_failure_detector/failure_detector.{cc,hh}
- cleaned license comments
- removed _mark_queue for sending notifications from ping_fiber() to notify_fiber(). Instead:
    - _listeners is now a boost::container::flat_multimap (previously it was std::multimap)
    - _alive_start is no longer an iterator to _listeners, but an index (size_t)
    - _mark_queue was replaced with a second index to _listeners, _marked_alive_start, together with a condition variable, _alive_start_changed
    - ping_fiber() signals _alive_start_changed when it changes _alive_start
    - notify_fiber() waits on _alive_start_changed. When it wakes up, it compares _marked_alive_start to _alive_start, sends notifications to listeners appropriately, and updates _marked_alive_start
- replacing _mark_queue with index + condition variable allowed some better exception specifications: send_mark_alive and send_mark_dead are now noexcept, ping_fiber() is specified to not return exceptional futures other than sleep_aborted which can only happen when we destroy the worker (previously, ping_fiber() could silently stop due to exception happening when we insert to _mark_queue - it could probably only be bad_alloc, but still)
- _shard_workers is now unordered_map<endpoint_id, endpoint_worker> instead of unordered_map<endpoint_id, unique_ptr<endpoint_worker>> (after learning how to construct map values in place - using either `emplace`+`forward_as_tuple` or `try_emplace`)
- `failure_detector::impl::add_endpoint` now gives strong exception guarantee: if an exception is thrown, no state changes
- same for `failure_detector::impl::remove_endpoint`
- `failure_detector::impl::create_worker` now uses `on_internal_error` when it detects that there is a worker for this endpoint already - thanks to the strong exception guarantees of `add_endpoint` and `remove_endpoint` this should never happen
- comment at _num_workers definition why we maintain this statistic (to pick a shard with smallest number of workers)
- remove unnecessary `if (_as.abort_requested())` in `ping_fiber()`
- in ping_fiber(), after a ping, we send notifications to listeners which we know will time-out before the next ping starts. Before, we would sleep until the threshold is actually passed by the clock. Now we send it immediately - we know ahead of time that the listener will time-out and we can notify it immediately.
- due to above, comment at `register_listener` was adjusted, with the following note added: "Note: the `mark_dead` notification may be sent earlier if we know ahead of time that `threshold` will be crossed before the next `ping()` can start."
- `register_listener` now takes a `listener&`, not `listener*`
- at `register_listener` comment why we allow different thresholds (second to last paragraph)
- at `register_listener` mention that listeners can be registered on any shard (last paragraph)
- add protected destructors to rpc, clock, listener, and mention that these objects are not owned/destroyed by `failure_detector`.
- replaced _endpoint_queue (seastar::queue<pair<endpoint_id, bool>>) with unordered_map<endpoint_id, bool> + condition variable. When user calls add/remove_endpoint, an entry is inserted to this map, or existing entry is updated, and the condition variable is signaled. update_endpoint_fiber() waits on the condition variable, performs the add/remove operation, and removes entries from this map. Compared to the previous solution:
    - the new solution has at most one entry for a given endpoint, so the number of entries is bounded by the number of different endpoints (so in the main Scylla use case, by the number of different nodes that ever exist); the previous solution could in theory have a backlog of unprocessed events, with updates for a given endpoint appearing multiple times in the queue at once
    - when the add/remove operation fails in update_endpoint_fiber(), we don't remove the entry from the map so the operation can be retried later. Previously we would always remove the entry from the queue so it doesn't grow too big in presence of failures.
    - when the add/remove operation fails in update_endpoint_fiber(), we sleep for 10*ping_period before retrying. Note that this codepath should not be reached in practice, it can basically only happen on bad_alloc
- commented that `clock::sleep_until` should signalize aborts using `sleep_aborted`
- `clock::now()` is `noexcept`
- `add/remove_endpoint` can be called after `stop()`, they just won't do anything in that case. Reason: next item
- in randomized_nemesis_test, stop failure detector before raft server (it was the other way before), so it stops using server's RPC before server is aborted. Before, the log was spammed with errors from failure detector because failure detector was getting gate_closed_exceptions from the RPC when the server was stopped. A side effect is that the raft server may continue adding/removing endpoints when the failure detector is stopped, which is fine due to above item
- randomized_nemesis_test: direct_fd_clock::sleep_until translates abort_requested_exception to sleep_aborted (so sleep_until satisfies the interface specification)
- message/rpc_protocol_impl: send_message_abortable: if abort_source::subscribe returns null, immediately throw abort_requested_exception (before we would send the message out and not react to an abort if it happened before we were called)
- rebase

Closes #10437

* github.com:scylladb/scylla:
  service: raft: remove `raft_gossip_failure_detector`
  service: raft: raft_group_registry: use direct failure detector notifications for raft server liveness
  service: raft: add/remove direct failure detector endpoints on group 0 configuration changes
  main: start direct failure detector service
  messaging_service: abortable version of `send_gossip_echo`
  message: abortable version of `send_message`
  test: raft: randomized_nemesis_test: remove old failure_detector
  test: raft: randomized_nemesis_test: use `direct_failure_detector::failure_detector`
  test: raft: randomized_nemesis_test: ping all shards on each tick
  test: unit test for new failure detector service
  direct_failure_detector: introduce new failure detector service
2022-05-11 14:46:27 +02:00
Botond Dénes
a4ed7186c0 scylla-gdb.py: make chunked_vector a proper container wrapper"
Add missing __len__() and __iter__() methods.
2022-05-11 15:11:33 +03:00
Botond Dénes
2f785ee8f0 scylla-gdb.py: make small_vector a proper container wrapper"
Add missing __len__() and __iter__() methods.
2022-05-11 15:11:12 +03:00
Botond Dénes
f66077da4b scylla-gdb.py: add sstring container wrapper 2022-05-11 15:10:49 +03:00
Botond Dénes
afa6cfb42a scylla-gdb.py: add chunked_managed_vector container wrapper 2022-05-11 15:10:39 +03:00
Botond Dénes
8048b15c57 scylla-gdb.py: add managed_vector container wrapper 2022-05-11 15:10:17 +03:00
Botond Dénes
2b40ed663e scylla-gdb.py: std_variant: add workaround for clang template bug
Clang and GDB don't see eye to eye on the template arguments of
std::variants. Executables generated by clang are known to yield 0
template arguments when one queries them via the GDB python API.
This patch adds a workaround to the std_variant wrapper, allowing a
caller who knows the type to brute-force getting the variant member with
the known type.
2022-05-11 15:10:11 +03:00
Botond Dénes
f975841702 scylla-gdb.py: add bplus_tree container wrapper 2022-05-11 15:10:03 +03:00
Benny Halevy
4a40e34577 docs: debugging.md: update instructions regarding backtrace.scylladb.com
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-11 10:23:09 +03:00
Benny Halevy
97b002e13e docs: debugging.md: add a sample gdbinit file
This gdbinit contains recommended settings
commonly useful for debugging scylla core dumps.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-11 10:23:08 +03:00
Botond Dénes
f725779957 docs/debugging.md: adjust instructions for using the toolchain
The attached volume doesn't need to be relabeled anymore (`:z` not
needed at the end of the volume attach instructions). This also allows
dropping the `sudo` from the invocation.
2022-05-11 08:25:29 +03:00
Botond Dénes
264db30ca5 docs/debugging.md: drop section about handling binaries from non-relocatable packages
All our releases ship with relocatable packages now. This section is
obsolete (thankfully).
2022-05-11 08:20:14 +03:00
Michael Livshin
1e690e6773 tests: remove obsolete utility functions
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-05-10 22:10:40 +03:00
Michael Livshin
864882253a tests: less trivial flat_reader_assertions{,_v2} conversions
Dealing with the handful of tests that check range tombstones in
interesting ways and need more than search-and-replace.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-05-10 22:10:40 +03:00
Michael Livshin
3cc2343775 tests: trivial flat_reader_assertions{,_v2} conversions
(Which entails temporary cut-and-pasting some utility functions)

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-05-10 22:10:40 +03:00
Michael Livshin
51cf84e8c9 flat_mutation_reader_assertions_v2: improve range tombstone support
* Track the active range tombstone.

* Add `may_produce_tombstones()`.

* Flesh out `produces_row_with_key()`.

* Add more trace logs.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-05-10 22:10:40 +03:00
Nadav Har'El
2c39c4c284 Merge 'Handle errors during snapshot' from Benny Halevy
This series refactors `table::snapshot` and moves the responsibility
to flush the table before taking the snapshot to the caller.

`flush_on_all` and `snapshot_on_all` helpers are added to replica::database
(by making it a peering_sharded_service) and upper layers,
including api and snapshot-ctl now call it instead of calling cf.snapshot directly.

With that, error are handed in table::snapshot and propagated
back to the callers.

Failure to allocate the `snapshot_manager` object is fatal,
similar to failure to allocate a continuation, since we can't
coordinate across the shards without it.

Test: unit(dev), rest_api(debug)

Fixes #10500

Closes #10513

* github.com:scylladb/scylla:
  table: snapshot: handle errors
  table: snapshot: get rid of skip_flush param
  database: truncate: skip flush when taking snapshot
  test: rest_api: storage_service: verify_snapshot_details: add truncate
  database: snapshot_on_all: flush before snapshot if needed
  table: make snapshot method private
  database: add snapshot_on_all
  snapshot-ctl: run_snapshot_modify_operation: reject views and secondary index using the schema
  snapshot-ctl: refactor and coroutinize take_snapshot / take_column_family_snapshot
  api: storage_service: increase visibility of snapshot ops in the log
  api: storage_service: coroutinize take_snapshot and del_snapshot
  api: storage_service: take_snapshot: improve api help messages
  test: rest_api: storage_service: add test_storage_service_snapshot
  database: add flush_on_all variants
  test: rest_api: add test_storage_service_flush
2022-05-10 10:52:10 +03:00
Benny Halevy
1d39d803af table: snapshot: handle errors
Turn table::snapshot into a coroutine,
catch exceptions, and return them to the caller.

Make sure that coordination across shards
would not break even if any of the shards hits
an error, by always signaling semaphores other
shards wait on.

All errors except for failing to allocate
the snapshot_manager objects are caught
and propagated back.

Failing to allocate the snapshot_manager is fatal
similar to failing to allocate a continuation
since we can't coordinate across the shards without it,
so abort that fails.

Fixes #10500

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-10 10:45:14 +03:00
Benny Halevy
9e69089306 table: snapshot: get rid of skip_flush param
Now that all callers flush on their own
before calling table::snapshot.

Refs #10500

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-10 10:45:14 +03:00
Benny Halevy
31881273a1 database: truncate: skip flush when taking snapshot
database::truncate already flushes the table
on auto_snapshot so there is never a reason
to flush it again in table::snapshot.

Note that cf.can_flush() is false only if memtables
are empty so there nothing to flush or there is
is no seal_immediate_fn and then table::snapshot
wouldn't be able to flush either.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-10 10:45:14 +03:00
Benny Halevy
fc79787863 test: rest_api: storage_service: verify_snapshot_details: add truncate
Truncate the test table and verify that
the 'live' snapshot size is now non-zero.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-10 10:45:14 +03:00
Benny Halevy
46c950fb31 database: snapshot_on_all: flush before snapshot if needed
flush_on_all shards before taking the snapshot if !skip_flush
so we can get rid of flushing in table::snapshot.

Refs #10500

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-10 10:45:14 +03:00
Benny Halevy
33bd52921e table: make snapshot method private
Only callable by database.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-10 10:45:14 +03:00
Benny Halevy
e1d58d4422 database: add snapshot_on_all
And move the logic from snapshot-ctl down to the
replica::database layer.

A following patch will move the flush phase
from the replica::table::snapshot layer
out to the caller.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-10 10:45:14 +03:00
Benny Halevy
aa127a2dbb snapshot-ctl: run_snapshot_modify_operation: reject views and secondary index using the schema
Detecting a secondary index by checking for a dot
in the table name is wrong as tables generated by Alternator
may contain a dot in their name.

Instead detect bot hmaterialized view and secondary indexes
using the schema()->is_view() method.

Fixes #10526

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-10 10:44:52 +03:00
Benny Halevy
1fbcdbd2e8 snapshot-ctl: refactor and coroutinize take_snapshot / take_column_family_snapshot
There is no functional change in this patch.
Only refactoring of the code.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-10 10:16:39 +03:00
Benny Halevy
01b1e54e22 api: storage_service: increase visibility of snapshot ops in the log
snapshot operations over the api are rare
but they contain significant state on disk in the
form of sstables hard-linked to the snapshot directories.

Also, we've seen snapshot operations hang in the field,
requiring a core dump to analyse the issue,
while there were no records in the log indicating
when previous snapshot operations were last executed.

This change promotes logging to info level
when take_snapshot and del_snapshot start,
and logs errors if in case they fail.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-10 10:15:46 +03:00
Benny Halevy
b9d972d029 api: storage_service: coroutinize take_snapshot and del_snapshot
Before making any further changes in them.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-10 10:02:52 +03:00
Benny Halevy
10b86ee5bd api: storage_service: take_snapshot: improve api help messages
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-10 10:02:47 +03:00
Benny Halevy
e95ecbbea6 test: rest_api: storage_service: add test_storage_service_snapshot
Test the snapshot operations via the rest api.

Added test/rest_api/rest_util.py with
new_test_snapshot that creates a new test snapshot
and automagically deletes it when the `with` block
if exited, similar to new_test_keyspace and new_test_table.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-10 09:56:44 +03:00
Benny Halevy
5b4eb44795 database: add flush_on_all variants
Use by api layer.

Will be used in a later patch to flush
on all shards before taking a snapshot.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-10 09:56:44 +03:00
Benny Halevy
05c7f4b832 test: rest_api: add test_storage_service_flush
Add a basic rest_api test for keyspace_flush.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-10 09:56:44 +03:00
Nadav Har'El
1c6163d51f Merge 'cql3: expr: allow bind markers in collection literals' from Michał Sala
Allowing bind markers in collection literals is a change which causes minor differences in behavior between Scylla and Cassandra. Despite such an undesirable effect, I think allowing them is a good idea because it makes [refactoring work made by cvybhu](https://github.com/scylladb/scylla/pull/10409) easier - 469d03f8c2.

Also, making Scylla accept a superset of valid Cassandra cql expressions does not make us less compatible (maybe apart from test suit compatibility).

Closes #10457

* github.com:scylladb/scylla:
  test/boost: cql_query_test: allow bound variables in test_list_of_tuples_with_bound_var
  test/boost: cql_query_test: test bound variables in collection literals
  cql3: expr: do not allow unset values inside collections
  cql3: expr: prepare_expr: allow bind markers in collection literals
2022-05-09 19:15:22 +03:00
Botond Dénes
fd27fbfe64 Merge "Add user types carrier helper" from Pavel Emelyanov
"
There's a cql_type_parser::parse() method that needs to get user
types for a keyspace by its name. For this it uses the global
storage proxy instance as a place to get database from. This set
introduces an abstract user_types_storage helper object that's
responsible in providing the user types for the caller.

This helper, in turn, is provided to the parse() method by the
database itself or by the schema_ctxt object that needs parse()
to unfreeze schemas and doesn't have database at those times.

This removes one more get_storage_proxy() call.
"

* 'br-user-types-storage' of https://github.com/xemul/scylla:
  cql_type_parser: Require user_types_storage& in parse()
  schame_tables: Add db/ctxt args here and there
  user_types: Carry storage on database and schema_ctxt
  data_dictionary: Introduce user types storage
2022-05-09 17:38:52 +03:00
Nadav Har'El
ca700bf417 scripts/pull_github_pr.sh: clean up after failed cherry-pick
When pull_github_pr.sh uses git cherry-pick to merge a single-patch
pull request, this cherry-pick can fail. A typical example is trying
to merge a patch that has actually already been merged in the past,
so cherry-pick reports that the patch, after conflict resolution,
is empty.

When cherry-pick fails, it leaves the working directory in an annoying
mid-cherry-pick state, and today the user needs to manually call
"git cherry-pick --abort" to return to the normal state. The script
should it automatically - so this is what we do in this patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-05-09 17:23:34 +03:00
Pavel Emelyanov
598ce8111d repair: Handle discarded stopping future
When repair_meta stops it does so in the background and reports back
a shared future into whose shared promise peer it resolves that
background activity. There's a shorter way to forward a future result
into another, even shared, promise. And this method doesn't need to
discard a future.

tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/253

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-09 17:23:12 +03:00
Pavel Emelyanov
3b4af86ad9 proxy (and suddenly redis): Don't check latency_counter.is_start()
The lcs at those places are explicitly start()ed beforehand. The
is_start() check is necessary when using the latency_counter with a
histogram that may or may not start the counter (this is the case
in several class table methods).

tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-09 17:20:41 +03:00
Raphael S. Carvalho
48e3117ebc compaction: move propagate_replacement() into private namespace
propagate_replacement() is an internal function that shouldn't be in
the public interface. No one besides an unit test for incremental
compaction needs it. In the future, I want to revisit incremental
compaction unit test to stop using it and only rely on public
interfaces

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220506171647.81063-1-raphaelsc@scylladb.com>
2022-05-09 16:49:50 +03:00
Kamil Braun
0f7a1179c8 service: raft: remove raft_gossip_failure_detector
It's no longer used, having been replaced
by the direct_failure_detector listener.
2022-05-09 15:31:19 +02:00
Kamil Braun
295aec2633 service: raft: raft_group_registry: use direct failure detector notifications for raft server liveness
On each shard, we register a listener for the new direct failure detector service.
The listener maintains a set of live addresses; on mark_alive it adds a
server to the set and on mark_dead it removes it. This set is then used
to implement the `raft::failure_detector` interface, consisting of
`is_alive()` function, which simply checks set membership.

There is some complexity in between, because we need to translate
direct_failure_detector endpoint_ids to inet_addresses and raft::server_ids
to inet_addreses, but all building blocks are already there.
2022-05-09 15:31:19 +02:00
Kamil Braun
7e4bb68061 service: raft: add/remove direct failure detector endpoints on group 0 configuration changes
We connect the group 0 raft server rpc implementation to the new direct
failure detector service, so that when servers are added or removed from
the the group 0 configuration, corresponding endpoints are added to the
direct failure detector service. Thus the set of detected endpoints will
be equal to the group 0 configuration.

This causes the failure detector service to start pinging endpoints,
but no listeners are registered yet. The following commit changes that.
2022-05-09 15:31:19 +02:00
Kamil Braun
38f65e5a2e main: start direct failure detector service
We add the new direct failure detector to the list of services started
in the Scylla process.

To start the service, we need an implementation of `pinger` and `clock`.

`pinger` is implemented using existing GOSSIP_ECHO verb. The gossip echo
message requires the node's gossip generation number. We handle this by
embedding the pinger implementation inside `gossiper`, and making
`gossiper` update the generation number (cached inside the pinger class)
periodically.

`clock` is a simple implementation which uses `std::chrono::steady_clock`
and `seastar::sleep_until` underneath. Translating `steady_clock`
durations to `direct_failure_detector::clock` durations happens by taking
the number of ticks.

The service is currently not used, just initialized; no endpoints are
added and no listeners are registered yet, but the following commits
change that.
2022-05-09 13:14:42 +02:00
Kamil Braun
9551256e81 messaging_service: abortable version of send_gossip_echo
Use the new `send_message_abortable` function to implement an abortable
version of `send_gossip_echo`.

These echo messages will be used for direct failure detection.
2022-05-09 13:14:41 +02:00
Kamil Braun
f2548fc3fa message: abortable version of send_message
I want to be able to timeout `send_message`, but not through the
existing `send_message_timeout` API which forces me to use a particular
clock/duration/timepoint type. Introduce a more general
`send_message_abortable` API which gets an `abort_source&`, subscribes
to it, and uses the `rpc::cancellable` interface to cancel the RPC on
abort.

The function is 90% copy-pasta from `send_message{_timeout}`, only the
abort part is new.
2022-05-09 13:14:41 +02:00
Kamil Braun
c15f3a9698 test: raft: randomized_nemesis_test: remove old failure_detector
No longer used.
Split from the previous commit for a better diff.
2022-05-09 13:14:41 +02:00
Kamil Braun
915d329f1f test: raft: randomized_nemesis_test: use direct_failure_detector::failure_detector
Until now the nemesis test used its own failure detector implementation
which used one-way heartbeats.

Switch it to use the new direct failure detection service, which will
also be used in production code. Integrating it does require some work
however as we need to implement the `pinger` and `clock` interfaces
for the failure detector.

The service is sharded, but for simplicity of implementation we
implement rpcs and sleeps by routing the requests to shard 0, where
logical timers and network live.
2022-05-09 13:14:41 +02:00
Kamil Braun
e5fc0681d9 test: raft: randomized_nemesis_test: ping all shards on each tick
Right now the test is running entirely on shard 0, but we want to
introduce a sharded service to the test. The initial naive attempt of
doing that failed because the test would time out (reach the tick limit)
before any work distributed to other shards could even start. The
solution in this commit solves that by synchronizing the shards on each
tick.

When the test is ran with smp=1, the behavior is as before.
2022-05-09 13:14:41 +02:00
Kamil Braun
e4f85cf425 test: unit test for new failure detector service 2022-05-09 13:14:41 +02:00
Kamil Braun
666e5a414d direct_failure_detector: introduce new failure detector service
The new service performs failure detection by periodically pinging
endpoints. The set of pinged endpoints can be dynamically extended and
shrinked. To learn about liveness of endpoints, user of the service
registers a listener and chooses a threshold - a duration of time which
has to pass since the last successful ping in order to mark an endpoint
as dead. When an endpoint responds it's immediately marked as alive.

Endpoints are identified using abstract integer identifiers.
The method of performing a ping is a dependency of the service provided
by the user through the `pinger` interface. The implementation of `pinger`
is responsible for translating the abstract endpoint IDs to 'real'
addresses. For example, production implementation may map endpoint IDs
to IP addresses and use TCP/IP to perform the ping, while a test/simulation
implementation may use a simulated network that also operates on
abstract identifiers.

Similarly, the method of measuring time is a dependency provided by the
user using the `clock` interface. The service operates on abstract time
intervals and timepoints. So, for example, in a production
implementation time can be measured using a stopwatch, while in
test/simulation we can use a logical clock.

The service distributes work across different shards. When an endpoint
is added to the set of detected endpoints, the service will choose a
shard with the smallest amount of workers and create a worker that is
responsible for periodically pinging this endpoint on that shard and
sending notifications to listeners.

Endpoints can be added or removed only through the shard 0 instance of
the service and shard 0 is responsible for coordinating the endpoint
workers. Listeners can be registered on any shard.
2022-05-09 13:14:40 +02:00
David Garcia
3e0f81180e docs: disable link checker
Closes #10434
2022-05-09 12:45:28 +02:00
Avi Kivity
81af9342f1 Merge "Simplify gossiper state map API" from Pavel E
"
There's a enpoint->state map member of the gossiper class. First
ugly thing about it is that the member is public.

Next, there's a whole bunch of helpers around that map that export
various bits of information from it. All of those helpers reshard
to shard-0 to read from the state mape ignoring the fact that the
map is replicated on all shards internally. Also, some of those
helpers effectively duplicate each other for no real gain. Finally,
most of them are specific to api/ code, and open-coding them often
makes api/ handlers shorter and simpler.

This set removes the unused, api-only or trivial state map accessors
and marks the state map itself private (underscore prefix included).

tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/233/
"

* 'br-gossiper-sanitize-api-2' of https://github.com/xemul/scylla:
  gossiper: Add underscores to new private members
  code: Indentation fix after previous patch
  gossiper, code: Relax get_up/down/all_counters() helpers
  api: Fix indentation after previous patch
  gossiper, api: Remove get_arrival_samples()
  gossiper, api: Remove get/set phi convict threshold helpers
  gossiper, api: Move get_simple_states() into API code
  gossiper: In-line std::optional<> get_endpoint_state_for_endpoint() overload
  gossiper, api: Remove get_endpoint_state() helpers
  gossiper: Make state and locks maps private
  gossiper: Remove dead code
2022-05-08 22:56:23 +03:00
Avi Kivity
94f677b790 Merge 'sstables/index_reader: short-circuit fast-forward-to when at EOF' from Botond Dénes
Attempting to call advance_to() on the index, after it is positioned at EOF, can result in an assert failure, because the operation results in an attempt to move backwards in the index-file (to read the last index page, which was already read). This only happens if the index cache entry belonging to the last index page is evicted, otherwise the advance operation just looks-up said entry and returns it. To prevent this, we add an early return conditioned on eof() to all the partition-level advance-to methods.
A regression unit test reproducing the above described crash is also added.

Fixes: #10403

Closes #10491

* github.com:scylladb/scylla:
  sstables/index_reader: short-circuit fast-forward-to when at EOF
  test/lib/random_schema: add a simpler overload for fixed partition count
2022-05-08 14:17:40 +03:00
Juliusz Stasiewicz
603dd72f9e CQL: Replace assert by exception on invalid auth opcode
One user observed this assertion fail, but it's an extremely rare event.
The root cause - interlacing of processing STARTUP and OPTIONS messages -
is still there, but now it's harmless enough to leave it as is.

Fixes #10487

Closes #10503
2022-05-08 11:33:58 +03:00
Michał Chojnowski
fb1a9e97c9 cql3: restrictions: statement_restrictions: pass arguments to std::bind_front by reference
Fix an accidental copy of query_options in range_or_slice_eq_null.

Closes #10511
2022-05-08 11:32:53 +03:00
Avi Kivity
1ecb87b7a8 Merge 'Harden table truncate' from Benny Halevy
This series fixes a few issue on the table truncate path:
- "memtable_list: safely futurize clear_and_add"
  - reinstates an async version of table::clear_and_add, just safe against #10421
- a unit test reproducing #10421 was added to make sure the new version is indeed safe.
- "table: clear: serialize with ongoing flush" fixes #10423
- a unit test reproducing #10423 was added

Fixes #10281
Fixes #10423

Test: unit(dev), database_test. test_truncate_without_snapshot_during_{writes,flushes} (debug)

Closes #10424

* github.com:scylladb/scylla:
  test: database_test: add test_truncate_without_snapshot_during_writes
  memtable_list: safely futurize clear_and_add
  table: clear: serialize with ongoing flush
2022-05-08 11:30:21 +03:00
Avi Kivity
287c01ab4d Merge ' sstables: consumer: reuse the fragmented_temporary_buffer in read_bytes()' from Michał Chojnowski
primitive_consumer::read_bytes() destroys and creates a vector for every value it reads.
This happens for every cell.
We can save a bit of work by reusing the vector.

Closes #10512

* github.com:scylladb/scylla:
  sstables: consumer: reuse the fragmented_temporary_buffer in read_bytes()
  utils: fragmented_temporary_buffer: add release()
2022-05-08 11:26:31 +03:00
Raphael S. Carvalho
8e99d3912e compaction: LCS: don't write to disengaged optional on compaction completion
Dtest triggers the problem by:
1) creating table with LCS
2) disabling regular compaction
3) writing a few sstables
4) running maintenance compaction, e.g. cleanup

Once the maintenance compaction completes, disengaged optional _last_compacted_keys
triggers an exception in notify_completion().

_last_compacted_keys is used by regular for its round-robin file picking
policy. It stores the last compacted key for each level. Meaning it's
irrelevant for any other compaction type.

Regular compaction is responsible for initializing it when it runs for
the first time to pick files. But with it disabled, notify_completion()
will find it uninitialized, therefore resulting in bad_optional_access.

To fix this, the procedure is skipped if _last_compacted_keys is
disengaged. Regular compaction, once re-enabled, will be able to
fill _last_compacted_keys by looking at metadata of the files.

compaction_test.py::TestCompaction::test_disable_autocompaction_doesnt_
block_user_initiated_compactions[CLEANUP-LeveledCompactionStrategy]
now passes.

Fixes #10378.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #10508
2022-05-08 11:23:13 +03:00
Raphael S. Carvalho
5682393693 compaction: Fix use-after-move when retrying maintenance compaction
SSTable was moved into descriptor, so on failure, it couldn't be used
without resulting in a segfault. Fix it by not moving sst, and changing
signature to make it explicit we don't want to move the content.

Fixes #10505.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #10506
2022-05-08 11:16:55 +03:00
Michał Chojnowski
ddc535a4a2 sstables: consumer: reuse the fragmented_temporary_buffer in read_bytes()
read_bytes destroys and creates a vector for every value it reads.
This happens for every cell.
We can save a bit of work by reusing the vector.
2022-05-07 13:04:16 +02:00
Michał Chojnowski
8cfbe9c9c1 utils: fragmented_temporary_buffer: add release()
Add a release() method to fragmented_temporary_buffer.
This method releases the underlying vector to allow for its reuse.
2022-05-07 13:04:16 +02:00
Pavel Emelyanov
9d364f19dc gossiper: Add underscores to new private members
The state map and guarding locks were moved to private and now should have a _ prefix

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-06 11:32:03 +03:00
Pavel Emelyanov
334d3434e7 code: Indentation fix after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-06 10:34:48 +03:00
Pavel Emelyanov
5ac28a29d3 gossiper, code: Relax get_up/down/all_counters() helpers
These helpers count elements in the endpoint state map. It makes sense
to keep them in gossiper API, but it's worth removing the wrappers that
do invoke_on(0). This makes code shorter.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-06 10:34:48 +03:00
Pavel Emelyanov
5f53799ffb api: Fix indentation after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-06 10:34:48 +03:00
Pavel Emelyanov
0ef33b71ba gossiper, api: Remove get_arrival_samples()
It's empty too, but the API-side conversion probably has some value for
the future, so keep it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-06 10:34:48 +03:00
Pavel Emelyanov
37d392c772 gossiper, api: Remove get/set phi convict threshold helpers
These are empty anyway. API caller can place return stubs itself.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-06 10:34:48 +03:00
Pavel Emelyanov
ad786d6b4d gossiper, api: Move get_simple_states() into API code
The API method in question just tries to scan the state map. There's no
need in doing invoke_on(0) and in a separate helper method in gossiper,
the creation of the json return value can happen in the API handler.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-06 10:34:48 +03:00
Pavel Emelyanov
49dd6b5371 gossiper: In-line std::optional<> get_endpoint_state_for_endpoint() overload
The method helps updating enpoint state in handle_major_state_change by
returning a copy of an endpoint state that's kept while the map's entry
is being replaced with the new state. It can be replaced with a shorter
code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-06 10:34:48 +03:00
Pavel Emelyanov
f278d84cfe gossiper, api: Remove get_endpoint_state() helpers
There are two of them -- one to do invoke_on(0) the other one to get the
needed data. The former one is not needed -- the scanned endpoint state
map is replicated accross shards and is the same everywhere. The latter
is not needed, because there's only one user of it -- the API -- which
can work with the existing gossiper API.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-06 10:34:48 +03:00
Pavel Emelyanov
0aea43a245 gossiper: Make state and locks maps private
Locks are not needed outside gossiper, state map is sometimes read from,
but there a const getter for such cases. Both methods now desrve the
underbar prefix, but it doesn't come with this short patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-06 10:34:48 +03:00
Pavel Emelyanov
690b21aa4d gossiper: Remove dead code
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-06 10:34:48 +03:00
Botond Dénes
9623589c77 Merge 'Futurize data_read_resolver::resolve and to_data_query_result' from Benny Halevy
This series futurizes two synchronous functions used for data reconciliation:
`data_read_resolver::resolve` and `to_data_query_result` and does so
by introducing lower-level asynchronous infrastructure:
`mutation_partition_view::accept_gently`,
`frozen_mutation::unfreeze_gently` and `frozen_mutation::consume_gently`,
and `mutation::consume_gently`.

This trades some cycles on this cold path to prevent known reactor stalls.

Fixes #2361
Fixes #10038

Closes #10482

* github.com:scylladb/scylla:
  mutation: add consume_gently
  frozen_mutation: add consume_gently
  query: coroutinize to_data_query_result
  frozen_mutation: add unfreeze_gently
  mutation_partition_view: add accept_gently methods
  storage_proxy: futurize data_read_resolver::resolve
2022-05-06 10:23:02 +03:00
Botond Dénes
e8f3d7dd13 sstables/index_reader: short-circuit fast-forward-to when at EOF
Attempting to call advance_to() on the index, after it is positioned at
EOF, can result in an assert failure, because the operation results in
an attempt to move backwards in the index-file (to read the last index
page, which was already read). This only happens if the index cache
entry belonging to the last index page is evicted, otherwise the advance
operation just looks-up said entry and returns it.
To prevent this, we add an early return conditioned on eof() to all the
partition-level advance-to methods.
A regression unit test reproducing the above described crash is also
added.
2022-05-05 14:42:37 +03:00
Botond Dénes
98f3d516a2 test/lib/random_schema: add a simpler overload for fixed partition count
Some tests want to generate a fixed amount of random partitions, make
their life easier.
2022-05-05 14:33:37 +03:00
Piotr Sarna
eeec502aee Merge 'gms: feature_service: reduce boilerplate to add a cluster feature' from Avi Kivity
Currently, adding a cluster feature requires editing several files and
repeating the new feature name several times. This series reduces
the boilerplate to a single line (for non-experimental features), and
perhaps three for experimental features.

Closes #10488

* github.com:scylladb/scylla:
  gms: feature_service: remove variable/helper function duplication
  gms: feature: make `operator bool` implicit
  gms: feature_service: remove feature variable duplication in enable()
  gms: feature_service: remove feature variable declaration/definition duplication
  gms: features: de-quadruplicate active feature names
  gms: features: de-quadruplicate deprecated feature names
  gms: feature_service: avoid duplicating feature names when listing known features
2022-05-05 12:43:15 +02:00
Benny Halevy
ca1b616092 mutation: add consume_gently
Allow yielding when consuming a mutation,
and use in to_data_query_result.

Fixes #10038

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-05 13:32:25 +03:00
Benny Halevy
09fb2c983a frozen_mutation: add consume_gently
Allow yielding when consuming a frozen_mutation,
and use in to_data_query_result.

Refs #10038

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-05 13:32:25 +03:00
Benny Halevy
c9612855c7 query: coroutinize to_data_query_result
Reduce stalls by maybe yielding in-between partitions,
and by awaiting unfreeze_gently where possible.

Refs #10038

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-05 13:32:25 +03:00
Benny Halevy
e12454f175 frozen_mutation: add unfreeze_gently
And use in data_read_resolver::resolve

Fixes #2361

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-05 13:32:25 +03:00
Benny Halevy
4963eb73b5 mutation_partition_view: add accept_gently methods
Allow yielding when consuming mutation_partition_view.
To be used in later patches by a new unfreeze_gently function
and frozen_mutation::consume.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-05 13:32:25 +03:00
Benny Halevy
f02c25f2c3 storage_proxy: futurize data_read_resolver::resolve
Allow yielding in data_read_resolver::resolve to
prevent reactor stalls.

TODO: unfreeze_gently, to prevent stalls due
to large partitions.

Refs #2361

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-05 13:32:25 +03:00
Pavel Emelyanov
0f698910e8 cql_type_parser: Require user_types_storage& in parse()
Right now to get user types the method in question gets global proxy
instance to get database from it and then peek a keyspace, its metadata
and, finally, the user types. There's also a safety check for proxy not
being initialized, which happens in tests.

Instead of messing with the proxy, the parse() method now accepts the
user_types_storage reference from which it gets the types. All the
callers already have the needed storage at hand -- in most of the cases
it's one shared between the database and schema_ctxt. In case of tests
is's a dummy storage, in case of schema-loader it's its local one.

The get_column_mapping() is special -- it doesn't expect any user-types
to be parsed and passes "" keyspace into it, neither it has db/ctxt to
get types storage from, so it can safely use the dummy one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-05 13:11:18 +03:00
Pavel Emelyanov
44f38d4de2 schame_tables: Add db/ctxt args here and there
This is to have them in places that call cql_type_parser::parse.
Pure churn reduction for the next patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-05 13:11:18 +03:00
Pavel Emelyanov
2104d90dd0 user_types: Carry storage on database and schema_ctxt
The user types storage is needed in cql_type_parser::parse which is in
turn called with either replica::database or scema_ctxt at hand.

To facilitate the former case replica::database has its own user types
storage created in database constructor.

The latter case is a bit trickier. In many cases the ctxt is created as
a temporary object and the database is available at those places. Also
the ctxt object lives on the schema_registry instance which doesn't have
database nearby. However, that ctxt lifetime is the same as the registry
instance one and when it's created there's a database at hand (it's the
database constructor that calls schema_registry.init() passing "this"
into it). Thus, the solution is to make database's user types storage be
a shared pointer that's shared between database itself and all the ctxts
out there including the one that lives on schema_registry instance.

When database goes away it .deactivate()s its user types storage so that
any ctxts that may share it stay on the safe side and don't use database
after free. This part will go away when the schema_registry will be
deglobalized.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-05 13:06:04 +03:00
Pavel Emelyanov
860dbab474 data_dictionary: Introduce user types storage
The interface in question will be used by cql type parser to get user
types. There are already three possible implementations of it:

- dummy, when no user types are in use (e.g. tests)
- schema-loader one, which gets user types from keyspaces that are
  collected on its implementation of the database
- replica::database one, which does the same, but uses the real
  database instance and that will be shared between scema_ctxts

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-05 09:44:26 +03:00
Avi Kivity
19ab3edd77 gms: feature_service: remove variable/helper function duplication
Each feature has a private variable and a public accessor. Since the
accessor effectively makes the variable public, avoid the intermediary
and make the variable public directly.

To ease mechanical translation, the variable name is chosen as
the function name (without the cluster_supports_ prefix).

References throughout the codebase are adjusted.
2022-05-04 18:59:56 +03:00
Avi Kivity
435b46cd52 gms: feature: make operator bool implicit
Features are usually used as booleans, so forcing allowing them
to implicitly decay to bool is not a mistake. In fact a bunch
of helper functions exist to cast feature variables to bool.

Prepare to reduce this boilerplate by allowing automatic conversion
to bool.
2022-05-04 18:58:24 +03:00
Avi Kivity
81ad595f61 gms: feature_service: remove feature variable duplication in enable()
We have a list of all feature variables in enable(), but the list
is also available programatically in _registered_features, so use
that instead.
2022-05-04 18:44:28 +03:00
Avi Kivity
f0f4759163 gms: feature_service: remove feature variable declaration/definition duplication
Feature variables are both declared and defined. Make that happen in one
place, reducing boilerplate.
2022-05-04 18:24:56 +03:00
Avi Kivity
0f95258577 gms: features: de-quadruplicate active feature names
Active feature names are present four or five times in the code:
a delaration in feature.hh, a definition and initialization (two copies)
in feature_service.cc, a use in feature_service.cc, and a possible
reference in feature_service.cc if the feature is conditionally enabled.

Switch to just one copy or two, using the "foo"sv operator (and "foo"s)
to generate a string_view (string) as before.

Note that a few features had different external and C++ names; we
preserve the external name.

This patch does cause literal strings to be present in two places,
making them vulnerable to misspellings. But since feature names
are immutable, there is little risk that one will change without
the other.
2022-05-04 18:12:53 +03:00
Avi Kivity
980b109adb gms: features: de-quadruplicate deprecated feature names
Deprecated features are unused, but are present four times in the code:
a delaration in feature.hh, a definition and initialization (two copies)
in feature_service.cc, and a use in feature_service.cc. Switch to just
one copy, using the "foo"sv operator to generate a string_view as before.

Note that a few features had different external and C++ names; we
preserve the external name.
2022-05-04 17:54:05 +03:00
Avi Kivity
ebe5ce2870 gms: feature_service: avoid duplicating feature names when listing known features
We already have the registered features in a data structure, collect them
from there instead of repeating.
2022-05-04 16:19:42 +03:00
Calle Wilund
78350a7e1b cdc: Ensure columns removed from log table are registered as dropped
If we are redefining the log table, we need to ensure any dropped
columns are registered in "dropped_columns" table, otherwise clients will not
be able to read data older than now.
Includes unit test.

Should probably be backported to all CDC enabled versions.

Fixes #10473
Closes #10474
2022-05-04 14:19:39 +02:00
Michał Radwański
29e09a3292 db/config: command line arguments logger_stdout_timestamps and logger_ostream_type are no longer ignored
Closes #10452
2022-05-04 14:40:52 +03:00
Pavel Emelyanov
063d26bc9e system_keyspace/config: Swallow string->value cast exception
When updating an updateable value via CQL the new value comes as a
string that's then boost::lexical_cast-ed to the desired value. If the
cast throws the respective exception is printed in logs which is very
likely uncalled for.

fixes: #10394
tests: manual

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220503142942.8145-1-xemul@scylladb.com>
2022-05-04 08:35:12 +03:00
Pavel Emelyanov
b26a3da584 gossiper: Coroutinize wait_for_gossip_to_settle()
Looks notably shorter this way

tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220422093000.24407-1-xemul@scylladb.com>
2022-05-03 15:58:04 +03:00
Botond Dénes
4440d4b41a Merge "De-globalize gossiper" from Pavel Emelyanov
"
- Alternator gets gossiper for its proxy dependency

- Forward service method that takes global gossiper can re-use
  proxy method (forward -> proxy reference is already there)

- Table code is patched to require gossiper argument

- Snitch gets a dependency reference on snitch_ptr and some extra
  care for snitch driver vs snitch-ptr interaction and gossip test

- Cql test env should carry gossiper reference on-board

- Few places can re-use the existing local gossiper reference

- Scylla-gdb needs to get gossiper from debug namespace and needs
  _not_ to get feature service from gossiper
"

* 'br-gossiper-deglobal-2' of https://github.com/xemul/scylla:
  code: De-globalize gossiper
  scylla-gdb, main: Get feature service without gossiper help
  test: Use cql-test-env gossiper
  cql test env: Keep gossiper reference on board
  code: Use gossiper reference where possible
  snitch: Use local gossiper in drivers
  snitch: Keep gossiper reference
  test: Remove snitch from manual gossip test
  gossiper: Use container() instead of the global pointer
  main, cql_test_env: Start snitch later
  snitch: Move snitch_base::get_endpoint_info()
  forward service: Re-use proxy's helper with duplicated code
  table: Don't use global gossiper
  alternator: Don't use global gossiper
2022-05-03 15:56:07 +03:00
Nadav Har'El
6fb762630b cql-pytest: translate Cassandra's tests for SELECT operations
This is a translation of Cassandra's CQL unit test source file
    validation/operations/SelectTest.java into our our cql-pytest framework.

    This large test file includes 78 tests for various types of SELECT
    operations. Four additional tests require UDF in Java syntax,
    and were skipped.

    All 78 tests pass on Cassandra. 25 of the tests fail on Scylla
    reproducing 3 already known Scylla issues and 8 previously-unknown
    issues:

    Previously known issues:

      Refs #2962:  Collection column indexing
      Refs #4244:  Add support for mixing token, multi- and single-column
                   restrictions
      Refs #8627:  Cleanly reject updates with indexed values where
                   value > 64k

    Newly-discovered issues:

      Refs #10354: SELECT DISTINCT should allow filter on static columns,
                   not just partition keys
      Refs #10357: Spurious static row returned from query with filtering,
                   despite not matching filter
      Refs #10358: Comparison with UNSET_VALUE should produce an error
      Refs #10359: "CONTAINS NULL" and "CONTAINS KEY NULL" restrictions
                   should match nothing
      Refs #10361: Null or UNSET_VALUE subscript should generate an
                   invalid request error
      Refs #10366: Enforce Key-length limits during SELECT
      Refs #10443: SELECT with IN and ORDER BY orders rows per partition
                   instead of for the entire response
      Refs #10448: The CQL token() function should validate its parameters

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #10449
2022-05-03 11:45:05 +03:00
Pavel Emelyanov
e80adbade3 code: De-globalize gossiper
No code uses global gossiper instance, it can be removed. The main and
cql-test-env code now have their own real local instances.

This change also requires adding the debug:: pointer and fixing the
scylle-gdb.py to find the correct global location.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-03 10:57:40 +03:00
Pavel Emelyanov
89ee15b05b scylla-gdb, main: Get feature service without gossiper help
This is needed not to mess with removed global gossiper in the next
patch. Other than this, it's better to access services by their own
debug:: pointers, not via under-the-good dependencies chains.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-03 10:57:40 +03:00
Pavel Emelyanov
b25fc29801 test: Use cql-test-env gossiper
There's yet another -test-env -- the alternator- one -- which needs
gossiper. It now uses global reference, but can grab gossiper reference
from the cql-test-env which partitipates in initialization.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-03 10:57:40 +03:00
Pavel Emelyanov
b0544ba7bd cql test env: Keep gossiper reference on board
The reference is already available at the env initialization, but it's
not kept on the env instance itself. Will be used by the next patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-03 10:57:40 +03:00
Pavel Emelyanov
4bea0b7491 code: Use gossiper reference where possible
Some places in the code has function-local gossiper reference but
continue to use global instance. Re-use the local reference (it's going
to become sharded<> instance soon).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-03 10:57:40 +03:00
Pavel Emelyanov
e502047c74 snitch: Use local gossiper in drivers
Each driver has a pointer to this shard snitch_ptr which, in turn, has
the reference on gossiper. This lets drivers stop using the global
gossiper instance.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-03 10:57:40 +03:00
Pavel Emelyanov
38c77d0d85 snitch: Keep gossiper reference
The reference is put on the snitch_ptr because this is the sharded<>
thing and because gossiper reference is the same for different snitch
drivers. Also, getting gossiper from snitch_ptr by driver will look
simpler than getting it from any base class.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-03 10:57:40 +03:00
Pavel Emelyanov
52fc4d6b22 test: Remove snitch from manual gossip test
It's not in use out there

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-03 10:57:40 +03:00
Pavel Emelyanov
7a0ca3fedc gossiper: Use container() instead of the global pointer
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-03 10:57:40 +03:00
Pavel Emelyanov
2d32c47d0d main, cql_test_env: Start snitch later
Snitch depends on gossiper and system keyspace, so it needs to be
started after those two do.

fixes #10402

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-03 10:57:32 +03:00
Pavel Emelyanov
f85e12ffa5 snitch: Move snitch_base::get_endpoint_info()
This method is only needed by production_snitch_base inheritants

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-03 10:34:52 +03:00
Pavel Emelyanov
282a1880a5 forward service: Re-use proxy's helper with duplicated code
The get_live_endpoints matches the same method on the proxy side. Since
the forward service carries proxy reference, it can use its method
(which needs to be made public for that sake).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-03 10:34:51 +03:00
Pavel Emelyanov
11c99fc41b table: Don't use global gossiper
The table::get_hit_rate needs gossiper to get hitrates state from.
There's no way to carry gossiper reference on the table itself, so it's
up to the callers of that method to provide it. Fortunately, there's
only one caller -- the proxy -- but the call chain to carry the
reference it not very short ... oh, well.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-03 10:33:08 +03:00
Pavel Emelyanov
7a5c2cdbe6 alternator: Don't use global gossiper
There's proxy at hand which can provide local gossiper reference

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-03 10:33:07 +03:00
Avi Kivity
a2901a376d Merge 'Coroutinize some storage_service member functions' from Pavel Solodovnikov
These trivial changes are mostly intended to reduce the use of `seastar::async`.

Closes #10416

* github.com:scylladb/scylla:
  service: storage_service: coroutinize `start_gossiping()`
  service: storage_service: coroutinize `node_ops_cmd_heartbeat_updater()`
  service: storage_service: coroutinize `node_ops_abort_thread()`
  service: storage_service: coroutinize `node_ops_abort()`
  service: storage_service: coroutinize `node_ops_done()`
  service: storage_service: coroutinize `node_ops_update_heartbeat()`
  service: storage_service: coroutinize `force_remove_completion()`
  service: storage_service: coroutinize `start_leaving()`
  service: storage_service: coroutinize `start_sys_dist_ks()`
  service: storage_service: coroutinize `prepare_to_join()`
  service: storage_service: coroutinize `removenode_add_ranges()`
  service: storage_service: coroutinize `unbootstrap()`
  service: storage_service: coroutinize `get_changed_ranges_for_leaving()`
2022-05-02 12:59:36 +03:00
Botond Dénes
53c66fe24a Merge "Make LCS reshape and major more efficient by picking the ideal output level" from Raphael S. Carvalho
"
Today, both operations are picking the highest level as the ideal level for
placing the output, but the size of input should be used instead.

The formula for calculating the ideal level is:
    ceil(log base(fan_out) of (total_input_size / max_fragment_size))

        where fan_out = 10 by default,
        total_input_size = total size of input data and
        max_fragment_size = maximum size for fragment (160M by default)

    such that 20 fragments will be placed at level 2, as level 1
    capacity is 10 fragments only.

By placing the output in the incorrect level, tons of backlog will be generated
for LCS because it will either have to promote or demote fragments until the
levels are properly balanced.
"

* 'optimize_lcs_major_and_reshape/v2' of https://github.com/raphaelsc/scylla:
  compaction: LCS: avoid needless work post major compaction completion
  compaction: LCS: avoid needless work post reshape completion
  compaction: LCS: extract calculation of ideal level for input
  compaction: LCS: Fix off-by-one in formula used to calculate ideal level
2022-05-02 10:16:09 +03:00
Pavel Solodovnikov
47834313d8 repair: avoid infinite recursion on stringifying unknown node_ops_cmd
Cast the cmd representation to underlying type and
avoid infinite recursion in the `operator <<(node_ops_cmd)`.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20220430180701.1012190-1-pa.solodovnikov@scylladb.com>
2022-05-02 10:08:34 +03:00
Avi Kivity
5169ce40ef Merge 'loading_cache: force minimum size of unprivileged ' from Piotr Grabowski
This series enforces a minimum size of the unprivileged section when
performing `shrink()` operation.

When the cache is shrunk, we still drop entries first from unprivileged
section (as before this commit), however, if this section is already small
(smaller than `max_size / 2`), we will drop entries from the privileged
section.

This is necessary, as before this change the unprivileged section could
be starved. For example if the cache could store at most 50 entries and
there are 49 entries in privileged section, after adding 5 entries (that would
go to unprivileged section) 4 of them would get evicted and only the 5th one
would stay. This caused problems with BATCH statements where all
prepared statements in the batch have to stay in cache at the same time
for the batch to correctly execute.

To correctly check if the unprivileged section might get too small after
dropping an entry, `_current_size` variable, which tracked the overall size
of cache, is changed to two variables: `_unprivileged_section_size` and
`_privileged_section_size`, tracking section sizes separately.

New tests are added to check this new behavior and bookkeeping of the section
sizes. A test is added, that sets up a CQL environment with a very small
prepared statement cache, reproduces issue in #10440 and stresses the cache.

Fixes #10440.

Closes #10456

* github.com:scylladb/scylla:
  loading_cache_test: test prepared stmts cache
  loading_cache: force minimum size of unprivileged
  loading_cache: extract dropping entries to lambdas
  loading_cache: separately track size of sections
  loading_cache: fix typo in 'privileged'
2022-05-01 19:36:35 +03:00
Avi Kivity
325eb9b4d2 Merge 'compaction: get rid of reader v1' from Benny Halevy
This series gets rid of the remaining usage of flat_mutation_reader v1 in compaction

Test: sstable_compaction_test

Closes #10454

* github.com:scylladb/scylla:
  compaction: sanitize headers from flat_mutation_reader v1
  flat_mutation_reader: get rid of class filter
  compaction: cleanup_compaction: make_partition_filter: return flat_mutation_reader_v2::filter
2022-05-01 19:29:10 +03:00
Avi Kivity
ab72dbc93e Merge 'Make internal statements caching more coherent' from Eliran Sinvani
There was some doubts about which internal prepared statements are cached and which aren't.
In addition some queries that should have been cached (IMO), weren't. This PR adds some verbosity
to the caching enabling parameter as well as adding caching to some queries.
As a followup I would suggest to have internal queries as a compile time strings that have a compile time hash, this
will make the cache lookup not be dependent on the query textual length as it is today, this makes sense given that the
queries are static even today.

Closes #10465

Fixes #10335.

* github.com:scylladb/scylla:
  internal queries: add caching to some queries
  query_processor: remove default internal query caching behavior
  query_processor: make execute_internal caching parameter more verbose
2022-05-01 19:27:06 +03:00
Eliran Sinvani
a16b4e407d internal queries: add caching to some queries
Some of the internal queries didn't have caching enabled even though
there are chances of the query executing in large bursts or relatively
often, example of the former is `default_authorized::authorize` and for
the later is `system_distributed_keyspace::get_service_levels`.

Fixes #10335

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
2022-05-01 13:30:02 +03:00
Pavel Solodovnikov
1031a9fa09 service: storage_service: coroutinize start_gossiping()
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-05-01 12:12:30 +03:00
Pavel Solodovnikov
4af27ca653 service: storage_service: coroutinize node_ops_cmd_heartbeat_updater()
Also, pass `node_ops_cmd` by value to get rid of lifetime issues
when converting to coroutine.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-05-01 12:07:36 +03:00
Eliran Sinvani
e0c7178e75 query_processor: remove default internal query caching behavior
When executing internal queries, it is important that the developer
will decide if to cache the query internally or not since internal
queries are cached indefinitely. Also important is that the programmer
will be aware if caching is going to happen or not.
The code contained two "groups" of `query_processor::execute_internal`,
one group has caching by default and the other doesn't.
Here we add overloads to eliminate default values for caching behaviour,
forcing an explicit parameter for the caching values.
All the call sites were changed to reflect the original caching default
that was there.

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
2022-05-01 08:33:55 +03:00
Eliran Sinvani
38b7ebf526 query_processor: make execute_internal caching parameter more verbose
`execute_internal` has a parameter to indicate if caching a prepared
statement is needed for a specific call. However this parameter was a
boolean so it was easy to miss it's meaning in the various call sites.
This replaces the parameter type to a more verbose one so it is clear
from the call site what decision was made.
2022-05-01 08:33:55 +03:00
Avi Kivity
7f1e368e92 Merge 'replica/database: drop_column_family(): properly cleanup stale querier cache entries' from Botond Dénes
Said method has to evict all querier cache entries, belonging to the to-be-dropped table. This is already the case, but there was a window where new entries could sneak in, causing a stale reference to the table to be de-referenced later when they are evicted due to TTL. This window is now closed, the entries are evicted after the method has waited for all ongoing operations on said table to stop.

Fixes: #10450

Closes #10451

* github.com:scylladb/scylla:
  replica/database: drop_column_family(): drop querier cache entries after waiting for ops
  replica/database: finish coroutinizing drop_column_family()
  replica/database: make remove(const column_family&) private
2022-04-29 22:06:51 +03:00
Tomasz Grabiec
dbef83af71 Merge 'raft: fix startup hangs' from Kamil Braun
Fix hangs on Scylla node startup with Raft enabled that were caused by:
- a deadlock when enabling the USES_RAFT feature,
- a non-voter server forgetting who the leader is and not being able to forward a `modify_config` entry to become a voter.

Read the commit messages for details.

Fixes: #10379
Refs: #10355

Closes #10380

* github.com:scylladb/scylla:
  raft: actively search for a leader if it is not known for a tick duration
  raft: server: return immediately from `wait_for_leader` if leader is known
  service: raft: don't support/advertise USES_RAFT feature
2022-04-29 19:47:10 +02:00
Piotr Grabowski
6537dc6126 loading_cache_test: test prepared stmts cache
Add a new test that sets up a CQL environment with a very small prepared 
statements cache. The test reproduces a scenario described in #10440,
where a privileged section of prepared statement cache gets large
and that could possibly starve the unprivileged section, making it
impossible to execute BATCH statements. Additionally, at the end of the 
test, prepared statements/"simulated batches" with prepared statements 
are executed a random number of times, stressing the cache.

To create a CQL environment with small prepared cache, cql_test_config
is extended to allow setting custom memory_config value.
2022-04-29 19:22:55 +02:00
Piotr Grabowski
3f2224a47f loading_cache: force minimum size of unprivileged
This patch enforces a minimum size of unprivileged section when
performing shrink() operation.

When the cache is shrank, we still drop entries first from unprivileged
section (as before this commit), however if this section is already small
(smaller than max_size / 2), we will drop entries from the privileged
section.

For example if the cache could store at most 50 entries and there are 49 
entries in privileged section, after adding 5 entries (that would go to 
unprivileged section) 4 of them would get evicted and only the 5th one 
would stay. This caused problems with BATCH statements where all 
prepared statements in the batch have to stay in cache at the same time 
for the batch to correctly execute.

New tests are added to check this behavior and bookkeeping of section
sizes. 

Fixes #10440.
2022-04-29 19:19:04 +02:00
Piotr Grabowski
06612ddf1c loading_cache: extract dropping entries to lambdas
Extract the logic of dropping an entry from privileged/unprivileged
sections to a separate named local lambdas.
2022-04-29 19:19:03 +02:00
Piotr Grabowski
bebc4c8147 loading_cache: separately track size of sections
This patch splits _current_size variable, which tracked the overall size
of cache, to two variables: _unprivileged_section_size and 
_privileged_section_size.

Their sum is equal to the old _current_size, but now you can get the
size of each section separately.

lru_entry's cache_size() is replaced with owning_section_size() which 
references in which counter the size of lru_entry is currently stored.
2022-04-29 19:19:03 +02:00
Michał Sala
35e02858b2 test/boost: cql_query_test: allow bound variables in test_list_of_tuples_with_bound_var 2022-04-29 10:46:57 +02:00
Michał Sala
ed377933a1 test/boost: cql_query_test: test bound variables in collection literals 2022-04-29 10:46:51 +02:00
Raphael S. Carvalho
736c96cc6f compaction: LCS: avoid needless work post major compaction completion
That's done by picking the ideal level for the input, such
that LCS won't have to either promote or demote data, because
the output level is not the best candidate for having the
size of the output data.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-04-28 20:19:28 -03:00
Raphael S. Carvalho
bea551ea14 compaction: LCS: avoid needless work post reshape completion
That's done by picking the ideal level for reshape input, such
that LCS won't have to either promote or demote data, because
the output level is not the best candidate for having the
size of the output data.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-04-28 20:19:28 -03:00
Raphael S. Carvalho
b25e53a845 compaction: LCS: extract calculation of ideal level for input
ideal level is calculated as:
	ceil(log base10 of ((input_size + max_fragment_size - 1) / max_fragment_size))

such that 20 fragments will be placed at level 2, as level 1
capacity is 10 fragments only.

The goal of extracting it is that the formula will be useful for
major in addition to reshape.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-04-28 20:19:04 -03:00
Raphael S. Carvalho
e82c02fae6 compaction: LCS: Fix off-by-one in formula used to calculate ideal level
To calculate ideal level, we use the formula:
	log 10 (input_size / max_fragment_size)

input_size / max_fragment_size is calculating number of fragments.

the problem is that the calculation can miss the last fragment, so
wrong level may be picked if last fragment would cause the target
level to exceed its capacity.

To fix it, let's tweak the formula to:
	log 10 ((input_size + max_fragment_size - 1) / max_fragment_size)

such that the actual # of fragments will be calculated.

If wrong level is picked, it can cause unnecessary writeamp as,
LCS will later have to promote data into the next level.

Problem spotted by Benny Halevy.

Fixes #10458.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-04-28 20:09:04 -03:00
Michał Sala
69bc60ef24 cql3: expr: do not allow unset values inside collections
Semantic of unset values inside collections is undefined.
Previous behavior of transforming list with unset value into unset value
was removed, because I couldn't find a reason for its existence.
2022-04-28 19:33:09 +02:00
Avi Kivity
aee94b7176 Merge "Convert remaining mutation sources to v2" from Botond
"
After the recent conversion of the row-cache, two v1 mutation sources
remained: the memtable and the kl sstable reader.
This series converts both to a native v2 implementation. The conversion
is shallow: both continue to read and process the underlying (v1) data
in v1, the fragments are converted to v2 right before being pushed to
the reader's buffer. This conversion is simple, surgical and low-risk.
It is also better than the upgrade_to_v2() used previously.

Following this, the remaining v1 reader implementations are removed,
with the exception of the downgrade_to_v1(), which is the only one left
at this point. Removing this requires converting all mutation sinks to
accept a v2 stream.

upgrade_to_v2() is now not used in any production code. It is still
needed to properly test downgrade_to_v1() (which is till used), so we
can't remove it yet. Instead it hidden as a private method of
mutation_source. This still allows for the above mentioned testing to
continue, while preventing anyone from being tempted to introduce new
usage.

tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/191
"

* 'convert-remaining-v1-mutation-sources/v2' of https://github.com/denesb/scylla:
  readers: make upgrade_to_v2() private
  test/lib/mutation_source_test: remove upgrade_to_v2 tests
  readers: remove v1 forwardable reader
  readers: remove v1 empty_reader
  readers: remove v1 delegating_reader
  sstables/kl: make reader impl v2 native
  sstables/kl: return v2 reader from factory methods
  sstables: move mp_row_consumer_reader_k_l to kl/reader.cc
  partition_snapshot_reader: convert implementation to native v2
  mutation_fragment_v2: range_tombstone_change: add minimal_memory_usage()
2022-04-28 20:31:23 +03:00
Michał Sala
4766e25d6e cql3: expr: prepare_expr: allow bind markers in collection literals
It's easier to allow them then not to do so.
2022-04-28 19:31:09 +02:00
Piotr Grabowski
fe9b62bc99 loading_cache: fix typo in 'privileged'
Fix typo from 'priviledged' to 'privileged'.
2022-04-28 17:51:26 +02:00
Benny Halevy
78d6f6a519 compaction: sanitize headers from flat_mutation_reader v1
flat_mutation_reader make_scrubbing_reader no longer exists
and there is no need to include flat_mutation_reader.hh
nor forward declare the class.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-04-28 17:23:04 +03:00
Benny Halevy
4c35de962f flat_mutation_reader: get rid of class filter
It is no longer in use.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-04-28 17:19:20 +03:00
Benny Halevy
f634b6d3be compaction: cleanup_compaction: make_partition_filter: return flat_mutation_reader_v2::filter
We filter only on the parittion key, so it doesn't matter,
but we want to get rid of flat_mutation_reader v1.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-04-28 17:16:47 +03:00
Nadav Har'El
ad7cc71748 Merge 'sstables: Fix deletion of partial SSTables' from Raphael "Raph" Carvalho
If SSTable write fails, it will leave a partial sst which contains
a temporary TOC in addition to other components partially written.
temporary TOC content is written upfront, to allow us from deleting
all partial components using the former content if write fails.

After commit e5fc4b6, partial sst cannot be deleted because it is
incorrectly assuming all files being deleted unconditionally has
TOC, but that's not true for partial files that need to be removed.
The consequence of this is that space of partial files cannot be
reclaimed, making it worse for Scylla to recover from ENOSPC,
which could happen by selecting a set of files for compaction with
higher chance of suceeeding given the free space.
Let's fix this by taking into account temp TOC for partial files.

Fixes #10410.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #10411

* github.com:scylladb/scylla:
  sstables: Fix deletion of partial SSTables
  sstables: Fix fsync_directory()
  sstables: Rename dirname() to a more descriptive name
2022-04-28 16:35:46 +03:00
Nadav Har'El
01bd858b6b cql: fix error message that refers to tuple instead of UDT
Slice restrictions on the "duration" type are not allowed, and also if
we have a collection, tuple or UDT of durations. We made an effort to
print helpful messages on the specific case encountered, such as "Slice
restrictions are not supported on UDTs containing duration".

But the if()s were reverse, meaning that a UDT - which is also a tuple -
will be reported as a tuple instead of UDT as we intended (and as Cassandra
reports it).

The wrong message was reproduced in the unit test translated from
Cassandra, select_test.py::testFilteringOnUdtContainingDurations

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220428071807.1769157-1-nyh@scylladb.com>
2022-04-28 14:16:58 +03:00
Botond Dénes
178c271bf4 readers: make upgrade_to_v2() private
The only user is the tests of downgrade_to_v1(), which uses it through
mutation source. To avoid any new users popping up, we make it a private
method of the latter. In the process the pass-through optimization is
dropped, it is not needed for tests anyway.
2022-04-28 14:12:24 +03:00
Botond Dénes
272da51f80 test/lib/mutation_source_test: remove upgrade_to_v2 tests
We don't have any upgrade_to_v2() left in production code, so no need to
keep testing it. Removing it from this test paves the way for removing
it for good (not in this series).
2022-04-28 14:12:24 +03:00
Botond Dénes
7420fb9411 readers: remove v1 forwardable reader
No users.
2022-04-28 14:12:24 +03:00
Botond Dénes
f527956cdb readers: remove v1 empty_reader
The only user is row level repair: it is replaced with
downgrade_to_v1(make_empty_flat_reader_v2()). The row level reader has
lots of downgrade_to_v1() calls, we will deal with these later all at
once.
Another use is the empty mutation source, this is trivially converted to
use the v2 variant.
2022-04-28 14:12:24 +03:00
Botond Dénes
ea37e9c04e readers: remove v1 delegating_reader
The only user is a test, which is hereby converted to use the v2
delegating reader.
2022-04-28 14:12:24 +03:00
Botond Dénes
70d019116f sstables/kl: make reader impl v2 native
The conversion is shallow: the meat of the logic remains v1, fragments
are converted to v2 right before being pushed into the buffer. This
approach is simple, surgical and is still better then a full
upgrade_to_v2().
2022-04-28 14:12:24 +03:00
Botond Dénes
a22b02c801 sstables/kl: return v2 reader from factory methods
This just moves the upgrade_to_v2() calls to the other side of said
factory methods, preparing the ground for converting the kl reader impl
to a native v2 one.
2022-04-28 14:12:24 +03:00
Botond Dénes
4b222e7f37 sstables: move mp_row_consumer_reader_k_l to kl/reader.cc
Its only user is in said file, so that is a better place for it.
2022-04-28 14:12:24 +03:00
Botond Dénes
4f77e74bd4 partition_snapshot_reader: convert implementation to native v2
The underlying mutation representation is still v1, so the
implementation still has to do conversion. This happens right above the
lsa reader component.
2022-04-28 14:12:12 +03:00
Botond Dénes
9c7455825b mutation_fragment_v2: range_tombstone_change: add minimal_memory_usage() 2022-04-28 14:11:51 +03:00
Botond Dénes
024ceec61e replica/database: drop_column_family(): drop querier cache entries after waiting for ops
Reads (part of operations) running concurrent to `drop_column_family()`
can create querier cache entries while we wait for them to finish in
`await_pending_ops()`. Move the cache entry eviction to after this, to
ensure such entries are also cleaned up before destroying the table
object.
This moves the `_querier_cache.evict_all_for_table()` from
`database::remove()` to `database::drop_column_family()`. With that the
former doesn't have to return `future<>` anymore. While at it (changing
the signature) also rename `column_family` -> `table`.

Also add a regression unit test.
2022-04-28 13:40:13 +03:00
Botond Dénes
4c17da9996 replica/database: finish coroutinizing drop_column_family()
Said method was already coroutinized, but only halfway, possibly because
of the difficulty in expressing `finally()` with coroutines. We now have
`coroutines::as_future()` which makes this easier, so finish the job.
2022-04-28 13:40:13 +03:00
Botond Dénes
9b7550f845 replica/database: make remove(const column_family&) private
It has no external users. And it shouldn't have either, tables should be
removed via drop_column_family().
2022-04-28 13:40:08 +03:00
Avi Kivity
de0ee13f45 schema_tables: forward-declare user_function and user_aggerates
These bring in wasm.hh (though they really shouldn't) and make
everyone suffer. Forward declare instead and add missing includes
where needed.

Closes #10444
2022-04-28 07:22:02 +03:00
Botond Dénes
2c08468fcb Merge 'Make headers self-contained' from Avi Kivity
Minor fixlets to make `ninja dev-headers` pass.

Closes #10445

* github.com:scylladb/scylla:
  readers/from_mutations_v2.hh: make self-contained
  data_dictionary/storage_options.hh: make self-contained
2022-04-28 07:20:10 +03:00
Avi Kivity
a9812166cd replica, partition_snapshot_reader, keys: replace boost::any with std::any
Reduce #include load by standardizing on std::any.

In keys.cc, we just drop the unneeded include.

One instance of boost::any remains in config_file, due to a tie-in with
other boost components.

Closes #10441
2022-04-28 07:18:53 +03:00
Avi Kivity
3a81cb7cc3 readers/from_mutations_v2.hh: make self-contained
Due to an inline function, we need the definition of
flat_mutation_reader_v2.hh, so include it.
2022-04-27 15:55:16 +03:00
Avi Kivity
28406c2c56 data_dictionary/storage_options.hh: make self-contained
Add "seastarx.hh" so sstring works (rather than seastar::sstring).
2022-04-27 15:54:32 +03:00
Avi Kivity
333fdcb3f5 Update tools/java submodule (fix NodeProbe: Malformed IPv6 address at index)
* tools/java 9bc83b7a32...a4573759a2 (1):
  > CASSANDRA-17581 fix NodeProbe: Malformed IPv6 address at index

Fixes #10442.
2022-04-27 14:51:47 +03:00
Benny Halevy
e88871f4ec replica: database: move shard_of implementation to mutation layer
We don't need the database to determine the shard of the mutation,
only its schema. So move the implementation to the respecive
definitions of mutation and frozen_mutation.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #10430
2022-04-27 14:40:24 +03:00
Nadav Har'El
f6ce7891a5 test/alternator: add test for key length limits
DynamoDB limits partition-key length to 2048 bytes and sort-key length
to 1024 bytes. Alternator currently has no such limits officially, but
if a user tries a key length of over 64 KB, the result will be an
"internal server error" as Alternator runs into Scylla's low-level key
length limit of 64 KB.

In this patch we add (mostly xfailing) tests confirming all the above
observations. The tests include extensive comments on what they are
testing and why. Some of these tests (specifically, the ones checking
what happens above 64 KB) should pass once Alternator is fixed. Other
tests - requiring that the limits be exactly what they are in DynamoDB -
may either not pass or change in the future, depending on what we decide
the limits should be in Alternator.

Refs #10347

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #10438
2022-04-26 18:09:19 +02:00
Raphael S. Carvalho
791403e4bb sstables: Fix deletion of partial SSTables
If SSTable write fails, it will leave a partial sst which contains
a temporary TOC in addition to other components partially written.
temporary TOC content is written upfront, to allow us from deleting
all partial components using the former content if write fails.

After commit e5fc4b6, partial sst cannot be deleted because deletion
procedure is incorrectly assuming all SSTs being deleted unconditionally
have TOC, but partial SSTs only have TMP TOC instead.
That happens because parent_path() requires all path components to
exist due to its usage of fs::path::canonical.

The consequence of this is that space of partial files cannot be
reclaimed, making it worse for Scylla to recover from ENOSPC,
which could happen by selecting a set of files for compaction with
higher chance of suceeeding given the free space.

This is fixed by only calling parent_path() on TMP TOC, which is
guaranteed to exist prior to calling fsync_directory().

Fixes #10410.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-04-26 11:00:27 -03:00
Raphael S. Carvalho
0be44b1035 sstables: Fix fsync_directory()
fsync_directory() is broken because it's unconditionally performing
fsync on parent directory, not on the directory that it was called
with. To fix, let's remove wrong parent_path() usage.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-04-26 11:00:27 -03:00
Raphael S. Carvalho
ca8f5dcdb7 sstables: Rename dirname() to a more descriptive name
dirname() is confusing because if it's called on a directory, parent
path is retrieved. By renaming it to parent_path(), it's clearer
what the function will do exactly.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-04-26 11:00:27 -03:00
Avi Kivity
582802825a treewide: use system-#include (angle brackets) for seastar
Seastar is an external library from Scylla's point of view so
we should use the angle bracket #include style. Most of the source
follows this, this patch fixes a few stragglers.

Also fix cases of #include which reached out to seastar's directory
tree directly, via #include "seastar/include/sesatar/..." to
just refer to <seastar/...>.

Closes #10433
2022-04-26 14:46:42 +03:00
Takuya ASADA
48b6aec16a scripts: use "out()" function for all capture_output subprocesses
On acaf0bb we applied out() just for perftune.py because we had issue #10390
with this script.
But the issue can happen with other commands too, let's apply it to all
commands which uses capture_output.

related #10390

Closes #10414
2022-04-26 13:56:52 +03:00
Benny Halevy
01f41630a5 compaction: time_window_compaction_strategy: reset estimated_remaining_tasks when running out of candidates
_estimated_remaining_tasks gets updated via get_next_non_expired_sstables ->
get_compaction_candidates, but otherwise if we return earlier from
get_sstables_for_compaction, it does not get updated and may go out of sync.

Refs #10418
(to be closed when the fix reaches branch-4.6)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #10419
2022-04-26 11:26:48 +03:00
Benny Halevy
055141fc2e multishard_mutation_query: do_query: stop ctx if lookup_readers fails
lookup_readers might fail after populating some readers
and those better be closed before returning the exception.

Fixes #10351

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #10425
2022-04-26 11:11:52 +03:00
Botond Dénes
bf1b6ced3c Merge "Make storage_service::bootstrap less if-y" from Pavel Emelyanov
"
The method in question performs node bootstrap in several different
modes
(regular, replacing, rnbo) and several subsequent if-else branches just
duplicate each-other. This set merges them making the code easier to
read.
"

* 'br-less-branchy-bootstrap' of https://github.com/xemul/scylla:
  storage_service: Remove pointless check in replace-bootstrap
  storage_service: Generalize wait for range setup
  storage_service: Merge common if-else branches in bootstrap
  storage_service: Move tables bootstrap-ON upwards
2022-04-26 10:58:30 +03:00
Raphael S. Carvalho
d79fb9a12f docs: Update compaction controller doc
The doc is being updated to reflect the changes in the commit
d8833de3bb ("Redefine Compaction Backlog to tame
compaction aggressiveness").

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-04-26 10:50:45 +03:00
Benny Halevy
c2f0d75d96 test: database_test: add test_truncate_without_snapshot_during_writes
Reproduces https://github.com/scylladb/scylla/issues/10421
with 2325c566d9
(memtable_list: futurize clear_and_add)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-04-26 07:25:30 +03:00
Benny Halevy
b8263e550a memtable_list: safely futurize clear_and_add
Following a4be927e23
that reverted 2325c566d9
due to #10421, this patch reintroduces an async version
of memtable_list::clear_and_add that calls clear_gently
safely after replacing the _memtables vector with a new one
so that writes and flushes can continue in he foreground
while the old memtables are cleared.

Fixes #10281

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-04-26 07:25:28 +03:00
Benny Halevy
aae532a96b table: clear: serialize with ongoing flush
Get all flush permits to serialize with any
ongoing flushes and preventing further flushes
during table::clear, in particular calling
discard_completed_segments for every table and
clearing the memtables in clear_and_add.

Fixes #10423

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-04-25 18:57:07 +03:00
Gleb Natapov
7f26a8eef5 raft: actively search for a leader if it is not known for a tick duration
For a follower to forward requests to a leader the leader must be known.
But there may be a situation where a follower does not learn about
a leader for a while. This may happen when a node becomes a follower while its
log is up-to-date and there are no new entries submitted to raft. In such
case the leader will send nothing to the follower and the only way to
learn about the current leader is to get a message from it. Until a new
entry is added to the raft's log a follower that does not know who the
leader is will not be able to add entries. Kind of a deadlock. Note that
the problem is specific to our implementation where failure detection is
done by an outside module. In vanilla raft a leader sends messages to
all followers periodically, so essentially it is never idle.

The patch solves this by broadcasting specially crafted append reject to all
nodes in the cluster on a tick in case a leader is not known. The leader
responds to this message with an empty append request which will cause the
node to learn about the leader. For optimisation purposes the patch
sends the broadcast only in case there is actually an operation that
waits for leader to be known.

Fixes #10379
2022-04-25 14:51:22 +02:00
Kamil Braun
5308a7d7a3 raft: server: return immediately from wait_for_leader if leader is known
`wait_for_leader` may be called when leader is known. There's nothing to
wait for in this case.
2022-04-25 12:59:55 +02:00
Benny Halevy
db676e9e4a replica: database: apply: make sure the schema is synced or throw internal error
Currently an exception is thrown in the apply stage
when the schema is not synced, but it is too late
since returning an error doesn't pinpoint which code
path was using an unsync'ed schema so move the check
earlier, before _apply_stage is called.

We need to make sure the schema is synced earlier
when the mutation is applied so call on_internal_error
to generate a backtrace in testing and still throw
an error in production.

Typically storage_proxy::mutate_locally implicitly
ensures the schema is synced by making a global_schema_ptr
for it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220424110057.3957597-1-bhalevy@scylladb.com>
2022-04-25 12:18:47 +02:00
Pavel Solodovnikov
654e6726d1 service: storage_service: coroutinize node_ops_abort_thread()
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-04-25 09:11:20 +03:00
Pavel Solodovnikov
b27c989e62 service: storage_service: coroutinize node_ops_abort()
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-04-25 09:11:14 +03:00
Pavel Solodovnikov
f7e84c6138 service: storage_service: coroutinize node_ops_done()
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-04-25 09:11:08 +03:00
Pavel Solodovnikov
6936dbea49 service: storage_service: coroutinize node_ops_update_heartbeat()
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-04-25 09:11:04 +03:00
Pavel Solodovnikov
1c03d01927 service: storage_service: coroutinize force_remove_completion()
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-04-25 09:10:58 +03:00
Pavel Solodovnikov
fc1dfb0ae1 service: storage_service: coroutinize start_leaving()
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-04-25 09:10:54 +03:00
Pavel Solodovnikov
0a3a7534d6 service: storage_service: coroutinize start_sys_dist_ks()
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-04-25 09:10:49 +03:00
Pavel Solodovnikov
15ea74e41f service: storage_service: coroutinize prepare_to_join()
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-04-25 09:10:43 +03:00
Pavel Solodovnikov
c739fad5d6 service: storage_service: coroutinize removenode_add_ranges()
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-04-25 09:10:05 +03:00
Pavel Solodovnikov
e392fdda96 service: storage_service: coroutinize unbootstrap()
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-04-25 09:09:56 +03:00
Pavel Solodovnikov
8fa7f47a74 service: storage_service: coroutinize get_changed_ranges_for_leaving()
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-04-25 09:09:04 +03:00
Benny Halevy
bcd35af7cf replica: table: generate_and_propagate_view_updates: pass mutation to make_flat_mutation_reader_from_mutations_v2
With f5ef687acd
we can consume the single mutation directly,
so there's n need to pass it as a vector of size 1.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220424103826.3930895-1-bhalevy@scylladb.com>
2022-04-24 22:19:19 +03:00
Avi Kivity
728479a6ea Merge 'Fix map subscript crashes when map or subscript is null' from Nadav Har'El
In the filtering expression "WHERE m[?] = 2", our implementation was buggy when either the map, or the subscript, was NULL (and also when the latter was an UNSET_VALUE). Our code ended up dereferencing null objects, yielding bizarre errors when we were lucky, or crashes when we were less lucky - see examples of both in issues #10361, #10399, #10401. The existing test `test_null.py::test_map_subscript_null` reproduced all these bugs sporadically.

In this series we improve the test to reproduce the separate bugs separately, and also reproduce additional problems (like the UNSET_VALUE). We then **define** both `m[NULL]` and `NULL[2]` to result in NULL instead of the existing undefined (and buggy, and crashing) behavior. This new definition is consistent with our usual SQL-inspired tradition that NULL "wins" in expressions - e.g., `NULL < 2` is also defined as resulting in NULL.

However, this decision differs from Cassandra, where `m[NULL]` is considered an error but `NULL[2]` is allowed. We believe that making `m[NULL]` be a NULL instead of an error is more consistent, and moreover - necessary if we ever want to support more complicate expressions like `m[a]`, where the column `a` can be NULL for some rows and non-NULL for others, and it doesn't make sense to return an "invalid query" error in the middle of the scan.

Fixes #10361
Fixes #10399
Fixes #10401

Closes #10420

* github.com:scylladb/scylla:
  expressions: don't dereference invalid map subscript in filter
  expressions: fix invalid dereference in map subscript evaluation
  test/cql-pytest: improve tests for map subscripts and nulls
2022-04-24 21:16:10 +03:00
Avi Kivity
a4be927e23 Revert "memtable_list: futurize clear_and_add"
This reverts commit 2325c566d9. It
causes a use-after-free of a memtable.

Fixes #10421.
2022-04-24 21:09:48 +03:00
Asias He
953af38281 streaming: Allow drop table during streaming
Currently, if a table is dropped during streaming, the streaming would
fail with no_such_column_family error.

Since the table is dropped anyway, it makes more sense to ignore the
streaming result of the dropped table, whether it is successful or
failed.

This allows users to drop tables during node operations, e.g., bootstrap
or decommission a node.

This is especially useful for the cloud users where it is hard to
coordinate between a node operation by admin and user cql change.

This patch also fixes a possible user after free issue by not passing
the table reference object around.

Fixes #10395

Closes #10396
2022-04-24 17:43:20 +03:00
Tzach Livyatan
607ccf0393 Update doc project name to scylla dev
Closes #10342
2022-04-24 17:40:54 +03:00
Nadav Har'El
fbb2a41246 expressions: don't dereference invalid map subscript in filter
If we have the filter expression "WHERE m[?] = 2", the existing code
simply assumed that the subscript is an object of the right type.
However, while it should indeed be the right type (we already have code
that verifies that), there are two more options: It can also be a NULL,
or an UNSET_VALUE. Either of these cases causes the existing code to
dereference a non-object as an object, leading to bizarre errors (as
in issue #10361) or even crashes (as in issue #10399).

Cassandra returns a invalid request error in these cases: "Unsupported
unset map key for column m" or "Unsupported null map key for column m".
We decided to do things differently:

 * For NULL, we consider m[NULL] to result in NULL - instead of an error.
   This behavior is more consistent with other expressions that contain
   null - for example NULL[2] and NULL<2 both result in NULL as well.
   Moreover, if in the future we allow more complex expressions, such
   as m[a] (where a is a column), we can find the subscript to be null
   for some rows and non-null for other rows - and throwing an "invalid
   query" in the middle of the filtering doesn't make sense.

 * For UNSET_VALUE, we do consider this an error like Cassandra, and use
   the same error message as Cassandra. However, the current implementation
   checks for this error only when the expression is evaluated - not
   before. It means that if the scan is empty before the filtering, the
   error will not be reported and we'll silently return an empty result
   set. We currently consider this ok, but we can also change this in the
   future by binding the expression only once (today we do it on every
   evaluation) and validating it once after this binding.

Fixes #10361
Fixes #10399

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-04-24 16:05:34 +03:00
Nadav Har'El
808a93d29b expressions: fix invalid dereference in map subscript evaluation
When we have an filter such as "WHERE m[2] = 3" (where m is a map
column), if a row had a null value for m, our expression evaluation
code incorrectly dereferences an unset optional, and continued
processing the result of this dereference which resulted in undefined
behavior - sometimes we were lucky enough to get "marshaling error"
but other times Scylla crashed.

The fix is trivial - just check before dereferencing the optional value
of the map. We return null in that case, which means that we consider
the result of null[2] to be null. I think this is a reasonable approach
and fits our overall approach of making null dominate expressions (e.g.,
the value of "null < 2" is also null).

The test test_filtering.py::test_filtering_null_map_with_subscript,
which used to frequently fail with marshaling errors or crashes, now
passes every time so its "xfail" mark is removed.

Fixes #10417

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-04-24 14:58:56 +03:00
Nadav Har'El
189b8845fe test/cql-pytest: improve tests for map subscripts and nulls
The test test_null.py::test_map_subscript_null turned out to reproduce
multiple bugs related to using map subscripts in filtering expressions.
One was issue #10361 (m[null] resulted in a bizarre error) or #10399
(m[null] resulted in a crash), and a different issue was #10401 (m[2]
resulted in a bizarre error or a crash if m itself was null). Moreover,
the same test uncovered different bugs depending how it was run - alone
or with other tests - because it was using a shared table.

In this patch we introduce two separate tests in test_filtering.py
which are designed to reproduce these separate bugs instead of mixing
them into one test. The new tests also cover a few more corners which
the previous test (which focused on nulls) missed - such as UNSET_VALUE.

The two new tests (and the old test_map_subscript_null) pass on
Cassandra so still assume that the Cassandra behavior - that m[null]
should be an error - is the correct behavior. We may want to change
the desired behavior (e.g., to decide that m[null] be null, not an
error), and change the tests accordingly later - but for now the
tests follow Cassandra's behavior exactly, and pass on Cassandra
and fail on Scylla (so are marked xfail).

The bugs reproduced by these tests involve randomness or reading
uninitialized memory, so these tests sometimes pass, sometimes fail,
and sometimes even crash (as reported in #10399 and #10401). So to
reproduce these bugs run the tests multiple times. For example:

    test/cql-pytest/run --count 100 --runxfail
        test_filtering.py::test_filtering_null_map_with_subscript

Refs #10361
Refs #10399
Refs #10401

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-04-24 13:26:26 +03:00
Avi Kivity
8624718983 Merge "row_cache: update reader implementations to v2" from Botond
"
cache_flat_mutation_reader gets a native v2 implementation. The
underlying mutation representation is not changed: range deletions are
still stored as v1 range_tombstones in mutation_partition. These are
converted to range tombstone changes during reading.
This allows for separating the change of a native v2 reader
implementation and a native v2 in-memory storage format, enabling the
two to be done at separate times and incrementally.
This means there is still conversion ingoing when reading from cache and
when populating, but when reading from underlying, the stream can now be
passed through as-is without conversions.
Also, any future v2 related changes to the in-memory storage will now be
limited to the cache reader implementation itself.

In the process, the non-forwarding reader, whose only user is the cache,
is also converted to v2.
"

Performance results reported by Botond:

"
build/release/test/perf/perf_simple_query -c1 -m2G --flush --
duration=20

BEFORE
median 130421.76 tps ( 71.1 allocs/op,  12.1 tasks/op,   47462
insns/op)
median absolute deviation: 319.64
maximum: 131028.33
minimum: 127502.55

AFTER
median 133297.41 tps ( 64.1 allocs/op,  12.2 tasks/op,   45406
insns/op)
median absolute deviation: 2964.24
maximum: 137581.56
minimum: 123739.4

Getting rid of those upgrade/downgrade was good for allocs and ops.
Curiously there is a 0.1 rise in number of tasks though.
"

* 'row-cache-readers-v2/v1' of https://github.com/denesb/scylla:
  row_cache: update reader implementations to v2
  range_tombstone_change_generator: flush(): add end_of_range
  readers/nonforwardable: convert to v2
  read_context: fix indentation
  read_context: coroutinize move_to_next_partition()
  row_cache: cache_entry::read(): return v2 reader
  row_cache: return v2 readers from make_reader*()
  readers/delegating_v2: s/make_delegating_reader_v2/make_delegating_reader/
2022-04-23 19:10:43 +03:00
Botond Dénes
5e97fb9fc4 row_cache: update reader implementations to v2
cache_flat_mutation_reader gets a native v2 implementation. The
underlying mutation representation is not changed: range deletions are
still stored as v1 range_tombstones in mutation_partition. These are
converted to range tombstone changes during reading.
This allows for separating the change of a native v2 reader
implementation and a native v2 in-memory storage format, enabling the
two to be done at separate times and incrementally.
2022-04-21 14:57:04 +03:00
Botond Dénes
5cc5fd4d23 range_tombstone_change_generator: flush(): add end_of_range
Allowing to flush all range tombstone changes, including those that have
a position equal to the passed in upper bound, when finishing off a
read-range, e.g. a clustering range from a slice.
2022-04-21 14:37:10 +03:00
Botond Dénes
7626beb729 readers/nonforwardable: convert to v2
It has a single user, the row cache, which for now has to
upgrade/downgrade around the nonforwardable reader, but this will go
away in the next patches when the row cache readers are converted to v2
proper.
2022-04-21 14:34:00 +03:00
Botond Dénes
b061acb668 Merge 'Remove queue reader v1' from Mikołaj Sielużycki
The patchset embeds the mutation_fragment upgrading logic from v1 to v2 into the mutation_fragment_queue. This way the mutation fragments coming to the mutation_fragment_queue can be v1, but the underlying query_reader receives mutation_fragment_v2, eliminating the last usage of query_reader (v1). The last commit removes query_reader, query_reader_handle and associated factory functions.

tests: unit(dev), dtest(incremental_repair_test, read_repair_test, repair_additional_test, repair_test)

Closes #10371

* github.com:scylladb/scylla:
  readers: Remove queue_reader v1 and associated code.
  repair: Make mutation_fragment_queue internally upgrade fragments to v2
  repair: Make mutation_fragment_queue::impl a seastar::shared_ptr
2022-04-21 12:34:48 +03:00
Mikołaj Sielużycki
f74fd0dd80 readers: Remove queue_reader v1 and associated code. 2022-04-20 17:56:34 +02:00
Mikołaj Sielużycki
339b60e5b0 repair: Make mutation_fragment_queue internally upgrade fragments to v2 2022-04-20 17:55:58 +02:00
Mikołaj Sielużycki
eeb2b458de repair: Make mutation_fragment_queue::impl a seastar::shared_ptr
It makes mutation_fragment_queue copyable and makes the pointer to
pending mutation fragments in next commit stable. This allows moving the
mutation_fragment_queue without breaking the underlying
upgrading_consumer.
2022-04-20 17:51:58 +02:00
Botond Dénes
46481264e9 read_context: fix indentation
Broken by the previous patch (patches actually -- it was half-indent on
half-indent before that).
2022-04-20 10:59:09 +03:00
Botond Dénes
28f90728a3 read_context: coroutinize move_to_next_partition()
Makes the code more readable and the impending v2 transition less noisy.
2022-04-20 10:59:09 +03:00
Botond Dénes
2a0d7e8a1d row_cache: cache_entry::read(): return v2 reader
Push the conversion down one level. Soon we will make cache flat
mutation reader a v2 reader, this keeps the related noise separate.
2022-04-20 10:59:09 +03:00
Botond Dénes
0b035c9099 row_cache: return v2 readers from make_reader*()
And adjust callers. The factory functions just sprinkle upgrade_to_v2()
on returned readers for now.
One test in row_cache_test.cc had to be disabled, because the upgrade to
v2 wrapper we now have over cache readers doesn't allow it to directly
control the reader's buffer size and so the test fails. There is a FIXME
left in the test code and the test will be re-enabled once a native v2
reader implementation allows us to get rid of the upgrade wrapper.
2022-04-20 10:59:09 +03:00
Botond Dénes
c3c71b3aa5 readers/delegating_v2: s/make_delegating_reader_v2/make_delegating_reader/
The argument type (v1 or v2 reader) is enough to disambiguate and
overloading the v1 method makes a transition to v2 more seamless.
2022-04-20 10:59:09 +03:00
Nadav Har'El
cc40685c28 test/cql-pytest: add test for filtering with IN restriction
It turns out that Cassandra does not allow IN restrictions together with
filtering, except, curiously, when the restriction is on a clustering key.
There is no real reason for this limitation - the error message even says
it is not *yet* supported.

Scylla, on the other hand, does support this case. Of course it's not
enough that we support it - we need to support it correctly... But we don't
have a full regression test that this support is correct - in
filtering_test.cc we test it with clustering and regular columns - but not
partition key columns.

So this patch adds a simple cql-pytest test that this sort of filtering
works in Scylla correctly for partition, clustering and regular columns
(and also confirms that these cases don't work, yet, on Cassandra).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220420075553.1008062-1-nyh@scylladb.com>
2022-04-20 09:56:22 +02:00
Konstantin Osipov
a3b790b413 test.py: add a dependency on python3-aiohttp and tabulate
Satisfy the build system requirements.

[avi: regenerate frozen toolchain]
2022-04-19 18:22:50 +03:00
Konstantin Osipov
097fbc7c5d .gitignore: ignore mypy_cache, the python lint cache 2022-04-19 16:48:47 +03:00
Pavel Emelyanov
41392a59bb storage_service: Remove pointless check in replace-bootstrap
The method in question is called in the branch where the replace address
is checked to be present, no need in extra explicit check.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-04-19 13:27:52 +03:00
Pavel Emelyanov
49481b1a21 storage_service: Generalize wait for range setup
Both the if is_replacing()/else branches call gossiper wating method as
their first steps. Can be done once.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-04-19 13:27:52 +03:00
Pavel Emelyanov
d213e6ffd1 storage_service: Merge common if-else branches in bootstrap
There are three modes in there -- bootstrap, b.s. with RBNO and b.s. for
replacing. All three are checked two times in a row, but can be done
once.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-04-19 13:27:52 +03:00
Pavel Emelyanov
b0df3a32b4 storage_service: Move tables bootstrap-ON upwards
This call just places a boolean flag on all. It won't hurt if it lasts
while the node is performing pre-bootstrap checks, but it allows making
the whole method less branchy.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-04-19 13:27:52 +03:00
Avi Kivity
469bca5369 storage_proxy: coroutinize mutate_locally (vector overload)
The do_with() means we have an unconditional allocation, so we can
justify the coroutine's allocation (replacing it). Meanwhile,
coroutine::parallel_for_each() reduces an allocation if mutate_locally()
blocks.

Closes #10387
2022-04-19 10:59:16 +03:00
Botond Dénes
3051fc3cbc Merge 'Fix some errors and issues found by gcc 12' from Avi Kivity
gcc 12 checks some things that clang doesn't, resulting in compile errors.
This series fixes some of theses issues, but still builds (and tests) with clang.

Unfortunately, we still don't have a clean gcc build due to an outstanding bug [1].

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98056

Closes #10386

* github.com:scylladb/scylla:
  build: disable warnings that cause false-positive errors with gcc 12
  utils: result_loop: remove invalid and incorrect constraint
  service: forward_service: avoid using deprecated std::bind1st and std::not1
  repair: explicityl ignore tombstone gc update response
  treewide: abort() after switch in formatters
  db: view: explicitly ignore unused result
  compaction: leveled_compaction_strategy: avoid compares between signed and unsigned
  compaction_manager: compaction_reenabler: disambiguate compaction_state
  api: avoid function specialization in req_param
  alternator: ttl: avoid specializing class templates in non-namespace scope
  alternator: executor: fix signed/unsigned comparison in is_big()
2022-04-19 10:25:38 +03:00
Botond Dénes
4d972b8d31 Merge 'storage_proxy: convert rpc handlers from lambdas to member functions' from Avi Kivity
Currently, rpc handlers are all lambdas inside
storage_proxy::init_messaging_service(). This means any stack trace
refers to storage_proxy::init_messaging_service::lambda#n instead of
a meaningful function name, and it makes init_messaging_service()
very intimidating.

Fix that by moving all such lambdas to regular member functions. The
first two patches remove unnecessary captures to make it easy; the
final patch coverts the lambdas to member functions.

Closes #10388

* github.com:scylladb/scylla:
  storage_proxy: convert rpc handlers from lambdas to member functions
  storage_proxy: don't capture messaging_service in server callbacks
  storage_proxy: don't capture migration_manager in server callbacks
2022-04-19 08:20:49 +03:00
Takuya ASADA
acaf0bb88a scripts: print perftune.py error message when capture_output=True
We currently does not able to get any error message from subprocess when we specified capture_output=True on subprocess.run().
This is because CalledProcessError does not print stdout/stderr when it raised, and we don't catch the exception, we just let python to cause Traceback.
Result of that, we only able to know exit status and failed command but
not able to get stdout/stderr.

This is problematic especially working on perftune.py bug, since the
script should caused Traceback but we never able to see it.

To resolve this, add wrapper function "out()" for capture output, and
print stdout/stderr with error message inside the function.

Fixes #10390

Closes #10391
2022-04-18 14:06:51 +03:00
Avi Kivity
27093d32d1 Merge 'gms: gossiper: coroutinize apply_state functions' from Pavel Solodovnikov
Mostly trivial conversions to coroutines in the gossiper to facilitate code readability.

Closes #10389

* github.com:scylladb/scylla:
  gms: gossiper: coroutinize `apply_state_locally`
  gms: gossiper: coroutinize `apply_state_locally_without_listener_notification`
  gms: gossiper: coroutinize `do_apply_state_locally`
  gms: gossiper: coroutinize `apply_new_states`
2022-04-18 13:48:07 +03:00
Avi Kivity
7129ddfa67 build: disable warnings that cause false-positive errors with gcc 12
gcc 12 generates some incorrect warnings (that we treat as errors).
Silence them so we can build.
2022-04-18 12:27:18 +03:00
Avi Kivity
160bbb00dd utils: result_loop: remove invalid and incorrect constraint
Checking a concept in a requires-expression requires an additional
requires keyword. Moreover, the constraint is incorrect (at least
all callers pass a T, not a result<T>), so remove it.

Found by gcc 12.
2022-04-18 12:27:18 +03:00
Avi Kivity
e55f5fab53 service: forward_service: avoid using deprecated std::bind1st and std::not1
Switch to newer alterantives std::bind_front, std::not_fn.
2022-04-18 12:27:18 +03:00
Avi Kivity
5da586271f repair: explicityl ignore tombstone gc update response
The response struct is empty and we have nothing to do with it. Cast
it to void to avoid a gcc warning.
2022-04-18 12:27:18 +03:00
Avi Kivity
1e1c0226a6 treewide: abort() after switch in formatters
It is typical in switch statements to select on an enum type and
rely on the compliler to complain if an enum value was missed. But
gcc isn't satisified since the enum could have a value outside the
declared list. Call abort() in this impossible situation to pacify
it.
2022-04-18 12:27:18 +03:00
Avi Kivity
a1df583dea db: view: explicitly ignore unused result
Otherwise, gcc complains.
2022-04-18 12:27:18 +03:00
Avi Kivity
eb436ac940 compaction: leveled_compaction_strategy: avoid compares between signed and unsigned
These can overflow. Here, there is no such risk, but switch to unsigned to
avoid the warning.
2022-04-18 12:27:18 +03:00
Avi Kivity
fa7172fcad compaction_manager: compaction_reenabler: disambiguate compaction_state
compaciton_state is used both as a type and a function, which gcc
does not like. Disambiguate by fully qualifying the type name.
2022-04-18 12:27:18 +03:00
Avi Kivity
de6631656c api: avoid function specialization in req_param
Function specializations are not allowed (you're supposed to use
overloads), but clang appears to allow them.

Here, we can't use an overload since the type doesn't appear in the
parameter list. Use a constraint instead.
2022-04-18 12:27:18 +03:00
Avi Kivity
40beb48176 alternator: ttl: avoid specializing class templates in non-namespace scope
The C++ standard disallows class template specialization in non-namespace
scopes. Clang apparently allows it as an extension.

Fix by not using a template - there are just two specializations and
no generic implementation. Use regular classes and std::conditional_t
to choose between the two.
2022-04-18 12:27:18 +03:00
Avi Kivity
b5e8e32c01 alternator: executor: fix signed/unsigned comparison in is_big()
Signed/unsigned comparisons are subject to C promotion rules. In is_big()
in this case the comparison is safe, but gcc warns. Use a cast to silence
the warning.

The sign/unsigned mix and int/size_t size differences still look bad, it
would be good to revisit this code, but that is left for another patch.
2022-04-18 12:23:18 +03:00
Piotr Sarna
fea18943cd schema_tables: drop leftover change to system_schema.keyspaces
Series 59d56a3fd7 introduced
an accidental backward incompatible regression by adding
a column to system_schema.keyspaces and then not even using
it for anything. It's a leftover from the original hackathon
implementation and should never reach master in the first place.
Fortunately, the series isn't part of any stable release yet.

Fixes #10376
Tests: manual, verifying that the system_schema.keyspaces table
no longer contains the extraneous column.

Closes #10377
2022-04-18 12:00:43 +03:00
Avi Kivity
36aee57978 storage_proxy: convert rpc handlers from lambdas to member functions
Currently, rpc handlers are all lambdas inside
storage_proxy::init_messaging_service(). This means any stack trace
refers to storage_proxy::init_messaging_service::lambda#n instead of
a meaningful function name, and it makes init_messaging_service()
very intimidating.

Fix that by moving all such lambdas to regular member functions.
This is easy now that they don't capture anything except `this`,
which we provide during registration via std::bind_front().

A few #includes and forward declarations had to be added to
storage_proxy.hh. This is unfortunate, but can only be solved
by splitting storage_proxy into a client part and a server part.
2022-04-17 19:03:06 +03:00
Avi Kivity
f7e8109b16 storage_proxy: don't capture messaging_service in server callbacks
We'd like to make the server callbacks member functions, rather
than lambdas, so we need to eliminate their captures. This patch
eliminats 'ms' by referringn to the already existing member '_messaging'
instead.
2022-04-17 17:55:05 +03:00
Avi Kivity
4cac2eb43e storage_proxy: don't capture migration_manager in server callbacks
We'd like to make the server callbacks member functions, rather
than lambdas, so we need to eliminate their captures. This patch
eliminates 'mm' by making it a member variable and capturing 'this'
instead. In one case 'mm' was used by a handle_write() intermediate
lambda so we have to make that non-static and capture it too.

uninit_messaging_service() clears the member variable to preserve
the same lifetime 'mm' had before, in case that's important.
2022-04-17 17:54:51 +03:00
Avi Kivity
86dfe75268 Update seastar submodule
* seastar acf7e3523b...5e86362704 (10):
  > Merge "Respect taskset-configured cpumask" from Pavel E
Ref #9505.
  > rpc_tester: Run CPU hogs on server side too
  > std-coroutine: include <coroutine> for LLVM-15
  > Revert "Merge "tests: perf: measure coroutines performance" from Benny"
  > test: perf_tests: remove [[gnu::always_inline]] attribute from coroutine perf tests
  > Merge "tests: perf: measure coroutines performance" from Benny
  > Merge "Extend RPC tester" from Pavel E
  > rpc: Mark connection trivial getters const noexcept
  > seastar-addr2line: Allow use of llvm-addr2line as the command
  > file: append_challenged_posix_file: Serialize allocate() to not block concurrent reads or writes
2022-04-17 17:11:31 +03:00
Pavel Solodovnikov
b25c4fee01 gms: gossiper: coroutinize apply_state_locally
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-04-17 11:51:18 +03:00
Pavel Solodovnikov
746f1179eb gms: gossiper: coroutinize apply_state_locally_without_listener_notification
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-04-17 11:38:33 +03:00
Pavel Solodovnikov
b7322c3f5d gms: gossiper: coroutinize do_apply_state_locally
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-04-17 11:29:26 +03:00
Pavel Solodovnikov
c48dcf607a gms: gossiper: coroutinize apply_new_states
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-04-17 11:28:42 +03:00
Kamil Braun
b1b22f2c2b service: raft: don't support/advertise USES_RAFT feature
The code would advertise the USES_RAFT feature when the SUPPORTS_RAFT
feature was enabled through a listener registered on the SUPPORTS_RAFT
feature.

This would cause a deadlock:
1. `gossiper::add_local_application_state(SUPPORTED_FEATURES, ...)`
   locks the gossiper (it's called for the first time from sstables
   format selector).
2. The function calls `on_change` listeners.
3. One of the listeners is the one for SUPPORTS_RAFT.
4. The listener calls
   `gossiper::add_local_application_state(SUPPORTED_FEATURES, ...)`.
5. This tries to lock the gossiper.

In turn, depending on timing, this could hang the startup procedure,
which calls `add_local_application_state` multiple times at various
points, trying to take the lock inside gossiper.

This prevents us from testing raft / group 0, new schema change
procedures that use group 0, etc.

For now, simply remove the code that advertises the USES_RAFT feature.
Right now the feature has no other effect on the system than just
becoming enabled. In fact, it's possible that we don't need this second
feature at all (SUPPORTS_RAFT may be enough), but that's
work-in-progress. If needed, it will be easy to bring the enabling code
back (in a fixed form that doesn't cause a deadlock). We don't remove
the feature definitions yet just in case.

Refs: #10355
2022-04-15 16:08:25 +02:00
Avi Kivity
4a5082bfc8 main: fix discarded future during prometheus start sequence
Probably not triggerable since it will be a while before we
recognize a signal to exit. But a FIXME is a FIXME.

Closes #10374
2022-04-15 16:40:31 +03:00
Avi Kivity
d90415434e main: wait for memory_threshold_guard start
We start the memory threshold guard (that enables large memory allocation
warnings post-boot) but don't wait for it. I can't imagine it can hurt,
but it does carry a FIXME label.

Closes #10375
2022-04-15 16:37:47 +03:00
Botond Dénes
75786c42cb Merge 'Add repair unit tests/v1' from Mikołaj Sielużycki
This patch series splits up parts of repair pipeline to allow unit testing
various bits of code without having to run full dtest suite. The reason why
repair pipeline has no unit tests is that by definition repair requires multiple
nodes, while unit test environment works only for a single node.

However, it is possible to explicitly define interfaces between various parts of the
pipeline, inject dependencies and test them individually. This patch series is focused
on taking repair_rows_on_wire (frozen mutation representation of changes coming from
another node) and flushing them to an sstable.

The commits are split into the following parts:
- pulling out classes to separate headers so that they can be included (potentially indirectly) from the test,
- pulling out repair_meta::to_repair_rows_list and part of repair_meta::flush_rows_in_working_row_buf so that they can be tested,
- refactoring repair_writer so that the actual writing logic can be injected as dependency,
- creating the unit test.

tests: unit(dev), dtest(incremental_repair_test, read_repair_test, repair_additional_test, repair_test)

Closes #10345

* github.com:scylladb/scylla:
  repair: Add unit test for flushing repair_rows_on_wire to disk.
  repair: Extract mutation_fragment_queue and repair_writer::impl interfaces.
  repair: Make parts of repair_writer interface private.
  repair: Rename inputs to flush_rows.
  repair: Make repair_meta::flush_rows a free function.
  repair: Split flush_rows_in_working_row_buf to two functions and make one static.
  repair: Rename inputs to to_repair_rows_list.
  repair: Make to_repair_rows_list a free function.
  repair: Make repair_meta::to_repair_rows_list a static function
  repair: Fix indentation in repair_writer.
  repair: Move repair_writer to separate header.
  repair: Move repair_row to a separate header.
  repair: Move repair_sync_boundary to a separate header.
  repair: Move decorated_key_with_hash to separate header.
  repair: Move row_repair hashing logic to separate class and file.
2022-04-14 18:17:03 +03:00
Kamil Braun
41f5b7e69e Merge branch 'raft_group0_early_startup_v3' of https://github.com/ManManson/scylla into next
* 'raft_group0_early_startup_v3' of https://github.com/ManManson/scylla:
  main: allow joining raft group0 before waiting for gossiper to settle
  service: raft_group0: make `join_group0` re-entrant
  service: storage_service: add `join_group0` method
  raft_group_registry: update gossiper state only on shard 0
  raft: don't update gossiper state if raft is enabled early or not enabled at all
  gms: feature_service: add `cluster_uses_raft_mgmt` accessor method
  db: system_keyspace: add `bootstrap_needed()` method
  db: system_keyspace: mark getter methods for bootstrap state as "const"
2022-04-14 16:42:20 +02:00
Botond Dénes
737cc798ca Merge "Add flat_mutation_reader_from_mutation_v2" from Benny Halevy
"
Optimize consuming from a single partition.

This gives us significant improvement with single, small mutations,
as shown with perf_mutation_readers, compared to the vector-based
flat_mutation_reader_from_mutations_v2.

These are expected to be common on the write path,
and can be optimized for view building.

results from: perf_mutation_readers -c1 --random-seed=840478750
(userspace cpu-frequency governer, 2.2GHz)

test                                      iterations      median         mad         min         max

Before:
combined.one_row                              720118   825.668ns     1.020ns   824.648ns   827.750ns

After:
combined.one_mutation                         881482   751.157ns     0.397ns   750.211ns   751.912ns
combined.one_row                              843270   756.553ns     0.303ns   755.889ns   757.911ns

The grand plan is to follow up
with make_flat_mutation_reader_from_frozen_mutation_v2
so that we can read directly from either a mutation
or frozen_mutation without having to unfreeze it e.g. in
table::push_view_replica_updates.

Test: unit(dev)
Perf: perf_mutation_readers(release)
"

* tag 'flat_mutation_reader_from_mutation-v3' of https://github.com/bhalevy/scylla:
  perf: perf_mutation_readers: add one_mutation case
  test: mutation_query_test: make make_source static
  mutation readers: refactor make_flat_mutation_reader_from_mutation*_v2
  mutation readers: add make_flat_mutation_reader_from_mutation_v2
  readers: delete slice_mutation.hh
  test: flat_mutation_reader_test: mock_consumer: add debug logging
  test: flat_mutation_reader_test: mock_consumer: make depth counter signed
2022-04-14 17:23:21 +03:00
Botond Dénes
fa75d58cf0 Merge "Make snitch start/stop code look classical" from Pavel Emelyanov
"
There's a generic way to start-stop services in scylla, that includes
5 "actions" (some are optional and/or implicit though)

  service_config cfg = ...
  sharded<service>.start(cfg)
  service.invoke_on_all(&service::start)
  service.invoke_on_all(&service::shutdown)
  service.invoke_on_all(&servuce::stop)
  sharded<service>.stop()

and most of the service out there conforms to that scheme. Not snitch
(spoiler: and not tracing), for which there's a couple of helpers that
do all that magic behind the scenes, "configuring" snitch is done with
the help of overloaded constructors. The latter is extra complicated
with the need to register snitch drivers in class-registry for each
constructor overload. Also there's an external shards synchronization
on stop.

This set brings snitch start/stop code to the described standard: the
create/stop helpers are removed, creation acceps the config structure,
per-shard start/stop (snitch has no drain for now) happens in the
simple invoke-on-all manner.

The intended side effect of this change is the ability to add explicit
dependencies to snitch (in the future, not in this set).

tests: unit(dev)
"

* 'br-snitch-config' of https://github.com/xemul/scylla:
  snitch: Remove create_snitch/stop_snitch
  snitch: Simplify stop (and pause_io)
  snitch: Move io_is_stopped to property-file driver
  snitch: Remove init_snitch_obj()
  snitch: Move instance creation into snitch_ptr constructor
  snitch: Make config-based construction of all drivers
  snitch: Declare snitch_ptr peering and rework container() method
  snitch: Introduce container() method
2022-04-14 16:56:32 +03:00
Pavel Solodovnikov
d4b717afa7 main: allow joining raft group0 before waiting for gossiper to settle
A node can join group0 without waiting for gossiper if
it is either a fresh node, or it's an existing node, which
is already part of some group0 (i.e. have `group0_id` persisted
in system tables).

In that case the second `join_group0()` call inside the
`storage_service::join_token_ring` will be a no-op.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-04-14 12:20:50 +03:00
Benny Halevy
f5ef687acd perf: perf_mutation_readers: add one_mutation case
Measure performance of the single-mutation reader:
make_flat_mutation_reader_from_mutation_v2.

Comparable to the `one_row` case that consumes the
single mutation using the multi-mutatio reader:
make_flat_mutation_reader_from_mutations_v2

perf_mutation_readers shows ~20-30% improvement of
make_flat_mutation_reader_from_mutation_v2
the same single mutation, just given as a single-item vector
to make_flat_mutation_reader_from_mutations_v2.

test                                      iterations      median         mad         min         max

Before:
combined.one_row                              720118   825.668ns     1.020ns   824.648ns   827.750ns

After:
combined.one_mutation                         881482   751.157ns     0.397ns   750.211ns   751.912ns
combined.one_row                              843270   756.553ns     0.303ns   755.889ns   757.911ns

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-04-14 11:39:05 +03:00
Benny Halevy
a4b69fe7b6 test: mutation_query_test: make make_source static
No need for it to be public.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-04-14 11:15:19 +03:00
Benny Halevy
ddb5166b82 mutation readers: refactor make_flat_mutation_reader_from_mutation*_v2
Extract the common parts of the single mutation reader
and the vector-based variant into mutation_reader_base
and reuse from both readers.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-04-14 11:15:17 +03:00
Benny Halevy
e85241d5b6 mutation readers: add make_flat_mutation_reader_from_mutation_v2
Optimize reading from a single partition.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-04-14 11:14:43 +03:00
Benny Halevy
394eb1271d readers: delete slice_mutation.hh
slice_mutations() is currently used only by readers/mutation_readers.cc
so there's no need to expose it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-04-14 08:41:31 +03:00
Benny Halevy
ee2c7948f3 test: flat_mutation_reader_test: mock_consumer: add debug logging
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-04-14 08:41:31 +03:00
Benny Halevy
38cdfca824 test: flat_mutation_reader_test: mock_consumer: make depth counter signed
We want to return stop_iteration::yes once we crossed
the initial depth threshold, with an unsigned depth counter,
it might wraparound and look > 1.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-04-14 08:41:31 +03:00
Tomasz Grabiec
d293bd9579 Merge "Enable forwarding in raft randomized_nemesis_test" from Kamil
The test will now, with probability 1/2, enable forwarding of entries by
followers to leaders. This is possible thanks to the new abort_source&
APIs which we use to ensure that no operations are running on servers
before we destroy them.

Some adjustments were required to the server abort procedure in order to
prevent rare hangs (see first patch). We also translate some low-level
exceptions coming from seastar primitives to high-level Raft API
exceptions (second patch).

* kbr/nemesis-enable-fd-v1:
  test: raft: randomized_nemesis_test: enable entry forwarding
  test: raft: randomized_nemesis_test: increase logging level on some rare operations
  raft: server: translate abort_requested_exception to raft::request_aborted
  raft: fsm: when stopping, become follower to reject new requests
2022-04-13 18:40:23 +02:00
Piotr Sarna
61057446f7 Merge 'forward_service: retry failed forwarder call' from Michał Sala
This pull request adds support for retrying failed forwarder calls
(currently used to parallelize `select count(*) from ...` queries).
Failed-to-forward sub-queries will be executed locally (on a
super-coordinator). This local execution is meant as a fallback for a
forward_requests that could not be sent to its destined coordinator
(e.g. due gossiper not reacting fast enough). Local execution was chosen
as the safest one - it does not require sending data to another
coordinator.

Due to problems with misscompilations, some parts of the
`forward_service` were uncoroutinized.

Fixes: #10131

Closes #10329

* github.com:scylladb/scylla:
  forward_service: uncoroutinize dispatch method
  forward_service: uncoroutinize retrying_dispatcher
  forward_service: rety a failed forwarder call
  forward_service: copy arguments/captured vars to local variables
2022-04-13 09:41:35 +02:00
Nadav Har'El
6cafffe281 test/cql-pytest: reproduce internal server error on null subscript
The restriction "WHERE m[NULL] = 2" should result in an invalid request
error, but currently results in an ugly internal server error.
This test reproduces it, and since the bug is still in the code - is
marked as xfail.

Refs #10361

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220412134118.829671-1-nyh@scylladb.com>
2022-04-13 08:49:48 +03:00
Nadav Har'El
ae0e1574dc test/cql-pytest: reproducer for CONTAINS NULL bug
This is a reproducer for issue #10359 that a "CONTAINS NULL" and
"CONTAINS KEY NULL" restrictions should not match any set, but currently
do match non-empty or all sets.

The tests currently fail on Scylla, so marked xfail. They also fails on
Cassandra because Cassandra considers such a request an error, which
we consider a mistake (see #4776) - so the tests are marked "cassandra_bug".

Refs #10359.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220412130914.823646-1-nyh@scylladb.com>
2022-04-13 08:49:23 +03:00
Nadav Har'El
5d87ead9f1 test/cql-pytest: add more tests comparing against NULL
We already have a test showing that WHERE v=NULL ALLOW FILTERING is
allowed in Scylla (unlike Cassandra), and matches nothing. Here
we add two further tests that confirm that:

1. Not only is v=NULL allowed - v<NULL, v<=NULL, and so on, is also
   allowed and matches nothing.

2. The ALLOW FILTERING is required in in those requests. Without it,
   both Scylla and Cassandra generate the same "ALLOW FILTERING is
   required" error.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220411214503.770413-1-nyh@scylladb.com>
2022-04-13 08:48:55 +03:00
Avi Kivity
987e6533d2 transport: return correct error codes when downgrading v4 {WRITE,READ}_FAILURE to {WRITE,READ}_TIMEOUT
Protocol v4 added WRITE_FAILURE and READ_FAILURE. When running under v3
we downgrade these exceptions to WRITE_TIMEOUT and READ_TIMEOUT (since
the client won't understand the v4 errors), but we still send the new
error codes. This causes the client to become confused.

Fix by updating the error codes.

A better fix is to move the error code from the constructor parameter
list and hard-code it in the constructor, but that is left for a follow-up
after this minimal fix.

Fixes #5610.

Closes #10362
2022-04-12 19:19:52 +03:00
Avi Kivity
8aec146dec Merge "Remove qctx from repair" from Pavel E
"
Repair code keeps its history in system keyspace and uses the qctx
global thing to update and query it. This set replaces the qctx with
the explicit reference on the system_keyspace object.

tests: unit(dev), dtest.repair_test(dev)
"

* 'br-repair-vs-qctx' of https://github.com/xemul/scylla:
  repair, system_keyspace: Query repair_history with a helper
  repair: Update loader code to use system_keyspace entry
  repair, system_keyspace: Update repair_history with a helper
  repair: Keep system keyspace reference
2022-04-12 17:08:41 +03:00
Tomasz Grabiec
0c365818c3 utils/chunked_managed_vector: Fix sigsegv during reserve()
Fixes the case of make_room() invoked with last_chunk_capacity_deficit
but _size not in the last reserved chunk.

Found during code review, no user impact.

Fixes #10364.

Message-Id: <20220411224741.644113-1-tgrabiec@scylladb.com>
2022-04-12 16:37:11 +03:00
Tomasz Grabiec
01eeb33c6e utils/chunked_vector: Fix sigsegv during reserve()
Fixes the case of make_room() invoked with last_chunk_capacity_deficit
but _size not in the last reserved chunk.

Found during code review, no known user impact.

Fixes #10363.

Message-Id: <20220411222605.641614-1-tgrabiec@scylladb.com>
2022-04-12 16:35:17 +03:00
Avi Kivity
546ee814dd Merge 'schema_tables, sstables: return instead of throwing' from Piotr Sarna
This miniseries rewrites a few unnecessary throws into forwarding the exception directly. It's partially possible thanks to the new `co_await coroutine::return_exception` mechanism which allows returning from a coroutine early, without explicitly calling co_return (d5843f6e88).

Closes #10360

* github.com:scylladb/scylla:
  sstables: : remove unnecessary throws
  schema_tables: remove unnecessary throws
2022-04-12 15:18:14 +03:00
Piotr Sarna
bce2933d99 sstables: : remove unnecessary throws
Throws are translated to passing the exceptions directly.
2022-04-12 13:09:54 +02:00
Piotr Sarna
91f130bd9c schema_tables: remove unnecessary throws
Throws are translated to passing the exception directly.
2022-04-12 13:09:27 +02:00
Pavel Emelyanov
05eb9c9416 repair, system_keyspace: Query repair_history with a helper
Querying the table is now done with the help of qctx directly. This
patch replaces it with a querying helper that calls the consumer
function with the entry struct as the argument.

After this change repair code can stop including query_context and
mess with untyped_result_set.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-04-12 14:04:21 +03:00
Pavel Emelyanov
59f4aa0934 repair: Update loader code to use system_keyspace entry
Patch the history entry loader to use the recently introduced
history entry. This is just to reduce the churn in the next patch

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-04-12 13:59:55 +03:00
Pavel Emelyanov
9940016e05 repair, system_keyspace: Update repair_history with a helper
Current code works directly on the qctx which is not nice. Instead,
make it use the system keyspace reference. To make it work, the patch
adds a helper method and introduces a helper struct for the table
entry. This struct will also be used to query the table (next patch).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-04-12 13:57:57 +03:00
Pavel Emelyanov
e501ebd6c2 repair: Keep system keyspace reference
Repair updates (and queries on start) the system.repair_history table
and thus depends on the system_keyspace object

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-04-12 13:57:08 +03:00
Avi Kivity
aa7c6dfaa9 Merge 'Commitlog: refactor file handling - simplify file management + make bookkeep safer' from Calle Wilund
Adds a "named_file" wrapper type in commitlog, encapsulating file and disk size, the latter being updated automatically on write/truncate/allocate/delete operations. Use this instead of loose vars in segments, and also in recycle/delete lists.

Having the data propagate with the objects means we can dispose of re-reading sizes from disk, which in turn means we know what "our" view of the file sizes is when we try to delete/recycle them -> we can bookkeep accurately (from our view point) without having to resort to the rather horrible recalculation of disk footprint.

This series also drops non-recycled segment handling, since it is not used anywhere, and just makes things harder.
It also adds a parameter to set flush threshold.
These two first patches could be broken out into separate PR:s if need be.

Closes #10084

* github.com:scylladb/scylla:
  commitlog: Fold named_file continuations into caller coroutine frame
  commitlog: Use named named_file objects in delete/dispose/recycle lists
  commitlog: Use named_file size tracking instead of segment var
  commitlog: Use named_file in segment
  commitlog: Add "named_file" file wrapping type
  commitlog: Make flush threshold a config parameter
  commitlog: kill non-recycled segment management
2022-04-12 11:28:36 +03:00
Raphael S. Carvalho
f05ae92849 compaction: move compaction::enable_garbage_collected_sstable_writer() into protected namespace
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220411181322.192830-2-raphaelsc@scylladb.com>
2022-04-12 11:21:18 +03:00
Raphael S. Carvalho
3741e7fb6d compaction: LCS: kill unused bootstrapping code
With off-strategy, we no longer need LCS explicitly switching to STCS
mode, and even without off-strategy, the dynamic fan-in approach
in compaction manager will cause LCS to automatically switch to
STCS under heavy write load.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220411181322.192830-1-raphaelsc@scylladb.com>
2022-04-12 11:21:18 +03:00
Mikołaj Sielużycki
b16e12f3a1 repair: Add unit test for flushing repair_rows_on_wire to disk.
The unit test executes a simplified repair scenario by:
- producing a random stream of mutation mutation_fragments,
- convering them to repair_rows_on_wire,
- convering them to list of repair_rows using the conversion logic
  extracted in previous commits from repair_meta,
- flushing the rows to an sstable using the logic extracted in previous
  commits from repair_meta,
- comparing the sstable contents with the originally produced mutation
  fragments.

The test checks only the flushing part and is not concerned with any
other piece of the repair pipeline.
2022-04-12 09:22:10 +02:00
Mikołaj Sielużycki
39205917a8 repair: Extract mutation_fragment_queue and repair_writer::impl interfaces. 2022-04-12 09:22:03 +02:00
Mikołaj Sielużycki
a52126d861 repair: Make parts of repair_writer interface private. 2022-04-12 09:20:14 +02:00
Mikołaj Sielużycki
826e0e9d8a repair: Rename inputs to flush_rows. 2022-04-12 09:20:14 +02:00
Mikołaj Sielużycki
4dd32064a3 repair: Make repair_meta::flush_rows a free function. 2022-04-12 09:20:14 +02:00
Mikołaj Sielużycki
046e8c31db repair: Split flush_rows_in_working_row_buf to two functions and make one static.
It allows pulling out the logic of writing internal representation
of repair mutations to disk. This in turn is needed to unit test
this functionality without spinning up clusters, which significantly
improves developer iteration time.
2022-04-12 09:20:14 +02:00
Mikołaj Sielużycki
ca53a7fcc9 repair: Rename inputs to to_repair_rows_list. 2022-04-12 09:20:14 +02:00
Mikołaj Sielużycki
c7a7680c7d repair: Make to_repair_rows_list a free function. 2022-04-12 09:20:14 +02:00
Mikołaj Sielużycki
69fc74ffbe repair: Make repair_meta::to_repair_rows_list a static function
It allows pulling out the logic of convering on-the-wire representation
of repair mutations to an internal representation used later for
flushing repair mutations to disk. This in turn is needed to unit test
the functionality without spinning up clusters, which significantly
improves developer iteration time.
2022-04-12 09:20:14 +02:00
Mikołaj Sielużycki
4ba48e5739 repair: Fix indentation in repair_writer. 2022-04-12 09:20:14 +02:00
Mikołaj Sielużycki
3ff738db6b repair: Move repair_writer to separate header. 2022-04-12 09:20:03 +02:00
Mikołaj Sielużycki
04986e8c8e repair: Move repair_row to a separate header. 2022-04-12 08:50:34 +02:00
Mikołaj Sielużycki
7b0cbdeac5 repair: Move repair_sync_boundary to a separate header. 2022-04-12 08:50:34 +02:00
Mikołaj Sielużycki
f9c75952ea repair: Move decorated_key_with_hash to separate header. 2022-04-12 08:50:34 +02:00
Mikołaj Sielużycki
0fa703de3e repair: Move row_repair hashing logic to separate class and file. 2022-04-12 08:50:34 +02:00
Calle Wilund
0e2a3e02ae commitlog: Fold named_file continuations into caller coroutine frame
Saves a continuation. That matters very little. But...
Uses a special awaiter type on returns from the "then(...)"-wrapping
named_file methods (which use a then([...update]) to keep internal
size counters up-to-date, making the continuation instead a stored func
into the returned awaiter, executed on successul resume of the caller
co_await.
2022-04-11 16:34:00 +00:00
Calle Wilund
ed8f0df105 commitlog: Use named named_file objects in delete/dispose/recycle lists
Changes delete/close queue, as well as deletetion queue into one, using
named_file objects + marker. Recycle list now also contains said named
file type.

This removes the need to re-eval file sizes on disk when deleting etc,
which in turn means we can dispose of recalculate_footprint on errors,
thus making things simpler and safer.
2022-04-11 16:34:00 +00:00
Calle Wilund
cdd4066006 commitlog: Use named_file size tracking instead of segment var
I.e. "auto-keep-track" of disk footprint
2022-04-11 16:34:00 +00:00
Calle Wilund
320c49e8d3 commitlog: Use named_file in segment
Uses named_file instead of file+string in segments.
Does not do anything particularly useful with it.
2022-04-11 16:34:00 +00:00
Calle Wilund
97bf7b1fc8 commitlog: Add "named_file" file wrapping type
For keeping track of file, name and size, even across
close/rename/delete.
2022-04-11 16:34:00 +00:00
Calle Wilund
7dd7760e8d commitlog: Make flush threshold a config parameter 2022-04-11 16:34:00 +00:00
Calle Wilund
d478896d46 commitlog: kill non-recycled segment management
It has been default for a while now. Makes no sense to not do it.
Even hints can use it (even if it makes no difference there)
2022-04-11 16:34:00 +00:00
Raphael S. Carvalho
8427ec056c gms: gossiper: don't duplicate knowledge of minimum time for gossip to settle
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220409022435.58070-2-raphaelsc@scylladb.com>
2022-04-11 19:19:02 +03:00
cvybhu
5c199cad45 cql3: expr: possible_lhs_values: Handle subscript
This commit makes subscript an invalid argument to possible_lhs_values.
Previously this function simply ignored subscripts
and behaved as if it was called on the subscripted column
without a subscript.

This behaviour is unexpected and potentially
dangerous so it would be better to forbid
passing subscript to possible_lhs_values entirely.

Trying to handle subscript correctly is impossible
without refactoring the whole function.
The first argument is a column for which we would
like to know the possible values.
What are possible values of a subscripted column c where c[0] = 1?
All lists that have 1 on 0th position?

If we wanted to handle this nicely we would have to
change the arguments.
Such refectoring is best left until the time
when this functionality is actually needed,
right now it's hard to predict what interface
will be needed then.

Signed-off-by: cvybhu <jan.ciolek@scylladb.com>

Closes #10228
2022-04-11 19:05:09 +03:00
Gleb Natapov
a3e8ae0979 storage_proxy: fix silencing of remote read errors
Filtering remote rpc errors based on exception type did not work because
the remote errors were reported as std::runtime_error and all rpc
exceptions inherit from it. New rpc propagates remote errors using
special type rpc::remote_verb_error now, so we can filter on that
instead.

Fixes #10339

Message-Id: <YlQYV5G6GksDytGp@scylladb.com>
2022-04-11 18:53:25 +03:00
Botond Dénes
08bcbd25e7 Merge 'toolchain: speed up prepare' from Avi Kivity
This series speeds up tools/toolchain/prepare in a few ways:
 - builds images in parallel
 - allows running on any arch as host
 - reduces work in building the image
 - removes unneeded layers

Closes #10348

* github.com:scylladb/scylla:
  tools: toolchain: prepare: sqush intermediate container layers
  tools: toolchain: update container image first thing
  tools: toolchain: prepare: build arch images in parallel
  tools: toolchain: prepare: aloow running on non-x86
2022-04-11 15:47:10 +03:00
Avi Kivity
fda99de15b Update seastar submodule
* seastar 05cdfc2d30...acf7e3523b (3):
  > http reply: avoid copying content
  > rpc: deliver remote verb exceptions as rpc::remote_verb_error instead of std::runtime_error
  > rpc: drop unneeded code
2022-04-11 15:12:43 +03:00
Pavel Emelyanov
828a951886 snitch: Remove create_snitch/stop_snitch
After previous patches both, create_snitch() and stop_snitch() no look
like the classica sharded service start/stop sequence. Finally both
helpers can be removed and the rest of the user can just call start/stop
on locally obtained sharded references.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-04-11 14:43:25 +03:00
Pavel Emelyanov
20e623f16d snitch: Simplify stop (and pause_io)
Both first stop/pause snitch driver on io-ing shard, then proceed with
the rest. This sequence is pretty pointless and here's why.

The only non-trivial stop()/pause_io() method out there is in the
property-file snitch driver. In it, both methods check if the current
shard is the io-ing one, if no -- return back the resolved future, if
yes -- go ahead and stop/pause some IO. With this, for all shards but
io-ing one there's no point in starting after io-ing one is stopped,
they all can start (and finish) in parallel.

So what this patch does is just removes the pre-stop/pause kicking of
the io-ing shard.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-04-11 14:43:23 +03:00
Pavel Emelyanov
2e42578dc8 snitch: Move io_is_stopped to property-file driver
This whole engine is only used by that driver, there's no point in it
sitting on the base class

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-04-11 14:43:20 +03:00
Pavel Emelyanov
28ecdc66ad snitch: Remove init_snitch_obj()
Now it's just a wrapper around sharded<snitch_ptr>::start()

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-04-11 14:43:16 +03:00
Pavel Emelyanov
b3eaae629e snitch: Move instance creation into snitch_ptr constructor
Current API to create snitch is not like other services -- there's a
dedicated helper that does sharded<>.start() + invoke_on_all(&start)
calls. These helpers complicate do-globalization of snitch and rework
of services start-stop sequence, things get simpler if snitch uses
the same start-stop API as all the others. The first step towards this
change is moving the non-waiting parts of snitch initialization code
from init_snitch_obj() into snitch_ptr constructor.

A note on this change: after patch #2 the snitch_ptr<->driver linkage
connects local objects with each other, not container() of any. This
is important, because connecting container() would be impossible inside
constructor, as the container pointer is initialized by seastar _after_
the service constructor itself.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-04-11 14:38:35 +03:00
Pavel Emelyanov
633746b87d snitch: Make config-based construction of all drivers
Currently snitch drivers register themselves in class-registry with all
sorts of construction options possible. All those different constuctors
are in fact "config options".

When later snitch will declare its dependencies (gossiper and system
keyspace), it will require patching all this registrations, which's very
inconvenient.

This patch introduces the snitch_config struct and replaces all the
snitch constructors with the snitch_driver(snitch_config cfg) one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-04-11 14:38:34 +03:00
Pavel Emelyanov
fa59ccb89d snitch: Declare snitch_ptr peering and rework container() method
This patch makes the snitch base class reference local snitch_ptr, not
its sharded<> container and, respectively, makes the base container()
method return _backreference->container() instead.

The motivation of this change is, again, in the next patch, which will
move snitch_ptr<->driver_object linkage into snitch_ptr constructor.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-04-11 14:38:32 +03:00
Pavel Emelyanov
552a08ecd0 snitch: Introduce container() method
Some snitch drivers want the peering_sharded_service::container()
functionality, but they can't directly use it, because the driver
class is in fact the pimplification behind the sharded<snitch_ptr>
service. To overcome this there's a _my_distributed pointer on the
driver base class that points back to sharded<snitch_ptr> object.

This patch replaces the direct _my_distributed usage with the
container() method that does it and also asserts that the pointer
in question is initialized (some drivers already do it, some don't).

Other than making the code more peering_sharded_service-like, this
patch allows changing _my_distributed into _backreference that
points to this shard's snitch_ptr, see next patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-04-11 14:38:27 +03:00
Botond Dénes
270aba0f51 Merge "Abort database stopping barriers on exception" by Pavel Emelyanov
"
The database::shutdown() and ::drain() methods are called inside the
invoke_on_all()s synchronizing with each other via the cross-shard
_stop_barrier.

If either shard throws in between all others may get stuck waiting for
the barrier to collect all arrivals. To fix it the throwing shard
should wake up others, resolving the wait somehow.

The fix is actually patch #4, the first and the second are the abort()
method for the barrier itself.

Fixes: #10304

tests: unit(dev), manual
"

* 'br-barrier-exception-2' of https://github.com/xemul/scylla:
  database: Abort barriers on exception
  database: Coroutinize close_tables
  test: Add test for cross_shard_barrier::abort()
  cross-shard-barrier: Add .abort() method
2022-04-11 13:48:43 +03:00
Pavel Emelyanov
f63f1c3d69 database: Abort barriers on exception
The database::shutdown() and ::drain() methods are called inside the
container().invoke_on_all() and synchronize with each other via the
cross-shard _stop_barrier. If either shard throws in between all others
may get stuck waiting for the barrier to collect all arrivals.

The fix is to abort the barrier on exception thus making all the
shards sitting in shutdown or drain to bail out with exceptions too.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-04-11 13:47:02 +03:00
Piotr Sarna
6d937f26ba Update seastar submodule
* seastar 2a2a1305...05cdfc2d (5):
  > Revert "core: reactor: fix a typo in `smp_pollfn::poll()`"
  > core: reactor: fix a typo in `smp_pollfn::poll()`
  > coroutine/exception: make it work with co_await
  > perftune.py: arfs: allow toggling on/off and allow auto-detection
  > coroutine: introduce as_future
2022-04-11 12:18:10 +02:00
Nadav Har'El
d9ec5ed46c test/cql-pytest: add test for blobAsInt() et al for various blob lengths
Recently I added a test that verified that blobAsInt() accepts a zero-
byte blob and return an "empty" integer. I was asked by one of the
reviewers - what happens if we try to pass a *three* byte blob to
blobAsInt()? Here is a new test that demonstrates that the answer is:
Besides the 0-byte blob, blobAsInt() only allows a 4-byte blob. Trying
3 or 5 bytes will result in an invalid query error being returned.

The test passes on both Cassandra and Scylla, confirming their behavior
is the same. The test checks all fixed-sized integer types - int (4
bytes), bigint (8 bytes), smallint (2 bytes) and tinyint (1 byte).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220411093803.651881-1-nyh@scylladb.com>
2022-04-11 12:44:22 +03:00
Raphael S. Carvalho
5cc46b3691 compaction: STCS: kill unused avg_size()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220408184419.100827-3-raphaelsc@scylladb.com>
2022-04-11 11:24:07 +03:00
Raphael S. Carvalho
6ab570d115 compaction: STCS: only proceed to trim bucket if interesting
In practice, a bucket that needs trimming will be interesting, but
this could be made clearer in the code.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220408184419.100827-2-raphaelsc@scylladb.com>
2022-04-11 11:24:07 +03:00
Raphael S. Carvalho
4f6003d335 compaction: STCS: simplify most_interesting_bucket()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220408184419.100827-1-raphaelsc@scylladb.com>
2022-04-11 11:24:07 +03:00
Nadav Har'El
84143c2ee5 alternator: implement Select option of Query and Scan
This patch implements the previously-unimplemented Select option of the
Query and Scan operators.

The most interesting use case of this option is Select=COUNT which means
we should only count the items, without returning their actual content.
But there are actually four different Select settings: COUNT,
ALL_ATTRIBUTES, SPECIFIC_ATTRIBUTES, and ALL_PROJECTED_ATTRIBUTES.

Five previously-failing tests now pass, and their xfail mark is removed:

 *  test_query.py::test_query_select
 *  test_scan.py::test_scan_select
 *  test_query_filter.py::test_query_filter_and_select_count
 *  test_filter_expression.py::test_filter_expression_and_select_count
 *  test_gsi.py::test_gsi_query_select_1

These tests cover many different cases of successes and errors, including
combination of Select and other options. E.g., combining Select=COUNT
with filtering requires us to get the parts of the items needed for the
filtering function - even if we don't need to return them to the user
at the end.

Because we do not yet support GSI/LSI projection (issue #5036), the
support for ALL_PROJECTED_ATTRIBUTES is a bit simpler than it will need
to be in the future, but we can only finish that after #5036 is done.

Fixes #5058.

The most intrusive part of this patch is a change from attrs_to_get -
a map of top-level attributes that a read needs to fetch - to an
optional<attrs_to_get>. This change is needed because we also need
to support the case that we want to read no attributes (Select=COUNT),
and attrs_to_get.empty() used to mean that we want to read *all*
attributes, not no attributes. After this patch, an unset
optional<attrs_to_get> means read *all* attributes, a set but empty
attrs_to_get means read *no* attributes, and a set and non-empty
attrs_to_get means read those specific attributes.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220405113700.9768-2-nyh@scylladb.com>
2022-04-11 10:04:32 +02:00
Nadav Har'El
9c1ebdceea alternator: forbid empty AttributesToGet
In DynamoDB one can retrieve only a subset of the attributes using the
AttributesToGet or ProjectionExpression paramters to read requests.
Neither allows an empty list of attributes - if you don't want any
attributes, you should use Select=COUNT instead.

Currently we correctly refuse an empty ProjectionExpression - and have
a test for it:
test_projection_expression.py::test_projection_expression_toplevel_syntax

However, Alternator is missing the same empty-forbidding logic for
AttributesToGet. An empty AttributesToGet is currently allowed, and
basically says "retrieve everything", which is sort of unexpected.

So this patch adds the missing logic, and the missing test (actually
two tests for the same thing - one using GetItem and the other Query).

Fixes #10332

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220405113700.9768-1-nyh@scylladb.com>
2022-04-11 10:21:02 +03:00
Nadav Har'El
86d01542de test/alternator: test another example of nested function calls
In the existing test we noticed that list_append(if_not_exists(...))
is allowed, but list_append(list_append(...)) is not. I wasn't sure
whether if_not_exists(if_not_exists(..)) will be allowed - and this
test verifies that it is - it works on both Scylla and DynamoDB, and
gives the same results on both.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220407122729.155648-1-nyh@scylladb.com>
2022-04-11 09:56:02 +03:00
Nadav Har'El
3456cbcfcf test/cql-pytest: split test_null.py into test_null and test_empty
We had in test_null.py a mixture of tests for null values and the
"null" CQL keyword - and tests for empty values. Null and empty
values are *not* the same thing, and there is no reason to keep the
tests for the two things in the same file and further confuse these
two distinct concepts.

This patch just moves code from test_null.py into a new test_empty.py -
there are no functional changes.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220407090348.137583-2-nyh@scylladb.com>
2022-04-11 09:54:54 +03:00
Nadav Har'El
cf79d84efa test/cql-pytest: add regression test for "empty" integer
In https://github.com/scylladb/scylla-rust-driver/issues/278 we noted
that beyond the concept of a null integer value (which has size -1),
there is also an empty integer value (size 0). This patch adds a test
that it works as expected. And we see that it does - Scylla stores such
a value fine, and the Python driver retrieves it the same as a null
(arguably, this is fine - the important point is to see that we don't
get a crash or an error).

The test passes - I just added it as a regression test for the future.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220407090348.137583-1-nyh@scylladb.com>
2022-04-11 09:54:53 +03:00
Avi Kivity
65720bcfd1 tools: toolchain: prepare: sqush intermediate container layers
Without this, the image contains awkward container layers with one
file (from the ADD commands). It's not a disaster, just pointless.
2022-04-10 19:00:36 +03:00
Avi Kivity
4bc5f1ba98 tools: toolchain: update container image first thing
Otherwise, rpm dependency resolution starts by installing an older
version of gcc (to satisfy an older preinstalled libgcc dependency),
then updates it. After the change, we install the updated gcc in
the first place.
2022-04-10 18:48:07 +03:00
Avi Kivity
69af7a830b tools: toolchain: prepare: build arch images in parallel
To speed up the build, run each arch in parallel, using bash's
awkward job control.
2022-04-10 18:45:08 +03:00
Avi Kivity
39ccd744de tools: toolchain: prepare: aloow running on non-x86
`prepare` builds a multiarch image using qemu emulation. It turns
out that aarch64 emulation is slowest (due to emulating pointer
authentication) so it makes sense to run it on an aarch64 host. To do
that, we need only to adjust the check for qemu installation.

Unfortunately, docker arch names and Linux arch names are different,
so we have to add an ungainly translation, but otherwise it is a
simple loop.
2022-04-10 18:17:00 +03:00
Avi Kivity
59d56a3fd7 Merge 'Add keyspace storage options' from Piotr Sarna
This series is part of the shared storage project.

The STORAGE option is designed to hold a map of options
used for customizing storage for given keyspace.
The option is kept in a system_schema.scylla_keyspaces table.

This option is guarded with a schema feature, because it's kept in a new schema table: `system_schema.scylla_keyspaces`.

Example of the contents of the new table:
```cql
cassandra@cqlsh> select * from system_schema.scylla_keyspaces;

 keyspace_name | storage_options                                | storage_type
---------------+------------------------------------------------+--------------
           ksx | {'bucket': '/tmp/xx', 'endpoint': 'localhost'} |           S3
```
Native storage options are not kept in the table, as this format doesn't hold any extra options and it would therefore just be a waste of storage.

Closes #10144

* github.com:scylladb/scylla:
  test: regenerate schema_change_test for storage options case
  test: improve output of schema_change_test regeneration
  docs: add a paragraph on keyspace storage options
  test: add test cases for keyspace storage options
  database,cql3: add STORAGE option to keyspaces
  db: add keyspace-storage-options experimental feature
  db,schema_tables: add scylla_keyspaces table
  db,gms: add SCYLLA_KEYSPACE schema feature
  db,gms: add KEYSPACE_STORAGE_OPTIONS feature
2022-04-10 17:23:56 +03:00
Avi Kivity
379892142d Merge 'Coroutinize view_update_builder::build_some' from Benny Halevy
Simplify view_update_builder::build_some by turning it into a coroutine,
and make view_updates::move_to async (also using a coroutine) so it may yield in-between building the updates, since freezing each mutation can be cpu intensive and preparing many updates synchronously may cause reactor stalls.

Test: unit(dev)
DTest: materialized_views_test.py(dev)

Closes #10344

* github.com:scylladb/scylla:
  db: view_updates: coroutinize move_to
  db: view_update_builder: build_some: maybe yield between updates
  db: view_update_builder: build_some: fixup indentation
  db: view_update_builder: coroutinize build_some
2022-04-10 16:13:58 +03:00
Raphael S. Carvalho
7b1589cb3d tests: chunked_managed_vector_test: Test correctness when crossing chunk boundary
While reviewing "utils/chunked_managed_vector: Fix corruption in case there is more
than one chunk", I was worried that there could be a correctness issue
when pop_back() pops off the first element of the last chunk, but turns
out I made an off-by-one error in my theory. Anyway, I wrote a unit test
to verify my assumption and I found worth submitting it upstream.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220408133555.12397-2-raphaelsc@scylladb.com>
2022-04-08 16:44:16 +02:00
Raphael S. Carvalho
2c11673246 utils/chunked_managed_vector: expose max_chunk_capacity()
That's useful for tests which want to verify correctness when the
vector is performing operations across the chunk boundary.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220408133555.12397-1-raphaelsc@scylladb.com>
2022-04-08 16:44:00 +02:00
Benny Halevy
6454c8d67f db: view_updates: coroutinize move_to
And allow yielding in-between freezing each update mutation.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-04-08 11:29:25 +03:00
Benny Halevy
0e570d6ffa db: view_update_builder: build_some: maybe yield between updates
`update.move_to` freezes the mutation

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-04-08 11:22:41 +03:00
Benny Halevy
243ba2e976 db: view_update_builder: build_some: fixup indentation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-04-08 11:21:42 +03:00
Benny Halevy
3e376155ef db: view_update_builder: coroutinize build_some
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-04-08 11:20:35 +03:00
Piotr Sarna
151d8f7c58 test: regenerate schema_change_test for storage options case
Keyspace storage options series adds a new schema table:
system_schema.scylla_keyspaces. The regenerated cases ensure
that this new table is taken into account when the schema feature
is available.
2022-04-08 09:17:01 +02:00
Piotr Sarna
4705a5fa42 test: improve output of schema_change_test regeneration
Schema change test operates on pre-generated sstables, and sometimes
this set of sstables needs to be regenerated. In order to make the
regeneration process more ergonomic, the output is now directly
copyable as valid C++ representation of UUIDs.
2022-04-08 09:17:01 +02:00
Piotr Sarna
20de52d96c docs: add a paragraph on keyspace storage options
A new CQL extension: allowing to specify keyspace storage options,
is now described in our design notes.
2022-04-08 09:17:01 +02:00
Piotr Sarna
97c9729487 test: add test cases for keyspace storage options
The test cases check if it's possible to set and/or alter
storage options for keyspaces with CQL, and whether the changes
are reflected in the schema tables.
2022-04-08 09:17:01 +02:00
Piotr Sarna
58529591a9 database,cql3: add STORAGE option to keyspaces
The STORAGE option is designed to hold a map of options
used for customizing storage for given keyspace.
The option is kept in a system_schema.scylla_keyspaces table.
The option is only available if the whole cluster is aware
of it - guarded by a cluster feature.

Example of the table contents:
```
cassandra@cqlsh> select * from system_schema.scylla_keyspaces;

 keyspace_name | storage_options                                | storage_type
---------------+------------------------------------------------+--------------
           ksx | {'bucket': '/tmp/xx', 'endpoint': 'localhost'} |           S3
```
2022-04-08 09:17:01 +02:00
Piotr Sarna
3272b4826f db: add keyspace-storage-options experimental feature
Specifying non-standard keyspace options is experimental, so it's
going to be protected by a configuration flag.
2022-04-08 09:17:01 +02:00
Piotr Sarna
7f02b188b7 db,schema_tables: add scylla_keyspaces table
The table holds scylla-specific information on keyspaces.
The first columns include storage_type and storage_options,
which will be used later to store storage information.
2022-04-08 09:17:00 +02:00
Piotr Sarna
120980ac8e db,gms: add SCYLLA_KEYSPACE schema feature
This schema feature will be used to guard the upcoming
system_schema.scylla_keyspaces schema table.
2022-04-08 09:17:00 +02:00
Piotr Sarna
567c0d0368 db,gms: add KEYSPACE_STORAGE_OPTIONS feature
The feature represents the ability to store storage options
in keyspace metadata: represented as a map of options,
e.g. storage type, bucket, authentication details, etc.
2022-04-08 09:17:00 +02:00
Tomasz Grabiec
41fe01ecff utils/chunked_managed_vector: Fix corruption in case there is more than one chunk
If reserve() allocates more than one chunk, push_back() should not
work with the last chunk. This can result in items being pushed to the
wrong chunk, breaking internal invariants.

Also, pop_back() should not work with the last chunk. This breaks when
there is more than one chunk.

Currently, the container is only used in the sstable partition index
cache.

Manifests by crashes in sstable reader which touch sstables which have
partition index pages with more than 1638 partition entries.

Introduced in 78e5b9fd85 (4.6.0)

Fixes #10290

Message-Id: <20220407174023.527059-1-tgrabiec@scylladb.com>
2022-04-07 21:26:35 +03:00
Benny Halevy
40ad057b6c database: delete db_apply_executor forward declaration
The class is long gone, since version 3.0.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220407094632.2647967-1-bhalevy@scylladb.com>
2022-04-07 17:11:38 +03:00
Pavel Solodovnikov
293c5f39ee service: raft_group0: make join_group0 re-entrant
Detect if we have already finished joining group0 before
and do nothing in that case.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-04-07 12:36:40 +03:00
Pavel Solodovnikov
057a12e213 service: storage_service: add join_group0 method
Just delegates work to `service::raft_group0::join_group0()`
so that it can be used in `main` to activate raft group0
early in some cases (before waiting for gossiper to settle).

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-04-07 12:36:33 +03:00
Pavel Solodovnikov
0d5e2157e1 raft_group_registry: update gossiper state only on shard 0
Since `gossiper::add_local_application_state` is not
safe to call concurrently from multiple shards (which
will cause a deadlock inside the method), call this
only on shard 0 in `_raft_support_listener`.

This fixes sporadic hangs when starting a fresh node in an
empty cluster where node hangs during startup.

Tests: unit(dev), manual

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-04-07 12:33:40 +03:00
Pavel Solodovnikov
7903d2afa8 raft: don't update gossiper state if raft is enabled early or not enabled at all
There is a listener in the `raft_group_registry`,
which makes the gossiper to re-publish supported
features app state to the cluster.

We don't need to do this in case `USES_RAFT_CLUSTER_MANAGEMENT`
feature is enabled before the usual time, i.e. before the
gossiper settles. So, short-circuit the listener logic in
that case and do nothing.

Also, don't do anything if raft group registry is not enabled
at all, this is just a generic safeguard.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-04-07 12:31:29 +03:00
Pavel Solodovnikov
ccb59ba6c7 gms: feature_service: add cluster_uses_raft_mgmt accessor method
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-04-07 12:30:21 +03:00
Wojciech Mitros
97408078a1 dependencies: add rust
The main reason for adding rust dependency to scylla is the
wasmtime library, which is written in rust. Although there
exist c++ bindings, they don't expose all of its features,
so we want to do that ourselves using rust's cxx.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
[avi: update toolchain]
[avi: remove example, saving for a follow-on]
2022-04-07 12:26:05 +03:00
Botond Dénes
ad075b27a4 test/lib/mutation_diff: s/colordiff/diff/
Colordiff is problematic when writing the diff into a file for later
examination. Use regular diff instead. One can still get syntax
highlighting by writing the output into `.diff` file (which most editors
will recognize).

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20220407080944.324108-1-bdenes@scylladb.com>
2022-04-07 12:07:24 +03:00
Michael Livshin
da7c7fd3dc delete code of the unused normalizing_reader class
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Message-Id: <20220406161107.2376568-3-michael.livshin@scylladb.com>
2022-04-07 09:29:41 +03:00
Michael Livshin
d8598d048a enormous_table_reader: inherit from flat_mutation_reader_v2::impl
(completely mechanical change)

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Message-Id: <20220406161107.2376568-2-michael.livshin@scylladb.com>
2022-04-07 09:29:41 +03:00
Michael Livshin
702ad7447a enormous_table_reader: remove the duplicate _schema field
flat_mutation_reader{,_v2}::impl already contains one, which makes for
very exciting debugging experience (and no, clang does not mind at
all).

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Message-Id: <20220406161107.2376568-1-michael.livshin@scylladb.com>
2022-04-07 09:29:41 +03:00
Pavel Emelyanov
9066224cf4 table: Don't export compaction manager reference
There's a public call on replica::table to get back the compaction
manager reference. It's not needed, actually. The users of the call are
distributed loader which already has database at hand, and a test that
creates itw own instance of compaction manager for its testing tables
and thus also has it available.

tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220406171351.3050-1-xemul@scylladb.com>
2022-04-07 09:27:45 +03:00
Pavel Emelyanov
2cab2a32b8 database: Coroutinize close_tables
To make next patch a bit simpler

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-04-06 18:43:32 +03:00
Pavel Emelyanov
401c0edea2 test: Add test for cross_shard_barrier::abort()
The tests runs a loop of arrivals each of which can randomly
throw before arriving. As the result the test expects all shards
to resolve into exception in the same phase.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-04-06 18:21:59 +03:00
Pavel Emelyanov
8d7a7cbe21 cross-shard-barrier: Add .abort() method
The method makes all the .arrive_and_wait()s in the current phase
to resolve with barrier_aborted_exception() exceptional future.

The barrier turns into a broken state and is not supposed to serve
any subsequence arrivals anyhow reasonably.

The .abort() method is re-entrable in two senses. The first is that
more than one shard can abort a barrier, which is pretty natural.
The second is that the exception-safety fuses like that imply that
if the arrive_and_wait() resolves into exception the caller will try
to abort() the barrier as well, even though the phase would be over.
This case is also "supported".

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-04-06 18:21:59 +03:00
Botond Dénes
18be2e9faf Merge "Remove gossiper->snitch kicking" from Pavel Emelyanov
"
Gossiper calls snitch->gossiper_starting() when being enabled. This
generates a dependency loop -- snitch needs gossiper to gossip its
states and get DC/RACK, gossiper needs snitch to do this kick.

This set removes this notification. The new approach is to kick the
snitch to gossip its states in the same places where gossiper is
enabled() so that only the snitch->gossiper dependency remains.

As a side effect the set ditches a bunch of references to global
snitch instance.

tests: unit(dev)
"

* 'br-snitch-gossiper-starting' of https://github.com/xemul/scylla:
  snitch: Remove gossiper_starting()
  snitch: Remove gossip_snitch_info()
  property-file snitch: Re-gossip states with the help of .get_app_states()
  property-file snitch: Reload state in .start()
  ec2 multi-region snitch: Register helper in .start()
  snitch, storage service: Gossip snitch info once
  snitch: Introduce get_app_states() method
  property-file snitch: Use _my_distributed to re-shard
  storage service: Shuffle snitch name gossiping
2022-04-06 17:41:36 +03:00
Piotr Sarna
2683b54402 Merge 'CQL3: Optional FINALFUNC and INITCOND for UDA' from Michał Jadwiszczak
Makes final function and initial condition to be optional while
creating UDA. No final function means UDA returns final state
and default initial condition is `null`.
Both items were optional in cql's grammar but they were treated as required in code.

Additionally I've added check if state function returns state.

Fixes #10324

Closes #10331

* github.com:scylladb/scylla:
  CQL3: check sfunc return type in UDA
  cql-pytest: UDA no final_func/initcond tests
  cql3: allow no final_func and no initcond in UDA
2022-04-06 16:04:47 +02:00
Michael Livshin
a90e02c302 skeleton_reader: inherit from flat_mutation_reader_v2::impl
(completely mechanical change)

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Message-Id: <20220406122912.2248111-1-michael.livshin@scylladb.com>
2022-04-06 16:55:54 +03:00
Michael Livshin
6001a0fef1 multi_partition_reader: inherit from flat_mutation_reader_v2::impl
(completely mechanical change)

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Message-Id: <20220406122122.2246058-1-michael.livshin@scylladb.com>
2022-04-06 16:55:07 +03:00
Michał Sala
28970389bc forward_service: uncoroutinize dispatch method
Done to mitigate potential misscompilations.
2022-04-06 15:01:31 +02:00
Michał Sala
edc32a7118 forward_service: uncoroutinize retrying_dispatcher
Done to mitigate potential misscompilations.
2022-04-06 14:52:59 +02:00
Michał Sala
59ff51c824 forward_service: rety a failed forwarder call
Failed-to-forward sub-queries will be executed locally (on a
super-coordinator). This local execution is meant as a fallback for
forward_requests that could not be sent to its destined coordinator
(e.g. due gossiper not reacting fast enough). Local execution was chosen
as the safest one - it does not require sending data to another
coordinator.
2022-04-06 14:44:55 +02:00
Benny Halevy
17358ac2a0 cmake: CMakeLists.txt: rename flat_mutation_reader.cc to readers/mutation_readers.cc
It was moved in 31d84a254c00b36dc2576e06ee288e28a13238195.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220406110512.3731011-3-bhalevy@scylladb.com>
2022-04-06 14:10:34 +03:00
Benny Halevy
4b3d0643a8 cmake: CMakeLists.txt: remove conncetion_notifier.cc
It was removed in 3aa05f7f03.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220406110512.3731011-2-bhalevy@scylladb.com>
2022-04-06 14:10:33 +03:00
Benny Halevy
8d95e12ecd cmake: CMakeLists.txt: update source paths
Those were moved to subdirectories.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220406110512.3731011-1-bhalevy@scylladb.com>
2022-04-06 14:10:32 +03:00
Avi Kivity
82733aeadb Merge 'Perf: Add extended template version of timed_perf + use in CL perf' from Calle Wilund
Adds sub-template for time_parallel with templated result type + optional per-iteration post-process func. Idea is that Res may be a subtype of perf_result, with additional stats, initiated on init, and post-process  function can fix up and apply stats -> we can add stats to result.

Then uses this mighty construct to add some IO stats to CL perf.

Closes #10334

* github.com:scylladb/scylla:
  perf_commitlog: Add bytes + bytes written stats
  perf: Add aio_writes mixin for perf_results
  test/perf/perf.hh: Make templated version of test routine to allow extended stats
2022-04-06 12:52:53 +03:00
Nadav Har'El
0f3cd6ad18 test/cql-pytest: fix fails_without_raft tests on Cassandra
We had a Python typo ("false" instead of "False") which prevented
tests with the fails_without_raft marker for running on Cassandra.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220405170337.36321-1-nyh@scylladb.com>
2022-04-06 11:20:25 +03:00
Jadw1
b560286ffe CQL3: check sfunc return type in UDA
Thre return type of state function is now checked while creating UDA.
Appropriate test added to cql-pytest.
2022-04-06 09:25:17 +02:00
Jadw1
977d6ac8b0 cql-pytest: UDA no final_func/initcond tests
Cql-pytests to check if UDA works properly without final function
or initial condition.
2022-04-06 09:25:12 +02:00
Jadw1
c921efd1b3 cql3: allow no final_func and no initcond in UDA
Makes final function and initial condition to be optional while
creating UDA. No final function means UDA returns final state
and defeult initial condition is `null`.

Fixes: #10324
2022-04-06 09:08:50 +02:00
Kamil Braun
424411ee5f test: raft: randomized_nemesis_test: enable entry forwarding
The test will now, with probability 1/2, enable forwarding of entries by
followers to leaders. This is possible thanks to the new abort_source&
APIs which we use to ensure that no operations are running on servers
before we destroy them.
2022-04-05 19:29:26 +02:00
Nadav Har'El
cfe04e6437 test/cql-pytest: nicer error message if a test can't find nodetool
When testing Scylla, cql-pytest does *not* need an external nodetool
command - it uses the REST API instead because it is much faster and
there is no need to install anything. However, if cql-pytest is run
against Cassandra, the tests do want to use the "nodetool" utility and
want to know what it is. The tests use either the NODETOOL environment
variable, or if that doesn't exist, look for "nodetool" in the path.

If nodetool wasn't found in that way, before this patch, we got an ugly
error message with long irrelevant Python backtraces. It wasn't easy
to understand that what actually happened was that the user forgot
to set the NODETOOL environment variable.

This patch cleans up this error handling. Now, if nodetool cannot be
found, every test that tries to run nodetool will report just a one-
line error message, clearly explaining what went wrong and how to
fix it:

        Error: Can't find nodetool. Please set the NODETOOL
        environment variable to the path of the nodetool utility.

To reiterate, when testing Scylla, nodetool is *not* needed even after
this patch. These errors will not happen even if you don't have the
nodetool utility. You only need nodetool if you plan to test Cassandra.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220405171835.43992-1-nyh@scylladb.com>
2022-04-05 20:29:02 +03:00
Kamil Braun
f31c61c7c9 test: raft: randomized_nemesis_test: increase logging level on some rare operations
Increase the logging level on the few operations which happen at the end
of the test but make debugging a bit easier if the test hangs for some
reason.
2022-04-05 19:19:59 +02:00
Kamil Braun
ad3141d3e0 raft: server: translate abort_requested_exception to raft::request_aborted
The `wait_for_leader` function would throw a low-level
`abort_requested_aborted` exception from seastar::shared_promise.
Translate it to the high-level raft::request_aborted so we can reduce
the number of different exception types which cross the Raft API
boundary.

Also, add comments on Raft API functions about the exception thrown when
requests are aborted.
2022-04-05 19:18:53 +02:00
Kamil Braun
7da586b912 raft: fsm: when stopping, become follower to reject new requests
After enabling add_entry forwarding in randomized_nemesis_test, the test
would sometimes hang on _rpc->abort() call due to add_entry messages
from followers which waited on log_limiter_semaphore on the leader
preventing _rpc from finishing the abort; the log_limter_semaphore would
not get unblocked because the part of the server was already stopped.

Prevent log_limiter_semaphore from being waited on when stopping the
server by becoming a follower in fsm::stop.
2022-04-05 19:11:44 +02:00
Calle Wilund
af28fb6d94 perf_commitlog: Add bytes + bytes written stats
Used extended perf_result used with aio_writes + aio_write_bytes to
include some IO stats for the benchmark.
2022-04-05 13:43:57 +00:00
Calle Wilund
5b60a6cf7c perf: Add aio_writes mixin for perf_results
Can be used with time_parallel_ex. Adds measurements for aio writes/aio written bytes.
2022-04-05 13:42:36 +00:00
Calle Wilund
12ab34a3d9 test/perf/perf.hh: Make templated version of test routine to allow extended stats
Adds sub-template for time_parallel with templated result type + optional
per-iteration post-process func. Idea is that Res may be a subtype of
perf_result, with additional stats, initiated on init, and post-process
function can fix up and apply stats -> we can add stats to result.
2022-04-05 13:30:42 +00:00
Avi Kivity
0d5fd526a5 Merge "tools/scylla-sstable alternative schema load method for system tables" from Botond
"
Examining sstables of system tables is quite a common task. Having to
dump the schemas of such tables into a schema.cql is annoying knowing
that these schemas are readily available in scylla, as they are
hardcoded. This mini-series adds a method to make use of this fact, by
adding a new option: `--system-schema`, which takes the name of a system
table and looks up its schema.

Tests: unit(dev)
"

* 'scylla-sstable-system-schema/v1' of https://github.com/denesb/scylla:
  tools/scylla-sstable: add alternative schema load method for system tables
  tools/schema_loader: add load_system_schema()
  db/system_distributed_keyspace: add all tables methods
  tools/scylla-sstable: reorganize main help text
2022-04-05 15:48:29 +03:00
Avi Kivity
6cfc1d6f6a Update seastar submodule
* seastar 798ec50701...2a2a13058e (2):
  > condition_variable: Add "has_waiters()" accessor + test
  > Merge "RPC tester" from Pavel E
2022-04-05 13:47:51 +03:00
Gleb Natapov
7bf557332f storage_service: remove maybe from maybe_start_sys_dist_ks
There is nothing "maybe" about it now.

Message-Id: <Ykv/bj8MvKh0UU23@scylladb.com>
2022-04-05 12:49:56 +03:00
Benny Halevy
abbf5de68c frozen_mutation: introduce consume method
Allowing to consume the frozen_mutation directly
to a stream rather than unfreezing it first
and then consuming the unfrozen mutation.

Streaming directly from the frozen_mutation
saves both cpu and memory, and will make it
easier to be made async as a follow, to allow
yielding, e.g. between rows.

This is used today only in to_data_query_result
which is invoked on the read-repair path.

Refs #10038
Fixes #10021

Test: unit(release)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220405055807.1834494-1-bhalevy@scylladb.com>
2022-04-05 10:51:21 +03:00
Nadav Har'El
67e0590bbc alternator: remove old TODO (with test verifying it)
We had an old TODO in the Alternator "Scan" operation code which
suggested that we may need to do something to limit the size of pages
when a row limit ("Limit") isn't given.

But we do already have a built-in limit on page sizes (1 MB),
so this TODO isn't needed and can be removed.

But I also wanted to make sure we have a test that this limit works:

We already had a test that this 1 MB limit works for a single-partition
Query (test_query.py::test_query_reverse_longish - tested both forward
and reversed queries). In this patch I add a similar test for a whole-
table Scan. It turns out that although page size is limited in this case
as well, it's not exactly 1 MB... For small tables can even reach 3 MB.
I consider this "good enough" and that we can drop the TODO, but also
opened issue #10327 to document this surprising (for me) finding.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220404145240.354198-1-nyh@scylladb.com>
2022-04-05 09:23:23 +03:00
Nadav Har'El
56936d3c16 test/alternator: add reproducers for scan of long string of tombstones
This patch adds two xfailing tests for issue #7933. That issue is about
what Scan or Query paging does when encountering a very long string of
consecutive tombstones (partition or row tombstones). Ideally, in that
case the scan could stop on one of these tombstones after already
processing too many. But as these two tests demonstrate, the scan can't
stop in the middle of a long string of tombstones - and as a result
retrieving a single page can take an unbounded amount of time, which is
wrong.

Currently the tests are marked `@veryslow` (they each take more than a
minute) because they each create a huge number of tombstones to
demonstrate a huge amount of work for a single page. When we fix
issue #7933 and have a much smaller limit on the number of tombstones
processed in a single page, we can hopefully make these tests much
shorter and remove the `@veryslow` tag. The `@veryslow` tags means
that although these tests can be used manually (with `--runveryslow`)
they will not yet be run as part of the usual regression tests.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220403070706.250147-1-nyh@scylladb.com>
2022-04-05 09:11:38 +03:00
Raphael S. Carvalho
840500fc4d compaction: Make cleanup for Leveled strategy bucket-aware
Bucket awareness in cleanup was introduced in a69d98c3d0.
STCS and TWCS already support it, and now LCS will receive it.

The goal of bucket awareness is to reduce writeamp in cleanup,
therefore reducing operation time. Additionally, garbage collection
becomes more efficient as shadowed data can now be potentially
compacted with the data that shadows it, assuming they're on
the same level.

The implementation for LCS is simple. Will reuse the procedure
for STCS for returning jobs in level 0. And one job will be
returned for each non-empty level > 0. What allows us to do it
is our incremental selection approach used in compaction,
that sets a limit on memory usage and disk space requirement.

Fixes #10097.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220331173417.211257-1-raphaelsc@scylladb.com>
2022-04-05 09:10:21 +03:00
Benny Halevy
2d80057617 range_tombstone_list: insert_from: correct rev.update range_tombstone in not overlapping case
2nd std::move(start) looks like a typo
in fe2fa3f20d.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220404124741.1775076-1-bhalevy@scylladb.com>
2022-04-04 22:26:29 +02:00
Tomasz Grabiec
0a3aba36e6 Merge 'range_tombstone_change_generator: flush: emit closing range_tombstone_change' from Benny Halevy
When the highest tombstone is open ended, we must
emit a closing range_tombstone_change at
position_in_partition::after_all_clustered_rows().

Since all consumers need to do it, implement the logic
in the range_tombstone_change_generator itself.

It turned out that mutation::consume doesn't do that,
hence this series, and 5a09e5234ef4e1ee673bc7fca481defbbb2c0384 in particular,
fix the issue.

Change 028b2a8cdfdc12721b2be23d175cbc756d2507de exposes the issue
by generating a richer set of random range_tombstone that include open-ended
range tombstones.

Fixes #10316

Test: unit(dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #10317

* github.com:scylladb/scylla:
  test: random_mutation_generator: make more interesting range tombstones
  reader: upgrading_consumer: let range_tombstone_change_generator emit last closing change
  range_tombstone_change_generator: flush: emit end_position when upper limit is after all clustered rows
  range_tombstone_change_generator: flush: use tri_compare rather than less
  range_tombstone_change_generator: flush: return early if empty
2022-04-04 19:07:45 +02:00
Michał Sala
e170961b4d forward_service: copy arguments/captured vars to local variables
Copying captured variables into local variables (that live in a
coroutine's frame) is a mitigation of suspected lifetime issues.
Arguments of forward_service::dispatch are also copied (to prevent
potential undefined behavior or miss-compilation triggered by
referencing the arguments in a capture list of a lambda that produces a
coroutine).
2022-04-04 16:58:08 +02:00
Benny Halevy
b3e2bbe5bd test: random_mutation_generator: make more interesting range tombstones
Include also singular prefix and semi-bounded range tombstones.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-04-04 17:34:49 +03:00
Piotr Grabowski
63fa5ac915 generic_server.hh: add missing include
Add missing include of "<list>" which caused compile errors on GCC:

In file included from generic_server.cc:9:
generic_server.hh:91:10: error: ‘list’ in namespace ‘std’ does not name a template type
   91 |     std::list<gentle_iterator> _gentle_iterators;
      |          ^~~~
generic_server.hh:19:1: note: ‘std::list’ is defined in header ‘<list>’; did you forget to ‘#include <list>’?
   18 | #include <seastar/net/tls.hh>
  +++ |+#include <list>
   19 |

Note that there are some GCC compilation problems still left apart from
this one.

Closes #10328
2022-04-04 17:31:55 +03:00
Lukasz Sojka
5727f196e3 Add big batch logs tests
Tests for warning and error lines in logfile when user executes
big batch (above preconfigured thresholds in scylla.yaml).

Signed-off-by: Lukasz Sojka <lukasz.sojka@scylladb.com>

Closes #10232
2022-04-04 17:25:13 +03:00
Takuya ASADA
f95a531407 docker: run scylla as root
Previous versions of Docker image runs scylla as root, but cb19048
accidently modified it to scylla user.
To keep compatibility we need to revert this to root.

Fixes #10261

Closes #10325
2022-04-04 17:25:13 +03:00
Pavel Emelyanov
9fdb49c86a Merge 'fix hang on shutdown while ddl query is running and there is no quorum' from Gleb
A node that runs DDL query while its cluster does not have a quorum
cannot be shutdown since the query is not abortable. The series makes it
abortable and also fixes the order in which components are shutdown to
avoid the deadlock.

* gleb/raft_shutdown_v4 of git@github.com:scylladb/scylla-dev.git:
  migration_manager: drain migration manager before stopping protocol servers on shutdown
  migration_manager: pass abort source to raft primitives
  storage_proxy: relax some read error reporting
2022-04-04 17:25:13 +03:00
Benny Halevy
002be743f6 reader: upgrading_consumer: let range_tombstone_change_generator emit last closing change
When flushing range tombstones up to
position_in_partition::after_all_clustered_rows(),
the range_tombstone_change_generator now emits
the closing range_tombstone_change, so there's
no need for the upgrading_consumer to do so too.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-04-04 17:00:53 +03:00
Benny Halevy
cd171f309c range_tombstone_change_generator: flush: emit end_position when upper limit is after all clustered rows
When the highest tombstone is open ended, we must
emit a closing range_tombstone_change at
position_in_partition::after_all_clustered_rows().

Since all consumers need to do it, implement the logic
int the range_tombstone_change_generator itself.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-04-04 17:00:53 +03:00
Benny Halevy
2c5a6b3894 range_tombstone_change_generator: flush: use tri_compare rather than less
less is already using tri_compare internally,
and we'll use tri_compare for equality in the next patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-04-04 17:00:53 +03:00
Benny Halevy
18a80a98b8 range_tombstone_change_generator: flush: return early if empty
Optimize the common, empty case.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-04-04 17:00:53 +03:00
Takuya ASADA
41edc045d9 docker: revert scylla-server.conf service name change
We changed supervisor service name at cb19048, but this breaks
compatibility with scylla-operator.
To fix the issue we need to revert the service name to previous one.

Fixes #10269

Closes #10323
2022-04-03 19:18:18 +03:00
Avi Kivity
538fabc05d Update tools/java submodule
* tools/java ac5d6c840d...9bc83b7a32 (1):
  > fix metadata printing for ME sstables
2022-04-03 18:33:04 +03:00
Alexey Kartashov
d86c3a8061 dist/docker: fix incorrect locale value
Docker build script contains an incorrect locale specification for LC_ALL setting,
this commit fixes that.

Fixes #10310

Closes #10321
2022-04-03 14:24:54 +03:00
David Garcia
934beb6e20 docs: update theme 1.2.1
Related issue scylladb/sphinx-scylladb-theme#395

ScyllaDB Sphinx Theme 1.2 is now released partying_face

We’ve added automatic checks for broken links and introduced numerous UI updates.

You can read more about all notable changes here.

Closes #10313
2022-04-03 13:45:07 +03:00
Nadav Har'El
0a67c87438 Update seastar submodule
* seastar 44389842...798ec507 (4):
  > CONTRIBUTING: update to note that pull requests are accepted (#1036)
  > semaphore: improve documentation of timeout and abort errors
  > condition_variable: fix cv.signal with active "when" wait would switch fiber
  > abortable_fifo: stop dereferencing null pointers

Fixes #10319 with "abortable_fifo: stop dereferencing null pointers".
2022-04-03 13:41:41 +03:00
Benny Halevy
5ca73019dd shard_reader_v2: do_fill_buffer: reserve buffer space ahead
To prevent unneeded reallocations, just reserve the
pre-known number of entries before pushing them.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220402130847.625085-2-bhalevy@scylladb.com>
2022-04-03 11:28:32 +03:00
Benny Halevy
8ab57aa4ab shard_reader_v2: do_fill_buffer: maybe yield when copying result
Prevent a reactor stall with e.g. large number of range tombstones.

Fixes #10314

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220402130847.625085-1-bhalevy@scylladb.com>
2022-04-03 11:05:14 +03:00
Tomasz Grabiec
9c96a37143 Merge "raft: nemesis test: use abort_source for time-outs" from Kamil
When a Raft API call such as `add_entry`, `set_configuration` or
`modify_config` takes too long, we need to time-out. There was no way to
abort these calls previously so we would do that by discarding the futures.
Recently the APIs were extended with `abort_source` parameters. Use this.

Also improve debuggability if the functions throw an exception type that
we don't expect. Previously if they did, a cryptic assert would fail
somewhere deep in the generator code, making the problem hard to debug.

Also collect some statistics in the test about the number of successful
and failed ops. I used it to manually check whether there was a
difference in how often operations fail with using the out timeout
method and the new timeout method (there doesn't seem to be any).

* kbr/nemesis-abort-source:
  test: raft: randomized_nemesis_test: on timeout, abort calls instead of discarding them
  raft: server: translate semaphore_aborted to request_aborted
  test: raft: logical_timer: add abortable version of `sleep_until`
  test: raft: randomized_nemesis_test: collect statistics on successful and failed ops
2022-04-01 16:25:23 +02:00
Pavel Emelyanov
886a275192 Merge 'replica/table: remove v1 reader factory methods' from Botond
Only users are internal and tests.

Tests: unit(dev)

* replica-table-remove-make-reader-v1/v2 of github.com/denesb/scylla.git
  replica/table: remove v1 reader factory methods
  tests: move away from table::make_reader()
  replica/table: add short make_reader_v2() variant:
2022-04-01 13:57:10 +03:00
Botond Dénes
9338affb8e replica/table: remove v1 reader factory methods 2022-04-01 13:52:08 +03:00
Botond Dénes
c8ea0715e9 tests: move away from table::make_reader()
Use v2 equivalents instead.
2022-04-01 13:39:26 +03:00
Botond Dénes
5aa97ccf0d replica/table: add short make_reader_v2() variant: 2022-04-01 13:39:26 +03:00
Pavel Emelyanov
05a32328fc snitch: Remove gossiper_starting()
No longer used

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-04-01 13:16:09 +03:00
Pavel Emelyanov
41332e183a snitch: Remove gossip_snitch_info()
No longer in use

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-04-01 13:16:09 +03:00
Pavel Emelyanov
38b0ee9822 property-file snitch: Re-gossip states with the help of .get_app_states()
This is the last place that still uses gossip_snitch_info(). It
can be reworked to use the get_app_states(), then the former
helper can be removed.

Another motivation for this is to stop using the _gossiper_started
boolean from the base class. This, in turn, will allow to remove
the whole gossiper_starting() notification altogether.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-04-01 13:16:09 +03:00
Pavel Emelyanov
6f71baa472 property-file snitch: Reload state in .start()
In its .start() helper the property-file driver does everything but
registers the reconnectable helper (like the ec2 m.r. one from the
previous patch did). Similarly to ec2 m.r. snitch this one can also
register its helper in .start(), before gossiper_starting() is called.

One thing to care about in this driver is that some tests start this
snitch without starting gossiper, thus an extra protection against
not initialized gossiper is needed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-04-01 13:16:09 +03:00
Pavel Emelyanov
2400c87e74 ec2 multi-region snitch: Register helper in .start()
This driver registers reconnectable helper in it gossiper_starting()
callback. It can be done earlier -- in the snitch .start() one, as
gossiper doesn't notify listeners until its started for real (event
its shardow round doesn't kick them).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-04-01 13:16:05 +03:00
Pavel Emelyanov
f9af6fb430 snitch, storage service: Gossip snitch info once
Nowadays snitch states are put into gossiper via .gossiper_starting()
call by gossiper. This, in turn, happens in two places -- on node
ring join code and on re-enabling gossiper via the API call.

The former can be performed by the ring joining code with the help of
recently introduced snitch.get_app_states() helper.

The latter call is in fact not needed. Re-gossiped are DC, RACK and
for some drivers the INTERNAL_IP states that don't change throughout
snitch lifetime and are preserved in the gossiper pre-loaded states.

Thus, once the snitch states are applied by storage service ring join
code, the respective states udpate can be removed from the snitch
gossiper_starting() implementations.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-04-01 13:16:05 +03:00
Pavel Emelyanov
4853959903 snitch: Introduce get_app_states() method
This virtual method returns back the list of app states that snitch
drivers need to gossip around. The exact implementation copies the
gossip_snitch_info() logic of the respective drivers and is unused.
Next patches will make use of it (spoiler: the latter method will be
removed after that).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-04-01 13:16:05 +03:00
Pavel Emelyanov
028bb84b0f property-file snitch: Use _my_distributed to re-shard
The driver in question wants to execute some of its actions on shard 0
and it calls smp::invoke(0, ...) for this. The invoked lambda thus needs
to refer to global snitch instance.

There's nicer and shorter way of re-sharding for snith drivers -- the
sharded<snith_ptr>* _my_distributed field on the base class.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-04-01 13:16:05 +03:00
Pavel Emelyanov
b8e876681d storage service: Shuffle snitch name gossiping
No functional changes, just have the local snitch reference in
the ring joining code. This simplifies next patching.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-04-01 13:16:05 +03:00
Botond Dénes
a325d3434a Merge "make_slicing_filtering_reader(): return flat mutation reader v2" from Michael Livshin
"
Tests: unit(dev)
"

* 'slicing-filtering-v2' of https://github.com/cmm/scylla:
  make_slicing_filtering_reader(): return flat mutation reader v2
  mutation_readers: refactor generic partition slicing logic
2022-04-01 11:08:25 +03:00
Nadav Har'El
758f8f01d7 test/alternator: turn REST API finding into a fixture
In test_tracing.py and util.py, we already have three duplicates of code
which looks for the Scylla REST API. We'll soon want to add even more uses
of this REST API, so it's good time to add a single fixture, "rest_api",
which can be use in all tests that need the Scylla REST API instead of
duplicating the same code.

A test using the "rest_api" fixture will be skipped if the server isn't
Scylla, or its port 10000 is not available or not responsive.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220331195337.64352-1-nyh@scylladb.com>
2022-04-01 10:51:59 +03:00
Raphael S. Carvalho
61c67105d2 compaction_manager: move internal stop functions into private namespace
They don't belong to public interface.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220331202255.237688-1-raphaelsc@scylladb.com>
2022-04-01 10:50:27 +03:00
Botond Dénes
e19cf6760a tools/scylla-sstable: add alternative schema load method for system tables
By providing its name via the new `--system-schema` option. The schema
will be loaded from the internal hardcoded definition.
2022-04-01 10:12:33 +03:00
Botond Dénes
095bb0d992 tools/schema_loader: add load_system_schema()
Allowing to load (or rather lookup) system schemas by name.
2022-04-01 10:10:31 +03:00
Botond Dénes
53b00ecefe db/system_distributed_keyspace: add all tables methods
Add methods to get the schema of all distributed and distribyted
everywhere tables respectively.
2022-04-01 10:10:31 +03:00
Botond Dénes
be788140ff tools/scylla-sstable: reorganize main help text
Currently the main help is a big wall of text. This makes it hard to
quickly jump to the section of interest. This patch reorganizes it into
clear sections, each with a title. Sections are now also ordered
according to the part they reference in the command-line.
This should make it easier for answers to questions regarding a certain
topic to be quickly found, without having to read a lot of text.
2022-04-01 10:10:31 +03:00
Michael Livshin
830aa041a8 make_slicing_filtering_reader(): return flat mutation reader v2
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-03-31 19:59:53 +03:00
Michael Livshin
aac51be0cc mutation_readers: refactor generic partition slicing logic
There are at least 1 actual and 1 potential users for it; this
change converts the existing one.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-03-31 19:59:53 +03:00
Avi Kivity
af07519928 Merge "Remove reader from mutations v1" from Botond
"
First migrate all users to the v2 variant, all of which are tests.
However, to be able to properly migrate all tests off it, a v2 variant
of the restricted reader is also needed. All restricted reader users are
then migrated to the freshly introduced v2 variant and the v1 variant is
removed.
Users include:
* replica::table::make_reader_v2()
* streaming_virtual_table::as_mutation_source()
* sstables::make_reader()
* tests

This allows us to get rid of a bunch of conversions on the query path,
which was mostly v2 already.

With a few tests we did kick the can down the road by wrapping the v2
reader in `downgrade_to_v1()`, but this series is long enough already.

Tests: unit(dev), unit(boost/flat_mutation_reader_test:debug)
"

* 'remove-reader-from-mutations-v1/v3' of https://github.com/denesb/scylla:
  readers: remove now unused v1 reader from mutations
  test: move away from v1 reader from mutations
  test/boost/mutation_reader_test: use fragment_scatterer
  test/boost/mutation_fragment_test: extract fragment_scatterer into a separate hh
  test/boost: mutation_fragment_test: refactor fragment_scatterer
  readers: remove now unused v1 reversing reader
  test/boost/flat_mutation_reader_test: convert to v2
  frozen_mutation: fragment_and_freeze(): convert to v2
  frozen_mutation: coroutinize fragment_and_freeze()
  readers: migrate away from v1 reversing reader
  db/virtual_table: use v2 variant of reversing and forwardable readers
  replica/table: use v2 variant of reversing reader
  sstables/sstable: remove unused make_crawling_reader_v1()
  sstables/sstable: remove make_reader_v1()
  readers: add v2 variant of reversing reader
  readers/reversing: remove FIXME
  readers: reader from mutations: use mutation's own schema when slicing
2022-03-31 13:29:11 +03:00
Avi Kivity
5fc093ad42 Merge 'wasm: manage memory using exports from the client' from Wojciech Mitros
This patch adds importing the `malloc` and `free` method from the wasm client, and using them for allocating wasm memory for UDF arguments and freeing its result. When the methods are not exported, the old behaviour is used instead. To make that possible, this patch also includes a fix to the usage of pages in wasm memory (methods `size` and `grow`) that were used for allocating memory for arguments until now. (The source codes  for the examples didn't work on my machine in their original form, so when updating paging I've also added small unrelated modifications)

Tests:unit(dev)

Closes #10234

* github.com:scylladb/scylla:
  wasm: add wasm ABI version 2
  wasm: add WASI handling
  wasm: add documentation
  wasm: add _scylla_abi export for specifying abi for wasm udfs
  wasm: update ABI for passing parameters to wasm UDFs
  wasm: move common code to a separate function
  wasm: use wasm pages for wasm memory
2022-03-31 12:33:55 +03:00
Wojciech Mitros
56c5459c50 wasm: add null handling for wasm udf
As the name suggests, for UDFs defined as RETURNS NULL ON NULL
INPUT, we sometimes want to return nulls. However, currently
we do not return nulls. Instead, we fail on the null check in
init_arg_visitor. Fix by adding null handling before passing
arguments, same as in lua.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>

Closes #10298
2022-03-31 12:27:38 +03:00
Piotr Sarna
c0fd53a9d7 cql3: fix qualifying restrictions with IN for indexing
When a query contains IN restriction on its partition key,
it's currently not eligible for indexing. It was however
erroneously qualified as such, which lead to fetching incorrect
results. This commit fixes the issue by not allowing such queries
to undergo indexing, and comes with a regression test.

Fixes #10300

Closes #10302
2022-03-31 11:04:17 +03:00
Nadav Har'El
3eaafbbdf7 test/cql-pytest: mark a test failing on Cassandra with cassandra_bug
We have a test for the LIKE restriction with ALLOW FILTERING.

Cassandra does not yet support this combination (it only supports LIKE
with SASI indexes), so this test fails on Cassandra, suggesting either
the test is wrong, or Cassandra is wrong. In this case, Cassandra is
wrong - they have an issue requesting this to be fixed -
https://issues.apache.org/jira/browse/CASSANDRA-17198, and even an
implementation which is being reviewed.

So let's mark this test with "cassandra_bug", meaning it is expected
to fail (xfail) when running against Cassandra. When CASSANDRA-17198
is fixed, we can remove the cassandra_bug mark.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220330211734.4103691-1-nyh@scylladb.com>
2022-03-31 09:47:44 +02:00
Botond Dénes
7d49afe78b readers: remove now unused v1 reader from mutations 2022-03-31 10:36:26 +03:00
Botond Dénes
fd69add579 test: move away from v1 reader from mutations
Use the v2 variant instead.
2022-03-31 10:36:23 +03:00
Botond Dénes
2e00ff314d test/boost/mutation_reader_test: use fragment_scatterer
Instead of the open-coded equivalent the test currently has.
2022-03-31 10:25:45 +03:00
Botond Dénes
feecc19d5b test/boost/mutation_fragment_test: extract fragment_scatterer into a separate hh
We want to use it in test/boost/mutation_reader_test.cc too.
2022-03-31 10:25:45 +03:00
Botond Dénes
226f01162e test/boost: mutation_fragment_test: refactor fragment_scatterer
Instead of taking an output parameter in the constructor, take just the
desired number of mutations to build and return the mutation list from
`consume_end_of_stream()`.
2022-03-31 10:25:45 +03:00
Botond Dénes
b8f0ab3b98 readers: remove now unused v1 reversing reader 2022-03-31 10:04:45 +03:00
Botond Dénes
56e3c6add6 test/boost/flat_mutation_reader_test: convert to v2 2022-03-31 10:04:29 +03:00
Gleb Natapov
c17a03727c migration_manager: drain migration manager before stopping protocol servers on shutdown
When protocol servers are stopping they wait for all active queries to
complete, but DDL queries use migration manager internally, so if they
hang there protocol servers will not be able to stop since migration
manager is drained afterwords. The patch moves the migration manager
draining before protocol servers stoppage.

Since after the patch migration managers is drained before messaging
service is stopped we need to make sure that no rpc request triggers new
migration manager requests. We do it by making sure that any attempt to
issue such a request after aborted will return abort_requested_exception.
2022-03-31 10:00:29 +03:00
Gleb Natapov
e52abdca30 migration_manager: pass abort source to raft primitives
We want to be able to abort raft operations on migration manager drain.
MM already has an abort source that is signaled on drain, so all that is
left is to pass it to raft calls.
2022-03-31 10:00:29 +03:00
Gleb Natapov
1409b885a0 storage_proxy: relax some read error reporting
Silence request_aborted read error since it is expected to happen suring
shutdown and report remote rpc errors as warnings instead of errors since
if they are indeed server they should be handled by the rpc client, but
OTOH some non critical errors do expect to happen during shutdown.
2022-03-31 10:00:29 +03:00
Botond Dénes
2e634883d9 frozen_mutation: fragment_and_freeze(): convert to v2 2022-03-31 09:57:48 +03:00
Botond Dénes
7f3986ed1a frozen_mutation: coroutinize fragment_and_freeze() 2022-03-31 09:57:48 +03:00
Botond Dénes
fc27b6b7ed readers: migrate away from v1 reversing reader
The only internal user is the v1 make reader from mutations, we use a
downgrade/upgrade to be able to use the v2 reversing reader there. This
is ugly but the v1 reader from mutations is going away soon too, so not
a real problem.
2022-03-31 09:57:48 +03:00
Botond Dénes
eb125b98eb db/virtual_table: use v2 variant of reversing and forwardable readers 2022-03-31 09:57:48 +03:00
Botond Dénes
c10d7bf9f8 replica/table: use v2 variant of reversing reader 2022-03-31 09:57:48 +03:00
Botond Dénes
3b67c25e49 sstables/sstable: remove unused make_crawling_reader_v1() 2022-03-31 09:57:48 +03:00
Botond Dénes
219cb881a4 sstables/sstable: remove make_reader_v1()
No external users, only used internally, by make_reader(), who delegates
cases currently unsupported by v2 to it. The code needed from
make_reader_v1() is inlined into make_reader() and the former is
removed.
2022-03-31 09:57:48 +03:00
Botond Dénes
470dc0d013 readers: add v2 variant of reversing reader
The v2 format allows for a much simpler reversing mechanism since
clustering fragments can simply be reversed as they are read. Fragments
are directly pushed in the reader's buffer eliminating a separate move
phase.
Existing reverse reader unit tests are converted to test the v2 one.
2022-03-31 09:57:48 +03:00
Botond Dénes
06bea6ae6b readers/reversing: remove FIXME 2022-03-31 09:57:48 +03:00
Botond Dénes
c38a7963b1 readers: reader from mutations: use mutation's own schema when slicing
Instead of the schema that is used for the reader. The schema of
individual mutations might be different (albeit compatible) and in debug
mode this can trigger an assert in mutation partition.
2022-03-31 09:57:48 +03:00
Piotr Sarna
4a3890b79c cql3: disambiguate an error message for indexes and IN clause
A user pointed out a misleading error message produced when
an indexed column is queried along with an IN relation
on the partition key. The message suggests that such queries are
not supported, but they are supported - just without indexing.
In particular, with ALLOW FILTERING, such queries are perfectly
fine.

Closes #10299
2022-03-31 07:04:00 +03:00
Piotr Sarna
85e95a8cc3 cql3: fix misleading error message for service level timeouts
The error message incorrectly stated that the timeout value cannot
be longer than 24h, but it can - the actual restriction is that the
value cannot be expressed in units like days or months, which was done
in order to significantly simplify the parsing routines (and the fact
that timeouts counted in days are not expected to be common).

Fixes #10286

Closes #10294
2022-03-31 07:04:00 +03:00
Raphael S. Carvalho
20a1ef3bee compaction_backlog_tracker: Raise logging level to error when disabling tracker on exception
If exception is caught while updating backlog tracker, the backlog
tracker will be disabled for the underlying table, potentially
causing compaction to fall behind.
That being said, let's raise the log level to error, to give it
its due importance and allow tests to detect the problem.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220330151421.49054-1-raphaelsc@scylladb.com>
2022-03-31 07:04:00 +03:00
Wojciech Mitros
8a9d55d3a1 wasm: add wasm ABI version 2
Because the only available version of wasm ABI did not allow
freeing any allocated memory, a new version of the ABI is
introduced. In this version, the host is required to export
_scylla_malloc and _scylla_free methods, which are later used
for the memory management.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2022-03-30 20:49:35 +02:00
Wojciech Mitros
5b60cd1eab wasm: add WASI handling
One of the issues that comes with compiling programs to WebAssembly
is the lack of a default implementation of a memory allocator. As
a result, the only available solutions to the need of memory allocation
are growing the wasm memory for each new allocated segment, or
implementing one's own memory allocator. To avoid both of these
approaches, for many languages, the user may compile a program to
a WASI target. By doing so, the compiler adds default implementations
of malloc and free methods, and the user can use them for dynamic
memory management.
This patch enables executing programs compiled with WASI by enabling
it in the wasmtime runtime.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2022-03-30 19:44:36 +02:00
Wojciech Mitros
1f81e05d52 wasm: add documentation
The ABI of wasm UDFs changed since the last time the documentation
was written, so it's being update in this patch.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2022-03-30 19:44:30 +02:00
Pavel Solodovnikov
3e2a42fcf0 db: system_keyspace: add bootstrap_needed() method
The method checks that bootstrap state is equal to
`NEEDS_BOOTSTRAP`. This will be used later to check
if we are in the state of "fresh" start (i.e. starting
a node from scratch).

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-03-30 20:41:35 +03:00
Pavel Solodovnikov
c69c47b2bb db: system_keyspace: mark getter methods for bootstrap state as "const"
The `bootstrap_complete()`, `bootstrap_in_progress()`,
`was_decommissioned()` and `get_bootstrap_state()` don't
modify internal state, so eligible to be marked as `const`.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-03-30 20:40:02 +03:00
Wojciech Mitros
93b082e8d9 wasm: add _scylla_abi export for specifying abi for wasm udfs
Different languages may require different ABIs for passing
parameters, etc. This patch adds a requirement for all wasm
UDFs to export an _scylla_abi symbol, that is an 32-bit integer
with a value specifying the ABI version.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2022-03-30 19:37:11 +02:00
Wojciech Mitros
a7ee3ccf52 wasm: update ABI for passing parameters to wasm UDFs
WebAssembly uses 32-bit address space, while also
having 64-bit integers as it native types. As a result,
when passing size of an object in memory and its address,
it can be combined into one 64-bit value. As a bonus,
if the object is null, we can signal it by passing -1 as
its size.

This patch implements handling of this new ABI and adjusts
expamples in test_wasm.py.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2022-03-30 17:13:25 +02:00
Wojciech Mitros
7fd81e6dae wasm: move common code to a separate function
Both init_nullable_arg_visitor and, in case
of abstract_type, init_arg_visitor were
the same method with one difference. The
common part was moved to init_abstract_arg,
and the difference remained in the operator()
method.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2022-03-30 17:13:22 +02:00
Wojciech Mitros
62761a7cf3 wasm: use wasm pages for wasm memory
The memory.grow and memory.size wasm methods return
the memory size in pages, and memory.size takes its
argument in the number of pages. A WebAssembly page
has a size of 64KiB, so during memory allocation
we have to divide our desired size in bytes by page
size and round up. Similarly, when reading memory
size we need to multiply the result by 64KiB to
get the size in bytes.

The change affects current naive allocator for
arguments when calling wasm UDFs and the examples
in wasm_test.py - both commented code and compiled
wasm in text representation.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2022-03-30 17:13:13 +02:00
Avi Kivity
caa4cddebf Update tools/java submodule
* tools/java b1e09c8b8f...ac5d6c840d (1):
  > Merge 'Sync with Cassandra 3.11.12' from Michael Livshin
2022-03-30 17:03:55 +03:00
Avi Kivity
9cb43f7029 Merge "Split mutation_reader.hh" from Botond
"
Following up on the recent split of flat_mutation_reader.hh and friends,
this series applies the same treatment to mutation_reader.hh. Each
readers gets its own header, while definitions are moved into
readers/mutation_readers.cc. There are two exceptions to this: the
combined and multishard reader families each make up more than 1K SLOC,
so these get their own source file, to avoid a SLOC explosion in
mutation_readers.cc.

This series is almost completely mechanical, moving code and patching
inclusion sites.

Tests: unit(dev)
"

* 'mutation-reader-hh-split/v1' of https://github.com/denesb/scylla:
  readers: merge fmr_logger and mrlog
  tree: remove now empty mutation_reader.{hh,cc}
  tree: remove mutation_reader.hh include
  mutation_reader: move mrlog (mutation reader logger) to readers/
  mutation_reader: move compacting reader into readers/
  mutation_reader: move queue reader to readers/
  mutation_reader: move mutation source into readers/
  mutation_reader: move slicing filtering reader into readers/
  mutation_reader: move filtering reader into readers/
  readers: move multishard reader & friends to reader/multishard.cc
  mutation_reader: remove unused remote_fill_buffer_result
  readers: move combined reader into readers/
2022-03-30 16:33:20 +03:00
Botond Dénes
3289df2e74 readers: merge fmr_logger and mrlog
By folding the former to the latter. Now that all the readers are nicely
co-located in the same folder, no point in having two distinct logger
for them.
2022-03-30 15:44:08 +03:00
Botond Dénes
c9e30b9a6c tree: remove now empty mutation_reader.{hh,cc} 2022-03-30 15:42:51 +03:00
Botond Dénes
b029bd3db7 tree: remove mutation_reader.hh include
In most files it was unused. We should move these to the patch which
moved out the last interesting reader from mutation_reader.hh (and added
the corresponding new header include) but its probably not worth the
effort.
Some other files still relied on mutation_reader.hh to provide reader
concurrency semaphore and some other misc reader related definitions.
2022-03-30 15:42:51 +03:00
Botond Dénes
20c9e556e1 mutation_reader: move mrlog (mutation reader logger) to readers/ 2022-03-30 15:42:51 +03:00
Botond Dénes
b7954138ac mutation_reader: move compacting reader into readers/ 2022-03-30 15:42:51 +03:00
Botond Dénes
11c378a175 mutation_reader: move queue reader to readers/ 2022-03-30 15:42:51 +03:00
Botond Dénes
11109f4c45 mutation_reader: move mutation source into readers/ 2022-03-30 15:42:51 +03:00
Botond Dénes
7eae66efe0 mutation_reader: move slicing filtering reader into readers/
Only the declaration has to be moved, the definition is already in
readers/mutation_readers.cc.
2022-03-30 15:42:51 +03:00
Botond Dénes
f24f2f726a mutation_reader: move filtering reader into readers/ 2022-03-30 15:42:51 +03:00
Botond Dénes
d0ea895671 readers: move multishard reader & friends to reader/multishard.cc
Since the multishard reader family weighs more than 1K SLOC, it gets
its own .cc file.
2022-03-30 15:42:51 +03:00
Botond Dénes
3505ef8a49 mutation_reader: remove unused remote_fill_buffer_result 2022-03-30 15:42:51 +03:00
Botond Dénes
f8015d9c26 readers: move combined reader into readers/
Since the combined reader family weighs more than 1K SLOC, it gets its
own .cc file.
2022-03-30 15:42:51 +03:00
Botond Dénes
0c3d4091a4 Merge "Make TWCS' cleanup bucket aware" from Raphael S. Carvalho
"
Quoting patch 3/4:
"This continues the work in a69d98c3d0,
by implementing the cleanup method in TWCS to make it bucket aware.
Till now, the default impl was used which cleanups on file at a
time, starting from the smallest.

The cleanup strategy for TWCS is simple. It's simply calling the
size tiered cleanup method for each bucket, so there will be
one job for each tier in each window.

The next strategies to receive this improvement are LCS and ICS
(the latter one being only available in enterprise).

Refs #10097."

** Simply put, the goal is to reduce writeamp when performing cleanup
on a TWCS table, therefore reducing the operation time. **

tests: unit(dev).
"

* 'twcs_cleanup_bucket_aware/v1' of https://github.com/raphaelsc/scylla:
  tests: sstable_compaction_test: Add test for TWCS' bucket-aware cleanup
  compaction: TWCS: Implement cleanup method for bucket awareness
  compaction: TWCS: change get_buckets() signature to work with const qualified functions
  compaction_strategy: get_cleanup_compaction_jobs: accept candidates by value
2022-03-30 11:45:28 +03:00
Pavel Emelyanov
edd0481b38 Merge 'Scrub compaction: prevent mishandling of range tombstone changes' from Botond
With v2 having individual bounds of range tombstone as separate
fragments, out-of-order fragments become more difficult to handle,
especially in the presence of active range tombstone.
Scrub in both SKIP and SEGREGATE mode closes the partition on
seeing the first invalid fragment (SEGREAGE re-opens it immediately).
If there is an active range tombstone, scrub now also has to take care
of closing said tombstone when closing the partition. In a normal stream
it could just use the last position-in-partition to create a closing
bound. But when out-of-order fragments are on the table this is not
possible: the closing bound may be found later in the stream, with a
position smaller than that of the current position-in-partition.
To prevent extending range tombstone changes like that, Scrub now aborts
the compaction on the first invalid fragment seen *inside* an active
range tombstone.
Fixing a v2 stream with range tombstone changes is definitely possible,
but non-trivial, so we defer it until there is demand for it.

This series also makes the mutation fragment stream validator check for
open range tombstones on partition-end and adds a comprehensive
test-suite for the validator.

Fixes: #10168

Tests: unit(dev)

* scrub-rtc-handling-fix/v2 of github.com/denesb/scylla.git:
  compaction/compaction: abort scrub when attempting to rectify stream with active tombstone
  test/boost/mutation_test: add test for mutation_fragment_stream_validator
  mutation_fragment_stream_validator: validate range tombstone changes
2022-03-30 11:42:52 +03:00
Pavel Emelyanov
ba6d2ecc6f api: Remove unused argument from set_tables_autocompaction helper
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220329093113.5953-1-xemul@scylladb.com>
2022-03-30 11:42:52 +03:00
Raphael S. Carvalho
177a8e8259 compaction_manager: allow sstable to be moved into rewrite_sstable()
Caller was already trying to move sstable, but rewrite_sstable() signature
was incorrect.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220329022149.250655-1-raphaelsc@scylladb.com>
2022-03-30 11:42:52 +03:00
Kamil Braun
475a408792 test: raft: randomized_nemesis_test: on timeout, abort calls instead of discarding them
When a Raft API call such as `add_entry`, `set_configuration` or
`modify_config` takes too long, we need to time-out. There was no way to
abort these calls previously so we would do that by discarding the futures.
Recently the APIs were extended with `abort_source` parameters. Use this.

Also improve debuggability if the functions throw an exception type that
we don't expect. Previously if they did, a cryptic assert would fail
somewhere deep in the generator code, making the problem hard to debug.
2022-03-29 18:25:26 +02:00
Kamil Braun
0f0d75fd66 raft: server: translate semaphore_aborted to request_aborted 2022-03-29 15:10:29 +02:00
Kamil Braun
8d4dd53c25 test: raft: logical_timer: add abortable version of sleep_until
`sleep_until(time_point tp, abort_source& as)` will sleep until `tp`, or
until `as.request_abort()` is called, whatever comes first.
2022-03-29 15:10:29 +02:00
Kamil Braun
6f9dcd784c test: raft: randomized_nemesis_test: collect statistics on successful and failed ops 2022-03-29 15:10:29 +02:00
Raphael S. Carvalho
a1fd9c1ee8 tests: sstable_compaction_test: Add test for TWCS' bucket-aware cleanup
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-03-29 09:52:11 -03:00
Raphael S. Carvalho
568bb40127 compaction: TWCS: Implement cleanup method for bucket awareness
This continues the work in a69d98c3d0,
by implementing the cleanup method in TWCS to make it bucket aware.
Till now, the default impl was used which cleanups on file at a
time, starting from the smallest.

The cleanup strategy for TWCS is simple. It's simply calling the
size tiered cleanup method for each bucket, so there will be
one job for each tier in each window.

The next strategies to receive this improvement are LCS and ICS
(the latter one being only available in enterprise).

Refs #10097.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-03-29 09:52:06 -03:00
Raphael S. Carvalho
8f4c04c38a compaction: TWCS: change get_buckets() signature to work with const qualified functions
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-03-29 09:49:14 -03:00
Raphael S. Carvalho
2a9bfa3e3f compaction_strategy: get_cleanup_compaction_jobs: accept candidates by value
Then caller can decide whether to copy or move candidate set into the
function. cleanup_sstables_compaction_task can move candidates as
it's no longer needed once it retrieves all descriptors.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-03-29 09:49:13 -03:00
Botond Dénes
2ae0e0093e compaction/compaction: abort scrub when attempting to rectify stream with active tombstone 2022-03-29 13:19:05 +03:00
Botond Dénes
316ff9eb86 test/boost/mutation_test: add test for mutation_fragment_stream_validator 2022-03-29 13:19:05 +03:00
Botond Dénes
ef90783007 mutation_fragment_stream_validator: validate range tombstone changes 2022-03-29 13:19:05 +03:00
Botond Dénes
bb4472d5ab Update seastar submodule
* seastar 7a05527b...44389842 (1):
  > core/circular_buffer: fix iterator comparison for wrapped-around indexes
2022-03-29 09:08:25 +03:00
Takuya ASADA
59adf05951 scylla_sysconfig_setup: avoid perse error on perftune.py --get-cpu-mask
Currently, we just passes entire output of perftune.py when getting CPU
mask from the script, but it may cause parse error since the script may
also print warning message.

To avoid that, we need to extract CPU mask from the output.

Fixes #10082

Closes #10107
2022-03-28 16:31:14 +03:00
Nadav Har'El
024ecd45a2 test/rest_api: add reproducer for REST API JSON encoding bug
This patch adds a reproducer for the JSON encoding in issue #9061.
The bug was already fixed (it was a Seastar bug, and Seastar was
updated in commit 5d4213e1b8), but
I verified that the test fails before that patch - and passes today.

It is useful to have such a test for regressions, as well as for
testing backports.

Unfortunately, the test isn't pretty. The test uses the toppartitions
API, which instead of having a "start" and "stop" request has a single
synchronous "start for a given duration" request, and we need to run
it with some fixed duration (we took 1 second), and in parallel, one
request.

Refs #9061.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220323180855.3307931-1-nyh@scylladb.com>
2022-03-28 15:28:45 +03:00
Nadav Har'El
7f89c8b3e3 alternator: clean error shutdown in case of TLS misconfigration
The way our boot-time service "controllers" are written, if a
controller's start_server() finds an error and throws, it cannot
the caller (main.cc) to call stop_server(), and must clean up
resources already created (e.g., sharded services) before returning
or risk crashes on assertion failures.

This patch fixes such a mistake in Alternator's initialization.
As noted in issue #10025, if the Alternator TLS configuration is
broken - especially the certificate or key files are missing -
Scylla would crash on an assertion failure, instead of reporting
the error as expected. Before this patch such a misconfiguration
will result in the unintelligible:

<alternator::server>::~sharded() [Service = alternator::server]:
Assertion `_instances.empty()' failed. Aborting on shard 0.

After this patch we get the right error message:

ERROR 2022-03-21 15:25:07,553 [shard 0] init - Startup failed:
std::_Nested_exception<std::runtime_error> (Failed to set up Alternator
TLS credentials): std::_Nested_exception<std::runtime_error> (Could not
read certificate file conf/scylla.crt): std::filesystem::__cxx11::
filesystem_error (error system:2, filesystem error: open failed:
No such file or directory [conf/scylla.crt])

Arguably this error message is a bit ugly, so I opened
https://github.com/scylladb/seastar/issues/1029, but at least it says
exactly what the error is.

Fixes #10025

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220321133323.3150939-1-nyh@scylladb.com>
2022-03-28 15:26:42 +03:00
Avi Kivity
e88dc51894 Update seastar submodule
* seastar 3b4e42d74e...7a05527b6f (4):
  > Revert "core/circular_buffer: fix iterator comparison for wrapped-around indexes"
  > seastar-cpu-map.sh: switch from pidof to pgrep
Fixes #10238
  > abort-source: mark `request_abort()` noexcept
  > core/circular_buffer: fix iterator comparison for wrapped-around indexes
2022-03-28 12:37:10 +03:00
Nadav Har'El
d8c0680585 test/alternator: add regression test for old ALL_NEW bug
In commit 964500e47a, in the middle of
a larger series, I fixed a small Alternator bug that I found while working
on that series. The bug was that the ReturnValues=ALL_NEW feature moved out
the read previous_item, which breaks operations that need previous_item,
e.g., an ADD operation. Unfortunately, we never had a regression test for
this fix bug, so in this patch I add one.

This bug was re-discovered on an old branch by a user, at which point
I noticed that we don't have a test for it - so I want to add it now,
even though the bug itself is long gone from Scylla master.

I verified that the new test indeed fails on old versions of Scylla
before the aforementioned commit, and passes when backporting only that
commit.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220327074928.3608576-1-nyh@scylladb.com>
2022-03-28 08:40:28 +02:00
Benny Halevy
2325c566d9 memtable_list: futurize clear_and_add
Allow yielding to fix a reactor stall from table::clear.

Fixes #10281

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220327141259.213688-1-bhalevy@scylladb.com>
2022-03-27 17:25:43 +03:00
Avi Kivity
3c2271af52 Merge "De-globalize system keyspace local cache" from Pavel E
"
There's a static global sharded<local_cache> variable in system keyspace
the keeps several bits on board that other subsystems need to get from
the system keyspace, but what to have it in future<>-less manner.

Some time ago the system_keyspace became a classical sharded<> service
that references the qctx and the local cache. This set removes the global
cache variable and makes its instances be unique_ptr's sitting on the
system keyspace instances.

The biggest obstacle on this route is the local_host_id that was cached,
but at some point was copied onto db::config to simplify getting the value
from sstables manager (there's no system keyspace at hand there at all).
So the first thing this set does is removes the cached host_id and makes
all the users get it from the db::config.

(There's a BUG with config copy of host id  -- replace node doesn't
 update it. This set also fixes this place)

De-globalizing the cache is the prerequisite for untangling the snitch-
-messaging-gossiper-system_keyspace knot. Currently cache is initialized
too late -- when main calls system_keyspace.start() on all shards -- but
before this time messaging should already have access to it to store
its preferred IP mappings.

tests: unit(dev), dtest.simple_boot_shutdown(dev)
"

* 'br-trade-local-hostid-for-global-cache' of https://github.com/xemul/scylla:
  system_keyspace: Make set_local_host_id non-static
  system_keyspace: Make load_local_host_id non-static
  system_keyspace: Remove global cache instance
  system_keyspace: Make it peering service
  system_keyspace,snitch: Make load_dc_rack_info non-static
  system_keyspace,cdc,storage_service: Make bootstrap manipulations non-static
  system_keyspace: Coroutinize set_bootstrap_state
  gossiper: Add system keyspace dependency
  cdc_generation_service: Add system keyspace dependency
  system_keyspace: Remove local host id from local cache
  storage_service: Update config.host_id on replace
  storage_service: Indentation fix after previous patch
  storage_service: Coroutinize prepare_replacement_info()
  system_distributed_keyspace: Indentation fix after previous patch
  code,system_keyspace: Relax system_keyspace::load_local_host_id() usage
  code,system_keyspace: Remove system_keyspace::get_local_host_id()
2022-03-27 17:19:24 +03:00
Avi Kivity
f476bd3a80 Merge "tools: cut schema loader free of replica::database" from Botond
"
By way of having an implementation of `data_dictionary` and using that.
The schema loader only needs a database to parse cql3 statements, which
are all coordinator-side objects and hence been largely migrated to use
data dictionary instead.
A few hard-dependencies on replica:: objects were found and resolved:
* index::secondary_index_manager
* tombstone_gc

The former was migrated to use `data_dictionary::table` instead of
`replica::table`. This in turn requires disentangling
`replica::data_dictionary_impl` from `replica::database`, as currently
the former can only really be used by the latter.

What all of this achieves us is that we no longer have to instantiate a
`replica::database` object in `tools::load_schema()`. We want to use the
standard allocator in tools, which means they cannot use LSA memory at
all. Database on the other hand creates memtable and row-cache instances
so it had to go.

Refs: #9882

Tests: unit(dev, schema_loader_test:debug,
cql-pytest/test_tools.py:debug)
"

* 'tools-schema-loader-database-impl/v2' of https://github.com/denesb/scylla:
  tools/schema_loader: use own data dictionary impl
  tombstone_gc: switch to using data dictionary
  index/secondary_index_manager: switch to using data dictionary
  replica/table: add as_data_dictionary()
  replica: disentangle data_dictionary_impl from database
  replica: move data_dictionary_impl into own header
2022-03-27 17:01:05 +03:00
Pavel Emelyanov
baedf1e4de system_keyspace: Remove unused argument from maybe_write_in_user_memory
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220325152521.25215-1-xemul@scylladb.com>
2022-03-27 16:43:53 +03:00
Nadav Har'El
653f2df28f alternator: fix JSON escaping of error responses
In the DynamoDB API, error responses are in JSON format with specific
fields ("__type" and "message" in the x-amz-json-1.0 format currently
used). Alternator tried to be clever and build the string representation
of this JSON itself, instead of using RapidJSON. But this optimization
was a mistake - if the error message contains characters that need
escaping (such as double quotes and newlines), they weren't escaped,
and the resulting JSON was malformed. When the client library boto3
read this malformed JSON it got confused, cosidered the entire error
response to be a string, which resulted in an ugly error message.

The fix is easy - just build the JSON output as usual with RapidJSON
instead of trying to optimize using string operation.

The patch also includes two tests reproducing this bug and checking its
fix. The first test uses boto3 and shows it got confused on the type
of error (not understanding that it is a ValidationException). The
second test bypasses boto3 and shows exactly where the bug happens -
the response is an unparsable JSON.

Fixes #10278

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220327132705.3707979-1-nyh@scylladb.com>
2022-03-27 16:32:36 +03:00
Takuya ASADA
bdefea7c82 docker: enable --log-to-stdout which mistakenly disabled
Since our Docker image moved to Ubuntu, we mistakenly copy
dist/docker/etc/sysconfig/scylla-server to /etc/sysconfig, which is not
used in Ubuntu (it should be /etc/default).
So /etc/default/scylla-server is just default configuration of
scylla-server .deb package, --log-to-stdout is 0, same as normal installation.

We don't want keep the duplicated configuration file anyway,
so let's drop dist/docker/etc/sysconfig/scylla-server and configure
/etc/default/scylla-server in build_docker.sh.

Fixes #10270

Closes #10280
2022-03-27 14:50:10 +03:00
Avi Kivity
1feec08c2d Revert "api: storage_service: force_keyspace_compaction: compact one table at a time"
This reverts commit 37dc31c429. There is no
reason to suppose compacting different tables concurently on different shards
reduces space requirements, apart from non-deterministically pausing
random shards.

However, when data is badly distributed and there are many tables, it will
slow down major compaction considerably. Consider a case where there are
100 tables, each with a 2GB large partition on some shard. This extra
200GB will be compacted on just one shard. With compation rate of 40 MB/s,
this adds more than an hour to the process. With the existing code, these
compactions would overlap if the badly distributed data was not all in one
shard.

It is also counter to tablets, where data is not equally ditributed on
purpose.

Closes #10246
2022-03-25 19:24:50 +03:00
Botond Dénes
a69d98c3d0 Merge "Improve efficiency of cleanup compaction by making it bucket aware" from Raphael S. Carvalho
"
Cleanup compaction works by rewriting all sstables that need clean up, one at
a time.
This approach can cause bad write amplification because the output data is
being made incrementally available for regular compaction.
Cleanup is a long operation on large data sets, and while it's happening,
new data can be written to buckets, triggering regular compaction.
Cleanup fighting for resources with regular compaction is a known problem.
With cleanup adding one file at a time to buckets, regular may require multiple
rounds to compact the data in a given bucket B, producing bad writeamp.

To fix this problem, cleanup will be made bucket aware. As each compaction
strategy has its own definition of bucket, strategies will implement their
own method to retrieve cleanup jobs. The method will be implemented such that
all files in a bucket B will be cleaned up together, and on completion,
they'll be made available for regular at once.

For STCS / ICS, a bucket is a size tier.
For TWCS, a bucket is a window.
For LCS, a bucket is a level.

In this way, writeamp problem is fixed as regular won't have to perform
multiple rounds to compact the data in a given bucket. Additionally, cleanup
will now be able to deduplicate data and will become way more efficient at
garbage collecting expired data.

The space requirement shouldn't be an issue, as compacting an entire bucket
happens during regular compaction anyway.
With leveled strategy, compacting an entire level is also not a problem because
files in a level L don't overlap and therefore incremental compaction is
employed to limit the space requirement.

By the time being, only STCS cleanup was made bucket aware. The others will be
using a default method, where one file is cleaned up at a time. Making cleanup
of other strategies bucket aware is relatively easy now and will be done soon.

Refs #10097.
"

* 'cleanup-compaction-revamp/v3' of https://github.com/raphaelsc/scylla:
  test: sstable_compaction_test: Add test for strategy cleanup method
  compaction: STCS: Implement cleanup strategy
  compaction_manager: Wire cleanup task into the strategy cleanup method
  compaction_strategy: Allow strategies to define their own cleanup strategy
  compaction: Introduce compaction_descriptor::sstables_size
  compaction: Move decision of garbage collection from strategy to task type
2022-03-25 16:30:28 +02:00
Raphael S. Carvalho
5312526e5e test: sstable_compaction_test: Add test for strategy cleanup method
Stresses default and STCS implementations of cleanup method

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-03-25 11:23:29 -03:00
Raphael S. Carvalho
84101dec9e compaction: STCS: Implement cleanup strategy
This implements cleanup strategy for STCS. It will return one descriptor
for each size tier. If a given tier has more than max_threshold
elements, more than 1 job will be returned for that tier. Token
contiguity is preserved by sorting elements of a tier by token.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-03-25 11:23:29 -03:00
Raphael S. Carvalho
c7826aa910 compaction_manager: Wire cleanup task into the strategy cleanup method
As the cleanup process can now be driven by the compaction strategy,
let's move cleanup into a new task type that uses the new
compaction_strategy::get_cleanup_compaction_jobs().

By the time being all strategies are using the default method that
returns one descriptor for each sstable that needs clean up.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-03-25 11:23:26 -03:00
Mikołaj Sielużycki
6f1b6da68a compile: Fix headers so that *-headers targets compile cleanly.
Closes #10273
2022-03-25 16:19:26 +02:00
Botond Dénes
9e1a642e1f tools/schema_loader: use own data dictionary impl
And pass it to the cql3 layer when parsing statements. This allows the
schema loader to cut itself from replica::database, using a local, much
simpler database implementation. This not only makes the code much
simpler but also opens up the way to using the standard allocator in
tools. The real database uses LSA which is incompatible with the
standard allocator (in release builds that is).
2022-03-25 14:40:44 +02:00
Pavel Emelyanov
f326d359de system_keyspace: Make set_local_host_id non-static
The callers are system_keyspace.load_local_host_id and storage service.
The former is non-static since previous patch, the latter has its own
sys.ks. reference.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-25 15:08:13 +03:00
Pavel Emelyanov
26a12ac056 system_keyspace: Make load_local_host_id non-static
The only caller is main(), the sharded<> sys_ks is started at
this point.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-25 15:08:13 +03:00
Pavel Emelyanov
725ab5eea1 system_keyspace: Remove global cache instance
No users of this variable left, all the code relies on system_keyspace
"this" to get it. Respectively, the cache can be a unique_ptr<> on the
system_keyspace instance and the global sharded variable can be removed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-25 15:08:13 +03:00
Pavel Emelyanov
4cb7c48243 system_keyspace: Make it peering service
And remove a bunch of (_local)?_cache.invoke_on_all() calls. This
is the preparation for removing the global cache instance.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-25 15:08:13 +03:00
Pavel Emelyanov
021c026482 system_keyspace,snitch: Make load_dc_rack_info non-static
It's snitch code that needs it. It now takes messaging service
from gossiper, so it can do the same with system keyspace. This
change removes one user of the global sys.ks. cache instance.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-25 15:08:13 +03:00
Pavel Emelyanov
c2cf4e3536 system_keyspace,cdc,storage_service: Make bootstrap manipulations non-static
The users of get_/set_bootstrap_sate and aux helpers are CDC and
storage service. Both have local system_keyspace references and can
just use them. This removes some users of global system ks. cache
and the qctx thing.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-25 15:08:13 +03:00
Pavel Emelyanov
798caf2ef8 system_keyspace: Coroutinize set_bootstrap_state
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-25 15:08:13 +03:00
Pavel Emelyanov
3da5f6ac30 gossiper: Add system keyspace dependency
The gossiper reads peer features from system keyspace. Also the snitch
code needs system keyspace, and since now it gets all its dependencies
from gossiper (will be fixed some day, but not now), it will do the same
for sys.ks.. Thus it's worth having gossiper->system_keyspace explicit
dependency.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-25 15:08:13 +03:00
Botond Dénes
f501bc8d54 tombstone_gc: switch to using data dictionary
But only on the surface, the only internal function needing the database
(`needs_repair_before_gc()`) still gets a real database because the
replication factor cannot be obtained from the data dictionary
currently. Although this might not look like an improvement, it is
enough to avoid a `real_database()` call for tables that don't have
tombstone gc mode set to repair.
2022-03-25 13:17:58 +02:00
Pavel Emelyanov
62417577ab cdc_generation_service: Add system keyspace dependency
The service uses system keyspace to, e.g., manage the generation id,
thus it depends on the system_keyspace instance and deserves the
explicit reference.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-25 13:39:32 +03:00
Pavel Emelyanov
a9cbdee82e system_keyspace: Remove local host id from local cache
This value is now write-only, all readers had been patched
to use db::config copy

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-25 13:39:32 +03:00
Pavel Emelyanov
ec8be45259 storage_service: Update config.host_id on replace
The config.host_id value is loaded early on start, but when the
storage service prepares to join the cluster to replace a node,
it will change that value (with the host id of the target). This
change only affect the system keyspace, but not the config copy
which is a BUG.

fixes: #10243

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-25 13:39:32 +03:00
Pavel Emelyanov
3e7a21aa24 storage_service: Indentation fix after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-25 13:39:29 +03:00
Pavel Emelyanov
b3503aaf6a storage_service: Coroutinize prepare_replacement_info()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-25 13:26:14 +03:00
Pavel Emelyanov
fa4d4beaf1 system_distributed_keyspace: Indentation fix after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-25 13:25:55 +03:00
Pavel Emelyanov
b8d3048104 code,system_keyspace: Relax system_keyspace::load_local_host_id() usage
The method is nowadays called from several places:
- API
- sys.dist.ks. (to udpate view building info)
- storage service prepare_to_join()
- set up in main

They all, but the last, can use db::config cached value, because
it's loaded earlier than any of them (but the last -- that's the
loading part itself).

Once patched, the load_local_host_id() can avoid checking the cache
for that value -- it will not be there for sure.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-25 13:23:30 +03:00
Pavel Emelyanov
965d2a0a4f code,system_keyspace: Remove system_keyspace::get_local_host_id()
The host id is cached on db::config object that's available in
all the places that need it. This allows removing the method in
question from the system_keyspace and not caring that anyone that
needs host_id would have to depend on system_keyspace instance.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-25 13:21:59 +03:00
Botond Dénes
9a44c26d7e index/secondary_index_manager: switch to using data dictionary
Instead of directly using replica::table.
2022-03-25 11:44:31 +02:00
Botond Dénes
eff941d22c replica/table: add as_data_dictionary()
To allow converting table instances to data_dictionary::table.
2022-03-25 11:44:31 +02:00
Botond Dénes
4f2d900c9f replica: disentangle data_dictionary_impl from database
Make it a standalone class, instead of private subclass of database.
Unfriend database and instead make wrap/unwrap methods public, so anyone
can use them.
2022-03-25 11:44:31 +02:00
Botond Dénes
421d4411f8 replica: move data_dictionary_impl into own header
As a first step towards disentangling it from database and allowing it
to be used by other classes (like table) too.
2022-03-25 11:44:31 +02:00
Yaron Kaikov
5ef1b49cb8 docs/conf.py:update scylla-4.6 to latest
Now that Scylla 4.6 is out, it;s the latest release available

Closes: https://github.com/scylladb/scylla/issues/10266

Closes #10268
2022-03-24 14:18:10 +02:00
Nadav Har'El
06fdce82aa test/*/run: clearer error message on wrong SCYLLA setting
The test runners cql-pytest/run et al. try to automatically find the
last-compile Scylla executable, but this decision can be overriden by
the SCYLLA environment variable. If the user sets by mistake SCYLLA to
something which is not a valid path of an executable, the result was a
long and obscure Python stack trace.

So after this patch, if SCYLLA points to something which is not an
executable, a clear error is produced immediately, directing the user
to set it this variable to a correct executable

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220323164427.3301828-1-nyh@scylladb.com>
2022-03-23 19:01:17 +02:00
Jadw1
1438be5311 CQL3: Bloom filter efficacy test
Added CQL pytest to check bloom filter efficacy by
inserting `N` rows and reading `M` non-existing keys.

Added `bloom_filter_false_positives` method to Python `nodetool`
module. Method gets fp number by calling Scylla's API.

Fixes: #1055

Closes #10186
2022-03-23 16:51:50 +02:00
Avi Kivity
c89117e5f7 Update seastar submodule
* seastar 4e42a6019...3b4e42d74 (20):
  > condition-variable: Fix broadcast to handle "when(pred)" usage
  > condition-variable: Adjust when(pred) evaulation to match wait(pred) + tests
  > coroutine: parallel_for_each: set_callback: destroy future after move
  > Merge "Calculate max IO lengths as lengths" from Pavel E
  > future: fix make_exception_future(const exception_ptr&&)
  > condition-variable: Fix when(<pred>) functions + add test
  > Merge "semaphore: add abortable get_units" from Benny
  > condition_variable: restore move-constructability.
  > coroutine: add a non-preemptive co_await flavor
  > future: set_callback: consume future input
  > coroutines: add parallel_for_each
  > Merge "Add more rl-iosched test cases" from Pavel E
  > Revert "condition-variable: Fix when(<pred>) functions + add test"
  > condition-variable: Fix when(<pred>) functions + add test
  > reactor: fix buildability with SEASTAR_NO_EXCEPTION_HACK
  > Revert "core: memory: Replace overlooked uses of cpu_mem with get_cpu_mem()"
  > core: memory: Add a shortcut for stats().free_memory
  > lw_shared_ptr: rename shared_ptr_no_esft to lw_shared_ptr_no_esft
  > core: memory: Replace overlooked uses of cpu_mem with get_cpu_mem()
  > include/seastar/core: define iterator without std::iterator<>
2022-03-23 12:10:16 +02:00
Raphael S. Carvalho
44e9e10414 compaction_strategy: Allow strategies to define their own cleanup strategy
Today, all compaction strategies will clean up their files using the
incremental approach of one sstable being rewritten at a time.

Turns out that's not the best approach performance wise. Let's take
STCS for example. As cleanup finishes rewriting one file, the output
file is placed into the sstable set. Regular now can compact that
file with another that was already there (e.g. produced by flush after
cleanup started). Inefficient compactions like this can keep happening
as cleanup incrementally places output file into the candidate list
for regular.

This method will allow strategies to clean up their files in batches.
For example, STCS can clean up all files in smallest tiers in single
round, allowing the output data to be added at once. So next compaction
rounds can be more efficient in terms of writeamp. Another benefit is
that deduplication and GC can happen more efficiently.

The drawback is the space requirement, as we no longer compact one file
a a time. However, the impact is minimized by cleaning up the smallest
tier first. With leveled strategy for example, even though 90% of data
is in highest level, the space requirement is not a problem because
we can apply the incremental compaction on its behalf. The same applies
to ICS. With STCS, the requirement is the size of the tier being
compacted, but that's already expected by its users anyway.

By the time being, all strategies have it unimplemented. so they still
use the old behavior where files are rewritten on at a time.
This will allow us to incrementally implement the cleanup method for
all compaction strategies.

Refs #10097.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-03-23 00:04:03 -03:00
Pavel Emelyanov
b0494b897d scylla-gdb: Support lw_shared_ptr_no_esft
Seastar commit fcb2b901 renamed the struct by adding it the lw_ prefix.

tests: unit.gdb(release)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220322151328.23203-1-xemul@scylladb.com>
2022-03-22 17:19:39 +02:00
Calle Wilund
56c383ba8e test/perf/perf_commitlog: Add a small commitlog throughput test
Based on perf_simple_query, just bashes data into CL using
normal distribution min/max data chunk size, allowing direct
freeing of segments, _but_ delayed by a normal dist as well,
to "simulate" secondary delay in data persistance.

Needs more stuff.

Some baseline measurements on master:

--min-flush-delay-in-ms 10 --max-flush-delay-in-ms 200
--commitlog-use-hard-size-limit true
--commitlog-total-space-in-mb 10000 --min-data-size 160 --max-data-size 1024
--smp1

median 2065648.59 tps (  1.1 allocs/op,   0.0 tasks/op,    1482 insns/op)
median absolute deviation: 48752.44
maximum: 2161987.06
minimum: 1984267.90

--min-data-size 256 --max-data-size 16384

median 269385.25 tps (  2.2 allocs/op,   0.7 tasks/op,    3244 insns/op)
median absolute deviation: 15719.13
maximum: 323574.43
minimum: 228206.28

--min-data-size 4096 --max-data-size 61440

median 67734.22 tps (  6.4 allocs/op,   2.9 tasks/op,    9153 insns/op)
median absolute deviation: 2070.93
maximum: 82833.17
minimum: 61473.57

--min-data-size 61440 --max-data-size 1843200

median 2281.37 tps ( 79.7 allocs/op,  43.5 tasks/op,  202963 insns/op)
median absolute deviation: 128.87
maximum: 3143.84
minimum: 2140.80

--min-data-size 368640 --max-data-size 6144000

median 679.76 tps (225.5 allocs/op, 116.3 tasks/op,  662700 insns/op)
median absolute deviation: 39.30
maximum: 1148.95
minimum: 586.86

Actual throughput obviously meaningless, as it is run on my slow
machine, but IPS might be relevant.
Note that transaction throughput plummets as we increase median data
sizes above ~200k, since we then more or less always end up replacing
buffers in every call.

Closes #10230
2022-03-22 15:18:25 +02:00
Pavel Emelyanov
cb4fe65a78 scripts: Allow specifying submodule branch to refresh from
There's a script to automate fetching submodule changes. However, this
script alays fetches remote master branch, which's not always the case.
For example, for branch-5.0/next-5.0 pair the correct scylla-seastar
branch would be the branch-5.0 one, not master.

With this change updating a submodule from a custom branch would be like

   refresh-submodules.sh <submodule>:<branch>

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220322093623.15748-1-xemul@scylladb.com>
2022-03-22 15:18:25 +02:00
Pavel Emelyanov
22d131d40a Revert "scripts: Detect remote branch to fetch submodules from"
This reverts commit 87df37792c.

Scylla branches are not mapped to seastar branches 1-1, so getting
the upstream scylla branch doesn't point to the correct seastar one.
2022-03-22 15:18:25 +02:00
Avi Kivity
72c6859c25 Merge "readers: get rid of v1 mutation from fragments" from Botond
"
The only real user is view building, which is converted to v2 and then
the v1 version of the mutation from fragments reader is removed.

Tests: unit(dev, release)
"

* 'v2-only-from-fragments-mutations/v1' of https://github.com/denesb/scylla:
  readers: remove now unused v1 reader from fragments
  test/boost: flat_mutation_reader_test: remove reader from fragments test
  replica/table: migrate generate_and_propagate_view_updates() to v2
  replica/table: migrate populate_views() to v2
  db/view: convert view_update_builder interface to v2
  db/view: migrate view_update_builder to v2
2022-03-22 15:18:25 +02:00
Benny Halevy
9871c757e0 HACKING: refer to backtrace.scylladb.com
Fixes #10252

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #10253
2022-03-22 15:18:25 +02:00
Raphael S. Carvalho
25be958ab9 compaction: Introduce compaction_descriptor::sstables_size
This method can be reused in manager, and will be useful for upcoming
cleanup task.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-03-21 12:55:10 -03:00
Raphael S. Carvalho
c25d8f6770 compaction: Move decision of garbage collection from strategy to task type
For compaction to be able to purge expired data, like tombstones, a
sstable set snapshot is set in the compaction descriptor.

That's a decision that belongs to task type. For example, all regular
compaction enable GC, whereas scrub for example doesn't for safety
reasons.

The problem is that the decision is being made by every instantiation
of compaction_descriptor in the strategies, which is both unnecessary
and also adds lots of boilerplate to the code, making it hard to
understand and work with.

As sstable set snapshot is an implementation detail, a new method
is being added to compaction_descriptor to make the intention
clearer, making the interface easier to understand.

can_purge_tombstones, used previously by rewrite task only, is being
reused for communicating GC intention into task::compact_sstables().

The boilerplate was a pain when adding a new strategy method for
the ongoing work on cleanup, described by issue #10097.
Another benefit is that we'll now only create a set snapshot when
compaction will really run. Before, it could happen that the snapshot
would be discarded if the compaction attempt had to be postponed,
which is a waste of cpu cycles.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-03-21 12:14:04 -03:00
Botond Dénes
dacc3c08a9 readers/upgrading_consumer: workaround for aarch64 miscompilation
On aarch64 the `std::move(mf)` seems to be reordered w.r.t.
`flush_tombstones()` in certain circumstances. These circumstances
are not clear yet, but while further investigation happens, this patch
makes the tests pass on aarch64, unclogging the promotion pipeline.

Refs: #10248
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20220321122209.71685-1-bdenes@scylladb.com>
2022-03-21 15:07:24 +02:00
Avi Kivity
585c0841c3 Merge 'sstables: enable read ahead for the partition index reader' from Wojciech Mitros
Currently, when advancing one of `index_reader`'s bounds, we're creating a new `index_consume_entry_context` with a new underlying file `input_stream` for each new page.

For either bound, the streams can be reused, because the indexes of pages that we are reading are never decreasing.

This patch adds a `index_consume_entry_context` to each of `index_reader`'s bounds, so that for each new page, the same file `input_stream` is used.
As a result, when reading consecutive pages, the reads that follow the first one can be satisfied by the `input_stream`'s read aheads, decreasing the number of blocking reads and increasing the throughput of the `index_reader`.

Additionally, we're reusing the `index_consumer` for all pages, calling `index_consumer::prepare` when we need to increase the size of  the `_entries` `chunked_managed_vector`.

A big difference can be seen when we're reading the entire table, frequently skipping a few rows; which we can test using perf_fast_forward:

Before:
```
running: small-partition-skips on dataset small-part
Testing scanning small partitions with skips.
Reads whole range interleaving reads with skips according to read-skip pattern:
   read    skip      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    allocs   tasks insns/f    cpu
-> 1       0         0.899447            4   1000000    1111794      12284    1113248    1096537      975.5    972     124356       1       0        0        0        0        0        0        0  12032202   29103    8967 100.0%
-> 1       1         1.805811            4    500000     276884        907     278214     275977     3655.8   3654     135084    2688       0     3161     4548     5935        0        0        0   7225100  140466   27010  75.6%
-> 1       8         0.927339            4    111112     119818        357     120465     119461     3654.0   3654     135084    2685       0     2133     4548     6963        0        0        0   1749663  107922   57502  50.2%
-> 1       16        0.790630            4     58824      74401        782      74617      73497     3654.0   3654     135084    2695       0     1975     4548     7121        0        0        0   1019189  109349   90832  42.7%
-> 1       32        0.717235            4     30304      42251        243      42266      41975     3654.0   3654     135084    2689       0     1871     4548     7225        0        0        0    619876  109199  156751  37.3%
-> 1       64        0.681624            4     15385      22571        244      22815      22286     3654.0   3654     135084    2685       0     1870     4548     7226        0        0        0    407671  105798  285688  34.0%
-> 1       256       0.630439            4      3892       6173         24       6214       6150     3549.0   3549     135116    2581       0     1313     3927     6505        0        0        0    232541  100803 1022454  29.1%
-> 1       1024      0.313303            4       976       3115        219       3126       2766     1956.0   1956     130608     986       0        0      987     1962        0        0        0     81165   41385 1724979  29.1%
-> 1       4096      0.083688            4       245       2928         85       3012       2134      738.8    737      17212     492     244        0      247      491        0        0        0     30500   19406 1999263  24.6%
-> 64      1         1.509011            4    984616     652491       2746     660930     649745     3673.5   3654     135084    2687       0     4507     4548     4589        0        0        0  11075882  117074   13157  68.9%
-> 64      8         1.424147            4    888896     624160       4446     625675     617713     3654.0   3654     135084    2691       0     4248     4548     4848        0        0        0  10019098  117383   13700  66.5%
-> 64      16        1.343276            4    800000     595559       5834     605880     589725     3654.0   3654     135084    2698       0     3989     4548     5107        0        0        0   9043830  124022   14206  64.9%
-> 64      32        1.249721            4    666688     533469       5056     536638     526212     3654.0   3654     135084    2688       0     3616     4548     5480        0        0        0   7570848  123043   15377  60.9%
-> 64      64        1.154549            4    500032     433097      10215     443312     415001     3654.0   3654     135084    2703       0     3161     4548     5935        0        0        0   5718758  110657   17787  53.2%
-> 64      256       1.005309            4    200000     198944       1179     199338     196989     3935.0   3935     137216    2966       0      690     4048     5592        0        0        0   2398359  110510   27855  51.3%
-> 64      1024      0.441913            4     58880     133239       8094     135471     120467     2161.0   2161     131820    1190       0        0     1192     1848        0        0        0    725092   45449   33740  59.7%
-> 64      4096      0.124826            4     15424     123564       5958     126814      95101      795.5    794      17400     553     240        0      312      482        0        0        0    199943   20869   46621  41.9%
```
After:
```
running: small-partition-skips on dataset small-part
Testing scanning small partitions with skips.
Reads whole range interleaving reads with skips according to read-skip pattern:
   read    skip      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    allocs   tasks insns/f    cpu
-> 1       0         0.917468            4   1000000    1089956       1422    1091378    1073112      975.5    972     124356       1       0        0        0        0        0        0        0  12032761   29721    8972 100.0%
-> 1       1         1.311446            4    500000     381259       3212     384470     377238     1087.0   1083     138420       2       0     4445     4548     4651        0        0        0   7096216   55681   20869 100.0%
-> 1       8         0.467975            4    111112     237432       1446     239372     235985     1121.2   1119     143124       9       0     4344     4548     4752        0        0        0   1619944   23502   28844  98.7%
-> 1       16        0.337085            4     58824     174508       3410     178451     171099     1117.5   1120     143276      11       0     4319     4548     4777        0        0        0    883692   19152   37460  96.8%
-> 1       32        0.262798            4     30304     115313       1222     116535     112400     1070.2   1066     135620     166      26     4354     4548     4742        0        0        0    483185   18856   54275  94.9%
-> 1       64        0.283954            4     15385      54181        531      56177      53650     2022.5   2040     137036     319      19     4351     4548     4745        0        0        0    292766   32998  102276  84.9%
-> 1       256       0.207020            4      3892      18800        575      19105      17520     1315.5   1334     136072     418      24     3703     3927     4115        0        0        0    118400   27427  292146  82.1%
-> 1       1024      0.164396            4       976       5937         57       5993       5842     1208.2   1195     135384     568      14      932      987     1030        0        0        0     62999   27554  503559  70.0%
-> 1       4096      0.085079            4       245       2880        108       2987       2714      635.8    634      26468     248     246      233      247      258        0        0        0     31264   12872 1546404  37.4%
-> 64      1         1.073331            4    984616     917346       7614     923983     909314     1812.2   1824     136792      11      20     4544     4548     4552        0        0        0  10971661   54538    9919  99.6%
-> 64      8         1.024389            4    888896     867733       6327     870429     845215     3027.2   3072     138212      31       0     4523     4548     4573        0        0        0   9933078   68059   10050  99.5%
-> 64      16        0.978754            4    800000     817366       7802     827665     809564     3012.2   3008     139884      39       0     4486     4548     4610        0        0        0   8947041   64050   10302  98.1%
-> 64      32        0.837266            4    666688     796267      10312     806579     785370     2275.8   2266     139672      29       0     4465     4548     4631        0        0        0   7458644   50754   10564  97.8%
-> 64      64        0.645627            4    500032     774490       4713     779203     768432     1136.8   1137     145428       8       0     4438     4548     4658        0        0        0   5593168   29982   10938  98.4%
-> 64      256       0.386192            4    200000     517877      22509     544067     495368     1134.8   1136     145300     109       0     2135     4048     4147        0        0        0   2270291   22840   13682  94.5%
-> 64      1024      0.238617            4     58880     246755      55856     305110     190899     1176.0   1118     135324     451      13      625     1192     1223        0        0        0    701262   24418   17323  71.1%
-> 64      4096      0.133340            4     15424     115674      14837     117978      99072      974.0    961      27132     366     347       99      312      383        0        0        0    209595   20657   43096  50.4%
```
For single partition reads, the index_reader is modified to behave in practically the same way, as before the change (not reading ahead past the page with the partition).
For example, a single partition read from a table with 10 rows per partition performs a single 6KB read from the index file, and the same read is performed before the change (as can be seen in traces below). If we enabled read aheads in that case, we would perform 2 16KB reads.
Relevant traces:
Before:
```
./tmp/data/ks/t2-75ebed30eb0211eb837a8f4cd3d1cf62/md-1-big-Index.db: scheduling bulk DMA read of size 6478 at offset 0 [shard 0] | 2021-07-23 15:22:25.847362 | 127.0.0.1 |            148 | 127.0.0.1
./tmp/data/ks/t2-75ebed30eb0211eb837a8f4cd3d1cf62/md-1-big-Index.db: finished bulk DMA read of size 6478 at offset 0, successfully read 6478 bytes [shard 0] | 2021-07-23 15:22:25.900996 | 127.0.0.1 |          53782 | 127.0.0.1
```
After:
```
./tmp/data/ks/t2-75ebed30eb0211eb837a8f4cd3d1cf62/md-1-big-Index.db: scheduling bulk DMA read of size 6478 at offset 0 [shard 0] | 2021-07-23 15:19:37.380033 | 127.0.0.1 |            149 | 127.0.0.1
./tmp/data/ks/t2-75ebed30eb0211eb837a8f4cd3d1cf62/md-1-big-Index.db: finished bulk DMA read of size 6478 at offset 0, successfully read 6478 bytes [shard 0] | 2021-07-23 15:19:37.433662 | 127.0.0.1 |          53777 | 127.0.0.1
```
Tests: unit(dev)

Closes #9063

* github.com:scylladb/scylla:
  sstables: index_reader: optimize single partition reads
  sstables: use read-aheads in the index reader
  sstables: index_reader: remove unused members from index reader context
2022-03-21 13:47:28 +02:00
Nadav Har'El
f76f6dbccb secondary index: avoid special characters in default index names
In CQL, table names are limited to so-called word characters (letters,
numbers and underscores), but column names don't have such a limitation.
When we create a secondary index, its default name is constructed from
the column name - so can contain problematic characters. It can include
even the "/" character. The problem is that the index name is then used,
like a table name, to create a directory with that name.

The test included in this patch demonstrates that before this patch, this
can be misused to create subdirectories anywhere in the filesystem, or to
crash Scylla when it fails to create a directory (which it considers an
unrecoverable I/O error).

In this patch we do what Cassandra does - remove all non-word
characters from the indexed column name before constructing the default
index name. In the included test - which can run on both Scylla and
Cassandra - we verify that the constructed index name is the same as
in Cassandra, which is useful to know (e.g., because knowing the index
name is needed to DROP the index).

Also, this patch adds a second line of defense against the security problem
described above: It is now an error to create a schema with a slash or
null (the two characters not allowed in Unix filenames) in the keyspace
or table names. So if the first line of defense (CQL checking the validity
of its commands) fails, we'll have that second line of defense. I verified
that if I revert the default-index-name fix, the second line of defense
kicks in, and the index creation is aborted and cannot create files in
the wrong place to crash Scylla.

Fixes #3403

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220320162543.3091121-1-nyh@scylladb.com>
2022-03-20 18:33:48 +02:00
Takuya ASADA
59c72d5d60 scylla_prepare: print Traceback with current user-friendly messages
On e1b15ba, we introduce user-friendly error message when Exception
occured while generating perftune.yaml.
However, it becomes difficult to investigate bugs since we dropped
traceback.
To resolve this problem, let's print both traceback and user-friendly
messages.

Related #10050

Closes #10140
2022-03-20 16:55:18 +02:00
Michał Chojnowski
f422e18906 cql3: restrictions: statement_restrictions: avoid an unnecessary vector copy
A minor optimization.

Closes #10231
2022-03-20 15:40:46 +02:00
Tomasz Grabiec
cd5fec8a23 Merge "raft: re-advertise gossiper features when raft feature support changes" from Pavel
Prior to the change, `USES_RAFT_CLUSTER_MANAGEMENT` feature wasn't
properly advertised upon enabling `SUPPORTS_RAFT_CLUSTER_MANAGEMENT`
raft feature.

This small series consists of 3 parts to fix the handling of supported
features for raft:
1. Move subscription for `SUPPORTS_RAFT_CLUSTER_MANAGEMENT` to the
   `raft_group_registry`.
2. Update `system.local#supported_features` directly in the
   `feature_service::support()` method.
3. Re-advertise gossiper state for `SUPPORTED_FEATURES` gossiper
   value in the support callback within `raft_group_registry`.

* manmanson/track_supported_set_recalculation_v7:
  raft: re-advertise gossiper features when raft feature support changes
  raft: move tracking `SUPPORTS_RAFT_CLUSTER_MANAGEMENT` feature to raft
  gms: feature_service: update `system.local#supported_features` when feature support changes
  test: cql_test_env: enable features in a `seastar::thread`
2022-03-18 12:34:17 +01:00
Pavel Solodovnikov
ebc2178ea5 raft: re-advertise gossiper features when raft feature support changes
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-03-18 09:54:29 +03:00
Pavel Solodovnikov
011942dcce raft: move tracking SUPPORTS_RAFT_CLUSTER_MANAGEMENT feature to raft
Move the listener from feature service to the `raft_group_registry`.

Enable support for the `USES_RAFT_CLUSTER_MANAGEMENT`
feature when the former is enabled.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-03-18 09:54:25 +03:00
Pavel Solodovnikov
7ea4d44508 gms: feature_service: update system.local#supported_features when feature support changes
Also, change the signature of `support()` method to return
`future<>` since it's now a coroutine. Adjust existing call sites.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-03-18 09:54:21 +03:00
Pavel Solodovnikov
724ea7aa38 test: cql_test_env: enable features in a seastar::thread
Each feature can have an associated `when_enabled` callback
registered, which is assumed to run in the thread context,
so wrap the `enable()` call in a seastar thread.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-03-18 09:54:15 +03:00
Avi Kivity
aab052c0d5 Merge 'replica/database: truncate: temporarily disable compaction on table and views before flush' from Benny Halevy
Flushing the base table triggers view building
and corresponding compactions on the view tables.

Temporarily disable compaction on both the base
table and all its view before flush and snapshot
since those flushed sstables are about to be truncated
anyway right after the snapshot is taken.

This should make truncate go faster.

In the process, this series also embeds `database::truncate_views`
into `truncate` and coroutinizes both

Refs #6309

Test: unit(dev)

Closes #10203

* github.com:scylladb/scylla:
  replica/database: truncate: fixup indentation
  replica/database: truncate: temporarily disable compaction on table and views before flush
  replica/database: truncate: coroutinize per-view logic
  replica/database: open-code truncate_view in truncate
  replica/database: truncate: coroutinize run_with_compaction_disabled lambda
  replica/database: coroutinize truncate
  compaction_manager: add disable_compaction method
2022-03-17 17:24:20 +02:00
Pavel Emelyanov
87df37792c scripts: Detect remote branch to fetch submodules from
There's a script to automate fetching submodule changes. However, this
script alays fetches remote master branch, which's not always the case.
The correct branch can be detected by checking the current remote
tracking scylla branch which should coincide with the submodule one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220317085018.11529-1-xemul@scylladb.com>
2022-03-17 12:21:29 +02:00
Avi Kivity
77f330f393 Merge "readers: retire v1 generating reader implementation" from Botond
"
The generating reader is a reader which converts a functor returning
mutation fragments to a mutation reader.
We currently have 2 generating reader implementations: one operating
with a v1 functor and one with a v2 one. This patch-set converts the v1
functor based one to a v2 reader, by adapting the v1 functor to a v2
functor and reusing the v2 reader implementation.
Tests are also added to both variants.

Tests: unit(dev)
"

* 'generating-reader-v2/v1' of https://github.com/denesb/scylla:
  test/boost: mutation_reader_test: add tests for generating reader
  test: export squash_mutations() into lib/mutation_source_test.hh
  readers: add next partition adaptor
  readers: implement generating_reader from v1 generator via adaptor
  readers: upgrade_to_v2(): reimplement in terms of upgrading_consumer
  readers: add upgrading_consumer
  readers: generating_reader: use noncopyable_function<>
  readers: merge generating.hh into generating_v2.hh
  readers/generating.hh: return v2 reader from make_generating_reader()
2022-03-17 12:19:25 +02:00
Botond Dénes
d15999a58e readers: remove now unused v1 reader from fragments 2022-03-17 11:03:16 +02:00
Botond Dénes
3ea1240fb9 test/boost: flat_mutation_reader_test: remove reader from fragments test 2022-03-17 11:03:16 +02:00
Botond Dénes
e12c543d3f replica/table: migrate generate_and_propagate_view_updates() to v2 2022-03-17 10:51:25 +02:00
Botond Dénes
4b9219a209 replica/table: migrate populate_views() to v2 2022-03-17 10:51:05 +02:00
Botond Dénes
909be0b9d7 db/view: convert view_update_builder interface to v2
The constructor and the make_ factory method now take v2 readers.
Immediate users are patched, with conversions if needed.
2022-03-17 10:50:50 +02:00
Botond Dénes
0740019e4d db/view: migrate view_update_builder to v2
To avoid noise, the interface is left as v1 and inbound readers are
converted in the constructor.
2022-03-17 10:47:55 +02:00
Botond Dénes
c450508954 Merge "Introduce sharded<system_keyspace> instance" from Pavel Emelyanov
"
Making the system-keyspace into a standard sharded instance will
help to fix several dependency knots.

First, the global qctx and local-cache both will be moved onto the
sys-ks, all their users will be patched to depend on system-keyspace.
Now it's not quite so, but we're moving towards this state.

Second, snitch instance now sits in the middle of another dependency
loop. To untie one the preferred ip and dc/rack info should be
moved onto system keyspace altogether (now it's scattered over several
places). The sys-ks thus needs to be a sharded service with some
state.

This set makes system-keyspace sharded instance, equipps it with all
the dependencies it needs and passes it as dependency into storage
service, migration manager and API. This helps eliminating a good
portion of global qctx/cache usage and prepares the ground for snitch
rework.

tests: unit(dev)
       v1: unit(debug), dtest.simple_boot_shutdown(dev)
"

* 'br-sharded-system-keyspace-instance-2' of https://github.com/xemul/scylla: (25 commits)
  system_keyspace: Make load_host_ids non-static
  system_keyspace: Make load_tokens non-static
  system_keyspace: Make remove_endpoint and update_tokens non-static
  system_keyspace: Coroutinize update_tokens
  system_keyspace: Coroutinize remove_endpoint
  system_keyspace: Make update_cached_values non-static
  system_keyspace: Coroutinuze update_peer_info
  system_keyspace: Make update_schema_version non-static
  schema_tables: Add sharded<system_keyspace> argument to update_schema_version_and_announce
  replica: Push sharded<system_keyspace> down to parse_system_tables
  api: Carry sharded<system_keyspace> reference along
  storage_service: Keep sharded<system_keyspace> reference
  migration_manager: Keep sharded<system_keyspace> reference
  system_keyspace: Remove temporary qp variable
  system_keyspace: Make get_preferred_ips non-static
  system_keyspace: Make cache_truncation_record non-static
  system_keyspace: Make check_health non-static
  system_keyspace: Make build_bootstrap_info non-static
  system_keyspace: Make build_dc_rack_info non-static
  system_keyspace: Make setup_version non-static
  ...
2022-03-17 08:16:29 +02:00
Botond Dénes
9f95042c2b test/boost: mutation_reader_test: add tests for generating reader 2022-03-17 08:08:01 +02:00
Botond Dénes
4243cd395d test: export squash_mutations() into lib/mutation_source_test.hh
This method used to be a static one in
boost/flat_mutation_reader_test.cc. Turns out it is useful for other
tests based on the mutation source test suite, so move it into the
header of the latter to make it accessible.
2022-03-17 08:08:01 +02:00
Botond Dénes
9e3d8cb06f readers: add next partition adaptor
Provides a wrapper with a `next_partition()` implementation for readers
that can't have one. Mainly for testing purposes.
2022-03-17 08:08:01 +02:00
Botond Dénes
3594f836fc readers: implement generating_reader from v1 generator via adaptor
Adaptor converts the
`noncopyable_function<future<mutation_fragment_opt>>` to the v2
equivalent, so we can have a single generating reader implementation.
The adaptor uses the upgrading_consumer reusable upgrade component to
implement the actual upgrade.
2022-03-17 08:08:01 +02:00
Botond Dénes
47b806393b readers: upgrade_to_v2(): reimplement in terms of upgrading_consumer
Use the reusable upgrading_consumer introduced in the previous patch as
the v2 upgrade implementation.
2022-03-17 08:08:01 +02:00
Botond Dénes
ffeeb83edf readers: add upgrading_consumer
Upgrading a v1 stream to a v2 one is a common task that currently
requires duplicating the upgrade logic in all components that wan to do
this. This patch extract the upgrade logic from `upgrade_to_v2()` into a
reusable component to promote code reuse.
2022-03-17 08:08:01 +02:00
Botond Dénes
fcf15fda94 readers: generating_reader: use noncopyable_function<>
std::function<> requires the functor it wraps to be copyable, which is
an unnecessarily strict requirement. To relax this, we use
noncopyable_function<> instead. Since the former seems to lack some
disambiguation magic of the latter, we add `_v1` and `_v2` postfixes to
manually disambiguate.
2022-03-17 06:53:44 +02:00
Botond Dénes
35bbd54946 readers: merge generating.hh into generating_v2.hh
Both variants return a v2 reader and we are going to keep both for a
time to come.
2022-03-17 06:52:28 +02:00
Botond Dénes
7844ff9912 readers/generating.hh: return v2 reader from make_generating_reader()
For now, the (v1) reader is just upgraded to v2 behind the scenes.
2022-03-17 06:51:20 +02:00
Gleb Natapov
a1604aa388 raft: make raft requests abortable
This patch adds an ability to pass abort_source to raft request APIs (
add_entry, modify_config) to make them abortable. A request issuer not
always want to wait for a request to complete. For instance because a
client disconnected or because it no longer interested in waiting
because of a timeout. After this patch it can now abort waiting for such
requests through an abort source. Note that aborting a request only
aborts the wait for it to complete, it does not mean that the request
will not be eventually executed.

Message-Id: <YjHivLfIB9Xj5F4g@scylladb.com>
2022-03-16 18:38:01 +01:00
Benny Halevy
a1d0f089c8 replica: distributed_database: populate_column_family: trigger offstrategy compaction only for the base directory
In https://github.com/scylladb/scylla/issues/10218
we see off-strategy compaction happening on a table
during the initial phases of
`distributed_loader::populate_column_family`.

It is caused by triggering offtrategy compaction
too early, when sstables are populated from the staging
directory in a144d30162.

We need to trigger offstrategy compaction only of the base
table directory, never the staging or quarantine dirs.

Fixes #10218

Test: unit(dev)
DTest: materialized_views_test.py::TestInterruptBuildProcess

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220316152812.3344634-1-bhalevy@scylladb.com>
2022-03-16 18:57:00 +02:00
Botond Dénes
0ea6dddc00 test/boost/mutation_reader_test: remove unused puppet_reader
All users use puppet_reader_v2 now.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20220316135525.211753-1-bdenes@scylladb.com>
2022-03-16 18:57:00 +02:00
Botond Dénes
afc824a109 test/boost/flat_mutation_reader_test: test_flat_mutation_reader_consume_single_partition: make ckrange conditionally inclusive
Depending on the bound weight of the position of the last fragment we
expect to read. Currently the range is unconditionally exclusive, which
might lead to an artificial difference between the read and expected
data, due to a fragment being possibly omitted.

Fixes #10229.

Tests: unit(boost/flat_mutation_reader_test:test_flat_mutation_reader_consume_single_partition)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20220304133515.74586-1-bdenes@scylladb.com>
2022-03-16 18:57:00 +02:00
Avi Kivity
975b0c0b03 Merge "tools/scylla-sstable: add validate-checksums and decompress" from Botond
"
This patchset adds two new operations to scylla-sstable:
* validate-checksums - helps identifying whether an sstable is intact or
  not, but checking the digest and the per-chunk checksums against the
  data on disk.
* decompress - helps when one wants to manually examine the content of a
  compressed sstable.

Refs: #497

Tests: unit(dev)
"

* 'scylla-sstable-validate-checksums-decompress/v3' of https://github.com/denesb/scylla:
  tools/scylla-sstable: consume_sstables(): s/no_skips/use_crawling_reader/
  tools/scylla-sstable: add decompress operation
  tools/scylla-sstables: add validate-checksums operation
  sstables/sstable: add validate_checksums()
  sstables/sstable: add raw_stream option to data_stream()
  sstables/sstable: make data_stream() and data_read() public
  utils/exceptions: add maybe_rethrow_exception()
2022-03-16 18:56:48 +02:00
Avi Kivity
c4a992564b Merge "Assorted fixes for Fedora 36 build" from Pavel S
"
This mini-series contains a few trivial fixes to be able
to build scylla on Fedora 36 Pre-Release, which will
soon enter "Beta" state.

It's mostly fixes due to some changes to external dependencies,
e.g. boost.outcome and libfmt.

Tests: unit(dev)
"

* 'fc36_build_fixes_v1' of https://github.com/ManManson/scylla:
  schema: fix build issues with libstdc++ 12
  treewide: fix compilation issues with fmtlib 8.1.0+
  utils/result.hh: add missing header includes for boost.outcome
2022-03-16 18:56:02 +02:00
Pavel Emelyanov
e8ba395fea system_keyspace: Make load_host_ids non-static
Same as previous patch -- just use the reference from storage service

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-16 14:24:40 +03:00
Pavel Emelyanov
28204cd83d system_keyspace: Make load_tokens non-static
Called from storage service that has system-keyspace instances

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-16 14:24:40 +03:00
Pavel Emelyanov
8f977814bc system_keyspace: Make remove_endpoint and update_tokens non-static
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-16 14:24:40 +03:00
Pavel Emelyanov
3f0f94b081 system_keyspace: Coroutinize update_tokens
While at it

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-16 14:24:40 +03:00
Pavel Emelyanov
7b2f142e2d system_keyspace: Coroutinize remove_endpoint
Not to capture 'this' all over the method in next patch

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-16 14:24:40 +03:00
Pavel Emelyanov
7d0d5642c0 system_keyspace: Make update_cached_values non-static
The update_table() helper template too. And the update_peer_info as
well. It can stop using global qctx and cache after that

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-16 14:24:40 +03:00
Pavel Emelyanov
5a1f7193b0 system_keyspace: Coroutinuze update_peer_info
Not to carry 'this' over captures in next patch

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-16 14:24:40 +03:00
Pavel Emelyanov
c15359165d system_keyspace: Make update_schema_version non-static
It's called from two places -- .setup() and schema_tables code. Both
have the instance hanging around, so the method can be de-marked
static and set free from global qctx

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-16 14:24:40 +03:00
Pavel Emelyanov
b80d5f8900 schema_tables: Add sharded<system_keyspace> argument to update_schema_version_and_announce
All its (indirect) callers had been patched to have it, now it's
possible to have the argument in it. Next patch will make use of it

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-16 14:24:40 +03:00
Pavel Emelyanov
009c449cc3 replica: Push sharded<system_keyspace> down to parse_system_tables
The method needs to call merge_schema() that will need system keyspace
instance at hand. The parse_s._t. method is boot-time one, pushing the
main-local instance through it is fine

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-16 14:24:40 +03:00
Pavel Emelyanov
bd4beeeebe api: Carry sharded<system_keyspace> reference along
There's an APi call to recalculate schema version that needs
the system_keyspace instance at hand

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-16 14:24:40 +03:00
Pavel Emelyanov
f18a80852e storage_service: Keep sharded<system_keyspace> reference
Storage service uses system keyspace on boot heavily

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-16 14:24:40 +03:00
Pavel Emelyanov
42e733bdf7 migration_manager: Keep sharded<system_keyspace> reference
The main target here is system_keyspace::update_schema_version() which
is now static, but needs to have system_keyspace at "this". Migration
manager is one of the places that calls that method indirectly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-16 14:24:40 +03:00
Pavel Emelyanov
835f39e0ba system_keyspace: Remove temporary qp variable
After the previous patches it's possible to clean local stack
of .setup() a little bit

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-16 14:24:40 +03:00
Pavel Emelyanov
ece4448ea9 system_keyspace: Make get_preferred_ips non-static
And mark the method private while at it, because it is such

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-16 14:24:40 +03:00
Pavel Emelyanov
a095e8d92d system_keyspace: Make cache_truncation_record non-static
This one is a bit more tricky that its four preceeders. The qctx's
qp().execute_cql() is replaced with qp().execute_internal() for
symmetry with the rest. Without data args it's the same.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-16 14:24:40 +03:00
Pavel Emelyanov
f54473427d system_keyspace: Make check_health non-static
Yet another same step. Drop static keyword and patch out globals.
Get config.cluster_name from _db while at it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-16 14:24:40 +03:00
Pavel Emelyanov
5944d5f663 system_keyspace: Make build_bootstrap_info non-static
The same -- drop static and forget global qctx and cache

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-16 14:24:40 +03:00
Pavel Emelyanov
08174eb868 system_keyspace: Make build_dc_rack_info non-static
Same here -- remove static and patch out global qctx and cache.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-16 14:24:40 +03:00
Pavel Emelyanov
66beaad1e5 system_keyspace: Make setup_version non-static
Just remove static mark and stop using global qctx.
Grab config from _db instead of argument while at it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-16 14:24:40 +03:00
Pavel Emelyanov
00a345c4d8 system_keyspace: Copy execute_internal() from query context
Before patching system_keyspace methods to use query processor from
its instance, the respective call is needed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-16 14:24:40 +03:00
Pavel Emelyanov
f4c185c30e system_keyspace: Make setup method non-static
It's called only on start and actively uses both qctx and local
cache. Next patches will fix the whole setup code to stop using
global qctx/cache.

For now setup invocation is left in its place, but it must really
happen in start() method. More patching is needed to make it work.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-16 13:57:59 +03:00
Pavel Emelyanov
b761e558b5 system_keyspace: Keep local_cache reference on board
For now it's a reference, but all users of the cache will be
eventually switched into using system_keyspace.

In cql-test-env cache starting happens earlier than it was
before, but that's OK, it just initializes empty instances.
In main cache starts at the same time as before patching.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-16 13:57:59 +03:00
Pavel Emelyanov
1bcb6c13a5 system_keyspace: Move minimal_setup into start
Start happens at exactly the same place. One thing to take care
of is that it happens on all shards.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-16 13:57:59 +03:00
Pavel Emelyanov
7ef69b8189 system_keyspace: Make sharded object
The db::system_keyspace was made a class some time ago, time to create
a standard sharded<> object out of it. It needs query processor and
database. None of those depensencies is started early enough, so the
object for now starts in two steps -- early instances creation and
late start.

The instances will carry qctx and local_cache on board and all the
services that need those two will depend on system-keyspace. Its start
happens at exactly the same place where system_keyspace::setup happens
thus any service that will use system_keyspace will be on the same
safe side as it is now.

In the further future the system_keyspace will be equpped with its
own query processor backed by local replica database instance, instead
of the whole storage proxy as it is now.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-16 13:57:59 +03:00
Pavel Solodovnikov
9a5aae654f schema: fix build issues with libstdc++ 12
Switch from using `std::map::insert` to `std::map::emplace`
in the `get_sharder()` function, since we are constructing
a temporary value anyway.

Also, use `std::make_pair` instead of initializer list because
for some reason Clang 13 w/ libstdc++ 12 argues about not
being able to find a suitable overload.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-03-16 12:34:22 +03:00
Pavel Solodovnikov
95c8d65949 treewide: fix compilation issues with fmtlib 8.1.0+
Due to fd62fba985
scoped enums are not automatically converted to integers anymore,
this is the intended behavior, according to the fmtlib devs.

A bit nicer solution would be to use `std::to_underlying`
instead of a direct `static_cast`, but it's not available until
C++23 and some compilers are still missing the support for it.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-03-16 12:31:50 +03:00
Pavel Solodovnikov
95dc534d0c utils/result.hh: add missing header includes for boost.outcome
Looks like internal boost.outcome headers don't include some
of needed dependencies, so do that manually in our headers.

For some reason it worked before, but started to fail when
building on Fedora 36 setup.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-03-16 12:28:47 +03:00
Raphael S. Carvalho
0cc717ee86 compaction_manager: Retrieve and register files in rewrite_sstables() atomically
The atomicity was lost in commit a2a5e530f0.

Registration of compacting SSTables now happens in rewrite_sstables_compaction_task
ctor, but that's risky because a regular compaction could pick those
same files if run_with_compaction_disabled() defers after the callback
passed to it returns, and before run__w__c__d() caller has a chance to
run. The deferring point is very much possible, because submit()
(submits a regular job) is called when run__w__c__d() reenables compaction
internally.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220315182857.121479-1-raphaelsc@scylladb.com>
2022-03-16 09:58:16 +02:00
Raphael S. Carvalho
58e520ab1d compaction: Move run_off_strategy_compaction() into compaction manager
Compaction manager is calling back the table to run off-strategy compaction,
but the logic clearly belongs to manager which should perform the
operation independently and only call table to update its state with the
result.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220315174504.107926-2-raphaelsc@scylladb.com>
2022-03-16 09:55:52 +02:00
Raphael S. Carvalho
1bae803a8b table: Add maintenance_sstable_set()
Let's expose maintenance set, to allow the implementation of
off-strategy compaction to be moved into the compaction manager.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220315174504.107926-1-raphaelsc@scylladb.com>
2022-03-16 09:55:51 +02:00
Jan Ciolek
2e7009f427 cql3: expr: is_supported_by: Return false for subscripted values
is_supported_by checks whether a given restriction
can be supported by some index.

Currently when a subscripted value, e.g `m[1]` is encountered,
we ignore the fact that there is a subscript and ask
whether an index can support the `m` itself.

This looks like unintentional behaviour leftover
from the times when column_value had a sub field,
which could be easily forgotten about.

Scylla doesn't support indexes on collection elements at all,
so simply returning false there seems like a good idea.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>

Closes #10227
2022-03-15 20:19:33 +02:00
Mikołaj Sielużycki
a8cb7bf677 readers: Make result_collector use queue reader handle v2.
Transitively modifies streaming_virtual_table::as_mutation_source,
along with the tests.

Closes #10223
2022-03-15 17:02:28 +02:00
Botond Dénes
15a44805ba tools/scylla-sstable: consume_sstables(): s/no_skips/use_crawling_reader/
It more accurately describes what the role of this flag actually is.
2022-03-15 14:52:20 +02:00
Botond Dénes
d10700a83d tools/scylla-sstable: add decompress operation
Useful when one wants to manually check the content of a compressed
sstable.
2022-03-15 14:52:20 +02:00
Botond Dénes
488d145dc8 tools/scylla-sstables: add validate-checksums operation
Useful for determining whether sstables have been corrupted by factors
outside of scylla, e.g. the I/O subsystem.
2022-03-15 14:52:20 +02:00
Botond Dénes
ddf9dee9d8 sstables/sstable: add validate_checksums()
Sstables have two kind of checksums: per-chunk checksums and
full-checksum (digest) calculated over the entire content of Data.db.

The full-checksum (digest) is stored in Digest.crc
(component_type::Digest).

When compression is used, the per-chunk checksum is stored directly
inside Data.db, after each compressed chunk. These are validated on
read, when decompressing the respective chunks.
When no compression is used, the per-chunk checksum is stored separately
in CRC.db (component_type::CRC). Chunk size is defined and stored in said
component as well.

In both compressed and uncompressed sstables, checksums are calculated
on the data that is actually written to disk, so in case of compressed
data, on the compressed data.

This method validates both the full checksum and the per-chunk checksum
for the entire Data.db.
2022-03-15 14:52:15 +02:00
Botond Dénes
bf335c9e7a sstables/sstable: add raw_stream option to data_stream()
Optionally provide access to the underlying data as-is, without
decompression.
2022-03-15 14:47:27 +02:00
Botond Dénes
9bc80b42cd sstables/sstable: make data_stream() and data_read() public 2022-03-15 14:42:45 +02:00
Botond Dénes
7a75862570 utils/exceptions: add maybe_rethrow_exception()
Helps with the common coroutine exception-handling idiom:

    std::exception_ptr ex;
    try {
        ...
    } catch (...) {
        ex = std::current_exception();
    }

    // release resource(s)
    maybe_rethrow_exception(std::move(ex));

    return result;
2022-03-15 14:42:45 +02:00
Botond Dénes
61028ad718 evicatble_reader: avoid preemption pitfall around waiting for readmission
Permits have to wait for re-admission after having been evicted. This
happens via `reader_permit::maybe_wait_readmission()`. The user of this
method -- the evictable reader -- uses it to re-wait admission when the
underlying reader was evicted. There is one tricky scenario however,
when the underlying reader is created for the first time. When the
evictable reader is part of a multishard query stack, the created reader
might in fact be a resumed, saved one. These readers are kept in an
inactive state until actually resumed. The evictable reader shares it
permit with the to-be-resumed reader so it can check whether it has been
evicted while saved and needs to wait readmission before being resumed.
In this flow it is critical that there is no preemption point between
this check and actually resuming the reader, because if there is, the
reader might end up actually recreated, without having waited for
readmission first.
To help avoid this situation, the existing `maybe_wait_readmission()` is
split into two methods:
* `bool reader_permit::needs_readmission()`
* `future<> reader_permit::wait_for_readmission()`

The evictable reader can now ensure there is no preemption point between
`needs_readmission()` and resuming the reader.

Fixes: #10187

Tests: unit(release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20220315105851.170364-1-bdenes@scylladb.com>
2022-03-15 14:37:22 +02:00
Avi Kivity
39504dea61 Merge "Convert result builders to v2" from Botond
"
Namely the query result writer and the reconcilable result builder, used
for building results for regular queries and mutation queries (used in
read repair) respectively.
With this, there are no users left for the v1 output of the compactor,
so we remove that, making the compactor v2 all-the-way (and simpler).
This means that for regular queries, a downgrade phase is eliminated
completely, as regular queries don't store range tombstone in their
result, so no need to convert them.

Tests: unit(dev, release, debug)
"

* 'result-builders-v2/v1' of https://github.com/denesb/scylla:
  reconcilable_result_builder: remove v1 support
  query_result_builder: remove v1 support
  mutation_compactor: drop v1 related code-paths
  mutation_compactor: drop v1 support altogether from the API
  tree: migrate to the v2 consumer APIs
  test/boost/mutation_test: remove v1 specific test code
  querier: switch to v2 compactor output
  reconcilable_result_builder: add v2 support
  query_result_writer: add v2 support
  query_result_builder: make consume(range_tombstone) noop
2022-03-15 14:32:58 +02:00
Benny Halevy
70e1fdb0c8 replica/database: truncate: fixup indentation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-03-15 14:02:35 +02:00
Benny Halevy
5ca45b5c32 replica/database: truncate: temporarily disable compaction on table and views before flush
Flushing the base table triggers view building
and corresponding compactions on the view tables.

Temporarily disable compaction on both the base
table and all its view before flush and snapshot
since those flushed sstables are about to be truncated
anyway right after the snapshot is taken.

This should make truncate go faster.

Refs #6309

Test: unit(dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-03-15 14:02:29 +02:00
Mikołaj Sielużycki
7ce0d380d4 readers: Update tests to use make_queue_reader_v2.
Closes #10220
2022-03-15 13:56:50 +02:00
Benny Halevy
20fcf05586 replica/database: truncate: coroutinize per-view logic
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-03-15 12:51:20 +02:00
Benny Halevy
c6d72a1814 replica/database: open-code truncate_view in truncate
truncate-views is called only internally from database::truncate.

Next step will be to disable compactions on the base
table and view before flush and snapshot.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-03-15 12:47:46 +02:00
Benny Halevy
613e65e5d0 replica/database: truncate: coroutinize run_with_compaction_disabled lambda
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-03-15 12:47:13 +02:00
Benny Halevy
0318f48110 replica/database: coroutinize truncate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-03-15 12:46:41 +02:00
Piotr Dulikowski
5d7b2c6515 utils/result_try: prevent exceptions from being caught multiple times
The `result_try` and `result_futurize_try` are supposed to handle both
failed results and exceptions in a way similar to a try..catch block.
In order to catch exceptions, the metaprogramming machinery invokes the
fallible code inside a stack of try..catch blocks, each one of them
handling one exception. This is done instead of creating a single
try..catch block, as to my knowledge it is not possible to create
a try..catch block with the number of "catch" clauses depending on a
variadic template parameter pack.

Unfortunately, a "try" with multiple "catches" is not functionally
equivalent to a "try block stack". Consider the following code:

    try {
        try {
            return execute_try_block();
        } catch (const derived_exception&) {
            // 1
        }
    } catch (const base_exception&) {
        // 2
    }

If `execute_try_block` throws `derived_exception` and the (1) catch
handler rethrows this exception, it will also be handled in (2), which
is not the same behavior as if the try..catch stack was "flat".

This causes wrong behavior in `result_try` and `result_futurize_try`.
The following snippet has the same, wrong behavior as the previous one:

    return utils::result_try([&] {
        return execute_try_block();
    },  utils::result_catch<derived_exception>([&] (const auto&& ex) {
        // 1
    }), utils::result_catch<base_exception>([&] (const auto&& ex) {
        // 2
    });

This commit fixes the problem by adding a boolean flag which is set just
before a catch handler is executed. If another catch handler is
accidentally matched due to exception rethrow, the catch handler is
skipped and exception is automatically rethrown.

Tests: unit(dev, debug)

Fixes: #10211

Closes #10216
2022-03-15 11:42:42 +02:00
Benny Halevy
e5538cf52e test: mutation_write_test: test_timestamp_based_splitting_mutation_writer: no need to downgrade reader to v1
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220315083425.2786228-2-bhalevy@scylladb.com>
2022-03-15 11:41:11 +02:00
Benny Halevy
90edddd7e3 everywhere: use make_flat_mutation_reader_from_mutations_v2
Rather than upgrade_to_v2(make_flat_mutation_reader_from_mutations)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220315083425.2786228-1-bhalevy@scylladb.com>
2022-03-15 11:41:10 +02:00
Benny Halevy
297a37f640 compaction_manager: add disable_compaction method
Returns a RAII class compaction_reenabler
that conditionally reenables compaction
for the given table when destroyed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-03-15 11:00:49 +02:00
Nadav Har'El
189ff5414f test/cql-pytest: implement test_tools.py without run-script cooperation
In commit afab1a97c6, we added
test_tools.py - tests for the various tools embedded in the Scylla
executable. These tests need to know where the Scylla executable is,
and also where its sstables are stored. For this, the commit added two
test parameters - "--scylla-path" and "--workdir" - with which the
"run" script communicated this knowledge to the test.

However, that implementation meant that these tests only work if the
test was run via the test/cql-pytest/run script - they won't work if
the user ran Scylla/pytest manually, or through some other script not
passing these options.

This patch drops the "--scylla-path" and "--workdir" parameters, and
instead the test figures out this information on its own:

1. To find the Scylla executable, we begin by looking (using the
   local_process_id(cql) function from the previous patch) for a
   local process which listens to our CQL connection, and then find
   the executable's path using /proc.

2. To find the Scylla data directory (which is what we really need, not
   workdir which is just a shortcut to set all directories!), we
   retrieve this configuration from the system.config table through CQL.

I tested that test_tools.py now works not only through test/cql-pytest/run
but also if I run Scylla manually and then run "pytest test_tools.py"
without any extra parameters.

Fixes #10209

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220314151125.2737815-2-nyh@scylladb.com>
2022-03-14 20:25:22 +02:00
Nadav Har'El
8ed0909cc3 test/cql-pytest: add mechanism and example of testing Scylla log messages
Generally, cql-pytest tests do not, and *should not* rely on looking up
messages in the Scylla log file: Relying on such messages makes it
impossible to run the same test against Cassandra or even a remotely-
installed Scylla, and the tests tend to break when logging (which is not
considered part of our API) changes. Moreover, usually what our dtests
achieve by looking at the log - e.g., figuring out when some event has
happened - can be achieved through official CQL APIs, and this is what
normal users do anyway (users don't normally dig through the log to
figure out when their operation completed).

However, sometimes we do want to write a test to confirm that during a
certain operation, a certain log message gets written to Scylla's log.
A desire to do this was raised by @fruch and @soyacz, so in this patch
I provide a mechanism to do this, and a trivial example - which checks
that a "Creating ..." message appears on the log whenever a table is
created, and "Dropping ..." when the table is deleted.

As is explained in detail in patches in the comment, Scylla's log file
is found automatically, without relying on Scylla's runner (such as
the script test/cql-pytest/run) communicating to the test where the log
file is. If the log file can't be found - e.g., we're testing a remote
Scylla, or if this isn't Scylla, the tests are skipped.

I would like all logfile-testing tests to be in the same file,
test_logs.py. As I explained above, I think it is a mistake for general
tests to check the log file just because they can. I think that the only
tests that should use the log file are tests deliberately written to
check what gets logged - and those can be collected in the same file.

As part of this patch, we add the utility function local_process_id(cql)
to find (if we can) the local process which listens to the connection
"cql". This utility function will later be useful in more places - for
example test_tools.py needs to find Scylla's executable.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220314151125.2737815-1-nyh@scylladb.com>
2022-03-14 20:25:20 +02:00
Lukasz Sojka
c65f1c3b47 test/cql-pytest: add warnings test
cql client should return warnings when batch exceedes certain size.
This test verifies if response contains them.

Test covers issue: https://github.com/scylladb/scylla/issues/10196

Signed-off-by: Lukasz Sojka <lukasz.sojka@scylladb.com>

Closes #10197
2022-03-14 19:49:06 +02:00
Benny Halevy
37dc31c429 api: storage_service: force_keyspace_compaction: compact one table at a time
To make major compaction more resilient to low-
disk space conditions, 342bfbd65a
sorted the tables based on their live disk space used.

However, each shard still makes progress in its own pace.

This change serializes major compaction between tables
so we still compact in parallel on all shards, but one
(distributed) table at a time.

As a follow-up, we can consider serializing even at the single shard
level when disk space is critically low, so we can't even risk
parallel compaction across all shards.

Refs scylladb/scylla-dtest#2653

Test: unit(dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220313153814.2203660-1-bhalevy@scylladb.com>
2022-03-14 15:39:23 +02:00
Raphael S. Carvalho
1a2332a0ba compaction: Move release_exhausted out of the compaction descriptor
With compact_sstables() now living in compaction_manager::task,
release_exhausted no longer has to live inside compaction_descriptor,
which is a good direction because implementation detail is being
removed from the interface.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220311023410.250149-2-raphaelsc@scylladb.com>
2022-03-14 15:39:23 +02:00
Raphael S. Carvalho
fce9d869b4 compaction: Move table::compact_sstables() into compaction manager
Table submits compaction request into manager, which in turn calls
back table to run the compaction when the time has come, i.e.:
table -> compaction manager -> table -> execute compaction

But manager should not rely on table to run compaction, as compaction
execution procedure sits one layer below the manager and should be
accessed directly by it, i.e:

table -> compaction manager -> execute compaction

This makes code easier to understand and update_compaction_history()
can now be noop for unit tests using table_state.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220311023410.250149-1-raphaelsc@scylladb.com>
2022-03-14 15:39:23 +02:00
Botond Dénes
964d9e033d Merge "raft_group_registry: drain_on_shutdown" from Benny Halevy
"
This series hardens raft_group_registry::stop_servers
and uses it to drain_on_shutdown, called before
the database is stopped in cql_test_env.
(Not needed for main).

raft_group_registry deferred_stop is introduced right after
the service is started to make sure it's properly stopped
even if there's an exception at any point while starting.

Test: unit(dev)
"

* tag 'raft_group_registry-drain_on_shutdown-v1' of https://github.com/bhalevy/scylla:
  cql_test_env: raft_group_registry::drain_on_shutdown before stopping the database
  raft_group_registry: harden stop_servers
  raft_group_registry: delete unused _shutdown_gate
2022-03-14 14:22:46 +02:00
Avi Kivity
e7fb71020b Merge 'replica: Optimize empty_flat_reader out of the hot path' from Michał Chojnowski
When row_cache::make_reader() and memtable::make_flat_reader() see that the query result is empty, they return empty_flat_reader, which is a trivial implementation of flat_mutation_reader.
Even though empty_flat_reader doesn't do anything meaningful, it still needs to be created, handled in merging_reader and destroyed. Turns out this is costly.

This patch series replaces hot path uses of empty_flat_reader with an empty optional.

Performance effects:

`perf_simple_query --smp 1`
TPS: 138k -> 168k
allocs/op: 80.2 -> 71.1
insns/op: 49.9k -> 45.1k

`perf_simple_query --smp 1 --enable-cache=1 --flush`
TPS: 125k -> 150k
allocs/op: 79.2 -> 71.1
insns/op: 51.7k -> 47.2k

For a cassandra-stress benchmark (localhost, 100% cache reads) this translates to a TPS increase from ~42k to ~48k per hyperthread.

Note that this optimization is effective for single-partition reads where the queried partition is only in cache/sstables or only in memtables. Other queries (e.g. where the partition is in both cache in memtables and needs to be merged) are unaffected.

Closes #10204

* github.com:scylladb/scylla:
  replica: Prefer row_cache::make_reader_opt() to row_cache::make_reader()
  row_cache: Add row_cache::make_reader_opt()
  replica: Prefer memtable::make_flat_reader_opt() to memtable::make_flat_reader()
  memtable: Add memtable::make_flat_reader_opt()

[avi: adjust #include for readers/ split]
2022-03-14 14:07:00 +02:00
Mikołaj Sielużycki
1d84a254c0 flat_mutation_reader: Split readers by file and remove unnecessary includes.
The flat_mutation_reader files were conflated and contained multiple
readers, which were not strictly necessary. Splitting optimizes both
iterative compilation times, as touching rarely used readers doesn't
recompile large chunks of codebase. Total compilation times are also
improved, as the size of flat_mutation_reader.hh and
flat_mutation_reader_v2.hh have been reduced and those files are
included by many file in the codebase.

With changes

real	29m14.051s
user	168m39.071s
sys	5m13.443s

Without changes

real	30m36.203s
user	175m43.354s
sys	5m26.376s

Closes #10194
2022-03-14 13:20:25 +02:00
Benny Halevy
26b1be0b8f test: lib: random_mutation_generator: accept optional random seed
Provide an easy way to instrument a particular test case to use
a given random number seed (that's curretly already printed to
the test log).

Refs #5349

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210907114537.3464004-1-bhalevy@scylladb.com>
2022-03-14 13:09:36 +02:00
Michał Chojnowski
83efb508d6 replica: Prefer row_cache::make_reader_opt() to row_cache::make_reader()
The former is significantly cheaper when there is nothing to be read.
2022-03-14 12:02:49 +01:00
Michał Chojnowski
6c6519a909 row_cache: Add row_cache::make_reader_opt() 2022-03-14 12:02:49 +01:00
Michał Chojnowski
f211ef9d71 replica: Prefer memtable::make_flat_reader_opt() to memtable::make_flat_reader()
The former is significantly cheaper when there is nothing to be read.
2022-03-14 12:02:49 +01:00
Michał Chojnowski
218f2b6e98 memtable: Add memtable::make_flat_reader_opt()
When there is nothing to read, make_flat_reader() returns an empty (no-op)
reader. But it turns out that constructing, combining and destroying that
empty reader is quite costly.
As an optimization, add an alternative version which returns an empty optional
instead.
2022-03-14 12:02:49 +01:00
Benny Halevy
8481852c91 cql_test_env: raft_group_registry::drain_on_shutdown before stopping the
database

We're currently stopping raft_gr before
shutting the database down, but we fail to do that if
anything goes wrong before that, e.g. if
distributed_loader::init_non_system_keyspaces fails.

This change splits drain_on_shutdown out of stop()
to stop the raft groups before the database is stopped
and does the rest in a deferred_stop placed right
after the rafr_gr registry is strated.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-03-14 11:49:44 +02:00
Benny Halevy
ac307d6a62 raft_group_registry: harden stop_servers
stop_servers should never fail since it's called on
the shutdown path.

Use a local gate in stop_servers() to wait on all
background raft group server aborts.

Also, handle theoretical exceptions from server::abort()
to guarantee success.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-03-14 11:49:44 +02:00
Benny Halevy
ab30feb71d raft_group_registry: delete unused _shutdown_gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-03-14 11:49:44 +02:00
Piotr Dulikowski
2415a1d169 abstract_read_resolver: bring back cancelling timeout timer on read failure
Recent PR #10092 (propagating read timeouts on coordinator without
throwing) accidentally removed a line which cancelled
`abstract_read_resolver`'s `_timeout` timer after a read failure.
Because of that, it might happen that after a read failure the timer is
triggered and the `_done_promise` is set twice which triggers an assert
in seastar.

This commit brings back the line which cancels the timeout timer.

Fixes: #10193

Closes #10206
2022-03-14 09:43:32 +01:00
Nadav Har'El
383aa326cc cql-pytest: translate Cassandra's tests for BATCH operations
This is a translation of Cassandra's CQL unit test source file
validation/operations/BatchTest.java into our our cql-pytest framework.

This test file includes 13 tests for various types of BATCH operations.
All tests pass on Scylla - no known or new bugs were reproduced.

Two of the tests involve very slow testing of TTLs, so after verifying
they work I marked them "skip" for now (we can always turn them on later,
perhaps after reducing the length or number of the sleeps).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220313121634.2611423-1-nyh@scylladb.com>
2022-03-14 09:43:02 +01:00
Piotr Sarna
83ec505fab cql3: add tracing indexed aggregate queries
Commit 1c99ed6ced added tracing logs
about the index chosen for the query, but aggregate queries have
a separate code path, which wasn't taken into account.
After this patch, tracing for aggregate queries also includes
this additional information.

Closes #10195
2022-03-11 15:27:03 +02:00
Raphael S. Carvalho
67a7b7a3f4 compaction: rename interrupt() to a descriptive name
interrupt() makes it sound like it's interrupting the compaction, but it's
actually called *on* interrupt, to handle the interrupt scenario.
Let's rename it to on_interrupt().

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220311000128.189840-1-raphaelsc@scylladb.com>
2022-03-11 10:16:34 +02:00
Botond Dénes
0632114a9b reconcilable_result_builder: remove v1 support
Amounts to making the range tombstone consume() overload private. It is
still used internally to consume the downgraded (from v2) range
tombstones.
2022-03-11 09:24:46 +02:00
Botond Dénes
21584262be query_result_builder: remove v1 support
Amounts to dropping (the noop) range tombstone consume() overload.
2022-03-11 09:24:17 +02:00
Botond Dénes
279682056d mutation_compactor: drop v1 related code-paths 2022-03-11 09:24:05 +02:00
Botond Dénes
924ff6a503 mutation_compactor: drop v1 support altogether from the API
Fully mechanical change. Drop all v1 types, template types. Internal
code is left unchanged, will be made v2 only in the next patch.
2022-03-11 09:24:05 +02:00
Botond Dénes
87ac2e9ab0 tree: migrate to the v2 consumer APIs 2022-03-11 09:24:05 +02:00
Botond Dénes
eacdfb2cb7 test/boost/mutation_test: remove v1 specific test code
From test_compactor_range_tombstone_spanning_many_pages, preparing for
the retirement of the v1 output of the compactor.
2022-03-11 09:24:05 +02:00
Botond Dénes
0b5217052d querier: switch to v2 compactor output
The change is mostly mechanical: update all compactor instances to the
_v2 variant and update all call-sites, of which there is not that many.
As a consequence of this patch, queries -- both single-partition and
range-scans -- now do the v2->v1 conversion in the consumers, instead of
in the compactor.
2022-03-11 09:24:05 +02:00
Botond Dénes
4629f7d7b5 reconcilable_result_builder: add v2 support
Add a `consume()` overload for range tombstone changes and convert them
internally to range tombstones, as the underlying reconcilable result
is still v1.
2022-03-11 09:24:05 +02:00
Botond Dénes
728c14549f query_result_writer: add v2 support
Add a consume() overload which takes a range tombstone change and drops
it just like the existing range tombstone overload does: query results
don't care about range tombstones.
2022-03-11 09:22:14 +02:00
Botond Dénes
d61f934c50 query_result_builder: make consume(range_tombstone) noop
The downstream consumer (mutation_querier) already ignores range
tombstones, so no point forwarding them to it. This makes adding v2
support easier too as range tombstone changes can be similarly dropped.
2022-03-11 08:39:12 +02:00
Michał Sala
c8413631af forward_service: change implicit lambda capture list to explicit one
Changing the capture list of a lambda in
forward_service::execute_on_this_shard from [&] to an explicit one
enables grater readability and prevents potential bugs.

Closes #10191
2022-03-10 17:30:06 +02:00
Botond Dénes
e9ba8ad43a Merge "Configure gossiper the "classical" way" from Pavel Emelyanov
The services' configuration should be performed with the help of
service-specific config that's filled by the service creator. This
is not the case for gossiper that grabs the db::config and keeps
reference on it throughout its lifetime.

This set brings the gossiper configuration to the described form
by putting the needed config bits onto gossip_config (that already
exists and is partially used for gossiper configuration). And two
live-updateable options need extra care.

tests: unit(dev), dtest.simple_boot_shutdown(dev)

* 'br-gossiper-no-db-config' of https://github.com/xemul/scylla:
  gossiper: Remove db::config reference from gossiper
  gossiper: Keep live-updateable options on gossiper
  gossiper: Keep immutable options on gossip_config
2022-03-10 16:35:41 +02:00
Botond Dénes
ab440e1a07 mutation_writer: drop now unused v1 variants of bucket_writer feed_writer()
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20220302145945.189607-2-bdenes@scylladb.com>
2022-03-10 15:20:07 +02:00
Botond Dénes
108d921fc9 mutation_writer: partition_based_splitting_writer: convert implementation to v2
Although its API was long converted to v2, its implementation stayed v1
because the memtable and mutation API were still v1. Now that the
memtable flush returns a v2 reader we can have a second look at
converting this. While the mutation API still uses v1, this can easily
be worked around by using going through `mutation_rebuilder_v2`.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20220302145945.189607-1-bdenes@scylladb.com>
2022-03-10 15:20:07 +02:00
Botond Dénes
7e0b51ff23 Merge 'Overhaul compaction_manager::task' from Benny Halevy
The series overhauls the compaction_manager::task design and implementation
by properly layering the functionality between the compaction_manager
that deals with generic task execution, and the per-task business logic that is defined
in a set of classes derived from the generic task class.

While at it, the series introduces `task::state` and a set of helper functions to manage it
to prevent leaks in the statistics, fixing #9974.

Two more stats counter were exposed: `completed_tasks` and a new `postponed_tasks`.

Test: sstable_compaction_test
Dtest: compaction_test.py compaction_additional_test.py

Fixes #9974

Closes #10122

* github.com:scylladb/scylla:
  compaction_manager: use coroutine::switch_to
  compaction_manager::task: drop _compaction_running
  compaction_manager: move per-type logic to derived task
  compaction_manager: task: add state enum
  compaction_manager: task: add maybe_retry
  compaction_manager: reevaluate_postponed_compactions: mark as noexcept
  compaction_manager: define derived task types
  compaction_manager: register_metrics: expose postponed_compactions
  compaction_manager: register_metrics: expose failed_compactions
  compaction_manager: register_metrics: expose _stats.completed_tasks
  compaction: add documentation for compaction_type to string conversions
  compaction: expose to_string(compaction_type)
  compaction_manager: task: standardize task description in log messages
  compaction_manager: refactor can_proceed
  compaction_manager: pass compaction_manager& to task ctor
  compaction_manager: use shared_ptr<task> rather than lw_shared_ptr
  compaction_manager: rewrite_sstables: acquire _maintenance_ops_sem once
  compaction_manager: use compaction_state::lock only to synchronize major and regular compaction
2022-03-10 13:33:56 +02:00
Benny Halevy
5e1fda7e1d compaction_manager: use coroutine::switch_to
Saving an allocation for running the functor
as a task in the switched-to scheduling group.

Also, switch to the desired scheduling group at
the beginning of the task so that the higher level logic,
like getting the list of sstables to compact
will be performed under the desired scheduling group,
not only the compaction code itself.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-03-10 12:20:01 +02:00
Benny Halevy
8c66916652 compaction_manager::task: drop _compaction_running
Replace the _compaction_running boolean member
by calculating _state == state::active
now that setup_new_compaction switches state to
`active`

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-03-10 12:20:01 +02:00
Benny Halevy
a2a5e530f0 compaction_manager: move per-type logic to derived task
Move the business logic into the task specific classes.
Separating initialization during task construction,
from the compaction_done task, moved into
a do_run() method, and in some cases moving
a lambda function that was called per table (as in
rewrite_sstables) into a private method of the
derived class.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-03-10 12:20:01 +02:00
Benny Halevy
2e6ce43a97 compaction_manager: task: add state enum
Add an enum class representing the task state machine
and a switch_state function to transition between the states
and update the corresponding compaction_manager stats counters.

Refs #9974

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-03-10 12:19:59 +02:00
Mikołaj Sielużycki
5920349357 row_cache: Make row_cache reader from sstables compacting.
Reading data from sstables without compacting first puts
unnecessary pressure on the cache. The mutation streams
need to be resolved anyway before passing to subsequent
consumers, so it's better to do it as close to the
source as possible.

Fixes: #3568

Closes #10188
2022-03-10 11:40:10 +02:00
Benny Halevy
9c59d66b7e compaction_manager: task: add maybe_retry
Replacing and combining compaction_manager methods:
maybe_stop_on_error and put_task_to_sleep.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-03-10 11:35:37 +02:00
Benny Halevy
ee32be3aa5 compaction_manager: reevaluate_postponed_compactions: mark as noexcept
To simplify error handling in following patches
that will coroutinize task logic.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-03-10 11:35:37 +02:00
Benny Halevy
72162ed653 compaction_manager: define derived task types
Turn task into a class, defining a clear hierarchy
of private, protected, and public methods.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-03-10 11:35:35 +02:00
Avi Kivity
2f967f84d4 Merge "Migrate sstable writer to v2" from Botond
"
This patch-set converts the sstable writer to v2, then prepares the
ground for users actually being able to use the v2 variant. Finally it
converts all users to do so and then decommissions the v1 variant.
For users to be able to use the v2 writer API, we first have to add a v2
output to the compactor first, as some users write to sstables via the
compactor.

Tests: unit(dev, release)
"

* 'sstable-writer-v2/v2' of https://github.com/denesb/scylla:
  sstables/sstable: remove now unused v1 write_components() variant
  mutation_compactor: remove now unused compact_for_compaction
  test/boost/mutation_test: migrate to compact_for_mutation_v2
  streaming: migrate to v2 variant of sstable writer API
  memtable-sstable: migrate to v2 variant of sstable writer API
  test: migrate to the v2 variant of the sstable writer API
  sstables/sstable: expose v2 variant of write_components()
  sstables: convert mx writer to v2
  sstables/metadata_collector: use position_in_partition for min/max keys
  test/boost/mutation_test: test_compactor_range_tombstone_spanning_many_pages extend to check v2 output too
  mutation_reader: convert compacting reader v2
  mutation_compactor: add v2 output
  mutation_compactor: make _last_clustering_pos track last input
  range_tombstone_change: add set_tombstone()
  test/lib/mutation_source_test: log name of each run_mutation_source()
2022-03-10 09:45:57 +02:00
Botond Dénes
2e0610e459 sstables/sstable: remove now unused v1 write_components() variant
Supplanted by the v2 variant.
2022-03-10 09:16:33 +02:00
Botond Dénes
4e97477281 mutation_compactor: remove now unused compact_for_compaction 2022-03-10 09:16:33 +02:00
Botond Dénes
32e9809e9c test/boost/mutation_test: migrate to compact_for_mutation_v2 2022-03-10 09:16:33 +02:00
Botond Dénes
06e6bb6ec9 streaming: migrate to v2 variant of sstable writer API 2022-03-10 09:16:33 +02:00
Botond Dénes
d8fec08468 memtable-sstable: migrate to v2 variant of sstable writer API 2022-03-10 09:16:33 +02:00
Botond Dénes
959483a2dc test: migrate to the v2 variant of the sstable writer API 2022-03-10 09:16:33 +02:00
Benny Halevy
37694422dc compaction_manager: register_metrics: expose postponed_compactions
Provide a metric counting the number of tables
with postponed compaction.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-03-10 08:39:18 +02:00
Benny Halevy
089d4442d8 compaction_manager: register_metrics: expose failed_compactions
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-03-10 08:39:18 +02:00
Benny Halevy
8081f951d0 compaction_manager: register_metrics: expose _stats.completed_tasks
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-03-10 08:39:18 +02:00
Benny Halevy
ffc314d506 compaction: add documentation for compaction_type to string conversions
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-03-10 08:39:18 +02:00
Benny Halevy
28a74a2e90 compaction: expose to_string(compaction_type)
To be used in the next patch to generate
a string dscription from the compaction_type.

In theory, we could use compaction_name()
btu the latter returns the compaction type
in all-upper case and that is very different from
what we print to the log today.  The all-upper
strings are used for the api layer, e.g. to
stop tasks of a particular compaction type.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-03-10 08:39:18 +02:00
Benny Halevy
20a8609392 compaction_manager: task: standardize task description in log messages
Define task::describe and use it via operator<<
to print the task metadata to the log in a standard way.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-03-10 08:39:18 +02:00
Benny Halevy
59863b317f compaction_manager: refactor can_proceed
Move the task-internal parts of can_proceed
to a respective compaction_manager::task method,
preparing for turning it into a class with
a proper hierarchy of access to private members.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-03-10 08:39:17 +02:00
Benny Halevy
33b2731a4a compaction_manager: pass compaction_manager& to task ctor
And use it to get the compaction state of the
table to compact.

It will be used in a later patch
to manage the task state from task
methods.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-03-10 08:39:17 +02:00
Benny Halevy
20067b1050 compaction_manager: use shared_ptr<task> rather than lw_shared_ptr
Prepare for defining per compaction type tasks
derived from compaction_manager::task.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-03-10 08:39:17 +02:00
Benny Halevy
cb2403e917 compaction_manager: rewrite_sstables: acquire _maintenance_ops_sem once
Like all other maintenance operations, acquire the _maintenance_ops_sem
once for the whole task, rather than for each sstable.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-03-10 08:39:17 +02:00
Benny Halevy
d0f693a517 compaction_manager: use compaction_state::lock only to synchronize major and regular compaction
Maintenance operations like cleanup, upgrade, reshape, and reshard
are serialized serialized with major compaction using the _maintenance_ops_sem
and they need no further synchronization with regular compaction
by acquiring the per-table read lock..

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-03-10 08:39:17 +02:00
Botond Dénes
fed5b73147 sstables/sstable: expose v2 variant of write_components()
In parallel to the existing v1 one. In the next patches we start
migrating users to the v2 variant incrementally and finally remove the
v1 variant.
2022-03-10 07:03:49 +02:00
Botond Dénes
105bf8888a sstables: convert mx writer to v2
The sstables::sstable class has two methods for writing sstables:
1) sstable_writer get_writer(...);
2) future<> write_components(flat_mutation_reader, ...);

(1) directly exposes the writer type, so we have to update all users of
it (there is not that many) in this same patch. We defer updating
users of (2) to a follow-up commits.
2022-03-10 07:03:49 +02:00
Botond Dénes
11adb404c6 sstables/metadata_collector: use position_in_partition for min/max keys
Instead of naked clustering keys. Working with the latter is dangerous
because it cannot accurately represent the entire clustering domain: it
cannot represent positions between (before/after) keys. For this reason
the metadata collector had a separate update_min_max_components()
overload for range tombstones because the positions of these cannot be
represented by clustering keys alone.
Moving to position_in_partition solves this problem and it is now enough
to have a single overload with position_in_partition_view. This is also
more future proof as it will work with range tombstone changes without
any additional changes.
2022-03-10 07:03:49 +02:00
Botond Dénes
2057db54ab test/boost/mutation_test: test_compactor_range_tombstone_spanning_many_pages extend to check v2 output too 2022-03-10 07:03:49 +02:00
Botond Dénes
7a37e30310 mutation_reader: convert compacting reader v2
Its input was already a v2 reader, now itself is also a v2 reader.
With this commit, compaction.cc is finally v2 all-the-way.
2022-03-10 07:03:46 +02:00
Botond Dénes
ad435dcf57 mutation_compactor: add v2 output
The output version is selected via compactor_output_format, which is a
template parameter of `compact_mutation_state` and all downstream types.
This is to ensure a compaction state created to emit a v2 stream will
not be accidentally used with a v1 consumer.
When using a v2 output, the current active tombstone has to be tracked
separately for the regular and for the gc consumer (if any), so that
each can be closed properly on EOS. The current effective tombstone is
tracked separately from these two. The reason is that purged tombstones
are still applied to data, but are not emitted to the regular consumer.
2022-03-10 06:46:46 +02:00
Botond Dénes
1ccaeb2a1a mutation_compactor: make _last_clustering_pos track last input
Instead of updating _last_clustering_pos whenever a clustering fragment
is pushed to the consumers, we now update it whenever a clustering
fragment enters the compactor. Not only is this much more robust, but it
also makes more sense. Just because a range tombstone is purged (and
therefore the consumer doesn't see it), it still moves the logical
clustering position in the stream. Also, tracking the input side avoids
any ambiguity related to cases where we have two consumers (regular + gc
consumer).
2022-03-10 06:46:46 +02:00
Botond Dénes
b2b6f03a5d range_tombstone_change: add set_tombstone() 2022-03-10 06:46:46 +02:00
Botond Dénes
6544da342a test/lib/mutation_source_test: log name of each run_mutation_source()
Although we have a log in run_mutation_reader_tests(), it is useful to
know where it was called from, when trying to find the test scenario
that failed.
2022-03-10 06:46:46 +02:00
Avi Kivity
a4756334ce Merge "tools/scylla-types: improve documentation" from Botond
"
Add per-action help content for each action. Main description now points
to these for more details.
"

* 'scylla-types-improvements/v1' of https://github.com/denesb/scylla:
  tools/types: update main description
  tools/scylla-types: per-action help content
  tools/scylla-types: description: remove -- from action listing
  tools/scylla-types: use fmt::print() instead of std::cout <<
2022-03-09 19:37:00 +02:00
Avi Kivity
e1c326a5ba Merge "Convert multishard writer to v2" from Botond
"
Also convert the foreign_reader used by it in the process.

Tests: unit(dev)
"

* 'multishard-writer-v2/v1' of https://github.com/denesb/scylla:
  mutation_writer/multishard_writer: remove now unused v1 factory overloads
  test/boost/mutation_writer_test: test the v2 variant of distribute_reader_and_consume_on_shards()
  flat_mutation_reader: add v2 variant of make_generating_reader()
  mutation_reader: multishard_writer: migrate implementation to v2
  mutation_reader: convert foreign_reader to v2
  streaming/consumer: convert to v2
  mutation_writer/multishard_writer: add v2 variant of distribute_reader_and_consume_on_shards()
2022-03-09 19:28:05 +02:00
Avi Kivity
fbb3904b67 Merge 'query: transform asserts into on_internal_error in forward_result::merge' from Michał Sala
It was a suggestion from @psarna, done to get more info about the abort from #10174.

Closes #10185

* github.com:scylladb/scylla:
  query: do not assert in `operator<<(ostream&, const forward_result::printer&)`
  query: transform asserts into on_internal_error in forward_result::merge
2022-03-09 16:30:32 +02:00
Tomasz Grabiec
8fa704972f loading_cache: Make invalidation take immediate effect
There are two issues with current implementation of remove/remove_if:

  1) If it happens concurrently with get_ptr(), the latter may still
  populate the cache using value obtained from before remove() was
  called. remove() is used to invalidate caches, e.g. the prepared
  statements cache, and the expected semantic is that values
  calculated from before remove() should not be present in the cache
  after invalidation.

  2) As long as there is any active pointer to the cached value
  (obtained by get_ptr()), the old value from before remove() will be
  still accessible and returned by get_ptr(). This can make remove()
  have no effect indefinitely if there is persistent use of the cache.

One of the user-perceived effects of this bug is that some prepared
statements may not get invalidated after a schema change and still use
the old schema (until next invalidation). If the schema change was
modifying UDT, this can cause statement execution failures. CQL
coordinator will try to interpret bound values using old set of
fields. If the driver uses the new schema, the coordinaotr will fail
to process the value with the following exception:

  User Defined Type value contained too many fields (expected 5, got 6)

The patch fixes the problem by making remove()/remove_if() erase old
entries from _loading_values immediately.

The predicate-based remove_if() variant has to also invalidate values
which are concurrently loading to be safe. The predicate cannot be
avaluated on values which are not ready. This may invalidate some
values unnecessarily, but I think it's fine.

Fixes #10117

Message-Id: <20220309135902.261734-1-tgrabiec@scylladb.com>
2022-03-09 16:13:07 +02:00
Michał Sala
538cff651e query: do not assert in operator<<(ostream&, const forward_result::printer&)
Printing invalid forward_result should not cause Scylla to stop.
2022-03-09 14:58:11 +01:00
Michał Sala
51362e4e5e query: transform asserts into on_internal_error in forward_result::merge
It was done to show more context in case of forward_result::merge
arguments size mismatch and also to prevent aborts caused by another
nodes sending malformed data.
2022-03-09 14:58:11 +01:00
Nadav Har'El
397dd64dea test/cql-pytest: avoid "run" warnings caused by pytest bug
This patch gets rid annoying pytest configuration warnings when running
test/cql-pytest/run. These started to happen after commit
afab1a97c6, due to a pytest bug:

In that commit, we added new "--scylla-path" and "--workdir" parameters
to our pytest tests, and test/cql-pytest/run started passing them,
and test/cql-pytest/run sometest runs pytest as:

    pytest --host something --workdir somedir --scylla-path somepath sometest

Pytest wants to find a configuration file (pytest.ini or tox.ini) in the
directory where the tests live, but its logic to find that directory is
buggy: It (_pytest/config/findpaths.py::determine_setup()) looks at
the command line for directory names, and looks for config files in
these directories or any of their parents. It ignores parameters
beginning with "-", but in our case the various arguments like
"--scylla-path" are each followed by another option, and this one is
not ignored! So instead of looking for the config file in sometest's
parent directories (and finding test/cql-pytest/pytest.ini), pytest
sees the directory given after "scylla-path", and finds the completely
irrelevant tox.ini there - and uses that, which (depending what you
have installed) can generate warnings.

The solution is to change the run script to use "--scylla-path=..."
as one parameter instead of "--scylla-path ..." as two parameters.
When it's just one parameter, the pytest determine_setup() logic skips
it entirely, and finds just the actual test directory.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220309132726.2311721-1-nyh@scylladb.com>
2022-03-09 15:37:08 +02:00
Nadav Har'El
733672fc54 Merge 'types: fix is_string for reversed types' from Piotr Sarna
Checking if the type is string is subtly broken for reversed types,
and these types will not be recognized as strings, even though they are.
As a result, if somebody creates a column with DESC order and then
tries to use operator LIKE on it, it will fail because the type
would not be recognized as a string.

Fixes #10183

Closes #10181

* github.com:scylladb/scylla:
  test: add a case for LIKE operator on a descending order column
  types: fix is_string for reversed types
2022-03-09 10:04:42 +02:00
Piotr Sarna
05b66102e9 test: add a case for LIKE operator on a descending order column
This case is a regression test for issue #10181, where it turned out
that a clustering column with descending order is not properly
recognized as a string.
This test case used to fail with:
cassandra.InvalidRequest:
  Error from server: code=2200 [Invalid query]
  message="LIKE is allowed only on string types, which b is not"
...until it got fixed by the previous commit.
2022-03-09 08:56:22 +01:00
Piotr Sarna
0a068cddb1 types: fix is_string for reversed types
Checking if the type is string is subtly broken for reversed types,
and these types will not be recognized as strings, even though they are.
As a result, if somebody creates a column with DESC order and then
tries to use operator LIKE on it, it will fail because the type
would not be recognized as a string.
2022-03-09 08:18:33 +01:00
Benny Halevy
11ea2ffc3c compaction_manager: rewrite_sstables: do not acquire table write lock
Since regular compaction may run in parallel no lock
is required per-table.

We still acquire a read lock in this patch, for backporting
purposes, in case the branch doesn't contain
6737c88045.
But it can be removed entirely in master in a follow-up patch.

This should solve some of the slowness in cleanup compaction (and
likely in upgrade sstables seen in #10060, and
possibly #10166.

Fixes #10175

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #10177
2022-03-09 09:13:46 +02:00
Nadav Har'El
c8152e78d7 Merge 'CQL3: fromJson accepts string as bool' from Jadw1
The problem was incompatibility with cassandra, which accepts bool
as a string in `fromJson()` UDF. The difference between Cassandra and
Scylla now is Scylla accepts whitespaces around word in string,
Cassandra don't. Both are case insensitive.

Fixes: https://github.com/scylladb/scylla/issues/7915

Closes #10134

* github.com:scylladb/scylla:
  CQL3/pytest: Updating test_json
  CQL3: fromJson accepts string as bool
2022-03-08 16:27:40 +02:00
Avi Kivity
1622995900 Merge 'Allow empty partition keys in views' from Nadav Har'El
Cassandra generally does not allow empty strings as partition keys (note, by the way, that empty strings are allowed as clustering keys, as well as in individual components of a compound partition key).
However, Cassandra does allow empty strings in _regular_ columns - and those regular columns can be indexed by a secondary index, or become an empty partition-key column in a materialized view. As noted in issues #9375 and #9364 and verified in a few xfailing cql-pytest tests, Scylla didn't allow these cases - and this patch series fixes that.

Before the last patch in this series finally enables empty-string partition keys in materialized views, we first need to solve a couple of bugs in our code related to handling empty partition keys:

The first patch fixes issue #10178 - a bug in `key_view::tri_compare()`  where comparing two empty keys returned a random result instead of "equal".

The second patch fixes issue #9352: our tokenizer has an inconsistency where for an empty string key, two variants of the same function return different results:

1. One variant `murmur3_partitioner::get_token(bytes_view key)` returned `minimum_token()` for the empty string.
2. Another variant `murmur3_partitioner::get_token(const schema& s, partition_key_view key)` did not have this special case, and called the normal hash-function calculation on the empty string (the resulting token is 0).

Variant 2 was an unintentional bug, because Cassandra always does what variant does 1. So the "obvious" fix here would be to fix variant 2 to do what variant 1 does. Nevertheless, we decided to do the opposite: Change variant 1 to match variant 2. The reasoning is as follows:

The `minimum_token()` is `token{token::kind::before_all_keys, 0 }` - it's not a real token. Since we intend in this patch allow real data to exist with the empty key, we need this real data to have a real token. For example, this token needs to be located on the token ring (so the empty-key partition will have replicas) and also belong to one of the shards, and it's not clear that `minimum_token()` will be handled correctly in this context.

After changing the token of the empty string to 0, we note that some places in the code assume that `dht::decorated_key(dh
t::minimum_token(), partition_key::make_empty())` is a legal decorated key. However, as far as I can tell, none of these places actually assume that the partition-key part (the `make_empty()`) really matches the token - this decorated key is only used to start an iteration (ignoring this key itself) or to indicate a non-existent key (in modern code `std::optional` should be used for that).

While normally changing the token of a key is a big faux-pas, which can result in old data no longer being readable, in this case this change is safe because:

1. Scylla previously disallowed empty partition keys (in both base tables and views), so we cannot have had such a partition key saved in any sstable.
3. Cassandra does allow empty partition keys in _views_ and _secondary indexes_, but we do not support migrating sstables of those into Scylla - users are expected to only migrate the base table and then re-create the view or index. So however Cassandra writes those empty-key partitions, we don't care.

The third patch finally fixes the materialized views implementation to not drop view rows with an empty-string partition key (#9375). This means we basically revert commit ec8960df45 - which fixed #3262 by disallowing empty partition keys in views, whereas this patch fixes the same problem by handling the empty partition keys correctly.

The fix for the secondary index bug (#9364) comes "for free" because it is based on materialized views.

We already had xfailing test cases for empty strings in materialized views and indexes, and after this series they begin to pass so the "xfail" mark is removed. The series also adds additional test cases that validate additional corner cases discovered during the debugging.

Fixes #9352
Fixes #9364
Fixes #9375
Fixes #10178

Closes #10170

* github.com:scylladb/scylla:
  compound_compat.hh: add missing methods of iterator
  materialized views: allow empty strings in views and indexes
  murmur3: fix inconsistent token for empty partition key
  compound_compat.hh: fix bug iterating on empty singular key
2022-03-08 15:55:55 +02:00
Nadav Har'El
674d3a5a84 compound_compat.hh: add missing methods of iterator
While debugging legacy_compound_view, I noticed that it cannot be used
as a C++20 std::ranges::input_range because it is missing some trivial
methods. So let's fix this, and make the life of future developers a
little bit easier.

The two trivial methods we need to implement:

1. A postfix increment operator. We already had a prefix increment
   operator, but the C++20 concept weakly_iterable also needs postfix.

2. By mistake (this will be corrected in https://wg21.link/P2325R3),
   weakly_iterable also required the default_initialized concept, so
   our iterator type also needs a default constructor.
   We'll never actually use this silly constructor, and when this C++20
   standard mistake is corrected, we can remove this constructor.

After this patch, a legacy_compound_view is accepted for the C++20
ranges::input_range concept.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-03-08 15:37:03 +02:00
Nadav Har'El
ef43531fb6 materialized views: allow empty strings in views and indexes
Although Cassandra generally does not allow empty strings as partition
keys (note they are allowed as clustering keys!), it *does* allow empty
strings in regular columns to be indexed by a secondary index, or to
become an empty partition-key column in a materialized view. As noted in
issues #9375 and #9364 and verified in a few xfailing cql-pytest tests,
Scylla didn't allow these cases - and this patch fixes that.

The patch mostly *removes* unnecessary code: In one place, code
prevented an sstable with an empty partition key from being written.
Another piece of removed code was a function is_partition_key_empty()
which the materialized-view code used to check whether the view's
row will end up with an empty partition key, which was supposedly
forbidden. But in fact, should have been allowed like they are allowed
in Cassandra and required for the secondary-index implementation, and
the entire function wasn't necessary.

Note that the removed function is_partition_key_empty() was *NOT* required
for the "IS NOT NULL" feature of materialized views - this continues to
work as expected after this patch, and we add another test to confirm it.
Being null and being an empty string are two different things.

This patch also removes a part of a unit test which enshrined the
wrong behavior.

After this patch we are left with one interesting difference from
Cassandra: Though Cassandra allows a user to create a view row with an
empty-string partition key, and this row is fully visible in when
scanning the view, this row can *not* be queried individually because
"WHERE v=''" is forbidden when v is the partition key (of the view).
Scylla does not reproduce this anomaly - and such point query does work
in Scylla after this patch. We add a new test to check this case, and mark
it "cassandra_bug", i.e., it's a Cassandra behavior which we consider
wrong and don't want to emulate.

This patch relies on #9352 and #10178 having been fixed in previous patches,
otherwise the WHERE v='' does not work when reading from sstables.
We add to the already existing tests we had for empty materialized-views
keys a lookup with WHERE v='' which failed before fixing those two issues.

Fixes #9364
Fixes #9375

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-03-08 15:34:26 +02:00
Nadav Har'El
bc4d0fd5ad murmur3: fix inconsistent token for empty partition key
Traditionally in Scylla and in Cassandra, an empty partition key is mapped
to minimum_token() instead of the empty key's usual hash function (0).
The reasons for this are unknown (to me), but one possibility is that
having one known key that maps to the minimal token is useful for
various iterations.

In murmur3_partitioner.cc we have two variants of the token calculation
function - the first is get_token(bytes_view) and the second is
get_token(schema, partition_key_view). The first includes that empty-
key special case, but the second was missing this special case!

As Kamil first noted in #9352, the second variant is used when looking
up partitions in the index file - so if a partition with an empty-string
key is saved under one token, it will be looked up under a different
token and not found. I reproduced exactly this problem when fixing
issues #9364 and #9375 (empty-string keys in materialized views and
indexes) - where a partition with an empty key was visible in a
full-table scan but couldn't be found by looking up its key because of
the wrong index lookup.

I also tried an alternative fix - changing both implementations to return
minimum_token (and not 0) for the empty key. But this is undesirable -
minimum_token is not supposed to be a valid token, so the tokenizer and
sharder may not return a valid replica or shard for it, so we shouldn't
store data under such token. We also have have code (such as an increasing-
key sanity check in the flat mutation reader) which assumes that
no real key in the data can be minimum_token, and our plan is to start
allowing data with an empty key (at least for materialized views).

This patch does not risk a backward-incompatible disk format changes
for two reasons:

1. In the current Scylla, there was no valid case where an empty partition
   key may appear. CQL and Thrift forbid such keys, and materialized-views
   and indexes also (incorrectly - see #9364, #9375) drop such rows.
2. Although Cassandra *does* allow empty partition keys, they is only
   allowed in materialized views and indexes - and we don't support reading
   materialized views generated by Cassandra (the user must re-generate
   them in Scylla).

When #9364 and #9375 will be fixed by the next patch, empty partition keys
will start appearing in Scylla (in materialized views and in the
materialized view backing a secondary index), and this fix will become
important.

Fixes #9352
Refs #9364
Refs #9375

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-03-08 14:15:03 +02:00
Nadav Har'El
f8807e24f4 compound_compat.hh: fix bug iterating on empty singular key
When iterating over a compound key with legacy_compound_view<>, when the
key is "singular" (i.e., a single column) we need to iterate over just the
component's actual bytes - without the two length bytes or end-of-component
byte. In particular, when the component is an *empty string*, the iteration
should return zero bytes. In other words, we should have begin() == end().

Unfortunately, this is not what happened - for an empty singular key, the
iterator returned for begin() was slightly different from end() - so
code using this iterator would not know there is nothing to iterate.
So in this patch we fix begin() and end() to return the same thing
if we have an empty singular key.

The bug in legacy_compound_view<> (which we fix here) caused a bug in
sstables::key_view::tri_compare(const schema& s, partition_key_view other),
causing it to return wrong results when comparing two empty keys. As a
result we were unable to retrieve a partition with an empty key from the
sstable index. So this patch is necessary to fix support for
empty-string keys in sstables (part of issue #9375).

This patch also includes a unit-test for this bug. We test it in the
context of sstables::key_view::tri_compare(), where it was first
discovered, and also test the legacy_compound_view itself. The included
test used to fail in both places before this patch, and pass after it.

Fixes #10178
Refs #9375

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-03-08 14:14:18 +02:00
Botond Dénes
f2060ac03b Merge "Remove sub-mode booleans from storage service" from Pavel Emelyanov
"
There's a _operation_mode enum sitting on storage_service that indicates the
top-level state of the scylla node. Next to it there's a bunch of booleans
that define (and duplicate) some sub-modes. These booleans just make the code
more obscure and complicated. This set removes all those booleans and patches
all the relevant checks/calls/methods to rely only on the operation mode.
Also, the switching between modes is simplified down to some bare minimum.

tests: unit(dev) dtest.simple_boot_shutdown(dev) manual(dev)

Manual test included start-stop, nodetool enablegossip, disablegosip and drain
commands, scylla-cly is_initialized and is_joined calls

As noticed in v2, this set changes the log messages that are checked by
dtests. The fix for dtest, that's compatible with both -- current scylla and
this patchset -- is already in dtest master.
"

* 'br-remove-bools-from-storage-service-3-rebase' of https://github.com/xemul/scylla:
  storage_service: Relax operation modes switch
  storage_service: Remove _ms_stopped
  storage_service: Remove _is_bootstrap_mode
  storage_service: Remove _initialized and is_initialized()
  storage_service: Remove _joined and is_joined()
  storage_service: Replace is_starting() with get_operation_mode()
  storage_service: Make get_operation_mode() return mode itself
  storage_service: Relax repeating set_mode-s
2022-03-07 15:27:03 +02:00
Avi Kivity
e2f3e9791b Update seastar submodule
* seastar 1d81c8e5aa...4e42a60199 (11):
  > condition_variable: Add coroutine-only "when" operation folding waiter into parent frame
  > condition_variable: Add simple test
  > condition_variable: Make std::chrono timeout operations templated
  > condition_variable: Remove semaphore usage, keep internal wait queue
  > condition_variable: Add concept checks to predicated wait methods
  > io_queue: Don't let preemption overlap requests
  > io_queue: Pending needs to keep capacity instead of ticket
  > io_queue: Extend grab_capacity() return codes
  > scattered_message: allow appending temporary buffers directly
  > util: file: include reactor.hh
  > tests: coroutines_test: Check scheduling group set with switch_to() is inherited and restored
2022-03-07 14:30:52 +02:00
Avi Kivity
8ab20bae68 Merge 'prepared_statements: Invalidate batch statement too' from Eliran Sinvani
It seams that batch prepared statements always return false for
depends_on_keyspace and depends_on_column_family, this in turn
renders the removal criteria from the cache to always be false
which result by the queries not being evicted.
Here we change the functions to return the true state meaning,
they will return true if any of the sub queries is dependant upon
the keyspace or column family.

In this fix we first make the API more coherent and then use this new API to implement
the batch statement's dependency test.
Fixes #10129

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>

Closes #10132

* github.com:scylladb/scylla:
  prepared_statements: Invalidate batch statement too
  cql3 statements: Change dependency test API to express better it's purpose
2022-03-07 14:00:05 +02:00
Pavel Emelyanov
190385551c storage_service: Relax operation modes switch
The set_mode() tries to combine mode switching and extended logging,
but there are no places left that do need this flexibility. It's
simpler and nicer to make set_mode() _just_ switch the mode and
log some generic "entering ... mode" message.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-07 13:29:47 +03:00
Pavel Emelyanov
0941098b39 storage_service: Remove _ms_stopped
This boolean protects do_stop_ms from re-entrability. However, this
method is only called from stop_transport() which handles re-entring
itself, so the _ms_stopped can be just removed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-07 13:29:47 +03:00
Pavel Emelyanov
74212286f8 storage_service: Remove _is_bootstrap_mode
This "state" is the sub-state of the STARTING mode that's activated
when the storage_service::bootstrap() is called. Instead of the
separate boolean the new mode can be used. To stop it from reverting
the BOOTSTRAP mode back to JOINING some calls to set_mode() should
be converted into regular logging.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-07 13:29:47 +03:00
Pavel Emelyanov
dbaca825ec storage_service: Remove _initialized and is_initialized()
This bit is hairy. First, it indicates that the storage service
entered the init_server() method. But, once the node is up and
running it also indicates whether the gossiper is enabled or not
via the APi call.

To rely on the operation mode, first, the NONE mode is introduced
at which the server starts. Then in init_server() is switches to
STARTING.

Second change is to stop using the bit in enable/disable gossiper
API call, instead -- check the gossiper.is_enabled() itself.

To keep the is_initialized API call compatible, when the operation
mode is NORMAL it would return true/false according to the status
of the gossiper. This change is simple because storage service API
handlers already have the gossiper instance hanging around.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-07 13:29:47 +03:00
Pavel Emelyanov
ffbfa3b542 storage_service: Remove _joined and is_joined()
The is_joined() status can be get with get_operation_mode(). Since
it indicates that the operation mode is JOINING, NORMAL or anything
above, the operation mode the enum class should be shuffled to get
the simple >= comparison.

Another needed change is to set mode few steps earlier than it
happens now to cover the non-bootstrap startup case.

And the third change is to partially revert the d49aa7ab that made
the .is_joined() method be future-less. Nowadays the is_joined() is
called only from the API which is happy with being future-full in
all other storage service state checks.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-07 13:29:47 +03:00
Pavel Emelyanov
ca03fd3145 storage_service: Replace is_starting() with get_operation_mode()
This is trivial change, since the only user is in API and the
get_operation_mode + mode values are at hand.

One thing to pay attention to -- the new method checks the mode to
be <= STARTING, not for equality. Now this is equivalent change,
but next patch will introduce NONE mode that should be reported
as is_starting() too.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-07 13:29:47 +03:00
Pavel Emelyanov
c385fe7d79 storage_service: Make get_operation_mode() return mode itself
Now it reports back formatted mode. For future convenience it's
needed to return the raw value, all the more so the mode enum class
is already public.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-07 13:29:47 +03:00
Pavel Emelyanov
968b07052d storage_service: Relax repeating set_mode-s
In several places the call to set_mode(...) is used as a (format-less)
replecement for regular logging. Mode doesn't really change there, because
it had been changed before. Patch all those places to use regular logging,
next patches will make full use of it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-07 13:29:47 +03:00
Benny Halevy
c7de2e0682 compaction: log info message when interrupting compaction
Info messages are logged when compaction jobs start and finish
but there is no message logged when the job is interrupted, e.g.
when stopped by the compaction_manager.

Refs scylladb/scylla-dtest#2468

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-03-07 11:43:58 +02:00
Botond Dénes
f1b2ff1722 Merge 'service: storage_service: announce new CDC generation immediately with RBNO' from Kamil Braun
When a new CDC generation is created (during bootstrap or otherwise), it
is assigned a timestamp. The timestamp must be propagated as soon as
possible, so all live nodes can learn about the generation before their
clocks reach the generation's timestamp. The propagation mechanism for
generation timestamps is gossip.

When bootstrap RBNO was enabled this was not the case: the generation
timestamp was inserted into gossiper state too late, after the repair
phase finished. Fix this.

Also remove an obsolete comment.

Fixes https://github.com/scylladb/scylla/issues/10149.

Closes #10154

* github.com:scylladb/scylla:
  service: storage_service: announce new CDC generation immediately with RBNO
  service: storage_service: fix indentation
2022-03-07 11:28:00 +02:00
Benny Halevy
a085ef74ff atomic_cell: compare_atomic_cell_for_merge: compare ttl if expiry is equal
Following up on a57c087c89,
compare_atomic_cell_for_merge should compare the ttl value in the
reverse order since, when comparing two cells that are identical
in all attributes but their ttl, we want to keep the cell with the
smaller ttl value rather than the larger ttl, since it was written
at a later (wall-clock) time, and so would remain longer after it
expires, until purged after gc_grace seconds.

Fixes #10173

Test: mutation_test.test_cell_ordering, unit(dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302154328.2400717-1-bhalevy@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220306091913.106508-1-bhalevy@scylladb.com>
2022-03-07 11:05:30 +02:00
Avi Kivity
9359a2caad Merge 'cql3: expr: Replace column_value::sub with subscript struct in expression' from Jan Ciołek
Currently a subscripted column is expressed using the struct `column_value`:
```c++
/// A column, optionally subscripted by a value (eg, c1 or c2['abc']).
struct column_value {
    const column_definition* col;
    std::optional<expression> sub; ///< If present, this LHS is col[sub], otherwise just col.
}
 ```

 It would be better to have a generic AST node for expressing arbitrary subscripted values:
 ```c++
 /// A subscripted value, eg list_colum[2], val[sub]
struct subscript {
    expression val;
    expression sub;
};
```

The `subscript` struct would allow us to express more, for example:
* subscripted `column_identifier`, not only `column_definition` (needed to get rid of `relation` class)
* nested subscripts: `col[1][2]`

Adding `subscript` to `expression` variant immediately would require to implement all `expr::visit` handlers immediately in the same commit, so I took a different approach. At first the struct is just there and visit handlers are implemented one by one in advance, then at the end `subscript` is added to the `expression`. This way all the new code can be neatly divided into commits and everything is still bisectable.

There were a few cases where the existing behaviour seemed to make little sense, but I didn't change it to keep the PR focused on refactoring. I left a `FIXME` comments there and I will submit separate patches to fix them.

Closes #10139

* github.com:scylladb/scylla:
  cql3: expr: Remove sub from column_value
  cql3: Create a subscript in single_column_relation
  cql3: expr: Add subscript to expression
  cql3: Handle subscript in multi_column_range_accumulator
  cql3: Handle subscript in selectable_process_selection
  cql3: expr: Handle subscript in test_assignment
  cql3: expr: Handle subscript in prepare_expression
  cql3: Handle subscript in prepare_selectable
  cql3: expr: Handle subscript in extract_clustering_prefix_restrictions
  cql3: expr: Handle subscript in extract_partition_range
  cql3: expr: Handle subscript in fill_prepare_context
  cql3: expr: Handle subscript in evaluate
  cql3: expr: Handle subscript in extract_single_column_restrictions_for_column
  cql3: expr: Handle subscript in search_and_replace
  cql3: expr: Handle subscript in recurse_until
  cql3: expr: Implement operator<< for subscript
  cql3: expr: Handle subscript in possible_lhs_values
  cql3: expr: Handle subscript in is_supported_by
  cql3: expr: Handle subscript in is_satisifed_by
  cql3: expr: Remove unused attribute
  cql3: expr: Use column_maybe_subscripted in is_one_of()
  cql3: expr: Use column_maybe_subscripted in limits()
  cql3: expr: Use column_maybe_subscripted in equal()
  cql3: expr: add get_subscripted_column(column_maybe_subscripted)
  cql3: expr: Add as_column_maybe_subscripted
  cql3: expr: Make get_value_comparator work with column_maybe_subscripted
  cql3: expr: Make get_value work with column_maybe_subscripted
  cql3: expr: Add column_maybe_subscripted
  cql3: expr: Add get_subscripted_column
  cql3: expr: Add subscript struct
2022-03-06 19:03:38 +02:00
Gleb Natapov
108e7fcc4e raft: enter candidate state immediately when starting a singleton cluster
When a node starts it does not immediately becomes a candidate since it
waits to learn about already existing leader and randomize the time it
becomes a candidate to prevent dueling candidates if several nodes are
started simultaneously.

If a cluster consist of only one node there is no point in waiting
before becoming a candidate though because two cases above cannot
happen. This patch checks that the node belongs to a singleton cluster
where the node itself is the only voting member and becomes candidate
immediately. This reduces the starting time of a single node cluster
which are often used in testing.

Message-Id: <YiCbQXx8LPlRQssC@scylladb.com>
2022-03-04 20:30:52 +01:00
Kamil Braun
1c5ab5d80c test: raft: randomized_nemesis_test: when setting up clusters, only create the first server with singleton configuration
When setting up clusters in regression tests, a bunch of servers were
created, each starting with a singleton configuration containing itself.
This is wrong: servers joining to an existing cluster should be started
with an empty configuration.

It 'worked' because the first server, which we wait for to become a leader
before creating the other servers, managed to override the logs and
configurations of other servers before they became leaders in their
configurations.

But if we want to change the logic so that servers in single-server clusters
elect themselves as leaders immediately, things start to break. So fix
the bug.
Message-Id: <20220303100344.6932-1-kbraun@scylladb.com>
2022-03-04 20:29:19 +01:00
Jadw1
213dace26e CQL3/pytest: Updating test_json
Referring to issue #7915, cassandra also works with unprepared
statement. There was missing `fromJson()`, the test was inserting
string into boolean column.
2022-03-04 14:18:42 +01:00
Jadw1
1902dbc9ff CQL3: fromJson accepts string as bool
The problem was incompatibility with cassandra, which accepts bool
as a string in `fromJson()` UDF. The difference between Cassandra and
Scylla now is Scylla accepts whitespaces around word in string,
Cassandra don't. Both are case insensitive.

Fixes: #7915
2022-03-04 14:18:34 +01:00
Benny Halevy
eff5076dd5 sstables: close_files: auto-remove temporary sstable directory
If the sstable is marked for deletion, e.g. when
writing the sstable fails for any reason before it's
sealed, make sure to remove the sstable's temporary
directory, if present, besides the sstables files.

This condition is benign as these empty temp dirs
are removed when scylla starts up, but the do accumulate
and we better remove them too.

Fixes #9522

Test: unit(dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302161827.2448980-1-bhalevy@scylladb.com>
2022-03-03 16:13:03 +02:00
Michael Livshin
0caa21079d sstables: refrain from throwing on host id mismatch
This makes host id mismatch cause a warning and stop being fatal,
to un-break node replacement dtests.

Should be revisited if/when the underlying problem (double setting of
local host id on a replacing node) is fixed.

Refs #10148

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Message-Id: <20220303085049.186259-1-michael.livshin@scylladb.com>
2022-03-03 15:53:19 +02:00
Benny Halevy
a57c087c89 atomic_cell: compare_atomic_cell_for_merge: compare ttl if expiry is equal
Unlike atomic_cell_or_collection::equals, compare_atomic_cell_for_merge
currently returns std::strong_ordering::equal if two cells are equal in
every way except their ttl:s.

The problem with that is that the cells' hashes are different and this
will cause repair to keep trying to repair discrepancies caused by the
ttl being different.

This may be triggered by e.g. the spark migrator that computes the ttl
based on the expiry time by subtracting the expiry time from the current
time to produce a respective ttl.

If the cell is migrated multiple times at different times, it will generate
cells that the same expiry (by design) but have different ttl values.

Fixes #10156

Test: mutation_test.test_cell_ordering, unit(dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302154328.2400717-1-bhalevy@scylladb.com>
2022-03-03 15:27:16 +02:00
Piotr Grabowski
d3673f2b29 types/map.hh: add missing const qualifiers
Add missing const qualifiers in serialize_to_bytes and
serialize_to_managed_bytes. Lack of those qualifiers caused GCC
compilation error:

./types/map.hh: In instantiation of ‘static bytes map_type_impl::serialize_to_bytes(const Range&) [with Range = std::map<seastar::basic_sstring<signed char, unsigned int, 31, false>, seastar::basic_sstring<signed char, unsigned int, 31, false>, serialized_compare>; bytes = seastar::basic_sstring<signed char, unsigned int, 31, false>]’:
cql3/type_json.cc:138:45:   required from here
./types/map.hh:72:41: error: loop variable ‘elem’ of type ‘const std::pair<seastar::basic_sstring<signed char, unsigned int, 31, false>, seastar::basic_sstring<signed char, unsigned int, 31, false> >&’ binds to a temporary constructed from type ‘const std::pair<const seastar::basic_sstring<signed char, unsigned int, 31, false>, seastar::basic_sstring<signed char, unsigned int, 31, false> >’ [-Werror=range-loop-construct]
   72 |     for (const std::pair<bytes, bytes>& elem : map_range) {
      |                                         ^~~~
./types/map.hh:72:41: note: use non-reference type ‘const std::pair<seastar::basic_sstring<signed char, unsigned int, 31, false>, seastar::basic_sstring<signed char, unsigned int, 31, false> >’ to make the copy explicit or ‘const std::pair<const seastar::basic_sstring<signed char, unsigned int, 31, false>, seastar::basic_sstring<signed char, unsigned int, 31, false> >&’ to prevent copying

Adding those const qualifiers there is correct, as the definition of
those functions specifies that the range is of
std::pair<const bytes, bytes> elements, not std::pair<bytes, bytes>
(before the change):

requires std::convertible_to<std::ranges::range_value_t<Range>,
                             std::pair<const bytes, bytes>>

Note that there are some GCC compilation problems still left apart
from this one.

Closes #10157
2022-03-03 14:24:05 +02:00
Benny Halevy
d43da5d6dc atomic_cell: compare_atomic_cell_for_merge: fixup indentation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302113833.2308533-2-bhalevy@scylladb.com>
2022-03-03 14:13:14 +02:00
Benny Halevy
be865a29b8 atomic_cell: compare_atomic_cell_for_merge: simplify expiry/deltion_time comparison
No need to check first the the cells' expiry is different
or that deletion_time is different before comparing them
with `<=>`.

If they are the same the function returns std::strong_ordering::equal
anyhow and that is the same as `<=>` comparing identical values.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302113833.2308533-1-bhalevy@scylladb.com>
2022-03-03 14:12:44 +02:00
Nadav Har'El
3d0bd523b5 Merge 'CQL3: fromJson out of range integer cause as error' from Jadw1
Passing integer which exceeds corresponding type's bounds to
`fromJson()` was causing silent overflow, e.g. inserting
`fromJson('2147483648')` to `int` coulmn stored `-2147483648`.

Now, this will cause marshal_exception. All integer types are testing agains their bounds.

Tests referring issue https://github.com/scylladb/scylla/issues/7914 in `test/cql-pytest/cassandra_tests/validation/entities/json_test.py` won't pass because the expected error's messages differ from the thrown ones. I was wondering what the message should be, because expected messages in tests aren't consistent, for instance:
- bigint overflow expects `Expected a bigint value, but got a` message
- short overflow expects `Unable to make short from` message

For now the message is `Value {} out of bound`.

Fixes: https://github.com/scylladb/scylla/issues/7914

Closes #10145

* github.com:scylladb/scylla:
  CQL3/pytest: Updating test_json
  CQL3: fromJson out of range integer cause as error
2022-03-03 13:46:16 +02:00
Piotr Grabowski
0544973b15 utils/rjson.cc: ignore buggy GCC warning
When compiling utils/rjson.cc on GCC, the compilation triggers the
following warning (which becomes a compilation error):

utils/rjson.cc: In function ‘seastar::future<> rjson::print(const value&, seastar::output_stream<char>&, size_t)’:
utils/rjson.cc:239:15: error: typedef ‘using Ch = char’ locally defined but not used [-Werror=unused-local-typedefs]
  239 |         using Ch = char;
      |               ^~

This warning is a false positive. 'using Ch' is actually used internally
by rapidjson::Writer. This is a known GCC bug
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61596), which has not been
fixed since 2014.

I disabled this warning only locally as other code is not affected by
this warning and no other code already disables this warning.

Note that there are some GCC compilation problems still left apart
from this one.

Closes #10158
2022-03-02 19:10:58 +02:00
Pavel Emelyanov
6a154305d7 gossiper: Remove db::config reference from gossiper
Also const-ify the db::config reference argument and std::move
the gossip_config argument while at it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-02 18:34:55 +03:00
Pavel Emelyanov
0c24087007 gossiper: Keep live-updateable options on gossiper
These options need to have updateable_value<> instance referencing
them from gossiper itself. The updateable_value<> is shard-aware in
the sense that it should be constructed on correct shard. This patch
does this -- the db::config reference is carried all the way down
to the gossiper constructor, then each instance gets its shard-local
construction of the updateable_value<>s.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-02 18:34:55 +03:00
Pavel Emelyanov
271ceb57b9 gossiper: Keep immutable options on gossip_config
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-02 18:34:55 +03:00
Piotr Grabowski
e99f487d31 raft: add missing include
Add missing include of "<experimental/source_location>" which caused
compile errors on GCC:

In file included from raft/fsm.hh:12,
                 from raft/fsm.cc:8:
raft/raft.hh:251:30: error: ‘std::experimental’ has not been declared
  251 |     state_machine_error(std::experimental::source_location l = std::experimental::source_location::current())
      |                              ^~~~~~~~~~~~
raft/raft.hh:251:59: error: expected ‘)’ before ‘l’
  251 |     state_machine_error(std::experimental::source_location l = std::experimental::source_location::current())
      |                        ~                                  ^~

Note that there are some GCC compilation problems still left apart from
this one.

Closes #10155
2022-03-02 16:33:43 +01:00
Jadw1
742efc4992 CQL3/pytest: Updating test_json
Added test for bigint overflow.
2022-03-02 15:36:09 +01:00
Kamil Braun
09357c784f service: storage_service: announce new CDC generation immediately with RBNO
When a new CDC generation is created (during bootstrap or otherwise), it
is assigned a timestamp. The timestamp must be propagated as soon as
possible, so all live nodes can learn about the generation before their
clocks reach the generation's timestamp. The propagation mechanism for
generation timestamps is gossip.

When bootstrap RBNO was enabled this was not the case: the generation
timestamp was inserted into gossiper state too late, after the repair
phase finished. Fix this.

Also remove an obsolete comment.

Fixes #10149.
2022-03-02 14:55:49 +01:00
Benny Halevy
3b5ba5c1a9 compaction_manager: stop_tasks: fixup indentation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302081547.2205813-3-bhalevy@scylladb.com>
2022-03-02 15:44:10 +02:00
Benny Halevy
95cf4c1c6f compaction_manager: coroutinize stop_tasks
Simplify the function by implementing it as a coroutine,
ensuring the input vector, holding the shared task ptrs, is
kept alive throughout the lifetime of the function
(instead of using do_with to achieve that)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302081547.2205813-2-bhalevy@scylladb.com>
2022-03-02 15:44:10 +02:00
Benny Halevy
d1d3c620b2 compaction_manager: embed task_stop into stop_tasks
task_stop is called exclusively from stop_tasks,
Now that stop_tasks calls task::stop() directly,
there is no need for this separation, so open-code
task_stop in stop_tasks, using coroutines.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302081547.2205813-1-bhalevy@scylladb.com>
2022-03-02 15:44:10 +02:00
Kamil Braun
c8e3d5f69d service: storage_service: fix indentation 2022-03-02 14:43:08 +01:00
Benny Halevy
0764e511bb compaction_manager: perform_offstrategy: run_offstrategy_compaction in maintenance scheduling group
It was assumed that offstrategy compaction is always triggered by streaming/repair
where it would inherit the caller's scheduling group.

However, offstrategy is triggered by a timer via table::_off_strategy_trigger so I don't see
how the expiration of this timer will inherit anything from streaming/repair.

Also, since d309a86, offstrategy compaction
may be triggered by the api where it will run in the default scheduling group.

The bottom line is that the compaction manager needs to explicitly perform offstrategy compaction
in the maintenance scheduling group similar to `perform_sstable_scrub_validate_mode`.

Fixes #10151

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302084821.2239706-1-bhalevy@scylladb.com>
2022-03-02 15:36:28 +02:00
Jadw1
0fd7ffb8c1 CQL3: fromJson out of range integer cause as error
Passing integer which exceeds corresponding type's bounds to
`fromJson()` was causing silent overflow, e.g. inserting
`fromJson('2147483648')` to `int` coulmn stored `-2147483648`.

Now, this will cause marshal_exception with value out of bound
message. Also, all integer types are testing agains their bounds.

Fixes: #7914
2022-03-02 14:30:03 +01:00
Botond Dénes
92eb02c301 Merge "Sanitize join_token_ring pre-bootstrap waiter" from Pavel Emelyanov
"
The set puts the code in question into a helper, coroutinizes it, removes
some code duplication, improves a corner case and relaxes logging.

tests: unit(dev), dtest.simple_boot_shutdown(v1, dev)
"

* 'br-join-ring-wait-sanitize-2' of https://github.com/xemul/scylla:
  storage_service: De-bloat waiting logs
  storage_service: Indentation fix after previous changes
  storage_service: Negate loop breaking check
  storage_service: Fix off-by-one-second waiting
  storage_service: Pack schema waiting loop
  storage_service: Out-line schema waiting code
  storage_service: Make int delay be std::chrono::milliseconds
2022-03-02 15:14:53 +02:00
Mikołaj Sielużycki
f4c57cbe87 memtable: Convert partition_snapshot_flat_reader to v2.
This is a facade change only, the make_partition_snapshot_flat_reader
function calls upgrade_to_v2 internally.

Closes #10152
2022-03-02 15:07:36 +02:00
Pavel Emelyanov
bb1c4adb7c storage_service: De-bloat waiting logs
First thing is that logging can be done with logger methods,
not with set_mode() because the mode is already set at this
place.

Second thing is that pre-update_pending_ranges logs are excessive,
as the update_pending_ranges logs its progress itself.

Third is that post-logging is also exsessive -- there are more
logs after those lines.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-02 11:55:30 +03:00
Pavel Emelyanov
cb0d298cc4 storage_service: Indentation fix after previous changes
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-02 11:55:30 +03:00
Pavel Emelyanov
829ffe630b storage_service: Negate loop breaking check
In simple words turn

   while {
       if (continue) {
           do_something
       } else {
           break
       }
   }

into

    while {
        if (!continue) {
            break;
        }

        do_something
    }

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-02 11:55:30 +03:00
Pavel Emelyanov
463aa66b75 storage_service: Fix off-by-one-second waiting
The waiting loop needs to abort once a minute passes and does
it in one second steps. However, the expiration check happens
after sleep, which effectively throws this last second away.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-02 11:55:30 +03:00
Pavel Emelyanov
60b53732e5 storage_service: Pack schema waiting loop
The newly created method looks like this

    wait_for_schema_agreement
    update_pending_ranges

    while (consistent_range_movement) {
        pause
        wait_for_schema_agreement
        update_pending_range
    }

This patch packs the wait_for_schema_agreement+update_pending_range
pairs into a single loop.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-02 11:55:30 +03:00
Pavel Emelyanov
d5b75a24a5 storage_service: Out-line schema waiting code
And coroutinize while moving. No other changes.

While the code in question runs in a thread context and can enjoy
synchronous .get() calls, it's still better if it doesn't make any
assumptions about its environment. The ring joining code is changing
and new intermediate helpers should better be on the safe side from
the very beginning.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-02 11:53:22 +03:00
Pavel Emelyanov
3ea7539d27 storage_service: Make int delay be std::chrono::milliseconds
It's milliseconds and is converted back and forth in join_token_ring().
Having a chrono type for it makes things (mostly code reading) simpler.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-02 11:51:47 +03:00
Benny Halevy
c6e0245f87 compaction_manager: get rid of the disable method
It is unused.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302080632.2183782-1-bhalevy@scylladb.com>
2022-03-02 11:13:39 +03:00
Nadav Har'El
fa7a302130 cross-tree: split coordinator_result from exceptions.hh
Recently, coordinator_result was introduced as an alternative for
exceptions. It was placed in the main "exceptions/exceptions.hh" header,
which virtually every single source file in Scylla includes.
But unfortunately, it brings in some heavy header files and templates,
leading to a lot of wasted build time - ClangBuildAnalyzer measured that
we include exceptions.hh in 323 source files, taking almost two seconds
each on average.

In this patch, we split the coordinator_result feature into a separate
header file, "exceptions/coordinator_result", and only the few places
which need it include the header file. Unfortunately, some of these
few places are themselves header, so the new header file ends up being
included in 100 source files - but 100 is still much less than 323 and
perhaps we can reduce this number 100 later.

After this patch, the total Scylla object-file size is reduced by 6.5%
(the object size is a proxy for build time, which I didn't directly
measure). ClangBuildAnalyzer reports that now each of the 323 includes
of exceptions.hh only takes 80ms, coordinator_result.hh is only included
100 times, and virtually all the cost to include it comes from Boost's
result.hh (400ms per inclusion).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220228204323.1427012-1-nyh@scylladb.com>
2022-03-02 10:12:57 +02:00
Raphael S. Carvalho
2dba0670ad compaction: Fix time_window_backlog_tracker::replace_sstables()
Introduced in commit: ddd693c6d7

We're not emplacing newer windows in the tracker, causing
std::out_of_range when replacing sstables for windows.

Let's fix the logic and add an unit test to cover this.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220301194944.95096-1-raphaelsc@scylladb.com>
2022-03-02 10:08:40 +02:00
Botond Dénes
b2061688a5 mutation_writer/multishard_writer: remove now unused v1 factory overloads 2022-03-02 09:58:38 +02:00
Botond Dénes
70e95a9cf7 test/boost/mutation_writer_test: test the v2 variant of distribute_reader_and_consume_on_shards()
The underlying implementation behind the v1 and v2 variants if said
methods is the same, but we want to move to using the v2 variant in the
test as the v1 variant is going away soon.
2022-03-02 09:57:24 +02:00
Botond Dénes
7a119080ee flat_mutation_reader: add v2 variant of make_generating_reader() 2022-03-02 09:56:50 +02:00
Botond Dénes
bbf8e26a3a mutation_reader: multishard_writer: migrate implementation to v2 2022-03-02 09:56:10 +02:00
Botond Dénes
cdf7e74da8 mutation_reader: convert foreign_reader to v2 2022-03-02 09:55:38 +02:00
Botond Dénes
ad1b157452 streaming/consumer: convert to v2
At least on the API level, internally there are still conversions, but
these are going to be sorted out in the next patches too.
2022-03-02 09:55:09 +02:00
Benny Halevy
1e15caa158 compaction_manager: setup_new_compaction: allow setting output_run_identifier
Currently the output_run_identifier is assigned right
after the calling setup_new_compaction.
Move setting the uuid to setup_new_compaction to simplify
the flow.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220301083643.1845096-1-bhalevy@scylladb.com>
2022-03-02 09:50:59 +02:00
Michael Livshin
a389cc520b system_keyspace, sstable: log local host id in key places
Specifically: when it is generated, when it is loaded from
`system.local`, and when there is a mismatch during sstable
validation; in the latter case log the in-sstable host id also.

Refs #10148

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Message-Id: <20220301123925.257766-1-michael.livshin@scylladb.com>
2022-03-02 09:49:37 +02:00
Benny Halevy
c9e06f1246 compaction_manager: task: get rid of the stopping member
Instead, rely solely on compaction_data.abort source
that is task::stop now uses to stop the task.

This makes task stopping permanent, so it can't be undone
(as used to be the case where task_stop
set stopping to false after waiting for compaction_done,
to allow rerite_sstables's task to be created before
calling run_with_compaction_disabled, and start
running after it - which is no longer the case)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220301083535.1844829-1-bhalevy@scylladb.com>
2022-03-01 16:46:09 +02:00
Benny Halevy
222389e0f5 compaction_manager: rewrite_sstables: retrieve sstable with compaction disabled before making task
Currently, rewrite_sstables retrieved the sstables under
run_with_compaction_disabled, *after* it's created a task for itself.

This makes little sense as this task have not started running yet
and therefore does not need to be stopped by
run_with_compaction_disabled.

This is currently worked around by setting task->stopping = false
in task_stop().

This change just moves task create in rewrite_sstables till
after the sstables are retrieved and the deferred cleanup
of _stats.pending_tasks till after it's first adjusted.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220301083409.1844500-1-bhalevy@scylladb.com>
2022-03-01 16:45:33 +02:00
Nadav Har'El
7cf2e5ee5c Merge 'directory_lister: drop abort method and simplify close semantics' from Benny Halevy
This series contains:
- lister: move to utils
  - tidy up the clutter in the root dir
Based on Avi's feedback to `[PATCH 1/1] utils: directory_lister: close: always abort queue` that was sent to the mailing list:
  - directory_lister: drop abort method
  - lister: do not require get after close to fail
- test: lister_test: test_directory_lister_close simplify indentation
  - cosmetic cleanup

Closes #10142

* github.com:scylladb/scylla:
  test: lister_test: test_directory_lister_close simplify indentation
  lister: do not require get after close to fail
  directory_lister: drop abort method
  lister: move to utils
2022-03-01 16:23:47 +02:00
Botond Dénes
cfa3910509 Merge 'Memtable - scanning and flush readers now implement flat_mutation_reader_v2::impl' from Michael Livshin
This PR consists of two changes.

The first fixes the flat_mutation_reader and flat_mutation_reader_v2, so that they can be destructed without being closed (if no action has been initiated). This has been discussed in the referenced issue.

The second one changes scanning and flush readers so that they implement the second version of the API.

It also contains unit test fixes, dealing with flat mutation reader assertions (where the v1 asserter failed to consume range tombstones intelligently enough in some flows) and several sstable_3_x tests (where sstables that contain range tombstones were expected to be byte-by-byte equivalent to a reference, aside from semantic validation).

Fixes #9065.

Closes #9669

* github.com:scylladb/scylla:
  flat_reader_assertions: do not accumulate out-of-range tombstones
  flat_reader_assertions: refactor resetting accumulated tombstone lists
  flat_mutation_reader_test: fix "test_flat_mutation_reader_consume_single_partition"
  memtable::make_flush_reader(): return flat_mutation_reader_v2
  memtable::make_flat_reader(): return flat_mutation_reader_v2
  flat_mutation_reader_v2: add consume_partitions()
  introduce the MutationConsumer concept
  mutation_source: clone shortcut constructors for flat_mutation_reader_v2
  flat_mutation_reader_v2: add delegating_reader_v2
  memtable: upgrade scanning_reader and flush_reader to v2
  flat_mutation_reader: allow destructing readers which are not closed and didn't initiate any IO.
  tests: stop comparing sstables with range tombstones to C* reference
  tests: flat_reader_assertions: improve range tombstone checking
2022-02-28 17:23:20 +02:00
Michael Livshin
fb6c79015a flat_reader_assertions: do not accumulate out-of-range tombstones
Also remove the incorrect difference in range tombstone checking
behavior between `produces_range_tombstone()` and `produces(const
range_tombstone&)` by having both turn on checking.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-28 17:11:54 +02:00
Michael Livshin
9fa4d9a2bb flat_reader_assertions: refactor resetting accumulated tombstone lists
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-28 17:11:54 +02:00
Michael Livshin
2221aeff0e flat_mutation_reader_test: fix "test_flat_mutation_reader_consume_single_partition"
Since `flat_reader_assertions::produces(const range_tombstone&,...)`
records the range tombstone for checking, be sure to explicitly pass
in a clustering range that does not extend beyond the mock-read part
of the mutation.

Also (provisionally) change the assertion method to accept clustering
ranges.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-28 17:11:54 +02:00
Michael Livshin
34ed752885 memtable::make_flush_reader(): return flat_mutation_reader_v2
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-28 17:11:54 +02:00
Michael Livshin
9bacce4359 memtable::make_flat_reader(): return flat_mutation_reader_v2
This is just a facade change.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-28 17:11:54 +02:00
Michael Livshin
8da28d0902 flat_mutation_reader_v2: add consume_partitions()
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-28 17:11:54 +02:00
Michael Livshin
ce8f34f5a0 introduce the MutationConsumer concept
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-28 17:11:54 +02:00
Michael Livshin
68cfb6261f mutation_source: clone shortcut constructors for flat_mutation_reader_v2
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-28 17:11:54 +02:00
Michael Livshin
fbbe27051e flat_mutation_reader_v2: add delegating_reader_v2
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-28 17:11:54 +02:00
Michał Radwański
2a3bd40c69 memtable: upgrade scanning_reader and flush_reader to v2
This change is a part of effort to migrate existing readers from old API
to the new one. The corresponding make_flush_reader and
make_flat_reader functions still return flat_mutation_reader.
2022-02-28 17:11:54 +02:00
Michał Radwański
9ada63a9cb flat_mutation_reader: allow destructing readers which are not closed and didn't initiate any IO.
In functions such as upgrade_to_v2 (excerpt below), if the constructor
of transforming_reader throws, r needs to be destroyed, however it
hasn't been closed. However, if a reader didn't start any operations, it
is safe to destruct such a reader. This issue can potentially manifest
itself in many more readers and might be hard to track down. This commit
adds a bool indicating whether a close is anticipated, thus avoiding
errors in the destructor.

Code excerpt:
flat_mutation_reader_v2 upgrade_to_v2(flat_mutation_reader r) {
    class transforming_reader : public flat_mutation_reader_v2::impl {
        // ...
    };
    return make_flat_mutation_reader_v2<transforming_reader>(std::move(r));
}

Fixes #9065.
2022-02-28 17:11:54 +02:00
Michael Livshin
67c3c31a6e tests: stop comparing sstables with range tombstones to C* reference
As flat mutation reader {up,down}grades get added to the write path,
comparing range-tombstone-containing (at least) sstables byte-by-byte
to a reference is starting to seem like a fool's errand.

* When a flat mutation reader is {up,down}graded, information may get
  lost while splitting range tombstones.  Making those splits revertable
  should in theory be possible but would surely make {up,down}graders
  slower and more complex, and may also possibly entail adding
  information to in-memory representation of range tombstones and
  range rombstone changes.  Such investment for the sake of 7 unit tests
  does not seem wise, given that the plan is to get rid of reader
  {up,down}grade logic once the move to flat mutation reader v2 is
  completed.

* All affected tests also validate their written sstables
  semantically.

* At least some of the offending reference sstables are not
  "canonical" wrt range tombstones to begin with -- they contain range
  tombstones that overlap with clustering rows.  The fact that Scylla
  does not "canonicalize" those in some way seems purely incidental.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-28 17:11:54 +02:00
Michael Livshin
2337d48b41 tests: flat_reader_assertions: improve range tombstone checking
`produces_range_tombstone()` is smart enough to not just try to read
one range tombstone from the input and compare it to the passed
reference, but to read as many range tombstones as the reader is
looking at (including none) using `may_produce_tombstones()` and
record those appropriately.

When `produces(const schema&, const mutation_fragment&)` is passed a
range tombstone as the second argument, it does not do anything
special -- it just reads one fragment, disregards it (!), and applies
its second argument to both "expected" and "encountered" range
tombstone lists.  The right thing here is to use the same logic as
`produces_range_tombstone()`; upcoming memtable-related reader
changes (which result in more split range tombstones) cause some unit
tests to fail without fixing this.

Refactor the relevant logic into a private method (`apply_rt()`) and
use that in both places.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-28 17:11:54 +02:00
Nadav Har'El
f84094320d exceptions: de-inline exception constructors
The header file "exceptions/exceptions.hh" and the exception types in it
is used by virtually every source file in Scylla, so excessive includes
and templated code generation in this header could slow down the build
considerably.

Before this patch, all of the exceptions' constructors were inline in
exceptions.hh, so source file using one of these exceptions will need
to recompile the code, which is fairly heavy, using the fmt templates
for various types. According to ClangBuildAnalyzer, 323 source files
needed to materialize prepare_message<db::consistency_level,int&,int&>,
taking 0.3 seconds each.

So this patch moves the exception constructors from the header file
exceptions.hh to the source file exceptions.cc. The header file no longer
uses fmt.

Unfortunately, the actual build-time savings from this patch is tiny -
around 0.1%... It turns out that most of the prepare_message<>
compilation time comes from fmt compilation time, and since virtually
all source files use fmt for other header reasons (intentionally or
through other headers), no compilation time can be saved. Nevertheless,
I hope that as we proceed with more cleanups like this and eliminate
more unnecessary code-generation-in-headers, we'll start seeing build
time drop.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-02-28 14:47:41 +02:00
Botond Dénes
f8da0a8d1e Merge "Conceptualize some static assertions" From Pavel Emelyanov
"
Some templates put constraints onto the involved types with the help of
static assertions. Having them in form of concepts is much better.

tests: unit(dev)
"

* 'br-static-assert-to-concept' of https://github.com/xemul/scylla:
  sstables: Remove excessive type-match assertions
  mutation_reader: Sanitize invocable asserion and concept
  code: Convert is_future result_of assertions into invoke_result concept
  code: Convert is_same+result_of assertions into invocable concepts
  code: Convert nothrow construction assertions into concepts
  code: Convert is_integral assertions to concepts
2022-02-28 13:58:01 +02:00
Nadav Har'El
b650ff5808 test/cql-pytest: test another corner-case of scientific-notation integers
In a previous patch, we added a test for the case of Scylla trying to
assign the JSON value 1e6 into an integer - which should be allowed
because 1e6 is indeed a whole number, in the range of int.

We already fixed that in commit efe7456f0a,
but this patch adds another test which demonstrates that an even more
esoteric problem remains:

If we are reading a JSON value into a bigint (CQL's 64-bit integer),
*and* if the number is between 2^53 and 2^63-1 *and* if the number
is written using scientific notation, e.g., 922337203685477580.7e1
(which is 2^63-1), then the bigint is set incorrectly, with some
digits being lost. The problem is that RapidJSON reads this integer
into the "double" type, which only keeps 53 significant bits.

Because this is an open issue (#10137), the test included here is
marked as expected failure (xfail). The test is also known to
fail in Cassandra - which doesn't allow scientific notation for
JSON integers at all despite the JSON standard - so the test is
also marked "cassandra_bug".

Refs #10137

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-02-28 13:52:56 +02:00
Benny Halevy
1768aae603 compaction_manager: rewrite_sstables: construct compacting_sstable_registration with compaction_manager&
Rather than using a std::optional<compacting_sstable_registration>
for lazy construction, construct the object early
and call register_compacting when the sstables to register
are available.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-02-28 13:52:03 +02:00
Benny Halevy
1584c50710 compaction_manager: compacting_sstable_registration: keep a compaction_manager&
Rather than a compaction_manager* so that in the next
patch it could be constructed with just that and
the caller can call register_compacting when
it has the sstables to register ready.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-02-28 13:52:03 +02:00
Benny Halevy
c008fb137b compaction_manager: use unordered_set for compacting sstables registration
It is more efficient than using a vector as the interface.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-02-28 13:52:03 +02:00
Benny Halevy
9c89c2df37 test: lister_test: test_directory_lister_close simplify indentation
There's no need anymore for an indented block
to destroy tnhe directory_lister since the other
sub-case was deleted.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-02-28 13:00:03 +02:00
Benny Halevy
41d097ef47 lister: do not require get after close to fail
Currently, the lister test expected get() to
always fail after close(), but it unexpectedly
succeeded if get() was never called before close,
as seen in https://jenkins.scylladb.com/view/master/job/scylla-master/job/next/4587/artifact/testlog/x86_64_debug/lister_test.test_directory_lister_close.4001.log
```
random-seed=1475104835
Generated 719 dir entries
Getting 565 dir entries
Closing directory_lister
Getting 0 dir entries
Closing directory_lister
test/boost/lister_test.cc(190): fatal error: in "test_directory_lister_close": exception std::exception expected but not raised
```

This change relaxes this requirement to keep
close() simple, based on Avi's feedback:

> The user should call close(), and not do it while get() is running, and
> that's it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-02-28 12:59:08 +02:00
Benny Halevy
00327bfae3 directory_lister: drop abort method
Based on Avi's feedback:
> We generally have a public abort() only if we depend on an external
> event (like data from a tcp socket) that we don't control. But here
> there are no such external events. So why have a public abort() at all?

If needed in the future, we can consider adding
get(abort_source&) to allow aborting get() via
an external event.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-02-28 12:52:47 +02:00
Benny Halevy
ebbbf1e687 lister: move to utils
There's nothing specific to scylla in the lister
classes, they could (and maybe should) be part of
the seastar library.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-02-28 12:36:03 +02:00
Botond Dénes
d27259ca5b mutation_writer/multishard_writer: add v2 variant of distribute_reader_and_consume_on_shards()
Just the factory function itself. The underlying machinery stays v1 for
now. Behind the scenes the v2 variant still invokes the v1 one, with the
necessary conversions.
This allows migrating users to the v2 interface, migrating the machinery
later.
2022-02-28 10:48:08 +02:00
Jan Ciolek
e086201420 cql3: expr: Remove sub from column_value
column_value::sub has been replaced by the subscript struct
everywhere, so we can finally remove it.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-02-27 22:02:39 +01:00
Jan Ciolek
b80f9e6cf8 cql3: Create a subscript in single_column_relation
When `val[sub]` is parsed, it used to be the case
that column_value with a sub field was created.

Now this has been changed to creating a subscript struct.

This is the only place where a subscripted value can be created.

All the code regarding subscripts now operates using only the
subscript struct, so we will be able to remove column_value::sub soon.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-02-27 22:02:39 +01:00
Jan Ciolek
cf6e81e731 cql3: expr: Add subscript to expression
All handlers for subscript have finally been implemented
and subscript can now be added to expression without
any trouble.

All the commented out code that waited for this moment
can now be uncommented.
Every such piece of code had a `TODO(subscript)` note
and by grepping this phrase we can make sure that
we didn't forget any of them.

Right now there is two ways to express a subscripted
column - either by a column_value with a sub field
or by using a subscript struct.

The grammar still uses the old column_value way,
but column_value.sub will be removed soon
and everything will move to the subscript struct.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-02-27 22:02:29 +01:00
Jan Ciolek
0a7636b2d4 cql3: Handle subscript in multi_column_range_accumulator
Same case as column_value.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-02-27 21:56:41 +01:00
Jan Ciolek
818e3544bb cql3: Handle subscript in selectable_process_selection
Selected values can't subscripted, the grammar in Cql.g doesn't allow it.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-02-27 21:56:41 +01:00
Jan Ciolek
ec6f93d0c7 cql3: expr: Handle subscript in test_assignment
test_assignment can't be passed a column_value,
so a subscript won't work as well.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-02-27 21:56:41 +01:00
Jan Ciolek
ab89fc316b cql3: expr: Handle subscript in prepare_expression
column_value can't be prepared, so subscript can't be prepared as well.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-02-27 21:56:41 +01:00
Jan Ciolek
1a653f8f36 cql3: Handle subscript in prepare_selectable
Selected values can't subscripted, the grammar in Cql.g doesn't allow it.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-02-27 21:56:41 +01:00
Jan Ciolek
ef2acddcb9 cql3: expr: Handle subscript in extract_clustering_prefix_restrictions
extract_clustering_prefix_restrictions collects restrictions
on clustering key columns.

In case we encounter col[sub] we treat it as a restriction on col
and add it to the result.

This seems to make some sense and is in line with the current behaviour
which doesn't check whether a column is subscripted at all.

The code has been copied from column_value& handler.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-02-27 21:56:41 +01:00
Jan Ciolek
ec57c2516e cql3: expr: Handle subscript in extract_partition_range
extract_parition_range collects restrictions on partition key columns.

In case we encounter col[sub] we treat it as a restriction on col
and add it to the result.

This seems to make some sense and is in line with the current behaviour
which doesn't check whether a column is subscripted at all.

The code has been copied from column_value& handler.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-02-27 21:56:41 +01:00
Jan Ciolek
c39498537c cql3: expr: Handle subscript in fill_prepare_context
fill_prepare_context collects useful information about
the expression involved in query restrictions.

We should collect this information from subscript as well,
just like we do from column_value and its sub.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-02-27 21:56:41 +01:00
Jan Ciolek
811685ad6a cql3: expr: Handle subscript in evaluate
A column_value can't be evaluated,
so a subscripted column can't evaluated be as well.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-02-27 21:56:41 +01:00
Jan Ciolek
db8990436a cql3: expr: Handle subscript in extract_single_column_restrictions_for_column
extract_single_column_restrictions_for_column finds all restrictions
for a column and puts them in a vector.

In case we encounter col[sub] we treat it as a restriction on col
and add it to the result.

This seems to make some sense and is in line with the current behaviour
which doesn't check whether a column is subscripted at all.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-02-27 21:56:41 +01:00
Jan Ciolek
2eaa39e1c8 cql3: expr: Handle subscript in search_and_replace
Prepare a handler for subscript in search_and_replace.
Some of the code must be commented out for now
because subscript hasn't been added to expression yet.

It will uncommented later.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-02-27 21:56:41 +01:00
Jan Ciolek
e2d983f659 cql3: expr: Handle subscript in recurse_until
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-02-27 21:56:41 +01:00
Jan Ciolek
2d4174dc46 cql3: expr: Implement operator<< for subscript
expression can be printed using operator<<.
We need to handle subscript there.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-02-27 21:56:41 +01:00
Jan Ciolek
02c3b78e25 cql3: expr: Handle subscript in possible_lhs_values
possible_lhs_values returns set of possible values
for a column given some restrictions.

Current behaviour in case of a subscripted column
is to just ignore the subscript and treat
the restriction as if it were on just the column.

This seems wrong, or at least confusing,
but I won't change it in this patch to preserve the existing behaviour.

Trying to change this to something more reasonable
breaks other code which assumes that possible_lhs_values
returns a list of values.
(See partition_ranges_from_EQs() in cql3/restrictions/statement_restrictions.cc)

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-02-27 21:56:41 +01:00
Jan Ciolek
07fbf74a97 cql3: expr: Handle subscript in is_supported_by
is_supported_by checks whether the given expression
is supported by some index.

The current behaviour seems wrong, but I kept
it to avoid making changes in a refactor PR.

Scylla doesn't have indexes on map entries yet,
so for a subscript the answer is always no.
I think we should just return false there.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-02-27 21:56:41 +01:00
Jan Ciolek
fb59f488df cql3: expr: Handle subscript in is_satisifed_by
For the most part subscript can be handled
in the same way as column_value.

column_value has a sub argument and all
called functions evaluate lhs value using
get_value() which is prepared to handle
subscripted columns.

These functions now take column_maybe_subscripted
so we can pass &subscript to them without a problem.

The difference is in CONTAINS, CONTAINS_KEY and LIKE.

contains() and contains_key() throw an exception
when the passed column has a subscript, so now
we just throw an exception immediately.

like() doesn't have a check for subscripted value,
but from reading its code it's clear that
it's not ready to handle such values,
so an exception is now thrown as well.
It shouldn't break any tests because when one tries
to perform a query like:

`select * from t where m[0] like '%' allow filtering;`

an exception is throw somewhere earlier in the code.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-02-27 21:56:41 +01:00
Jan Ciolek
1edaa3ef0d cql3: expr: Remove unused attribute
Functions that were previously marked as unused to make the code
compile are now used and we can remove the markings.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-02-27 21:56:41 +01:00
Jan Ciolek
cf839807ac cql3: expr: Use column_maybe_subscripted in is_one_of()
is_one_of() used to take column_value which could be subscripted as an argument.
column_value.sub will be removed so this function needs to take column_maybe_subscripted now.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-02-27 21:56:41 +01:00
Jan Ciolek
75c8b2ec6c cql3: expr: Use column_maybe_subscripted in limits()
limits() used to take column_value which could be subscripted as an argument.
column_value.sub will be removed so this function needs to take column_maybe_subscripted now.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-02-27 21:56:41 +01:00
Jan Ciolek
bc8c298be3 cql3: expr: Use column_maybe_subscripted in equal()
equal() used to take column_value which could be subscripted as an argument.
column_value.sub will be removed so this function needs to take column_maybe_subscripted now.

To get lhs value the code uses get_value() which is ready to handle subscripted columns.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-02-27 21:56:41 +01:00
Jan Ciolek
ca423a455e cql3: expr: add get_subscripted_column(column_maybe_subscripted)
Add a function that extracts the column_value
from column_maybe_subscripted.

There were already overloads for expression and subscript,
but this one will be needed as well.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-02-27 21:56:41 +01:00
Jan Ciolek
6d42ff580d cql3: expr: Add as_column_maybe_subscripted
Add a convenience function that allows to convert
a reference to expression to column_maybe_subscripted.

It will be useful in a moment.

For now part of it must be commented out
because subscript is not in the expression variant yet.

It will be uncommented once subscript is finally added
to expression.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-02-27 21:56:41 +01:00
Jan Ciolek
d577af0f0c cql3: expr: Make get_value_comparator work with column_maybe_subscripted
There is get_value_comparator(column_value) but soon
we will also need get_value_comparator(column_maybe_subscripted).

Implement it by copying code from get_value_comparator(column_value).

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-02-27 21:56:41 +01:00
Jan Ciolek
a8287a158a cql3: expr: Make get_value work with column_maybe_subscripted
There is a get_value(column_value), but soon we will also
need get_value(column_maybe_subscripted).

Implement get_value(column_maybe_subscripted) by checking
whether the argument is a column_value or subscript
and calling the right code.

Code for handling the subscript case is copied from
get_value(column_value) where sub has value.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-02-27 21:56:41 +01:00
Jan Ciolek
feee6e4ffb cql3: expr: Add column_maybe_subscripted
column_maybe_subscripted is a variant that
can be either a column_value or a subscript.

It will be used as an argument to functions
which used to take column_value.

Right now column_value has a sub field,
but this will be removed soon once
the subscript struct takes over.

Changing the argument type is a smaller change
than rewriting all these functions, although
if they were rewritten the resulting code
would probably be nicer.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-02-27 21:56:41 +01:00
Jan Ciolek
4d7438d30a cql3: expr: Add get_subscripted_column
Even though the new subscript allows for subscripting anything,
the only thing that is really allowed to be subscripted is a column.

Add a utility function that extracts the column_value
from an expression with is a column_value or subscript.

It will came in handy in the following commits.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-02-27 21:56:30 +01:00
Benny Halevy
132c9d5933 main: shutdown: do not abort on certain system errors
Currently any unhandled error during deferred shutdown
is rethrown in a noexcept context (in ~deferred_action),
generating a core dump.

The core dump is not helpful if the cause of the
error is "environmental", i.e. in the system, rather
than in scylla itself.

This change detects several such errors and calls
_Exit(255) to exit the process early, without leaving
a coredump behind.  Otherwise, call abort() explicitly,
rather than letting terminate() be called implicitly
by the destructor exception handling code.

Fixes #9573

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220227101054.1294368-1-bhalevy@scylladb.com>
2022-02-27 16:26:48 +02:00
Nadav Har'El
5df6e56fbf Update seastar submodule
* seastar 2849a8a8...1d81c8e5 (3):
  > Merge "make semaphore and shared_promise abortable" from Gleb
  > Fix io_tester.cc compilation with clang
  > Revert "Merge "make semaphore and shared_promise abortable" from Gleb"
2022-02-27 13:00:41 +02:00
Eliran Sinvani
4eb0398457 prepared_statements: Invalidate batch statement too
It seams that batch prepared statements always return false for
depends_on, this in turn renders the removal criteria from the
prepared statements cache to always be false which result by the
queries not being evicted.
Here we change the function to return the true state meaning,
they will return true if one of the sub queries is dependant
upon the keyspace and/ or column family.

Fixes #10129

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
2022-02-27 11:48:03 +02:00
Eliran Sinvani
bf50dbd35b cql3 statements: Change dependency test API to express better it's
purpose

Cql statements used to have two API functions, depends_on_keyspace and
depends_on_column_family. The former, took as a parameter only a table
name, which makes no sense. There could be multiple tables with the same
name each in a different keyspace and it doesn't make sense to
generalize the test - i.e to ask "Does a statement depend on any table
named XXX?"
In this change we unify the two calls to one - depends on that takes a
keyspace name and optionally also a table name, that way every logical
dependency tests that makes sense is supported by a single API call.
2022-02-27 11:48:03 +02:00
Jan Ciolek
a5bcd4f7f2 cql3: expr: Add subscript struct
Add a struct called subscript, which will be used in expression
variant to represent subscripted values e.g col[x], val[sub].
It will replace the sub field of column_value.
Having a separate struct in AST for this purpose
is cleaner and allows to express subscripting
values other than column_value.

It is not added to the expression variant yet, because
that would require immediately implementing all visitors.

The following commits will implement individual visitors
and then subscript will finally be added to expression.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-02-27 01:32:59 +01:00
MaciekCisowski
439001b8c2 service_level_controller: fix small typo in exception message
Closes #10136
2022-02-26 22:23:26 +02:00
Tomasz Grabiec
7719f4cd91 Merge "Group 0 discovery: persist and restore peers" from Kamil
We add a `peers()` method to `discovery` which returns the peers
discovered until now (including seeds). The caller of functions which
return an output -- `tick` or `request` -- is responsible for persisting
`peers()` before returning the output of `tick`/`request` (e.g. before
sending the response produced by `request` back). The user of
`discovery` is also responsible for restoring previously persisted peers
when constructing `discovery` again after a restart (e.g. if we
previously crashed in the middle of the algorithm).

The `persistent_discovery` class is a wrapper around `discovery` which
does exactly that.

For storage we use a simple local table.

A simple bugfix is also included in the first patch.

* kbr/discovery-persist-v3:
  service: raft: raft_group0: persist discovered peers and restore on restart
  db: system_keyspace: introduce discovery table
  service: raft: discovery: rename `get_output` to `tick`
  service: raft: discovery: stop returning peer_list from `request` after becoming leader
2022-02-25 17:23:08 +01:00
Avi Kivity
ff2cd72766 Merge 'utils: cached_file: Fix alloc-dealloc mismatch during eviction' from Tomasz Grabiec
cached_page::on_evicted() is invoked in the LSA allocator context, set in the
reclaimer callback installed by the cache_tracker. However,
cached_pages are allocated in the standard allocator context (note:
page content is allocated inside LSA via lsa_buffer). The LSA region
will happily deallocate these, thinking that they these are large
objects which were delegated to the standard allocator. But the
_non_lsa_memory_in_use metric will underflow. When it underflows
enough, shard_segment_pool.total_memory() will become 0 and memory
reclamation will stop doing anything, leading to apparent OOM.

The fix is to switch to the standard allocator context inside
cached_page::on_evicted(). evict_range() was also given the same
treatment as a precaution, it currently is only invoked in the
standard allocator context.

The series also adds two safety checks to LSA to catch such problems earlier.

Fixes #10056

\cc @slivne @bhalevy

Closes #10130

* github.com:scylladb/scylla:
  lsa: Abort when trying to free a standard allocator object not allocated through the region
  lsa: Abort when _non_lsa_memory_in_use goes negative
  tests: utils: cached_file: Validate occupancy after eviction
  test: sstable_partition_index_cache_test: Fix alloc-dealloc mismatch
  utils: cached_file: Fix alloc-dealloc mismatch during eviction
2022-02-25 18:19:04 +02:00
Botond Dénes
daf0f7cee5 tools/types: update main description
Remove examples and instead point user to action-specific help for more
information about specific actions.
2022-02-25 15:02:07 +02:00
Botond Dénes
af19d5ccf1 tools/scylla-types: per-action help content
Just like scylla-sstable, have a separate --help content for reach
action. The existing description is shortened and is demoted to summary:
this now only appears in the listing in the main description.
2022-02-25 15:01:02 +02:00
Botond Dénes
629a5c3ed6 tools/scylla-types: description: remove -- from action listing
Actions are commands, not switches now, update the listing in the
description accordingly.
2022-02-25 15:00:47 +02:00
Botond Dénes
05bd6b2bce tools/scylla-types: use fmt::print() instead of std::cout <<
`std::cout <<` makes for very hard-to-read (and hard-to-write) code.
Replace with `fmt::print()`.
2022-02-25 15:00:21 +02:00
Botond Dénes
d8833de3bb Merge "Redefine Compaction Backlog to tame compaction aggressiveness" From Raphael S. Carvalho
"
Problem statement
=================
Today, compaction can act much more aggressive than it really has to, because
the strategy and its definition of backlog are completely decoupled.

The backlog definition for size-tiered, which is inherited by all
strategies (e.g.: LCS L0, TWCS' windows), is built on the assumption that the
world must reach the state of zero amplification. But that's unrealistic and
goes against the intent amplification defined by the compaction strategy.
For example, size tiered is a write oriented strategy which allows for extra
space amplification for compaction to keep up with the high write rate.

It can be seen today, in many deployments, that compaction shares is either
close to 1000, or even stuck at 1000, even though there's nothing to be done,
i.e. the compaction strategy is completely satisfied.
When there's a single sstable per tier, for example.
This means that whenever a new compaction job kicks in, it will act much more
aggressive because of the high shares, caused by false backlog of the existing
tables. This translates into higher P99 latencies and reduced throughput.

Solution
========
This problem can be fixed, as proposed in the document "Fixing compaction
aggressiveness due to suboptimal definition of zero backlog by controller" [1],
by removing backlog of tiers that don't have to be compacted now, like a tier
that has a single file. That's about coupling the strategy goal with the
backlog definition. So once strategy becomes satisfied, so will the controller.

Low-efficiency compaction, like compacting 2 files only or cross-tier, only
happens when system is under little load and can proceed at a slower pace.
Once efficient jobs show up, ongoing compactions, even if inefficient, will get
more shares (as efficient jobs add to the backlog) so compaction won't fall
behind.

With this approach, throughput and latency is improved as cpu time is no longer
stolen (unnecessarily) from the foreground requests.

[1]: https://docs.google.com/document/d/1EQnXXGWg6z7VAwI4u8AaUX1vFduClaf6WOMt2wem5oQ

Results
=======
Test sequentially populates 3 tables and then run a mixed workload on them,
where disk:memory ratio (usage) reaches ~30:1 at the peak.

Please find graphs here:
https://user-images.githubusercontent.com/1409139/153687219-32368a35-ac63-461b-a362-64dbe8449a00.png

1) Patched version started at ~01:30
2) On population phase, throughput increase and lower P99 write latency can be
clearly observed.
3) On mixed phase, throughput increase and lower P99 write and read latency can
also be clearly observed.
4) Compaction CPU time sometimes reach ~100% because of the delay between each
loader.
5) On unpatched version, it can be seen that backlog keeps growing even when
though strategies become satisfied, so compaction is using much more CPU time
in comparison. Patched version correctly clears the backlog.

Can also be found at:
github.com/raphaelsc/scylla.git compaction-controller-v5

tests: UNIT(dev, debug).
"

* 'compaction-controller-v5' of https://github.com/raphaelsc/scylla:
  tests: Add compaction controller test
  test/lib/sstable_utils: Set bytes_on_disk for fake SSTables
  compaction/size_tiered_backlog_tracker.hh: Use unsigned type for inflight component
  compaction: Redefine compaction backlog to tame compaction aggressiveness
  compaction_backlog_tracker: Batch changes through a new replacement interface
  table: Disable backlog tracker when stopping table
  compaction_backlog_tracker: make disable() public
  compaction_backlog_tracker: Clear tracker state when disabled
  compaction: Add normalized backlog metric
  compaction: make size_tiered_compaction_strategy static
2022-02-25 09:21:08 +02:00
Pavel Emelyanov
40078a6f8c types.hh: Nitpick on <=> usage
tri_compare_opt can avoid casting bool to int for spaceshipping
int - int <=> 0 looks nicer and shorter as int <=> int
data_type::compare from serialized_tri_compare already returns strong_ordering

tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220224125556.13138-1-xemul@scylladb.com>
2022-02-25 07:26:11 +02:00
Nadav Har'El
c26230943b alternator ttl: add metrics
This patch adds metrics to the Alternator TTL feature (aka the "expiration
service").

I put these metrics deliberately in their own object in ttl.{hh,cc}, and
also with their own prefix ("expiration_*") - and *not* together with the
rest of the Alternator metrics (alternator/stats.{hh,cc}). This is
because later we may want to use the expiration service not only in
Alternator but also in CQL - to support per-item expiration with CDC
events also in CQL. So the implementation of this feature should not be
too tangled with that of Alternator.

The patch currently adds four metrics, and opens the path to easily add
more in the future. The metrics added now are:

1. scylla_expiration_scan_passes:  The number of scan passes over the
       entire table. We expect this to grow by 1 every
       alternator_ttl_period_in_seconds seconds.

2. scylla_expiration_scan_table: The number of table scans. In each scan
       pass, we scan all the tables that have the Alternator TTL feature
       enabled. Each scan of each table is counted by this counter.

3. scylla_expiration_items_deleted: Counts the number of items that
       the expiration service expired (deleted). Please remember that
       each item is considered for expiration - and then expired - on
       only one node, so each expired item is counted only once - not
       RF times.

4. scylla_expiration_secondary_ranges_scanned: If this counter is
       incremented, it means this node took over some other node's
       expiration scanning duties while the other node was down.

This patch also includes a couple of unrelated comment fixes.

I tested the new metrics manually - they aren't yet tested by the
Alternator test suite because I couldn't make up my mind if such
tests would belong in test_ttl.py or test_metrics.py :-)

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220224092419.1132655-1-nyh@scylladb.com>
2022-02-25 07:26:11 +02:00
Asias He
ec59f7a079 repair: Do not flush hints and batchlog if tombstone_gc_mode is not repair
The flush of hints and batchlog are needed only for the table with
tombstone_gc_mode set to repair mode. We should skip the flush if the
tombstone_gc_mode is not repair mode.

Fixes #10004

Closes #10124
2022-02-25 07:26:11 +02:00
Nadav Har'El
d1b4cbfbc3 test/cql-pytest: add reproducer for LWT bug with static-column conditions
This patch adds a reproducing test for issue #10081. That issue is about
a conditional (LWT) UPDATE operation that chose a non-existent row via WHERE,
and its condition refers to both static and regular columns: In that case,
the code incorrectly assumes that because it didn't read any row, all columns
are null - and forgets that the static column is *not* null.

The test, test_lwt.py::test_lwt_missing_row_with_static
passes on Cassandra but fails on Scylla, so is marked xfail.

Refs #10081

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220215215243.660087-1-nyh@scylladb.com>
2022-02-25 07:26:11 +02:00
Avi Kivity
8f2bc838af Update seastar submodule
* seastar ea6a6820ed...2849a8a8ba (1):
  > Merge "make semaphore and shared_promise abortable" from Gleb

include fixup from Gleb added.
2022-02-25 07:26:11 +02:00
Benny Halevy
e2894bc762 compaction_manager: task: use plain UUID
Now that a null uuid is defined to be logically false
there's no need to use an optional UUID.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-02-25 07:26:11 +02:00
Nadav Har'El
db7b11cfc4 alternator: make TTL expiration scanner bypass cache
The background scan for expired Alternator items (the TTL feature)
should bypass the cache to avoid poluting it with the entire content
of the table being scanned.

I tested that the flag added in this patch really works by adding a printout
to the code in table.cc which creates the reader. Although we do have a
metric for uses of BYPASS CACHE, unfortunately this metric counts usage
of "BYPASS CACHE" in CQL statements - and not does not account the low-
level calls that we use in the ttl scanner.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-02-25 07:26:11 +02:00
Nadav Har'El
e06b5d9306 alternator: updated compatibility.md about TTL feature
The document docs/alternator/compatibility.md suggested that Alternator
does not support the TTL feature at all. The real situation is more
optimistic - this feature is supported, but as experimental feature.
So let's update compatibility.md with the real status of this feature.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-02-25 07:26:11 +02:00
Nadav Har'El
49a8164fb7 alternator: add configurable scan period to TTL expiration
Before this patch, the experimental TTL (expiration time) feature in
Alternator scans tables for expiration in a tight loop - starting the
next scan one second after the previous one completed.

In this patch we introduce a new configuration option,
alternator_ttl_period_in_seconds, which determines how frequently
to start the scan. The default is 24 hours - meaning that the next
scan is started 24 hours after the previous one started.

The tests (test/alternator/run) change this configuration back to one
second, so that expiration tests finish as quickly as possible.

Please note that the scan is *not* slowed down to fill this 24 hours -
if it finishes in one hour, it will then sleep for 23 hours. Additional
work would be needed to slow down the scan to not finish too quickly.
One idea not yet implemented is to move the expiration service from
the "maintenance" scheduling group which it uses today to a new
scheduling group, and modifying the number of shares that this group
gets.

Another thing worth noting about the configurable period (which defaults
to 24 hours) is that when TTL is enabled on an Alternator table, it can
take that amount of time until its scan starts and items start expiring
from it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-02-25 07:26:11 +02:00
Tomasz Grabiec
1d75a8c843 lsa: Abort when trying to free a standard allocator object not
allocated through the region

It indicates alloc-dealloc mismatch, and can cause other problems in
the systems like unable to reclaim memory. We want to catch this at
the deallocation site to be able to quickly indentify the offender.

Misbehavior of this sort can cause fake OOMs due to underflow of
_non_lsa_memory_in_use. When it underflows enough,
shard_segment_pool.total_memory() will become 0 and memory reclamation
will stop doing anything.

Refs #10056
2022-02-25 01:42:15 +01:00
Tomasz Grabiec
9dd4153c16 lsa: Abort when _non_lsa_memory_in_use goes negative
It indicates alloc-dealloc mismatch, and can cause other problems in
the systems like unable to reclaim memory. Catch early.

Refs #10056
2022-02-25 01:42:15 +01:00
Tomasz Grabiec
ca09a72597 tests: utils: cached_file: Validate occupancy after eviction
Reproducer for #10056

Catches alloc-dealloc mismatch leading to the underflow of
_non_lsa_memory_in_use.
2022-02-25 01:42:15 +01:00
Tomasz Grabiec
b0d5bb334c test: sstable_partition_index_cache_test: Fix alloc-dealloc mismatch
The test was allocating entries in the standard allocator, but they
are evicted in the LSA allocator context.

Fix by allocating under LSA.
2022-02-25 01:42:15 +01:00
Raphael S. Carvalho
2a7939ee4d tests: Add compaction controller test
There's no automated test for controller, it's time to have one.
Let's start with a basic one that verifies the assumption that
perfectly compacted tiers should produce 0 backlog.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-02-24 18:57:45 -03:00
Raphael S. Carvalho
96cfe7d530 test/lib/sstable_utils: Set bytes_on_disk for fake SSTables
Not precise, as bytes_on_disk accounts for all components, but good enough
for testing purposes.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-02-24 18:57:45 -03:00
Raphael S. Carvalho
a8caa67937 compaction/size_tiered_backlog_tracker.hh: Use unsigned type for inflight component
For describing data size, we use unsigned types.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-02-24 18:57:45 -03:00
Raphael S. Carvalho
1d9f53c881 compaction: Redefine compaction backlog to tame compaction aggressiveness
Today, compaction can act much more aggressive than it really has to, because
the strategy and its definition of backlog are completely decoupled.

The backlog definition for size-tiered, which is inherited by all
strategies (e.g.: LCS L0, TWCS' windows), is built on the assumption that the
world must reach the state of zero amplification. But that's unrealistic and
goes against the intent amplification defined by the compaction strategy.
For example, size tiered is a write oriented strategy which allows for extra
space amplification for compaction to keep up with the high write rate.

It can be seen today, in many deployments, that compaction shares is either
close to 1000, or even stuck at 1000, even though there's nothing to be done,
i.e. the compaction strategy is completely satisfied.
When there's a single sstable per tier, for example.
This means that whenever a new compaction job kicks in, it will act much more
aggressive because of the high shares, caused by false backlog of the existing
tables. This translates into higher P99 latencies and reduced throughput.

Solution
========
This problem can be fixed, as proposed in the document "Fixing compaction
aggressiveness due to suboptimal definition of zero backlog by controller" [1],
by removing backlog of tiers that don't have to be compacted now, like a tier
that has a single file. That's about coupling the strategy goal with the
backlog definition. So once strategy becomes satisfied, so will the controller.

Low-efficiency compaction, like compacting 2 files only or cross-tier, only
happens when system is under little load and can proceed at a slower pace.
Once efficient jobs show up, ongoing compactions, even if inefficient, will get
more shares (as efficient jobs add to the backlog) so compaction won't fall
behind.

With this approach, throughput and latency is improved as cpu time is no longer
stolen (unnecessarily) from the foreground requests.

[1]: https://docs.google.com/document/d/1EQnXXGWg6z7VAwI4u8AaUX1vFduClaf6WOMt2wem5oQ

Fixes #4588.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-02-24 18:57:38 -03:00
Raphael S. Carvalho
ddd693c6d7 compaction_backlog_tracker: Batch changes through a new replacement interface
This new interface allows table to communicate multiple changes in the
SSTable set with a single call, which is useful on compaction completion
for example.
With this new interface, the size tiered backlog tracker will be able to
know when compaction completed, which will allow it to recompute tiers
and their backlog contribution, if any. Without it, tiered tracker
would have to recompute tiers for every change, which would be terribly
expensive.
The old remove/add interface are being removed in favor of the new one.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-02-24 15:34:16 -03:00
Pavel Emelyanov
3f884fbdd7 sstables: Remove excessive type-match assertions
The primitive_consumer method templates overcomplicate the
declaration of the fact that one of the method arguments is
the sub-type of a template argument

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-02-24 19:49:20 +03:00
Pavel Emelyanov
b1843e50de mutation_reader: Sanitize invocable asserion and concept
There are both in the filtering_reader template, leave only
the concept and convert it into one-line invocable check

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-02-24 19:48:37 +03:00
Pavel Emelyanov
ffbf19ee3c code: Convert is_future result_of assertions into invoke_result concept
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-02-24 19:47:32 +03:00
Pavel Emelyanov
645896335d code: Convert is_same+result_of assertions into invocable concepts
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-02-24 19:46:10 +03:00
Pavel Emelyanov
063da81ab7 code: Convert nothrow construction assertions into concepts
The small_vector also has N>0 constraint that's also converted

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-02-24 19:44:50 +03:00
Pavel Emelyanov
b8401f2ddd code: Convert is_integral assertions to concepts
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-02-24 19:44:29 +03:00
Raphael S. Carvalho
84d843697b table: Disable backlog tracker when stopping table
Backlog tracker is managed by compaction strategy, and we'd like to
have it disabled in table::stop(), to make sure that all state is
cleared. For example, a reference to a shared sstable, in the
tracker implementation, could prevent the sstable manager from being
stopped as it relies on all sstables managed by it being closed
first. By calling tracker's disable() method, table::stop() will
guarantee that state is cleared by completion.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-02-24 13:41:05 -03:00
Raphael S. Carvalho
26350c8591 compaction_backlog_tracker: make disable() public
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-02-24 13:40:50 -03:00
Raphael S. Carvalho
c15e055612 compaction_backlog_tracker: Clear tracker state when disabled
If the tracker is disabled, we never get to access the underlying
implementation anymore. It makes sense to clear _impl on
disable(). So table::stop() can call its backlog tracker's disable
method, clearing all its state. This is important for clean
shutdown, as any sstable in tracker state may cause sstable
manager to hang when being stopped.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-02-24 13:40:39 -03:00
Raphael S. Carvalho
a70ce7ecb3 compaction: Add normalized backlog metric
Normalized backlog metric is important for understanding the controller
behavior as the controller acts on normalized backlog for yielding an
output, not the raw backlog value in bytes.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-02-24 13:40:33 -03:00
Raphael S. Carvalho
89eb563c94 compaction: make size_tiered_compaction_strategy static
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-02-24 13:40:29 -03:00
Tomasz Grabiec
e68cf55514 utils: cached_file: Fix alloc-dealloc mismatch during eviction
on_evicted() is invoked in the LSA allocator context, set in the
reclaimer callback instaled by the cache_tracker. However,
cached_pages are allocated in the standard allocator context (note:
page content is allocated inside LSA via lsa_buffer). The LSA region
will happilly deallocate these, thinking that they these are large
objects which were delegated to the standard allocator. But the
_non_lsa_memory_in_use metric will underflow. When it underflows
enough, shard_segment_pool.total_memory() will become 0 and memory
reclamation will stop doing anything, leading to apparent OOM.

The fix is to switch to the standard allocator context inside
cached_page::on_evicted(). evict_range() was also given the same
treatment as a precaution, it currently is only invoked in the
standard allocator context.

Fixes #10056
2022-02-23 18:38:05 +01:00
Asias He
680195564d repair: Unify repair uuid report in the log
More and more places are using the repair[uuid]: format for logging
repair jobs with the uuid. Convert more places to use the new format to
unify the log format.

This makes it easier to grep a specific repair job in the log.

Closes #10125
2022-02-23 09:13:12 +02:00
Avi Kivity
cbba80914d memtable: move to replica module and namespace
Memtables are a replica-side entity, and so are moved to the
replica module and namespace.

Memtables are also used outside the replica, in two places:
 - in some virtual tables; this is also in some way inside the replica,
   (virtual readers are installed at the replica level, not the
   cooordinator), so I don't consider it a layering violation
 - in many sstable unit tests, as a convenient way to create sstables
   with known input. This is a layering violation.

We could make memtables their own module, but I think this is wrong.
Memtables are deeply tied into replica memory management, and trying
to make them a low-level primitive (at a lower level than sstables) will
be difficult. Not least because memtables use sstables. Instead, we
should have a memtable-like thing that doesn't support merging and
doesn't have all other funky memtable stuff, and instead replace
the uses of memtables in sstable tests with some kind of
make_flat_mutation_reader_from_unsorted_mutations() that does
the sorting that is the reason for the use of memtables in tests (and
live with the layering violation meanwhile).

Test: unit (dev)

Closes #10120
2022-02-23 09:05:16 +02:00
Avi Kivity
5d4213e1b8 Update seastar submodule
* seastar c18cc5dc68...ea6a6820ed (7):
  > Merge 'json/formatter: Escape strings' from Juliusz Stasiewicz
Fixes #9061.
  > Merge "Export IO rate-limiter tokens metrics" from Pavel E
  > Merge "Fix block device configuration and RWF_NOWAIT support" from Pavel E
  > code: Sanitize fs magic values usage
  > Fix build error with c++17
  > coroutine: introduce coroutine::switch_to
  > Broader catching of bpo command-line parsing errors in app-template
2022-02-22 20:58:25 +03:00
Avi Kivity
75fb45df1b Merge 'Propagate CQL coordinator timeouts and failures for reads' from Piotr Dulikowski
This PR propagates the read coordinator logic so that read timeout and read failure exceptions are propagated without throwing on the coordinator side.

This PR is only concerned with exceptions which were originally thrown by the coordinator (in read resolvers). Exceptions propagated through RPC and RPC timeouts will still throw, although those exceptions will be caught and converted into exceptions-as-values by read resolvers.

This is a continuation of work started in #10014.

Results of `perf_simple_query --smp 1 --operations-per-shard 1000000` (read workload), compared with merge base (10880fb0a7):

```
BEFORE:
125085.13 tps ( 80.2 allocs/op,  12.2 tasks/op,   49010 insns/op)
125645.88 tps ( 80.2 allocs/op,  12.2 tasks/op,   49008 insns/op)
126148.85 tps ( 80.2 allocs/op,  12.2 tasks/op,   49005 insns/op)
126044.40 tps ( 80.2 allocs/op,  12.2 tasks/op,   49005 insns/op)
125799.75 tps ( 80.2 allocs/op,  12.2 tasks/op,   49003 insns/op)

AFTER:
127557.21 tps ( 80.2 allocs/op,  12.2 tasks/op,   49197 insns/op)
127835.98 tps ( 80.2 allocs/op,  12.2 tasks/op,   49198 insns/op)
127749.81 tps ( 80.2 allocs/op,  12.2 tasks/op,   49202 insns/op)
128941.17 tps ( 80.2 allocs/op,  12.2 tasks/op,   49192 insns/op)
129276.15 tps ( 80.2 allocs/op,  12.2 tasks/op,   49182 insns/op)
```

The PR does not introduce additional allocations on the read happy-path. The number of instructions used grows by about 200 insns/op. The increase in TPS is probably just a measurement error.

Closes #10092

* github.com:scylladb/scylla:
  indexed_table_select_statement: return some exceptions as exception messages
  result_combinators: add result_wrap_unpack
  select_statement: return exceptions as errors in execute_without_checking_exception_message
  select_statement: return exceptions without throwing in do_execute
  select_statement: implement execute_without_checking_exception_message
  select_statement: introduce helpers for working with failed results
  query_pager: resultify relevant methods
  storage_proxy: resultify (do_)query
  storage_proxy: resultify query_singular
  storage_proxy: propagate failed results through query_partition_key_range
  storage_proxy: resultify query_partition_key_range_concurrent
  storage_proxy: modify handle_read_error to also handle exception containers
  abstract_read_executor: return result from execute()
  abstract_read_executor: return and handle result from has_cl()
  storage_proxy: resultify handling errors from read-repair
  abstract_read_executor::reconcile: resultise handling of data_resolver->done()
  abstract_read_executor::execute: resultify handling of data_resolver->done()
  result_combinators: add result_discard_value
  abstract_read_executor: resultify _result_promise
  abstract_read_executor: return result from done()
  abstract_read_resolver: fail promises by passing exception as value
  abstract_read_resolver: resultify promises
  exceptions: make it possible to return read_{timeout,failure}_exception as value
  result_try: add as_inner/clone_inner to handle types
  result_try: relax ConvertWithTo constraint
  exception_container: switch impl to std::shared_ptr and make copyable
  result_loop: add result_repeat
  result_loop: add result_do_until
  result_loop: add result_map_reduce
  utils/result: add utilities for checking/creating rebindable results
2022-02-22 20:58:25 +03:00
Nadav Har'El
eec39e1258 Merge 'api: keyspace_scrub: validate params' from Benny Halevy
Refs #10087

Add validation of all params for the keyspace_scrub api.
The validation method is generic and should be used by all apis eventually,
but I'm leaving that as follow-up work.

While at it, fixed the exception types thrown on invalid `scrub_mode` or `quarantine_mode` values from `std::runtime_error` to `httpd::bad_param_exception` so to generate the `bad_request` http status.

And added unit tests to verify that, and the handling of an unknown parameter.

Test: unit(dev)
DTest: nodetool_additional_test.py::TestNodetool::{test_scrub_with_one_node_expect_data_loss,test_scrub_with_multi_nodes_expect_data_rebuild,test_scrub_sstable_with_invalid_fragment,test_scrub_ks_sstable_with_invalid_fragment,test_scrub_segregate_sstable_with_invalid_fragment,test_scrub_segregate_ks_sstable_with_invalid_fragment}

Closes #10090

* github.com:scylladb/scylla:
  api: storage_service: scrub: validate parameters
  api: storage_service: refactor parse_tables
  api: storage_service: refactor validate_keyspace
  test: rest_api: add test_storage_service_keyspace_scrub tests
  api: storage_service: scrub: throw httpd::bad_param_exception for invalid param values
2022-02-22 20:58:25 +03:00
Nadav Har'El
364bd00136 test/cql-pytest: confirm that table names cannot include non-Latin letters
In CQL table names must be composed only of letters, digits, or underscores,
but some Cassandra documentation is unclear whether these "letters" refer only
to the Latin alphabet, or maybe UTF-8 names composed of letters in other
alphabets should be allowed too.

This patch adds a test that confirms that both Scylla and Cassandra only
accept the Latin alphabet in table names, and for example UTF-8 names
with French or Hebrew letters are rejected.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220222134220.972413-1-nyh@scylladb.com>
2022-02-22 20:58:25 +03:00
Nadav Har'El
1a940a1003 test/cql-pytest: remove "xfail" mark from scientific-notation tests that now pass
After issue #10100 was fixed, the two tests reproducing it now pass,
so remove their "xfail" marker.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220222131809.970592-1-nyh@scylladb.com>
2022-02-22 20:58:25 +03:00
Nadav Har'El
be84a8def3 Merge 'Allow integers in scientific format in INSERT JSON ' from Piotr Grabowski
Add support for specifing integers in scientific format (for example 1.234e8) in INSERT JSON statement:

```
INSERT INTO table JSON '{"int_column": 1e7}';
```

Before the JSON parsing library was switched to RapidJSON from JsonCpp, this statement used to work correctly, because JsonCpp transparently casts double to integer value.

Inserting a floating-point number ending with .0 is allowed, as the fractional part is zero. Non-zero fractional part (for example 12.34) is disallowed. A new test is added to test all those behaviors.

This behavior differs from Cassandra, which disallows those types of numbers (1e7, 123.0 and 12.34), however some users rely on that behavior and JSON specification itself does not distinct between floating-point numbers and integer numbers (only a single "number" type is defined).

This PR also fixes two minor issues I noticed while looking at the code: wrong blob validation and missing `IsString()` checks that could result in assertion error.

Fixes #10100
Fixes #10114
Fixes #10115

Closes #10101

* github.com:scylladb/scylla:
  type_json: support integers in scientific format
  type_json: add missing IsString() checks
  type_json: fix wrong blob JSON validation
2022-02-22 20:58:25 +03:00
Botond Dénes
3aa05f7f03 Merge "Make system.clients table virtual" from Pavel Emelyanov
"
The table lists connected clients. For this the clients are
stored in real table when they connect, update their statuses
when needed and remove^w tombstone themselves when they
disconnect. On start the whole table is cleared.

This looks weird. Here's another approach (inspired by the
hackathon project) that makes this table a pure virtual one.
The schema is preserved so is the data returned.

The benefits of doing it virtual are

- no on-disk updates while processing clients
- no potentially failing updates on non-failing disconnect
- less usage of the global qctx thing
- less calls to global storage_proxy
- simpler support for thrift and alternator clients (today's
  table implementation doesn't track them)
- the need to make virtual tables reg/unreg dynamic

branch: https://github.com/xemul/scylla/tree/br-clients-virtual-table-4
tests: manual(dev), unit(dev)

The manual test used 80-shards node and 1M connections from
1k different IP addresses.
"

* 'br-clients-virtual-table-4' of https://github.com/xemul/scylla:
  test: Add cql-pytest sanity test for system.clients table
  client_data: Sanitize connection_notifier
  transport: Indentation fix after previous patch
  code: Remove old on-disk version of system.clients table
  system_keyspace: Add clients_v virtual table
  protocol_server: Add get_client_data call
  transport: Track client state for real
  transport: Add stringifiers to client_data class
  generic_server: Gentle iterator
  generic_server: Type alias
  docs: Add system.clients description
2022-02-22 20:58:25 +03:00
Piotr Dulikowski
ddf049738d indexed_table_select_statement: return some exceptions as exception messages
Adjusts the indexed_table_select_statement so that it uses the
result-aware methods in storage_proxy and propagates failed results as
result_message::exception.
2022-02-22 16:25:21 +01:00
Piotr Dulikowski
091b20019b result_combinators: add result_wrap_unpack
Adds a helper combinator utils::result_wrap_unpack which, in contrast to
utils::result_wrap, uses futurize_apply instead of futurize_invoke to
call the wrapped callable.

In short, if utils::result_wrap is used to adapt code like this:

    f.then([] {})
      ->
    f_result.then(utils::result_wrap([] {}))

Then utils::result_wrap_unpack works for the following case:

    f.then_unpack([] (arg1, arg2) {})
      ->
    f_result.then(utils::result_wrap_unpack([] (arg1, arg2) {}))
2022-02-22 16:25:21 +01:00
Piotr Dulikowski
c5bcfee28f select_statement: return exceptions as errors in execute_without_checking_exception_message
Modifies the remaining logic of execute_without... (apart from the
do_execute call) so that the result-aware versions of storage_proxy's
methods are called and failed results are converted to
result_message::exception.
2022-02-22 16:25:21 +01:00
Piotr Dulikowski
5106c60cd0 select_statement: return exceptions without throwing in do_execute
Modifies do_execute so that it uses the result-aware versions of the
query_pager's methods and returns them as result_message::exception.
2022-02-22 16:25:21 +01:00
Piotr Dulikowski
3a4d3f3175 select_statement: implement execute_without_checking_exception_message
The select_statement will be able to propagate coordinator failures
without throwing, so it's important to override the default
implementations of execute and excecute_without... so that the first
calls the latter and not the other way around.
2022-02-22 16:25:21 +01:00
Piotr Dulikowski
df7668797b select_statement: introduce helpers for working with failed results
Adds:

- Includes for result-related helper methods (to be used in later
  commits),
- Alias for coordinator_result,
- The wrap_result_to_error_message function - a bit similar to
  utils::result_wrap. Adapts a callable T -> shared_ptr<result_message>
  to take result<T> -> shared_ptr<result_message>. If the result is
  failed, it converts it into result_message::exception and returns.
2022-02-22 16:25:21 +01:00
Piotr Dulikowski
c96c8e4813 query_pager: resultify relevant methods
Now, the relevant methods of all query pagers properly propagate failed
results.
2022-02-22 16:25:21 +01:00
Piotr Dulikowski
e5922e650e storage_proxy: resultify (do_)query
Adjusts do_query so that it propagates and returns failed results. The
query_result method is added which is result-aware, and the old query
method was changed to call query_result.
2022-02-22 16:08:52 +01:00
Piotr Dulikowski
e39c5b6eba storage_proxy: resultify query_singular
Now, query_singular propagates and returns failed results without
rethrowing them.
2022-02-22 16:08:52 +01:00
Piotr Dulikowski
2f5f746ae2 storage_proxy: propagate failed results through query_partition_key_range
Now, query_partition_key_range propagates the failed result from
query_partition_key_range_concurrent.
2022-02-22 16:08:52 +01:00
Piotr Dulikowski
608032b2b5 storage_proxy: resultify query_partition_key_range_concurrent
Now, query_partition_key_range_concurrent propagates and returns
exceptions as values, if possible.
2022-02-22 16:08:52 +01:00
Piotr Dulikowski
10923d9d58 storage_proxy: modify handle_read_error to also handle exception containers
Now, storage_proxy::handle_read_error can work with both exception
containers and exception_ptrs.
2022-02-22 16:08:52 +01:00
Piotr Dulikowski
89fe804a1a abstract_read_executor: return result from execute() 2022-02-22 16:08:52 +01:00
Piotr Dulikowski
15fa5e30f5 abstract_read_executor: return and handle result from has_cl()
The has_cl() method is changed to return a future with a result. The
result returned from has_cl() is handled without throwing.
2022-02-22 16:08:52 +01:00
Piotr Dulikowski
68b5b84fbe storage_proxy: resultify handling errors from read-repair
Now, failed results returned from read-repair are handled without
throwing.
2022-02-22 16:08:52 +01:00
Piotr Dulikowski
dd860c70ce abstract_read_executor::reconcile: resultise handling of data_resolver->done()
Now, the logic of handling exceptions returned in reconcile() from
data_resolver->done() was changed so that the failed result does not
need to be converted to an exceptional future.
2022-02-22 16:08:52 +01:00
Piotr Dulikowski
5accfd8dae abstract_read_executor::execute: resultify handling of data_resolver->done()
Now, the logic of handling exceptions returned in execute() from
data_resolver->done() was changed so that the failed result does not
need to be converted to an exceptional future.
2022-02-22 16:08:52 +01:00
Piotr Dulikowski
a304fbfed3 result_combinators: add result_discard_value
Adds a utils::result_discard_value, which is an alternative to
future::discard_result which just ignores the "success" value of the
provided result and does not ignore the exception.
2022-02-22 16:08:52 +01:00
Piotr Dulikowski
ee2c4725c3 abstract_read_executor: resultify _result_promise
Adjusts the type of _result_promise so that it holds a result.
2022-02-22 16:08:52 +01:00
Piotr Dulikowski
e7f960d041 abstract_read_executor: return result from done() 2022-02-22 16:08:52 +01:00
Piotr Dulikowski
28d562ddf6 abstract_read_resolver: fail promises by passing exception as value
Now, on read timeouts and failures, _cl_promise and _done_promise is set
to a failed result instead of an exceptional promise.
2022-02-22 16:08:52 +01:00
Piotr Dulikowski
5438973c9d abstract_read_resolver: resultify promises
Changes the types of _done_promise and _cl_promise so that they hold a
result.
2022-02-22 16:08:52 +01:00
Piotr Dulikowski
4c58683102 exceptions: make it possible to return read_{timeout,failure}_exception as value
Adds read_timeout_exception and read_failure exception to the list of
exceptions supported by the coordinator_exception_container.

Those exceptions are not yet returned-as-value anywhere, but they will
be in the commits that follow.
2022-02-22 16:08:52 +01:00
Piotr Dulikowski
1e1f5b4a48 result_try: add as_inner/clone_inner to handle types
Adds two methods to result_try's exception handles:

- as_inner: returns a {l,r}-value reference either to the exception
  container, or the exception_ptr. This allows to use them in operations
  which work on both types, e.g. logging.
- clone_inner: returns a copy of the underlying exception container or
  exception ptr.
2022-02-22 16:08:52 +01:00
Piotr Dulikowski
1ed416906e result_try: relax ConvertWithTo constraint
Currently, the catch handlers in result_futurize_try are required to
return a future, although they are always being called with
seastar::futurize_invoke, so if their result is not future it could be
converted to one anyway. This commit relaxes the ConvertsWithTo
constraint in order to allow this conversion.
2022-02-22 16:08:52 +01:00
Piotr Dulikowski
e87cf08591 exception_container: switch impl to std::shared_ptr and make copyable
The exception_container is supposed to be a cheaper, but possibly harder
to use alternative to std::exception_ptr. Before this commit, the
exception was kept behind foreign_ptr<std::unique_ptr<>> so that moving
the container is very cheap. However, the original std::exception_ptr
supports copying in a thread-safe manner, and it turns out that some of
the read coordinator logic intentionally copies the pointer in order to
be able to fail two different promises with the same exception.

The pointer type is changed to std::shared_ptr. Although it uses atomics
for reference counting, this is also probably what std::exception_ptr
does, so the performance should not be worse. The exception stored
inside the container is immutable, so this allows for a non-throwing
implementation of copying.

To encourage moves instead of copying, the copy constructor is deleted
and instead the `clone()` method should be used if it is really
necessary.
2022-02-22 16:08:52 +01:00
Piotr Dulikowski
7afea88dfc result_loop: add result_repeat
Adds a result-aware counterpart to seastar::repeat. The new function
does not base on seastar::repeat, but rather is a rewrite of the
original (using a coroutine instead of an open-coded task). The main
consequence of using a coroutine is that exceptions from AsyncAction
need to be thrown once more.
2022-02-22 16:08:52 +01:00
Piotr Dulikowski
32cbc89779 result_loop: add result_do_until
Adds a result-aware counterpart to seastar::do_until. The new function
does not base on seastar::do_until, but rather is a rewrite of the
original (using a coroutine instead of an open-coded task). The main
consequence of using a coroutine is that exceptions from StopCondition
or AsyncAction need to be thrown once more.
2022-02-22 16:08:52 +01:00
Piotr Dulikowski
4f0a98a829 result_loop: add result_map_reduce
Adds result-aware counterparts to all seastar::map_reduce overloads.

Fortunately, it was possible to implement the functions by basing them
on seastar::map_reduce and get the same number of allocation. The only
exception happens when reducer::get() returns a non-ready future, which
doesn't seem to happen on the read coordinator path.
2022-02-22 16:08:52 +01:00
Piotr Dulikowski
b3a0480439 utils/result: add utilities for checking/creating rebindable results
Adds:

- ResultRebindableTo<L, R>: concept which is satisfied by a pair of
  results which do not necessarily share the same value, but have the
  same error and policy types; a failed result L can be converted to a
  failed result R.
- rebind_result<T, R>: given a value type T and another result R,
  returns a result which can hold T as value and both the same error and
  policy as R.
2022-02-22 16:08:45 +01:00
Piotr Grabowski
efe7456f0a type_json: support integers in scientific format
Add support for specifing integers in scientific format (for example
1.234e8) in INSERT JSON statement:

INSERT INTO table JSON '{"int_column": 1e7}';

Inserting a floating-point number ending with .0 is allowed, as
the fractional part is zero. Non-zero fractional part (for example
12.34) is disallowed. A new test is added to test all those behaviors.

Before the JSON parsing library was switched to RapidJSON from JsonCpp,
this statement used to work correctly, because JsonCpp transparently
casts double to integer value.

This behavior differs from Cassandra, which disallows those types of
numbers (1e7, 123.0 and 12.34).

Fix typo in if condition: "if (value.GetUint64())" to
"if (value.IsUint64())".

Fixes #10100
2022-02-22 12:55:38 +01:00
Avi Kivity
d1a394fd97 loading_cache: fix indentation of timestamped_val and two nested type aliases
timestamped_val (and two other type aliases) are nested inside loading_cache,
but indented as if they were top-level names. Adjust the indent to
avoid confusion.

Closes #10118
2022-02-22 12:20:36 +02:00
Botond Dénes
2afacf9609 mutation_reader: drop now unused v1 multishard_combining_reader and friends
Friends: shard_reader and evictable_reader. All these have been
supplanted by their respective v2 variants.

Tests: unit(dev)

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20220222071925.223718-1-bdenes@scylladb.com>
2022-02-22 10:51:08 +03:00
Pavel Emelyanov
dfb980e5f5 Merge 'compaction_manager: allow stopping sleeping tasks' from Benny Halevy
Use exponential_backoff_retry::retry(abort_source&)
when sleeping between retries and request abort
when the task is stopped.

Fixes #10112

Test: unit(dev)

Closes #10113

* github.com:scylladb/scylla:
  compaction_manager: allow stopping sleeping tasks
  compaction_manager: task: add make_compaction_stopped_exception
  compaction_manager: task: refactor stop
2022-02-22 10:39:47 +03:00
Wojciech Mitros
7f590a3686 sstables: index_reader: optimize single partition reads
All entries from a single partition can be found in a
single summary page.
Because of that, in cases when we know we want to read
only one partition, we can limit the underyling file
input_stream to the range of the page.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2022-02-22 02:16:52 +01:00
Wojciech Mitros
c81992c665 sstables: use read-aheads in the index reader
Currently, when advancing one of index_reader's bounds,
we're creating a new index_consume_entry_context with a new
underlying file input_stream for each new page.

For either bound, the streams can be reused, because
the indexes of pages that we are reading are never
decreasing.

This patch adds a index_consume_entry_context to each of
index_reader's bounds, so that for each new page, the same
file input_stream is used.
As a result, when reading consecutive pages, the reads that
follow the first one can be satisfied by the input_stream's
read aheads, decreasing the number of blocking reads and
increasing the throughput of the index_reader.

Fixes #2388

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2022-02-22 01:51:33 +01:00
Benny Halevy
57f97046a7 compaction_manager: allow stopping sleeping tasks
Use exponential_backoff_retry::retry(abort_source&)
when sleeping between retries and request abort
when the task is stopped.

Fixes #10112

Test: unit(dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-02-21 21:01:56 +02:00
Benny Halevy
f21b985872 compaction_manager: task: add make_compaction_stopped_exception
Provide a function to make a sstables::compaction_stopped_exception
based on the information in the stopped task.

To be reused by the next patch that will
also throw this exception from the retry sleep path,
when the task is stopped.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-02-21 18:09:49 +02:00
Benny Halevy
91514c20ec compaction_manager: task: refactor stop
Refactor compaction_manager::task::stop
out of compaction_manager::task_stop.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-02-21 18:04:06 +02:00
Piotr Grabowski
649ab70936 type_json: add missing IsString() checks
Add missing IsString() checks to parsing date, time, uuid and inet
types by introducing validated_to_string_view function which checks
whether the value is of string type and otherwise throws 
marshal_exception. 

Without this check, a malformed input to those types would result in 
nasty ServerError with RapidJSON assertion instead of marshal_exception
with detail about the problem.

Add new tests checking passing non-string values for those types.

Fixes #10115
2022-02-21 16:58:13 +01:00
Piotr Grabowski
f8b67c9bd1 type_json: fix wrong blob JSON validation
Fixes wrong condition for validating whether a JSON string representing
blob value is valid. Previously, strings such as "6" or "0392fa" would
pass the validation, even though they are too short or don't start with
"0x". Add those test cases to json_cql_query_test.cc.

Fixes #10114
2022-02-21 16:58:12 +01:00
Botond Dénes
10880fb0a7 tools/scylla-sstable: fix description template
Quote '{' and '}' used in CQL example, so format doesn't try to
interpret it.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20220221140652.173015-1-bdenes@scylladb.com>
2022-02-21 17:14:41 +02:00
Nadav Har'El
7181a6757a test/cql-pytest: add a couple of tests for static columns
This patch adds two tests for two interesting edge cases in the behavior
of static columns in Scylla. We already have a lot of tests for static
columns in other frameworks (C++ unit tests, cql and dtest), but the two
cases here are issues where specifically we weren't sure how Cassandra
behaves in those cases - and this can most easily be checked in the
test/cql-pytest framework.

The first test, test_static_not_selected, is a reproducer for issue #10091.
This issue was reported by a user @aohotnik, who was surprised by the
fact that Scylla returns empty values, instead of nothing, when selecting
regular columns of a non-existent row if the partition has a static
column set. The test demonstrates a difference between Scylla and
Cassandra, so it is marked "xfail" - it passes on Cassandra and fails on
Scylla. If later we decide that both Scylla's and Cassandra's behaviours
are reasonable and both can be considered "correct", we can change this
test to except Scylla's result as well and it will beging to pass.

The second test, test_missing_row_with_static, shows that SELECT of a
non-existent row returns nothing - even if the partition has a static
column. The behavior in this case is identical in Scylla and Cassandra,
so this test passes. This contrasts with the analogous situation in LWT
UPDATE from issue #10081, where the IF condition is expected to see the
static column value.

Refs #10081
Refs #10091

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220220120418.831540-1-nyh@scylladb.com>
2022-02-21 16:04:57 +02:00
Avi Kivity
adc08d0ab9 Merge "Drop v1 input support for mutation compactor" from Botond
"
Currently the mutation compactor supports v1 and v2 output and has a v1
output. The next step is to add a v2 output but this would lead to a
full conversion matrix which we want to avoid. So in preparation we drop
the v1 input support. Most inputs were already v2, but there were some
notable exceptions: tests, the compacting reader and the multishard
query code. The former two was a simple mechanical update but the latter
required some further work because it turned out the v2 version of
evictable reader wasn't used yet and thus it managed to hide some bugs
and dropped features. While at it, we migrate all evictable and
multishard reader users to the v2 variant of the respective readers and
drop the v1 variant completely.
With this the road is open to a v2 compactor output and therefore to a
v2 sstable writer.

Tests: unit(dev, release), dtest(paging_additional_test.py)
"

* 'compact-mutation-v2-only-input/v5' of https://github.com/denesb/scylla:
  test/lib/test_utils: return OK from check() variants
  repair/row_level: use evictable reader v2
  db/view/view_updating_consumer: migrate to v2
  test/boost/mutation_reader_test: add v2 specific evictable reader tests
  test: migrate to evictable reader v2 and multishard combining reader v2
  compact_mutation: drop support for v1 input
  test: pass v2 input to mutation_compaction
  test/boost/mutation_test: simplify test_compaction_data_stream_split test
  mutation_partition: do_compact(): do drop row tombstones covered by higher order tombstones
  multishard_mutation_query: migrate to v2
  mutation_fragment_v2: range_tombstone_change: add memory_usage()
  evictable_reader_v2: terminate active range tombstones on reader recreation
  evictable_reader_v2: restore handling of non-monotonically increasing positions
  evictable_reader_v2: simplify handling of reader recreation
  mutation: counter_write_query: use v2 reader
  mutation: migrate consume() to v2
  mutation_fragment_v2,flat_mutation_reader_v2: mirror v1 concept organization
  mutation_reader: compacting_reader: require a v2 input reader
  db/view/view_builder: use v2 reader
  test/lib/flat_mutation_reader_assertions: adjust has_monotonic_positions() to v2 spec
2022-02-21 14:32:55 +02:00
Botond Dénes
841b982e51 test/lib/test_utils: return OK from check() variants
The various require() and check() methods in test_utils.hh were
introduced to replace BOOST_REQUIRE() and BOOST_CHECK() respectively in
multi-shard concurrent tests, specifically those in
tests/boost/multishard_mutation_query_test.cc.
This was done literally, just replacing BOOST_REQUIRE() with require()
and BOOST_CHECK() with check(). The problem is that check() is missing a
feature BOOST_CHECK() had: while BOOST_CHECK() doesn't cause an
immediate test failure, just logging an error if the condition fails, it
remembers this failure and will fail the test in the end. check() did
not have this feature and this caused potential errors to just be logged
while the test could still pass fine, causing false-positive tests
passes. This patch fixes this by returning a [[nodiscard]] bool from the
check() methods. The caller can & these together over all calls to
check() methods and manually fail the test in the end. We choose this
method over a hidden global (like BOOST_CHECK() does) for simplicity
sake.
2022-02-21 12:29:25 +02:00
Botond Dénes
4aa9b90ba9 repair/row_level: use evictable reader v2 2022-02-21 12:29:24 +02:00
Botond Dénes
05c48ee0cc db/view/view_updating_consumer: migrate to v2
Not a completely mechanical transition. The consumer has to generate its
mutation via a mutation_rebuilder_v2 as mutation fragment v2 cannot be
applied to mutations directly yet.
2022-02-21 12:29:24 +02:00
Botond Dénes
014a23bf2a test/boost/mutation_reader_test: add v2 specific evictable reader tests
One is a reincarnation of the recently removed
test_multishard_combining_reader_non_strictly_monotonic_positions. The
latter was actually targeting the evictable reader but through the
multishard reader, probably for historic reasons (evictable reader was
part of the multishard reader family).
The other one checks that active range tombstones changes are properly
terminated when the partition ends abruptly after recreating the reader.
2022-02-21 12:29:24 +02:00
Botond Dénes
e3c618beba test: migrate to evictable reader v2 and multishard combining reader v2
All reads are now using the v2 version of these readers, test them
instead of the old v1.
2022-02-21 12:29:24 +02:00
Botond Dénes
f1e9e3b3b7 compact_mutation: drop support for v1 input 2022-02-21 12:29:24 +02:00
Botond Dénes
284ed9154f test: pass v2 input to mutation_compaction 2022-02-21 12:29:24 +02:00
Botond Dénes
dec4e5659b test/boost/mutation_test: simplify test_compaction_data_stream_split test
This test has very elaborate infrastructure essentially duplicating
mutation, mutation::apply() and mutation::operator==. Drop all this
extra code and use mutations directly instead. This makes migrating the
test to v2 easier.
2022-02-21 12:29:24 +02:00
Botond Dénes
2941803da0 mutation_partition: do_compact(): do drop row tombstones covered by higher order tombstones
The comment on the public methods calling said method promises to do so
but doesn't actually follows through. This patch fixes this for row
tombstones, to mirror the behaviour of the mutation compactor. This is
especially important for tests that compare mutations compacted with
different methods.
2022-02-21 12:29:24 +02:00
Botond Dénes
f2e2b84038 multishard_mutation_query: migrate to v2
Mostly mechanical transformation. The main difference is in the detached
compaction state, from which we now get the range tombstone change,
instead of the range tombstone list. The code around this is a bit
awkward, will become simpler when compactor drops v1 support.
2022-02-21 12:29:24 +02:00
Botond Dénes
b330cba792 mutation_fragment_v2: range_tombstone_change: add memory_usage() 2022-02-21 12:29:24 +02:00
Botond Dénes
9e48237b86 evictable_reader_v2: terminate active range tombstones on reader recreation
Reader recreation messes with the continuity of the mutation fragment
stream because it breaks snapshot isolation. We cannot guarantee that a
range tombstone or even the partition started before will continue after
too. So we have to make sure to wrap up all loose threads when
recreating the reader. We already close uncontinued partitions. This
commit also takes care of closing any range tombstone started by
unconditionally emitting a null range tombstone. This is legal to do,
even if no range tombstone was in effect.
2022-02-21 12:29:24 +02:00
Botond Dénes
6db08ddeb2 evictable_reader_v2: restore handling of non-monotonically increasing positions
We thought that unlike v1, v2 will not need this. But it does.
Handled similarly to how v1 did it: we ensure each buffer represents
forward progress, when the last fragment in the buffer is a range
tombstone change:
* Ensure the content of the buffer represents progress w.r.t.
  _next_position_in_partition, thus ensuring the next time we recreate
  the reader it will continue from a later position.
* Continue reading until the next (peeked) fragment has a strictly
  larger position.

The code is just much nicer because it uses coroutines.
2022-02-21 12:29:24 +02:00
Botond Dénes
498d03836b evictable_reader_v2: simplify handling of reader recreation
The evictable reader has a handful of flags dictating what to do after
the reader is recreated: what to validate, what to drop, etc. We
actually need a single flag telling us if the reader was recreated or
not, all other things can be derived from existing fields.
This patch does exactly that. Furthermore it folds do_fill_buffer() into
fill_buffer() and replaces the awkward to use `should_drop_fragment()`
with `examine_first_fragments()`, which does a much better job of
encapsulating all validation and fragment dropping logic.
This code reorganization also fixes two bugs introduced by the v2
conversion:
* The loop in `do_fill_buffer()` could become infinite in certain
  circumstances due to a difference between the v1 and v2 versions of
  `is_end_of_stream()`.
* The position of the first non-dropped fragment is was not validated
  (this was integrated into the range tombstone trimming which was
  thrown out by the conversion).
2022-02-21 12:29:24 +02:00
Botond Dénes
d4ac473f7d mutation: counter_write_query: use v2 reader 2022-02-21 12:27:55 +02:00
Botond Dénes
fcda35d08e mutation: migrate consume() to v2
The underlying mutation format is still v1, so consume() ends up doing
an online conversion. This allows converting all downstream code to v2,
leaving the conversion close to the code that is yet to be migrated to
v2 native: the mutation itself.
2022-02-21 12:27:55 +02:00
Botond Dénes
1fa6537a2f mutation_fragment_v2,flat_mutation_reader_v2: mirror v1 concept organization
Currently all concepts are in mutation_fragment_v2.hh and
flat_mutation_reader_v2.hh. Organize concepts similar to how the v1 ones
are: move high-level consume concepts into
mutation_consumer_concepts.hh.
2022-02-21 12:27:55 +02:00
Botond Dénes
fb0e0ec7c1 mutation_reader: compacting_reader: require a v2 input reader
Before we add a v2 output option to the compactor, we want to get rid of
all the v1 inputs to make it simpler. This means that for a while the
compacting reader will be in a strange place of having a v2 input and a
v1 output. Hopefully, not for long.
2022-02-21 12:27:55 +02:00
Botond Dénes
45b36d91c6 db/view/view_builder: use v2 reader 2022-02-21 12:27:55 +02:00
Botond Dénes
bba20f5cce test/lib/flat_mutation_reader_assertions: adjust has_monotonic_positions() to v2 spec
The v2 spec allows for non-strictly monotonically increasing positions,
but has_monotonic_positions() tried to enforce it. Relax the check so it
conforms to the spec.
2022-02-21 12:27:55 +02:00
Benny Halevy
f5259e048c test: sstable_compaction_test: stop compaction manager and test table using deferred action
To make sure they are properly stopped also on exception.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220220120939.2362590-2-bhalevy@scylladb.com>
2022-02-21 12:06:32 +02:00
Benny Halevy
9a308bc496 test: lib: register_compaction: do not allow null table
Require to pass the table to be compacted so
register_compaction finds the real compaction state
rather than making a bogus one.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220220120939.2362590-1-bhalevy@scylladb.com>
2022-02-21 12:06:32 +02:00
Yaron Kaikov
23bb0761bf SCYLLA-VERSION-GEN:set release-version value length
Noticed this issue during my debug sessions while building Scylla on x86 and Arm (https://jenkins.scylladb.com/job/scylla-master/job/releng-testing/job/build/670/artifact/)

from x86 log (https://jenkins.scylladb.com/job/scylla-master/job/releng-testing/job/build/670/artifact/output-build-x86_64.txt)
```
[883/2823] cd tools/python3 && ./reloc/build_reloc.sh --version $(<../../build/SCYLLA-PRODUCT-FILE)-$(<../../build/SCYLLA-VERSION-FILE)-$(<../../build/SCYLLA-RELEASE-FILE) --nodeps --packages "python3-pyyaml python3-urwid python3-pyparsing python3-requests python3-pyudev python3-setuptools python3-psutil python3-distro python3-click python3-six" --pip-packages "scylla-driver geomet"
5.1.dev-0.20220209.23da2b58796
```
from arm log (https://jenkins.scylladb.com/job/scylla-master/job/releng-testing/job/build/670/artifact/output-build-aarch64.txt)
```
[244/2823] cd tools/python3 && ./reloc/build_reloc.sh --version $(<../../build/SCYLLA-PRODUCT-FILE)-$(<../../build/SCYLLA-VERSION-FILE)-$(<../../build/SCYLLA-RELEASE-FILE) --nodeps --packages "python3-pyyaml python3-urwid python3-pyparsing python3-requests python3-pyudev python3-setuptools python3-psutil python3-distro python3-click python3-six" --pip-packages "scylla-driver geomet"
5.1.dev-0.20220209.23da2b587
```

Related to git config parameter core.abbrev which is not defined so default is set for auto (based: https://git-scm.com/docs/git-config#Documentation/git-config.txt-coreabbrev)

Fixes: https://github.com/scylladb/scylla/issues/10108

Closes #10109
2022-02-21 13:28:04 +02:00
Pavel Emelyanov
49c5d5b7e8 Merge 'lister: add directory_lister' from Benny Halevy
directory_lister provides a simpler interface compared to lister.

After creating the directory_lister,
its async get() method should be called repeatedly,
returning a std::optional<directory_entry> each call,
until it returns a disengaged entry or an error.

This is especially suitable for coroutines
as demonstrated in the unit tests that were added.

For example:
```c++
        auto dl = directory_lister(path);
        while (auto de = co_await dl.get()) {
            co_await process(*de);
        }
```

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #9835

* github.com:scylladb/scylla:
  sstable_directory: process_sstable_dir: use directory_lister
  sstable_directory: process_sstable_dir: fixup indentation
  sstable_directory: coroutinize process_sstable_dir
  lister: add directory_lister
2022-02-21 12:24:28 +03:00
Nadav Har'El
4349514064 test/alternator: add smaller reproducer for Limit-less reverse query
The regression test we have for Alternator's issue #9487 (where a reverse
query without a Limit given was broken into 100MB pages instead of the
expected 1MB) is test_query.py::test_query_reverse_long. But this is a
very long test requiring a 100MB partition, and because of its slowness
isn't run by default.

This patch adds another version of that test, test_query_reverse_longish,
which reproduces the same issue #9487 with a partition 50 times shorter
(2MB) so it only takes a fraction of a second and can be enabled by
default. It also requires much less network traffic which is important
when running these tests non-locally.

We leave the original test test_query_reverse_long behind, it can be
still useful to stress Scylla even beyond the 100MB boundary, but it
remains in @veryslow mode so won't run in default test runs.

Refs #9487
Refs #7586

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220220161905.852994-1-nyh@scylladb.com>
2022-02-21 09:12:16 +01:00
Avi Kivity
8eb5d6ed31 frozen_schema: avoid allocating contiguous memory
A frozen schema can be quite large (in #10071 we measured 500 bytes per
column, and there can be thousands of columns in extreme tables). This
can cause large contiguous allocations and therefor memory stalls or
even failures to allocate.

Switch to bytes_ostream as the internal representation. Fortunately
frozen_schema is internally implemented as bytes_ostream, so the
change is minimal.

Ref #10071.

Test: unit (dev)

Closes #10105
2022-02-21 01:39:02 +01:00
Amnon Heiman
c764f0d0f8 gms/gossiper.cc: Add gauge for live and unreachable nodes
this patch adds two gauges:
scylla_gossip_live - how many live nodes the gossiper sees
scylla_gossip_unreachable - how many nodes the gossiper tries to connect
to but cannot.

Both metrics are reported once per node (i.e., per node, not per shard) it
gives visibility to how a specific node sees the cluster.

For example, a split-brain 6 nodes cluster (3 and 3). Each node would
report that it sees 2 nodes, but the monitoring system would see that
there are, in fact, 6 nodes.

Example of two nodes cluster, both running:
``
scylla_gossip_live{shard="0"} 1.000000
scylla_gossip_unreachable{shard="0"} 0.000000
``

Example of two nodes cluster, one is down:
``
scylla_gossip_live{shard="0"} 0.000000
scylla_gossip_unreachable{shard="0"} 1.000000
``

Fixes #10102

Signed-off-by: Amnon Heiman <amnon@scylladb.com>

Closes #10103

[avi: remove whitespace change and correct spelling]
2022-02-20 19:42:58 +02:00
Wojciech Mitros
0a1500acd2 sstables: index_reader: remove unused members from index reader context
The _file_name and _index_file fields in index_consume_entry_context
are no longer used anywhere in the class (_file_name isn't even set,
and _index_file was previously used when creating a promoted_index,
which doesn't store the file object anymore)
2022-02-20 16:24:27 +01:00
Nadav Har'El
6476a64185 test/cql-pytest: reproducer for JSON scientific-notation integer problem
The JSON standard specifies numbers without making a distinction of what
is "an integer" and what is "floating point". The value 1e6 is a valid
number, and although it is customary in C that 1e6 is a floating-point
constant, as a JSON constant there is nothing inherently "non-integer" about
it - it is a whole number. This is why I believe CQL commands such as

    CREATE TABLE t(pk int PRIMARY KEY, v int);
    INSERT INTO t JSON '{"pk": 1, "v": 1e6}';

should be allowed, as 1e6 is a whole number and fits in the range of
Scylla's int.

The included tests show that, unfortunately, 1e6 is *not* currently
allowed to be assigned to an integer. The test currently fail on both
Scylla and Cassandra - and we believe this failure to be a bug in both,
so the test is marked with xfail (known to fail) and cassandra-bug
(known failure on Cassandra considered to be a bug).

Refs #10100

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220220141602.843783-1-nyh@scylladb.com>
2022-02-20 17:01:22 +02:00
Nadav Har'El
d3ac9a5790 Merge 'cql3: expr: Fix expr::visit so that it works with references' from Jan Ciołek
There is a bug in `expr::visit`. When trying to return a reference from a visitor it actually returns a reference to some temporary location.
So trying to do something like:
```c++
const expression e = new_bind_variable(123);

const bind_variable& ref = visit(overloaded_functor {
    [](const bind_variable& bv) -> const bind_variable& { return bv; },
    [](const auto&) -> const bind_variable& { throw std::runtime_error("Unreachable"); }
}, e);

std::cout << ref << std::endl;
 ```
 Would actually print a random stack location instead of the value inside of `e`.
 Additionally trying to return a non-const reference doesn't compile.

 Current implementation of `expr::visit` is:
 ```c++
 auto visit(invocable_on_expression auto&& visitor, const expression& e) {
    return std::visit(visitor, e._v->v);
}
 ```

For reference, `std::visit` looks like this:
 ```c++
template<typename _Res, typename _Visitor, typename... _Variants>
constexpr _Res
visit(_Visitor&& __visitor, _Variants&&... __variants)
{
  return std::__do_visit<_Res>(std::forward<_Visitor>(__visitor),
                               std::forward<_Variants>(__variants)...);
}
 ```

 The problem is that `auto` can evaluate to `int` or `float`, but not to `int&`.
 It has now been changed to `decltype(auto)`, which is able to express references.
 I also added a missing `std::forward` on the visitor argument.

The new version looks like this:
 ```c++
template <invocable_on_expression Visitor>
decltype(auto) visit(Visitor&& visitor, const expression& e) {
    return std::visit(std::forward<Visitor>(visitor), e._v->v);
}
```

I added some tests of `expr::visit` in `boost/expr_test`, but sadly they are not as throughout as they could be, Ideally I could return a refernce from `std::visit` and `expr::visit` and then check that they both point to the same address in memory.
I can't do this because it would require to access a private field of `expression`.
Some test pass before the fix, even though they shouldn't, but I'm not sure how to make them better without making field of expression public.

 I played around with some code, it can be found here: https://github.com/cvybhu/attached-files/blob/main/visit/visit_playground.cpp

Closes #10073

* github.com:scylladb/scylla:
  cql3: expr: Add a test to show that std::forward is needed in expr::visit
  cql3: expr: add std::forward in expr::visit
  cql3: expr: Add tests for expr::visit
  cql3: expr: Fix expr::visit so that it works with references
2022-02-20 12:09:57 +02:00
Jan Ciolek
353ab8f438 cql3: expr: Add a test to show that std::forward is needed in expr::visit
Adds a test with a vistior that can only be used as a rvalue.
Without std::forward in expr::visit this test doesn't compile.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-02-18 14:19:49 +01:00
Jan Ciolek
7234cc851c cql3: expr: add std::forward in expr::visit
expr::visit was missing std::forward on the visitor.
In cases where the visitor was passed as an rvalue it wouldn't
be properly forwarded to std::visit.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-02-18 14:19:49 +01:00
Jan Ciolek
46367eec55 cql3: expr: Add tests for expr::visit
Add tests for new expr::visit to ensure that it is working correctly.

expr::visit had a hidden bug where trying to return a reference
actually returned a reference to freed location on the stack,
so now there are tests to ensure that everything works.

Sadly the test `expr_visit_const_ref` also passes
before the fix, but at lest expr_visit_ref doesn't compile before the fix.
It would be better to test this by taking references returned
by std::visit and expr::visit and checking that they point
to the same address in memory, but I can't do this
because I would have to access private field of expression.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-02-18 14:16:55 +01:00
Pavel Emelyanov
9c06897ec3 test: Add cql-pytest sanity test for system.clients table
Check that SELECT {columns} FROM system.clients returns back only local
connection of cql type (because there are no others during the test).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-02-18 15:02:26 +03:00
Pavel Emelyanov
de6c60c1c9 client_data: Sanitize connection_notifier
Now the connection_notifier is all gone, only the client_data bits are left.
To keep it consistent -- rename the files.

Also, while at it, brush up the header dependencies and remove the not
really used constexprs for client states.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-02-18 15:02:26 +03:00
Pavel Emelyanov
d63ba87266 transport: Indentation fix after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-02-18 15:02:26 +03:00
Pavel Emelyanov
971c431a23 code: Remove old on-disk version of system.clients table
This includes most of the connection_notifier stuff as well as
the auxiliary code from system_keyspace.cc and a bunch of
updating calls from the client state changing.

Other than less code and less disk updates on clients connection
paths, this removes one usage of the nasty global qctx thing.

Since the system.clients goes away rename the system.clients_v
here too so the table is always present out there.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-02-18 15:02:26 +03:00
Pavel Emelyanov
0c9ed01716 system_keyspace: Add clients_v virtual table
This table mirrors the existing clients one but temporarily
has its own name. The schema is the same as in system.clients.

The table gets client_data's from the registered protocol
servers, which in turn are obtained from the storage service.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-02-18 15:02:26 +03:00
Pavel Emelyanov
7bc697ec99 protocol_server: Add get_client_data call
The call returns a chunked_vector with client_data's. For now
only the native transport implements it, others return empty
vector.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-02-18 14:25:08 +03:00
Pavel Emelyanov
0046cdc6cb transport: Track client state for real
Right now when the client state changes the respective update is
performed on the system.clients table. While doing it some bits
from this state are lost from the in-memory structures. For the
sake of exporting this information we need to track whether the
connected client goes authenticating or is already ready.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-02-18 14:25:08 +03:00
Pavel Emelyanov
00ce9b1c36 transport: Add stringifiers to client_data class
There are two fields on the client_data that are not mapped to
string with the help of standard fmt library. Add two methods
that turn client state and type into strings.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-02-18 14:25:08 +03:00
Pavel Emelyanov
f035313b16 generic_server: Gentle iterator
Add the ability to iterate over the list of connections in a "gentle"
manner, i.e. -- preempting the loop when required.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-02-18 14:25:08 +03:00
Pavel Emelyanov
661c12066b generic_server: Type alias
For simpler future patching

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-02-18 14:25:07 +03:00
Pavel Emelyanov
d586805054 docs: Add system.clients description
There's a document that sums up the tables from system keyspace and
its missing the clients table. This set is going to reimplement the
table keeping the schema intact, so it's good time to document it
right at the beginning.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-02-18 14:25:07 +03:00
Nadav Har'El
f292d3d679 alternator: make schema modifications in CreateTable atomic
The Alternator CreateTable operation currently performs several schema-
changing operations separately - one by one: It creates a keyspace,
a table in that keyspace and possibly also multiple views, and it sets
tags on the table. A consequence of this is that concurrent CreateTable
and DeleteTable operations (for example) can result in unexpected errors
or inconsistent states - for example CreateTable wants to create the
table in the keyspace it just created, but a concurrent DeleteTable
deleted it. We have two issues about this problem (#6391 and #9868)
and three tests (test_table.py::test_concurrent_create_and_delete_table)
reproducing it.

In this patch we fix these problems by switching to the modern Scylla
schema-changing API: Instead of doing several schema-changing
operations one by one, we create a vector of schema mutation performing
all these operations - and then perform all these mutations together.

When the experimental Raft-based schema modifications is enabled, this
completely solves the races, and the tests begin to pass. However, if
the experimental Raft mode is not enabled, these tests continue to fail
because there is still no locking while applying the different schema
mutations (not even on a single node). So I put a special fixture
"fails_without_raft" on these tests - which means that the tests
xfail if run without raft, and expected to pass when run on Raft.

Indeed, after this patch
test/alternator/run --raft test_table.py::test_concurrent_create_and_delete_table

shows three passing tests (they also pass if we drastically improve the
number of iterations), while
test/alternator/run test_table.py::test_concurrent_create_and_delete_table

shows three xfailing tests.

All other Alternator tests pass as before with this patch, verifying
that the handling of new tables, new views, tags, and CDC log tables,
all happen correctly even after this patch.

A note about the implementation: Before this patch, the CreateTable code
used high-level functions like prepare_new_column_family_announcement().
These high-level functions become unusable if we write multiple schema
operations to one list of mutations, because for example this function
validates that the keyspace had already been created - when it hasn't
and that's the whole point. So instead we had to use lower-level
function like add_table_or_view_to_schema_mutation() and
before_create_column_family(). However, despite being lower level,
these functions were public so I think it's reasonable to use them,
and we probably have no other alternative.

Fixes #6391
Fixes #9868

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-02-18 09:03:52 +02:00
Nadav Har'El
46120ca4f4 Merge 'tools/scylla-sstable: change output of dump commands to JSON' from Botond Dénes
Replacing the previous text output with the exception of the dump-data
command. The text output was supposed to be human-friendly but it is not
really human friendlier than a well formatted JSON, the latter having
the additional advantage of being machine friendly too. Although the
text output already exists, having just one output format makes the code
much simpler and easier to maintain so we chose not to pay the higher
maintenance price for a format that is not expected to see much (if any)
use.
Although the JSON written by the tool is not formatted, it can easily be
formatted by e.g. piping it through `jq`. The latter also allows lookup
of specific field(s).
The JSON schema of each command is documented in the --help output of
the respective command (e.g. scylla sstable data-dump --help) .

We keep the text output of the dump-data command as this is using
scylla's built-in printer that we also use in logging and tests. Some
people might be used to this format, so leave it in: the code already
exists for it and lives in scylla core, so we don't need to maintain it
separately. The default output-format of dump-data is now JSON.

A smoke test suite is added for the dump commands too. The tests only
check that some output is present and that it is valid JSON.

Refs: #9882

Tests: unit(dev)

Also on: https://github.com/denesb/scylla.git scylla-sstable-json/v2

Changelog

v3:
* Rebase on recent master (which has the required seastar fixes for
  debug tests)

v2:
* Document the JSON schema of each command.
* Use the SAX-style API of rapidjson to generate streaming JSON, instead
  of hand-generating it.

Closes #10074

* github.com:scylladb/scylla:
  test/cql-pytest: add tests for scylla-sstable's dump commands
  test/cql-pytest: prepare for tool tests
  tools/schema_loader: auto-create the keyspace for all statements
  tools/scylla-sstable: change output of dump-scylla-metadata to json
  tools/scylla-sstable: change output of dump-statistics to json
  tools/scylla-sstable: change output of dump-summary to json
  tools/scylla-sstable: change output of dump-compression-info to json
  tools/scylla-sstable: change output of dump-index to json
  tools/scylla-sstable: add json support in --dump-data
  tools/scylla-sstable: add json_writer
  tools/scylla-sstable: use fmt::print in --dump-data
  tools/scylla-sstable: prepare --dump-data for multiple output formats
2022-02-18 07:52:19 +02:00
Jan Ciolek
8676f60724 cql3: expr: Fix expr::visit so that it works with references
expr::visit had a bug where if we wanted to return
a reference in the visitor, the reference would be
to a temporary stack location instead of the passed
argument.

So trying to do something like this:
```
    const bind_variable& ref = visit(overloaded_functor {
    [](const bind_variable& bv) -> const bind_variable& { return bv; },
    [](const auto&) -> const bind_variable& { ... }
    }, e);
std::cout << ref << std::endl;
```

Would actually print a random location on stack instead
of valid value inside of e.

Additionally trying to return a non-const reference
doesn't even compile.

The problem was that the return type of expr::visit
was defined as `auto`, which can be `int`, but not `int&`.
This has been changed to `decltype(auto)` which can be both `int` and `int&`

New version of `expr::visit` works for `const expression&` and `expression&`
no matter what the visitor returns.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-02-17 17:29:28 +01:00
Botond Dénes
1e038b40cf test/cql-pytest: add tests for scylla-sstable's dump commands
The tests are smoke-tests: they mostly check that scylla doesn't crash
while dumping and it produces *some* output. When dumping json, the test
checks that it is valid json.
2022-02-17 15:24:24 +02:00
Botond Dénes
afab1a97c6 test/cql-pytest: prepare for tool tests
We want to add tool tests. These tests will have to invoke scylla
executable (as tools are hosted by the latter) and they want access to
the scylla data directories. Propagate the scylla path and data
directory used from `run` into the test suite via pytest request
parameters.
2022-02-17 15:24:24 +02:00
Botond Dénes
96082631c8 tools/schema_loader: auto-create the keyspace for all statements
Currently the keyspace is only auto-created for create type statements.
However the keyspace is needed even without UDTs being involved: for
example if the table contains a collection type. So auto-create the
keyspace unconditionally before preparing the first statement.

Also add a test-case with a create table statement which requires the
keyspace to be present at prepare time.
2022-02-17 15:24:24 +02:00
Botond Dénes
59ce247164 tools/scylla-sstable: change output of dump-scylla-metadata to json 2022-02-17 15:24:24 +02:00
Botond Dénes
2a7ed8212f tools/scylla-sstable: change output of dump-statistics to json 2022-02-17 15:24:24 +02:00
Botond Dénes
a617e66878 tools/scylla-sstable: change output of dump-summary to json 2022-02-17 14:17:11 +02:00
Botond Dénes
fb6b7c8036 tools/scylla-sstable: change output of dump-compression-info to json 2022-02-17 14:17:11 +02:00
Botond Dénes
f5c6d7e12e tools/scylla-sstable: change output of dump-index to json 2022-02-17 14:17:11 +02:00
Botond Dénes
bdbbda29c1 tools/scylla-sstable: add json support in --dump-data
But keep the old text output-format too. One can switch between the two
with the --output-format flag, which defaults to "json".
2022-02-17 14:17:11 +02:00
Botond Dénes
03bbf1b362 tools/scylla-sstable: add json_writer
Wrapping a rapidjson::Writer<> and mirrors the latter's API, providing
more convenient overloads for the Key() and String() methods, as well as
providing some extra, scylla-sstable specific methods too.
2022-02-17 14:17:11 +02:00
Botond Dénes
72f27c8782 tools/scylla-sstable: use fmt::print in --dump-data
The rest of the code is standardizing on fmt::print(), bring the code
for --dump-data in line.
2022-02-17 14:17:11 +02:00
Botond Dénes
ba2a61b2bc tools/scylla-sstable: prepare --dump-data for multiple output formats
Extract the actual dumping code into a separate class, which also
implements sstable_consumer interface. The dumping consumer now just
forwards calls to actual dumper through the abstract consumer interface,
allowing different concrete dumpers to be instantiated.
2022-02-17 14:17:11 +02:00
Piotr Dulikowski
adfd9d2f7a abstract_read_resolver::fail_request: make non-virtual
This method is not overrided by any of the derived classes, so it does
not need to be virtual.

(cherry picked from commit b7fb93dc46531bca8db535301a069df52991f9d9)
2022-02-17 12:34:37 +02:00
Michael Livshin
f8d4bafa5a to_string.hh: include <map>
The code uses `std::map`, so it should include the definition
explicitly.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-17 08:53:48 +02:00
Michael Livshin
a657dc9787 scylla-gdb.py: set source language to c++
When you interrupt a process in gdb using Ctrl-C or attach gdb to a
running process, usually gdb will show the current frame as
`syscall()` (no source information).  But in some less usual setups
gdb may happen to know that `syscall()` is implemented in assembly,
and even knows which line is current in which assembly file.

An unfortunate effect of gdb knowing that the current frame's source
language is assembly is that since assembly is not C++, gdb's
expression parser switches to "auto" while in the `syscall()` stack
frame.  And in the "auto" language explicit C++ global namespace
references like "::debug::the_database" are not syntactically valid,
which renders much of scylla-gdb.py unusable unless you remember to go
up the call stack before doing anything.

But since scylla-gdb.py is there to help debug Scylla, and Scylla is
written in C++, we can just set gdb source language to "c++" and avoid
the problem.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Message-Id: <20220216235301.1206341-1-michael.livshin@scylladb.com>
2022-02-17 08:43:59 +02:00
Botond Dénes
948bc359c2 Merge "ME sstable format support" from Michael Livshin
"
This series implements support for the ME sstable format (introduced
in C* 3.11.11).

Tests: unit(dev)
"

* tag 'me-sstable-format-v5' of https://github.com/cmm/scylla:
  sstables: validate originating host id
  sstable: add is_uploaded() predicate
  config: make the ME sstable format default
  scylla-gdb.py: recognize ME sstables
  sstables: store originating host id in stats metadata
  system_keyspace: cache local host id before flushing
  database_test: ensure host id continuity
  sstables_manager: add get_local_host_id() method and support
  sstables_manager: formalize inheritability
  system_keyspace, main: load (or create) local host id earlier
  sstable_3_x_test: test ME sstable format too
  add "ME_SSTABLE" cluster feature
  add "sstable_format" config
  add support for the ME sstable format
  scylla-sstable: add ability to dump optionals and utils::UUID
  sstables: add ability to write and parse optionals
  globalize sstables::write(..., utils::UUID)
2022-02-16 18:28:16 +02:00
Michael Livshin
79bf79ebd3 sstables: validate originating host id
Add an additional sstable validation step to check that originating
host id matches the local host id.

This is only done for ME-and-up sstables, which do not come from
upload/, and when the local host id is known.

When local host id is unknown, check that the sstable belongs to a
system keyspace, i.e. whether it is plausible that Scylla is still
booting up and hasn't loaded/generated the local host id yet.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-16 18:21:24 +02:00
Michael Livshin
3511d7cd21 sstable: add is_uploaded() predicate
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-16 18:21:24 +02:00
Michael Livshin
3bf1e137fc config: make the ME sstable format default
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-16 18:21:24 +02:00
Michael Livshin
0ca58096cf scylla-gdb.py: recognize ME sstables
Also use the opportunity to unify two closely-related lists into a
dictionary.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-16 18:21:24 +02:00
Michael Livshin
dd4e330cc5 sstables: store originating host id in stats metadata
With this change, ME sstables start carrying their originating host
id, which makes ME format feature-complete so it can be made default.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-16 18:21:24 +02:00
Michael Livshin
0ccd56e036 system_keyspace: cache local host id before flushing
Later in this series the ME sstable format is made default, which
means that `system.local` will likely be written as ME.

Since, in ME, originating host id is a part of sstable stats metadata,
the local host id needs to either already be cached by the time
`system.local` is flushed, or to somehow be special-case-ignored when
flushing `system.local`.

The former (done here) is optimistic (cache before flush), but the
alternative would be an abstraction violation and would also cost a
little time upon each sstable write.

(Cache-before-flush could be undone by catching any exceptions during
flush and un-caching, but inability to `co_await` in catch clauses
makes the code look rather awkward.  And there is no need to bother
because bootstrap failures should be fatal anyway)

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-16 18:21:24 +02:00
Michael Livshin
d8cc535297 database_test: ensure host id continuity
The "populate_from_quarantine_works" test case creates sstables with
one db config, then reads them with another.  Ensure that both configs
have the same host id so the sstables pass validation.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-16 18:21:24 +02:00
Michael Livshin
3fef604075 sstables_manager: add get_local_host_id() method and support
Since ME sstable format includes originating host id in stats
metadata, local host id needs to be made available for writing and
validation.

Both Scylla server (where local host id comes from the `system.local`
table) and unit tests (where it is fabricated) must be accomodated.
Regardless of how the host id is obtained, it is stored in the db
config instance and accessed through `sstables_manager`.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-16 18:21:24 +02:00
Michael Livshin
0895188851 sstables_manager: formalize inheritability
The class is already inherited from in tests (along with overriding a
non-virtual method), so this seems to be called for.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-16 18:21:24 +02:00
Michael Livshin
7d2af177eb system_keyspace, main: load (or create) local host id earlier
We want it to be cached before any sstable is written, so do it right
after system_keyspace::minimal_setup().

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-16 18:21:24 +02:00
Michael Livshin
387c882dc7 sstable_3_x_test: test ME sstable format too
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-16 18:21:24 +02:00
Michael Livshin
d370558279 add "ME_SSTABLE" cluster feature
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-16 18:21:24 +02:00
Michael Livshin
0b1447c702 add "sstable_format" config
Initialize it to "md" until ME format support is
complete (i.e. storing originating host id in sstable stats metadata
is implemented), so at present there is no observable change by
default.

Also declare "enable_sstables_md_format" unused -- the idea, going
forward, being that only "sstable_format" controls the written sstable
file format and that no more per-format enablement config options
shall be added.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-16 18:21:24 +02:00
Michael Livshin
c96708d262 add support for the ME sstable format
The ME format has been introduced in Cassandra 3.11.11:

11952fae77/src/java/org/apache/cassandra/io/sstable/format/big/BigFormat.java (L123)
d84c6e9810

It adds originating host id to sstable metadata in support of fixing
loss of commit log data when moving sstables between nodes:

https://issues.apache.org/jira/browse/CASSANDRA-16619

In Scylla:

* The supported way to ingest sstables is via upload/, where stored
  commit log replay position should be disregarded (but see
  https://github.com/scylladb/scylla/issues/10080).

* A later commit in this series implements originating host id
  validation for native ME sstables.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-16 18:21:24 +02:00
Michael Livshin
3712a82ca7 scylla-sstable: add ability to dump optionals and utils::UUID
Needed for the ME sstable format.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-16 18:21:24 +02:00
Michael Livshin
26bae0cd39 sstables: add ability to write and parse optionals
(that is, instances of `std::optional`).

The ME sstable format includes optional originating host id in stats
metadata.  We know how to write and parse uuids, but not how to write
and parse optionals.

The format is (used by C* in this case, and also happens to be
consistent with how booleans are serialized): first a boolean
indicating whether the contents are present (0 or 1, as a byte), then
the contents (if any).

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-16 18:21:23 +02:00
Michael Livshin
c00d272b16 globalize sstables::write(..., utils::UUID)
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-16 18:21:23 +02:00
Benny Halevy
5a63026932 api: storage_service: scrub: validate parameters
Validate all parameters, rejecting unsupported parameters.

Refs #10087

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-02-16 17:01:46 +02:00
Benny Halevy
16afde46e7 api: storage_service: refactor parse_tables
Prepare for string-based parsing and validation.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-02-16 16:53:18 +02:00
Benny Halevy
cce6810615 api: storage_service: refactor validate_keyspace
Prepare for string-based validation.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-02-16 16:53:18 +02:00
Benny Halevy
eef131ea10 test: rest_api: add test_storage_service_keyspace_scrub tests
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-02-16 16:53:16 +02:00
Benny Halevy
fc2e9abeba api: storage_service: scrub: throw httpd::bad_param_exception for invalid param values
Throwing std::runtime_error results in
http status 500 (internal_server_error), but the problem
is with the request parameters, nt with the server.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-02-16 15:39:17 +02:00
Benny Halevy
b7b0c19fdc test: uuid: cement the assumption that default and null uuid are equal
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220216081623.830627-2-bhalevy@scylladb.com>
2022-02-16 10:19:47 +02:00
Benny Halevy
489e50ef3a utils: uuid: make operator bool explicit
Following up on 69fcc053bb

To prevent unintentional implicit conversions
e.g. to a number.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220216081623.830627-1-bhalevy@scylladb.com>
2022-02-16 10:19:47 +02:00
Piotr Dulikowski
742f2abfd8 exception_container: do not throw in accept
This commit changes the behavior of `exception_container::accept`. Now,
instead of throwing an `utils::bad_exception_container_access` exception
when the container is empty, the provided visitor is invoked with that
exception instead. There are two reasons for this change:

- The exception_container is supposed to allow handling exceptions
  without using the costly C++'s exception runtime. Although an empty
  container is an edge case, I think it the new behavior is more aligned
  with the class' purpose. The old behavior can be simulated by
  providing a visitor which throws when called with bad access
  exception.

- The new behavior fixes a bug in `result_try`/`result_futurize_try`.
  Before the change, if the `try` block returned a failed result with an
  empty exception container, a bad access exception would either be
  thrown or returned as an exceptional future without being handled by
  the `catch` clauses. Although nobody is supposed to return such
  result<>s on purpose, a moved out result can be returned by accident
  and it's important for the exception handling logic to be correct in
  such a situation.

Tests: unit(dev)

Closes #10086
2022-02-16 10:06:10 +02:00
Nadav Har'El
7be3129458 cdc: don't need current keyspace to create the log table
CDC registers to the table-creation hook (before_create_column_family)
to add a second table - the CDC log table - to the same keyspace.
The handler function (on_before_update_column_family() in cdc/log.cc)
wants to retrieve the keyspace's definition, but that does NOT WORK if
we create the keyspace and table in one operation (which is exactly what
we intend to do in Alternator to solve issue #9868) - because at the
time of the hook, the keyspace does not yet exist in the schema.

It turns out that on_before_update_column_family() does not REALLY need
the keyspace. It needed it to pass it on to make_create_table_mutations()
but that function doesn't use the keyspace parameter passed to it! All
it needs is the keyspace's name - which is in the schema anyway and
doesn't need to be looked up.

So in this patch we fix make_create_table_mutations() to not require the
unused keyspace parameter - and fix the CDC code not to look for the
keyspace that is no longer needed.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220215162342.622509-1-nyh@scylladb.com>
2022-02-16 08:38:56 +02:00
Benny Halevy
69fcc053bb utils: uuid: add null_uuid
and respective bool predecate and operator
and unit test.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220215113438.473400-1-bhalevy@scylladb.com>
2022-02-15 18:02:54 +02:00
Avi Kivity
817d1aade8 tools: toolchain: regenerate with libstdc++-11.2.1-9.fc34.x86_64 2022-02-15 18:02:54 +02:00
Benny Halevy
3e20fee070 cql3: result_set: remove std::ref from comperator&
Applying std::ref on `RowComparator& cmp` hits the
following compilation error on Fedora 34 with
libstdc++-devel-11.2.1-9.fc34.x86_64

```
FAILED: build/dev/cql3/statements/select_statement.o
clang++ -MD -MT build/dev/cql3/statements/select_statement.o -MF build/dev/cql3/statements/select_statement.o.d -I/home/bhalevy/dev/scylla/seastar/include -I/home/bhalevy/dev/scylla/build/dev/seastar/gen/include -std=gnu++20 -U_FORTIFY_SOURCE -DSEASTAR_SSTRING -Werror=unused-result -fstack-clash-protection -DSEASTAR_API_LEVEL=6 -DSEASTAR_ENABLE_ALLOC_FAILURE_INJECTION -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_TYPE_ERASE_MORE -DFMT_LOCALE -DFMT_SHARED -I/usr/include/p11-kit-1  -DDEVEL -DSEASTAR_ENABLE_ALLOC_FAILURE_INJECTION -DSCYLLA_ENABLE_ERROR_INJECTION -O2 -DSCYLLA_ENABLE_WASMTIME -iquote. -iquote build/dev/gen --std=gnu++20  -ffile-prefix-map=/home/bhalevy/dev/scylla=.  -march=westmere -DBOOST_TEST_DYN_LINK   -Iabseil -fvisibility=hidden  -Wall -Werror -Wno-mismatched-tags -Wno-tautological-compare -Wno-parentheses-equality -Wno-c++11-narrowing -Wno-sometimes-uninitialized -Wno-return-stack-address -Wno-missing-braces -Wno-unused-lambda-capture -Wno-overflow -Wno-noexcept-type -Wno-error=cpp -Wno-ignored-attributes -Wno-overloaded-virtual -Wno-unused-command-line-argument -Wno-defaulted-function-deleted -Wno-redeclared-class-member -Wno-unsupported-friend -Wno-unused-variable -Wno-delete-non-abstract-non-virtual-dtor -Wno-braced-scalar-init -Wno-implicit-int-float-conversion -Wno-delete-abstract-non-virtual-dtor -Wno-uninitialized-const-reference -Wno-psabi -Wno-narrowing -Wno-array-bounds -Wno-nonnull -Wno-error=deprecated-declarations -DXXH_PRIVATE_API -DSEASTAR_TESTING_MAIN -DHAVE_LZ4_COMPRESS_DEFAULT  -c -o build/dev/cql3/statements/select_statement.o cql3/statements/select_statement.cc
In file included from cql3/statements/select_statement.cc:14:
In file included from ./cql3/statements/select_statement.hh:16:
In file included from ./cql3/statements/raw/select_statement.hh:16:
In file included from ./cql3/statements/raw/cf_statement.hh:16:
In file included from ./cql3/cf_name.hh:16:
In file included from ./cql3/keyspace_element_name.hh:16:
In file included from /home/bhalevy/dev/scylla/seastar/include/seastar/core/sstring.hh:25:
In file included from /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/algorithm:74:
In file included from /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/pstl/glue_algorithm_defs.h:13:
In file included from /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/functional:58:
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/refwrap.h:319:40: error: exception specification of 'function<__gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>, void>' uses itself
                = decltype(reference_wrapper::_S_fun(std::declval<_Up>()))>
                                                     ^
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/refwrap.h:319:40: note: in instantiation of exception specification for 'function<__gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>, void>' requested here
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/refwrap.h:321:2: note: in instantiation of default argument for 'reference_wrapper<__gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>, void>' required here
        reference_wrapper(_Up&& __uref)
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/type_traits:1017:57: note: while substituting deduced template arguments into function template 'reference_wrapper' [with _Up = __gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>, $1 = (no value), $2 = (no value)]
      = __bool_constant<__is_nothrow_constructible(_Tp, _Args...)>;
                                                        ^
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/type_traits:1023:14: note: in instantiation of template type alias '__is_nothrow_constructible_impl' requested here
    : public __is_nothrow_constructible_impl<_Tp, _Args...>::type
             ^
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/type_traits:153:14: note: in instantiation of template class 'std::is_nothrow_constructible<__gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>, __gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>>' requested here
    : public conditional<_B1::value, _B2, _B1>::type
             ^
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/std_function.h:298:11: note: (skipping 8 contexts in backtrace; use -ftemplate-backtrace-limit=0 to see all)
          return __and_<typename _Base::_Local_storage,
                 ^
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/stl_algo.h:1933:13: note: in instantiation of function template specialization 'std::__partial_sort<utils::chunked_vector<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>, 131072>::iterator_type<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>>, __gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>>' requested here
              std::__partial_sort(__first, __last, __last, __comp);
                   ^
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/stl_algo.h:1954:9: note: in instantiation of function template specialization 'std::__introsort_loop<utils::chunked_vector<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>, 131072>::iterator_type<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>>, long, __gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>>' requested here
          std::__introsort_loop(__first, __last,
               ^
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/stl_algo.h:4875:12: note: in instantiation of function template specialization 'std::__sort<utils::chunked_vector<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>, 131072>::iterator_type<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>>, __gnu_cxx::__ops::_Iter_comp_iter<std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>>' requested here
      std::__sort(__first, __last, __gnu_cxx::__ops::__iter_comp_iter(__comp));
           ^
./cql3/result_set.hh:168:14: note: in instantiation of function template specialization 'std::sort<utils::chunked_vector<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>, 131072>::iterator_type<std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>>>, std::reference_wrapper<const std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>>' requested here
        std::sort(_rows.begin(), _rows.end(), std::ref(cmp));
             ^
cql3/statements/select_statement.cc:773:21: note: in instantiation of function template specialization 'cql3::result_set::sort<std::function<bool (const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &, const std::vector<std::optional<seastar::basic_sstring<signed char, unsigned int, 31, false>>> &)>>' requested here
                rs->sort(_ordering_comparator);
                    ^
1 error generated.
ninja: build stopped: subcommand failed.
```

Fixes #10079.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220215071955.316895-3-bhalevy@scylladb.com>
2022-02-15 10:57:23 +02:00
Benny Halevy
41b5c266db cql3: result_set: add concept for RowComparator
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220215071955.316895-2-bhalevy@scylladb.com>
2022-02-15 10:57:19 +02:00
Benny Halevy
ee59b851b4 cql3: result_set: define internal types
Define the types for column, row, and vector of rows
and reuse correspondingly.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220215071955.316895-1-bhalevy@scylladb.com>
2022-02-15 10:57:18 +02:00
Michael Livshin
04c1286a94 Add "me" sstables for the multi-format tests
Prerequisite for the "ME sstable format support" series (which has been
posted to the mailing list) -- to be merged or rejected together with
that.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>

Closes #9939
2022-02-15 09:24:09 +02:00
Benny Halevy
67580c0855 sstables: get rid of remove_sstable_with_temp_toc
It is unused since e40aa042a7
(version 4.2)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220214140029.1513522-2-bhalevy@scylladb.com>
2022-02-14 18:57:40 +02:00
Benny Halevy
e5fc4b6f5d sstables: coroutinize remove_by_toc_name
Test: unit(dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220214140029.1513522-1-bhalevy@scylladb.com>
2022-02-14 18:57:39 +02:00
Nadav Har'El
4e3038b57f alternator: add FIXME for schema changes requiring a loop
In commit a664ac7ba5, the Alternator
schema-modifying code (e.g., delete_table()) was reorganized to support
the new Raft-based schema modifications. Schema modifications now work
with an "optimistic locking" approach: We retrieve the current schema
version id ("group0_guard"), reads the current schema and verifies it
can do the changes it wants to do, and then does them with
mm.announce(group0_guard) - which will fail if the schema version is not
current because some other concurrent modification beat us in the race.

This means that we need to do this whole read-modify-write (group0_guard,
checking the schema, creating mutations, calling mm.announce()) in a
*retry loop*. We have such a loop in the CQL code but it's missing in the
Alternator code. In this patch we don't add the loop yet, but add FIXMEs
to remind us where it's missing.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220214154435.544125-1-nyh@scylladb.com>
2022-02-14 18:24:16 +02:00
Nadav Har'El
212c321c55 test/alternator: add reproducers for non-atomic table creation
We add reproducing tests for two known Alternator issues, #6391 and #9868,
which involve the non-atomicity of table creation. Creating a table
currently involves multiple steps - creating a keyspace, a table,
materialized views, and tags. If some of these steps succeed and some
fail, we get an InternalServerError and potentially leave behind some
half-built table.

Both issues will be solved by making better use of the new Raft-based
capabilities of making multiple modifications to the schema atomically,
but this patch doesn't fix the problem - it just proves it exist.

The new tests involve two threads - one repeatedly trying to create a
table with a GSI or with tags - and the other thread repeatedly trying
to delete the same table under its feet. Both bugs are reproduced almost
immediately.

Note that like all test/alternator tests, the new tests are usually run on
just one node. So when we fix the bug and these tests begin to pass,
it will not be a proof that concurrent schema modification works safely
on *different* nodes. To prove that, we will also need a multi-node test.
However, this test can prove that we used Raft-based schema modification
correctly - and if we assume that the Raft-based schema modification
feature is itself correct, then we can be sure that CreateTable will be
correct also across multiple nodes. Although it won't hurt to check it
directly.

Refs #6391
Refs #9868

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220207223100.207074-1-nyh@scylladb.com>
2022-02-14 18:21:21 +02:00
Benny Halevy
244df07771 large_data_handler: use only basename to identify the sstable
SSTables may be created in one directory (e.g. staging)
and be removed from another directory (base table dir,
or quarantine if scrub moved them there), so identify
the sstable by its unique component basename rather than
the full path.

Fixes #10075

Test: unit(dev)
DTest: wide_rows_test.py (w/ https://github.com/scylladb/scylla-dtest/pull/2606)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220214131923.1468870-1-bhalevy@scylladb.com>
2022-02-14 17:57:49 +02:00
Benny Halevy
19ea228cf8 replica: table: coroutinize move_sstables_from_staging
Test: unit(dev)
DTest: test_drop_mv_during_base_table_writes

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220214102911.1314022-1-bhalevy@scylladb.com>
2022-02-14 17:52:27 +02:00
Benny Halevy
8f417b8021 sstable: coroutinize seal_sstable
Test: unit(dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220214105214.1337361-1-bhalevy@scylladb.com>
2022-02-14 17:49:52 +02:00
Benny Halevy
c75e63e480 sstable: coroutinize move_to_new_dir
Test: unit(dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220214154403.1590022-1-bhalevy@scylladb.com>
2022-02-14 17:47:09 +02:00
Benny Halevy
b131f94fc3 large_data_handler: maybe_delete_large_data_entries: data_size is unused
Since 64a4ffc579

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220214115258.1354372-1-bhalevy@scylladb.com>
2022-02-14 13:58:44 +02:00
Kamil Braun
4147ef94b0 service: raft: raft_group0: persist discovered peers and restore on restart
We add a `peers()` method to `discovery` which returns the peers
discovered until now (including seeds). The caller of functions which
return an output -- `tick` or `request` -- is responsible for persisting
`peers()` before returning the output of `tick`/`request` (e.g. before
sending the response produced by `request` back). The user of
`discovery` is also responsible for restoring previously persisted peers
when constructing `discovery` again after a restart (e.g. if we
previously crashed in the middle of the algorithm).

The `persistent_discovery` class is a wrapper around `discovery` which
does exactly that.
2022-02-14 12:05:18 +01:00
Kamil Braun
5dbf86fa29 db: system_keyspace: introduce discovery table
This table will be used to persist the list of peers discovered by the
`discovery` algorithm that is used for creating Raft group 0 when
bootstrapping a fresh cluster.
2022-02-14 12:05:18 +01:00
Kamil Braun
02d4087c6e service: raft: discovery: rename get_output to tick
The name `get_output` suggests that this is the only way to get output
from `discovery`. But there is a second public API: `request`, which also
provides us with a different kind of output.

Rename it to `tick`, which describes what the API is used for:
periodically ticking the discovery state machine in order to make
progress.
2022-02-14 12:04:37 +01:00
Kamil Braun
586ef8fc23 service: raft: discovery: stop returning peer_list from request after becoming leader
In `raft_group0::discover_group0`, when we detect that we became a
leader, we destroy the `discovery` object, create a group 0, and respond
with the group 0 information to all further requests.

However there is a small time window after becoming a leader but before
destroying the `discovery` object where we still answer to discovery
requests by returning peer lists, without informing the requester that
we become a leader.

This is unsafe, and the algorithm specification does not allow this. For
example, consider the seed graph 0 --> 1. It satisfies the property
required by the algorithm, i.e. that there exists a vertex reachable
from every other vertex. Now `1` can become a leader before `0` contacts it.
When `0` contacts `1`, it should learn from `1` that `1` created a group 0, so
`0` does not become a leader itself and create another group 0. However,
with the current implementation, it may happen that `0` contacts `1` and
receives a peer list (instead of group 0 information), and also becomes
a leader because it has the smallest ID, so we end up with two group 0s.

The correct thing to do is to stop returning peer lists to requests
immediately after becoming a leader. This is what we fix in this commit.
2022-02-14 12:04:37 +01:00
Benny Halevy
795d4a0bad batchlog_manager: batchlog_replay_loop: ignore broken_semaphore if abort_requested
drain() breaks _sem, causing do_batch_log_replay to throw broken_semaphore.
Ignore this error in batchlog_replay_loop as it's expected on shutdown.

https://jenkins.scylladb.com/job/scylla-master/job/dtest-debug/1073/testReport/junit/thrift_tests/TestCompactStorageThriftAccesses/test_get/
```
E           AssertionError: Unexpected errors found: [('node1', ['ERROR 2022-02-14 06:55:44,263 [shard 0] batchlog_manager - Exception in batch replay: seastar::broken_semaphore (Semaphore broken)'])]
```

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220214090607.1213740-1-bhalevy@scylladb.com>
2022-02-14 11:34:16 +02:00
Raphael S. Carvalho
a9427f150a Revert "sstables/compaction_manager: rewrite_sstables(): resolve maintenance group FIXME"
This reverts commit 4c05e5f966.

Moving cleanup to maintenance group made its operation time up to
10x slower than previous release. It's a blocker to 4.6 release,
so let's revert it until we figure this all out.

Probably this happens because maintenance group is fixed at a
relatively small constant, and cleanup may be incrementally
generating backlog for regular compaction, where the former is
fighting for resources against the latter.

Fixes #10060.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220213184306.91585-1-raphaelsc@scylladb.com>
2022-02-13 21:48:20 +02:00
Avi Kivity
13cf66d3ef Revert "schema_registry: Increase grace period for schema version cache"
This reverts commit 23da2b5879. It causes
the node to quickly run out of memory when many schema changes are made
within a small time window.

Fixes #10071.
2022-02-13 19:38:24 +02:00
Avi Kivity
7cc43f8aa8 Merge 'utils: add result_try and result_futurize_try' from Piotr Dulikowski
Adds `utils::result_try` and `utils::result_futurize_try` - functions which allow to convert existing try..catch blocks into a version which handles C++ exceptions, failed results with exception containers and, depending on the function variant, exceptional futures using the same exception handling logic.

For example, you can convert the following try..catch block:

    try {
        return a_function_that_may_throw();
    } catch (const my_exception& ex) {
        return 123;
    } catch (...) {
        throw;
    }

...to this:

    return utils::result_try([&] {
        return a_function_that_may_throw_or_return_a_failed_result();
    },  utils::result_catch<my_exception>([&] (const Ex&) {
        return 123;
    }), utils::result_catch_dots([&] (auto&& handle) {
        return handle.into_result();
    });

Similarly, `utils::result_futurize_try` can be used to migrate `then_wrapped` or `f.handle_exception()` constructs.

As an example of the usability of the new constructs, two places in the current code which need to simultaneously handle exceptions and failed results are converted to use `result_try` and `result_futurize_try`.

Results of `perf_simple_query --smp 1 --operations-per-shard 1000000 --write`:

```
127041.61 tps ( 67.2 allocs/op,  14.2 tasks/op,   52422 insns/op)
126958.60 tps ( 67.2 allocs/op,  14.2 tasks/op,   52409 insns/op)
127088.37 tps ( 67.2 allocs/op,  14.2 tasks/op,   52411 insns/op)
127560.84 tps ( 67.2 allocs/op,  14.2 tasks/op,   52424 insns/op)
127826.61 tps ( 67.2 allocs/op,  14.2 tasks/op,   52406 insns/op)

126801.02 tps ( 67.2 allocs/op,  14.2 tasks/op,   52420 insns/op)
125371.51 tps ( 67.2 allocs/op,  14.2 tasks/op,   52425 insns/op)
126498.51 tps ( 67.2 allocs/op,  14.2 tasks/op,   52427 insns/op)
126359.41 tps ( 67.2 allocs/op,  14.2 tasks/op,   52423 insns/op)
126298.27 tps ( 67.2 allocs/op,  14.2 tasks/op,   52410 insns/op)
```

The number of tasks and allocations is unchanged. The number of instructions per operations seems similar, it may have increased slightly (by 10-20) but it's hard to tell for sure because of the noisiness of the results.

Tests: unit(dev)

Closes #10045

* github.com:scylladb/scylla:
  transport: use result_try in process_request_one
  storage_proxy: use result_futurize_try in mutate_end
  storage_proxy: temporarily throw exception from result in mutate_end
  utils: add result_try and result_futurize_try
2022-02-13 19:38:13 +02:00
Avi Kivity
45bdb57b05 Update seastar submodule
* seastar 299c9474d...c18cc5dc6 (5):
  > log: Fix silencer to be shard-local and logger-global
Fixes #9784.
  > Fix alloc-dealloc-mismatch error in DPDK mode
  > Fix stack buffer overflow when using native network inteface
  > test: coroutines: test_scheduling_group: fixup indentation
  > test: coroutines: test_scheduling_group: destroy temporary scheduling_group when done
2022-02-13 16:47:25 +02:00
Avi Kivity
6572b297a2 treewide: clean up stray license blurbs
After the mechanical change in fcb8d040e8
("treewide: use Software Package Data Exchange (SPDX) license identifiers"),
a few stray license blurbs or fragments thereof remain. In two cases
these were extra blurbs in code generators intended for the generated code,
in others they were just missed by the script.

Clean them up, adding an SPDX license identifier where needed.

Closes #10072
2022-02-13 14:16:16 +02:00
Avi Kivity
6b380121e0 Merge 'utils/result: optimize result_parallel_for_each' from Piotr Dulikowski
This PR rewrites the `utils::result_parallel_for_each`'s implementation to resemble the original `seastar::parallel_for_each` more closely instead of using the less efficient `seastar::map_reduce`. It uses less tasks and allocations now, as demonstrated in the results from the `perf_result_query` benchmark, attached at the end of the cover letter.

The main drawback of the new implementation is that it needs to rethrow exceptions propagated as exceptional futures from the parallel sub-invocations. Contrary to the original `seastar::parallel_for_each` which uses a custom task to collect results, the new `utils::result_parallel_for_each` uses a coroutine and there doesn't currently seem to be a way to co_await for a future and inspect its state without either rethrowing or handling it in then_wrapped (which allocates a continuation). Fortunately, rethrowing is not needed for exceptions returned in failed result<>, which are already intended to be a more performant alternative to regular exceptions.

As a bonus, definitions from `utils/result.hh` are now split across three different headers in order to improve (re)compilation times.

Results from `perf_simple_query --smp 1 --operations-per-shard 1000000 --write` (before vs. after):

```
126872.54 tps ( 67.2 allocs/op,  14.2 tasks/op,   52404 insns/op)
126532.13 tps ( 67.2 allocs/op,  14.2 tasks/op,   52408 insns/op)
126864.99 tps ( 67.2 allocs/op,  14.2 tasks/op,   52428 insns/op)
127073.10 tps ( 67.2 allocs/op,  14.2 tasks/op,   52404 insns/op)
126895.85 tps ( 67.2 allocs/op,  14.2 tasks/op,   52411 insns/op)

127894.02 tps ( 66.2 allocs/op,  13.2 tasks/op,   52036 insns/op)
127671.51 tps ( 66.2 allocs/op,  13.2 tasks/op,   52042 insns/op)
127541.42 tps ( 66.2 allocs/op,  13.2 tasks/op,   52044 insns/op)
127409.10 tps ( 66.2 allocs/op,  13.2 tasks/op,   52052 insns/op)
127831.30 tps ( 66.2 allocs/op,  13.2 tasks/op,   52043 insns/op)
```

Test: unit(dev, debug)

Closes #10053

* github.com:scylladb/scylla:
  utils/result: optimize result_parallel_for_each
  utils/result: split into `combinators` and `loop` file
2022-02-13 12:04:40 +02:00
Avi Kivity
52e707f978 Merge 'gms: gossiper: coroutinize code (continued)' from Pavel Solodovnikov
This series continues the effort of https://github.com/scylladb/scylla/pull/9844 to reduce `seastar::async` usage and coroutinize in the gossiper code.

There are mostly trivial conversions from using `.get()` to `co_await`, where appropriate, as well, as elimination of `seastar::async()` wrappers.

A few more functions are not yet converted, though (e.g. `apply_new_states`, `do_apply_state_locally`, `apply_state_locally`, `apply_state_locally_without_listener_notification`, maybe a few others, as well).

The motivation is to be able to call every public API function of `gossiper` class without requiring `seastar::async` context.

Tests: unit(debug, dev), dtest (topology-related tests)

Closes #10032

* github.com:scylladb/scylla:
  gms: gossiper: coroutinize `wait_for_gossip`
  gms: gossiper: coroutinize `advertise_token_removed`
  gms: gossiper: coroutinize `advertise_removing`
  gms: gossiper: don't wrap `convict` calls into `seastar::async`
  gms: gossiper: coroutinize `handle_major_state_change`
  gms: gossiper: coroutinize `handle_shutdown_msg`
  gms: gossiper: coroutinize `mark_as_shutdown` and `convict`
  gms: gossiper: remove comment about requiring thread context in `mark_alive`
  gms: gossiper: don't use `seastar::async` in `mark_alive`
  gms: gossiper: coroutinize `do_on_change_notifications`
  gms: gossiper: coroutinize `do_before_change_notifications`
  gms: gossiper: coroutinize `real_mark_alive`
  gms: gossiper: coroutinize `mark_dead`
2022-02-13 11:51:44 +02:00
Piotr Dulikowski
dd3284ec38 utils/result: optimize result_parallel_for_each
It now resembles the original parallel_for_each more, but uses a
coroutine instead of a custom `task` to collect not-ready futures.
Although the usage of a coroutine saves on allocations, the drawback is
that there is currently no way to co_await on a future and handle its
exception without throwing or without unconditionally allocating a
then_wrapped or handle_exception continuation - so it introduces a
rethrow.

Furthermore, now failed results and exceptions are treated as equals.
Previously, in case one parallel invocation returned failed result and
another returned an exception, the exception would always be returned.
Now, the failed result/exception of the invocation with the lowest index
is always preferred, regardless of the failure type.

The reimplementation manages to save about 350-400 instructions, one
task and one allocation in the perf_simple_query benchmark in write
mode.

Results from `perf_simple_query --smp 1 --operations-per-shard 1000000
--write` (before vs. after):

```
126872.54 tps ( 67.2 allocs/op,  14.2 tasks/op,   52404 insns/op)
126532.13 tps ( 67.2 allocs/op,  14.2 tasks/op,   52408 insns/op)
126864.99 tps ( 67.2 allocs/op,  14.2 tasks/op,   52428 insns/op)
127073.10 tps ( 67.2 allocs/op,  14.2 tasks/op,   52404 insns/op)
126895.85 tps ( 67.2 allocs/op,  14.2 tasks/op,   52411 insns/op)

127894.02 tps ( 66.2 allocs/op,  13.2 tasks/op,   52036 insns/op)
127671.51 tps ( 66.2 allocs/op,  13.2 tasks/op,   52042 insns/op)
127541.42 tps ( 66.2 allocs/op,  13.2 tasks/op,   52044 insns/op)
127409.10 tps ( 66.2 allocs/op,  13.2 tasks/op,   52052 insns/op)
127831.30 tps ( 66.2 allocs/op,  13.2 tasks/op,   52043 insns/op)
```

Test: unit(dev), unit(result_utils_test, debug)
2022-02-10 18:19:08 +01:00
Piotr Dulikowski
6abeec6299 utils/result: split into combinators and loop file
Segregates result utilities into:

- result.hh - basic definitions related to results with exception
  containers,
- result_combinators.hh - combinators for working with results in
  conjunction with futures,
- result_loop.hh - loop-like combinators, currently has only
  result_parallel_for_each.

The motivation for the split is:

1. In headers, usually only result.hh will be needed, so no need to
   force most .cc files to compile definitions from other files,
2. Less files need to be recompiled when a combinator is added to
   result_combinators or result_loop.

As a bonus, `result_with_exception` was moved from `utils::internal` to
just `utils`.
2022-02-10 18:19:05 +01:00
Piotr Dulikowski
049564bd2d transport: use result_try in process_request_one
Adapts the exception handling logic in process_request_one so that it
uses utils::result_try to handle both C++ exceptions and failed results
in a unified way.
2022-02-10 17:35:32 +01:00
Piotr Dulikowski
98bde8d6d2 storage_proxy: use result_futurize_try in mutate_end
Adapts the mutate_end exception handling logic so that it uses the new
utils::result_futurize_try function to handle both exceptional futures
and failed results in an unified way.
2022-02-10 17:35:32 +01:00
Piotr Dulikowski
d5d24a5140 storage_proxy: temporarily throw exception from result in mutate_end
Temporarily removes the logic which handles failed results in a
non-throwing way. Exceptions from failed results are thrown and handled
in try..catch.

The reason for this change is that it makes the following commit, which
migrates the whole try..catch block to utils::result_futurize_try much
nicer. The next commit will also bring back the non-throwing handling of
the failed result.
2022-02-10 17:35:32 +01:00
Piotr Dulikowski
8d52ceca50 utils: add result_try and result_futurize_try
Adds result_try and result_futurize_try - functions which allow to
convert existing try..catch blocks into a version which handles C++
exceptions, failed results with exception containers and, depending on
the function variant, exceptional futures.
2022-02-10 17:35:32 +01:00
Juliusz Stasiewicz
00a6fda7b9 tracing: Trace slow queries on replicas wrt. parent's clock
Secondary tracing sessions used to compute the execution time
from the point of their `begin()`-ning, not the parent session's
`begin()`. As a result, replica reported a slow query if it
exceeded the entire threshold *on that replica* too.

This change augments `trace_info` with the TS of parent's session
starting point, to be used as a reference on replicas.

Fixes #9403

Closes #10005
2022-02-10 12:03:53 +01:00
Pavel Solodovnikov
e892170c86 raft: add raft tables to extra_durable_tables list
`system.raft`, `system.raft_snapshots` and `system.raft_config`
were missing from the `extra_durable_tables` list, so that
`set_wait_for_sync_to_commitlog(true)` was not enabled when
the tables were re-created via `create_table_from_mutations`.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20220210073418.484843-1-pa.solodovnikov@scylladb.com>
2022-02-10 11:47:41 +02:00
Botond Dénes
ef34c10a94 main: run scylla main to when there are no arguments
main() has some logic to select the main function it will delegate to
based on argv[1]. The intent is that when the value of argv[1] suggest
that the user did not specify a specific app to run, we default to
"server" (scylla proper).
This logic currently breaks down when there are no arguments at all: in
this case the following error is printed and scylla refuses to start:

    error: unrecognized first argument: expected it to be "server", a regular command-line argument or a valid tool name (see `scylla --list-tools`), but got

Fix this by checking for empty argv[1] and defaulting to "server" in
that case.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20220210092125.293682-1-bdenes@scylladb.com>
2022-02-10 11:47:20 +02:00
Benny Halevy
c8cf545fdc sstable_directory: process_sstable_dir: use directory_lister
Simplify the implementation by using directory_lister get()
rather than lister::scan_dir.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-02-10 11:41:50 +02:00
Benny Halevy
6b59c5bccd sstable_directory: process_sstable_dir: fixup indentation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-02-10 11:41:50 +02:00
Benny Halevy
8b654afc1c sstable_directory: coroutinize process_sstable_dir
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-02-10 11:41:50 +02:00
Benny Halevy
207174c692 lister: add directory_lister
directory_lister provides a simpler interface
compared to lister.

After creating the directory_lister,
its async get() method should be called repeatedly,
returning a std::optional<directory_entry> each call,
until it returns a disengaged entry or an error.

This is especially suitable for coroutines
as demonstrated in the unit tests that were added.

For example:

    auto dl = directory_lister(path);
    while (auto de = co_await dl.get()) {
        co_await process(*de);
    }

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-02-10 11:41:50 +02:00
Avi Kivity
2e2b54254c Merge 'docs: update theme 1.1' from David Garcia
Related issue https://github.com/scylladb/sphinx-scylladb-theme/issues/310

ScyllaDB Sphinx Theme 1.1 is now released 🥳

We’ve made a number of updates to update all our dependencies to the latest version and introduced new directives you can use to write great docs.

You can read more about all notable changes [here](https://sphinx-theme.scylladb.com/master/upgrade/CHANGELOG.html#february-2022).

Before,  the theme installed [poetry 1.1.x](https://python-poetry.org/) as a dependency to manage Python dependencies. However, ``poetry 1.2.x`` changed the installation method. Therefore, we've decided to [#307 Make poetry a prerequisite](https://github.com/scylladb/sphinx-scylladb-theme/issues/307) so that you can decide to install the poetry version you prefer.

To preview the docs locally, you should uninstall the previous version of poetry. Then, install the latest version:

1. Uninstall Poetry 1.1.x.

```
curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | POETRY_UNINSTALL=1 python -
```

2. Install Poetry 1.2.x. For detailed instructions, see [Poetry installation](https://python-poetry.org/docs/master/#installation).

1. Clone this PR. For more information, see [Cloning pull requests locally](https://docs.github.com/en/github/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally).

2. Uninstall poetry 1.1 and install poetry 1.2. For more information, see **Breaking changes** notice above.

3. Enter the docs folder, and run:

```
make preview
````

4. Open http://127.0.0.1:5500/ with your favorite browser. The doc should render without errors, and the version should be Sphinx Theme version (see the footer) must be ``1.1.x``:

![image](https://user-images.githubusercontent.com/9107969/152107446-52b167d8-c607-4431-a7a4-92579153d024.png)

Closes #10054

* github.com:scylladb/scylla:
  Add missing lexer
  docs: update theme 1.1
2022-02-10 11:14:02 +02:00
Botond Dénes
54b27a6dec Update seastar submodule
* seastar d27bf8b5...299c9474 (1):
  > core/app_template: print debug warning to std::cerr
2022-02-10 09:51:41 +02:00
Nadav Har'El
4937270803 test/alternator: add option to run with Raft-based schema changes
This patch adds a "--raft" option to test/alternator/run to enable the
experimental Raft-based schema changes ("--experimental-features=raft")
when running Scylla for the tests. This is the same option we added to
test/cql-pytest/run in a previous patch.

Note that we still don't have any Alternator tests that pass or fail
differently in these two modes - these will probably come later as we
fix issues #9868 and #6391. But in order to work on fixing those issues
we need to be able to run the tests in Raft mode.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220209123144.321344-1-nyh@scylladb.com>
2022-02-10 09:43:10 +02:00
Nadav Har'El
8409a42baa merge: Convert table::compact_sstables to coroutines
Patch series by Mikołaj Sielużycki

  compaction: Fix indentation in table::compact_sstables.
  compaction: Convert table::compact_sstables to coroutines.
2022-02-10 09:10:24 +03:00
Nadav Har'El
a1635b553e cql-pytest: fix detection of "raft" experimental feature
In a previous patch we fixed the output of experimental features list
(issue #10047), so we also need to fix the test code which detects the
"raft" experimental feature - to use the string "raft" and not the
silly byte 4 we had there before.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220209104331.312999-1-nyh@scylladb.com>
2022-02-10 09:10:24 +03:00
Nadav Har'El
de586ef856 test/cql-pytest: mechanism for tests requiring raft-based schema updates
Issue #8968 no longer exists when Raft-based schema updates are enabled
in Scylla (with --experimental-features=raft). Before we can close this
issue we need a way to re-run its test

        test_keyspace.py::test_concurrent_create_and_drop_keyspace

with Raft and see it pass. But we also want the tests to continue to run
by default the older raft-less schema updates - so that this mode doesn't
regress during the potentially-long duration that it's still the default!

The solution in this patch is:

1. Introduce a "--raft" option to test/cql-pytest/run, which runs the tests
   against a Scylla with the raft experimental feature, while the default is
   still to run without it.

2. Introduce a text fixture "fails_without_raft" which marks a test which
   is expected to fail with the old pre-raft code, but is expected to
   pass in the new code.

3. Mark the test test_concurrent_create_and_drop_keyspace with this new
   "fails_without_raft".

After this patch, running

        test/cql-pytest/run --raft
            test_keyspace.py::test_concurrent_create_and_drop_keyspace

Passes, which shows that issue 8968 was fixed (in Raft mode) - so we can say:
Fixes #8968

Running the same test without "--raft" still xfails (an expected failure).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220208162732.260888-1-nyh@scylladb.com>
2022-02-10 09:10:24 +03:00
Nadav Har'El
fef7934a2d config: fix some types in system.config virtual table
The system.config virtual tables prints each configuration variable of
type T based on the JSON printer specified in the config_type_for<T>
in db/config.cc.

For two variable types - experimental_features and tri_mode_restriction,
the specified converter was wrong: We used value_to_json<string> or
value_to_json<vector<string>> on something which was *not* a string.
Unfortunately, value_to_json silently casted the given objects into
strings, and the result was garbage: For example as noted in #10047,
for experimental_features instead of printing a list of features *names*,
e.g., "raft", we got a bizarre list of one-byte strings with each feature's
number (which isn't documented or even guaranteed to not change) as well
as carriage-return characters (!?).

So solution is a new printable_to_json<T> which works on a type T that
can be printed with operator<< - as in fact the above two types can -
and the type is converted into a string or vector of strings using this
operator<<, not a cast.

Also added a cql-pytest test for reading system.config and in particular
options of the above two types - checking that they contain sensible
strings and not "garbage" like before this patch.

Fixes #10047.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220209090421.298849-1-nyh@scylladb.com>
2022-02-10 09:10:24 +03:00
David Garcia
e092bf3bad Add missing lexer 2022-02-09 11:25:10 +00:00
Mikołaj Sielużycki
ee386213c2 compaction: Fix indentation in table::compact_sstables. 2022-02-09 12:19:23 +01:00
Mikołaj Sielużycki
ec91192525 compaction: Convert table::compact_sstables to coroutines. 2022-02-09 12:19:23 +01:00
David Garcia
24b5584941 docs: update theme 1.1 2022-02-09 11:13:38 +00:00
Avi Kivity
7f0dec9227 Update seastar submodule
* seastar 0d250d15a...d27bf8b5a (5):
  > Merge "Clean internal namespace in io_queue.cc" from Pavel E
  > Making par.._for_each and max_conc.._for_each compatible with move-only views (like generators)
  > tests: Perf test for smp::submit_to efficiency
  > Merge "Auto-increase IO latency goal from reactor" from Pavel E
  > reactor: Fix default task-quota-ms to be 0.5ms
2022-02-09 10:17:26 +02:00
Tomasz Grabiec
23da2b5879 schema_registry: Increase grace period for schema version cache
If version is absent in cache, it will be fetched from the
coordinator. This is not expensive, but if the version is not known,
it must be also "synced". It means that the node will do a full schema
pull from the coordinator. This pull is expensive and can take seconds.

If the coordinator we pull from is at an old version, the pull will do
nothing and current node will soon forget the old version, initiating
another pull.

If some nodes stay at an old version for a long time for some reason,
this will make new coordinators initiate pulls frequently.

Increase the expiration period to 15 minutes to reduce the impact in
such scenarios.

Fixes #10042.

Message-Id: <20220207122317.674241-1-tgrabiec@scylladb.com>
2022-02-09 09:27:07 +02:00
Tomasz Grabiec
7ae947b7e1 Merge "raft: bootstrap nodes as non-voter" from Alejo
Make only the first node in group0 to start as voter. Subsequent nodes
start as non-voters and request change to voter once bootstrap is
successful.

Add support for this in raft and a couple of minor fixes.

* alejo/raft-join-non-voting-v6:
  raft: nodes joining as non-voters
  raft: group 0: use cfg.contains() for config check
  raft: modify_config: support voting state change
  raft: minor: fix log format string
2022-02-09 09:27:07 +02:00
Raphael S. Carvalho
d208d33636 Fix quadratic behavior and compaction inefficiency when adding new files
With trigger_compaction() being called after each new sstable is added
to the set, we'll get quadratic behavior because strategies like
tiered will sort all the candidates before iterating on them, so
complexity is ~ ((N - 1) * N * logN).
Additionally, compaction may be inefficient as we're not waiting for
the sstable set to settle, so table may end up missing files that
would allow for more efficient jobs.
The latter isn't a big problem because we have reshape running in an
earlier phase, so data layout should satisfy the strategy almost.
Boot is not affected by these problems because it temporarily
disables auto compaction, so trigger_compaction() is a no-op for it.
So refresh remains as the only one affected.

Fixes #10046.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220208151154.72606-1-raphaelsc@scylladb.com>
2022-02-09 09:27:07 +02:00
Alejo Sanchez
a0c2bc0df2 raft: nodes joining as non-voters
Except for the first node creating the group0, make other nodes join as
non-voters and make them voters after successful bootstrap.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-02-08 09:16:30 -04:00
Avi Kivity
5099b1e272 Merge 'Propagate coordinator timeouts for regular writes and batches without throwing' from Piotr Dulikowski
Currently, most of the failures that occur during CQL reads or writes are reported using C++ exceptions. Although the seastar framework avoids most of the cost of unwinding by keeping exceptions in futures as `std::exception_ptr`s, the exceptions need to be inspected at various points for the purposes of accounting metrics or converting them to a CQL error response. Analyzing the value and type of an exception held by `std::exception_ptr`'s cannot be done without rethrowing the exception, and that can be very costly even if the exception is immediately caught. Because of that, exceptions are not a good fit for reporting failures which happen frequently during overload, especially if the CPU is the bottleneck.

This PR introduces facilities for reporting exceptions as values using the boost::outcome library. As a first step, the need to use exceptions for reporting timeouts was eliminated for regular and batch writes, and no exceptions are thrown between creation of a `mutation_write_timeout_exception` and its serialization as a CQL response in the `cql_server`.

The types and helpers introduced here can be reused in order to migrate more exceptions and exception paths in a similar fashion.

Results of `perf_simple_query --smp 1 --operations-per-shard 1000000`:

    Master (00a9326ae7)
    128789.53 tps ( 82.2 allocs/op,  12.2 tasks/op,   49245 insns/op)

    This PR
    127072.93 tps ( 82.2 allocs/op,  12.2 tasks/op,   49356 insns/op)

The new version seems to be slower by about 100 insns/op, fortunately not by much (about 0.2%).

Tests: unit(dev), unit(result_utils_test, debug)

Closes #10014

* github.com:scylladb/scylla:
  cql_test_env: optimize handling result_message::exception
  transport/server: handle exceptions from coordinator_result without throwing
  transport/server: propagate coordinator_result to the error handling code
  transport/server: unwrap the exception result_message in process_xyz_internal
  query_processor: add exception-returning variants of execute_ methods
  modification_statement: propagate failed result through result_message::exception
  batch_statement: propagate failed result through result_message::exception
  cql_statement: add `execute_without_checking_exception_message`
  result_message: add result_message::exception
  storage_proxy: change mutate_with_triggers to return future<result<>>
  storage_proxy: add mutate_atomically_result
  storage_proxy: return result<> from mutate_result
  storage_proxy: return result<> from mutate_internal
  storage_proxy: properly propagate future from mutate_begin to mutate_end
  storage_proxy: handle exceptions as values in mutate_end
  storage_proxy: let mutate_end take a future<result<>>
  storage_proxy: resultify mutate_begin
  storage_proxy: use result in the _ready future of write handlers
  storage_proxy: introduce helpers for dealing with results
  exceptions: add coordinator_exception_container and coordinator_result
  utils: add result utils
  utils: add exception_container
2022-02-08 14:27:09 +02:00
Alejo Sanchez
2d9f40f716 raft: group 0: use cfg.contains() for config check
There will be nodes in non-voting state in configuration, so can_vote()
is not a good check. Use newer cfg.contains().

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-02-08 08:00:07 -04:00
Alejo Sanchez
627275945f raft: modify_config: support voting state change
Handle requests to change voting for servers already present in the
current configuration.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-02-08 08:00:07 -04:00
Alejo Sanchez
a40417df08 raft: minor: fix log format string
Fix format string for log line.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-02-08 08:00:07 -04:00
Piotr Dulikowski
ffd439d908 cql_test_env: optimize handling result_message::exception
The single_node_cql_env uses query_processor::execute_xyz family of
methods to perform operations. Due to previous commits in this series,
they allocate one more task than before - a continuation that converts
result_message::exception into an exceptional future. We can recover
that one task by using variants of those methods which do not perform a
conversion, and turn .finally() invocations into .then()s which perform
conversion manually.
2022-02-08 11:08:42 +01:00
Piotr Dulikowski
81968f2c3a transport/server: handle exceptions from coordinator_result without throwing
Instead of throwing the exception contained in failed `result<>`, it is
now inspected with a visitor which avoids the need for throwing.
2022-02-08 11:08:42 +01:00
Piotr Dulikowski
4cc5d582e3 transport/server: propagate coordinator_result to the error handling code
Now, the failed `result<>` is throwlessly propagated to the continuation
which converts exceptions to CQL response messages, and is thrown there.
2022-02-08 11:08:42 +01:00
Piotr Dulikowski
c750f7895f transport/server: unwrap the exception result_message in process_xyz_internal
At the point where `result_message` is converted to a
`cql_server::response`, now the result message is inspected and returned
as failed `result<>` if it contained an error.

For now, the failed `result<>` is thrown as exception in `process` and
`process_on_shard`, but that will change in the next commit.
2022-02-08 11:08:42 +01:00
Piotr Dulikowski
53f3feb103 query_processor: add exception-returning variants of execute_ methods
Adds variants of the execute_prepared, execute_direct and execute_batch
which are allowed to return exceptions as `result_message::exception`.

Because the `result_message::exception` must be explicitly handled by
the receiver, new variants are introduced in order not to accidentally
ignore the exception, which would be very bad.
2022-02-08 11:08:42 +01:00
Piotr Dulikowski
2572104dfe modification_statement: propagate failed result through result_message::exception
Modifies the modification_statement code so that is converts failed
`result<>` into a `result_message::exception` without involving the C++
exception runtime.
2022-02-08 11:08:42 +01:00
Piotr Dulikowski
f9d1914e1c batch_statement: propagate failed result through result_message::exception
Modifies the batch_statement code so that is converts failed `result<>`
into a `result_message::exception` without involving the C++ exception
runtime.
2022-02-08 11:08:42 +01:00
Piotr Dulikowski
e1d762b110 cql_statement: add execute_without_checking_exception_message
Adds a new virtual method to the cql_statement with a wordy name. The
new method is a variant of `execute`, but it is allowed to return errors
via the `result_message::exception` object.

The reason for an additional method is that there are many places in the
code which call `execute` but do not check the result in any way.
Because ignoring an exception unintentionally is a very bad thing, the
new method needs to be explicitly implemented by statements which can
return a `result_message::exception`, and explicitly called in the code
which is prepared to handle a `result_message::exception`.
2022-02-08 11:08:42 +01:00
Piotr Dulikowski
e4ff22b4ca result_message: add result_message::exception
In order to propagate exceptions as values through the CQL layer with
minimal modifications to the interfaces, a new result_message type is
introduced: result_message::exception. Similarly to
result_message::bounce_to_shard, this is an internal type which is
supposed to be handled before being returned to the client.
2022-02-08 11:08:42 +01:00
Piotr Dulikowski
4c1eae7600 storage_proxy: change mutate_with_triggers to return future<result<>>
Changes the interface of `mutate_with_triggers` so that it returns
`future<result<>>` instead of `future<>`. No intermediate
`mutate_with_triggers_result` method is introduced because all call
sites will be changed in this PR so that they properly handle failed
`result<>`s with exceptions-as-values.
2022-02-08 11:08:42 +01:00
Piotr Dulikowski
7ed668a177 storage_proxy: add mutate_atomically_result
Similarly to `mutate_result` introduced in the previous commit,
`mutate_atomically_result` is introduced which returns some exceptions
inside `result<>`. The pre-existing `mutate_atomically` keeps the same
interface but uses `mutate_atomically_result` internally, converting
failed `result<>` to exceptional future if needed.
2022-02-08 11:08:42 +01:00
Piotr Dulikowski
f9ff5e7692 storage_proxy: return result<> from mutate_result
In order to be able to propagate exceptions-as-values from storage_proxy
but without having to modify all call sites of `mutate`, an in-between
method `mutate_result` is introduced which returns some exceptions
inside `result<>`. Now, `mutate` just calls the latter and converts
those exceptions to exceptional future if needed.
2022-02-08 11:08:42 +01:00
Piotr Dulikowski
f02b8614af storage_proxy: return result<> from mutate_internal
Changes the interface of `mutate_internal` so that it returns a
`future<result<>>` instead of `future<>`.
2022-02-08 11:08:42 +01:00
Piotr Dulikowski
f8bbf67e64 storage_proxy: properly propagate future from mutate_begin to mutate_end
Modifies all call sites of `mutate_begin` and `mutate_end` so that the
failed result<> created in the former is properly propagated to the
latter.
2022-02-08 11:08:42 +01:00
Piotr Dulikowski
e2893368a7 storage_proxy: handle exceptions as values in mutate_end
Instead of stupidly rethrowing the exception in failed result<>, the
`storage_proxy::mutate_end` function now inspects it with a visitor,
which does not involve any rethrows. Moreover, mutate_end now also
returns a `future<result<>>` instead of just `future<>`.
2022-02-08 11:08:42 +01:00
Piotr Dulikowski
5c00b27662 storage_proxy: let mutate_end take a future<result<>>
Changes the `storage_proxy::mutate_end` method to accept a
`future<result<>>` instead of `future<>`.

For the time being, all call call sites of that method pass a future
which is either exceptional or contains a result<> with a value.
Moreover, in case of a failed result<>, mutate_end just rethrows the
exception. Both of these will change in the upcoming commits of this PR.
2022-02-08 11:08:42 +01:00
Piotr Dulikowski
59efe085af storage_proxy: resultify mutate_begin
Changes the `storage_proxy::mutate_begin` method to return a
future<result<>>.
2022-02-08 11:08:42 +01:00
Piotr Dulikowski
3a92513ef6 storage_proxy: use result in the _ready future of write handlers
Changes the type of the _ready promise in
abstract_write_response_handler - a promise used by the coordinator
logic to wait until the write operation is complete - to keep a
`result<>` instead of `void`. Now, a timeout is signalled by setting the
promise to a value containing a `result<>` with a mutation write timeout
exception - previously it was signalled by setting the promise to an
exceptional value.

This is just a first step on a long road of throwless propagation of the
error to the cql_server - for now, a failed result is immediately
converted to an exceptional future in `storage_proxy::response_wait`.
2022-02-08 11:08:42 +01:00
Piotr Dulikowski
6ac98f26e0 storage_proxy: introduce helpers for dealing with results
Adds a number of typedefs in order to make working with coordinator
exceptions-as-values easier.
2022-02-08 11:08:42 +01:00
Piotr Dulikowski
9304791ce5 exceptions: add coordinator_exception_container and coordinator_result
Adds coordinator_exception_container which is a typedef over
exception_container and is meant to hold exceptions returned from the
coordinator code path. Currently, it can only hold mutation write
timeout exceptions, because only that kind of error will be returned by
value as a result of this PR. In the future, more exception types can be
added.

Adds coordinator_result which is a boost::outcome::result that uses
coordinator_exception_container as the error type.
2022-02-08 11:08:42 +01:00
Piotr Dulikowski
11cb670881 utils: add result utils
Adds a number of utilities for working with boost::outcome::result
combined with exception_container. The utilities are meant to help with
migration of the existing code to use the boost::outcome::result:

- `exception_container_throw_policy` - a NoValuePolicy meant to be used
  as a template parameter for the boost::outcome::result. It protects
  the caller of `result::value()` and `result::error()` methods - if the
  caller wishes to get a value but the result has an error
  (exception_container in our case), the exception in the container will
  be thrown instead. In case it's the other way around,
  boost::outcome::bad_result_access is thrown.
- `result_parallel_for_each` - a version of `parallel_for_each` which is
  aware of results and returns a failed result in case any of the
  parallel invocations return a failed result.
- `result_into_future` - converts a result into a future. If the result
  holds a value, converts it into make_ready_future; if it holds an
  exception, the exception is returned as make_exception_future.
- `then_ok_result` takes a `future<T>` and converts it into
  a `future<result<T>>`.
- `result_wrap` adapts a callable of type `T -> future<result<T>>` and
  returns a callable of type `result<T> -> future<result<T>>`.
2022-02-08 11:08:42 +01:00
Raphael S. Carvalho
38f83d8862 compaction_manager: Don't mix member functions and variables
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220204190911.37276-1-raphaelsc@scylladb.com>
2022-02-07 18:40:48 +02:00
Botond Dénes
9cfde98cce Merge "Move is_replacing/get_replace_address from database" from Pavel Emelyanov
"
This is the continuation of 3e31126b (Brush up the initial tokens
generation
code). The replica::database is still used as the configuration
provider, and
two of those bits can be easily fixed.
"

tests: unit(dev)
* 'br-database-no-replacing-config' of https://github.com/xemul/scylla:
  database: Move is_replacing() and get_replace_address() (back) into storage_service
  bootstrapper: Get 'is-replacing' via argument too
  bootstrapper: Get replace address via argument
2022-02-07 18:40:48 +02:00
Nadav Har'El
9982a28007 alternator: allow REMOVE of non-existent nested attribute
DynamoDB allows an UpdateItem operation "REMOVE x.y" when a map x
exists in the item, but x.y doesn't - the removal silently does
nothing. Alternator incorrectly generated an error in this case,
and unfortunately we didn't have a test for this case.

So in this patch we add the missing test (which fails on Alternator
before this patch - and passes on DynamoDB) and then fix the behavior.
After this patch, "REMOVE x.y" will remain an error if "x" doesn't
exist (saying "document paths not valid for this item"), but if "x"
exists and is a map, but "x.y" doesn't, the removal will silently
do nothing and will not be an error.

Fixes #10043.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220207133652.181994-1-nyh@scylladb.com>
2022-02-07 18:40:48 +02:00
Benny Halevy
31f4cd21eb shard_reader: close: degrade error message to warning
1. There's nothing we can do about this error.
2. It doesn't affect any query
3. No need to reprort timeout errors here.

Refs #10029

Note that in 4.6.rc4-0.20220203.34d470967a0 (where the issue above was opened against)
the error is likely to be related to read_ahead failure which
is already reported as a warning in master since fc729a804b.

When backported, this patch should be applied after:
fc729a804b
d7a993043d

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220207080041.174934-1-bhalevy@scylladb.com>
2022-02-07 18:40:48 +02:00
Kamil Braun
93eed6d0c7 service: storage_service: leave Raft group 0 before stop_transport in decommission
Leaving group 0 in `decommission` would previously fail with RPC
exception because it happened after messaging service was shutdown.

Fixes #9845.
Message-Id: <20220201112743.9705-1-kbraun@scylladb.com>
2022-02-07 18:40:48 +02:00
Piotr Sarna
5a13ff09e9 expression: fix get_value for mismatched column definitions
As observed in #10026, after schema changes it somehow happened
that a column defition that does not match any of the base table
columns was passed to expression verification code.
The function that looks up the index of a column happens to return
-1 when it doesn't find anything, so using this returned index
without checking if it's nonnegative results in accessing invalid
vector data, and a segfault or silent memory corruption.
Therefore, an explicit check is added to see if the column was actually
found. This serves two purposes:
 - avoiding segfaults/memory corruption
 - making it easier to investigate the root cause of #10026

Closes #10039
2022-02-07 18:40:48 +02:00
Nadav Har'El
203291f7ba cql: reject a map literal with the same key twice
The CQL parser currently accepts a command like:

    ALTER KEYSPACE ksname WITH replication = {
        'class' : 'NetworkTopologyStrategy',
        'dc1' : 2,
        'dc1' : 3 }

But because these options are read into an std::map, one of the
definitions of 'dc1' is silently ignored (counter-intuitively, it is
the first setting which is kept, and the second setting is ignored.)
But this is most likely a user's typo, so a better choice is to report
this as a parse error instead of arbitrarly and silently keeping just
one of the settings.

This is what Cassandra does since version 3.11 (see
https://issues.apache.org/jira/browse/CASSANDRA-13369 and Cassandra
commit 1a83efe2047d0138725d5e102cc40774f3b14641), and this is what we do
in this patch.

The unit test cassandra_tests/validation/operations/alter_test.py::
testAlterKeyspaceWithMultipleInstancesOfSameDCThrowsSyntaxException,
translated from Cassandra's unit tests, now passes.

Fixes #10037.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220207113709.78613-1-nyh@scylladb.com>
2022-02-07 18:40:48 +02:00
Pavel Emelyanov
66b9a53808 database: Move is_replacing() and get_replace_address() (back) into storage_service
Both helpers (natuarally) used to be storage-service methods, but then
were moved to databse because bootstrapper code wanted to know this info.
Now the bootstraper is equipped with necessary arguments.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-02-07 12:43:08 +03:00
Pavel Emelyanov
469ded71a9 bootstrapper: Get 'is-replacing' via argument too
This also removes the only usage of this helper outside of the storage
service. The place that needs it is the use_strict_sources_for_ranges()
checker and all the callers of it are aware of whether it's replacing
happenning or not.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-02-07 12:41:02 +03:00
Pavel Emelyanov
9770f54789 bootstrapper: Get replace address via argument
This removes the only usage of db.get_replace_address outside of
storage service.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-02-07 12:39:51 +03:00
Nadav Har'El
cc57ac8c1c cql3: add a cql3::util::quote() function
The function cql3::util::maybe_quote() is used throughout Scylla to
convert identifier names (column names, table names, etc.) into strings
that can be embedded in CQL commands. maybe_quote() sometimes needs to
quote these identifier names, but when the identifier name is lowercase,
and not a CQL keyword, it is not quoted.

Not quoting identifier names when not needed is nice and pretty, but has
a forward-compatibility problem: If some CQL command with an unquoted
identifier is saved somewhere, and new version of Scylla adss this
identifier as a new reserved keyword - the CQL command will break.

So this patch introduces a new function, cql3::util::quote(), which
unconditionally quotes the given identifier.

The new function is not yet used in Scylla, but we add a unit test
(based on the test of maybe_quote()) to confirm it behaves correctly.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220118161217.231811-2-nyh@scylladb.com>
2022-02-07 11:33:57 +02:00
Nadav Har'El
5d2f694a90 cql3: fix cql3::util::maybe_quote() for keywords
cql3::util::maybe_quote() is a utility function formatting an identifier
name (table name, column name, etc.) that needs to be embedded in a CQL
statement - and might require quoting if it contains non-alphanumeric
characters, uppercase characters, or a CQL keyword.

maybe_quote() made an effort to only quote the identifier name if neccessary,
e.g., a lowercase name usually does not need quoting. But lowercase names
that are CQL keywords - e.g., to or where - cannot be used as identifiers
without quoting. This can cause problems for code that wants to generate
CQL statements, such as the materialized-view problem in issue #9450 - where
a user had a column called "to" and wanted to create a materialized view
for it.

So in this patch we fix maybe_quote() to recognize invalid identifiers by
using the CQL parser, and quote them. This will quote reserved keywords,
but not so-called unreserved keywords, which *are* allowed as identifiers
and don't need quoting. This addition slows down maybe_quote(), but
maybe_quote() is anyway only used in heavy operations which need to
generate CQL.

This patch also adds two tests that reproduce the bug and verify its
fix:

1. Add to the low-level maybe_quote() test (a C++ unit test) also tests
   that maybe_quote() quotes reserved keywords like "to", but doesn't
   quote unreserved keywords like "int".

2. Add a test reproducing issue #9450 - creating a materialized view
   whose key column is a keyword. This new test passes on Cassandra,
   failed on Scylla before this patch, and passes after this patch.

It is worth noting that maybe_quote() now has a "forward compatiblity"
problem: If we save CQL statements generated by maybe_quote(), and a
future version introduces a new reserved keyword, the parser of the
future version may not be able to parse the saved CQL statement that
was generated with the old mayb_quote() and didn't quote what is now
a keyword. This problem can be solved in two ways:

1. Try hard not to introduced new reserved keywords. Instead, introduce
   unreserved keywords. We've been doing this even before recognizing
   this maybe_quote() future-compatibility problem.

2. In the next patch we will introduce quote() - which unconditionally
   quotes identifier names, even if lowercase. These quoted names will
   be uglier for lowercase names - but will be safe from future
   introduction of new keywords. So we can consider switching some or
   all uses of maybe_quote() to quote().

Fixes #9450

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220118161217.231811-1-nyh@scylladb.com>
2022-02-07 11:33:56 +02:00
Nadav Har'El
b3cfd4ce07 cql-pytest: translate Cassandra's tests for ALTER operations
This is a translation of Cassandra's CQL unit test source file
validation/operations/AlterTest.java into our our cql-pytest framework.

This test file includes 24 tests for various types of ALTER operations
(of keyspaces, tables and types). Two additional tests which required
multiple data centers to test were dropped with a comment explaining why.

All 24 tests pass on Cassandra, with 8 failing on Scylla reproducing
one already known Scylla issue and 5 previously-unknown ones:

  Refs #8948:  Cassandra 3.11.10 uses "class" instead of
               "sstable_compression" for compression settings by default
  Refs #9929:  Cassandra added "USING TIMESTAMP" to "ALTER TABLE",
               we didn't.
  Refs #9930:  Forbid re-adding static columns as regular and vice versa
  Refs #9935:  Scylla stores un-expanded compaction class name in system
               tables.
  Refs #10036: Reject empty options while altering a keyspace
  Refs #10037: If there are multiple values for a key, CQL silently
               chooses last value

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220206163820.1875410-2-nyh@scylladb.com>
2022-02-07 10:57:43 +02:00
Nadav Har'El
b61876f4ff test/cql-pytest: implement nodetool.compact()
Implement the nodetool.compact() function, requesting a major compaction
of the given table. As usual for the nodetool.* functions, this is
implemented with the REST API if available (i.e., testing Scylla), or
with the external "nodetool" command if not (for testing Cassandra).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220206163820.1875410-1-nyh@scylladb.com>
2022-02-07 10:57:42 +02:00
Konstantin Osipov
caeaba60f9 cql_repl: use POSIX primitives to reset input/output
Seastar uses POSIX IO for output in addition to C++ iostreams,
e.g. in print_safe(), where it write()s directly to stdout.

Instead of manipulating C++ output streams to reset
stdout/log files, reopen the underlying file descriptors
to output/log files.

Fixes #9962 "cql_repl prints junk into the log"
Message-Id: <20220204205032.1313150-1-kostja@scylladb.com>
2022-02-07 10:53:20 +02:00
Nadav Har'El
c020ed7383 merge: test.py: assorted fixes
Merged patch series by Konstantin Osipov:

Assorted fixes in test.py in preparation for cluster
testing:
- better logging
- async search for unit test cases
- ubuntu fixes

  test.py: highlight the failure cause
  test.py: clean up setting of scylla executable
  test.py: speed up search for tests cases, use async
  test.py: make case cache global
  test.py: make --cpus option work on Ubuntu
  test.py: create an own TestSuite instance for each path/mode combo
  test.py: do not fail entire run if list-content fails due to ASAN
  test.py: print subtest name on cancel
  test.py: fix flake8 complaints
2022-02-06 10:13:36 +02:00
Pavel Solodovnikov
dce3159156 gms: gossiper: coroutinize wait_for_gossip
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-02-05 10:34:52 +03:00
Pavel Solodovnikov
ab41151a41 gms: gossiper: coroutinize advertise_token_removed
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-02-05 10:33:32 +03:00
Pavel Solodovnikov
4416070f56 gms: gossiper: coroutinize advertise_removing
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-02-05 10:33:13 +03:00
Pavel Solodovnikov
e9f5da9507 gms: gossiper: don't wrap convict calls into seastar::async
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-02-05 10:32:14 +03:00
Pavel Solodovnikov
e26829e202 gms: gossiper: coroutinize handle_major_state_change
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-02-05 10:15:21 +03:00
Pavel Solodovnikov
705a759891 gms: gossiper: coroutinize handle_shutdown_msg
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-02-05 10:15:21 +03:00
Pavel Solodovnikov
9ce0e2efa3 gms: gossiper: coroutinize mark_as_shutdown and convict
Since these two functions call each other, convert
to coroutines and eliminate the dependency on `seastar::async`
for both of them at the same time.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-02-05 10:15:21 +03:00
Pavel Solodovnikov
c584a9cc1f gms: gossiper: remove comment about requiring thread context in mark_alive
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-02-05 10:15:21 +03:00
Pavel Solodovnikov
ee30d0a385 gms: gossiper: don't use seastar::async in mark_alive
Since `real_mark_alive` does not require `seastar::async`
now, we can eliminate the wrapping async call, as well.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-02-05 10:15:21 +03:00
Pavel Solodovnikov
529f4d0f98 gms: gossiper: coroutinize do_on_change_notifications
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-02-05 10:15:21 +03:00
Pavel Solodovnikov
37066039df gms: gossiper: coroutinize do_before_change_notifications
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-02-05 10:15:21 +03:00
Pavel Solodovnikov
231d8a3ad4 gms: gossiper: coroutinize real_mark_alive
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-02-05 10:15:21 +03:00
Pavel Solodovnikov
c929f23b8d gms: gossiper: coroutinize mark_dead
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-02-05 10:15:20 +03:00
Piotr Dulikowski
80f6224959 utils: add exception_container
Adds `exception_container` - a helper type used to hold exceptions as a
value, without involving the std::exception_ptr.

The motivation behind this type is that it allows inspecting exception's
type and value without having to rethrow that exception and catch it,
unlike std::exception_ptr. In our current codebase, some exception
handling paths need to rethrow the exception multiple times in order to
account it into metrics or encode it as an error response to the CQL
client. Some types of exceptions can be thrown very frequently in case
of overload (e.g. timeouts) and inspecting those exceptions with
rethrows can make the overload even worse. For those kinds of exceptions
it is important to handle them as cheaply as possible, and
exception_container used with conjunction with boost::outcome::result
can help achieve that.
2022-02-04 20:18:00 +01:00
Konstantin Osipov
dee6da53b3 test.py: highlight the failure cause
Use color palette to highlight the exception which aborted
the harness.
2022-02-04 17:15:52 +03:00
Konstantin Osipov
56aaabfa31 test.py: clean up setting of scylla executable
Now that suites are per mode, set scylla executable
path once per suite, not once per test.
Ditto for scylla env.
2022-02-04 17:15:52 +03:00
Konstantin Osipov
e9ec69494e test.py: speed up search for tests cases, use async
Search for test cases in parallel.
This speeds up the search for test cases from 30 to 4-5
seconds in absence of test case cache and from 4 to 3
seconds if case cache is present.
2022-02-04 17:15:52 +03:00
Konstantin Osipov
45270f5ad2 test.py: make case cache global
test.py runs each unit test's test case in a separate process.
The list of test cases is built at start, by running --list-cases
for each unit test. The output is cached, so that if one uses --repeat
option, we don't list the cases again and again.

The cache, however, was only useful for --repeat, because it was only
caching the last tests' output, not all tests output, so if I, for example,
run tests like:

./test.py foo bar foo

.. the cache was unused. Make the cache global which simplifies its
logic and makes it work in more cases.
2022-02-04 16:47:35 +03:00
Konstantin Osipov
445f90dc3b test.py: make --cpus option work on Ubuntu
The used API is only available in python3, so use it explicitly.
2022-02-04 16:16:17 +03:00
Konstantin Osipov
60fde39880 test.py: create an own TestSuite instance for each path/mode combo
To run tests in a given mode we will need to start off scylla
clusters, which we would want to pool and reuse between many tests.
TestSuite class was designed to share resources of common tests.
One can't pool together scylla servers compiled with different
tests, so create an own TestSuite instance for each mode.
2022-02-04 16:16:17 +03:00
Konstantin Osipov
efd7b9f4a3 test.py: do not fail entire run if list-content fails due to ASAN
If list-content of one test fails with ASAN error, do not abort.
2022-02-04 16:16:17 +03:00
Konstantin Osipov
c63e0ee271 test.py: print subtest name on cancel 2022-02-04 16:16:17 +03:00
Konstantin Osipov
8cc7c1a5bb test.py: fix flake8 complaints
It's good practice to use linters and style formatters for
all scripted languages. Python community is more strict
about formatting guidelines than others, and using
formatters (like flake8 or black) is almost universally
accepted.

test.py was adhering to flake8 standards at some point,
but later this was spoiled by random commits.
2022-02-04 16:16:17 +03:00
Raphael S. Carvalho
755cec1199 table: Close reader if flush fails to peek into fragment
An OOM failure while peeking into fragment, to determine if reader will
produce any fragment, causes Scylla to abort as flat_mutation_reader
expects reader to be closed before destroyed. Let's close it if
peek() fails, to handle the scenario more gracefully.

Fixes #10027.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220204031553.124848-1-raphaelsc@scylladb.com>
2022-02-04 12:48:36 +02:00
Avi Kivity
fe65122ccd Merge 'Distribute select count(*) queries' from Michał Sala
This pull request speeds up execution of `count(*)` queries. It does so by splitting given query into sub-queries and distributing them across some group of nodes for parallel execution.

New level of coordination was added. Node called super-coordinator splits aggregation query into sub-queries and distributes them across some group of coordinators. Super-coordinator is also responsible for merging results.

To develop a mechanism for speeding up `count(*)` queries, there was a need to detect which queries have a `count(*)` selector. Due to this pull request being a proof of concept, detection was realized rather poorly. It is only allows catching the simplest cases of `count(*)` queries (with only one selector and no column name specified).

After detecting that a query is a `count(*)` it should be split into sub-queries and sent to another coordinators. Splitting part wasn't that difficult, it has been achieved by limiting original query's partition ranges. Sending modified query to another node was much harder. The easiest scenario would be to send whole `cql3::statements::select_statement`. Unfortunately `cql3::statements::select_statement` can't be [de]serialized, so sending it was out of the question. Even more unfortunately, some non-[de]serializable members of `cql3::statements::select_statement` are required to start the execution process of this statement. Finally, I have decided to send a `query::read_command` paired with required [de]serializable members. Objects, that cannot be [de]serialized (such as query's selector) are mocked on the receiving end.

When a super-coordinator receives a `count(*)` query, it splits it into sub-queries. It does so, by splitting original query's partition ranges into list of vnodes, grouping them by their owner and creating sub-queries with partition ranges set to successive results of such grouping. After creation, each sub-query is sent to the owner of its partition ranges. Owner dispatches received sub-query to all of its shards. Shards slice partition ranges of the received sub-query, so that they will only query data that is owned by them. Each shard becomes a coordinator and executes so prepared sub-query.

3 node cluster set up on powerful desktops located in the office (3x32 cores)
Filled the cluster with ~2 * 10^8 rows using scylla-bench and run:
```
time cqlsh <ip> <port> --request-timeout=3600 -e "select count(*) from scylla_bench.test using timeout 1h;"
```

* master: 68s
* this branch: 2s

3 node cluster (each node had 2 shards, `murmur3_ignore_msb_bits` was set to 1, `num_tokens` was set to 3)

```
>  cqlsh -e 'tracing on; select count(*) from ks.t;
Now Tracing is enabled

 count
-------
  1000

(1 rows)

Tracing session: e5852020-7fc3-11ec-8600-4c4c210dd657

 activity                                                                                                                                    | timestamp                  | source    | source_elapsed | client
---------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------+-----------
                                                                                                                          Execute CQL3 query | 2022-01-27 22:53:08.770000 | 127.0.0.1 |              0 | 127.0.0.1
                                                                                                               Parsing a statement [shard 1] | 2022-01-27 22:53:08.770451 | 127.0.0.1 |             -- | 127.0.0.1
                                                                                                            Processing a statement [shard 1] | 2022-01-27 22:53:08.770487 | 127.0.0.1 |             36 | 127.0.0.1
                                                                                        Dispatching forward_request to 3 endpoints [shard 1] | 2022-01-27 22:53:08.770509 | 127.0.0.1 |             58 | 127.0.0.1
                                                                                            Sending forward_request to 127.0.0.1:0 [shard 1] | 2022-01-27 22:53:08.770516 | 127.0.0.1 |             64 | 127.0.0.1
                                                                                                         Executing forward_request [shard 1] | 2022-01-27 22:53:08.770519 | 127.0.0.1 |             -- | 127.0.0.1
                                                                                                       read_data: querying locally [shard 1] | 2022-01-27 22:53:08.770528 | 127.0.0.1 |              9 | 127.0.0.1
                                             Start querying token range ({-4242912715832118944, end}, {-4075408479358018994, end}] [shard 1] | 2022-01-27 22:53:08.770531 | 127.0.0.1 |             12 | 127.0.0.1
                                                                                                 Creating shard reader on shard: 1 [shard 1] | 2022-01-27 22:53:08.770537 | 127.0.0.1 |             18 | 127.0.0.1
                      Scanning cache for range ({-4242912715832118944, end}, {-4075408479358018994, end}] and slice {(-inf, +inf)} [shard 1] | 2022-01-27 22:53:08.770541 | 127.0.0.1 |             22 | 127.0.0.1
    Page stats: 12 partition(s), 0 static row(s) (0 live, 0 dead), 12 clustering row(s) (12 live, 0 dead) and 0 range tombstone(s) [shard 1] | 2022-01-27 22:53:08.770589 | 127.0.0.1 |             70 | 127.0.0.1
                                                                                            Sending forward_request to 127.0.0.2:0 [shard 1] | 2022-01-27 22:53:08.770600 | 127.0.0.1 |            149 | 127.0.0.1
                                                                                            Sending forward_request to 127.0.0.3:0 [shard 1] | 2022-01-27 22:53:08.770608 | 127.0.0.1 |            157 | 127.0.0.1
                                                                                                         Executing forward_request [shard 0] | 2022-01-27 22:53:08.770627 | 127.0.0.1 |             -- | 127.0.0.1
                                                                                                       read_data: querying locally [shard 0] | 2022-01-27 22:53:08.770639 | 127.0.0.1 |             11 | 127.0.0.1
                                               Start querying token range ({2507462623645193091, end}, {3897266736829642805, end}] [shard 0] | 2022-01-27 22:53:08.770643 | 127.0.0.1 |             15 | 127.0.0.1
                                                                                                 Creating shard reader on shard: 0 [shard 0] | 2022-01-27 22:53:08.770646 | 127.0.0.1 |             19 | 127.0.0.1
                        Scanning cache for range ({2507462623645193091, end}, {3897266736829642805, end}] and slice {(-inf, +inf)} [shard 0] | 2022-01-27 22:53:08.770649 | 127.0.0.1 |             22 | 127.0.0.1
                                                                                                         Executing forward_request [shard 1] | 2022-01-27 22:53:08.770658 | 127.0.0.2 |             -- | 127.0.0.1
                                                                                                         Executing forward_request [shard 1] | 2022-01-27 22:53:08.770674 | 127.0.0.3 |              5 | 127.0.0.1
                                                                                                       read_data: querying locally [shard 1] | 2022-01-27 22:53:08.770698 | 127.0.0.2 |             40 | 127.0.0.1
                                             Start querying token range [{4611686018427387904, start}, {5592106830937975806, end}] [shard 1] | 2022-01-27 22:53:08.770704 | 127.0.0.2 |             46 | 127.0.0.1
                                                                                                 Creating shard reader on shard: 1 [shard 1] | 2022-01-27 22:53:08.770710 | 127.0.0.2 |             52 | 127.0.0.1
                                                                                                       read_data: querying locally [shard 1] | 2022-01-27 22:53:08.770712 | 127.0.0.3 |             43 | 127.0.0.1
                      Scanning cache for range [{4611686018427387904, start}, {5592106830937975806, end}] and slice {(-inf, +inf)} [shard 1] | 2022-01-27 22:53:08.770714 | 127.0.0.2 |             56 | 127.0.0.1
                                           Start querying token range [{-4611686018427387904, start}, {-4242912715832118944, end}] [shard 1] | 2022-01-27 22:53:08.770718 | 127.0.0.3 |             49 | 127.0.0.1
                                                                                                 Creating shard reader on shard: 1 [shard 1] | 2022-01-27 22:53:08.770739 | 127.0.0.3 |             70 | 127.0.0.1
                    Scanning cache for range [{-4611686018427387904, start}, {-4242912715832118944, end}] and slice {(-inf, +inf)} [shard 1] | 2022-01-27 22:53:08.770743 | 127.0.0.3 |             73 | 127.0.0.1
    Page stats: 17 partition(s), 0 static row(s) (0 live, 0 dead), 17 clustering row(s) (17 live, 0 dead) and 0 range tombstone(s) [shard 1] | 2022-01-27 22:53:08.770814 | 127.0.0.3 |            145 | 127.0.0.1
                                                                                                         Executing forward_request [shard 0] | 2022-01-27 22:53:08.770846 | 127.0.0.3 |             -- | 127.0.0.1
                                                                                                       read_data: querying locally [shard 0] | 2022-01-27 22:53:08.770862 | 127.0.0.3 |             16 | 127.0.0.1
    Page stats: 71 partition(s), 0 static row(s) (0 live, 0 dead), 71 clustering row(s) (71 live, 0 dead) and 0 range tombstone(s) [shard 0] | 2022-01-27 22:53:08.770865 | 127.0.0.1 |            238 | 127.0.0.1
                                             Start querying token range ({-6683686776653114062, end}, {-6473446911791631266, end}] [shard 0] | 2022-01-27 22:53:08.770867 | 127.0.0.3 |             21 | 127.0.0.1
                                                                                                 Creating shard reader on shard: 0 [shard 0] | 2022-01-27 22:53:08.770874 | 127.0.0.3 |             28 | 127.0.0.1
                      Scanning cache for range ({-6683686776653114062, end}, {-6473446911791631266, end}] and slice {(-inf, +inf)} [shard 0] | 2022-01-27 22:53:08.770879 | 127.0.0.3 |             33 | 127.0.0.1
    Page stats: 48 partition(s), 0 static row(s) (0 live, 0 dead), 48 clustering row(s) (48 live, 0 dead) and 0 range tombstone(s) [shard 1] | 2022-01-27 22:53:08.770880 | 127.0.0.2 |            222 | 127.0.0.1
                                                                                                                  Querying is done [shard 1] | 2022-01-27 22:53:08.770888 | 127.0.0.1 |            369 | 127.0.0.1
                                                                                                       read_data: querying locally [shard 1] | 2022-01-27 22:53:08.770909 | 127.0.0.1 |            390 | 127.0.0.1
                                             Start querying token range ({-4075408479358018994, end}, {-3391415989210253693, end}] [shard 1] | 2022-01-27 22:53:08.770911 | 127.0.0.1 |            392 | 127.0.0.1
                                                                                                 Creating shard reader on shard: 1 [shard 1] | 2022-01-27 22:53:08.770914 | 127.0.0.1 |            395 | 127.0.0.1
                      Scanning cache for range ({-4075408479358018994, end}, {-3391415989210253693, end}] and slice {(-inf, +inf)} [shard 1] | 2022-01-27 22:53:08.770936 | 127.0.0.1 |            418 | 127.0.0.1
                                                                                                         Executing forward_request [shard 0] | 2022-01-27 22:53:08.770951 | 127.0.0.2 |             -- | 127.0.0.1
                                                                                                       read_data: querying locally [shard 0] | 2022-01-27 22:53:08.770966 | 127.0.0.2 |             15 | 127.0.0.1
    Page stats: 12 partition(s), 0 static row(s) (0 live, 0 dead), 12 clustering row(s) (12 live, 0 dead) and 0 range tombstone(s) [shard 0] | 2022-01-27 22:53:08.770969 | 127.0.0.3 |            123 | 127.0.0.1
                                                                    Start querying token range (-inf, {-6683686776653114062, end}] [shard 0] | 2022-01-27 22:53:08.770969 | 127.0.0.2 |             18 | 127.0.0.1
                                                                                                 Creating shard reader on shard: 0 [shard 0] | 2022-01-27 22:53:08.770974 | 127.0.0.2 |             23 | 127.0.0.1
                                             Scanning cache for range (-inf, {-6683686776653114062, end}] and slice {(-inf, +inf)} [shard 0] | 2022-01-27 22:53:08.770977 | 127.0.0.2 |             26 | 127.0.0.1
                                                                                                                  Querying is done [shard 1] | 2022-01-27 22:53:08.770993 | 127.0.0.3 |            324 | 127.0.0.1
                                                                                                       read_data: querying locally [shard 1] | 2022-01-27 22:53:08.770998 | 127.0.0.3 |            329 | 127.0.0.1
                                                              Start querying token range ({-3391415989210253693, end}, {0, start}) [shard 1] | 2022-01-27 22:53:08.771001 | 127.0.0.3 |            332 | 127.0.0.1
                                                                                                 Creating shard reader on shard: 1 [shard 1] | 2022-01-27 22:53:08.771004 | 127.0.0.3 |            335 | 127.0.0.1
                                       Scanning cache for range ({-3391415989210253693, end}, {0, start}) and slice {(-inf, +inf)} [shard 1] | 2022-01-27 22:53:08.771007 | 127.0.0.3 |            338 | 127.0.0.1
    Page stats: 48 partition(s), 0 static row(s) (0 live, 0 dead), 48 clustering row(s) (48 live, 0 dead) and 0 range tombstone(s) [shard 1] | 2022-01-27 22:53:08.771044 | 127.0.0.1 |            525 | 127.0.0.1
                                                                                                                  Querying is done [shard 0] | 2022-01-27 22:53:08.771069 | 127.0.0.1 |            442 | 127.0.0.1
                                                                                                 On shard execution result is [71] [shard 0] | 2022-01-27 22:53:08.771145 | 127.0.0.1 |            518 | 127.0.0.1
                                                                                                                  Querying is done [shard 1] | 2022-01-27 22:53:08.771308 | 127.0.0.1 |            789 | 127.0.0.1
                                                                                                 On shard execution result is [60] [shard 1] | 2022-01-27 22:53:08.771351 | 127.0.0.1 |            832 | 127.0.0.1
 Page stats: 127 partition(s), 0 static row(s) (0 live, 0 dead), 127 clustering row(s) (127 live, 0 dead) and 0 range tombstone(s) [shard 0] | 2022-01-27 22:53:08.771379 | 127.0.0.2 |            427 | 127.0.0.1
 Page stats: 183 partition(s), 0 static row(s) (0 live, 0 dead), 183 clustering row(s) (183 live, 0 dead) and 0 range tombstone(s) [shard 1] | 2022-01-27 22:53:08.771385 | 127.0.0.3 |            716 | 127.0.0.1
                                                                                                                  Querying is done [shard 0] | 2022-01-27 22:53:08.771402 | 127.0.0.3 |            556 | 127.0.0.1
                                                                                                                  Querying is done [shard 1] | 2022-01-27 22:53:08.771403 | 127.0.0.2 |            745 | 127.0.0.1
                                                                                                       read_data: querying locally [shard 1] | 2022-01-27 22:53:08.771408 | 127.0.0.2 |            750 | 127.0.0.1
                                                                                                       read_data: querying locally [shard 0] | 2022-01-27 22:53:08.771409 | 127.0.0.3 |            563 | 127.0.0.1
                                                                     Start querying token range ({5592106830937975806, end}, +inf) [shard 1] | 2022-01-27 22:53:08.771411 | 127.0.0.2 |            754 | 127.0.0.1
                                           Start querying token range ({-6272011798787969456, end}, {-4611686018427387904, start}) [shard 0] | 2022-01-27 22:53:08.771412 | 127.0.0.3 |            566 | 127.0.0.1
                                                                                                 Creating shard reader on shard: 0 [shard 0] | 2022-01-27 22:53:08.771415 | 127.0.0.3 |            569 | 127.0.0.1
                                                                                                 Creating shard reader on shard: 1 [shard 1] | 2022-01-27 22:53:08.771415 | 127.0.0.2 |            757 | 127.0.0.1
                                              Scanning cache for range ({5592106830937975806, end}, +inf) and slice {(-inf, +inf)} [shard 1] | 2022-01-27 22:53:08.771419 | 127.0.0.2 |            761 | 127.0.0.1
                    Scanning cache for range ({-6272011798787969456, end}, {-4611686018427387904, start}) and slice {(-inf, +inf)} [shard 0] | 2022-01-27 22:53:08.771419 | 127.0.0.3 |            573 | 127.0.0.1
                                                                                    Received forward_result=[131] from 127.0.0.1:0 [shard 1] | 2022-01-27 22:53:08.771454 | 127.0.0.1 |           1003 | 127.0.0.1
    Page stats: 74 partition(s), 0 static row(s) (0 live, 0 dead), 74 clustering row(s) (74 live, 0 dead) and 0 range tombstone(s) [shard 0] | 2022-01-27 22:53:08.771764 | 127.0.0.3 |            918 | 127.0.0.1
                                                                                                       read_data: querying locally [shard 0] | 2022-01-27 22:53:08.771768 | 127.0.0.3 |            922 | 127.0.0.1
                                                               Start querying token range [{0, start}, {2507462623645193091, end}] [shard 0] | 2022-01-27 22:53:08.771771 | 127.0.0.3 |            925 | 127.0.0.1
                                                                                                 Creating shard reader on shard: 0 [shard 0] | 2022-01-27 22:53:08.771775 | 127.0.0.3 |            929 | 127.0.0.1
                                        Scanning cache for range [{0, start}, {2507462623645193091, end}] and slice {(-inf, +inf)} [shard 0] | 2022-01-27 22:53:08.771779 | 127.0.0.3 |            933 | 127.0.0.1
                                                                                                                  Querying is done [shard 1] | 2022-01-27 22:53:08.771935 | 127.0.0.3 |           1265 | 127.0.0.1
                                                                                                                  Querying is done [shard 0] | 2022-01-27 22:53:08.771950 | 127.0.0.2 |            998 | 127.0.0.1
                                                                                                       read_data: querying locally [shard 0] | 2022-01-27 22:53:08.771956 | 127.0.0.2 |           1004 | 127.0.0.1
                                             Start querying token range ({-6473446911791631266, end}, {-6272011798787969456, end}] [shard 0] | 2022-01-27 22:53:08.771959 | 127.0.0.2 |           1008 | 127.0.0.1
                                                                                                 Creating shard reader on shard: 0 [shard 0] | 2022-01-27 22:53:08.771963 | 127.0.0.2 |           1011 | 127.0.0.1
                      Scanning cache for range ({-6473446911791631266, end}, {-6272011798787969456, end}] and slice {(-inf, +inf)} [shard 0] | 2022-01-27 22:53:08.771966 | 127.0.0.2 |           1014 | 127.0.0.1
    Page stats: 13 partition(s), 0 static row(s) (0 live, 0 dead), 13 clustering row(s) (13 live, 0 dead) and 0 range tombstone(s) [shard 0] | 2022-01-27 22:53:08.772008 | 127.0.0.2 |           1057 | 127.0.0.1
                                                                                                       read_data: querying locally [shard 0] | 2022-01-27 22:53:08.772012 | 127.0.0.2 |           1061 | 127.0.0.1
                                             Start querying token range ({3897266736829642805, end}, {4611686018427387904, start}) [shard 0] | 2022-01-27 22:53:08.772014 | 127.0.0.2 |           1063 | 127.0.0.1
                                                                                                 Creating shard reader on shard: 0 [shard 0] | 2022-01-27 22:53:08.772016 | 127.0.0.2 |           1065 | 127.0.0.1
                      Scanning cache for range ({3897266736829642805, end}, {4611686018427387904, start}) and slice {(-inf, +inf)} [shard 0] | 2022-01-27 22:53:08.772019 | 127.0.0.2 |           1067 | 127.0.0.1
                                                                                                On shard execution result is [200] [shard 1] | 2022-01-27 22:53:08.772053 | 127.0.0.3 |           1384 | 127.0.0.1
    Page stats: 56 partition(s), 0 static row(s) (0 live, 0 dead), 56 clustering row(s) (56 live, 0 dead) and 0 range tombstone(s) [shard 0] | 2022-01-27 22:53:08.772138 | 127.0.0.2 |           1186 | 127.0.0.1
 Page stats: 190 partition(s), 0 static row(s) (0 live, 0 dead), 190 clustering row(s) (190 live, 0 dead) and 0 range tombstone(s) [shard 1] | 2022-01-27 22:53:08.772364 | 127.0.0.2 |           1706 | 127.0.0.1
 Page stats: 149 partition(s), 0 static row(s) (0 live, 0 dead), 149 clustering row(s) (149 live, 0 dead) and 0 range tombstone(s) [shard 0] | 2022-01-27 22:53:08.772407 | 127.0.0.3 |           1561 | 127.0.0.1
                                                                                                                  Querying is done [shard 0] | 2022-01-27 22:53:08.772417 | 127.0.0.3 |           1571 | 127.0.0.1
                                                                                                                  Querying is done [shard 1] | 2022-01-27 22:53:08.772418 | 127.0.0.2 |           1760 | 127.0.0.1
                                                                                                                  Querying is done [shard 0] | 2022-01-27 22:53:08.772426 | 127.0.0.2 |           1475 | 127.0.0.1
                                                                                                                  Querying is done [shard 0] | 2022-01-27 22:53:08.772428 | 127.0.0.2 |           1476 | 127.0.0.1
                                                                                                                  Querying is done [shard 0] | 2022-01-27 22:53:08.772449 | 127.0.0.3 |           1604 | 127.0.0.1
                                                                                                On shard execution result is [196] [shard 0] | 2022-01-27 22:53:08.772555 | 127.0.0.2 |           1603 | 127.0.0.1
                                                                                                On shard execution result is [238] [shard 1] | 2022-01-27 22:53:08.772674 | 127.0.0.2 |           2016 | 127.0.0.1
                                                                                                On shard execution result is [235] [shard 0] | 2022-01-27 22:53:08.772770 | 127.0.0.3 |           1924 | 127.0.0.1
                                                                                    Received forward_result=[435] from 127.0.0.3:0 [shard 1] | 2022-01-27 22:53:08.772933 | 127.0.0.1 |           2482 | 127.0.0.1
                                                                                    Received forward_result=[434] from 127.0.0.2:0 [shard 1] | 2022-01-27 22:53:08.773110 | 127.0.0.1 |           2658 | 127.0.0.1
                                                                                                           Merged result is [1000] [shard 1] | 2022-01-27 22:53:08.773111 | 127.0.0.1 |           2660 | 127.0.0.1
                                                                                              Done processing - preparing a result [shard 1] | 2022-01-27 22:53:08.773114 | 127.0.0.1 |           2663 | 127.0.0.1
                                                                                                                            Request complete | 2022-01-27 22:53:08.772666 | 127.0.0.1 |           2666 | 127.0.0.1
```

Fixes #1385

Closes #9209

* github.com:scylladb/scylla:
  docs: add parallel aggregations design doc
  db: config: add a flag to disable new parallelized aggregation algorithm
  test: add parallelized select count test
  forward_service: add metrics
  forward_service: parallelize execution across shards
  forward_service: add tracing
  cql3: statements: introduce parallelized_select_statement
  cql3: query_processor: add forward_service reference to query_processor
  gms: add PARALLELIZED_AGGREGATION feature
  service: introduce forward_service
  storage_proxy: extract query_ranges_to_vnodes_generator to a separate file
  messaging_service: add verb for count(*) request forwarding
  cql3: selection: detect if a selection represents count(*)
2022-02-04 12:34:19 +02:00
Nadav Har'El
b54e85088d Merge 'snapshots: Fix snapshot-ctl to include snapshots of dropped tables' from Benny Halevy
Snapshot-ctl methods fetch information about snapshots from
column family objects. The problem with this is that we get rid
of these objects once the table gets dropped, while the snapshots
might still be present (the auto_snapshot option is specifically
made to create this kind of situation). This commit switches from
relying on column family interface to scanning every datadir
that the database knows of in search for "snapshots" folders.

This PR is a rebased version of #9539 (and slightly cleaned-up, cosmetically)
and so it replaces the previous PR.

Fixes #3463
Closes #7122

Closes #9884

* github.com:scylladb/scylla:
  snapshots: Fix snapshot-ctl to include snapshots of dropped tables
  table: snapshot: add debug messages
2022-02-04 12:34:19 +02:00
Piotr Sarna
c613d1ce87 alternator: migrate expression parsers to string_view
Following the advice in the FIXME note, helper functions for parsing
expressions are now based on string views to avoid a few unnecessary
conversions to std::string.

Tests: unit(dev)

Closes #10013
2022-02-04 12:34:19 +02:00
Nadav Har'El
87e48d61a7 build: rebuild relocatable packages if version changed
In commit d72465531e we fixed the building
of relocatable packages of submodules (tools/java, etc.) to use the
top-level Scylla's version. However, if on an active working directory
Scylla's version changes - as we just did from 4.7 to 5.0 - these
relocatable packages are not rebuilt with the new version number, and as
a result some of our scripts (such as the docker build) can't find them.

Because the build-submodule-reloc rule depends on the files
build/SCYLLA-{PRODUCT,VERSION,RELEASE}-FILE (which is what the
aforementioned commit did), in this patch we add those files as a
dependency whenever build-submodule-reloc is used. This means that if
any of these files change, we rebuild the relocatable packages and
anything depending on them (e.g., Debian packages).

Fixes #10018.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220202131248.1610678-1-nyh@scylladb.com>
2022-02-03 10:19:15 +02:00
Botond Dénes
996e2f8048 Merge 'Handle serialized_action trigger exceptions' from Benny Halevy
"
which is currently unhandled from multiple call sites, leading to the following warning
as seen in https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/1094/artifact/logs-all.release.2/1643794928169_materialized_views_test.py%3A%3ATestInterruptBuildProcess%3A%3Atest_interrupt_build_process_and_resharding_half_to_max_test/node2.log
```
Scylla version 5.0.dev-0.20220201.a026b4ef4 with build-id cebf6dca8edd8df843a07e0f01a1573f1d0a6dfc starting ...

WARN  2022-02-02 09:31:56,616 [shard 2] seastar - Exceptional future ignored: seastar::sleep_aborted (Sleep is aborted), backtrace: 0x463b65e 0x463bb50 0x463be58 0x426c165 0x230c744 0x42adad4 0x42aeea7 0x42cdb55 0x4281a2a /jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/a026b4ef490074df0d31d4b0ed9189d0cfaa745e/scylla/libreloc/libpthread.so.0+0x9298 /jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/a026b4ef490074df0d31d4b0ed9189d0cfaa745e/scylla/libreloc/libc.so.6+0x100352
    --------
    seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<serialized_action::trigger(bool)::{lambda()#2}, false>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::finally_body<serialized_action::trigger(bool)::{lambda()#2}, false> >(seastar::future<void>::finally_body<serialized_action::trigger(bool)::{lambda()#2}, false>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::finally_body<serialized_action::trigger(bool)::{lambda()#2}, false>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>
```

Decoded:
```
void seastar::backtrace(seastar::current_backtrace_tasklocal()::$_3&&) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:59
    (inlined by) seastar::current_backtrace_tasklocal() at ./build/release/seastar/./seastar/src/util/backtrace.cc:86
seastar::current_tasktrace() at ./build/release/seastar/./seastar/src/util/backtrace.cc:137
seastar::current_backtrace() at ./build/release/seastar/./seastar/src/util/backtrace.cc:170
seastar::report_failed_future(std::__exception_ptr::exception_ptr const&) at ./build/release/seastar/./seastar/src/core/future.cc:210
    (inlined by) seastar::report_failed_future(seastar::future_state_base::any&&) at ./build/release/seastar/./seastar/src/core/future.cc:218
seastar::future_state_base::any::check_failure() at ././seastar/include/seastar/core/future.hh:567
    (inlined by) seastar::future_state::clear() at ././seastar/include/seastar/core/future.hh:609
    (inlined by) ~future_state at ././seastar/include/seastar/core/future.hh:614
    (inlined by) ~future at ././seastar/include/seastar/core/scheduling.hh:43
    (inlined by) void seastar::futurize >::satisfy_with_result_of::then_wrapped_nrvo, seastar::future::finally_body >(seastar::future::finally_body&&)::{lambda(seastar::internal::promise_base_with_type&&, serialized_action::trigger(bool)::{lambda()#2}&, seastar::future_state&&)#1}::operator()(seastar::internal::promise_base_with_type, seastar::internal::promise_base_with_type&&, seastar::future_state::finally_body&&::monostate>) const::{lambda()#1}>(seastar::internal::promise_base_with_type, seastar::future::finally_body&&) at ././seastar/include/seastar/core/future.hh:2120
    (inlined by) operator() at ././seastar/include/seastar/core/future.hh:1667
    (inlined by) seastar::continuation, seastar::future::finally_body, seastar::future::then_wrapped_nrvo, serialized_action::trigger(bool)::{lambda()#2}>(serialized_action::trigger(bool)::{lambda()#2}&&)::{lambda(seastar::internal::promise_base_with_type&&, serialized_action::trigger(bool)::{lambda()#2}&, seastar::future_state&&)#1}, void>::run_and_dispose() at ././seastar/include/seastar/core/future.hh:767
seastar::reactor::run_tasks(seastar::reactor::task_queue&) at ./build/release/seastar/./seastar/src/core/reactor.cc:2344
    (inlined by) seastar::reactor::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:2754
seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:2923
operator() at ./build/release/seastar/./seastar/src/core/reactor.cc:4128
    (inlined by) void std::__invoke_impl(std::__invoke_other, seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_100&) at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/invoke.h:61
    (inlined by) std::enable_if, void>::type std::__invoke_r(seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_100&) at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/invoke.h:111
    (inlined by) std::_Function_handler::_M_invoke(std::_Any_data const&) at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/std_function.h:291
std::function::operator()() const at /usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/std_function.h:560
    (inlined by) seastar::posix_thread::start_routine(void*) at ./build/release/seastar/./seastar/src/core/posix.cc:60
```

This series handles exception handling to serialized actions triggers
that don't handle exceptions.

Test: unit(dev)
"

* tag 'handle-serialized_action-trigger-exception-v1' of https://github.com/bhalevy/scylla:
  migration_manager: passive_announce(version): handle exception
  view_builder: do_build_step: handle unexpected exceptions
  storage_service: no need to include utils/serialized_action.hh
2022-02-03 10:17:59 +02:00
Yaron Kaikov
e6ea0e04ed release: prepare for 5.1.dev 2022-02-03 08:11:24 +02:00
Michał Sala
4903f7a314 docs: add parallel aggregations design doc
Added document describes the design of a mechanism that parallelizes
execution of aggregation queries.
2022-02-02 17:52:22 +01:00
Benny Halevy
b94c9ed3e6 migration_manager: passive_announce(version): handle exception
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-02-02 14:54:19 +02:00
Benny Halevy
b56b10a4bb view_builder: do_build_step: handle unexpected exceptions
Exception are handled by do_build_step in principle,
Yet if an unhandled exception escapes handling
(e.g. get_units(_sem, 1) fails on a broken semaphore)
we should warn about it since the _build_step.trigger() calls
do no handle exceptions.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-02-02 14:54:19 +02:00
Benny Halevy
71a9524175 storage_service: no need to include utils/serialized_action.hh 2022-02-02 14:42:05 +02:00
Piotr Wojtczak
0dd7739716 snapshots: Fix snapshot-ctl to include snapshots of dropped tables
Snapshot-ctl methods fetch information about snapshots from
column family objects. The problem with this is that we get rid
of these objects once the table gets dropped, while the snapshots
might still be present (the auto_snapshot option is specifically
made to create this kind of situation). This commit switches from
relying on column family interface to scanning every datadir
that the database knows of in search for "snapshots" folders.

Fixes #3463
Closes #7122

Closes #9884

Signed-off-by: Piotr Wojtczak <piotr.m.wojtczak@gmail.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-02-01 22:31:43 +02:00
Benny Halevy
2a90896b79 table: snapshot: add debug messages
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-02-01 22:31:37 +02:00
Michał Sala
b439d6e710 db: config: add a flag to disable new parallelized aggregation algorithm
Just in case the new algorithm turns out to be buggy, add a flag to
fall-back to the old algorithm.
2022-02-01 21:26:25 +01:00
Michał Sala
140bab279c test: add parallelized select count test
Added test that checks if a SELECT COUNT(*) query was transformed and
processed in a parallel way. Checking is done by looking at the cql
statistics and comparing subsequent counts of parallelized aggregation
SELECT query executions.
2022-02-01 21:14:41 +01:00
Michał Sala
e6e9553b4a forward_service: add metrics
Introduces metrics for `forward_service`. 3 counters were created, which
allows checking how many requests had been dispached or executed.
2022-02-01 21:14:41 +01:00
Michał Sala
354f7a1c34 forward_service: parallelize execution across shards
Coordinators processed each vnode sequentially on shards when executing
a `forward_request` sent by super-coordinator. This commit changes this
behavior and parallelizes execution of `forward_request` across shards.

It does that by adding additional layer of dispatching to
`forward_service`. When a coordinator receives a `forward_request`, it
forwards it to each of its shards. Shards slice `forward_request`'s
partition ranges so that they will only query data that is owned by
them. Implementation of slicing partition ranges was based on @nyh's
`token_ranges_owned_by_this_shard` from `alternator/ttl.cc`.
2022-02-01 21:14:41 +01:00
Michał Sala
aec96be553 forward_service: add tracing 2022-02-01 21:14:41 +01:00
Michał Sala
f344bd0aaa cql3: statements: introduce parallelized_select_statement
Detect whether a statement is a count(*) query in prepare time. If so,
instantiate a new `select_statement` subclass -
`parallelized_select_statement`. This subclass has a different execution
logic, that enables it to distribute count(*) queries across a cluster.
Also, a new counter was added - `select_parallelized` that counts the
number of parallelized aggregation SELECT query executions.
2022-02-01 21:14:41 +01:00
Michał Sala
66a93d3000 cql3: query_processor: add forward_service reference to query_processor 2022-02-01 21:14:41 +01:00
Michał Sala
3789a4d02b gms: add PARALLELIZED_AGGREGATION feature
This new feature will be used to determined whether the whole
cluster is ready to parallelize execution of aggregation queries.
2022-02-01 21:14:41 +01:00
Michał Sala
a6cf3f52bd service: introduce forward_service
The new service is responsible for:
* spreading forward_request execution across multiple nodes in cluster
* collecting forward_request execution results and merging them

`forward_service::dispatch` method takes forward_request as an
argument, and forwards its execution to group of other nodes (using rpc
verb added in previous commits). Each node (in the group chosen by
dispatch method) is provided with forward_request, which is no different
from the original argument except for changed partition ranges. They are
changed so that vnodes contained in them are owned by recipient node.

Executing forward_request is realized in `forward_service::execute`
method, that is registered to be called on FORWARD_REQUEST verb receipt.
Process of executing forward_request consists of mocking few
non-serializable object (such as `cql3::selection`) in order to create
`service:pager:query_pagers::pager` and `cql3::selection::result_set_builder`.
After pager and result_set_builder creation, execution process resembles
what might be seen in select_statement's execution path.
2022-02-01 21:14:41 +01:00
Michał Sala
0fe59082ec storage_proxy: extract query_ranges_to_vnodes_generator to a separate file
Such separation allows using query_ranges_to_vnodes_generator by other
services without needing a storage_proxy dependency.
2022-02-01 21:14:41 +01:00
Michał Sala
fff454761a messaging_service: add verb for count(*) request forwarding
Except for the verb addition, this commit also defines forward_request
and forward_result structures, used as an argument and result of the new
rpc. forward_request is used to forward information about select
statement that does count(*) (or other aggregating functions such as
max, min, avg in the future). Due to the inability to serialize
cql3::statements::select_statement, I chose to include
query::read_command, dht::partition_range_vector and some configuration
options in forward_request. They can be serialized and are sufficient
enough to allow creation of service::pager::query_pagers::pager.
2022-02-01 21:14:41 +01:00
Michał Sala
bb7edf3785 cql3: selection: detect if a selection represents count(*)
The way that this detection works is a bit clunky, but it does its job
given the simplest cases e.g. "SELECT COUNT(*) FROM ks.t". It fails when
there are multiple selectors, or when there is a column name specified
("SELECT COUNT(column_name) FROM ks.t").
2022-02-01 21:14:41 +01:00
2452 changed files with 151470 additions and 51106 deletions

1
.gitattributes vendored
View File

@@ -1,2 +1,3 @@
*.cc diff=cpp
*.hh diff=cpp
*.svg binary

22
.github/CODEOWNERS vendored
View File

@@ -2,14 +2,14 @@
auth/* @elcallio @vladzcloudius
# CACHE
row_cache* @tgrabiec @haaawk
*mutation* @tgrabiec @haaawk
test/boost/mvcc* @tgrabiec @haaawk
row_cache* @tgrabiec
*mutation* @tgrabiec
test/boost/mvcc* @tgrabiec
# CDC
cdc/* @haaawk @kbr- @elcallio @piodul @jul-stas
test/cql/cdc_* @haaawk @kbr- @elcallio @piodul @jul-stas
test/boost/cdc_* @haaawk @kbr- @elcallio @piodul @jul-stas
cdc/* @kbr- @elcallio @piodul @jul-stas
test/cql/cdc_* @kbr- @elcallio @piodul @jul-stas
test/boost/cdc_* @kbr- @elcallio @piodul @jul-stas
# COMMITLOG / BATCHLOG
db/commitlog/* @elcallio
@@ -28,8 +28,12 @@ transport/*
cql3/* @tgrabiec @psarna @cvybhu
# COUNTERS
counters* @haaawk @jul-stas
tests/counter_test* @haaawk @jul-stas
counters* @jul-stas
tests/counter_test* @jul-stas
# DOCS
docs/* @annastuchlik @tzach
docs/alternator @annastuchlik @tzach @nyh @psarna
# GOSSIP
gms/* @tgrabiec @asias
@@ -74,7 +78,7 @@ alternator/* @nyh @psarna
test/alternator/* @nyh @psarna
# HINTED HANDOFF
db/hints/* @haaawk @piodul @vladzcloudius
db/hints/* @piodul @vladzcloudius
# REDIS
redis/* @nyh @syuu1228

35
.github/workflows/docs-pages.yaml vendored Normal file
View File

@@ -0,0 +1,35 @@
name: "Docs / Publish"
# For more information,
# see https://sphinx-theme.scylladb.com/stable/deployment/production.html#available-workflows
on:
push:
branches:
- master
paths:
- "docs/**"
workflow_dispatch:
jobs:
release:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v3
with:
persist-credentials: false
fetch-depth: 0
- name: Set up Python
uses: actions/setup-python@v3
with:
python-version: 3.7
- name: Set up env
run: make -C docs setupenv
- name: Build docs
run: make -C docs multiversion
- name: Build redirects
run: make -C docs redirects
- name: Deploy docs to GitHub Pages
run: ./docs/_utils/deploy.sh
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

View File

@@ -1,29 +0,0 @@
name: "Docs / Publish"
on:
push:
branches:
- master
paths:
- "docs/**"
workflow_dispatch:
jobs:
release:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v2
with:
persist-credentials: false
fetch-depth: 0
- name: Set up Python
uses: actions/setup-python@v1
with:
python-version: 3.7
- name: Build docs
run: make -C docs multiversion
- name: Deploy
run: ./docs/_utils/deploy.sh
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

28
.github/workflows/docs-pr.yaml vendored Normal file
View File

@@ -0,0 +1,28 @@
name: "Docs / Build PR"
# For more information,
# see https://sphinx-theme.scylladb.com/stable/deployment/production.html#available-workflows
on:
pull_request:
branches:
- master
paths:
- "docs/**"
jobs:
build:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v3
with:
persist-credentials: false
fetch-depth: 0
- name: Set up Python
uses: actions/setup-python@v3
with:
python-version: 3.7
- name: Set up env
run: make -C docs setupenv
- name: Build docs
run: make -C docs test

View File

@@ -1,25 +0,0 @@
name: "Docs / Build PR"
on:
pull_request:
branches:
- master
paths:
- "docs/**"
jobs:
build:
name: Build
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v2
with:
persist-credentials: false
fetch-depth: 0
- name: Set up Python
uses: actions/setup-python@v1
with:
python-version: 3.7
- name: Build docs
run: make -C docs test

4
.gitignore vendored
View File

@@ -22,6 +22,7 @@ resources
.pytest_cache
/expressions.tokens
tags
!db/tags/
testlog
test/*/*.reject
.vscode
@@ -29,3 +30,6 @@ docs/_build
docs/poetry.lock
compile_commands.json
.ccls-cache/
.mypy_cache
.envrc
rust/Cargo.lock

2
.gitmodules vendored
View File

@@ -1,6 +1,6 @@
[submodule "seastar"]
path = seastar
url = ../scylla-seastar
url = ../seastar
ignore = dirty
[submodule "swagger-ui"]
path = swagger-ui

3
.mailmap Normal file
View File

@@ -0,0 +1,3 @@
Avi Kivity <avi@scylladb.com> Avi Kivity' via ScyllaDB development <scylladb-dev@googlegroups.com>
Raphael S. Carvalho <raphaelsc@scylladb.com> Raphael S. Carvalho' via ScyllaDB development <scylladb-dev@googlegroups.com>
Pavel Emelyanov <xemul@scylladb.com> Pavel Emelyanov' via ScyllaDB development <scylladb-dev@googlegroups.com>

View File

@@ -189,6 +189,8 @@ set(swagger_files
api/api-doc/storage_service.json
api/api-doc/stream_manager.json
api/api-doc/system.json
api/api-doc/task_manager.json
api/api-doc/task_manager_test.json
api/api-doc/utils.json)
set(swagger_gen_files)
@@ -301,6 +303,8 @@ set(scylla_sources
api/storage_service.cc
api/stream_manager.cc
api/system.cc
api/task_manager.cc
api/task_manager_test.cc
atomic_cell.cc
auth/allow_all_authenticator.cc
auth/allow_all_authorizer.cc
@@ -337,7 +341,6 @@ set(scylla_sources
compaction/size_tiered_compaction_strategy.cc
compaction/time_window_compaction_strategy.cc
compress.cc
connection_notifier.cc
converting_mutation_partition_applier.cc
counters.cc
cql3/abstract_marker.cc
@@ -350,6 +353,7 @@ set(scylla_sources
cql3/cql3_type.cc
cql3/expr/expression.cc
cql3/expr/prepare_expr.cc
cql3/expr/restrictions.cc
cql3/functions/aggregate_fcts.cc
cql3/functions/castas_fcts.cc
cql3/functions/error_injection_fcts.cc
@@ -363,7 +367,6 @@ set(scylla_sources
cql3/prepare_context.cc
cql3/query_options.cc
cql3/query_processor.cc
cql3/relation.cc
cql3/restrictions/statement_restrictions.cc
cql3/result_set.cc
cql3/role_name.cc
@@ -374,7 +377,6 @@ set(scylla_sources
cql3/selection/selector_factories.cc
cql3/selection/simple_selector.cc
cql3/sets.cc
cql3/single_column_relation.cc
cql3/statements/alter_keyspace_statement.cc
cql3/statements/alter_service_level_statement.cc
cql3/statements/alter_table_statement.cc
@@ -426,8 +428,9 @@ set(scylla_sources
cql3/statements/sl_prop_defs.cc
cql3/statements/truncate_statement.cc
cql3/statements/update_statement.cc
cql3/statements/strongly_consistent_modification_statement.cc
cql3/statements/strongly_consistent_select_statement.cc
cql3/statements/use_statement.cc
cql3/token_relation.cc
cql3/type_json.cc
cql3/untyped_result_set.cc
cql3/update_parameters.cc
@@ -453,6 +456,7 @@ set(scylla_sources
db/large_data_handler.cc
db/legacy_schema_migrator.cc
db/marshal/type_parser.cc
db/rate_limiter.cc
db/schema_tables.cc
db/size_estimates_virtual_reader.cc
db/snapshot-ctl.cc
@@ -468,10 +472,10 @@ set(scylla_sources
dht/murmur3_partitioner.cc
dht/range_streamer.cc
dht/token.cc
distributed_loader.cc
replica/distributed_loader.cc
duration.cc
exceptions/exceptions.cc
flat_mutation_reader.cc
readers/mutation_readers.cc
frozen_mutation.cc
frozen_schema.cc
generic_server.cc
@@ -491,7 +495,7 @@ set(scylla_sources
index/secondary_index_manager.cc
init.cc
keys.cc
lister.cc
utils/lister.cc
locator/abstract_replication_strategy.cc
locator/azure_snitch.cc
locator/ec2_multi_region_snitch.cc
@@ -509,7 +513,7 @@ set(scylla_sources
locator/token_metadata.cc
lang/lua.cc
main.cc
memtable.cc
replica/memtable.cc
message/messaging_service.cc
multishard_mutation_query.cc
mutation.cc
@@ -518,7 +522,7 @@ set(scylla_sources
mutation_partition_serializer.cc
mutation_partition_view.cc
mutation_query.cc
mutation_reader.cc
readers/mutation_reader.cc
mutation_writer/feed_writers.cc
mutation_writer/multishard_writer.cc
mutation_writer/partition_based_splitting_writer.cc
@@ -528,12 +532,14 @@ set(scylla_sources
partition_version.cc
querier.cc
query.cc
query_ranges_to_vnodes.cc
query-result-set.cc
raft/fsm.cc
raft/log.cc
raft/raft.cc
raft/server.cc
raft/tracker.cc
service/broadcast_tables/experimental/lang.cc
range_tombstone.cc
range_tombstone_list.cc
tombstone_gc_options.cc
@@ -562,6 +568,7 @@ set(scylla_sources
schema_registry.cc
serializer.cc
service/client_state.cc
service/forward_service.cc
service/migration_manager.cc
service/misc_services.cc
service/pager/paging_state.cc
@@ -574,7 +581,6 @@ set(scylla_sources
service/qos/qos_common.cc
service/qos/service_level_controller.cc
service/qos/standard_service_level_distributed_data_accessor.cc
service/raft/raft_gossip_failure_detector.cc
service/raft/raft_group_registry.cc
service/raft/raft_rpc.cc
service/raft/raft_sys_table_storage.cc
@@ -613,6 +619,7 @@ set(scylla_sources
streaming/stream_task.cc
streaming/stream_transfer_task.cc
table_helper.cc
tasks/task_manager.cc
thrift/controller.cc
thrift/handler.cc
thrift/server.cc

View File

@@ -18,3 +18,5 @@ If you need help formatting or sending patches, [check out these instructions](h
The Scylla C++ source code uses the [Seastar coding style](https://github.com/scylladb/seastar/blob/master/coding-style.md) so please adhere to that in your patches. Note that Scylla code is written with `using namespace seastar`, so should not explicitly add the `seastar::` prefix to Seastar symbols. You will usually not need to add `using namespace seastar` to new source files, because most Scylla header files have `#include "seastarx.hh"`, which does this.
Header files in Scylla must be self-contained, i.e., each can be included without having to include specific other headers first. To verify that your change did not break this property, run `ninja dev-headers`. If you added or removed header files, you must `touch configure.py` first - this will cause `configure.py` to be automatically re-run to generate a fresh list of header files.
For more criteria on what reviewers consider good code, see the [review checklist](https://github.com/scylladb/scylla/blob/master/docs/dev/review-checklist.md).

View File

@@ -383,6 +383,40 @@ Open the link printed at the end. Be horrified. Go and write more tests.
For more details see `./scripts/coverage.py --help`.
### Resolving stack backtraces
Scylla may print stack backtraces to the log for several reasons.
For example:
- When aborting (e.g. due to assertion failure, internal error, or segfault)
- When detecting seastar reactor stalls (where a seastar task runs for a long time without yielding the cpu to other tasks on that shard)
The backtraces contain code pointers so they are not very helpful without resolving into code locations.
To resolve the backtraces, one needs the scylla relocatable package that contains the scylla binary (with debug information),
as well as the dynamic libraries it is linked against.
Builds from our automated build system are uploaded to the cloud
and can be searched on http://backtrace.scylladb.com/
Make sure you have the scylla server exact `build-id` to locate
its respective relocatable package, required for decoding backtraces it prints.
The build-id is printed to the system log when scylla starts.
It can also be found by executing `scylla --build-id`, or
by using the `file` utility, for example:
```
$ scylla --build-id
4cba12e6eb290a406bfa4930918db23941fd4be3
$ file scylla
scylla: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, BuildID[sha1]=4cba12e6eb290a406bfa4930918db23941fd4be3, with debug_info, not stripped, too many notes (256)
```
To find the build-id of a coredump, use the `eu-unstrip` utility as follows:
```
$ eu-unstrip -n --core <coredump> | awk '/scylla$/ { s=$2; sub(/@.*$/, "", s); print s; exit(0); }'
4cba12e6eb290a406bfa4930918db23941fd4be3
```
### Core dump debugging
See [debugging.md](debugging.md).
See [debugging.md](docs/dev/debugging.md).

View File

@@ -42,7 +42,7 @@ For further information, please see:
* [Docker image build documentation] for information on how to build Docker images.
[developer documentation]: HACKING.md
[build documentation]: docs/guides/building.md
[build documentation]: docs/dev/building.md
[docker image build documentation]: dist/docker/debian/README.md
## Running Scylla
@@ -65,7 +65,7 @@ $ ./tools/toolchain/dbuild ./build/release/scylla --help
## Testing
See [test.py manual](docs/guides/testing.md).
See [test.py manual](docs/dev/testing.md).
## Scylla APIs and compatibility
By default, Scylla is compatible with Apache Cassandra and its APIs - CQL and
@@ -78,7 +78,7 @@ and the current compatibility of this feature as well as Scylla-specific extensi
## Documentation
Documentation can be found [here](https://scylla.docs.scylladb.com).
Documentation can be found [here](docs/dev/README.md).
Seastar documentation can be found [here](http://docs.seastar.io/master/index.html).
User documentation can be found [here](https://docs.scylladb.com/).

View File

@@ -1,11 +1,12 @@
#!/bin/sh
USAGE=$(cat <<-END
Usage: $(basename "$0") [-h|--help] [-o|--output-dir PATH] -- generate Scylla version and build information files.
Usage: $(basename "$0") [-h|--help] [-o|--output-dir PATH] [--date-stamp DATE] -- generate Scylla version and build information files.
Options:
-h|--help show this help message.
-o|--output-dir PATH specify destination path at which the version files are to be created.
-d|--date-stamp DATE manually set date for release parameter
By default, the script will attempt to parse 'version' file
in the current directory, which should contain a string of
@@ -31,6 +32,8 @@ using '-o PATH' option.
END
)
DATE=""
while [[ $# -gt 0 ]]; do
opt="$1"
case $opt in
@@ -43,6 +46,11 @@ while [[ $# -gt 0 ]]; do
shift
shift
;;
--date-stamp)
DATE="$2"
shift
shift
;;
*)
echo "Unexpected argument found: $1"
echo
@@ -58,24 +66,33 @@ if [ -z "$OUTPUT_DIR" ]; then
OUTPUT_DIR="$SCRIPT_DIR/build"
fi
if [ -z "$DATE" ]; then
DATE=$(date --utc +%Y%m%d)
fi
# Default scylla product/version tags
PRODUCT=scylla
VERSION=5.0.13
VERSION=5.2.0-dev
if test -f version
then
SCYLLA_VERSION=$(cat version | awk -F'-' '{print $1}')
SCYLLA_RELEASE=$(cat version | awk -F'-' '{print $2}')
else
DATE=$(date --utc +%Y%m%d)
GIT_COMMIT=$(git -C "$SCRIPT_DIR" log --pretty=format:'%h' -n 1)
SCYLLA_VERSION=$VERSION
# For custom package builds, replace "0" with "counter.your_name",
# where counter starts at 1 and increments for successive versions.
# This ensures that the package manager will select your custom
# package over the standard release.
SCYLLA_BUILD=0
SCYLLA_RELEASE=$SCYLLA_BUILD.$DATE.$GIT_COMMIT
if [ -z "$SCYLLA_RELEASE" ]; then
DATE=$(date --utc +%Y%m%d)
GIT_COMMIT=$(git -C "$SCRIPT_DIR" log --pretty=format:'%h' -n 1 --abbrev=12)
# For custom package builds, replace "0" with "counter.your_name",
# where counter starts at 1 and increments for successive versions.
# This ensures that the package manager will select your custom
# package over the standard release.
SCYLLA_BUILD=0
SCYLLA_RELEASE=$SCYLLA_BUILD.$DATE.$GIT_COMMIT
elif [ -f "$OUTPUT_DIR/SCYLLA-RELEASE-FILE" ]; then
echo "setting SCYLLA_RELEASE only makes sense in clean builds" 1>&2
exit 1
fi
fi
if [ -f "$OUTPUT_DIR/SCYLLA-RELEASE-FILE" ]; then

2
abseil

Submodule abseil updated: f70eadadd7...7f3c0d7811

View File

@@ -129,11 +129,12 @@ future<std::string> get_key_from_roles(service::storage_proxy& proxy, std::strin
std::vector<query::clustering_range> bounds{query::clustering_range::make_open_ended_both_sides()};
const column_definition* salted_hash_col = schema->get_column_definition(bytes("salted_hash"));
if (!salted_hash_col) {
co_return coroutine::make_exception(api_error::unrecognized_client(format("Credentials cannot be fetched for: {}", username)));
co_await coroutine::return_exception(api_error::unrecognized_client(format("Credentials cannot be fetched for: {}", username)));
}
auto selection = cql3::selection::selection::for_columns(schema, {salted_hash_col});
auto partition_slice = query::partition_slice(std::move(bounds), {}, query::column_id_vector{salted_hash_col->id}, selection->get_query_options());
auto command = ::make_lw_shared<query::read_command>(schema->id(), schema->version(), partition_slice, proxy.get_max_result_size(partition_slice));
auto command = ::make_lw_shared<query::read_command>(schema->id(), schema->version(), partition_slice,
proxy.get_max_result_size(partition_slice), query::tombstone_limit(proxy.get_tombstone_limit()));
auto cl = auth::password_authenticator::consistency_for_user(username);
service::client_state client_state{service::client_state::internal_tag()};
@@ -145,11 +146,11 @@ future<std::string> get_key_from_roles(service::storage_proxy& proxy, std::strin
auto result_set = builder.build();
if (result_set->empty()) {
co_return coroutine::make_exception(api_error::unrecognized_client(format("User not found: {}", username)));
co_await coroutine::return_exception(api_error::unrecognized_client(format("User not found: {}", username)));
}
const bytes_opt& salted_hash = result_set->rows().front().front(); // We only asked for 1 row and 1 column
if (!salted_hash) {
co_return coroutine::make_exception(api_error::unrecognized_client(format("No password found for user: {}", username)));
co_await coroutine::return_exception(api_error::unrecognized_client(format("No password found for user: {}", username)));
}
co_return value_cast<sstring>(utf8_type->deserialize(*salted_hash));
}

View File

@@ -14,6 +14,8 @@
#include "db/config.hh"
#include "cdc/generation_service.hh"
#include "service/memory_limiter.hh"
#include "auth/service.hh"
#include "service/qos/service_level_controller.hh"
using namespace seastar;
@@ -28,6 +30,8 @@ controller::controller(
sharded<db::system_distributed_keyspace>& sys_dist_ks,
sharded<cdc::generation_service>& cdc_gen_svc,
sharded<service::memory_limiter>& memory_limiter,
sharded<auth::service>& auth_service,
sharded<qos::service_level_controller>& sl_controller,
const db::config& config)
: _gossiper(gossiper)
, _proxy(proxy)
@@ -35,6 +39,8 @@ controller::controller(
, _sys_dist_ks(sys_dist_ks)
, _cdc_gen_svc(cdc_gen_svc)
, _memory_limiter(memory_limiter)
, _auth_service(auth_service)
, _sl_controller(sl_controller)
, _config(config)
{
}
@@ -77,7 +83,7 @@ future<> controller::start_server() {
auto get_cdc_metadata = [] (cdc::generation_service& svc) { return std::ref(svc.get_cdc_metadata()); };
_executor.start(std::ref(_gossiper), std::ref(_proxy), std::ref(_mm), std::ref(_sys_dist_ks), sharded_parameter(get_cdc_metadata, std::ref(_cdc_gen_svc)), _ssg.value()).get();
_server.start(std::ref(_executor), std::ref(_proxy), std::ref(_gossiper)).get();
_server.start(std::ref(_executor), std::ref(_proxy), std::ref(_gossiper), std::ref(_auth_service), std::ref(_sl_controller)).get();
// Note: from this point on, if start_server() throws for any reason,
// it must first call stop_server() to stop the executor and server
// services we just started - or Scylla will cause an assertion

View File

@@ -34,6 +34,14 @@ class gossiper;
}
namespace auth {
class service;
}
namespace qos {
class service_level_controller;
}
namespace alternator {
// This is the official DynamoDB API version.
@@ -53,6 +61,8 @@ class controller : public protocol_server {
sharded<db::system_distributed_keyspace>& _sys_dist_ks;
sharded<cdc::generation_service>& _cdc_gen_svc;
sharded<service::memory_limiter>& _memory_limiter;
sharded<auth::service>& _auth_service;
sharded<qos::service_level_controller>& _sl_controller;
const db::config& _config;
std::vector<socket_address> _listen_addresses;
@@ -68,6 +78,8 @@ public:
sharded<db::system_distributed_keyspace>& sys_dist_ks,
sharded<cdc::generation_service>& cdc_gen_svc,
sharded<service::memory_limiter>& memory_limiter,
sharded<auth::service>& auth_service,
sharded<qos::service_level_controller>& sl_controller,
const db::config& config);
virtual sstring name() const override;

View File

@@ -73,6 +73,9 @@ public:
static api_error serialization(std::string msg) {
return api_error("SerializationException", std::move(msg));
}
static api_error table_not_found(std::string msg) {
return api_error("TableNotFoundException", std::move(msg));
}
static api_error internal(std::string msg) {
return api_error("InternalServerError", std::move(msg), reply::status_type::internal_server_error);
}

File diff suppressed because it is too large Load Diff

View File

@@ -81,10 +81,10 @@ namespace parsed {
class path;
};
const std::map<sstring, sstring>& get_tags_of_table(schema_ptr schema);
std::optional<std::string> find_tag(const schema& s, const sstring& tag);
future<> update_tags(service::migration_manager& mm, schema_ptr schema, std::map<sstring, sstring>&& tags_map);
schema_ptr get_table(service::storage_proxy& proxy, const rjson::value& request);
bool is_alternator_keyspace(const sstring& ks_name);
// Wraps the db::get_tags_of_table and throws if the table is missing the tags extension.
const std::map<sstring, sstring>& get_tags_of_table_or_throw(schema_ptr schema);
// An attribute_path_map object is used to hold data for various attributes
// paths (parsed::path) in a hierarchy of attribute paths. Each attribute path
@@ -144,6 +144,11 @@ template<typename T>
using attribute_path_map = std::unordered_map<std::string, attribute_path_map_node<T>>;
using attrs_to_get_node = attribute_path_map_node<std::monostate>;
// attrs_to_get lists which top-level attribute are needed, and possibly also
// which part of the top-level attribute is really needed (when nested
// attribute paths appeared in the query).
// Most code actually uses optional<attrs_to_get>. There, a disengaged
// optional means we should get all attributes, not specific ones.
using attrs_to_get = attribute_path_map<std::monostate>;
@@ -191,6 +196,7 @@ public:
future<request_return_type> describe_stream(client_state& client_state, service_permit permit, rjson::value request);
future<request_return_type> get_shard_iterator(client_state& client_state, service_permit permit, rjson::value request);
future<request_return_type> get_records(client_state& client_state, tracing::trace_state_ptr, service_permit permit, rjson::value request);
future<request_return_type> describe_continuous_backups(client_state& client_state, service_permit permit, rjson::value request);
future<> start();
future<> stop() { return make_ready_future<>(); }
@@ -206,21 +212,25 @@ public:
private:
friend class rmw_operation;
static bool is_alternator_keyspace(const sstring& ks_name);
static sstring make_keyspace_name(const sstring& table_name);
static void describe_key_schema(rjson::value& parent, const schema&, std::unordered_map<std::string,std::string> * = nullptr);
static void describe_key_schema(rjson::value& parent, const schema& schema, std::unordered_map<std::string,std::string>&);
public:
public:
static std::optional<rjson::value> describe_single_item(schema_ptr,
const query::partition_slice&,
const cql3::selection::selection&,
const query::result&,
const attrs_to_get&);
const std::optional<attrs_to_get>&);
static std::vector<rjson::value> describe_multi_item(schema_ptr schema,
const query::partition_slice& slice,
const cql3::selection::selection& selection,
const query::result& query_result,
const std::optional<attrs_to_get>& attrs_to_get);
static void describe_single_item(const cql3::selection::selection&,
const std::vector<bytes_opt>&,
const attrs_to_get&,
const std::optional<attrs_to_get>&,
rjson::value&,
bool = false);

View File

@@ -29,7 +29,7 @@
namespace alternator {
template <typename Func, typename Result = std::result_of_t<Func(expressionsParser&)>>
Result do_with_parser(std::string input, Func&& f) {
Result do_with_parser(std::string_view input, Func&& f) {
expressionsLexer::InputStreamType input_stream{
reinterpret_cast<const ANTLR_UINT8*>(input.data()),
ANTLR_ENC_UTF8,
@@ -44,7 +44,7 @@ Result do_with_parser(std::string input, Func&& f) {
}
parsed::update_expression
parse_update_expression(std::string query) {
parse_update_expression(std::string_view query) {
try {
return do_with_parser(query, std::mem_fn(&expressionsParser::update_expression));
} catch (...) {
@@ -53,7 +53,7 @@ parse_update_expression(std::string query) {
}
std::vector<parsed::path>
parse_projection_expression(std::string query) {
parse_projection_expression(std::string_view query) {
try {
return do_with_parser(query, std::mem_fn(&expressionsParser::projection_expression));
} catch (...) {
@@ -62,7 +62,7 @@ parse_projection_expression(std::string query) {
}
parsed::condition_expression
parse_condition_expression(std::string query) {
parse_condition_expression(std::string_view query) {
try {
return do_with_parser(query, std::mem_fn(&expressionsParser::condition_expression));
} catch (...) {

View File

@@ -26,9 +26,9 @@ public:
using runtime_error::runtime_error;
};
parsed::update_expression parse_update_expression(std::string query);
std::vector<parsed::path> parse_projection_expression(std::string query);
parsed::condition_expression parse_condition_expression(std::string query);
parsed::update_expression parse_update_expression(std::string_view query);
std::vector<parsed::path> parse_projection_expression(std::string_view query);
parsed::condition_expression parse_condition_expression(std::string_view query);
void resolve_update_expression(parsed::update_expression& ue,
const rjson::value* expression_attribute_names,

View File

@@ -14,11 +14,14 @@
#include "rapidjson/writer.h"
#include "concrete_types.hh"
#include "cql3/type_json.hh"
#include "position_in_partition.hh"
static logging::logger slogger("alternator-serialization");
namespace alternator {
bool is_alternator_keyspace(const sstring& ks_name);
type_info type_info_from_string(std::string_view type) {
static thread_local const std::unordered_map<std::string_view, type_info> type_infos = {
{"S", {alternator_type::S, utf8_type}},
@@ -162,31 +165,42 @@ bytes get_key_column_value(const rjson::value& item, const column_definition& co
}
// Parses the JSON encoding for a key value, which is a map with a single
// entry, whose key is the type (expected to match the key column's type)
// and the value is the encoded value.
bytes get_key_from_typed_value(const rjson::value& key_typed_value, const column_definition& column) {
// entry whose key is the type and the value is the encoded value.
// If this type does not match the desired "type_str", an api_error::validation
// error is thrown (the "name" parameter is the name of the column which will
// mentioned in the exception message).
// If the type does match, a reference to the encoded value is returned.
static const rjson::value& get_typed_value(const rjson::value& key_typed_value, std::string_view type_str, std::string_view name, std::string_view value_name) {
if (!key_typed_value.IsObject() || key_typed_value.MemberCount() != 1 ||
!key_typed_value.MemberBegin()->value.IsString()) {
throw api_error::validation(
format("Malformed value object for key column {}: {}",
column.name_as_text(), key_typed_value));
format("Malformed value object for {} {}: {}",
value_name, name, key_typed_value));
}
auto it = key_typed_value.MemberBegin();
if (it->name != type_to_string(column.type)) {
if (rjson::to_string_view(it->name) != type_str) {
throw api_error::validation(
format("Type mismatch: expected type {} for key column {}, got type {}",
type_to_string(column.type), column.name_as_text(), it->name));
format("Type mismatch: expected type {} for {} {}, got type {}",
type_str, value_name, name, it->name));
}
std::string_view value_view = rjson::to_string_view(it->value);
return it->value;
}
// Parses the JSON encoding for a key value, which is a map with a single
// entry, whose key is the type (expected to match the key column's type)
// and the value is the encoded value.
bytes get_key_from_typed_value(const rjson::value& key_typed_value, const column_definition& column) {
auto& value = get_typed_value(key_typed_value, type_to_string(column.type), column.name_as_text(), "key column");
std::string_view value_view = rjson::to_string_view(value);
if (value_view.empty()) {
throw api_error::validation(
format("The AttributeValue for a key attribute cannot contain an empty string value. Key: {}", column.name_as_text()));
}
if (column.type == bytes_type) {
return rjson::base64_decode(it->value);
return rjson::base64_decode(value);
} else {
return column.type->from_string(rjson::to_string_view(it->value));
return column.type->from_string(value_view);
}
}
@@ -237,6 +251,39 @@ clustering_key ck_from_json(const rjson::value& item, schema_ptr schema) {
return clustering_key::from_exploded(raw_ck);
}
position_in_partition pos_from_json(const rjson::value& item, schema_ptr schema) {
auto ck = ck_from_json(item, schema);
if (is_alternator_keyspace(schema->ks_name())) {
return position_in_partition::for_key(std::move(ck));
}
const auto region_item = rjson::find(item, scylla_paging_region);
const auto weight_item = rjson::find(item, scylla_paging_weight);
if (bool(region_item) != bool(weight_item)) {
throw api_error::validation("Malformed value object: region and weight has to be either both missing or both present");
}
partition_region region;
bound_weight weight;
if (region_item) {
auto region_view = rjson::to_string_view(get_typed_value(*region_item, "S", scylla_paging_region, "key region"));
auto weight_view = rjson::to_string_view(get_typed_value(*weight_item, "N", scylla_paging_weight, "key weight"));
auto region = parse_partition_region(region_view);
if (weight_view == "-1") {
weight = bound_weight::before_all_prefixed;
} else if (weight_view == "0") {
weight = bound_weight::equal;
} else if (weight_view == "1") {
weight = bound_weight::after_all_prefixed;
} else {
throw std::runtime_error(fmt::format("Invalid value for weight: {}", weight_view));
}
return position_in_partition(region, weight, region == partition_region::clustered ? std::optional(std::move(ck)) : std::nullopt);
}
if (ck.is_empty()) {
return position_in_partition(position_in_partition::partition_start_tag_t());
}
return position_in_partition::for_key(std::move(ck));
}
big_decimal unwrap_number(const rjson::value& v, std::string_view diagnostic) {
if (!v.IsObject() || v.MemberCount() != 1) {
throw api_error::validation(format("{}: invalid number object", diagnostic));

View File

@@ -17,6 +17,8 @@
#include "utils/rjson.hh"
#include "utils/big_decimal.hh"
class position_in_partition;
namespace alternator {
enum class alternator_type : int8_t {
@@ -33,6 +35,9 @@ struct type_representation {
data_type dtype;
};
inline constexpr std::string_view scylla_paging_region(":scylla:paging:region");
inline constexpr std::string_view scylla_paging_weight(":scylla:paging:weight");
type_info type_info_from_string(std::string_view type);
type_representation represent_type(alternator_type atype);
@@ -47,6 +52,7 @@ rjson::value json_key_column_value(bytes_view cell, const column_definition& col
partition_key pk_from_json(const rjson::value& item, schema_ptr schema);
clustering_key ck_from_json(const rjson::value& item, schema_ptr schema);
position_in_partition pos_from_json(const rjson::value& item, schema_ptr schema);
// If v encodes a number (i.e., it is a {"N": [...]}, returns an object representing it. Otherwise,
// raises ValidationException with diagnostic.

View File

@@ -16,11 +16,11 @@
#include <seastar/util/short_streams.hh>
#include "seastarx.hh"
#include "error.hh"
#include "service/qos/service_level_controller.hh"
#include "utils/rjson.hh"
#include "auth.hh"
#include <cctype>
#include "service/storage_proxy.hh"
#include "locator/snitch_base.hh"
#include "gms/gossiper.hh"
#include "utils/overloaded_functor.hh"
#include "utils/fb_utilities.hh"
@@ -152,8 +152,10 @@ public:
protected:
void generate_error_reply(reply& rep, const api_error& err) {
rep._content += "{\"__type\":\"com.amazonaws.dynamodb.v20120810#" + err._type + "\"," +
"\"message\":\"" + err._msg + "\"}";
rjson::value results = rjson::empty_object();
rjson::add(results, "__type", rjson::from_string("com.amazonaws.dynamodb.v20120810#" + err._type));
rjson::add(results, "message", err._msg);
rep._content = rjson::print(std::move(results));
rep._status = err._http_code;
slogger.trace("api_handler error case: {}", rep._content);
}
@@ -199,10 +201,9 @@ protected:
// It's very easy to get a list of all live nodes on the cluster,
// using _gossiper().get_live_members(). But getting
// just the list of live nodes in this DC needs more elaborate code:
sstring local_dc = locator::i_endpoint_snitch::get_local_snitch_ptr()->get_datacenter(
utils::fb_utilities::get_broadcast_address());
std::unordered_set<gms::inet_address> local_dc_nodes =
_proxy.get_token_metadata_ptr()->get_topology().get_datacenter_endpoints().at(local_dc);
auto& topology = _proxy.get_token_metadata_ptr()->get_topology();
sstring local_dc = topology.get_datacenter();
std::unordered_set<gms::inet_address> local_dc_nodes = topology.get_datacenter_endpoints().at(local_dc);
for (auto& ip : local_dc_nodes) {
if (_gossiper.is_alive(ip)) {
rjson::push_back(results, rjson::from_string(ip.to_sstring()));
@@ -234,7 +235,7 @@ protected:
future<std::string> server::verify_signature(const request& req, const chunked_content& content) {
if (!_enforce_authorization) {
slogger.debug("Skipping authorization");
return make_ready_future<std::string>("<unauthenticated request>");
return make_ready_future<std::string>();
}
auto host_it = req._headers.find("Host");
if (host_it == req._headers.end()) {
@@ -364,7 +365,9 @@ static tracing::trace_state_ptr maybe_trace_query(service::client_state& client_
tracing::add_session_param(trace_state, "alternator_op", op);
tracing::add_query(trace_state, truncated_content_view(query, buf));
tracing::begin(trace_state, format("Alternator {}", op), client_state.get_client_address());
tracing::set_username(trace_state, auth::authenticated_user(username));
if (!username.empty()) {
tracing::set_username(trace_state, auth::authenticated_user(username));
}
}
return trace_state;
}
@@ -407,7 +410,11 @@ future<executor::request_return_type> server::handle_api_request(std::unique_ptr
auto leave = defer([this] () noexcept { _pending_requests.leave(); });
//FIXME: Client state can provide more context, e.g. client's endpoint address
// We use unique_ptr because client_state cannot be moved or copied
executor::client_state client_state{executor::client_state::internal_tag()};
executor::client_state client_state = username.empty()
? service::client_state{service::client_state::internal_tag()}
: service::client_state{service::client_state::internal_tag(), _auth_service, _sl_controller, username};
co_await client_state.maybe_update_per_service_level_params();
tracing::trace_state_ptr trace_state = maybe_trace_query(client_state, username, op, content);
tracing::trace(trace_state, op);
rjson::value json_request = co_await _json_parser.parse(std::move(content));
@@ -440,12 +447,14 @@ void server::set_routes(routes& r) {
//FIXME: A way to immediately invalidate the cache should be considered,
// e.g. when the system table which stores the keys is changed.
// For now, this propagation may take up to 1 minute.
server::server(executor& exec, service::storage_proxy& proxy, gms::gossiper& gossiper)
server::server(executor& exec, service::storage_proxy& proxy, gms::gossiper& gossiper, auth::service& auth_service, qos::service_level_controller& sl_controller)
: _http_server("http-alternator")
, _https_server("https-alternator")
, _executor(exec)
, _proxy(proxy)
, _gossiper(gossiper)
, _auth_service(auth_service)
, _sl_controller(sl_controller)
, _key_cache(1024, 1min, slogger)
, _enforce_authorization(false)
, _enabled_servers{}
@@ -520,6 +529,9 @@ server::server(executor& exec, service::storage_proxy& proxy, gms::gossiper& gos
{"GetRecords", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.get_records(client_state, std::move(trace_state), std::move(permit), std::move(json_request));
}},
{"DescribeContinuousBackups", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.describe_continuous_backups(client_state, std::move(permit), std::move(json_request));
}},
} {
}
@@ -611,7 +623,7 @@ future<> server::json_parser::stop() {
const char* api_error::what() const noexcept {
if (_what_string.empty()) {
_what_string = format("{} {}: {}", _http_code, _type, _msg);
_what_string = format("{} {}: {}", static_cast<int>(_http_code), _type, _msg);
}
return _what_string.c_str();
}

View File

@@ -15,6 +15,7 @@
#include <seastar/net/tls.hh>
#include <optional>
#include "alternator/auth.hh"
#include "service/qos/service_level_controller.hh"
#include "utils/small_vector.hh"
#include "utils/updateable_value.hh"
#include <seastar/core/units.hh>
@@ -34,6 +35,8 @@ class server {
executor& _executor;
service::storage_proxy& _proxy;
gms::gossiper& _gossiper;
auth::service& _auth_service;
qos::service_level_controller& _sl_controller;
key_cache _key_cache;
bool _enforce_authorization;
@@ -65,7 +68,7 @@ class server {
json_parser _json_parser;
public:
server(executor& executor, service::storage_proxy& proxy, gms::gossiper& gossiper);
server(executor& executor, service::storage_proxy& proxy, gms::gossiper& gossiper, auth::service& service, qos::service_level_controller& sl_controller);
future<> init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port, std::optional<tls::credentials_builder> creds,
bool enforce_authorization, semaphore* memory_limiter, utils::updateable_value<uint32_t> max_concurrent_requests);

View File

@@ -33,7 +33,6 @@
#include "gms/feature_service.hh"
#include "executor.hh"
#include "tags_extension.hh"
#include "rmw_operation.hh"
/**
@@ -75,8 +74,8 @@ struct rapidjson::internal::TypeHelper<ValueType, utils::UUID>
: public from_string_helper<ValueType, utils::UUID>
{};
static db_clock::time_point as_timepoint(const utils::UUID& uuid) {
return db_clock::time_point{utils::UUID_gen::unix_timestamp(uuid)};
static db_clock::time_point as_timepoint(const table_id& tid) {
return db_clock::time_point{utils::UUID_gen::unix_timestamp(tid.uuid())};
}
/**
@@ -107,6 +106,9 @@ public:
stream_arn(const UUID& uuid)
: UUID(uuid)
{}
stream_arn(const table_id& tid)
: UUID(tid.uuid())
{}
stream_arn(std::string_view v)
: UUID(v.substr(1))
{
@@ -143,25 +145,20 @@ future<alternator::executor::request_return_type> alternator::executor::list_str
auto table = find_table(_proxy, request);
auto db = _proxy.data_dictionary();
auto cfs = db.get_tables();
auto i = cfs.begin();
auto e = cfs.end();
if (limit < 1) {
throw api_error::validation("Limit must be 1 or more");
}
// # 12601 (maybe?) - sort the set of tables on ID. This should ensure we never
// generate duplicates in a paged listing here. Can obviously miss things if they
// are added between paged calls and end up with a "smaller" UUID/ARN, but that
// is to be expected.
std::sort(cfs.begin(), cfs.end(), [](const data_dictionary::table& t1, const data_dictionary::table& t2) {
return t1.schema()->id() < t2.schema()->id();
});
auto i = cfs.begin();
auto e = cfs.end();
// TODO: the unordered_map here is not really well suited for partial
// querying - we're sorting on local hash order, and creating a table
// between queries may or may not miss info. But that should be rare,
// and we can probably expect this to be a single call.
if (streams_start) {
i = std::find_if(i, e, [&](const data_dictionary::table& t) {
return t.schema()->id() == streams_start
i = std::find_if(i, e, [&](data_dictionary::table t) {
return t.schema()->id().uuid() == streams_start
&& cdc::get_base_table(db.real_database(), *t.schema())
&& is_alternator_keyspace(t.schema()->ks_name())
;
@@ -436,7 +433,7 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
auto db = _proxy.data_dictionary();
try {
auto cf = db.find_column_family(stream_arn);
auto cf = db.find_column_family(table_id(stream_arn));
schema = cf.schema();
bs = cdc::get_base_table(db.real_database(), *schema);
} catch (...) {
@@ -723,7 +720,7 @@ future<executor::request_return_type> executor::get_shard_iterator(client_state&
std::optional<shard_id> sid;
try {
auto cf = db.find_column_family(stream_arn);
auto cf = db.find_column_family(table_id(stream_arn));
schema = cf.schema();
sid = rjson::get<shard_id>(request, "ShardId");
} catch (...) {
@@ -808,7 +805,7 @@ future<executor::request_return_type> executor::get_records(client_state& client
auto db = _proxy.data_dictionary();
schema_ptr schema, base;
try {
auto log_table = db.find_column_family(iter.table);
auto log_table = db.find_column_family(table_id(iter.table));
schema = log_table.schema();
base = cdc::get_base_table(db.real_database(), *schema);
} catch (...) {
@@ -838,14 +835,14 @@ future<executor::request_return_type> executor::get_records(client_state& client
static const bytes op_column_name = cdc::log_meta_column_name_bytes("operation");
static const bytes eor_column_name = cdc::log_meta_column_name_bytes("end_of_batch");
auto key_names = boost::copy_range<attrs_to_get>(
std::optional<attrs_to_get> key_names = boost::copy_range<attrs_to_get>(
boost::range::join(std::move(base->partition_key_columns()), std::move(base->clustering_key_columns()))
| boost::adaptors::transformed([&] (const column_definition& cdef) {
return std::make_pair<std::string, attrs_to_get_node>(cdef.name_as_text(), {}); })
);
// Include all base table columns as values (in case pre or post is enabled).
// This will include attributes not stored in the frozen map column
auto attr_names = boost::copy_range<attrs_to_get>(base->regular_columns()
std::optional<attrs_to_get> attr_names = boost::copy_range<attrs_to_get>(base->regular_columns()
// this will include the :attrs column, which we will also force evaluating.
// But not having this set empty forces out any cdc columns from actual result
| boost::adaptors::transformed([] (const column_definition& cdef) {
@@ -882,7 +879,7 @@ future<executor::request_return_type> executor::get_records(client_state& client
++mul;
}
auto command = ::make_lw_shared<query::read_command>(schema->id(), schema->version(), partition_slice, _proxy.get_max_result_size(partition_slice),
query::row_limit(limit * mul));
query::tombstone_limit(_proxy.get_tombstone_limit()), query::row_limit(limit * mul));
return _proxy.query(schema, std::move(command), std::move(partition_ranges), cl, service::storage_proxy::coordinator_query_options(default_timeout(), std::move(permit), client_state)).then(
[this, schema, partition_slice = std::move(partition_slice), selection = std::move(selection), start_time = std::move(start_time), limit, key_names = std::move(key_names), attr_names = std::move(attr_names), type, iter, high_ts] (service::storage_proxy::coordinator_query_result qr) mutable {
@@ -1050,10 +1047,10 @@ void executor::add_stream_options(const rjson::value& stream_specification, sche
if (stream_enabled->GetBool()) {
auto db = sp.data_dictionary();
if (!db.features().cluster_supports_cdc()) {
if (!db.features().cdc) {
throw api_error::validation("StreamSpecification: streams (CDC) feature not enabled in cluster.");
}
if (!db.features().cluster_supports_alternator_streams()) {
if (!db.features().alternator_streams) {
throw api_error::validation("StreamSpecification: alternator streams feature not enabled in cluster.");
}

View File

@@ -13,6 +13,7 @@
#include <seastar/core/coroutine.hh>
#include <seastar/core/sleep.hh>
#include <seastar/core/future.hh>
#include <seastar/core/lowres_clock.hh>
#include <seastar/coroutine/maybe_yield.hh>
#include <boost/multiprecision/cpp_int.hpp>
@@ -44,6 +45,8 @@
#include "alternator/controller.hh"
#include "alternator/serialization.hh"
#include "dht/sharder.hh"
#include "db/config.hh"
#include "db/tags/utils.hh"
#include "ttl.hh"
@@ -62,7 +65,7 @@ static const sstring TTL_TAG_KEY("system:ttl_attribute");
future<executor::request_return_type> executor::update_time_to_live(client_state& client_state, service_permit permit, rjson::value request) {
_stats.api_operations.update_time_to_live++;
if (!_proxy.data_dictionary().features().cluster_supports_alternator_ttl()) {
if (!_proxy.data_dictionary().features().alternator_ttl) {
co_return api_error::unknown_operation("UpdateTimeToLive not yet supported. Experimental support is available if the 'alternator-ttl' experimental feature is enabled on all nodes.");
}
@@ -89,7 +92,7 @@ future<executor::request_return_type> executor::update_time_to_live(client_state
}
sstring attribute_name(v->GetString(), v->GetStringLength());
std::map<sstring, sstring> tags_map = get_tags_of_table(schema);
std::map<sstring, sstring> tags_map = get_tags_of_table_or_throw(schema);
if (enabled) {
if (tags_map.contains(TTL_TAG_KEY)) {
co_return api_error::validation("TTL is already enabled");
@@ -106,7 +109,7 @@ future<executor::request_return_type> executor::update_time_to_live(client_state
}
tags_map.erase(TTL_TAG_KEY);
}
co_await update_tags(_mm, schema, std::move(tags_map));
co_await db::update_tags(_mm, schema, std::move(tags_map));
// Prepare the response, which contains a TimeToLiveSpecification
// basically identical to the request's
rjson::value response = rjson::empty_object();
@@ -117,7 +120,7 @@ future<executor::request_return_type> executor::update_time_to_live(client_state
future<executor::request_return_type> executor::describe_time_to_live(client_state& client_state, service_permit permit, rjson::value request) {
_stats.api_operations.describe_time_to_live++;
schema_ptr schema = get_table(_proxy, request);
std::map<sstring, sstring> tags_map = get_tags_of_table(schema);
std::map<sstring, sstring> tags_map = get_tags_of_table_or_throw(schema);
rjson::value desc = rjson::empty_object();
auto i = tags_map.find(TTL_TAG_KEY);
if (i == tags_map.end()) {
@@ -133,7 +136,7 @@ future<executor::request_return_type> executor::describe_time_to_live(client_sta
// expiration_service is a sharded service responsible for cleaning up expired
// items in all tables with per-item expiration enabled. Currently, this means
// Alternator tables with TTL configured via a UpdateTimeToLeave request.
// Alternator tables with TTL configured via a UpdateTimeToLive request.
//
// Here is a brief overview of how the expiration service works:
//
@@ -147,28 +150,26 @@ future<executor::request_return_type> executor::describe_time_to_live(client_sta
// To avoid scanning the same items RF times in RF replicas, only one node is
// responsible for scanning a token range at a time. Normally, this is the
// node owning this range as a "primary range" (the first node in the ring
// with this range), but when this node is down, other nodes may take over
// (FIXME: this is not implemented yet).
// with this range), but when this node is down, the secondary owner (the
// second in the ring) may take over.
// An expiration thread is reponsible for all tables which need expiration
// scans. FIXME: explain how this is done with multiple tables - parallel,
// staggered, or what?
// scans. Currently, the different tables are scanned sequentially (not in
// parallel).
// The expiration thread scans item using CL=QUORUM to ensures that it reads
// a consistent expiration-time attribute. This means that the items are read
// locally and in addition QUORUM-1 additional nodes (one additional node
// when RF=3) need to read the data and send digests.
// FIXME: explain if we can read the exact attribute or the entire map.
// When the expiration thread decides that an item has expired and wants
// to delete it, it does it using a CL=QUORUM write. This allows this
// deletion to be visible for consistent (quorum) reads. The deletion,
// like user deletions, will also appear on the CDC log and therefore
// Alternator Streams if enabled (FIXME: explain how we mark the
// deletion different from user deletes. We don't do it yet.).
expiration_service::expiration_service(data_dictionary::database db, service::storage_proxy& proxy)
// Alternator Streams if enabled - currently as ordinary deletes (the
// userIdentity flag is currently missing this is issue #11523).
expiration_service::expiration_service(data_dictionary::database db, service::storage_proxy& proxy, gms::gossiper& g)
: _db(db)
, _proxy(proxy)
, _gossiper(g)
{
//FIXME: add metrics for the service
//setup_metrics();
}
// Convert the big_decimal used to represent expiration time to an integer.
@@ -281,10 +282,13 @@ static future<> expire_item(service::storage_proxy& proxy,
auto ck = clustering_key::from_exploded(exploded_ck);
m.partition().clustered_row(*schema, ck).apply(tombstone(ts, gc_clock::now()));
}
return proxy.mutate(std::vector<mutation>{std::move(m)},
std::vector<mutation> mutations;
mutations.push_back(std::move(m));
return proxy.mutate(std::move(mutations),
db::consistency_level::LOCAL_QUORUM,
executor::default_timeout(), // FIXME - which timeout?
qs.get_trace_state(), qs.get_permit());
qs.get_trace_state(), qs.get_permit(),
db::allow_per_partition_rate_limit::no);
}
static size_t random_offset(size_t min, size_t max) {
@@ -363,7 +367,7 @@ static std::vector<std::pair<dht::token_range, gms::inet_address>> get_secondary
// 2. The primary replica for this token is currently marked down.
// 3. In this node, this shard is responsible for this token.
// We use the <secondary> case to handle the possibility that some of the
// nodes in the system are down. A dead node will not be expiring expiring
// nodes in the system are down. A dead node will not be expiring
// the tokens owned by it, so we want the secondary owner to take over its
// primary ranges.
//
@@ -376,12 +380,11 @@ static std::vector<std::pair<dht::token_range, gms::inet_address>> get_secondary
enum primary_or_secondary_t {primary, secondary};
template<primary_or_secondary_t primary_or_secondary>
class token_ranges_owned_by_this_shard {
template<primary_or_secondary_t> class ranges_holder;
// ranges_holder<primary> holds just the primary ranges themselves
template<> class ranges_holder<primary> {
// ranges_holder_primary holds just the primary ranges themselves
class ranges_holder_primary {
const dht::token_range_vector _token_ranges;
public:
ranges_holder(const locator::effective_replication_map_ptr& erm, gms::inet_address ep)
ranges_holder_primary(const locator::effective_replication_map_ptr& erm, gms::gossiper& g, gms::inet_address ep)
: _token_ranges(erm->get_primary_ranges(ep)) {}
std::size_t size() const { return _token_ranges.size(); }
const dht::token_range& operator[](std::size_t i) const {
@@ -393,13 +396,13 @@ class token_ranges_owned_by_this_shard {
};
// ranges_holder<secondary> holds the secondary token ranges plus each
// range's primary owner, needed to implement should_skip().
template<> class ranges_holder<secondary> {
class ranges_holder_secondary {
std::vector<std::pair<dht::token_range, gms::inet_address>> _token_ranges;
gms::gossiper& _gossiper;
public:
ranges_holder(const locator::effective_replication_map_ptr& erm, gms::inet_address ep)
ranges_holder_secondary(const locator::effective_replication_map_ptr& erm, gms::gossiper& g, gms::inet_address ep)
: _token_ranges(get_secondary_ranges(erm, ep))
, _gossiper(gms::get_local_gossiper()) {}
, _gossiper(g) {}
std::size_t size() const { return _token_ranges.size(); }
const dht::token_range& operator[](std::size_t i) const {
return _token_ranges[i].first;
@@ -414,17 +417,21 @@ class token_ranges_owned_by_this_shard {
// _token_ranges will contain a list of token ranges owned by this node.
// We'll further need to split each such range to the pieces owned by
// the current shard, using _intersecter.
const ranges_holder<primary_or_secondary> _token_ranges;
using ranges_holder = std::conditional_t<
primary_or_secondary == primary_or_secondary_t::primary,
ranges_holder_primary,
ranges_holder_secondary>;
const ranges_holder _token_ranges;
// NOTICE: _range_idx is used modulo _token_ranges size when accessing
// the data to ensure that it doesn't go out of bounds
size_t _range_idx;
size_t _end_idx;
std::optional<dht::selective_token_range_sharder> _intersecter;
public:
token_ranges_owned_by_this_shard(replica::database& db, schema_ptr s)
token_ranges_owned_by_this_shard(replica::database& db, gms::gossiper& g, schema_ptr s)
: _s(s)
, _token_ranges(db.find_keyspace(s->ks_name()).get_effective_replication_map(),
utils::fb_utilities::get_broadcast_address())
g, utils::fb_utilities::get_broadcast_address())
, _range_idx(random_offset(0, _token_ranges.size() - 1))
, _end_idx(_range_idx + _token_ranges.size())
{
@@ -502,9 +509,11 @@ struct scan_ranges_context {
selection = cql3::selection::selection::wildcard(s);
query::partition_slice::option_set opts = selection->get_query_options();
opts.set<query::partition_slice::option::allow_short_read>();
// It is important that the scan bypass cache to avoid polluting it:
opts.set<query::partition_slice::option::bypass_cache>();
std::vector<query::clustering_range> ck_bounds{query::clustering_range::make_open_ended_both_sides()};
auto partition_slice = query::partition_slice(std::move(ck_bounds), {}, std::move(regular_columns), opts);
command = ::make_lw_shared<query::read_command>(s->id(), s->version(), partition_slice, proxy.get_max_result_size(partition_slice));
command = ::make_lw_shared<query::read_command>(s->id(), s->version(), partition_slice, proxy.get_max_result_size(partition_slice), query::tombstone_limit(proxy.get_tombstone_limit()));
executor::client_state client_state{executor::client_state::internal_tag()};
tracing::trace_state_ptr trace_state;
// NOTICE: empty_service_permit is used because the TTL service has fixed parallelism
@@ -521,13 +530,14 @@ struct scan_ranges_context {
// Scan data in a list of token ranges in one table, looking for expired
// items and deleting them.
// Because of issue #9167, partition_ranges must have a single partition
// for this code to work correctly.
// range for this code to work correctly.
static future<> scan_table_ranges(
service::storage_proxy& proxy,
const scan_ranges_context& scan_ctx,
dht::partition_range_vector&& partition_ranges,
abort_source& abort_source,
named_semaphore& page_sem)
named_semaphore& page_sem,
expiration_service::stats& expiration_stats)
{
const schema_ptr& s = scan_ctx.s;
assert (partition_ranges.size() == 1); // otherwise issue #9167 will cause incorrect results.
@@ -595,6 +605,7 @@ static future<> scan_table_ranges(
expired = is_expired(n, now);
}
if (expired) {
expiration_stats.items_deleted++;
// FIXME: maybe don't recalculate new_timestamp() all the time
// FIXME: if expire_item() throws on timeout, we need to retry it.
auto ts = api::new_timestamp();
@@ -606,7 +617,7 @@ static future<> scan_table_ranges(
}
}
// scan_table() scans data in one table "owned" by this shard, looking for
// scan_table() scans, in one table, data "owned" by this shard, looking for
// expired items and deleting them.
// We consider each node to "own" its primary token ranges, i.e., the tokens
// that this node is their first replica in the ring. Inside the node, each
@@ -628,13 +639,16 @@ static future<> scan_table_ranges(
static future<bool> scan_table(
service::storage_proxy& proxy,
data_dictionary::database db,
gms::gossiper& gossiper,
schema_ptr s,
abort_source& abort_source,
named_semaphore& page_sem)
named_semaphore& page_sem,
expiration_service::stats& expiration_stats)
{
// Check if an expiration-time attribute is enabled for this table.
// If not, just return false immediately.
std::optional<std::string> attribute_name = find_tag(*s, TTL_TAG_KEY);
// FIXME: the setting of the TTL may change in the middle of a long scan!
std::optional<std::string> attribute_name = db::find_tag(*s, TTL_TAG_KEY);
if (!attribute_name) {
co_return false;
}
@@ -675,11 +689,10 @@ static future<bool> scan_table(
tlogger.info("table {} TTL column has unsupported type, not scanning", s->cf_name());
co_return false;
}
expiration_stats.scan_table++;
// FIXME: need to pace the scan, not do it all at once.
// FIXME: consider if we should ask the scan without caching?
// can we use cache but not fill it?
scan_ranges_context scan_ctx{s, proxy, std::move(column_name), std::move(member)};
token_ranges_owned_by_this_shard<primary> my_ranges(db.real_database(), s);
token_ranges_owned_by_this_shard<primary> my_ranges(db.real_database(), gossiper, s);
while (std::optional<dht::partition_range> range = my_ranges.next_partition_range()) {
// Note that because of issue #9167 we need to run a separate
// query on each partition range, and can't pass several of
@@ -690,7 +703,7 @@ static future<bool> scan_table(
// we fail the entire scan (and rescan from the beginning). Need to
// reconsider this. Saving the scan position might be a good enough
// solution for this problem.
co_await scan_table_ranges(proxy, scan_ctx, std::move(partition_ranges), abort_source, page_sem);
co_await scan_table_ranges(proxy, scan_ctx, std::move(partition_ranges), abort_source, page_sem, expiration_stats);
}
// If each node only scans its own primary ranges, then when any node is
// down part of the token range will not get scanned. This can be viewed
@@ -699,11 +712,12 @@ static future<bool> scan_table(
// by tasking another node to take over scanning of the dead node's primary
// ranges. What we do here is that this node will also check expiration
// on its *secondary* ranges - but only those whose primary owner is down.
token_ranges_owned_by_this_shard<secondary> my_secondary_ranges(db.real_database(), s);
token_ranges_owned_by_this_shard<secondary> my_secondary_ranges(db.real_database(), gossiper, s);
while (std::optional<dht::partition_range> range = my_secondary_ranges.next_partition_range()) {
expiration_stats.secondary_ranges_scanned++;
dht::partition_range_vector partition_ranges;
partition_ranges.push_back(std::move(*range));
co_await scan_table_ranges(proxy, scan_ctx, std::move(partition_ranges), abort_source, page_sem);
co_await scan_table_ranges(proxy, scan_ctx, std::move(partition_ranges), abort_source, page_sem, expiration_stats);
}
co_return true;
}
@@ -716,6 +730,7 @@ future<> expiration_service::run() {
// also need to notice when a new table is added, a table is
// deleted or when ttl is enabled or disabled for a table!
for (;;) {
auto start = lowres_clock::now();
// _db.tables() may change under our feet during a
// long-living loop, so we must keep our own copy of the list of
// schemas.
@@ -729,7 +744,7 @@ future<> expiration_service::run() {
co_return;
}
try {
co_await scan_table(_proxy, _db, s, _abort_source, _page_sem);
co_await scan_table(_proxy, _db, _gossiper, s, _abort_source, _page_sem, _expiration_stats);
} catch (...) {
// The scan of a table may fail in the middle for many
// reasons, including network failure and even the table
@@ -748,17 +763,30 @@ future<> expiration_service::run() {
}
}
}
// FIXME: replace this silly 1-second sleep by something smarter.
try {
co_await seastar::sleep_abortable(std::chrono::seconds(1), _abort_source);
} catch(seastar::sleep_aborted&) {}
_expiration_stats.scan_passes++;
// The TTL scanner runs above once over all tables, at full steam.
// After completing such a scan, we sleep until it's time start
// another scan. TODO: If the scan went too fast, we can slow it down
// in the next iteration by reducing the scanner's scheduling-group
// share (if using a separate scheduling group), or introduce
// finer-grain sleeps into the scanning code.
std::chrono::milliseconds scan_duration(std::chrono::duration_cast<std::chrono::milliseconds>(lowres_clock::now() - start));
std::chrono::milliseconds period(long(_db.get_config().alternator_ttl_period_in_seconds() * 1000));
if (scan_duration < period) {
try {
tlogger.info("sleeping {} seconds until next period", (period - scan_duration).count()/1000.0);
co_await seastar::sleep_abortable(period - scan_duration, _abort_source);
} catch(seastar::sleep_aborted&) {}
} else {
tlogger.warn("scan took {} seconds, longer than period - not sleeping", scan_duration.count()/1000.0);
}
}
}
future<> expiration_service::start() {
// Called by main() on each shard to start the expiration-service
// thread. Just runs run() in the background and allows stop().
if (_db.features().cluster_supports_alternator_ttl()) {
if (_db.features().alternator_ttl) {
if (!shutting_down()) {
_end = run().handle_exception([] (std::exception_ptr ep) {
tlogger.error("expiration_service failed: {}", ep);
@@ -780,4 +808,18 @@ future<> expiration_service::stop() {
return std::move(*_end);
}
expiration_service::stats::stats() {
_metrics.add_group("expiration", {
seastar::metrics::make_total_operations("scan_passes", scan_passes,
seastar::metrics::description("number of passes over the database")),
seastar::metrics::make_total_operations("scan_table", scan_table,
seastar::metrics::description("number of table scans (counting each scan of each table that enabled expiration)")),
seastar::metrics::make_total_operations("items_deleted", items_deleted,
seastar::metrics::description("number of items deleted after expiration")),
seastar::metrics::make_total_operations("secondary_ranges_scanned", secondary_ranges_scanned,
seastar::metrics::description("number of token ranges scanned by this node while their primary owner was down")),
});
}
} // namespace alternator

View File

@@ -14,6 +14,10 @@
#include <seastar/core/semaphore.hh>
#include "data_dictionary/data_dictionary.hh"
namespace gms {
class gossiper;
}
namespace replica {
class database;
}
@@ -28,8 +32,26 @@ namespace alternator {
// items in all tables with per-item expiration enabled. Currently, this means
// Alternator tables with TTL configured via a UpdateTimeToLeave request.
class expiration_service final : public seastar::peering_sharded_service<expiration_service> {
public:
// Object holding per-shard statistics related to the expiration service.
// While this object is alive, these metrics are also registered to be
// visible by the metrics REST API, with the "expiration_" prefix.
class stats {
public:
stats();
uint64_t scan_passes = 0;
uint64_t scan_table = 0;
uint64_t items_deleted = 0;
uint64_t secondary_ranges_scanned = 0;
private:
// The metric_groups object holds this stat object's metrics registered
// as long as the stats object is alive.
seastar::metrics::metric_groups _metrics;
};
private:
data_dictionary::database _db;
service::storage_proxy& _proxy;
gms::gossiper& _gossiper;
// _end is set by start(), and resolves when the the background service
// started by it ends. To ask the background service to end, _abort_source
// should be triggered. stop() below uses both _abort_source and _end.
@@ -38,11 +60,12 @@ class expiration_service final : public seastar::peering_sharded_service<expirat
// Ensures that at most 1 page of scan results at a time is processed by the TTL service
named_semaphore _page_sem{1, named_semaphore_exception_factory{"alternator_ttl"}};
bool shutting_down() { return _abort_source.abort_requested(); }
stats _expiration_stats;
public:
// sharded_service<expiration_service>::start() creates this object on
// all shards, so calls this constructor on each shard. Later, the
// additional start() function should be invoked on all shards.
expiration_service(data_dictionary::database, service::storage_proxy&);
expiration_service(data_dictionary::database, service::storage_proxy&, gms::gossiper&);
future<> start();
future<> run();
// sharded_service<expiration_service>::stop() calls the following stop()

View File

@@ -0,0 +1,29 @@
{
"apiVersion":"0.0.1",
"swaggerVersion":"1.2",
"basePath":"{{Protocol}}://{{Host}}",
"resourcePath":"/authorization_cache",
"produces":[
"application/json"
],
"apis":[
{
"path":"/authorization_cache/reset",
"operations":[
{
"method":"POST",
"summary":"Reset cache",
"type":"void",
"nickname":"authorization_cache_reset",
"produces":[
"application/json"
],
"parameters":[
]
}
]
}
],
"models":{
}
}

View File

@@ -134,7 +134,7 @@
},
{
"name":"tables",
"description":"Comma-seperated tables to stop compaction in",
"description":"Comma-separated tables to stop compaction in",
"required":false,
"allowMultiple":false,
"type":"string",

View File

@@ -667,7 +667,7 @@
},
{
"name":"kn",
"description":"Comma seperated keyspaces name that their snapshot will be deleted",
"description":"Comma-separated keyspaces name that their snapshot will be deleted",
"required":false,
"allowMultiple":false,
"type":"string",
@@ -723,7 +723,7 @@
},
{
"name":"cf",
"description":"Comma seperated column family names",
"description":"Comma-separated column family names",
"required":false,
"allowMultiple":false,
"type":"string",
@@ -755,7 +755,7 @@
},
{
"name":"cf",
"description":"Comma seperated column family names",
"description":"Comma-separated column family names",
"required":false,
"allowMultiple":false,
"type":"string",
@@ -787,7 +787,7 @@
},
{
"name":"cf",
"description":"Comma-seperated table names",
"description":"Comma-separated table names",
"required":false,
"allowMultiple":false,
"type":"string",
@@ -862,7 +862,7 @@
},
{
"name":"cf",
"description":"Comma seperated column family names",
"description":"Comma-separated column family names",
"required":false,
"allowMultiple":false,
"type":"string",
@@ -902,7 +902,7 @@
},
{
"name":"cf",
"description":"Comma seperated column family names",
"description":"Comma-separated column family names",
"required":false,
"allowMultiple":false,
"type":"string",
@@ -934,7 +934,7 @@
},
{
"name":"cf",
"description":"Comma seperated column family names",
"description":"Comma-separated column family names",
"required":false,
"allowMultiple":false,
"type":"string",
@@ -1228,7 +1228,7 @@
"operations":[
{
"method":"POST",
"summary":"Removes token (and all data associated with enpoint that had it) from the ring",
"summary":"Removes a node from the cluster. Replicated data that logically belonged to this node is redistributed among the remaining nodes.",
"type":"void",
"nickname":"remove_node",
"produces":[
@@ -1245,7 +1245,7 @@
},
{
"name":"ignore_nodes",
"description":"List of dead nodes to ingore in removenode operation",
"description":"Comma-separated list of dead nodes to ignore in removenode operation. Use the same method for all nodes to ignore: either Host IDs or ip addresses.",
"required":false,
"allowMultiple":false,
"type":"string",
@@ -2073,7 +2073,7 @@
},
{
"name":"cf",
"description":"Comma seperated column family names",
"description":"Comma-separated column family names",
"required":false,
"allowMultiple":false,
"type":"string",
@@ -2100,7 +2100,7 @@
},
{
"name":"cf",
"description":"Comma seperated column family names",
"description":"Comma-separated column family names",
"required":false,
"allowMultiple":false,
"type":"string",
@@ -2641,7 +2641,7 @@
"version":{
"type":"string",
"enum":[
"ka", "la", "mc", "md"
"ka", "la", "mc", "md", "me"
],
"description":"SSTable version"
},

View File

@@ -52,6 +52,45 @@
}
]
},
{
"path":"/system/log",
"operations":[
{
"method":"POST",
"summary":"Write a message to the Scylla log",
"type":"void",
"nickname":"write_log_message",
"produces":[
"application/json"
],
"parameters":[
{
"name":"message",
"description":"The message to write to the log",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"level",
"description":"The logging level to use",
"required":true,
"allowMultiple":false,
"type":"string",
"enum":[
"error",
"warn",
"info",
"debug",
"trace"
],
"paramType":"query"
}
]
}
]
},
{
"path":"/system/drop_sstable_caches",
"operations":[

View File

@@ -0,0 +1,251 @@
{
"apiVersion":"0.0.1",
"swaggerVersion":"1.2",
"basePath":"{{Protocol}}://{{Host}}",
"resourcePath":"/task_manager",
"produces":[
"application/json"
],
"apis":[
{
"path":"/task_manager/list_modules",
"operations":[
{
"method":"GET",
"summary":"Get all modules names",
"type":"array",
"items":{
"type":"string"
},
"nickname":"get_modules",
"produces":[
"application/json"
],
"parameters":[
]
}
]
},
{
"path":"/task_manager/list_module_tasks/{module}",
"operations":[
{
"method":"GET",
"summary":"Get a list of tasks",
"type":"array",
"items":{
"type":"task_stats"
},
"nickname":"get_tasks",
"produces":[
"application/json"
],
"parameters":[
{
"name":"module",
"description":"The module to query about",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
},
{
"name":"internal",
"description":"Boolean flag indicating whether internal tasks should be shown (false by default)",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
},
{
"name":"keyspace",
"description":"The keyspace to query about",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"table",
"description":"The table to query about",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
]
},
{
"path":"/task_manager/task_status/{task_id}",
"operations":[
{
"method":"GET",
"summary":"Get task status",
"type":"task_status",
"nickname":"get_task_status",
"produces":[
"application/json"
],
"parameters":[
{
"name":"task_id",
"description":"The uuid of a task to query about",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
}
]
}
]
},
{
"path":"/task_manager/abort_task/{task_id}",
"operations":[
{
"method":"POST",
"summary":"Abort running task and its descendants",
"type":"void",
"nickname":"abort_task",
"produces":[
"application/json"
],
"parameters":[
{
"name":"task_id",
"description":"The uuid of a task to abort",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
}
]
}
]
},
{
"path":"/task_manager/wait_task/{task_id}",
"operations":[
{
"method":"GET",
"summary":"Wait for a task to complete",
"type":"task_status",
"nickname":"wait_task",
"produces":[
"application/json"
],
"parameters":[
{
"name":"task_id",
"description":"The uuid of a task to wait for",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
}
]
}
]
}
],
"models":{
"task_stats" :{
"id": "task_stats",
"description":"A task statistics object",
"properties":{
"task_id":{
"type":"string",
"description":"The uuid of a task"
},
"state":{
"type":"string",
"enum":[
"created",
"running",
"done",
"failed"
],
"description":"The state of a task"
}
}
},
"task_status":{
"id":"task_status",
"description":"A task status object",
"properties":{
"id":{
"type":"string",
"description":"The uuid of the task"
},
"type":{
"type":"string",
"description":"The description of the task"
},
"state":{
"type":"string",
"enum":[
"created",
"running",
"done",
"failed"
],
"description":"The state of the task"
},
"is_abortable":{
"type":"boolean",
"description":"Boolean flag indicating whether the task can be aborted"
},
"start_time":{
"type":"datetime",
"description":"The start time of the task"
},
"end_time":{
"type":"datetime",
"description":"The end time of the task (unspecified when the task is not completed)"
},
"error":{
"type":"string",
"description":"Error string, if the task failed"
},
"parent_id":{
"type":"string",
"description":"The uuid of the parent task"
},
"sequence_number":{
"type":"long",
"description":"The running sequence number of the task"
},
"shard":{
"type":"long",
"description":"The number of a shard the task is running on"
},
"keyspace":{
"type":"string",
"description":"The keyspace the task is working on (if applicable)"
},
"table":{
"type":"string",
"description":"The table the task is working on (if applicable)"
},
"entity":{
"type":"string",
"description":"Task-specific entity description"
},
"progress_units":{
"type":"string",
"description":"A description of the progress units"
},
"progress_total":{
"type":"double",
"description":"The total number of units to complete for the task"
},
"progress_completed":{
"type":"double",
"description":"The number of units completed so far"
}
}
}
}
}

View File

@@ -0,0 +1,185 @@
{
"apiVersion":"0.0.1",
"swaggerVersion":"1.2",
"basePath":"{{Protocol}}://{{Host}}",
"resourcePath":"/task_manager_test",
"produces":[
"application/json"
],
"apis":[
{
"path":"/task_manager_test/test_module",
"operations":[
{
"method":"POST",
"summary":"Register test module in task manager",
"type":"void",
"nickname":"register_test_module",
"produces":[
"application/json"
],
"parameters":[
]
},
{
"method":"DELETE",
"summary":"Unregister test module in task manager",
"type":"void",
"nickname":"unregister_test_module",
"produces":[
"application/json"
],
"parameters":[
]
}
]
},
{
"path":"/task_manager_test/test_task",
"operations":[
{
"method":"POST",
"summary":"Register test task",
"type":"string",
"nickname":"register_test_task",
"produces":[
"application/json"
],
"parameters":[
{
"name":"task_id",
"description":"The uuid of a task to register",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"shard",
"description":"The shard of the task",
"required":false,
"allowMultiple":false,
"type":"long",
"paramType":"query"
},
{
"name":"parent_id",
"description":"The uuid of a parent task",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"keyspace",
"description":"The keyspace the task is working on",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"table",
"description":"The table the task is working on",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"type",
"description":"The type of the task",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"entity",
"description":"Task-specific entity description",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
},
{
"method":"DELETE",
"summary":"Unregister test task",
"type":"void",
"nickname":"unregister_test_task",
"produces":[
"application/json"
],
"parameters":[
{
"name":"task_id",
"description":"The uuid of a task to register",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
]
},
{
"path":"/task_manager_test/finish_test_task/{task_id}",
"operations":[
{
"method":"POST",
"summary":"Finish test task",
"type":"void",
"nickname":"finish_test_task",
"produces":[
"application/json"
],
"parameters":[
{
"name":"task_id",
"description":"The uuid of a task to finish",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
},
{
"name":"error",
"description":"The error with which task fails (if it does)",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
]
},
{
"path":"/task_manager_test/ttl",
"operations":[
{
"method":"POST",
"summary":"Set ttl in seconds and get last value",
"type":"long",
"nickname":"get_and_update_ttl",
"produces":[
"application/json"
],
"parameters":[
{
"name":"ttl",
"description":"The number of seconds for which the tasks will be kept in memory after it finishes",
"required":true,
"allowMultiple":false,
"type":"long",
"paramType":"query"
}
]
}
]
}
]
}

View File

@@ -24,10 +24,13 @@
#include "compaction_manager.hh"
#include "hinted_handoff.hh"
#include "error_injection.hh"
#include "authorization_cache.hh"
#include <seastar/http/exception.hh>
#include "stream_manager.hh"
#include "system.hh"
#include "api/config.hh"
#include "task_manager.hh"
#include "task_manager_test.hh"
logging::logger apilog("api");
@@ -96,9 +99,9 @@ future<> unset_rpc_controller(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_rpc_controller(ctx, r); });
}
future<> set_server_storage_service(http_context& ctx, sharded<service::storage_service>& ss, sharded<gms::gossiper>& g, sharded<cdc::generation_service>& cdc_gs) {
return register_api(ctx, "storage_service", "The storage service API", [&ss, &g, &cdc_gs] (http_context& ctx, routes& r) {
set_storage_service(ctx, r, ss, g.local(), cdc_gs);
future<> set_server_storage_service(http_context& ctx, sharded<service::storage_service>& ss, sharded<gms::gossiper>& g, sharded<cdc::generation_service>& cdc_gs, sharded<db::system_keyspace>& sys_ks) {
return register_api(ctx, "storage_service", "The storage service API", [&ss, &g, &cdc_gs, &sys_ks] (http_context& ctx, routes& r) {
set_storage_service(ctx, r, ss, g.local(), cdc_gs, sys_ks);
});
}
@@ -126,6 +129,17 @@ future<> unset_server_repair(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_repair(ctx, r); });
}
future<> set_server_authorization_cache(http_context &ctx, sharded<auth::service> &auth_service) {
return register_api(ctx, "authorization_cache",
"The authorization cache API", [&auth_service] (http_context &ctx, routes &r) {
set_authorization_cache(ctx, r, auth_service);
});
}
future<> unset_server_authorization_cache(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_authorization_cache(ctx, r); });
}
future<> set_server_snapshot(http_context& ctx, sharded<db::snapshot_ctl>& snap_ctl) {
return ctx.http_server.set_routes([&ctx, &snap_ctl] (routes& r) { set_snapshot(ctx, r, snap_ctl); });
}
@@ -134,8 +148,14 @@ future<> unset_server_snapshot(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_snapshot(ctx, r); });
}
future<> set_server_snitch(http_context& ctx) {
return register_api(ctx, "endpoint_snitch_info", "The endpoint snitch info API", set_endpoint_snitch);
future<> set_server_snitch(http_context& ctx, sharded<locator::snitch_ptr>& snitch) {
return register_api(ctx, "endpoint_snitch_info", "The endpoint snitch info API", [&snitch] (http_context& ctx, routes& r) {
set_endpoint_snitch(ctx, r, snitch);
});
}
future<> unset_server_snitch(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_endpoint_snitch(ctx, r); });
}
future<> set_server_gossip(http_context& ctx, sharded<gms::gossiper>& g) {
@@ -233,5 +253,56 @@ future<> set_server_done(http_context& ctx) {
});
}
future<> set_server_task_manager(http_context& ctx) {
auto rb = std::make_shared < api_registry_builder > (ctx.api_doc);
return ctx.http_server.set_routes([rb, &ctx](routes& r) {
rb->register_function(r, "task_manager",
"The task manager API");
set_task_manager(ctx, r);
});
}
#ifndef SCYLLA_BUILD_MODE_RELEASE
future<> set_server_task_manager_test(http_context& ctx, lw_shared_ptr<db::config> cfg) {
auto rb = std::make_shared < api_registry_builder > (ctx.api_doc);
return ctx.http_server.set_routes([rb, &ctx, &cfg = *cfg](routes& r) mutable {
rb->register_function(r, "task_manager_test",
"The task manager test API");
set_task_manager_test(ctx, r, cfg);
});
}
#endif
void req_params::process(const request& req) {
// Process mandatory parameters
for (auto& [name, ent] : params) {
if (!ent.is_mandatory) {
continue;
}
try {
ent.value = req.param[name];
} catch (std::out_of_range&) {
throw httpd::bad_param_exception(fmt::format("Mandatory parameter '{}' was not provided", name));
}
}
// Process optional parameters
for (auto& [name, value] : req.query_parameters) {
try {
auto& ent = params.at(name);
if (ent.is_mandatory) {
throw httpd::bad_param_exception(fmt::format("Parameter '{}' is expected to be provided as part of the request url", name));
}
ent.value = value;
} catch (std::out_of_range&) {
throw httpd::bad_param_exception(fmt::format("Unsupported optional parameter '{}'", name));
}
}
}
}

View File

@@ -137,6 +137,14 @@ future<json::json_return_type> sum_timer_stats(distributed<T>& d, utils::timed_
});
}
template<class T, class F>
future<json::json_return_type> sum_timer_stats(distributed<T>& d, utils::timed_rate_moving_average_summary_and_histogram F::*f) {
return d.map_reduce0([f](const T& p) {return (p.get_stats().*f).rate();}, utils::rate_moving_average_and_histogram(),
std::plus<utils::rate_moving_average_and_histogram>()).then([](const utils::rate_moving_average_and_histogram& val) {
return make_ready_future<json::json_return_type>(timer_to_json(val));
});
}
inline int64_t min_int64(int64_t a, int64_t b) {
return std::min(a,b);
}
@@ -237,6 +245,67 @@ public:
operator T() const { return value; }
};
using mandatory = bool_class<struct mandatory_tag>;
class req_params {
public:
struct def {
std::optional<sstring> value;
mandatory is_mandatory = mandatory::no;
def(std::optional<sstring> value_ = std::nullopt, mandatory is_mandatory_ = mandatory::no)
: value(std::move(value_))
, is_mandatory(is_mandatory_)
{ }
def(mandatory is_mandatory_)
: is_mandatory(is_mandatory_)
{ }
};
private:
std::unordered_map<sstring, def> params;
public:
req_params(std::initializer_list<std::pair<sstring, def>> l) {
for (const auto& [name, ent] : l) {
add(std::move(name), std::move(ent));
}
}
void add(sstring name, def ent) {
params.emplace(std::move(name), std::move(ent));
}
void process(const request& req);
const std::optional<sstring>& get(const char* name) const {
return params.at(name).value;
}
template <typename T = sstring>
const std::optional<T> get_as(const char* name) const {
return get(name);
}
template <typename T = sstring>
requires std::same_as<T, bool>
const std::optional<bool> get_as(const char* name) const {
auto value = get(name);
if (!value) {
return std::nullopt;
}
std::transform(value->begin(), value->end(), value->begin(), ::tolower);
if (value == "true" || value == "yes" || value == "1") {
return true;
}
if (value == "false" || value == "no" || value == "0") {
return false;
}
throw boost::bad_lexical_cast{};
}
};
utils_json::estimated_histogram time_to_json_histogram(const utils::time_estimated_histogram& val);
}

View File

@@ -11,6 +11,7 @@
#include <seastar/core/future.hh>
#include "replica/database_fwd.hh"
#include "tasks/task_manager.hh"
#include "seastarx.hh"
namespace service {
@@ -31,6 +32,7 @@ namespace locator {
class token_metadata;
class shared_token_metadata;
class snitch_ptr;
} // namespace locator
@@ -42,6 +44,7 @@ class config;
namespace view {
class view_builder;
}
class system_keyspace;
}
namespace netw { class messaging_service; }
class repair_service;
@@ -53,6 +56,8 @@ class gossiper;
}
namespace auth { class service; }
namespace api {
struct http_context {
@@ -63,11 +68,12 @@ struct http_context {
distributed<service::storage_proxy>& sp;
service::load_meter& lmeter;
const sharded<locator::shared_token_metadata>& shared_token_metadata;
sharded<tasks::task_manager>& tm;
http_context(distributed<replica::database>& _db,
distributed<service::storage_proxy>& _sp,
service::load_meter& _lm, const sharded<locator::shared_token_metadata>& _stm)
: db(_db), sp(_sp), lmeter(_lm), shared_token_metadata(_stm) {
service::load_meter& _lm, const sharded<locator::shared_token_metadata>& _stm, sharded<tasks::task_manager>& _tm)
: db(_db), sp(_sp), lmeter(_lm), shared_token_metadata(_stm), tm(_tm) {
}
const locator::token_metadata& get_token_metadata();
@@ -75,8 +81,9 @@ struct http_context {
future<> set_server_init(http_context& ctx);
future<> set_server_config(http_context& ctx, const db::config& cfg);
future<> set_server_snitch(http_context& ctx);
future<> set_server_storage_service(http_context& ctx, sharded<service::storage_service>& ss, sharded<gms::gossiper>& g, sharded<cdc::generation_service>& cdc_gs);
future<> set_server_snitch(http_context& ctx, sharded<locator::snitch_ptr>& snitch);
future<> unset_server_snitch(http_context& ctx);
future<> set_server_storage_service(http_context& ctx, sharded<service::storage_service>& ss, sharded<gms::gossiper>& g, sharded<cdc::generation_service>& cdc_gs, sharded<db::system_keyspace>& sys_ks);
future<> set_server_sstables_loader(http_context& ctx, sharded<sstables_loader>& sst_loader);
future<> unset_server_sstables_loader(http_context& ctx);
future<> set_server_view_builder(http_context& ctx, sharded<db::view::view_builder>& vb);
@@ -87,6 +94,8 @@ future<> set_transport_controller(http_context& ctx, cql_transport::controller&
future<> unset_transport_controller(http_context& ctx);
future<> set_rpc_controller(http_context& ctx, thrift_controller& ctl);
future<> unset_rpc_controller(http_context& ctx);
future<> set_server_authorization_cache(http_context& ctx, sharded<auth::service> &auth_service);
future<> unset_server_authorization_cache(http_context& ctx);
future<> set_server_snapshot(http_context& ctx, sharded<db::snapshot_ctl>& snap_ctl);
future<> unset_server_snapshot(http_context& ctx);
future<> set_server_gossip(http_context& ctx, sharded<gms::gossiper>& g);
@@ -102,5 +111,7 @@ future<> set_server_gossip_settle(http_context& ctx, sharded<gms::gossiper>& g);
future<> set_server_cache(http_context& ctx);
future<> set_server_compaction_manager(http_context& ctx);
future<> set_server_done(http_context& ctx);
future<> set_server_task_manager(http_context& ctx);
future<> set_server_task_manager_test(http_context& ctx, lw_shared_ptr<db::config> cfg);
}

View File

@@ -0,0 +1,33 @@
/*
* Copyright (C) 2022-present ScyllaDB
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include "api/api-doc/authorization_cache.json.hh"
#include "api/authorization_cache.hh"
#include "api/api.hh"
#include "auth/common.hh"
namespace api {
using namespace json;
void set_authorization_cache(http_context& ctx, routes& r, sharded<auth::service> &auth_service) {
httpd::authorization_cache_json::authorization_cache_reset.set(r, [&auth_service] (std::unique_ptr<request> req) -> future<json::json_return_type> {
co_await auth_service.invoke_on_all([] (auth::service& auth) -> future<> {
auth.reset_authorization_cache();
return make_ready_future<>();
});
co_return json_void();
});
}
void unset_authorization_cache(http_context& ctx, routes& r) {
httpd::authorization_cache_json::authorization_cache_reset.unset(r);
}
}

View File

@@ -0,0 +1,18 @@
/*
* Copyright (C) 2022-present ScyllaDB
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once
#include "api.hh"
namespace api {
void set_authorization_cache(http_context& ctx, routes& r, sharded<auth::service> &auth_service);
void unset_authorization_cache(http_context& ctx, routes& r);
}

View File

@@ -29,8 +29,11 @@ static auto transformer(const std::vector<collectd_value>& values) {
case scollectd::data_type::GAUGE:
collected_value.values.push(v.d());
break;
case scollectd::data_type::DERIVE:
collected_value.values.push(v.i());
case scollectd::data_type::COUNTER:
collected_value.values.push(v.ui());
break;
case scollectd::data_type::REAL_COUNTER:
collected_value.values.push(v.d());
break;
default:
collected_value.values.push(v.ui());

View File

@@ -14,7 +14,7 @@
#include "sstables/metadata_collector.hh"
#include "utils/estimated_histogram.hh"
#include <algorithm>
#include "db/system_keyspace_view_types.hh"
#include "db/system_keyspace.hh"
#include "db/data_listeners.hh"
#include "storage_service.hh"
#include "unimplemented.hh"
@@ -43,7 +43,7 @@ std::tuple<sstring, sstring> parse_fully_qualified_cf_name(sstring name) {
return std::make_tuple(name.substr(0, pos), name.substr(end));
}
const utils::UUID& get_uuid(const sstring& ks, const sstring& cf, const replica::database& db) {
const table_id& get_uuid(const sstring& ks, const sstring& cf, const replica::database& db) {
try {
return db.find_uuid(ks, cf);
} catch (replica::no_such_column_family& e) {
@@ -51,7 +51,7 @@ const utils::UUID& get_uuid(const sstring& ks, const sstring& cf, const replica:
}
}
const utils::UUID& get_uuid(const sstring& name, const replica::database& db) {
const table_id& get_uuid(const sstring& name, const replica::database& db) {
auto [ks, cf] = parse_fully_qualified_cf_name(name);
return get_uuid(ks, cf, db);
}
@@ -79,14 +79,14 @@ future<json::json_return_type> get_cf_stats(http_context& ctx,
}
static future<json::json_return_type> get_cf_stats_count(http_context& ctx, const sstring& name,
utils::timed_rate_moving_average_and_histogram replica::column_family_stats::*f) {
utils::timed_rate_moving_average_summary_and_histogram replica::column_family_stats::*f) {
return map_reduce_cf(ctx, name, int64_t(0), [f](const replica::column_family& cf) {
return (cf.get_stats().*f).hist.count;
}, std::plus<int64_t>());
}
static future<json::json_return_type> get_cf_stats_sum(http_context& ctx, const sstring& name,
utils::timed_rate_moving_average_and_histogram replica::column_family_stats::*f) {
utils::timed_rate_moving_average_summary_and_histogram replica::column_family_stats::*f) {
auto uuid = get_uuid(name, ctx.db.local());
return ctx.db.map_reduce0([uuid, f](replica::database& db) {
// Histograms information is sample of the actual load
@@ -102,7 +102,7 @@ static future<json::json_return_type> get_cf_stats_sum(http_context& ctx, const
static future<json::json_return_type> get_cf_stats_count(http_context& ctx,
utils::timed_rate_moving_average_and_histogram replica::column_family_stats::*f) {
utils::timed_rate_moving_average_summary_and_histogram replica::column_family_stats::*f) {
return map_reduce_cf(ctx, int64_t(0), [f](const replica::column_family& cf) {
return (cf.get_stats().*f).hist.count;
}, std::plus<int64_t>());
@@ -110,7 +110,7 @@ static future<json::json_return_type> get_cf_stats_count(http_context& ctx,
static future<json::json_return_type> get_cf_histogram(http_context& ctx, const sstring& name,
utils::timed_rate_moving_average_and_histogram replica::column_family_stats::*f) {
utils::UUID uuid = get_uuid(name, ctx.db.local());
auto uuid = get_uuid(name, ctx.db.local());
return ctx.db.map_reduce0([f, uuid](const replica::database& p) {
return (p.find_column_family(uuid).get_stats().*f).hist;},
utils::ihistogram(),
@@ -120,7 +120,19 @@ static future<json::json_return_type> get_cf_histogram(http_context& ctx, const
});
}
static future<json::json_return_type> get_cf_histogram(http_context& ctx, utils::timed_rate_moving_average_and_histogram replica::column_family_stats::*f) {
static future<json::json_return_type> get_cf_histogram(http_context& ctx, const sstring& name,
utils::timed_rate_moving_average_summary_and_histogram replica::column_family_stats::*f) {
auto uuid = get_uuid(name, ctx.db.local());
return ctx.db.map_reduce0([f, uuid](const replica::database& p) {
return (p.find_column_family(uuid).get_stats().*f).hist;},
utils::ihistogram(),
std::plus<utils::ihistogram>())
.then([](const utils::ihistogram& val) {
return make_ready_future<json::json_return_type>(to_json(val));
});
}
static future<json::json_return_type> get_cf_histogram(http_context& ctx, utils::timed_rate_moving_average_summary_and_histogram replica::column_family_stats::*f) {
std::function<utils::ihistogram(const replica::database&)> fun = [f] (const replica::database& db) {
utils::ihistogram res;
for (auto i : db.get_column_families()) {
@@ -136,8 +148,8 @@ static future<json::json_return_type> get_cf_histogram(http_context& ctx, utils:
}
static future<json::json_return_type> get_cf_rate_and_histogram(http_context& ctx, const sstring& name,
utils::timed_rate_moving_average_and_histogram replica::column_family_stats::*f) {
utils::UUID uuid = get_uuid(name, ctx.db.local());
utils::timed_rate_moving_average_summary_and_histogram replica::column_family_stats::*f) {
auto uuid = get_uuid(name, ctx.db.local());
return ctx.db.map_reduce0([f, uuid](const replica::database& p) {
return (p.find_column_family(uuid).get_stats().*f).rate();},
utils::rate_moving_average_and_histogram(),
@@ -147,7 +159,7 @@ static future<json::json_return_type> get_cf_rate_and_histogram(http_context& c
});
}
static future<json::json_return_type> get_cf_rate_and_histogram(http_context& ctx, utils::timed_rate_moving_average_and_histogram replica::column_family_stats::*f) {
static future<json::json_return_type> get_cf_rate_and_histogram(http_context& ctx, utils::timed_rate_moving_average_summary_and_histogram replica::column_family_stats::*f) {
std::function<utils::rate_moving_average_and_histogram(const replica::database&)> fun = [f] (const replica::database& db) {
utils::rate_moving_average_and_histogram res;
for (auto i : db.get_column_families()) {
@@ -382,7 +394,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_all_cf_all_memtables_off_heap_size.set(r, [&ctx] (std::unique_ptr<request> req) {
warn(unimplemented::cause::INDEXES);
return ctx.db.map_reduce0([](const replica::database& db){
return db.dirty_memory_region_group().memory_used();
return db.dirty_memory_region_group().real_memory_used();
}, int64_t(0), std::plus<int64_t>()).then([](int res) {
return make_ready_future<json::json_return_type>(res);
});
@@ -803,19 +815,19 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_cas_prepare.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf_time_histogram(ctx, req->param["name"], [](const replica::column_family& cf) {
return cf.get_stats().estimated_cas_prepare;
return cf.get_stats().cas_prepare.histogram();
});
});
cf::get_cas_propose.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf_time_histogram(ctx, req->param["name"], [](const replica::column_family& cf) {
return cf.get_stats().estimated_cas_accept;
return cf.get_stats().cas_accept.histogram();
});
});
cf::get_cas_commit.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf_time_histogram(ctx, req->param["name"], [](const replica::column_family& cf) {
return cf.get_stats().estimated_cas_learn;
return cf.get_stats().cas_learn.histogram();
});
});
@@ -843,7 +855,7 @@ void set_column_family(http_context& ctx, routes& r) {
});
cf::get_auto_compaction.set(r, [&ctx] (const_req req) {
const utils::UUID& uuid = get_uuid(req.param["name"], ctx.db.local());
auto uuid = get_uuid(req.param["name"], ctx.db.local());
replica::column_family& cf = ctx.db.local().find_column_family(uuid);
return !cf.is_auto_compaction_disabled_by_user();
});
@@ -921,13 +933,13 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_read_latency_estimated_histogram.set(r, [&ctx](std::unique_ptr<request> req) {
return map_reduce_cf_time_histogram(ctx, req->param["name"], [](const replica::column_family& cf) {
return cf.get_stats().estimated_read;
return cf.get_stats().reads.histogram();
});
});
cf::get_write_latency_estimated_histogram.set(r, [&ctx](std::unique_ptr<request> req) {
return map_reduce_cf_time_histogram(ctx, req->param["name"], [](const replica::column_family& cf) {
return cf.get_stats().estimated_write;
return cf.get_stats().writes.histogram();
});
});

View File

@@ -18,7 +18,7 @@ namespace api {
void set_column_family(http_context& ctx, routes& r);
const utils::UUID& get_uuid(const sstring& name, const replica::database& db);
const table_id& get_uuid(const sstring& name, const replica::database& db);
future<> foreach_column_family(http_context& ctx, const sstring& name, std::function<void(replica::column_family&)> f);
@@ -63,7 +63,7 @@ struct map_reduce_column_families_locally {
std::function<std::unique_ptr<std::any>(std::unique_ptr<std::any>, std::unique_ptr<std::any>)> reducer;
future<std::unique_ptr<std::any>> operator()(replica::database& db) const {
auto res = seastar::make_lw_shared<std::unique_ptr<std::any>>(std::make_unique<std::any>(init));
return do_for_each(db.get_column_families(), [res, this](const std::pair<utils::UUID, seastar::lw_shared_ptr<replica::table>>& i) {
return do_for_each(db.get_column_families(), [res, this](const std::pair<table_id, seastar::lw_shared_ptr<replica::table>>& i) {
*res = reducer(std::move(*res), mapper(*i.second.get()));
}).then([res] {
return std::move(*res);

View File

@@ -68,7 +68,7 @@ void set_compaction_manager(http_context& ctx, routes& r) {
cm::get_pending_tasks_by_table.set(r, [&ctx] (std::unique_ptr<request> req) {
return ctx.db.map_reduce0([&ctx](replica::database& db) {
return do_with(std::unordered_map<std::pair<sstring, sstring>, uint64_t, utils::tuple_hash>(), [&ctx, &db](std::unordered_map<std::pair<sstring, sstring>, uint64_t, utils::tuple_hash>& tasks) {
return do_for_each(db.get_column_families(), [&tasks](const std::pair<utils::UUID, seastar::lw_shared_ptr<replica::table>>& i) {
return do_for_each(db.get_column_families(), [&tasks](const std::pair<table_id, seastar::lw_shared_ptr<replica::table>>& i) {
replica::table& cf = *i.second.get();
tasks[std::make_pair(cf.schema()->ks_name(), cf.schema()->cf_name())] = cf.get_compaction_strategy().estimated_pending_compactions(cf.as_table_state());
return make_ready_future<>();
@@ -119,7 +119,7 @@ void set_compaction_manager(http_context& ctx, routes& r) {
auto& cm = db.get_compaction_manager();
return parallel_for_each(table_names, [&db, &cm, &ks_name, type] (sstring& table_name) {
auto& t = db.find_column_family(ks_name, table_name);
return cm.stop_compaction(type, &t);
return cm.stop_compaction(type, &t.as_table_state());
});
});
co_return json_void();

View File

@@ -6,30 +6,63 @@
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include "locator/token_metadata.hh"
#include "locator/snitch_base.hh"
#include "locator/production_snitch_base.hh"
#include "endpoint_snitch.hh"
#include "api/api-doc/endpoint_snitch_info.json.hh"
#include "api/api-doc/storage_service.json.hh"
#include "utils/fb_utilities.hh"
namespace api {
void set_endpoint_snitch(http_context& ctx, routes& r) {
void set_endpoint_snitch(http_context& ctx, routes& r, sharded<locator::snitch_ptr>& snitch) {
static auto host_or_broadcast = [](const_req req) {
auto host = req.get_query_param("host");
return host.empty() ? gms::inet_address(utils::fb_utilities::get_broadcast_address()) : gms::inet_address(host);
};
httpd::endpoint_snitch_info_json::get_datacenter.set(r, [](const_req req) {
return locator::i_endpoint_snitch::get_local_snitch_ptr()->get_datacenter(host_or_broadcast(req));
httpd::endpoint_snitch_info_json::get_datacenter.set(r, [&ctx](const_req req) {
auto& topology = ctx.shared_token_metadata.local().get()->get_topology();
auto ep = host_or_broadcast(req);
if (!topology.has_endpoint(ep, locator::topology::pending::yes)) {
// Cannot return error here, nodetool status can race, request
// info about just-left node and not handle it nicely
return sstring(locator::production_snitch_base::default_dc);
}
return topology.get_datacenter(ep);
});
httpd::endpoint_snitch_info_json::get_rack.set(r, [](const_req req) {
return locator::i_endpoint_snitch::get_local_snitch_ptr()->get_rack(host_or_broadcast(req));
httpd::endpoint_snitch_info_json::get_rack.set(r, [&ctx](const_req req) {
auto& topology = ctx.shared_token_metadata.local().get()->get_topology();
auto ep = host_or_broadcast(req);
if (!topology.has_endpoint(ep, locator::topology::pending::yes)) {
// Cannot return error here, nodetool status can race, request
// info about just-left node and not handle it nicely
return sstring(locator::production_snitch_base::default_rack);
}
return topology.get_rack(ep);
});
httpd::endpoint_snitch_info_json::get_snitch_name.set(r, [] (const_req req) {
return locator::i_endpoint_snitch::get_local_snitch_ptr()->get_name();
httpd::endpoint_snitch_info_json::get_snitch_name.set(r, [&snitch] (const_req req) {
return snitch.local()->get_name();
});
httpd::storage_service_json::update_snitch.set(r, [&snitch](std::unique_ptr<request> req) {
locator::snitch_config cfg;
cfg.name = req->get_query_param("ep_snitch_class_name");
return locator::i_endpoint_snitch::reset_snitch(snitch, cfg).then([] {
return make_ready_future<json::json_return_type>(json::json_void());
});
});
}
void unset_endpoint_snitch(http_context& ctx, routes& r) {
httpd::endpoint_snitch_info_json::get_datacenter.unset(r);
httpd::endpoint_snitch_info_json::get_rack.unset(r);
httpd::endpoint_snitch_info_json::get_snitch_name.unset(r);
httpd::storage_service_json::update_snitch.unset(r);
}
}

View File

@@ -10,8 +10,13 @@
#include "api.hh"
namespace locator {
class snitch_ptr;
}
namespace api {
void set_endpoint_snitch(http_context& ctx, routes& r);
void set_endpoint_snitch(http_context& ctx, routes& r, sharded<locator::snitch_ptr>&);
void unset_endpoint_snitch(http_context& ctx, routes& r);
}

View File

@@ -12,7 +12,7 @@
#include <seastar/http/exception.hh>
#include "log.hh"
#include "utils/error_injection.hh"
#include "seastar/core/future-util.hh"
#include <seastar/core/future-util.hh>
namespace api {

View File

@@ -18,7 +18,7 @@ namespace fd = httpd::failure_detector_json;
void set_failure_detector(http_context& ctx, routes& r, gms::gossiper& g) {
fd::get_all_endpoint_states.set(r, [&g](std::unique_ptr<request> req) {
std::vector<fd::endpoint_state> res;
for (auto i : g.endpoint_state_map) {
for (auto i : g.get_endpoint_states()) {
fd::endpoint_state val;
val.addrs = boost::lexical_cast<std::string>(i.first);
val.is_alive = i.second.is_alive();
@@ -40,54 +40,53 @@ void set_failure_detector(http_context& ctx, routes& r, gms::gossiper& g) {
});
fd::get_up_endpoint_count.set(r, [&g](std::unique_ptr<request> req) {
return gms::get_up_endpoint_count(g).then([](int res) {
return make_ready_future<json::json_return_type>(res);
});
int res = g.get_up_endpoint_count();
return make_ready_future<json::json_return_type>(res);
});
fd::get_down_endpoint_count.set(r, [&g](std::unique_ptr<request> req) {
return gms::get_down_endpoint_count(g).then([](int res) {
return make_ready_future<json::json_return_type>(res);
});
int res = g.get_down_endpoint_count();
return make_ready_future<json::json_return_type>(res);
});
fd::get_phi_convict_threshold.set(r, [] (std::unique_ptr<request> req) {
return gms::get_phi_convict_threshold().then([](double res) {
return make_ready_future<json::json_return_type>(res);
});
return make_ready_future<json::json_return_type>(8);
});
fd::get_simple_states.set(r, [&g] (std::unique_ptr<request> req) {
return gms::get_simple_states(g).then([](const std::map<sstring, sstring>& map) {
return make_ready_future<json::json_return_type>(map_to_key_value<fd::mapper>(map));
});
std::map<sstring, sstring> nodes_status;
for (auto& entry : g.get_endpoint_states()) {
nodes_status.emplace(entry.first.to_sstring(), entry.second.is_alive() ? "UP" : "DOWN");
}
return make_ready_future<json::json_return_type>(map_to_key_value<fd::mapper>(nodes_status));
});
fd::set_phi_convict_threshold.set(r, [](std::unique_ptr<request> req) {
double phi = atof(req->get_query_param("phi").c_str());
return gms::set_phi_convict_threshold(phi).then([]() {
return make_ready_future<json::json_return_type>("");
});
return make_ready_future<json::json_return_type>("");
});
fd::get_endpoint_state.set(r, [&g] (std::unique_ptr<request> req) {
return get_endpoint_state(g, req->param["addr"]).then([](const sstring& state) {
return make_ready_future<json::json_return_type>(state);
});
auto* state = g.get_endpoint_state_for_endpoint_ptr(gms::inet_address(req->param["addr"]));
if (!state) {
return make_ready_future<json::json_return_type>(format("unknown endpoint {}", req->param["addr"]));
}
std::stringstream ss;
g.append_endpoint_state(ss, *state);
return make_ready_future<json::json_return_type>(sstring(ss.str()));
});
fd::get_endpoint_phi_values.set(r, [](std::unique_ptr<request> req) {
return gms::get_arrival_samples().then([](std::map<gms::inet_address, gms::arrival_window> map) {
std::vector<fd::endpoint_phi_value> res;
auto now = gms::arrival_window::clk::now();
for (auto& p : map) {
fd::endpoint_phi_value val;
val.endpoint = p.first.to_sstring();
val.phi = p.second.phi(now);
res.emplace_back(std::move(val));
}
return make_ready_future<json::json_return_type>(res);
});
std::map<gms::inet_address, gms::arrival_window> map;
std::vector<fd::endpoint_phi_value> res;
auto now = gms::arrival_window::clk::now();
for (auto& p : map) {
fd::endpoint_phi_value val;
val.endpoint = p.first.to_sstring();
val.phi = p.second.phi(now);
res.emplace_back(std::move(val));
}
return make_ready_future<json::json_return_type>(res);
});
}

View File

@@ -14,7 +14,7 @@
#include "db/config.hh"
#include "utils/histogram.hh"
#include "replica/database.hh"
#include "seastar/core/scheduling_specific.hh"
#include <seastar/core/scheduling_specific.hh>
namespace api {
@@ -22,6 +22,9 @@ namespace sp = httpd::storage_proxy_json;
using proxy = service::storage_proxy;
using namespace json;
utils::time_estimated_histogram timed_rate_moving_average_summary_merge(utils::time_estimated_histogram a, const utils::timed_rate_moving_average_summary_and_histogram& b) {
return a.merge(b.histogram());
}
/**
* This function implement a two dimentional map reduce where
@@ -55,10 +58,10 @@ future<V> two_dimensional_map_reduce(distributed<service::storage_proxy>& d,
* @param initial_value - the initial value to use for both aggregations* @return
* @return A future that resolves to the result of the aggregation.
*/
template<typename V, typename Reducer, typename F>
template<typename V, typename Reducer, typename F, typename C>
future<V> two_dimensional_map_reduce(distributed<service::storage_proxy>& d,
V F::*f, Reducer reducer, V initial_value) {
return two_dimensional_map_reduce(d, [f] (F& stats) {
C F::*f, Reducer reducer, V initial_value) {
return two_dimensional_map_reduce(d, [f] (F& stats) -> V {
return stats.*f;
}, reducer, initial_value);
}
@@ -112,10 +115,10 @@ utils_json::estimated_histogram time_to_json_histogram(const utils::time_estimat
return res;
}
static future<json::json_return_type> sum_estimated_histogram(http_context& ctx, utils::time_estimated_histogram service::storage_proxy_stats::stats::*f) {
return two_dimensional_map_reduce(ctx.sp, f, utils::time_estimated_histogram_merge,
utils::time_estimated_histogram()).then([](const utils::time_estimated_histogram& val) {
static future<json::json_return_type> sum_estimated_histogram(http_context& ctx, utils::timed_rate_moving_average_summary_and_histogram service::storage_proxy_stats::stats::*f) {
return two_dimensional_map_reduce(ctx.sp, [f] (service::storage_proxy_stats::stats& stats) {
return (stats.*f).histogram();
}, utils::time_estimated_histogram_merge, utils::time_estimated_histogram()).then([](const utils::time_estimated_histogram& val) {
return make_ready_future<json::json_return_type>(time_to_json_histogram(val));
});
}
@@ -130,7 +133,7 @@ static future<json::json_return_type> sum_estimated_histogram(http_context& ctx
});
}
static future<json::json_return_type> total_latency(http_context& ctx, utils::timed_rate_moving_average_and_histogram service::storage_proxy_stats::stats::*f) {
static future<json::json_return_type> total_latency(http_context& ctx, utils::timed_rate_moving_average_summary_and_histogram service::storage_proxy_stats::stats::*f) {
return two_dimensional_map_reduce(ctx.sp, [f] (service::storage_proxy_stats::stats& stats) {
return (stats.*f).hist.mean * (stats.*f).hist.count;
}, std::plus<double>(), 0.0).then([](double val) {
@@ -150,7 +153,7 @@ static future<json::json_return_type> total_latency(http_context& ctx, utils::t
template<typename F>
future<json::json_return_type>
sum_histogram_stats_storage_proxy(distributed<proxy>& d,
utils::timed_rate_moving_average_and_histogram F::*f) {
utils::timed_rate_moving_average_summary_and_histogram F::*f) {
return two_dimensional_map_reduce(d, [f] (service::storage_proxy_stats::stats& stats) {
return (stats.*f).hist;
}, std::plus<utils::ihistogram>(), utils::ihistogram()).
@@ -170,7 +173,7 @@ sum_histogram_stats_storage_proxy(distributed<proxy>& d,
template<typename F>
future<json::json_return_type>
sum_timer_stats_storage_proxy(distributed<proxy>& d,
utils::timed_rate_moving_average_and_histogram F::*f) {
utils::timed_rate_moving_average_summary_and_histogram F::*f) {
return two_dimensional_map_reduce(d, [f] (service::storage_proxy_stats::stats& stats) {
return (stats.*f).rate();
@@ -491,14 +494,14 @@ void set_storage_proxy(http_context& ctx, routes& r, sharded<service::storage_se
});
sp::get_read_estimated_histogram.set(r, [&ctx](std::unique_ptr<request> req) {
return sum_estimated_histogram(ctx, &service::storage_proxy_stats::stats::estimated_read);
return sum_estimated_histogram(ctx, &service::storage_proxy_stats::stats::read);
});
sp::get_read_latency.set(r, [&ctx](std::unique_ptr<request> req) {
return total_latency(ctx, &service::storage_proxy_stats::stats::read);
});
sp::get_write_estimated_histogram.set(r, [&ctx](std::unique_ptr<request> req) {
return sum_estimated_histogram(ctx, &service::storage_proxy_stats::stats::estimated_write);
return sum_estimated_histogram(ctx, &service::storage_proxy_stats::stats::write);
});
sp::get_write_latency.set(r, [&ctx](std::unique_ptr<request> req) {

View File

@@ -11,6 +11,7 @@
#include "db/config.hh"
#include "db/schema_tables.hh"
#include "utils/hash.hh"
#include <optional>
#include <sstream>
#include <time.h>
#include <algorithm>
@@ -24,8 +25,9 @@
#include "db/commitlog/commitlog.hh"
#include "gms/gossiper.hh"
#include "db/system_keyspace.hh"
#include "seastar/http/exception.hh"
#include <seastar/http/exception.hh>
#include <seastar/core/coroutine.hh>
#include <seastar/coroutine/parallel_for_each.hh>
#include "repair/row_level.hh"
#include "locator/snitch_base.hh"
#include "column_family.hh"
@@ -56,23 +58,30 @@ const locator::token_metadata& http_context::get_token_metadata() {
namespace ss = httpd::storage_service_json;
using namespace json;
sstring validate_keyspace(http_context& ctx, const parameters& param) {
const auto& ks_name = param["keyspace"];
sstring validate_keyspace(http_context& ctx, sstring ks_name) {
if (ctx.db.local().has_keyspace(ks_name)) {
return ks_name;
}
throw bad_param_exception(replica::no_such_keyspace(ks_name).what());
}
sstring validate_keyspace(http_context& ctx, const parameters& param) {
return validate_keyspace(ctx, param["keyspace"]);
}
locator::host_id validate_host_id(const sstring& param) {
auto hoep = locator::host_id_or_endpoint(param, locator::host_id_or_endpoint::param_type::host_id);
return hoep.id;
}
// splits a request parameter assumed to hold a comma-separated list of table names
// verify that the tables are found, otherwise a bad_param_exception exception is thrown
// containing the description of the respective no_such_column_family error.
std::vector<sstring> parse_tables(const sstring& ks_name, http_context& ctx, const std::unordered_map<sstring, sstring>& query_params, sstring param_name) {
auto it = query_params.find(param_name);
if (it == query_params.end()) {
return {};
std::vector<sstring> parse_tables(const sstring& ks_name, http_context& ctx, sstring value) {
if (value.empty()) {
return map_keys(ctx.db.local().find_keyspace(ks_name).metadata().get()->cf_meta_data());
}
std::vector<sstring> names = split(it->second, ",");
std::vector<sstring> names = split(value, ",");
try {
for (const auto& table_name : names) {
ctx.db.local().find_column_family(ks_name, table_name);
@@ -83,6 +92,14 @@ std::vector<sstring> parse_tables(const sstring& ks_name, http_context& ctx, con
return names;
}
std::vector<sstring> parse_tables(const sstring& ks_name, http_context& ctx, const std::unordered_map<sstring, sstring>& query_params, sstring param_name) {
auto it = query_params.find(param_name);
if (it == query_params.end()) {
return {};
}
return parse_tables(ks_name, ctx, it->second);
}
static ss::token_range token_range_endpoints_to_json(const dht::token_range_endpoints& d) {
ss::token_range r;
r.start_token = d._start_token;
@@ -145,7 +162,7 @@ seastar::future<json::json_return_type> run_toppartitions_query(db::toppartition
});
}
future<json::json_return_type> set_tables_autocompaction(http_context& ctx, service::storage_service& ss, const sstring &keyspace, std::vector<sstring> tables, bool enabled) {
future<json::json_return_type> set_tables_autocompaction(http_context& ctx, const sstring &keyspace, std::vector<sstring> tables, bool enabled) {
if (tables.empty()) {
tables = map_keys(ctx.db.local().find_keyspace(keyspace).metadata().get()->cf_meta_data());
}
@@ -384,11 +401,10 @@ static future<json::json_return_type> describe_ring_as_json(sharded<service::sto
co_return json::json_return_type(stream_range_as_array(co_await ss.local().describe_ring(keyspace), token_range_endpoints_to_json));
}
void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_service>& ss, gms::gossiper& g, sharded<cdc::generation_service>& cdc_gs) {
ss::local_hostid.set(r, [](std::unique_ptr<request> req) {
return db::system_keyspace::load_local_host_id().then([](const utils::UUID& id) {
return make_ready_future<json::json_return_type>(id.to_sstring());
});
void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_service>& ss, gms::gossiper& g, sharded<cdc::generation_service>& cdc_gs, sharded<db::system_keyspace>& sys_ks) {
ss::local_hostid.set(r, [&ctx](std::unique_ptr<request> req) {
auto id = ctx.db.local().get_config().host_id;
return make_ready_future<json::json_return_type>(id.to_sstring());
});
ss::get_tokens.set(r, [&ctx] (std::unique_ptr<request> req) {
@@ -504,10 +520,10 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
return ctx.db.local().get_config().saved_caches_directory();
});
ss::get_range_to_endpoint_map.set(r, [&ctx, &ss](std::unique_ptr<request> req) {
ss::get_range_to_endpoint_map.set(r, [&ctx, &ss](std::unique_ptr<request> req) -> future<json::json_return_type> {
auto keyspace = validate_keyspace(ctx, req->param);
std::vector<ss::maplist_mapper> res;
return make_ready_future<json::json_return_type>(stream_range_as_array(ss.local().get_range_to_address_map(keyspace),
co_return stream_range_as_array(co_await ss.local().get_range_to_address_map(keyspace),
[](const std::pair<dht::token_range, inet_address_vector_replica_set>& entry){
ss::maplist_mapper m;
if (entry.first.start()) {
@@ -524,7 +540,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
m.value.push(address.to_sstring());
}
return m;
}));
});
});
ss::get_pending_range_to_endpoint_map.set(r, [&ctx](std::unique_ptr<request> req) {
@@ -536,7 +552,13 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
});
ss::describe_any_ring.set(r, [&ctx, &ss](std::unique_ptr<request> req) {
return describe_ring_as_json(ss, "");
// Find an arbitrary non-system keyspace.
auto keyspaces = ctx.db.local().get_non_local_strategy_keyspaces();
if (keyspaces.empty()) {
throw std::runtime_error("No keyspace provided and no non system kespace exist");
}
auto ks = keyspaces[0];
return describe_ring_as_json(ss, ks);
});
ss::describe_ring.set(r, [&ctx, &ss](std::unique_ptr<request> req) {
@@ -593,13 +615,12 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
if (column_families.empty()) {
column_families = map_keys(ctx.db.local().find_keyspace(keyspace).metadata().get()->cf_meta_data());
}
apilog.debug("force_keyspace_compaction: keyspace={} tables={}", keyspace, column_families);
return ctx.db.invoke_on_all([keyspace, column_families] (replica::database& db) -> future<> {
auto table_ids = boost::copy_range<std::vector<utils::UUID>>(column_families | boost::adaptors::transformed([&] (auto& cf_name) {
auto table_ids = boost::copy_range<std::vector<table_id>>(column_families | boost::adaptors::transformed([&] (auto& cf_name) {
return db.find_uuid(keyspace, cf_name);
}));
// major compact smaller tables first, to increase chances of success if low on space.
std::ranges::sort(table_ids, std::less<>(), [&] (const utils::UUID& id) {
std::ranges::sort(table_ids, std::less<>(), [&] (const table_id& id) {
return db.find_column_family(id).get_stats().live_disk_space_used;
});
// as a table can be dropped during loop below, let's find it before issuing major compaction request.
@@ -618,7 +639,6 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
if (column_families.empty()) {
column_families = map_keys(ctx.db.local().find_keyspace(keyspace).metadata().get()->cf_meta_data());
}
apilog.info("force_keyspace_cleanup: keyspace={} tables={}", keyspace, column_families);
return ss.local().is_cleanup_allowed(keyspace).then([&ctx, keyspace,
column_families = std::move(column_families)] (bool is_cleanup_allowed) mutable {
if (!is_cleanup_allowed) {
@@ -626,18 +646,19 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
std::runtime_error("Can not perform cleanup operation when topology changes"));
}
return ctx.db.invoke_on_all([keyspace, column_families] (replica::database& db) -> future<> {
auto table_ids = boost::copy_range<std::vector<utils::UUID>>(column_families | boost::adaptors::transformed([&] (auto& table_name) {
auto table_ids = boost::copy_range<std::vector<table_id>>(column_families | boost::adaptors::transformed([&] (auto& table_name) {
return db.find_uuid(keyspace, table_name);
}));
// cleanup smaller tables first, to increase chances of success if low on space.
std::ranges::sort(table_ids, std::less<>(), [&] (const utils::UUID& id) {
std::ranges::sort(table_ids, std::less<>(), [&] (const table_id& id) {
return db.find_column_family(id).get_stats().live_disk_space_used;
});
auto& cm = db.get_compaction_manager();
auto owned_ranges_ptr = compaction::make_owned_ranges_ptr(db.get_keyspace_local_ranges(keyspace));
// as a table can be dropped during loop below, let's find it before issuing the cleanup request.
for (auto& id : table_ids) {
replica::table& t = db.find_column_family(id);
co_await t.perform_cleanup_compaction(db);
co_await cm.perform_cleanup(owned_ranges_ptr, t.as_table_state());
}
co_return;
}).then([]{
@@ -647,7 +668,6 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
});
ss::perform_keyspace_offstrategy_compaction.set(r, wrap_ks_cf(ctx, [] (http_context& ctx, std::unique_ptr<request> req, sstring keyspace, std::vector<sstring> tables) -> future<json::json_return_type> {
apilog.info("perform_keyspace_offstrategy_compaction: keyspace={} tables={}", keyspace, tables);
co_return co_await ctx.db.map_reduce0([&keyspace, &tables] (replica::database& db) -> future<bool> {
bool needed = false;
for (const auto& table : tables) {
@@ -661,12 +681,12 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
ss::upgrade_sstables.set(r, wrap_ks_cf(ctx, [] (http_context& ctx, std::unique_ptr<request> req, sstring keyspace, std::vector<sstring> column_families) {
bool exclude_current_version = req_param<bool>(*req, "exclude_current_version", false);
apilog.info("upgrade_sstables: keyspace={} tables={} exclude_current_version={}", keyspace, column_families, exclude_current_version);
return ctx.db.invoke_on_all([=] (replica::database& db) {
auto owned_ranges_ptr = compaction::make_owned_ranges_ptr(db.get_keyspace_local_ranges(keyspace));
return do_for_each(column_families, [=, &db](sstring cfname) {
auto& cm = db.get_compaction_manager();
auto& cf = db.find_column_family(keyspace, cfname);
return cm.perform_sstable_upgrade(db, &cf, exclude_current_version);
return cm.perform_sstable_upgrade(owned_ranges_ptr, cf.as_table_state(), exclude_current_version);
});
}).then([]{
return make_ready_future<json::json_return_type>(0);
@@ -676,19 +696,17 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
ss::force_keyspace_flush.set(r, [&ctx](std::unique_ptr<request> req) -> future<json::json_return_type> {
auto keyspace = validate_keyspace(ctx, req->param);
auto column_families = parse_tables(keyspace, ctx, req->query_parameters, "cf");
apilog.info("perform_keyspace_flush: keyspace={} tables={}", keyspace, column_families);
auto &db = ctx.db.local();
auto& db = ctx.db;
if (column_families.empty()) {
co_await db.flush_on_all(keyspace);
co_await replica::database::flush_keyspace_on_all_shards(db, keyspace);
} else {
co_await db.flush_on_all(keyspace, std::move(column_families));
co_await replica::database::flush_tables_on_all_shards(db, keyspace, std::move(column_families));
}
co_return json_void();
});
ss::decommission.set(r, [&ss](std::unique_ptr<request> req) {
apilog.info("decommission");
return ss.local().decommission().then([] {
return make_ready_future<json::json_return_type>(json_void());
});
@@ -702,21 +720,23 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
});
ss::remove_node.set(r, [&ss](std::unique_ptr<request> req) {
auto host_id = req->get_query_param("host_id");
auto host_id = validate_host_id(req->get_query_param("host_id"));
std::vector<sstring> ignore_nodes_strs= split(req->get_query_param("ignore_nodes"), ",");
apilog.info("remove_node: host_id={} ignore_nodes={}", host_id, ignore_nodes_strs);
auto ignore_nodes = std::list<gms::inet_address>();
auto ignore_nodes = std::list<locator::host_id_or_endpoint>();
for (std::string n : ignore_nodes_strs) {
try {
std::replace(n.begin(), n.end(), '\"', ' ');
std::replace(n.begin(), n.end(), '\'', ' ');
boost::trim_all(n);
if (!n.empty()) {
auto node = gms::inet_address(n);
ignore_nodes.push_back(node);
auto hoep = locator::host_id_or_endpoint(n);
if (!ignore_nodes.empty() && hoep.has_host_id() != ignore_nodes.front().has_host_id()) {
throw std::runtime_error("All nodes should be identified using the same method: either Host IDs or ip addresses.");
}
ignore_nodes.push_back(std::move(hoep));
}
} catch (...) {
throw std::runtime_error(format("Failed to parse ignore_nodes parameter: ignore_nodes={}, node={}", ignore_nodes_strs, n));
throw std::runtime_error(format("Failed to parse ignore_nodes parameter: ignore_nodes={}, node={}: {}", ignore_nodes_strs, n, std::current_exception()));
}
}
return ss.local().removenode(host_id, std::move(ignore_nodes)).then([] {
@@ -757,13 +777,13 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
ss::get_operation_mode.set(r, [&ss](std::unique_ptr<request> req) {
return ss.local().get_operation_mode().then([] (auto mode) {
return make_ready_future<json::json_return_type>(mode);
return make_ready_future<json::json_return_type>(format("{}", mode));
});
});
ss::is_starting.set(r, [&ss](std::unique_ptr<request> req) {
return ss.local().is_starting().then([] (auto starting) {
return make_ready_future<json::json_return_type>(starting);
return ss.local().get_operation_mode().then([] (auto mode) {
return make_ready_future<json::json_return_type>(mode <= service::storage_service::mode::STARTING);
});
});
@@ -777,7 +797,6 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
});
ss::drain.set(r, [&ss](std::unique_ptr<request> req) {
apilog.info("drain");
return ss.local().drain().then([] {
return make_ready_future<json::json_return_type>(json_void());
});
@@ -793,31 +812,20 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
ss::get_keyspaces.set(r, [&ctx](const_req req) {
auto type = req.get_query_param("type");
if (type == "user") {
return ctx.db.local().get_non_system_keyspaces();
return ctx.db.local().get_user_keyspaces();
} else if (type == "non_local_strategy") {
return map_keys(ctx.db.local().get_keyspaces() | boost::adaptors::filtered([](const auto& p) {
return p.second.get_replication_strategy().get_type() != locator::replication_strategy_type::local;
}));
return ctx.db.local().get_non_local_strategy_keyspaces();
}
return map_keys(ctx.db.local().get_keyspaces());
});
ss::update_snitch.set(r, [](std::unique_ptr<request> req) {
auto ep_snitch_class_name = req->get_query_param("ep_snitch_class_name");
return locator::i_endpoint_snitch::reset_snitch(ep_snitch_class_name).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
ss::stop_gossiping.set(r, [&ss](std::unique_ptr<request> req) {
apilog.info("stop_gossiping");
return ss.local().stop_gossiping().then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
ss::start_gossiping.set(r, [&ss](std::unique_ptr<request> req) {
apilog.info("start_gossiping");
return ss.local().start_gossiping().then([] {
return make_ready_future<json::json_return_type>(json_void());
});
@@ -836,9 +844,13 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
return make_ready_future<json::json_return_type>(json_void());
});
ss::is_initialized.set(r, [&ss](std::unique_ptr<request> req) {
return ss.local().is_initialized().then([] (bool initialized) {
return make_ready_future<json::json_return_type>(initialized);
ss::is_initialized.set(r, [&ss, &g](std::unique_ptr<request> req) {
return ss.local().get_operation_mode().then([&g] (auto mode) {
bool is_initialized = mode >= service::storage_service::mode::STARTING;
if (mode == service::storage_service::mode::NORMAL) {
is_initialized = g.is_enabled();
}
return make_ready_future<json::json_return_type>(is_initialized);
});
});
@@ -847,7 +859,9 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
});
ss::is_joined.set(r, [&ss] (std::unique_ptr<request> req) {
return make_ready_future<json::json_return_type>(ss.local().is_joined());
return ss.local().get_operation_mode().then([] (auto mode) {
return make_ready_future<json::json_return_type>(mode >= service::storage_service::mode::JOINING);
});
});
ss::set_stream_throughput_mb_per_sec.set(r, [](std::unique_ptr<request> req) {
@@ -914,7 +928,6 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
ss::rebuild.set(r, [&ss](std::unique_ptr<request> req) {
auto source_dc = req->get_query_param("source_dc");
apilog.info("rebuild: source_dc={}", source_dc);
return ss.local().rebuild(std::move(source_dc)).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
@@ -947,19 +960,17 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
return make_ready_future<json::json_return_type>(res);
});
ss::reset_local_schema.set(r, [](std::unique_ptr<request> req) {
ss::reset_local_schema.set(r, [&sys_ks](std::unique_ptr<request> req) {
// FIXME: We should truncate schema tables if more than one node in the cluster.
auto& sp = service::get_storage_proxy();
auto& fs = sp.local().features();
apilog.info("reset_local_schema");
return db::schema_tables::recalculate_schema_version(sp, fs).then([] {
return db::schema_tables::recalculate_schema_version(sys_ks, sp, fs).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
ss::set_trace_probability.set(r, [](std::unique_ptr<request> req) {
auto probability = req->get_query_param("probability");
apilog.info("set_trace_probability: probability={}", probability);
return futurize_invoke([probability] {
double real_prob = std::stod(probability.c_str());
return tracing::tracing::tracing_instance().invoke_on_all([real_prob] (auto& local_tracing) {
@@ -997,7 +1008,6 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
auto ttl = req->get_query_param("ttl");
auto threshold = req->get_query_param("threshold");
auto fast = req->get_query_param("fast");
apilog.info("set_slow_query: enable={} ttl={} threshold={} fast={}", enable, ttl, threshold, fast);
try {
return tracing::tracing::tracing_instance().invoke_on_all([enable, ttl, threshold, fast] (auto& local_tracing) {
if (threshold != "") {
@@ -1020,20 +1030,18 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
}
});
ss::enable_auto_compaction.set(r, [&ctx, &ss](std::unique_ptr<request> req) {
ss::enable_auto_compaction.set(r, [&ctx](std::unique_ptr<request> req) {
auto keyspace = validate_keyspace(ctx, req->param);
auto tables = parse_tables(keyspace, ctx, req->query_parameters, "cf");
apilog.info("enable_auto_compaction: keyspace={} tables={}", keyspace, tables);
return set_tables_autocompaction(ctx, ss.local(), keyspace, tables, true);
return set_tables_autocompaction(ctx, keyspace, tables, true);
});
ss::disable_auto_compaction.set(r, [&ctx, &ss](std::unique_ptr<request> req) {
ss::disable_auto_compaction.set(r, [&ctx](std::unique_ptr<request> req) {
auto keyspace = validate_keyspace(ctx, req->param);
auto tables = parse_tables(keyspace, ctx, req->query_parameters, "cf");
apilog.info("disable_auto_compaction: keyspace={} tables={}", keyspace, tables);
return set_tables_autocompaction(ctx, ss.local(), keyspace, tables, false);
return set_tables_autocompaction(ctx, keyspace, tables, false);
});
ss::deliver_hints.set(r, [](std::unique_ptr<request> req) {
@@ -1182,7 +1190,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
ss::sstable info;
info.timestamp = t;
info.generation = sstable->generation();
info.generation = sstables::generation_value(sstable->generation());
info.level = sstable->get_sstable_level();
info.size = sstable->bytes_on_disk();
info.data_size = sstable->ondisk_data_size();
@@ -1259,6 +1267,13 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
});
}
enum class scrub_status {
successful = 0,
aborted,
unable_to_cancel, // Not used in Scylla, included to ensure compability with nodetool api.
validation_errors,
};
void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_ctl) {
ss::get_snapshot_details.set(r, [&snap_ctl](std::unique_ptr<request> req) {
return snap_ctl.local().get_snapshot_details().then([] (std::unordered_map<sstring, std::vector<db::snapshot_ctl::snapshot_details>>&& result) {
@@ -1297,7 +1312,7 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_
});
});
ss::take_snapshot.set(r, [&snap_ctl](std::unique_ptr<request> req) -> future<json::json_return_type> {
ss::take_snapshot.set(r, [&ctx, &snap_ctl](std::unique_ptr<request> req) -> future<json::json_return_type> {
apilog.info("take_snapshot: {}", req->query_parameters);
auto tag = req->get_query_param("tag");
auto column_families = split(req->get_query_param("cf"), ",");
@@ -1315,7 +1330,13 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_
if (keynames.size() > 1) {
throw httpd::bad_param_exception("Only one keyspace allowed when specifying a column family");
}
co_await snap_ctl.local().take_column_family_snapshot(keynames[0], column_families, tag, sf);
for (const auto& table_name : column_families) {
auto& t = ctx.db.local().find_column_family(keynames[0], table_name);
if (t.schema()->is_view()) {
throw std::invalid_argument("Do not take a snapshot of a materialized view or a secondary index by itself. Run snapshot on the base table instead.");
}
}
co_await snap_ctl.local().take_column_family_snapshot(keynames[0], column_families, tag, db::snapshot_ctl::snap_views::yes, sf);
}
co_return json_void();
} catch (...) {
@@ -1345,17 +1366,29 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_
});
});
ss::scrub.set(r, wrap_ks_cf(ctx, [&snap_ctl] (http_context& ctx, std::unique_ptr<request> req, sstring keyspace, std::vector<sstring> column_families) {
ss::scrub.set(r, [&ctx, &snap_ctl] (std::unique_ptr<request> req) {
auto rp = req_params({
{"keyspace", {mandatory::yes}},
{"cf", {""}},
{"scrub_mode", {}},
{"skip_corrupted", {}},
{"disable_snapshot", {}},
{"quarantine_mode", {}},
});
rp.process(*req);
auto keyspace = validate_keyspace(ctx, *rp.get("keyspace"));
auto column_families = parse_tables(keyspace, ctx, *rp.get("cf"));
auto scrub_mode_opt = rp.get("scrub_mode");
auto scrub_mode = sstables::compaction_type_options::scrub::mode::abort;
const sstring scrub_mode_str = req_param<sstring>(*req, "scrub_mode", "");
if (scrub_mode_str == "") {
const auto skip_corrupted = req_param<bool>(*req, "skip_corrupted", false);
if (!scrub_mode_opt) {
const auto skip_corrupted = rp.get_as<bool>("skip_corrupted").value_or(false);
if (skip_corrupted) {
scrub_mode = sstables::compaction_type_options::scrub::mode::skip;
}
} else {
auto scrub_mode_str = *scrub_mode_opt;
if (scrub_mode_str == "ABORT") {
scrub_mode = sstables::compaction_type_options::scrub::mode::abort;
} else if (scrub_mode_str == "SKIP") {
@@ -1365,7 +1398,7 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_
} else if (scrub_mode_str == "VALIDATE") {
scrub_mode = sstables::compaction_type_options::scrub::mode::validate;
} else {
throw std::invalid_argument(fmt::format("Unknown argument for 'scrub_mode' parameter: {}", scrub_mode_str));
throw httpd::bad_param_exception(fmt::format("Unknown argument for 'scrub_mode' parameter: {}", scrub_mode_str));
}
}
@@ -1373,7 +1406,10 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_
if (!req_param<bool>(*req, "disable_snapshot", false)) {
auto tag = format("pre-scrub-{:d}", db_clock::now().time_since_epoch().count());
f = parallel_for_each(column_families, [&snap_ctl, keyspace, tag](sstring cf) {
return snap_ctl.local().take_column_family_snapshot(keyspace, cf, tag, db::snapshot_ctl::skip_flush::no, db::snapshot_ctl::allow_view_snapshots::yes);
// We always pass here db::snapshot_ctl::snap_views::no since:
// 1. When scrubbing particular tables, there's no need to auto-snapshot their views.
// 2. When scrubbing the whole keyspace, column_families will contain both base tables and views.
return snap_ctl.local().take_column_family_snapshot(keyspace, cf, tag, db::snapshot_ctl::snap_views::no, db::snapshot_ctl::skip_flush::no);
});
}
@@ -1388,20 +1424,39 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_
} else if (quarantine_mode_str == "ONLY") {
opts.quarantine_operation_mode = sstables::compaction_type_options::scrub::quarantine_mode::only;
} else {
throw std::invalid_argument(fmt::format("Unknown argument for 'quarantine_mode' parameter: {}", quarantine_mode_str));
throw httpd::bad_param_exception(fmt::format("Unknown argument for 'quarantine_mode' parameter: {}", quarantine_mode_str));
}
return f.then([&ctx, keyspace, column_families, opts] {
return ctx.db.invoke_on_all([=] (replica::database& db) {
return do_for_each(column_families, [=, &db](sstring cfname) {
const auto& reduce_compaction_stats = [] (const compaction_manager::compaction_stats_opt& lhs, const compaction_manager::compaction_stats_opt& rhs) {
sstables::compaction_stats stats{};
stats += lhs.value();
stats += rhs.value();
return stats;
};
return f.then([&ctx, keyspace, column_families, opts, &reduce_compaction_stats] {
return ctx.db.map_reduce0([=] (replica::database& db) {
return map_reduce(column_families, [=, &db] (sstring cfname) {
auto& cm = db.get_compaction_manager();
auto& cf = db.find_column_family(keyspace, cfname);
return cm.perform_sstable_scrub(&cf, opts);
});
});
}).then([]{
return make_ready_future<json::json_return_type>(0);
return cm.perform_sstable_scrub(cf.as_table_state(), opts);
}, std::make_optional(sstables::compaction_stats{}), reduce_compaction_stats);
}, std::make_optional(sstables::compaction_stats{}), reduce_compaction_stats);
}).then_wrapped([] (auto f) {
if (f.failed()) {
auto ex = f.get_exception();
if (try_catch<sstables::compaction_aborted_exception>(ex)) {
return make_ready_future<json::json_return_type>(static_cast<int>(scrub_status::aborted));
} else {
return make_exception_future<json::json_return_type>(std::move(ex));
}
} else if (f.get()->validation_errors) {
return make_ready_future<json::json_return_type>(static_cast<int>(scrub_status::validation_errors));
} else {
return make_ready_future<json::json_return_type>(static_cast<int>(scrub_status::successful));
}
});
}));
});
}
void unset_snapshot(http_context& ctx, routes& r) {

View File

@@ -19,6 +19,7 @@ class snapshot_ctl;
namespace view {
class view_builder;
}
class system_keyspace;
}
namespace netw { class messaging_service; }
class repair_service;
@@ -42,7 +43,7 @@ sstring validate_keyspace(http_context& ctx, const parameters& param);
// containing the description of the respective no_such_column_family error.
std::vector<sstring> parse_tables(const sstring& ks_name, http_context& ctx, const std::unordered_map<sstring, sstring>& query_params, sstring param_name);
void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_service>& ss, gms::gossiper& g, sharded<cdc::generation_service>& cdc_gs);
void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_service>& ss, gms::gossiper& g, sharded<cdc::generation_service>& cdc_gs, sharded<db::system_keyspace>& sys_ls);
void set_sstables_loader(http_context& ctx, routes& r, sharded<sstables_loader>& sst_loader);
void unset_sstables_loader(http_context& ctx, routes& r);
void set_view_builder(http_context& ctx, routes& r, sharded<db::view::view_builder>& vb);

View File

@@ -61,6 +61,16 @@ void set_system(http_context& ctx, routes& r) {
return json::json_void();
});
hs::write_log_message.set(r, [](const_req req) {
try {
logging::log_level level = boost::lexical_cast<logging::log_level>(std::string(req.get_query_param("level")));
apilog.log(level, "/system/log: {}", std::string(req.get_query_param("message")));
} catch (boost::bad_lexical_cast& e) {
throw bad_param_exception("Unknown logging level " + req.get_query_param("level"));
}
return json::json_void();
});
hs::drop_sstable_caches.set(r, [&ctx](std::unique_ptr<request> req) {
apilog.info("Dropping sstable caches");
return ctx.db.invoke_on_all([] (replica::database& db) {

164
api/task_manager.cc Normal file
View File

@@ -0,0 +1,164 @@
/*
* Copyright (C) 2022-present ScyllaDB
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include <seastar/core/coroutine.hh>
#include "task_manager.hh"
#include "api/api-doc/task_manager.json.hh"
#include "db/system_keyspace.hh"
#include "column_family.hh"
#include "unimplemented.hh"
#include "storage_service.hh"
#include <utility>
#include <boost/range/adaptors.hpp>
namespace api {
namespace tm = httpd::task_manager_json;
using namespace json;
inline bool filter_tasks(tasks::task_manager::task_ptr task, std::unordered_map<sstring, sstring>& query_params) {
return (!query_params.contains("keyspace") || query_params["keyspace"] == task->get_status().keyspace) &&
(!query_params.contains("table") || query_params["table"] == task->get_status().table);
}
struct full_task_status {
tasks::task_manager::task::status task_status;
tasks::task_manager::task::progress progress;
std::string module;
tasks::task_id parent_id;
tasks::is_abortable abortable;
};
struct task_stats {
task_stats(tasks::task_manager::task_ptr task) : task_id(task->id().to_sstring()), state(task->get_status().state) {}
sstring task_id;
tasks::task_manager::task_state state;
};
tm::task_status make_status(full_task_status status) {
auto start_time = db_clock::to_time_t(status.task_status.start_time);
auto end_time = db_clock::to_time_t(status.task_status.end_time);
::tm st, et;
::gmtime_r(&end_time, &et);
::gmtime_r(&start_time, &st);
tm::task_status res{};
res.id = status.task_status.id.to_sstring();
res.type = status.task_status.type;
res.state = status.task_status.state;
res.is_abortable = bool(status.abortable);
res.start_time = st;
res.end_time = et;
res.error = status.task_status.error;
res.parent_id = status.parent_id.to_sstring();
res.sequence_number = status.task_status.sequence_number;
res.shard = status.task_status.shard;
res.keyspace = status.task_status.keyspace;
res.table = status.task_status.table;
res.entity = status.task_status.entity;
res.progress_units = status.task_status.progress_units;
res.progress_total = status.progress.total;
res.progress_completed = status.progress.completed;
return res;
}
future<json::json_return_type> retrieve_status(tasks::task_manager::foreign_task_ptr task) {
if (task.get() == nullptr) {
co_return coroutine::return_exception(httpd::bad_param_exception("Task not found"));
}
auto progress = co_await task->get_progress();
full_task_status s;
s.task_status = task->get_status();
s.parent_id = task->get_parent_id();
s.abortable = task->is_abortable();
s.module = task->get_module_name();
s.progress.completed = progress.completed;
s.progress.total = progress.total;
co_return make_status(s);
}
void set_task_manager(http_context& ctx, routes& r) {
tm::get_modules.set(r, [&ctx] (std::unique_ptr<request> req) -> future<json::json_return_type> {
std::vector<std::string> v = boost::copy_range<std::vector<std::string>>(ctx.tm.local().get_modules() | boost::adaptors::map_keys);
co_return v;
});
tm::get_tasks.set(r, [&ctx] (std::unique_ptr<request> req) -> future<json::json_return_type> {
using chunked_stats = utils::chunked_vector<task_stats>;
auto internal = tasks::is_internal{req_param<bool>(*req, "internal", false)};
std::vector<chunked_stats> res = co_await ctx.tm.map([&req, internal] (tasks::task_manager& tm) {
chunked_stats local_res;
auto module = tm.find_module(req->param["module"]);
const auto& filtered_tasks = module->get_tasks() | boost::adaptors::filtered([&params = req->query_parameters, internal] (const auto& task) {
return (internal || !task.second->is_internal()) && filter_tasks(task.second, params);
});
for (auto& [task_id, task] : filtered_tasks) {
local_res.push_back(task_stats{task});
}
return local_res;
});
std::function<future<>(output_stream<char>&&)> f = [r = std::move(res)] (output_stream<char>&& os) -> future<> {
auto s = std::move(os);
auto res = std::move(r);
co_await s.write("[");
std::string delim = "";
for (auto& v: res) {
for (auto& stats: v) {
co_await s.write(std::exchange(delim, ", "));
tm::task_stats ts;
ts = stats;
co_await formatter::write(s, ts);
}
}
co_await s.write("]");
co_await s.close();
};
co_return std::move(f);
});
tm::get_task_status.set(r, [&ctx] (std::unique_ptr<request> req) -> future<json::json_return_type> {
auto id = tasks::task_id{utils::UUID{req->param["task_id"]}};
auto task = co_await tasks::task_manager::invoke_on_task(ctx.tm, id, std::function([] (tasks::task_manager::task_ptr task) -> future<tasks::task_manager::foreign_task_ptr> {
auto state = task->get_status().state;
if (state == tasks::task_manager::task_state::done || state == tasks::task_manager::task_state::failed) {
task->unregister_task();
}
co_return std::move(task);
}));
co_return co_await retrieve_status(std::move(task));
});
tm::abort_task.set(r, [&ctx] (std::unique_ptr<request> req) -> future<json::json_return_type> {
auto id = tasks::task_id{utils::UUID{req->param["task_id"]}};
co_await tasks::task_manager::invoke_on_task(ctx.tm, id, [] (tasks::task_manager::task_ptr task) -> future<> {
if (!task->is_abortable()) {
co_await coroutine::return_exception(std::runtime_error("Requested task cannot be aborted"));
}
co_await task->abort();
});
co_return json_void();
});
tm::wait_task.set(r, [&ctx] (std::unique_ptr<request> req) -> future<json::json_return_type> {
auto id = tasks::task_id{utils::UUID{req->param["task_id"]}};
auto task = co_await tasks::task_manager::invoke_on_task(ctx.tm, id, std::function([] (tasks::task_manager::task_ptr task) {
return task->done().then_wrapped([task] (auto f) {
task->unregister_task();
f.get();
return make_foreign(task);
});
}));
co_return co_await retrieve_status(std::move(task));
});
}
}

17
api/task_manager.hh Normal file
View File

@@ -0,0 +1,17 @@
/*
* Copyright (C) 2022-present ScyllaDB
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once
#include "api.hh"
namespace api {
void set_task_manager(http_context& ctx, routes& r);
}

109
api/task_manager_test.cc Normal file
View File

@@ -0,0 +1,109 @@
/*
* Copyright (C) 2022-present ScyllaDB
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#ifndef SCYLLA_BUILD_MODE_RELEASE
#include <seastar/core/coroutine.hh>
#include "task_manager_test.hh"
#include "api/api-doc/task_manager_test.json.hh"
#include "tasks/test_module.hh"
namespace api {
namespace tmt = httpd::task_manager_test_json;
using namespace json;
void set_task_manager_test(http_context& ctx, routes& r, db::config& cfg) {
tmt::register_test_module.set(r, [&ctx] (std::unique_ptr<request> req) -> future<json::json_return_type> {
co_await ctx.tm.invoke_on_all([] (tasks::task_manager& tm) {
auto m = make_shared<tasks::test_module>(tm);
tm.register_module("test", m);
});
co_return json_void();
});
tmt::unregister_test_module.set(r, [&ctx] (std::unique_ptr<request> req) -> future<json::json_return_type> {
co_await ctx.tm.invoke_on_all([] (tasks::task_manager& tm) -> future<> {
auto module_name = "test";
auto module = tm.find_module(module_name);
co_await module->stop();
});
co_return json_void();
});
tmt::register_test_task.set(r, [&ctx] (std::unique_ptr<request> req) -> future<json::json_return_type> {
sharded<tasks::task_manager>& tms = ctx.tm;
auto it = req->query_parameters.find("task_id");
auto id = it != req->query_parameters.end() ? tasks::task_id{utils::UUID{it->second}} : tasks::task_id::create_null_id();
it = req->query_parameters.find("shard");
unsigned shard = it != req->query_parameters.end() ? boost::lexical_cast<unsigned>(it->second) : 0;
it = req->query_parameters.find("keyspace");
std::string keyspace = it != req->query_parameters.end() ? it->second : "";
it = req->query_parameters.find("table");
std::string table = it != req->query_parameters.end() ? it->second : "";
it = req->query_parameters.find("type");
std::string type = it != req->query_parameters.end() ? it->second : "";
it = req->query_parameters.find("entity");
std::string entity = it != req->query_parameters.end() ? it->second : "";
it = req->query_parameters.find("parent_id");
tasks::task_info data;
if (it != req->query_parameters.end()) {
data.id = tasks::task_id{utils::UUID{it->second}};
auto parent_ptr = co_await tasks::task_manager::lookup_task_on_all_shards(ctx.tm, data.id);
data.shard = parent_ptr->get_status().shard;
}
auto module = tms.local().find_module("test");
id = co_await module->make_task<tasks::test_task_impl>(shard, id, keyspace, table, type, entity, data);
co_await tms.invoke_on(shard, [id] (tasks::task_manager& tm) {
auto it = tm.get_all_tasks().find(id);
if (it != tm.get_all_tasks().end()) {
it->second->start();
}
});
co_return id.to_sstring();
});
tmt::unregister_test_task.set(r, [&ctx] (std::unique_ptr<request> req) -> future<json::json_return_type> {
auto id = tasks::task_id{utils::UUID{req->query_parameters["task_id"]}};
co_await tasks::task_manager::invoke_on_task(ctx.tm, id, [] (tasks::task_manager::task_ptr task) -> future<> {
tasks::test_task test_task{task};
co_await test_task.unregister_task();
});
co_return json_void();
});
tmt::finish_test_task.set(r, [&ctx] (std::unique_ptr<request> req) -> future<json::json_return_type> {
auto id = tasks::task_id{utils::UUID{req->param["task_id"]}};
auto it = req->query_parameters.find("error");
bool fail = it != req->query_parameters.end();
std::string error = fail ? it->second : "";
co_await tasks::task_manager::invoke_on_task(ctx.tm, id, [fail, error = std::move(error)] (tasks::task_manager::task_ptr task) {
tasks::test_task test_task{task};
if (fail) {
test_task.finish_failed(std::make_exception_ptr(std::runtime_error(error)));
} else {
test_task.finish();
}
return make_ready_future<>();
});
co_return json_void();
});
tmt::get_and_update_ttl.set(r, [&ctx, &cfg] (std::unique_ptr<request> req) -> future<json::json_return_type> {
uint32_t ttl = cfg.task_ttl_seconds();
cfg.task_ttl_seconds.set(boost::lexical_cast<uint32_t>(req->query_parameters["ttl"]));
co_return json::json_return_type(ttl);
});
}
}
#endif

22
api/task_manager_test.hh Normal file
View File

@@ -0,0 +1,22 @@
/*
* Copyright (C) 2022-present ScyllaDB
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#ifndef SCYLLA_BUILD_MODE_RELEASE
#pragma once
#include "api.hh"
#include "db/config.hh"
namespace api {
void set_task_manager_test(http_context& ctx, routes& r, db::config& cfg);
}
#endif

View File

@@ -1,6 +1,3 @@
/*
*/
/*
* Copyright (C) 2016-present ScyllaDB
*

View File

@@ -1,6 +1,3 @@
/*
*/
/*
* Copyright (C) 2016-present ScyllaDB
*

View File

@@ -1,6 +1,3 @@
/*
*/
/*
* Copyright (C) 2016-present ScyllaDB
*

View File

@@ -1,6 +1,3 @@
/*
*/
/*
* Copyright (C) 2016-present ScyllaDB
*

View File

@@ -1,6 +1,3 @@
/*
*/
/*
* Copyright (C) 2016-present ScyllaDB
*

View File

@@ -1,6 +1,3 @@
/*
*/
/*
* Copyright (C) 2016-present ScyllaDB
*
@@ -77,7 +74,7 @@ future<bool> default_authorizer::any_granted() const {
query,
db::consistency_level::LOCAL_ONE,
{},
true).then([this](::shared_ptr<cql3::untyped_result_set> results) {
cql3::query_processor::cache_internal::yes).then([this](::shared_ptr<cql3::untyped_result_set> results) {
return !results->empty();
});
}
@@ -88,7 +85,8 @@ future<> default_authorizer::migrate_legacy_metadata() const {
return _qp.execute_internal(
query,
db::consistency_level::LOCAL_ONE).then([this](::shared_ptr<cql3::untyped_result_set> results) {
db::consistency_level::LOCAL_ONE,
cql3::query_processor::cache_internal::no).then([this](::shared_ptr<cql3::untyped_result_set> results) {
return do_for_each(*results, [this](const cql3::untyped_result_set_row& row) {
return do_with(
row.get_as<sstring>("username"),
@@ -168,7 +166,8 @@ default_authorizer::authorize(const role_or_anonymous& maybe_role, const resourc
return _qp.execute_internal(
query,
db::consistency_level::LOCAL_ONE,
{*maybe_role.name, r.name()}).then([](::shared_ptr<cql3::untyped_result_set> results) {
{*maybe_role.name, r.name()},
cql3::query_processor::cache_internal::yes).then([](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
return permissions::NONE;
}
@@ -197,7 +196,8 @@ default_authorizer::modify(
query,
db::consistency_level::ONE,
internal_distributed_query_state(),
{permissions::to_strings(set), sstring(role_name), resource.name()}).discard_result();
{permissions::to_strings(set), sstring(role_name), resource.name()},
cql3::query_processor::cache_internal::no).discard_result();
});
}
@@ -223,7 +223,7 @@ future<std::vector<permission_details>> default_authorizer::list_all() const {
db::consistency_level::ONE,
internal_distributed_query_state(),
{},
true).then([](::shared_ptr<cql3::untyped_result_set> results) {
cql3::query_processor::cache_internal::yes).then([](::shared_ptr<cql3::untyped_result_set> results) {
std::vector<permission_details> all_details;
for (const auto& row : *results) {
@@ -249,7 +249,8 @@ future<> default_authorizer::revoke_all(std::string_view role_name) const {
query,
db::consistency_level::ONE,
internal_distributed_query_state(),
{sstring(role_name)}).discard_result().handle_exception([role_name](auto ep) {
{sstring(role_name)},
cql3::query_processor::cache_internal::no).discard_result().handle_exception([role_name](auto ep) {
try {
std::rethrow_exception(ep);
} catch (exceptions::request_execution_exception& e) {
@@ -268,7 +269,8 @@ future<> default_authorizer::revoke_all(const resource& resource) const {
return _qp.execute_internal(
query,
db::consistency_level::LOCAL_ONE,
{resource.name()}).then_wrapped([this, resource](future<::shared_ptr<cql3::untyped_result_set>> f) {
{resource.name()},
cql3::query_processor::cache_internal::no).then_wrapped([this, resource](future<::shared_ptr<cql3::untyped_result_set>> f) {
try {
auto res = f.get0();
return parallel_for_each(
@@ -284,7 +286,8 @@ future<> default_authorizer::revoke_all(const resource& resource) const {
return _qp.execute_internal(
query,
db::consistency_level::LOCAL_ONE,
{r.get_as<sstring>(ROLE_NAME), resource.name()}).discard_result().handle_exception(
{r.get_as<sstring>(ROLE_NAME), resource.name()},
cql3::query_processor::cache_internal::no).discard_result().handle_exception(
[resource](auto ep) {
try {
std::rethrow_exception(ep);

View File

@@ -1,6 +1,3 @@
/*
*/
/*
* Copyright (C) 2016-present ScyllaDB
*

View File

@@ -1,6 +1,3 @@
/*
*/
/*
* Copyright (C) 2016-present ScyllaDB
*
@@ -87,7 +84,8 @@ future<> password_authenticator::migrate_legacy_metadata() const {
return _qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_query_state()).then([this](::shared_ptr<cql3::untyped_result_set> results) {
internal_distributed_query_state(),
cql3::query_processor::cache_internal::no).then([this](::shared_ptr<cql3::untyped_result_set> results) {
return do_for_each(*results, [this](const cql3::untyped_result_set_row& row) {
auto username = row.get_as<sstring>("username");
auto salted_hash = row.get_as<sstring>(SALTED_HASH);
@@ -96,7 +94,8 @@ future<> password_authenticator::migrate_legacy_metadata() const {
update_row_query(),
consistency_for_user(username),
internal_distributed_query_state(),
{std::move(salted_hash), username}).discard_result();
{std::move(salted_hash), username},
cql3::query_processor::cache_internal::no).discard_result();
}).finally([results] {});
}).then([] {
plogger.info("Finished migrating legacy authentication metadata.");
@@ -113,7 +112,8 @@ future<> password_authenticator::create_default_if_missing() const {
update_row_query(),
db::consistency_level::QUORUM,
internal_distributed_query_state(),
{passwords::hash(DEFAULT_USER_PASSWORD, rng_for_salt), DEFAULT_USER_NAME}).then([](auto&&) {
{passwords::hash(DEFAULT_USER_PASSWORD, rng_for_salt), DEFAULT_USER_NAME},
cql3::query_processor::cache_internal::no).then([](auto&&) {
plogger.info("Created default superuser authentication record.");
});
}
@@ -211,7 +211,7 @@ future<authenticated_user> password_authenticator::authenticate(
consistency_for_user(username),
internal_distributed_query_state(),
{username},
true);
cql3::query_processor::cache_internal::yes);
}).then_wrapped([=](future<::shared_ptr<cql3::untyped_result_set>> f) {
try {
auto res = f.get0();
@@ -244,7 +244,8 @@ future<> password_authenticator::create(std::string_view role_name, const authen
update_row_query(),
consistency_for_user(role_name),
internal_distributed_query_state(),
{passwords::hash(*options.password, rng_for_salt), sstring(role_name)}).discard_result();
{passwords::hash(*options.password, rng_for_salt), sstring(role_name)},
cql3::query_processor::cache_internal::no).discard_result();
}
future<> password_authenticator::alter(std::string_view role_name, const authentication_options& options) const {
@@ -261,7 +262,8 @@ future<> password_authenticator::alter(std::string_view role_name, const authent
query,
consistency_for_user(role_name),
internal_distributed_query_state(),
{passwords::hash(*options.password, rng_for_salt), sstring(role_name)}).discard_result();
{passwords::hash(*options.password, rng_for_salt), sstring(role_name)},
cql3::query_processor::cache_internal::no).discard_result();
}
future<> password_authenticator::drop(std::string_view name) const {
@@ -273,7 +275,8 @@ future<> password_authenticator::drop(std::string_view name) const {
return _qp.execute_internal(
query, consistency_for_user(name),
internal_distributed_query_state(),
{sstring(name)}).discard_result();
{sstring(name)},
cql3::query_processor::cache_internal::no).discard_result();
}
future<custom_options> password_authenticator::query_custom_options(std::string_view role_name) const {

View File

@@ -1,6 +1,3 @@
/*
*/
/*
* Copyright (C) 2016-present ScyllaDB
*

View File

@@ -1,6 +1,3 @@
/*
*/
/*
* Copyright (C) 2016-present ScyllaDB
*

View File

@@ -1,6 +1,3 @@
/*
*/
/*
* Copyright (C) 2016-present ScyllaDB
*

View File

@@ -14,13 +14,21 @@
namespace auth {
permissions_cache::permissions_cache(const permissions_cache_config& c, service& ser, logging::logger& log)
: _cache(c.max_entries, c.validity_period, c.update_period, log, [&ser, &log](const key_type& k) {
permissions_cache::permissions_cache(const utils::loading_cache_config& c, service& ser, logging::logger& log)
: _cache(c, log, [&ser, &log](const key_type& k) {
log.debug("Refreshing permissions for {}", k.first);
return ser.get_uncached_permissions(k.first, k.second);
}) {
}
bool permissions_cache::update_config(utils::loading_cache_config c) {
return _cache.update_config(std::move(c));
}
void permissions_cache::reset() {
_cache.reset();
}
future<permission_set> permissions_cache::get(const role_or_anonymous& maybe_role, const resource& r) {
return do_with(key_type(maybe_role, r), [this](const auto& k) {
return _cache.get(k);

View File

@@ -44,12 +44,6 @@ namespace auth {
class service;
struct permissions_cache_config final {
std::size_t max_entries;
std::chrono::milliseconds validity_period;
std::chrono::milliseconds update_period;
};
class permissions_cache final {
using cache_type = utils::loading_cache<
std::pair<role_or_anonymous, resource>,
@@ -64,12 +58,14 @@ class permissions_cache final {
cache_type _cache;
public:
explicit permissions_cache(const permissions_cache_config&, service&, logging::logger&);
explicit permissions_cache(const utils::loading_cache_config&, service&, logging::logger&);
future <> stop() {
return _cache.stop();
}
bool update_config(utils::loading_cache_config);
void reset();
future<permission_set> get(const role_or_anonymous&, const resource&);
};

View File

@@ -1,6 +1,3 @@
/*
*/
/*
* Copyright (C) 2016-present ScyllaDB
*

View File

@@ -1,6 +1,3 @@
/*
*/
/*
* Copyright (C) 2016-present ScyllaDB
*

View File

@@ -56,14 +56,14 @@ future<bool> default_role_row_satisfies(
query,
db::consistency_level::ONE,
{meta::DEFAULT_SUPERUSER_NAME},
true).then([&qp, &p](::shared_ptr<cql3::untyped_result_set> results) {
cql3::query_processor::cache_internal::yes).then([&qp, &p](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
return qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_query_state(),
{meta::DEFAULT_SUPERUSER_NAME},
true).then([&p](::shared_ptr<cql3::untyped_result_set> results) {
cql3::query_processor::cache_internal::yes).then([&p](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
return make_ready_future<bool>(false);
}
@@ -86,7 +86,8 @@ future<bool> any_nondefault_role_row_satisfies(
return qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_query_state()).then([&p](::shared_ptr<cql3::untyped_result_set> results) {
internal_distributed_query_state(),
cql3::query_processor::cache_internal::no).then([&p](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
return false;
}

View File

@@ -1,6 +1,3 @@
/*
*/
/*
* Copyright (C) 2019-present ScyllaDB
*

View File

@@ -1,6 +1,3 @@
/*
*/
/*
* Copyright (C) 2019-present ScyllaDB
*

View File

@@ -22,6 +22,7 @@
#include "auth/role_or_anonymous.hh"
#include "cql3/query_processor.hh"
#include "cql3/untyped_result_set.hh"
#include "db/config.hh"
#include "db/consistency_level_type.hh"
#include "exceptions/exceptions.hh"
#include "log.hh"
@@ -100,23 +101,28 @@ static future<> validate_role_exists(const service& ser, std::string_view role_n
}
service::service(
permissions_cache_config c,
utils::loading_cache_config c,
cql3::query_processor& qp,
::service::migration_notifier& mn,
std::unique_ptr<authorizer> z,
std::unique_ptr<authenticator> a,
std::unique_ptr<role_manager> r)
: _permissions_cache_config(std::move(c))
: _loading_cache_config(std::move(c))
, _permissions_cache(nullptr)
, _qp(qp)
, _mnotifier(mn)
, _authorizer(std::move(z))
, _authenticator(std::move(a))
, _role_manager(std::move(r))
, _migration_listener(std::make_unique<auth_migration_listener>(*_authorizer)) {}
, _migration_listener(std::make_unique<auth_migration_listener>(*_authorizer))
, _permissions_cache_cfg_cb([this] (uint32_t) { (void) _permissions_cache_config_action.trigger_later(); })
, _permissions_cache_config_action([this] { update_cache_config(); return make_ready_future<>(); })
, _permissions_cache_max_entries_observer(_qp.db().get_config().permissions_cache_max_entries.observe(_permissions_cache_cfg_cb))
, _permissions_cache_update_interval_in_ms_observer(_qp.db().get_config().permissions_update_interval_in_ms.observe(_permissions_cache_cfg_cb))
, _permissions_cache_validity_in_ms_observer(_qp.db().get_config().permissions_validity_in_ms.observe(_permissions_cache_cfg_cb)) {}
service::service(
permissions_cache_config c,
utils::loading_cache_config c,
cql3::query_processor& qp,
::service::migration_notifier& mn,
::service::migration_manager& mm,
@@ -160,7 +166,7 @@ future<> service::start(::service::migration_manager& mm) {
return when_all_succeed(_authorizer->start(), _authenticator->start()).discard_result();
});
}).then([this] {
_permissions_cache = std::make_unique<permissions_cache>(_permissions_cache_config, *this, log);
_permissions_cache = std::make_unique<permissions_cache>(_loading_cache_config, *this, log);
}).then([this] {
return once_among_shards([this] {
_mnotifier.register_listener(_migration_listener.get());
@@ -182,6 +188,24 @@ future<> service::stop() {
});
}
void service::update_cache_config() {
auto db = _qp.db();
utils::loading_cache_config perm_cache_config;
perm_cache_config.max_size = db.get_config().permissions_cache_max_entries();
perm_cache_config.expiry = std::chrono::milliseconds(db.get_config().permissions_validity_in_ms());
perm_cache_config.refresh = std::chrono::milliseconds(db.get_config().permissions_update_interval_in_ms());
if (!_permissions_cache->update_config(std::move(perm_cache_config))) {
log.error("Failed to apply permissions cache changes. Please read the documentation of these parameters");
}
}
void service::reset_authorization_cache() {
_permissions_cache->reset();
_qp.reset_cache();
}
future<bool> service::has_existing_legacy_users() const {
if (!_qp.db().has_schema(meta::AUTH_KS, meta::USERS_CF)) {
return make_ready_future<bool>(false);
@@ -203,7 +227,7 @@ future<bool> service::has_existing_legacy_users() const {
default_user_query,
db::consistency_level::ONE,
{meta::DEFAULT_SUPERUSER_NAME},
true).then([this](auto results) {
cql3::query_processor::cache_internal::yes).then([this](auto results) {
if (!results->empty()) {
return make_ready_future<bool>(true);
}
@@ -212,14 +236,15 @@ future<bool> service::has_existing_legacy_users() const {
default_user_query,
db::consistency_level::QUORUM,
{meta::DEFAULT_SUPERUSER_NAME},
true).then([this](auto results) {
cql3::query_processor::cache_internal::yes).then([this](auto results) {
if (!results->empty()) {
return make_ready_future<bool>(true);
}
return _qp.execute_internal(
all_users_query,
db::consistency_level::QUORUM).then([](auto results) {
db::consistency_level::QUORUM,
cql3::query_processor::cache_internal::no).then([](auto results) {
return make_ready_future<bool>(!results->empty());
});
});

View File

@@ -23,6 +23,8 @@
#include "auth/permissions_cache.hh"
#include "auth/role_manager.hh"
#include "seastarx.hh"
#include "utils/observable.hh"
#include "utils/serialized_action.hh"
namespace cql3 {
class query_processor;
@@ -68,7 +70,7 @@ public:
/// peering_sharded_service inheritance is needed to be able to access shard local authentication service
/// given an object from another shard. Used for bouncing lwt requests to correct shard.
class service final : public seastar::peering_sharded_service<service> {
permissions_cache_config _permissions_cache_config;
utils::loading_cache_config _loading_cache_config;
std::unique_ptr<permissions_cache> _permissions_cache;
cql3::query_processor& _qp;
@@ -84,9 +86,16 @@ class service final : public seastar::peering_sharded_service<service> {
// Only one of these should be registered, so we end up with some unused instances. Not the end of the world.
std::unique_ptr<::service::migration_listener> _migration_listener;
std::function<void(uint32_t)> _permissions_cache_cfg_cb;
serialized_action _permissions_cache_config_action;
utils::observer<uint32_t> _permissions_cache_max_entries_observer;
utils::observer<uint32_t> _permissions_cache_update_interval_in_ms_observer;
utils::observer<uint32_t> _permissions_cache_validity_in_ms_observer;
public:
service(
permissions_cache_config,
utils::loading_cache_config,
cql3::query_processor&,
::service::migration_notifier&,
std::unique_ptr<authorizer>,
@@ -99,7 +108,7 @@ public:
/// of the instances themselves.
///
service(
permissions_cache_config,
utils::loading_cache_config,
cql3::query_processor&,
::service::migration_notifier&,
::service::migration_manager&,
@@ -109,6 +118,10 @@ public:
future<> stop();
void update_cache_config();
void reset_authorization_cache();
///
/// \returns an exceptional future with \ref nonexistant_role if the named role does not exist.
///

View File

@@ -95,7 +95,7 @@ static future<std::optional<record>> find_record(cql3::query_processor& qp, std:
consistency_for_role(role_name),
internal_distributed_query_state(),
{sstring(role_name)},
true).then([](::shared_ptr<cql3::untyped_result_set> results) {
cql3::query_processor::cache_internal::yes).then([](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
return std::optional<record>();
}
@@ -178,7 +178,8 @@ future<> standard_role_manager::create_default_role_if_missing() const {
query,
db::consistency_level::QUORUM,
internal_distributed_query_state(),
{meta::DEFAULT_SUPERUSER_NAME}).then([](auto&&) {
{meta::DEFAULT_SUPERUSER_NAME},
cql3::query_processor::cache_internal::no).then([](auto&&) {
log.info("Created default superuser role '{}'.", meta::DEFAULT_SUPERUSER_NAME);
return make_ready_future<>();
});
@@ -204,7 +205,8 @@ future<> standard_role_manager::migrate_legacy_metadata() const {
return _qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_query_state()).then([this](::shared_ptr<cql3::untyped_result_set> results) {
internal_distributed_query_state(),
cql3::query_processor::cache_internal::no).then([this](::shared_ptr<cql3::untyped_result_set> results) {
return do_for_each(*results, [this](const cql3::untyped_result_set_row& row) {
role_config config;
config.is_superuser = row.get_or<bool>("super", false);
@@ -267,7 +269,7 @@ future<> standard_role_manager::create_or_replace(std::string_view role_name, co
consistency_for_role(role_name),
internal_distributed_query_state(),
{sstring(role_name), c.is_superuser, c.can_login},
true).discard_result();
cql3::query_processor::cache_internal::yes).discard_result();
}
future<>
@@ -309,7 +311,8 @@ standard_role_manager::alter(std::string_view role_name, const role_config_updat
meta::roles_table::role_col_name),
consistency_for_role(role_name),
internal_distributed_query_state(),
{sstring(role_name)}).discard_result();
{sstring(role_name)},
cql3::query_processor::cache_internal::no).discard_result();
});
}
@@ -328,7 +331,8 @@ future<> standard_role_manager::drop(std::string_view role_name) {
query,
consistency_for_role(role_name),
internal_distributed_query_state(),
{sstring(role_name)}).then([this, role_name](::shared_ptr<cql3::untyped_result_set> members) {
{sstring(role_name)},
cql3::query_processor::cache_internal::no).then([this, role_name](::shared_ptr<cql3::untyped_result_set> members) {
return parallel_for_each(
members->begin(),
members->end(),
@@ -360,7 +364,7 @@ future<> standard_role_manager::drop(std::string_view role_name) {
// Delete all attributes for that role
const auto remove_attributes_of = [this, role_name] {
static const sstring query = format("DELETE FROM {} WHERE role = ?", meta::role_attributes_table::qualified_name());
return _qp.execute_internal(query, {sstring(role_name)}).discard_result();
return _qp.execute_internal(query, {sstring(role_name)}, cql3::query_processor::cache_internal::yes).discard_result();
};
// Finally, delete the role itself.
@@ -373,7 +377,8 @@ future<> standard_role_manager::drop(std::string_view role_name) {
query,
consistency_for_role(role_name),
internal_distributed_query_state(),
{sstring(role_name)}).discard_result();
{sstring(role_name)},
cql3::query_processor::cache_internal::no).discard_result();
};
return when_all_succeed(revoke_from_members(), revoke_members_of(),
@@ -401,7 +406,8 @@ standard_role_manager::modify_membership(
query,
consistency_for_role(grantee_name),
internal_distributed_query_state(),
{role_set{sstring(role_name)}, sstring(grantee_name)}).discard_result();
{role_set{sstring(role_name)}, sstring(grantee_name)},
cql3::query_processor::cache_internal::no).discard_result();
};
const auto modify_role_members = [this, role_name, grantee_name, ch] {
@@ -412,7 +418,8 @@ standard_role_manager::modify_membership(
meta::role_members_table::qualified_name),
consistency_for_role(role_name),
internal_distributed_query_state(),
{sstring(role_name), sstring(grantee_name)}).discard_result();
{sstring(role_name), sstring(grantee_name)},
cql3::query_processor::cache_internal::no).discard_result();
case membership_change::remove:
return _qp.execute_internal(
@@ -420,7 +427,8 @@ standard_role_manager::modify_membership(
meta::role_members_table::qualified_name),
consistency_for_role(role_name),
internal_distributed_query_state(),
{sstring(role_name), sstring(grantee_name)}).discard_result();
{sstring(role_name), sstring(grantee_name)},
cql3::query_processor::cache_internal::no).discard_result();
}
return make_ready_future<>();
@@ -522,7 +530,8 @@ future<role_set> standard_role_manager::query_all() {
return _qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_query_state()).then([](::shared_ptr<cql3::untyped_result_set> results) {
internal_distributed_query_state(),
cql3::query_processor::cache_internal::yes).then([](::shared_ptr<cql3::untyped_result_set> results) {
role_set roles;
std::transform(
@@ -557,7 +566,7 @@ future<bool> standard_role_manager::can_login(std::string_view role_name) {
future<std::optional<sstring>> standard_role_manager::get_attribute(std::string_view role_name, std::string_view attribute_name) {
static const sstring query = format("SELECT name, value FROM {} WHERE role = ? AND name = ?", meta::role_attributes_table::qualified_name());
return _qp.execute_internal(query, {sstring(role_name), sstring(attribute_name)}).then([] (shared_ptr<cql3::untyped_result_set> result_set) {
return _qp.execute_internal(query, {sstring(role_name), sstring(attribute_name)}, cql3::query_processor::cache_internal::yes).then([] (shared_ptr<cql3::untyped_result_set> result_set) {
if (!result_set->empty()) {
const cql3::untyped_result_set_row &row = result_set->one();
return std::optional<sstring>(row.get_as<sstring>("value"));
@@ -590,7 +599,7 @@ future<> standard_role_manager::set_attribute(std::string_view role_name, std::s
if (!role_exists) {
throw auth::nonexistant_role(role_name);
}
return _qp.execute_internal(query, {sstring(role_name), sstring(attribute_name), sstring(attribute_value)}).discard_result();
return _qp.execute_internal(query, {sstring(role_name), sstring(attribute_name), sstring(attribute_value)}, cql3::query_processor::cache_internal::yes).discard_result();
});
});
@@ -603,7 +612,7 @@ future<> standard_role_manager::remove_attribute(std::string_view role_name, std
if (!role_exists) {
throw auth::nonexistant_role(role_name);
}
return _qp.execute_internal(query, {sstring(role_name), sstring(attribute_name)}).discard_result();
return _qp.execute_internal(query, {sstring(role_name), sstring(attribute_name)}, cql3::query_processor::cache_internal::yes).discard_result();
});
});
}

View File

@@ -1,6 +1,3 @@
/*
*/
/*
* Copyright (C) 2017-present ScyllaDB
*

View File

@@ -37,19 +37,27 @@
// The constants q1 and q2 are used to determine the proportional factor at each stage.
class backlog_controller {
public:
struct scheduling_group {
seastar::scheduling_group cpu = default_scheduling_group();
seastar::io_priority_class io = default_priority_class();
};
future<> shutdown() {
_update_timer.cancel();
return std::move(_inflight_update);
}
future<> update_static_shares(float static_shares) {
_static_shares = static_shares;
return make_ready_future<>();
}
protected:
struct control_point {
float input;
float output;
};
seastar::scheduling_group _scheduling_group;
const ::io_priority_class& _io_priority;
std::chrono::milliseconds _interval;
scheduling_group _scheduling_group;
timer<> _update_timer;
std::vector<control_point> _control_points;
@@ -58,41 +66,36 @@ protected:
// updating shares for an I/O class may contact another shard and returns a future.
future<> _inflight_update;
// Used when the controllers are disabled and a static share is used
// When that option is deprecated we should remove this.
float _static_shares;
virtual void update_controller(float quota);
bool controller_disabled() const noexcept {
return _static_shares > 0;
}
void adjust();
backlog_controller(seastar::scheduling_group sg, const ::io_priority_class& iop, std::chrono::milliseconds interval,
std::vector<control_point> control_points, std::function<float()> backlog)
: _scheduling_group(sg)
, _io_priority(iop)
, _interval(interval)
backlog_controller(scheduling_group sg, std::chrono::milliseconds interval,
std::vector<control_point> control_points, std::function<float()> backlog,
float static_shares = 0)
: _scheduling_group(std::move(sg))
, _update_timer([this] { adjust(); })
, _control_points()
, _current_backlog(std::move(backlog))
, _inflight_update(make_ready_future<>())
, _static_shares(static_shares)
{
_control_points.insert(_control_points.end(), control_points.begin(), control_points.end());
_update_timer.arm_periodic(_interval);
}
// Used when the controllers are disabled and a static share is used
// When that option is deprecated we should remove this.
backlog_controller(seastar::scheduling_group sg, const ::io_priority_class& iop, float static_shares)
: _scheduling_group(sg)
, _io_priority(iop)
, _inflight_update(make_ready_future<>())
{
update_controller(static_shares);
_update_timer.arm_periodic(interval);
}
virtual ~backlog_controller() {}
public:
backlog_controller(backlog_controller&&) = default;
float backlog_of_shares(float shares) const;
seastar::scheduling_group sg() {
return _scheduling_group;
}
};
// memtable flush CPU controller.
@@ -113,11 +116,11 @@ public:
class flush_controller : public backlog_controller {
static constexpr float hard_dirty_limit = 1.0f;
public:
flush_controller(seastar::scheduling_group sg, const ::io_priority_class& iop, float static_shares) : backlog_controller(sg, iop, static_shares) {}
flush_controller(seastar::scheduling_group sg, const ::io_priority_class& iop, std::chrono::milliseconds interval, float soft_limit, std::function<float()> current_dirty)
: backlog_controller(sg, iop, std::move(interval),
flush_controller(backlog_controller::scheduling_group sg, float static_shares, std::chrono::milliseconds interval, float soft_limit, std::function<float()> current_dirty)
: backlog_controller(std::move(sg), std::move(interval),
std::vector<backlog_controller::control_point>({{0.0, 0.0}, {soft_limit, 10}, {soft_limit + (hard_dirty_limit - soft_limit) / 2, 200} , {hard_dirty_limit, 1000}}),
std::move(current_dirty)
std::move(current_dirty),
static_shares
)
{}
};
@@ -127,11 +130,11 @@ public:
static constexpr unsigned normalization_factor = 30;
static constexpr float disable_backlog = std::numeric_limits<double>::infinity();
static constexpr float backlog_disabled(float backlog) { return std::isinf(backlog); }
compaction_controller(seastar::scheduling_group sg, const ::io_priority_class& iop, float static_shares) : backlog_controller(sg, iop, static_shares) {}
compaction_controller(seastar::scheduling_group sg, const ::io_priority_class& iop, std::chrono::milliseconds interval, std::function<float()> current_backlog)
: backlog_controller(sg, iop, std::move(interval),
compaction_controller(backlog_controller::scheduling_group sg, float static_shares, std::chrono::milliseconds interval, std::function<float()> current_backlog)
: backlog_controller(std::move(sg), std::move(interval),
std::vector<backlog_controller::control_point>({{0.0, 50}, {1.5, 100} , {normalization_factor, 1000}}),
std::move(current_backlog)
std::move(current_backlog),
static_shares
)
{}
};

59
build_mode.hh Normal file
View File

@@ -0,0 +1,59 @@
/*
* Copyright (C) 2022-present ScyllaDB
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once
#ifndef SCYLLA_BUILD_MODE
#error SCYLLA_BUILD_MODE must be defined
#endif
#ifndef STRINGIFY
// We need to levels of indirection
// to make a string out of the macro name.
// The outer level expands the macro
// and the inner level makes a string out of the expanded macro.
#define STRINGIFY_VALUE(x) #x
#define STRINGIFY_MACRO(x) STRINGIFY_VALUE(x)
#endif
#define SCYLLA_BUILD_MODE_STR STRINGIFY_MACRO(SCYLLA_BUILD_MODE)
// We use plain macro definitions
// so the preprocessor can expand them
// inline in the #if directives below
#define SCYLLA_BUILD_MODE_CODE_debug 0
#define SCYLLA_BUILD_MODE_CODE_release 1
#define SCYLLA_BUILD_MODE_CODE_dev 2
#define SCYLLA_BUILD_MODE_CODE_sanitize 3
#define SCYLLA_BUILD_MODE_CODE_coverage 4
#define _SCYLLA_BUILD_MODE_CODE(sbm) SCYLLA_BUILD_MODE_CODE_ ## sbm
#define SCYLLA_BUILD_MODE_CODE(sbm) _SCYLLA_BUILD_MODE_CODE(sbm)
#if SCYLLA_BUILD_MODE_CODE(SCYLLA_BUILD_MODE) == SCYLLA_BUILD_MODE_CODE_debug
#define SCYLLA_BUILD_MODE_DEBUG
#elif SCYLLA_BUILD_MODE_CODE(SCYLLA_BUILD_MODE) == SCYLLA_BUILD_MODE_CODE_release
#define SCYLLA_BUILD_MODE_RELEASE
#elif SCYLLA_BUILD_MODE_CODE(SCYLLA_BUILD_MODE) == SCYLLA_BUILD_MODE_CODE_dev
#define SCYLLA_BUILD_MODE_DEV
#elif SCYLLA_BUILD_MODE_CODE(SCYLLA_BUILD_MODE) == SCYLLA_BUILD_MODE_CODE_sanitize
#define SCYLLA_BUILD_MODE_SANITIZE
#elif SCYLLA_BUILD_MODE_CODE(SCYLLA_BUILD_MODE) == SCYLLA_BUILD_MODE_CODE_coverage
#define SCYLLA_BUILD_MODE_COVERAGE
#else
#error unrecognized SCYLLA_BUILD_MODE
#endif
#if (defined(SCYLLA_BUILD_MODE_RELEASE) || defined(SCYLLA_BUILD_MODE_DEV)) && defined(SEASTAR_DEBUG)
#error SEASTAR_DEBUG is not expected to be defined when SCYLLA_BUILD_MODE is "release" or "dev"
#endif
#if (defined(SCYLLA_BUILD_MODE_DEBUG) || defined(SCYLLA_BUILD_MODE_SANITIZE)) && !defined(SEASTAR_DEBUG)
#error SEASTAR_DEBUG is expected to be defined when SCYLLA_BUILD_MODE is "debug" or "sanitize"
#endif

View File

@@ -11,9 +11,11 @@
#include <boost/range/iterator_range.hpp>
#include "bytes.hh"
#include "utils/managed_bytes.hh"
#include "hashing.hh"
#include <seastar/core/simple-stream.hh>
#include <seastar/core/loop.hh>
#include <bit>
#include <concepts>
/**
@@ -31,26 +33,15 @@ public:
static constexpr size_type max_chunk_size() { return max_alloc_size() - sizeof(chunk); }
private:
static_assert(sizeof(value_type) == 1, "value_type is assumed to be one byte long");
struct chunk {
// FIXME: group fragment pointers to reduce pointer chasing when packetizing
std::unique_ptr<chunk> next;
~chunk() {
auto p = std::move(next);
while (p) {
// Avoid recursion when freeing chunks
auto p_next = std::move(p->next);
p = std::move(p_next);
}
}
size_type offset; // Also means "size" after chunk is closed
size_type size;
value_type data[0];
void operator delete(void* ptr) { free(ptr); }
};
// Note: while appending data, chunk::size refers to the allocated space in the chunk,
// and chunk::frag_size refers to the currently occupied space in the chunk.
// After building, the first chunk::size is the whole object size, and chunk::frag_size
// doesn't change. This fits with managed_bytes interpretation.
using chunk = blob_storage;
static constexpr size_type default_chunk_size{512};
static constexpr size_type max_alloc_size() { return 128 * 1024; }
private:
std::unique_ptr<chunk> _begin;
blob_storage::ref_type _begin;
chunk* _current;
size_type _size;
size_type _initial_chunk_size = default_chunk_size;
@@ -70,13 +61,13 @@ public:
fragment_iterator(const fragment_iterator&) = default;
fragment_iterator& operator=(const fragment_iterator&) = default;
bytes_view operator*() const {
return { _current->data, _current->offset };
return { _current->data, _current->frag_size };
}
bytes_view operator->() const {
return *(*this);
}
fragment_iterator& operator++() {
_current = _current->next.get();
_current = _current->next;
return *this;
}
fragment_iterator operator++(int) {
@@ -119,19 +110,21 @@ private:
if (!_current) {
return 0;
}
return _current->size - _current->offset;
return _current->size - _current->frag_size;
}
// Figure out next chunk size.
// - must be enough for data_size + sizeof(chunk)
// - must be at least _initial_chunk_size
// - try to double each time to prevent too many allocations
// - should not exceed max_alloc_size, unless data_size requires so
// - will be power-of-two so the allocated memory can be fully utilized.
size_type next_alloc_size(size_t data_size) const {
auto next_size = _current
? _current->size * 2
: _initial_chunk_size;
next_size = std::min(next_size, max_alloc_size());
return std::max<size_type>(next_size, data_size + sizeof(chunk));
auto r = std::max<size_type>(next_size, data_size + sizeof(chunk));
return std::bit_ceil(r);
}
// Makes room for a contiguous region of given size.
// The region is accounted for as already written.
@@ -139,8 +132,8 @@ private:
[[gnu::always_inline]]
value_type* alloc(size_type size) {
if (__builtin_expect(size <= current_space_left(), true)) {
auto ret = _current->data + _current->offset;
_current->offset += size;
auto ret = _current->data + _current->frag_size;
_current->frag_size += size;
_size += size;
return ret;
} else {
@@ -154,19 +147,21 @@ private:
if (!space) {
throw std::bad_alloc();
}
auto new_chunk = std::unique_ptr<chunk>(new (space) chunk());
new_chunk->offset = size;
new_chunk->size = alloc_size - sizeof(chunk);
if (_current) {
_current->next = std::move(new_chunk);
_current = _current->next.get();
} else {
_begin = std::move(new_chunk);
_current = _begin.get();
}
auto backref = _current ? &_current->next : &_begin;
auto new_chunk = new (space) chunk(backref, alloc_size - sizeof(chunk), size);
_current = new_chunk;
_size += size;
return _current->data;
}
[[gnu::noinline]]
void free_chain(chunk* c) noexcept {
while (c) {
auto n = c->next;
c->~chunk();
::free(c);
c = n;
}
}
public:
explicit bytes_ostream(size_t initial_chunk_size) noexcept
: _begin()
@@ -178,7 +173,7 @@ public:
bytes_ostream() noexcept : bytes_ostream(default_chunk_size) {}
bytes_ostream(bytes_ostream&& o) noexcept
: _begin(std::move(o._begin))
: _begin(std::exchange(o._begin, {}))
, _current(o._current)
, _size(o._size)
, _initial_chunk_size(o._initial_chunk_size)
@@ -196,6 +191,10 @@ public:
append(o);
}
~bytes_ostream() {
free_chain(_begin.ptr);
}
bytes_ostream& operator=(const bytes_ostream& o) {
if (this != &o) {
auto x = bytes_ostream(o);
@@ -243,8 +242,8 @@ public:
auto this_size = std::min(v.size(), size_t(current_space_left()));
if (__builtin_expect(this_size, true)) {
memcpy(_current->data + _current->offset, v.begin(), this_size);
_current->offset += this_size;
memcpy(_current->data + _current->frag_size, v.begin(), this_size);
_current->frag_size += this_size;
_size += this_size;
v.remove_prefix(this_size);
}
@@ -287,19 +286,20 @@ public:
throw std::bad_alloc();
}
auto new_chunk = std::unique_ptr<chunk>(new (space) chunk());
new_chunk->offset = _size;
new_chunk->size = _size;
auto old_begin = _begin;
auto new_chunk = new (space) chunk(&_begin, _size, _size);
auto dst = new_chunk->data;
auto r = _begin.get();
auto r = old_begin.ptr;
while (r) {
auto next = r->next.get();
dst = std::copy_n(r->data, r->offset, dst);
auto next = r->next;
dst = std::copy_n(r->data, r->frag_size, dst);
r->~chunk();
::free(r);
r = next;
}
_current = new_chunk.get();
_current = new_chunk;
_begin = std::move(new_chunk);
return bytes_view(_current->data, _size);
}
@@ -333,22 +333,23 @@ public:
void remove_suffix(size_t n) {
_size -= n;
auto left = _size;
auto current = _begin.get();
auto current = _begin.ptr;
while (current) {
if (current->offset >= left) {
current->offset = left;
if (current->frag_size >= left) {
current->frag_size = left;
_current = current;
current->next.reset();
free_chain(current->next);
current->next = nullptr;
return;
}
left -= current->offset;
current = current->next.get();
left -= current->frag_size;
current = current->next;
}
}
// begin() and end() form an input range to bytes_view representing fragments.
// Any modification of this instance invalidates iterators.
fragment_iterator begin() const { return { _begin.get() }; }
fragment_iterator begin() const { return { _begin.ptr }; }
fragment_iterator end() const { return { nullptr }; }
output_iterator write_begin() { return output_iterator(*this); }
@@ -363,7 +364,7 @@ public:
};
position pos() const {
return { _current, _current ? _current->offset : 0 };
return { _current, _current ? _current->frag_size : 0 };
}
// Returns the amount of bytes written since given position.
@@ -373,11 +374,11 @@ public:
if (!c) {
return _size;
}
size_type total = c->offset - pos._offset;
c = c->next.get();
size_type total = c->frag_size - pos._offset;
c = c->next;
while (c) {
total += c->offset;
c = c->next.get();
total += c->frag_size;
c = c->next;
}
return total;
}
@@ -391,8 +392,9 @@ public:
}
_size -= written_since(pos);
_current = pos._chunk;
free_chain(_current->next);
_current->next = nullptr;
_current->offset = pos._offset;
_current->frag_size = pos._offset;
}
void reduce_chunk_count() {
@@ -441,11 +443,23 @@ public:
// the clear() calls then writes will not involve any memory allocations,
// except for the first write made on this instance.
void clear() {
if (_begin) {
_begin->offset = 0;
if (_begin.ptr) {
_begin.ptr->frag_size = 0;
_size = 0;
_current = _begin.get();
_begin->next.reset();
free_chain(_begin.ptr->next);
_begin.ptr->next = nullptr;
_current = _begin.ptr;
}
}
managed_bytes to_managed_bytes() && {
if (_size) {
_begin.ptr->size = _size;
_current = nullptr;
_size = 0;
return managed_bytes(std::exchange(_begin.ptr, {}));
} else {
return managed_bytes();
}
}
@@ -456,15 +470,17 @@ public:
// the clear() calls then writes will not involve any memory allocations,
// except for the first write made on this instance.
future<> clear_gently() noexcept {
if (!_begin) {
if (!_begin.ptr) {
return make_ready_future<>();
}
_begin->offset = 0;
_begin->frag_size = 0;
_current = _begin.ptr;
_size = 0;
return do_until([this] { return !_begin->next; }, [this] {
// move next->next first to avoid it being recursively destroyed
// in ~chunk when _begin->next is move-assigned.
auto next = std::move(_begin->next->next);
return do_until([this] { return !_begin.ptr->next; }, [this] {
auto second_chunk = _begin.ptr->next;
auto next = second_chunk->next;
second_chunk->~chunk();
::free(second_chunk);
_begin->next = std::move(next);
return make_ready_future<>();
});

View File

@@ -10,19 +10,19 @@
#include <vector>
#include "row_cache.hh"
#include "mutation_reader.hh"
#include "mutation_fragment.hh"
#include "query-request.hh"
#include "partition_snapshot_row_cursor.hh"
#include "range_tombstone_assembler.hh"
#include "read_context.hh"
#include "flat_mutation_reader.hh"
#include "readers/delegating_v2.hh"
#include "clustering_key_filter.hh"
namespace cache {
extern logging::logger clogger;
class cache_flat_mutation_reader final : public flat_mutation_reader::impl {
class cache_flat_mutation_reader final : public flat_mutation_reader_v2::impl {
enum class state {
before_static_row,
@@ -51,6 +51,46 @@ class cache_flat_mutation_reader final : public flat_mutation_reader::impl {
end_of_stream
};
enum class source {
cache = 0,
underlying = 1,
};
// Merges range tombstone change streams coming from underlying and the cache.
// Ensures no range tombstone change fragment is emitted when there is no
// actual change in the effective tombstone.
class range_tombstone_change_merger {
const schema& _schema;
position_in_partition _pos;
tombstone _current_tombstone;
std::array<tombstone, 2> _tombstones;
private:
std::optional<range_tombstone_change> do_flush(position_in_partition pos, bool end_of_range) {
std::optional<range_tombstone_change> ret;
position_in_partition::tri_compare cmp(_schema);
const auto res = cmp(_pos, pos);
const auto should_flush = end_of_range ? res <= 0 : res < 0;
if (should_flush) {
auto merged_tomb = std::max(_tombstones.front(), _tombstones.back());
if (merged_tomb != _current_tombstone) {
_current_tombstone = merged_tomb;
ret.emplace(_pos, _current_tombstone);
}
_pos = std::move(pos);
}
return ret;
}
public:
range_tombstone_change_merger(const schema& s) : _schema(s), _pos(position_in_partition::before_all_clustered_rows()), _tombstones{}
{ }
std::optional<range_tombstone_change> apply(source src, range_tombstone_change&& rtc) {
auto ret = do_flush(rtc.position(), false);
_tombstones[static_cast<size_t>(src)] = rtc.tombstone();
return ret;
}
std::optional<range_tombstone_change> flush(position_in_partition_view pos, bool end_of_range) {
return do_flush(position_in_partition(pos), end_of_range);
}
};
partition_snapshot_ptr _snp;
query::clustering_key_filter_ranges _ck_ranges; // Query schema domain, reversed reads use native order
@@ -66,6 +106,7 @@ class cache_flat_mutation_reader final : public flat_mutation_reader::impl {
// range_tombstones with positions <= _lower_bound.
position_in_partition _lower_bound; // Query schema domain
position_in_partition_view _upper_bound; // Query schema domain
std::optional<position_in_partition> _underlying_upper_bound; // Query schema domain
// cache_flat_mutation_reader may be constructed either
// with a read_context&, where it knows that the read_context
@@ -80,6 +121,19 @@ class cache_flat_mutation_reader final : public flat_mutation_reader::impl {
read_context& _read_context;
partition_snapshot_row_cursor _next_row;
range_tombstone_change_generator _rt_gen; // cache -> reader
range_tombstone_assembler _rt_assembler; // underlying -> cache
range_tombstone_change_merger _rt_merger; // {cache, underlying} -> reader
// When the read moves to the underlying, the read range will be
// (_lower_bound, x], where x is either _next_row.position() or _upper_bound.
// In the former case (x is _next_row.position()), underlying can emit
// a range tombstone change for after_key(x), which is outside the range.
// We can't push this fragment into the buffer straight away, the cache may
// have fragments with smaller position. So we save it here and flush it when
// a fragment with a larger position is seen.
std::optional<mutation_fragment_v2> _queued_underlying_fragment;
state _state = state::before_static_row;
bool _next_row_in_range = false;
@@ -98,8 +152,8 @@ class cache_flat_mutation_reader final : public flat_mutation_reader::impl {
// Points to the underlying reader conforming to _schema,
// either to *_underlying_holder or _read_context.underlying().underlying().
flat_mutation_reader* _underlying = nullptr;
flat_mutation_reader_opt _underlying_holder;
flat_mutation_reader_v2* _underlying = nullptr;
flat_mutation_reader_v2_opt _underlying_holder;
future<> do_fill_buffer();
future<> ensure_underlying();
@@ -110,11 +164,13 @@ class cache_flat_mutation_reader final : public flat_mutation_reader::impl {
void move_to_range(query::clustering_row_ranges::const_iterator);
void move_to_next_entry();
void maybe_drop_last_entry() noexcept;
void flush_tombstones(position_in_partition_view, bool end_of_range = false);
void add_to_buffer(const partition_snapshot_row_cursor&);
void add_clustering_row_to_buffer(mutation_fragment&&);
void add_to_buffer(range_tombstone&&);
void add_clustering_row_to_buffer(mutation_fragment_v2&&);
void add_to_buffer(range_tombstone_change&&, source);
void do_add_to_buffer(range_tombstone_change&&);
void add_range_tombstone_to_buffer(range_tombstone&&);
void add_to_buffer(mutation_fragment&&);
void add_to_buffer(mutation_fragment_v2&&);
future<> read_from_underlying();
void start_reading_from_underlying();
bool after_current_range(position_in_partition_view position);
@@ -131,9 +187,9 @@ class cache_flat_mutation_reader final : public flat_mutation_reader::impl {
// if !_read_context.is_reversed() then _last_row is valid after this or the population lower bound
// is before all rows (so _last_row doesn't point at any entry).
bool ensure_population_lower_bound();
void maybe_add_to_cache(const mutation_fragment& mf);
void maybe_add_to_cache(const mutation_fragment_v2& mf);
void maybe_add_to_cache(const clustering_row& cr);
void maybe_add_to_cache(const range_tombstone& rt);
void maybe_add_to_cache(const range_tombstone_change& rtc);
void maybe_add_to_cache(const static_row& sr);
void maybe_set_static_row_continuous();
void finish_reader() {
@@ -177,7 +233,7 @@ public:
read_context& ctx,
partition_snapshot_ptr snp,
row_cache& cache)
: flat_mutation_reader::impl(std::move(s), ctx.permit())
: flat_mutation_reader_v2::impl(std::move(s), ctx.permit())
, _snp(std::move(snp))
, _ck_ranges(std::move(crr))
, _ck_ranges_curr(_ck_ranges.begin())
@@ -188,6 +244,8 @@ public:
, _read_context_holder()
, _read_context(ctx) // ctx is owned by the caller, who's responsible for closing it.
, _next_row(*_schema, *_snp, false, _read_context.is_reversed())
, _rt_gen(*_schema)
, _rt_merger(*_schema)
{
clogger.trace("csm {}: table={}.{}, reversed={}, snap={}", fmt::ptr(this), _schema->ks_name(), _schema->cf_name(), _read_context.is_reversed(),
fmt::ptr(&*_snp));
@@ -238,13 +296,13 @@ future<> cache_flat_mutation_reader::process_static_row() {
return _snp->static_row(_read_context.digest_requested());
});
if (!sr.empty()) {
push_mutation_fragment(mutation_fragment(*_schema, _permit, std::move(sr)));
push_mutation_fragment(*_schema, _permit, std::move(sr));
}
return make_ready_future<>();
} else {
_read_context.cache().on_row_miss();
return ensure_underlying().then([this] {
return (*_underlying)().then([this] (mutation_fragment_opt&& sr) {
return (*_underlying)().then([this] (mutation_fragment_v2_opt&& sr) {
if (sr) {
assert(sr->is_static_row());
maybe_add_to_cache(sr->as_static_row());
@@ -294,7 +352,7 @@ future<> cache_flat_mutation_reader::ensure_underlying() {
return make_ready_future<>();
}
return _read_context.ensure_underlying().then([this] {
flat_mutation_reader& ctx_underlying = _read_context.underlying().underlying();
flat_mutation_reader_v2& ctx_underlying = _read_context.underlying().underlying();
if (ctx_underlying.schema() != _schema) {
_underlying_holder = make_delegating_reader(ctx_underlying);
_underlying_holder->upgrade_schema(_schema);
@@ -318,9 +376,9 @@ future<> cache_flat_mutation_reader::do_fill_buffer() {
if (!_read_context.partition_exists()) {
return read_from_underlying();
}
auto end = _next_row_in_range ? position_in_partition(_next_row.position())
_underlying_upper_bound = _next_row_in_range ? position_in_partition(_next_row.position())
: position_in_partition(_upper_bound);
return _underlying->fast_forward_to(position_range{_lower_bound, std::move(end)}).then([this] {
return _underlying->fast_forward_to(position_range{_lower_bound, *_underlying_upper_bound}).then([this] {
return read_from_underlying();
});
}
@@ -363,12 +421,13 @@ inline
future<> cache_flat_mutation_reader::read_from_underlying() {
return consume_mutation_fragments_until(*_underlying,
[this] { return _state != state::reading_from_underlying || is_buffer_full(); },
[this] (mutation_fragment mf) {
[this] (mutation_fragment_v2 mf) {
_read_context.cache().on_row_miss();
maybe_add_to_cache(mf);
add_to_buffer(std::move(mf));
},
[this] {
_underlying_upper_bound.reset();
_state = state::reading_from_cache;
_lsa_manager.run_in_update_section([this] {
auto same_pos = _next_row.maybe_refresh();
@@ -495,9 +554,9 @@ void cache_flat_mutation_reader::maybe_update_continuity() {
}
inline
void cache_flat_mutation_reader::maybe_add_to_cache(const mutation_fragment& mf) {
if (mf.is_range_tombstone()) {
maybe_add_to_cache(mf.as_range_tombstone());
void cache_flat_mutation_reader::maybe_add_to_cache(const mutation_fragment_v2& mf) {
if (mf.is_range_tombstone_change()) {
maybe_add_to_cache(mf.as_range_tombstone_change());
} else {
assert(mf.is_clustering_row());
const clustering_row& cr = mf.as_clustering_row();
@@ -513,9 +572,16 @@ void cache_flat_mutation_reader::maybe_add_to_cache(const clustering_row& cr) {
_read_context.cache().on_mispopulate();
return;
}
auto rt_opt = _rt_assembler.flush(*_schema, position_in_partition::after_key(cr.key()));
clogger.trace("csm {}: populate({})", fmt::ptr(this), clustering_row::printer(*_schema, cr));
_lsa_manager.run_in_update_section_with_allocator([this, &cr] {
_lsa_manager.run_in_update_section_with_allocator([this, &cr, &rt_opt] {
mutation_partition& mp = _snp->version()->partition();
if (rt_opt) {
clogger.trace("csm {}: populate flushed rt({})", fmt::ptr(this), *rt_opt);
mp.mutable_row_tombstones().apply_monotonically(table_schema(), to_table_domain(range_tombstone(*rt_opt)));
}
rows_entry::tri_compare cmp(table_schema());
if (_read_context.digest_requested()) {
@@ -571,11 +637,6 @@ void cache_flat_mutation_reader::copy_from_cache_to_buffer() {
position_in_partition_view next_lower_bound = _next_row.dummy() ? _next_row.position() : position_in_partition_view::after_key(_next_row.key());
auto upper_bound = _next_row_in_range ? next_lower_bound : _upper_bound;
if (_snp->range_tombstones(_lower_bound, upper_bound, [&] (range_tombstone rts) {
position_in_partition::less_compare less(*_schema);
// Avoid emitting overlapping range tombstones for performance reasons.
if (less(upper_bound, rts.end_position())) {
rts.set_end(*_schema, upper_bound);
}
add_range_tombstone_to_buffer(std::move(rts));
return stop_iteration(_lower_bound_changed && is_buffer_full());
}, _read_context.is_reversed()) == stop_iteration::no) {
@@ -599,6 +660,10 @@ void cache_flat_mutation_reader::move_to_end() {
inline
void cache_flat_mutation_reader::move_to_next_range() {
if (_queued_underlying_fragment) {
add_to_buffer(*std::exchange(_queued_underlying_fragment, {}));
}
flush_tombstones(position_in_partition::for_range_end(*_ck_ranges_curr), true);
auto next_it = std::next(_ck_ranges_curr);
if (next_it == _ck_ranges_end) {
move_to_end();
@@ -615,6 +680,7 @@ void cache_flat_mutation_reader::move_to_range(query::clustering_row_ranges::con
_last_row = nullptr;
_lower_bound = std::move(lb);
_upper_bound = std::move(ub);
_rt_gen.trim(_lower_bound);
_lower_bound_changed = true;
_ck_ranges_curr = next_it;
auto adjacent = _next_row.advance_to(_lower_bound);
@@ -661,7 +727,7 @@ void cache_flat_mutation_reader::maybe_drop_last_entry() noexcept {
// This prevents unnecessary dummy entries from accumulating in cache and slowing down scans.
//
// Eviction can happen only from oldest versions to preserve the continuity non-overlapping rule
// (See docs/design-notes/row_cache.md)
// (See docs/dev/row_cache.md)
//
if (_last_row
&& !_read_context.is_reversed() // FIXME
@@ -671,7 +737,9 @@ void cache_flat_mutation_reader::maybe_drop_last_entry() noexcept {
&& _snp->at_oldest_version()) {
with_allocator(_snp->region().allocator(), [&] {
_last_row->on_evicted(_read_context.cache()._tracker);
cache_tracker& tracker = _read_context.cache()._tracker;
tracker.get_lru().remove(*_last_row);
_last_row->on_evicted(tracker);
});
_last_row = nullptr;
@@ -706,27 +774,49 @@ void cache_flat_mutation_reader::move_to_next_entry() {
}
}
void cache_flat_mutation_reader::flush_tombstones(position_in_partition_view pos, bool end_of_range) {
// Ensure position is appropriate for range tombstone bound
pos = position_in_partition_view::after_key(pos);
clogger.trace("csm {}: flush_tombstones({}) end_of_range: {}", fmt::ptr(this), pos, end_of_range);
_rt_gen.flush(pos, [this] (range_tombstone_change&& rtc) {
add_to_buffer(std::move(rtc), source::cache);
}, end_of_range);
if (auto rtc_opt = _rt_merger.flush(pos, end_of_range)) {
do_add_to_buffer(std::move(*rtc_opt));
}
}
inline
void cache_flat_mutation_reader::add_to_buffer(mutation_fragment&& mf) {
clogger.trace("csm {}: add_to_buffer({})", fmt::ptr(this), mutation_fragment::printer(*_schema, mf));
void cache_flat_mutation_reader::add_to_buffer(mutation_fragment_v2&& mf) {
clogger.trace("csm {}: add_to_buffer({})", fmt::ptr(this), mutation_fragment_v2::printer(*_schema, mf));
position_in_partition::less_compare less(*_schema);
if (_underlying_upper_bound && less(*_underlying_upper_bound, mf.position())) {
_queued_underlying_fragment = std::move(mf);
return;
}
flush_tombstones(mf.position());
if (mf.is_clustering_row()) {
add_clustering_row_to_buffer(std::move(mf));
} else {
assert(mf.is_range_tombstone());
add_to_buffer(std::move(mf).as_range_tombstone());
assert(mf.is_range_tombstone_change());
add_to_buffer(std::move(mf).as_range_tombstone_change(), source::underlying);
}
}
inline
void cache_flat_mutation_reader::add_to_buffer(const partition_snapshot_row_cursor& row) {
position_in_partition::less_compare less(*_schema);
if (_queued_underlying_fragment && less(_queued_underlying_fragment->position(), row.position())) {
add_to_buffer(*std::exchange(_queued_underlying_fragment, {}));
}
if (!row.dummy()) {
_read_context.cache().on_row_hit();
if (_read_context.digest_requested()) {
row.latest_row().cells().prepare_hash(table_schema(), column_kind::regular_column);
}
add_clustering_row_to_buffer(mutation_fragment(*_schema, _permit, row.row()));
flush_tombstones(position_in_partition_view::for_key(row.key()));
add_clustering_row_to_buffer(mutation_fragment_v2(*_schema, _permit, row.row()));
} else {
position_in_partition::less_compare less(*_schema);
if (less(_lower_bound, row.position())) {
_lower_bound = row.position();
_lower_bound_changed = true;
@@ -739,8 +829,8 @@ void cache_flat_mutation_reader::add_to_buffer(const partition_snapshot_row_curs
// (1) no fragment with position >= _lower_bound was pushed yet
// (2) If _lower_bound > mf.position(), mf was emitted
inline
void cache_flat_mutation_reader::add_clustering_row_to_buffer(mutation_fragment&& mf) {
clogger.trace("csm {}: add_clustering_row_to_buffer({})", fmt::ptr(this), mutation_fragment::printer(*_schema, mf));
void cache_flat_mutation_reader::add_clustering_row_to_buffer(mutation_fragment_v2&& mf) {
clogger.trace("csm {}: add_clustering_row_to_buffer({})", fmt::ptr(this), mutation_fragment_v2::printer(*_schema, mf));
auto& row = mf.as_clustering_row();
auto new_lower_bound = position_in_partition::after_key(row.key());
push_mutation_fragment(std::move(mf));
@@ -752,32 +842,46 @@ void cache_flat_mutation_reader::add_clustering_row_to_buffer(mutation_fragment&
}
inline
void cache_flat_mutation_reader::add_to_buffer(range_tombstone&& rt) {
clogger.trace("csm {}: add_to_buffer({})", fmt::ptr(this), rt);
// This guarantees that rt starts after any emitted clustering_row
// and not before any emitted range tombstone.
position_in_partition::less_compare less(*_schema);
if (less(_lower_bound, rt.end_position())) {
add_range_tombstone_to_buffer(std::move(rt));
void cache_flat_mutation_reader::add_to_buffer(range_tombstone_change&& rtc, source src) {
clogger.trace("csm {}: add_to_buffer({})", fmt::ptr(this), rtc);
if (auto rtc_opt = _rt_merger.apply(src, std::move(rtc))) {
do_add_to_buffer(std::move(*rtc_opt));
}
}
inline
void cache_flat_mutation_reader::do_add_to_buffer(range_tombstone_change&& rtc) {
clogger.trace("csm {}: push({})", fmt::ptr(this), rtc);
position_in_partition::less_compare less(*_schema);
auto lower_bound_changed = less(_lower_bound, rtc.position());
_lower_bound = position_in_partition(rtc.position());
_lower_bound_changed = lower_bound_changed;
push_mutation_fragment(*_schema, _permit, std::move(rtc));
_read_context.cache()._tracker.on_range_tombstone_read();
}
inline
void cache_flat_mutation_reader::add_range_tombstone_to_buffer(range_tombstone&& rt) {
position_in_partition::less_compare less(*_schema);
if (_queued_underlying_fragment && less(_queued_underlying_fragment->position(), rt.position())) {
add_to_buffer(*std::exchange(_queued_underlying_fragment, {}));
}
clogger.trace("csm {}: add_to_buffer({})", fmt::ptr(this), rt);
if (!less(_lower_bound, rt.position())) {
rt.set_start(_lower_bound);
} else {
_lower_bound = position_in_partition(rt.position());
_lower_bound_changed = true;
}
clogger.trace("csm {}: push({})", fmt::ptr(this), rt);
push_mutation_fragment(*_schema, _permit, std::move(rt));
_read_context.cache()._tracker.on_range_tombstone_read();
flush_tombstones(rt.position());
_rt_gen.consume(std::move(rt));
}
inline
void cache_flat_mutation_reader::maybe_add_to_cache(const range_tombstone& rt) {
void cache_flat_mutation_reader::maybe_add_to_cache(const range_tombstone_change& rtc) {
clogger.trace("csm {}: maybe_add_to_cache({})", fmt::ptr(this), rtc);
auto rt_opt = _rt_assembler.consume(*_schema, range_tombstone_change(rtc));
if (!rt_opt) {
return;
}
const auto& rt = *rt_opt;
if (can_populate()) {
clogger.trace("csm {}: maybe_add_to_cache({})", fmt::ptr(this), rt);
_lsa_manager.run_in_update_section_with_allocator([&] {
@@ -825,25 +929,25 @@ bool cache_flat_mutation_reader::can_populate() const {
// pass a reference to ctx to cache_flat_mutation_reader
// keeping its ownership at caller's.
inline flat_mutation_reader make_cache_flat_mutation_reader(schema_ptr s,
inline flat_mutation_reader_v2 make_cache_flat_mutation_reader(schema_ptr s,
dht::decorated_key dk,
query::clustering_key_filter_ranges crr,
row_cache& cache,
cache::read_context& ctx,
partition_snapshot_ptr snp)
{
return make_flat_mutation_reader<cache::cache_flat_mutation_reader>(
return make_flat_mutation_reader_v2<cache::cache_flat_mutation_reader>(
std::move(s), std::move(dk), std::move(crr), ctx, std::move(snp), cache);
}
// transfer ownership of ctx to cache_flat_mutation_reader
inline flat_mutation_reader make_cache_flat_mutation_reader(schema_ptr s,
inline flat_mutation_reader_v2 make_cache_flat_mutation_reader(schema_ptr s,
dht::decorated_key dk,
query::clustering_key_filter_ranges crr,
row_cache& cache,
std::unique_ptr<cache::read_context> unique_ctx,
partition_snapshot_ptr snp)
{
return make_flat_mutation_reader<cache::cache_flat_mutation_reader>(
return make_flat_mutation_reader_v2<cache::cache_flat_mutation_reader>(
std::move(s), std::move(dk), std::move(crr), std::move(unique_ctx), std::move(snp), cache);
}

View File

@@ -15,14 +15,7 @@
#include "converting_mutation_partition_applier.hh"
#include "hashing_partition_visitor.hh"
#include "utils/UUID.hh"
#include "serializer.hh"
#include "idl/uuid.dist.hh"
#include "idl/keys.dist.hh"
#include "idl/mutation.dist.hh"
#include "serializer_impl.hh"
#include "serialization_visitors.hh"
#include "idl/uuid.dist.impl.hh"
#include "idl/keys.dist.impl.hh"
#include "idl/mutation.dist.impl.hh"
#include <iostream>
@@ -44,7 +37,7 @@ canonical_mutation::canonical_mutation(const mutation& m)
}).end_canonical_mutation();
}
utils::UUID canonical_mutation::column_family_id() const {
table_id canonical_mutation::column_family_id() const {
auto in = ser::as_input_stream(_data);
auto mv = ser::deserialize(in, boost::type<ser::canonical_mutation_view>());
return mv.table_id();
@@ -120,17 +113,19 @@ std::ostream& operator<<(std::ostream& os, const canonical_mutation& cm) {
auto&& entry = _cm.static_column_at(id);
fmt::print(_os, "static column {} {}", bytes_to_text(entry.name()), collection_mutation_view::printer(*entry.type(), cmv));
}
virtual void accept_row_tombstone(range_tombstone rt) override {
virtual stop_iteration accept_row_tombstone(range_tombstone rt) override {
print_separator();
fmt::print(_os, "row tombstone {}", rt);
return stop_iteration::no;
}
virtual void accept_row(position_in_partition_view pipv, row_tombstone rt, row_marker rm, is_dummy, is_continuous) override {
virtual stop_iteration accept_row(position_in_partition_view pipv, row_tombstone rt, row_marker rm, is_dummy, is_continuous) override {
if (_in_row) {
fmt::print(_os, "}}, ");
}
fmt::print(_os, "{{row {} tombstone {} marker {}", pipv, rt, rm);
_in_row = true;
_first = false;
return stop_iteration::no;
}
virtual void accept_row_cell(column_id id, atomic_cell ac) override {
print_separator();

View File

@@ -14,10 +14,6 @@
#include "bytes_ostream.hh"
#include <iosfwd>
namespace utils {
class UUID;
} // namespace utils
// Immutable mutation form which can be read using any schema version of the same table.
// Safe to access from other shards via const&.
// Safe to pass serialized across nodes.
@@ -39,7 +35,7 @@ public:
// is not intended, user should sync the schema first.
mutation to_mutation(schema_ptr) const;
utils::UUID column_family_id() const;
table_id column_family_id() const;
const bytes_ostream& representation() const { return _data; }

View File

@@ -14,12 +14,12 @@
#include "cdc/generation.hh"
#include "keys.hh"
static const sstring cdc_partitioner_name = "com.scylladb.dht.CDCPartitioner";
namespace cdc {
const sstring cdc_partitioner::classname = "com.scylladb.dht.CDCPartitioner";
const sstring cdc_partitioner::name() const {
return cdc_partitioner_name;
return classname;
}
static dht::token to_token(int64_t value) {
@@ -48,7 +48,7 @@ cdc_partitioner::get_token(const schema& s, partition_key_view key) const {
}
using registry = class_registrator<dht::i_partitioner, cdc_partitioner>;
static registry registrator(cdc_partitioner_name);
static registry registrator(cdc::cdc_partitioner::classname);
static registry registrator_short_name("CDCPartitioner");
}

View File

@@ -25,6 +25,8 @@ class key_view;
namespace cdc {
struct cdc_partitioner final : public dht::i_partitioner {
static const sstring classname;
cdc_partitioner() = default;
virtual const sstring name() const override;
virtual dht::token get_token(const schema& s, partition_key_view key) const override;

View File

@@ -340,7 +340,7 @@ future<cdc::generation_id> generation_service::make_new_generation(const std::un
auto normal_token_owners = tmptr->count_normal_token_owners();
assert(normal_token_owners);
if (_feature_service.cluster_supports_cdc_generations_v2()) {
if (_feature_service.cdc_generations_v2) {
auto uuid = utils::make_random_uuid();
cdc_log.info("Inserting new generation data at UUID {}", uuid);
// This may take a while.
@@ -455,7 +455,12 @@ static future<> update_streams_description(
noncopyable_function<unsigned()> get_num_token_owners,
abort_source& abort_src) -> future<> {
while (true) {
co_await sleep_abortable(std::chrono::seconds(60), abort_src);
try {
co_await sleep_abortable(std::chrono::seconds(60), abort_src);
} catch (seastar::sleep_aborted&) {
cdc_log.warn( "Aborted update CDC description table with generation {}", gen_id);
co_return;
}
try {
co_await do_update_streams_description(gen_id, *sys_dist_ks, { get_num_token_owners() });
co_return;
@@ -578,7 +583,7 @@ future<> generation_service::maybe_rewrite_streams_descriptions() {
co_return;
}
if (co_await db::system_keyspace::cdc_is_rewritten()) {
if (co_await _sys_ks.local().cdc_is_rewritten()) {
co_return;
}
@@ -602,13 +607,13 @@ future<> generation_service::maybe_rewrite_streams_descriptions() {
continue;
}
times_and_ttls.push_back(time_and_ttl{as_timepoint(s.id()), cdc_opts.ttl()});
times_and_ttls.push_back(time_and_ttl{as_timepoint(s.id().uuid()), cdc_opts.ttl()});
}
if (times_and_ttls.empty()) {
// There's no point in rewriting old generations' streams (they don't contain any data).
cdc_log.info("No CDC log tables present, not rewriting stream tables.");
co_return co_await db::system_keyspace::cdc_set_rewritten(std::nullopt);
co_return co_await _sys_ks.local().cdc_set_rewritten(std::nullopt);
}
auto get_num_token_owners = [tm = _token_metadata.get()] { return tm->count_normal_token_owners(); };
@@ -626,7 +631,7 @@ future<> generation_service::maybe_rewrite_streams_descriptions() {
std::move(get_num_token_owners),
_abort_src);
co_await db::system_keyspace::cdc_set_rewritten(last_rewritten);
co_await _sys_ks.local().cdc_set_rewritten(last_rewritten);
}
static void assert_shard_zero(const sstring& where) {
@@ -670,11 +675,13 @@ constexpr char could_not_retrieve_msg_template[]
generation_service::generation_service(
config cfg, gms::gossiper& g, sharded<db::system_distributed_keyspace>& sys_dist_ks,
sharded<db::system_keyspace>& sys_ks,
abort_source& abort_src, const locator::shared_token_metadata& stm, gms::feature_service& f,
replica::database& db)
: _cfg(std::move(cfg))
, _gossiper(g)
, _sys_dist_ks(sys_dist_ks)
, _sys_ks(sys_ks)
, _abort_src(abort_src)
, _token_metadata(stm)
, _feature_service(f)
@@ -702,7 +709,7 @@ generation_service::~generation_service() {
future<> generation_service::after_join(std::optional<cdc::generation_id>&& startup_gen_id) {
assert_shard_zero(__PRETTY_FUNCTION__);
assert(db::system_keyspace::bootstrap_complete());
assert(_sys_ks.local().bootstrap_complete());
_gen_id = std::move(startup_gen_id);
_gossiper.register_(shared_from_this());
@@ -776,7 +783,7 @@ future<> generation_service::check_and_repair_cdc_streams() {
cdc_log.warn("check_and_repair_cdc_streams: no generation observed in gossip");
should_regenerate = true;
} else if (std::holds_alternative<cdc::generation_id_v1>(*latest)
&& _feature_service.cluster_supports_cdc_generations_v2()) {
&& _feature_service.cdc_generations_v2) {
cdc_log.info(
"Cluster still using CDC generation storage format V1 (id: {}), even though it already understands the V2 format."
" Creating a new generation using V2.", *latest);
@@ -868,7 +875,7 @@ future<> generation_service::check_and_repair_cdc_streams() {
{ gms::application_state::CDC_GENERATION_ID, gms::versioned_value::cdc_generation_id(new_gen_id) },
{ gms::application_state::STATUS, *status }
});
co_await db::system_keyspace::update_cdc_generation_id(new_gen_id);
co_await _sys_ks.local().update_cdc_generation_id(new_gen_id);
}
future<> generation_service::handle_cdc_generation(std::optional<cdc::generation_id> gen_id) {
@@ -878,7 +885,7 @@ future<> generation_service::handle_cdc_generation(std::optional<cdc::generation
co_return;
}
if (!db::system_keyspace::bootstrap_complete() || !_sys_dist_ks.local_is_initialized()
if (!_sys_ks.local().bootstrap_complete() || !_sys_dist_ks.local_is_initialized()
|| !_sys_dist_ks.local().started()) {
// The service should not be listening for generation changes until after the node
// is bootstrapped. Therefore we would previously assume that this condition
@@ -1011,7 +1018,7 @@ future<bool> generation_service::do_handle_cdc_generation(cdc::generation_id gen
// The assumption follows from the requirement of bootstrapping nodes sequentially.
if (!_gen_id || get_ts(*_gen_id) < get_ts(gen_id)) {
_gen_id = gen_id;
co_await db::system_keyspace::update_cdc_generation_id(gen_id);
co_await _sys_ks.local().update_cdc_generation_id(gen_id);
co_await _gossiper.add_local_application_state(
gms::application_state::CDC_GENERATION_ID, gms::versioned_value::cdc_generation_id(gen_id));
}

View File

@@ -15,6 +15,7 @@
namespace db {
class system_distributed_keyspace;
class system_keyspace;
}
namespace gms {
@@ -51,6 +52,7 @@ private:
config _cfg;
gms::gossiper& _gossiper;
sharded<db::system_distributed_keyspace>& _sys_dist_ks;
sharded<db::system_keyspace>& _sys_ks;
abort_source& _abort_src;
const locator::shared_token_metadata& _token_metadata;
gms::feature_service& _feature_service;
@@ -77,7 +79,9 @@ private:
future<> _cdc_streams_rewrite_complete = make_ready_future<>();
public:
generation_service(config cfg, gms::gossiper&,
sharded<db::system_distributed_keyspace>&, abort_source&, const locator::shared_token_metadata&,
sharded<db::system_distributed_keyspace>&,
sharded<db::system_keyspace>& sys_ks,
abort_source&, const locator::shared_token_metadata&,
gms::feature_service&, replica::database& db);
future<> stop();

View File

@@ -20,6 +20,7 @@
#include "cdc/cdc_options.hh"
#include "cdc/change_visitor.hh"
#include "cdc/metadata.hh"
#include "cdc/cdc_partitioner.hh"
#include "bytes.hh"
#include "replica/database.hh"
#include "db/schema_tables.hh"
@@ -30,7 +31,6 @@
#include "service/storage_proxy.hh"
#include "types/tuple.hh"
#include "cql3/statements/select_statement.hh"
#include "cql3/multi_column_relation.hh"
#include "cql3/untyped_result_set.hh"
#include "log.hh"
#include "utils/rjson.hh"
@@ -59,7 +59,7 @@ using namespace std::chrono_literals;
logging::logger cdc_log("cdc");
namespace cdc {
static schema_ptr create_log_schema(const schema&, std::optional<utils::UUID> = {}, schema_ptr = nullptr);
static schema_ptr create_log_schema(const schema&, std::optional<table_id> = {}, schema_ptr = nullptr);
}
static constexpr auto cdc_group_name = "cdc";
@@ -169,9 +169,8 @@ public:
// in seastar thread
auto log_schema = create_log_schema(schema);
auto& keyspace = db.find_keyspace(schema.ks_name());
auto log_mut = db::schema_tables::make_create_table_mutations(keyspace.metadata(), log_schema, timestamp);
auto log_mut = db::schema_tables::make_create_table_mutations(log_schema, timestamp);
mutations.insert(mutations.end(), std::make_move_iterator(log_mut.begin()), std::make_move_iterator(log_mut.end()));
}
@@ -181,36 +180,36 @@ public:
bool is_cdc = new_schema.cdc_options().enabled();
bool was_cdc = old_schema.cdc_options().enabled();
// we need to create or modify the log & stream schemas iff either we changed cdc status (was != is)
// or if cdc is on now unconditionally, since then any actual base schema changes will affect the column
// etc.
if (was_cdc || is_cdc) {
// if we are turning off cdc we can skip this, since even if columns change etc,
// any writer should see cdc -> off together with any actual schema changes to
// base table, so should never try to write to non-existent log column etc.
// note that if user has set ttl=0 in cdc options, he is still reponsible
// for emptying the log.
if (is_cdc) {
auto& db = _ctxt._proxy.get_db().local();
if (!was_cdc) {
check_that_cdc_log_table_does_not_exist(db, new_schema, log_name(new_schema.cf_name()));
}
if (is_cdc) {
check_for_attempt_to_create_nested_cdc_log(db, new_schema);
ensure_that_table_has_no_counter_columns(new_schema);
}
auto logname = log_name(old_schema.cf_name());
auto& keyspace = db.find_keyspace(old_schema.ks_name());
auto log_schema = was_cdc ? db.find_column_family(old_schema.ks_name(), logname).schema() : nullptr;
auto has_cdc_log = db.has_schema(old_schema.ks_name(), logname);
auto log_schema = has_cdc_log ? db.find_schema(old_schema.ks_name(), logname) : nullptr;
if (!is_cdc) {
auto log_mut = db::schema_tables::make_drop_table_mutations(keyspace.metadata(), log_schema, timestamp);
mutations.insert(mutations.end(), std::make_move_iterator(log_mut.begin()), std::make_move_iterator(log_mut.end()));
return;
if (!was_cdc && has_cdc_log) {
// make sure the apparent log table really is a cdc log (not user table)
// we just check the partitioner - since user tables should _not_ be able
// set/use this.
if (log_schema->get_partitioner().name() != cdc::cdc_partitioner::classname) {
// will throw
check_that_cdc_log_table_does_not_exist(db, old_schema, logname);
}
}
check_for_attempt_to_create_nested_cdc_log(db, new_schema);
ensure_that_table_has_no_counter_columns(new_schema);
auto new_log_schema = create_log_schema(new_schema, log_schema ? std::make_optional(log_schema->id()) : std::nullopt, log_schema);
auto log_mut = log_schema
? db::schema_tables::make_update_table_mutations(db, keyspace.metadata(), log_schema, new_log_schema, timestamp, false)
: db::schema_tables::make_create_table_mutations(keyspace.metadata(), new_log_schema, timestamp)
: db::schema_tables::make_create_table_mutations(new_log_schema, timestamp)
;
mutations.insert(mutations.end(), std::make_move_iterator(log_mut.begin()), std::make_move_iterator(log_mut.end()));
@@ -218,14 +217,16 @@ public:
}
void on_before_drop_column_family(const schema& schema, std::vector<mutation>& mutations, api::timestamp_type timestamp) override {
if (schema.cdc_options().enabled()) {
auto logname = log_name(schema.cf_name());
auto& db = _ctxt._proxy.get_db().local();
auto logname = log_name(schema.cf_name());
auto& db = _ctxt._proxy.get_db().local();
auto has_cdc_log = db.has_schema(schema.ks_name(), logname);
if (has_cdc_log) {
auto log_schema = db.find_schema(schema.ks_name(), logname);
if (log_schema->get_partitioner().name() != cdc::cdc_partitioner::classname) {
return;
}
auto& keyspace = db.find_keyspace(schema.ks_name());
auto log_schema = db.find_column_family(schema.ks_name(), logname).schema();
auto log_mut = db::schema_tables::make_drop_table_mutations(keyspace.metadata(), log_schema, timestamp);
mutations.insert(mutations.end(), std::make_move_iterator(log_mut.begin()), std::make_move_iterator(log_mut.end()));
}
}
@@ -407,7 +408,7 @@ static const sstring cdc_meta_column_prefix = "cdc$";
static const sstring cdc_deleted_column_prefix = cdc_meta_column_prefix + "deleted_";
static const sstring cdc_deleted_elements_column_prefix = cdc_meta_column_prefix + "deleted_elements_";
static bool is_log_name(const std::string_view& table_name) {
bool is_log_name(const std::string_view& table_name) {
return boost::ends_with(table_name, cdc_log_suffix);
}
@@ -484,9 +485,9 @@ bytes log_data_column_deleted_elements_name_bytes(const bytes& column_name) {
return to_bytes(cdc_deleted_elements_column_prefix) + column_name;
}
static schema_ptr create_log_schema(const schema& s, std::optional<utils::UUID> uuid, schema_ptr old) {
static schema_ptr create_log_schema(const schema& s, std::optional<table_id> uuid, schema_ptr old) {
schema_builder b(s.ks_name(), log_name(s.cf_name()));
b.with_partitioner("com.scylladb.dht.CDCPartitioner");
b.with_partitioner(cdc::cdc_partitioner::classname);
b.set_compaction_strategy(sstables::compaction_strategy_type::time_window);
b.set_comment(fmt::format("CDC log for {}.{}", s.ks_name(), s.cf_name()));
auto ttl_seconds = s.cdc_options().ttl();
@@ -1654,7 +1655,8 @@ public:
auto partition_slice = query::partition_slice(std::move(bounds), std::move(static_columns), std::move(regular_columns), std::move(opts));
const auto max_result_size = _ctx._proxy.get_max_result_size(partition_slice);
auto command = ::make_lw_shared<query::read_command>(_schema->id(), _schema->version(), partition_slice, query::max_result_size(max_result_size), query::row_limit(row_limit));
const auto tombstone_limit = query::tombstone_limit(_ctx._proxy.get_tombstone_limit());
auto command = ::make_lw_shared<query::read_command>(_schema->id(), _schema->version(), partition_slice, query::max_result_size(max_result_size), tombstone_limit, query::row_limit(row_limit));
const auto select_cl = adjust_cl(write_cl);

View File

@@ -60,6 +60,8 @@ struct operation_result_tracker;
class db_context;
class metadata;
bool is_log_name(const std::string_view& table_name);
/// \brief CDC service, responsible for schema listeners
///
/// CDC service will listen for schema changes and iff CDC is enabled/changed

View File

@@ -8,8 +8,8 @@
#pragma once
#include "seastar/core/file.hh"
#include "seastar/core/seastar.hh"
#include <seastar/core/file.hh>
#include <seastar/core/seastar.hh>
#include "utils/disk-error-handler.hh"
#include "seastarx.hh"

28
client_data.cc Normal file
View File

@@ -0,0 +1,28 @@
/*
* Copyright (C) 2019-present ScyllaDB
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include "client_data.hh"
#include <stdexcept>
sstring to_string(client_type ct) {
switch (ct) {
case client_type::cql: return "cql";
case client_type::thrift: return "thrift";
case client_type::alternator: return "alternator";
}
throw std::runtime_error("Invalid client_type");
}
sstring to_string(client_connection_stage ccs) {
switch (ccs) {
case client_connection_stage::established: return "ESTABLISHED";
case client_connection_stage::authenticating: return "AUTHENTICATING";
case client_connection_stage::ready: return "READY";
}
throw std::runtime_error("Invalid client_connection_stage");
}

51
client_data.hh Normal file
View File

@@ -0,0 +1,51 @@
/*
* Copyright (C) 2019-present ScyllaDB
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#pragma once
#include <seastar/net/inet_address.hh>
#include <seastar/core/sstring.hh>
#include "seastarx.hh"
#include <optional>
enum class client_type {
cql = 0,
thrift,
alternator,
};
sstring to_string(client_type ct);
enum class client_connection_stage {
established = 0,
authenticating,
ready,
};
sstring to_string(client_connection_stage ct);
// Representation of a row in `system.clients'. std::optionals are for nullable cells.
struct client_data {
net::inet_address ip;
int32_t port;
client_type ct;
client_connection_stage connection_stage = client_connection_stage::established;
int32_t shard_id; /// ID of server-side shard which is processing the connection.
std::optional<sstring> driver_name;
std::optional<sstring> driver_version;
std::optional<sstring> hostname;
std::optional<int32_t> protocol_version;
std::optional<sstring> ssl_cipher_suite;
std::optional<bool> ssl_enabled;
std::optional<sstring> ssl_protocol;
std::optional<sstring> username;
sstring stage_str() const { return to_string(connection_stage); }
sstring client_type_str() const { return to_string(ct); }
};

View File

@@ -13,16 +13,22 @@
class schema;
class partition_key;
class clustering_row;
struct atomic_cell_view;
struct tombstone;
namespace db::view {
struct view_key_and_action;
}
class column_computation;
using column_computation_ptr = std::unique_ptr<column_computation>;
/*
* Column computation represents a computation performed in order to obtain a value for a computed column.
* Computed columns description is also available at docs/system_schema_keyspace.md. They hold values
* Computed columns description is also available at docs/dev/system_schema_keyspace.md. They hold values
* not provided directly by the user, but rather computed: from other column values and possibly other sources.
* This class is able to serialize/deserialize column computations and perform the computation itself,
* based on given schema, partition key and clustering row. Responsibility for providing enough data
* based on given schema, and partition key. Responsibility for providing enough data
* in the clustering row in order for computation to succeed belongs to the caller. In particular,
* generating a value might involve performing a read-before-write if the computation is performed
* on more values than are present in the update request.
@@ -36,7 +42,19 @@ public:
virtual column_computation_ptr clone() const = 0;
virtual bytes serialize() const = 0;
virtual bytes_opt compute_value(const schema& schema, const partition_key& key, const clustering_row& row) const = 0;
virtual bytes compute_value(const schema& schema, const partition_key& key) const = 0;
/*
* depends_on_non_primary_key_column for a column computation is needed to
* detect a case where the primary key of a materialized view depends on a
* non primary key column from the base table, but at the same time, the view
* itself doesn't have non-primary key columns. This is an issue, since as
* for now, it was assumed that no non-primary key columns in view schema
* meant that the update cannot change the primary key of the view, and
* therefore the update path can be simplified.
*/
virtual bool depends_on_non_primary_key_column() const {
return false;
}
};
/*
@@ -54,7 +72,7 @@ public:
return std::make_unique<legacy_token_column_computation>(*this);
}
virtual bytes serialize() const override;
virtual bytes_opt compute_value(const schema& schema, const partition_key& key, const clustering_row& row) const override;
virtual bytes compute_value(const schema& schema, const partition_key& key) const override;
};
@@ -75,5 +93,53 @@ public:
return std::make_unique<token_column_computation>(*this);
}
virtual bytes serialize() const override;
virtual bytes_opt compute_value(const schema& schema, const partition_key& key, const clustering_row& row) const override;
virtual bytes compute_value(const schema& schema, const partition_key& key) const override;
};
/*
* collection_column_computation is used for a secondary index on a collection
* column. In this case we don't have a single value to compute, but rather we
* want to return multiple values (e.g., all the keys in the collection).
* So this class does not implement the base class's compute_value() -
* instead it implements a new method compute_collection_values(), which
* can return multiple values. This new method is currently called only from
* the materialized-view code which uses collection_column_computation.
*/
class collection_column_computation final : public column_computation {
enum class kind {
keys,
values,
entries,
};
const bytes _collection_name;
const kind _kind;
collection_column_computation(const bytes& collection_name, kind kind) : _collection_name(collection_name), _kind(kind) {}
using collection_kv = std::pair<bytes_view, atomic_cell_view>;
void operate_on_collection_entries(
std::invocable<collection_kv*, collection_kv*, tombstone> auto&& old_and_new_row_func, const schema& schema,
const partition_key& key, const clustering_row& update, const std::optional<clustering_row>& existing) const;
public:
static collection_column_computation for_keys(const bytes& collection_name) {
return {collection_name, kind::keys};
}
static collection_column_computation for_values(const bytes& collection_name) {
return {collection_name, kind::values};
}
static collection_column_computation for_entries(const bytes& collection_name) {
return {collection_name, kind::entries};
}
static column_computation_ptr for_target_type(std::string_view type, const bytes& collection_name);
virtual bytes serialize() const override;
virtual bytes compute_value(const schema& schema, const partition_key& key) const override;
virtual column_computation_ptr clone() const override {
return std::make_unique<collection_column_computation>(*this);
}
virtual bool depends_on_non_primary_key_column() const override {
return true;
}
std::vector<db::view::view_key_and_action> compute_values_with_action(const schema& schema, const partition_key& key, const clustering_row& row, const std::optional<clustering_row>& existing) const;
};

View File

@@ -26,6 +26,7 @@
#include <seastar/core/scheduling.hh>
#include <seastar/core/coroutine.hh>
#include <seastar/util/closeable.hh>
#include <seastar/core/shared_ptr.hh>
#include "sstables/sstables.hh"
#include "sstables/sstable_writer.hh"
@@ -33,7 +34,6 @@
#include "sstables/sstables_manager.hh"
#include "compaction.hh"
#include "compaction_manager.hh"
#include "mutation_reader.hh"
#include "schema.hh"
#include "db/system_keyspace.hh"
#include "service/priority_manager.hh"
@@ -48,7 +48,11 @@
#include "utils/UUID_gen.hh"
#include "utils/utf8.hh"
#include "utils/fmt-compat.hh"
#include "utils/error_injection.hh"
#include "readers/filtering.hh"
#include "readers/compacting.hh"
#include "tombstone_gc.hh"
#include "keys.hh"
namespace sstables {
@@ -86,7 +90,7 @@ compaction_type to_compaction_type(sstring type_name) {
throw std::runtime_error("Invalid Compaction Type Name");
}
static std::string_view to_string(compaction_type type) {
std::string_view to_string(compaction_type type) {
switch (type) {
case compaction_type::Compaction: return "Compact";
case compaction_type::Cleanup: return "Cleanup";
@@ -180,7 +184,7 @@ static api::timestamp_type get_max_purgeable_timestamp(const table_state& table_
}
static std::vector<shared_sstable> get_uncompacting_sstables(const table_state& table_s, std::vector<shared_sstable> sstables) {
auto all_sstables = boost::copy_range<std::vector<shared_sstable>>(*table_s.get_sstable_set().all());
auto all_sstables = boost::copy_range<std::vector<shared_sstable>>(*table_s.main_sstable_set().all());
auto& compacted_undeleted = table_s.compacted_undeleted_sstables();
all_sstables.insert(all_sstables.end(), compacted_undeleted.begin(), compacted_undeleted.end());
boost::sort(all_sstables, [] (const shared_sstable& x, const shared_sstable& y) {
@@ -274,6 +278,18 @@ class compacted_fragments_writer {
stop_func_t _stop_compaction_writer;
std::optional<utils::observer<>> _stop_request_observer;
bool _unclosed_partition = false;
struct partition_state {
dht::decorated_key_opt dk;
// Partition tombstone is saved for the purpose of replicating it to every fragment storing a partition pL.
// Then when reading from the SSTable run, we won't unnecessarily have to open >= 2 fragments, the one which
// contains the tombstone and another one(s) that has the partition slice being queried.
::tombstone tombstone;
// Used to determine whether any active tombstones need closing at EOS.
::tombstone current_emitted_tombstone;
// Track last emitted clustering row, which will be used to close active tombstone if splitting partition
position_in_partition last_pos = position_in_partition::before_all_clustered_rows();
bool is_splitting_partition = false;
} _current_partition;
private:
inline void maybe_abort_compaction();
@@ -283,6 +299,13 @@ private:
consume_end_of_stream();
});
}
void stop_current_writer();
bool can_split_large_partition() const;
void track_last_position(position_in_partition_view pos);
void split_large_partition();
void do_consume_new_partition(const dht::decorated_key& dk);
stop_iteration do_consume_end_of_partition();
public:
explicit compacted_fragments_writer(compaction& c, creator_func_t cpw, stop_func_t scw)
: _c(c)
@@ -301,7 +324,7 @@ public:
void consume_new_partition(const dht::decorated_key& dk);
void consume(tombstone t) { _compaction_writer->writer.consume(t); }
void consume(tombstone t);
stop_iteration consume(static_row&& sr, tombstone, bool) {
maybe_abort_compaction();
return _compaction_writer->writer.consume(std::move(sr));
@@ -309,17 +332,11 @@ public:
stop_iteration consume(static_row&& sr) {
return consume(std::move(sr), tombstone{}, bool{});
}
stop_iteration consume(clustering_row&& cr, row_tombstone, bool) {
maybe_abort_compaction();
return _compaction_writer->writer.consume(std::move(cr));
}
stop_iteration consume(clustering_row&& cr, row_tombstone, bool);
stop_iteration consume(clustering_row&& cr) {
return consume(std::move(cr), row_tombstone{}, bool{});
}
stop_iteration consume(range_tombstone&& rt) {
maybe_abort_compaction();
return _compaction_writer->writer.consume(std::move(rt));
}
stop_iteration consume(range_tombstone_change&& rtc);
stop_iteration consume_end_of_partition();
void consume_end_of_stream();
@@ -390,7 +407,7 @@ struct compaction_read_monitor_generator final : public read_monitor_generator {
}
private:
table_state& _table_s;
std::unordered_map<int64_t, compaction_read_monitor> _generated_monitors;
std::unordered_map<generation_type, compaction_read_monitor> _generated_monitors;
};
class formatted_sstables_list {
@@ -430,7 +447,7 @@ protected:
schema_ptr _schema;
reader_permit _permit;
std::vector<shared_sstable> _sstables;
std::vector<unsigned long> _input_sstable_generations;
std::vector<generation_type> _input_sstable_generations;
// Unused sstables are tracked because if compaction is interrupted we can only delete them.
// Deleting used sstables could potentially result in data loss.
std::unordered_set<shared_sstable> _new_partial_sstables;
@@ -445,10 +462,11 @@ protected:
uint64_t _estimated_partitions = 0;
db::replay_position _rp;
encoding_stats_collector _stats_collector;
bool _can_split_large_partition = false;
bool _contains_multi_fragment_runs = false;
mutation_source_metadata _ms_metadata = {};
compaction_sstable_replacer_fn _replacer;
utils::UUID _run_identifier;
run_id _run_identifier;
::io_priority_class _io_priority;
// optional clone of sstable set to be used for expiration purposes, so it will be set if expiration is enabled.
std::optional<sstable_set> _sstable_set;
@@ -476,6 +494,7 @@ protected:
, _type(descriptor.options.type())
, _max_sstable_size(descriptor.max_sstable_bytes)
, _sstable_level(descriptor.level)
, _can_split_large_partition(descriptor.can_split_large_partition)
, _replacer(std::move(descriptor.replacer))
, _run_identifier(descriptor.run_identifier)
, _io_priority(descriptor.io_priority)
@@ -486,7 +505,7 @@ protected:
for (auto& sst : _sstables) {
_stats_collector.update(sst->get_encoding_stats_for_compaction());
}
std::unordered_set<utils::UUID> ssts_run_ids;
std::unordered_set<run_id> ssts_run_ids;
_contains_multi_fragment_runs = std::any_of(_sstables.begin(), _sstables.end(), [&ssts_run_ids] (shared_sstable& sst) {
return !ssts_run_ids.insert(sst->run_identifier()).second;
});
@@ -596,6 +615,10 @@ protected:
const std::vector<shared_sstable>& used_garbage_collected_sstables() const {
return _used_garbage_collected_sstables;
}
bool enable_garbage_collected_sstable_writer() const noexcept {
return _contains_multi_fragment_runs && _max_sstable_size != std::numeric_limits<uint64_t>::max();
}
public:
compaction& operator=(const compaction&) = delete;
compaction(const compaction&) = delete;
@@ -613,13 +636,15 @@ private:
return _table_s.get_compaction_strategy().make_sstable_set(_schema);
}
void setup() {
future<> setup() {
auto ssts = make_lw_shared<sstables::sstable_set>(make_sstable_set_for_input());
formatted_sstables_list formatted_msg;
auto fully_expired = _table_s.fully_expired_sstables(_sstables, gc_clock::now());
min_max_tracker<api::timestamp_type> timestamp_tracker;
_input_sstable_generations.reserve(_sstables.size());
for (auto& sst : _sstables) {
co_await coroutine::maybe_yield();
auto& sst_stats = sst->get_stats_metadata();
timestamp_tracker.update(sst_stats.min_timestamp);
timestamp_tracker.update(sst_stats.max_timestamp);
@@ -669,13 +694,14 @@ private:
// to be compacted together.
future<> consume_without_gc_writer(gc_clock::time_point compaction_time) {
auto consumer = make_interposer_consumer([this] (flat_mutation_reader_v2 reader) mutable {
return seastar::async([this, reader = downgrade_to_v1(std::move(reader))] () mutable {
return seastar::async([this, reader = std::move(reader)] () mutable {
auto close_reader = deferred_close(reader);
auto cfc = compacted_fragments_writer(get_compacted_fragments_writer());
reader.consume_in_thread(std::move(cfc));
});
});
return consumer(upgrade_to_v2(make_compacting_reader(downgrade_to_v1(make_sstable_reader()), compaction_time, max_purgeable_func())));
const auto& gc_state = _table_s.get_tombstone_gc_state();
return consumer(make_compacting_reader(make_sstable_reader(), compaction_time, max_purgeable_func(), gc_state));
}
future<> consume() {
@@ -692,18 +718,20 @@ private:
auto close_reader = deferred_close(reader);
if (enable_garbage_collected_sstable_writer()) {
using compact_mutations = compact_for_compaction<compacted_fragments_writer, compacted_fragments_writer>;
using compact_mutations = compact_for_compaction_v2<compacted_fragments_writer, compacted_fragments_writer>;
auto cfc = compact_mutations(*schema(), now,
max_purgeable_func(),
_table_s.get_tombstone_gc_state(),
get_compacted_fragments_writer(),
get_gc_compacted_fragments_writer());
reader.consume_in_thread(std::move(cfc));
return;
}
using compact_mutations = compact_for_compaction<compacted_fragments_writer, noop_compacted_fragments_consumer>;
using compact_mutations = compact_for_compaction_v2<compacted_fragments_writer, noop_compacted_fragments_consumer>;
auto cfc = compact_mutations(*schema(), now,
max_purgeable_func(),
_table_s.get_tombstone_gc_state(),
get_compacted_fragments_writer(),
noop_compacted_fragments_consumer());
reader.consume_in_thread(std::move(cfc));
@@ -719,12 +747,15 @@ private:
virtual bool use_interposer_consumer() const {
return _table_s.get_compaction_strategy().use_interposer_consumer();
}
compaction_result finish(std::chrono::time_point<db_clock> started_at, std::chrono::time_point<db_clock> ended_at) {
protected:
virtual compaction_result finish(std::chrono::time_point<db_clock> started_at, std::chrono::time_point<db_clock> ended_at) {
compaction_result ret {
.new_sstables = std::move(_all_new_sstables),
.ended_at = ended_at,
.end_size = _end_size,
.stats {
.ended_at = ended_at,
.start_size = _start_size,
.end_size = _end_size,
},
};
auto ratio = double(_end_size) / double(_start_size);
@@ -748,6 +779,11 @@ private:
return ret;
}
private:
void on_interrupt(std::exception_ptr ex) {
log_info("{} of {} sstables interrupted due to: {}", report_start_desc(), _input_sstable_generations.size(), ex);
delete_sstables_for_interrupted_compaction();
}
virtual std::string_view report_start_desc() const = 0;
virtual std::string_view report_finish_desc() const = 0;
@@ -825,10 +861,6 @@ protected:
log(log_level::trace, std::move(fmt), std::forward<Args>(args)...);
}
public:
bool enable_garbage_collected_sstable_writer() const noexcept {
return _contains_multi_fragment_runs && _max_sstable_size != std::numeric_limits<uint64_t>::max();
}
static future<compaction_result> run(std::unique_ptr<compaction> c);
friend class compacted_fragments_writer;
@@ -851,7 +883,51 @@ void compacted_fragments_writer::maybe_abort_compaction() {
}
}
void compacted_fragments_writer::consume_new_partition(const dht::decorated_key& dk) {
void compacted_fragments_writer::stop_current_writer() {
// stop sstable writer being currently used.
_stop_compaction_writer(&*_compaction_writer);
_compaction_writer = std::nullopt;
}
bool compacted_fragments_writer::can_split_large_partition() const {
return _c._can_split_large_partition;
}
void compacted_fragments_writer::track_last_position(position_in_partition_view pos) {
if (can_split_large_partition()) {
_current_partition.last_pos = pos;
}
}
void compacted_fragments_writer::split_large_partition() {
// Closes the active range tombstone if needed, before emitting partition end.
// after_key(last_pos) is used for both closing and re-opening the active tombstone, which
// will result in current fragment storing an inclusive end bound for last pos, and the
// next fragment storing an exclusive start bound for last pos. This is very important
// for not losing information on the range tombstone.
auto after_last_pos = position_in_partition::after_key(_current_partition.last_pos.key());
if (_current_partition.current_emitted_tombstone) {
auto rtc = range_tombstone_change(after_last_pos, tombstone{});
_c.log_debug("Closing active tombstone {} with {} for partition {}", _current_partition.current_emitted_tombstone, rtc, *_current_partition.dk);
_compaction_writer->writer.consume(std::move(rtc));
}
_c.log_debug("Splitting large partition {} in order to respect SSTable size limit of {}", *_current_partition.dk, pretty_printed_data_size(_c._max_sstable_size));
// Close partition in current writer, and open it again in a new writer.
do_consume_end_of_partition();
stop_current_writer();
do_consume_new_partition(*_current_partition.dk);
// Replicate partition tombstone to every fragment, allowing the SSTable run reader
// to open a single fragment during the read.
if (_current_partition.tombstone) {
consume(_current_partition.tombstone);
}
if (_current_partition.current_emitted_tombstone) {
_compaction_writer->writer.consume(range_tombstone_change(after_last_pos, _current_partition.current_emitted_tombstone));
}
_current_partition.is_splitting_partition = false;
}
void compacted_fragments_writer::do_consume_new_partition(const dht::decorated_key& dk) {
maybe_abort_compaction();
if (!_compaction_writer) {
_compaction_writer = _create_compaction_writer(dk);
@@ -859,17 +935,55 @@ void compacted_fragments_writer::consume_new_partition(const dht::decorated_key&
_c.on_new_partition();
_compaction_writer->writer.consume_new_partition(dk);
_c._cdata.total_keys_written++;
_unclosed_partition = true;
}
stop_iteration compacted_fragments_writer::consume_end_of_partition() {
auto ret = _compaction_writer->writer.consume_end_of_partition();
stop_iteration compacted_fragments_writer::do_consume_end_of_partition() {
_unclosed_partition = false;
return _compaction_writer->writer.consume_end_of_partition();
}
void compacted_fragments_writer::consume_new_partition(const dht::decorated_key& dk) {
_current_partition = {
.dk = dk,
.tombstone = tombstone(),
.current_emitted_tombstone = tombstone(),
.last_pos = position_in_partition(position_in_partition::partition_start_tag_t()),
.is_splitting_partition = false
};
do_consume_new_partition(dk);
_c._cdata.total_keys_written++;
}
void compacted_fragments_writer::consume(tombstone t) {
_current_partition.tombstone = t;
_compaction_writer->writer.consume(t);
}
stop_iteration compacted_fragments_writer::consume(clustering_row&& cr, row_tombstone, bool) {
maybe_abort_compaction();
if (_current_partition.is_splitting_partition) [[unlikely]] {
split_large_partition();
}
track_last_position(cr.position());
auto ret = _compaction_writer->writer.consume(std::move(cr));
if (can_split_large_partition() && ret == stop_iteration::yes) [[unlikely]] {
_current_partition.is_splitting_partition = true;
}
return stop_iteration::no;
}
stop_iteration compacted_fragments_writer::consume(range_tombstone_change&& rtc) {
maybe_abort_compaction();
_current_partition.current_emitted_tombstone = rtc.tombstone();
track_last_position(rtc.position());
return _compaction_writer->writer.consume(std::move(rtc));
}
stop_iteration compacted_fragments_writer::consume_end_of_partition() {
auto ret = do_consume_end_of_partition();
if (ret == stop_iteration::yes) {
// stop sstable writer being currently used.
_stop_compaction_writer(&*_compaction_writer);
_compaction_writer = std::nullopt;
stop_current_writer();
}
return ret;
}
@@ -888,7 +1002,7 @@ public:
}
virtual sstables::sstable_set make_sstable_set_for_input() const override {
return sstables::make_partitioned_sstable_set(_schema, make_lw_shared<sstable_list>(sstable_list{}), false);
return sstables::make_partitioned_sstable_set(_schema, false);
}
flat_mutation_reader_v2 make_sstable_reader() const override {
@@ -1014,7 +1128,7 @@ private:
_new_unused_sstables.insert(_new_unused_sstables.end(), unused_gc_sstables.begin(), unused_gc_sstables.end());
auto exhausted_ssts = std::vector<shared_sstable>(exhausted, _sstables.end());
log_debug("Replacing earlier exhausted sstable(s) {} by new sstable {}", formatted_sstables_list(exhausted_ssts, false), sst->get_filename());
log_debug("Replacing earlier exhausted sstable(s) {} by new sstable(s) {}", formatted_sstables_list(exhausted_ssts, false), formatted_sstables_list(_new_unused_sstables, true));
_replacer(get_compaction_completion_desc(exhausted_ssts, std::move(_new_unused_sstables)));
_sstables.erase(exhausted, _sstables.end());
_monitor_generator.remove_exhausted_sstables(exhausted_ssts);
@@ -1081,13 +1195,13 @@ class cleanup_compaction final : public regular_compaction {
}
};
const dht::token_range_vector _owned_ranges;
owned_ranges_ptr _owned_ranges;
incremental_owned_ranges_checker _owned_ranges_checker;
private:
// Called in a seastar thread
dht::partition_range_vector
get_ranges_for_invalidation(const std::vector<shared_sstable>& sstables) {
auto owned_ranges = dht::to_partition_ranges(_owned_ranges, utils::can_yield::yes);
auto owned_ranges = dht::to_partition_ranges(*_owned_ranges, utils::can_yield::yes);
auto non_owned_ranges = boost::copy_range<dht::partition_range_vector>(sstables
| boost::adaptors::transformed([] (const shared_sstable& sst) {
@@ -1119,10 +1233,10 @@ protected:
}
private:
cleanup_compaction(table_state& table_s, compaction_descriptor descriptor, compaction_data& cdata, dht::token_range_vector owned_ranges)
cleanup_compaction(table_state& table_s, compaction_descriptor descriptor, compaction_data& cdata, owned_ranges_ptr owned_ranges)
: regular_compaction(table_s, std::move(descriptor), cdata)
, _owned_ranges(std::move(owned_ranges))
, _owned_ranges_checker(_owned_ranges)
, _owned_ranges_checker(*_owned_ranges)
{
}
@@ -1144,7 +1258,7 @@ public:
return "Cleaned";
}
flat_mutation_reader::filter make_partition_filter() const {
flat_mutation_reader_v2::filter make_partition_filter() const {
return [this] (const dht::decorated_key& dk) {
#ifdef SEASTAR_DEBUG
// sstables should never be shared with other shards at this point.
@@ -1170,9 +1284,9 @@ public:
type,
schema.ks_name(),
schema.cf_name(),
partition_key_to_string(new_key.key(), schema),
new_key.key().with_schema(schema),
new_key,
partition_key_to_string(current_key.key(), schema),
current_key.key().with_schema(schema),
current_key,
action.empty() ? "" : "; ",
action);
@@ -1185,9 +1299,9 @@ public:
type,
schema.ks_name(),
schema.cf_name(),
partition_key_to_string(new_key.key(), schema),
new_key.key().with_schema(schema),
new_key,
partition_key_to_string(current_key.key(), schema),
current_key.key().with_schema(schema),
current_key,
action.empty() ? "" : "; ",
action);
@@ -1205,7 +1319,7 @@ public:
mf.mutation_fragment_kind(),
mf.has_key() ? format(" with key {}", mf.key().with_schema(schema)) : "",
mf.position(),
partition_key_to_string(key.key(), schema),
key.key().with_schema(schema),
key,
prev_pos.region(),
prev_pos.has_key() ? format(" with key {}", prev_pos.key().with_schema(schema)) : "",
@@ -1217,14 +1331,10 @@ public:
const auto& schema = validator.schema();
const auto& key = validator.previous_partition_key();
clogger.error("[{} compaction {}.{}] Invalid end-of-stream, last partition {} ({}) didn't end with a partition-end fragment{}{}",
type, schema.ks_name(), schema.cf_name(), partition_key_to_string(key.key(), schema), key, action.empty() ? "" : "; ", action);
type, schema.ks_name(), schema.cf_name(), key.key().with_schema(schema), key, action.empty() ? "" : "; ", action);
}
private:
static sstring partition_key_to_string(const partition_key& key, const ::schema& s) {
sstring ret = format("{}", key.with_schema(s));
return utils::utf8::validate((const uint8_t*)ret.data(), ret.size()) ? ret : "<non-utf8-key>";
}
class reader : public flat_mutation_reader_v2::impl {
using skip = bool_class<class skip_tag>;
@@ -1233,18 +1343,23 @@ private:
flat_mutation_reader_v2 _reader;
mutation_fragment_stream_validator _validator;
bool _skip_to_next_partition = false;
uint64_t& _validation_errors;
private:
void maybe_abort_scrub() {
void maybe_abort_scrub(std::function<void()> report_error) {
if (_scrub_mode == compaction_type_options::scrub::mode::abort) {
report_error();
throw compaction_aborted_exception(_schema->ks_name(), _schema->cf_name(), "scrub compaction found invalid data");
}
++_validation_errors;
}
void on_unexpected_partition_start(const mutation_fragment_v2& ps) {
maybe_abort_scrub();
report_invalid_partition_start(compaction_type::Scrub, _validator, ps.as_partition_start().key(),
"Rectifying by adding assumed missing partition-end");
auto report_fn = [this, &ps] (std::string_view action = "") {
report_invalid_partition_start(compaction_type::Scrub, _validator, ps.as_partition_start().key(), action);
};
maybe_abort_scrub(report_fn);
report_fn("Rectifying by adding assumed missing partition-end");
auto pe = mutation_fragment_v2(*_schema, _permit, partition_end{});
if (!_validator(pe)) {
@@ -1264,20 +1379,26 @@ private:
}
skip on_invalid_partition(const dht::decorated_key& new_key) {
maybe_abort_scrub();
auto report_fn = [this, &new_key] (std::string_view action = "") {
report_invalid_partition(compaction_type::Scrub, _validator, new_key, action);
};
maybe_abort_scrub(report_fn);
if (_scrub_mode == compaction_type_options::scrub::mode::segregate) {
report_invalid_partition(compaction_type::Scrub, _validator, new_key, "Detected");
report_fn("Detected");
_validator.reset(new_key);
// Let the segregating interposer consumer handle this.
return skip::no;
}
report_invalid_partition(compaction_type::Scrub, _validator, new_key, "Skipping");
report_fn("Skipping");
_skip_to_next_partition = true;
return skip::yes;
}
skip on_invalid_mutation_fragment(const mutation_fragment_v2& mf) {
maybe_abort_scrub();
auto report_fn = [this, &mf] (std::string_view action = "") {
report_invalid_mutation_fragment(compaction_type::Scrub, _validator, mf, "");
};
maybe_abort_scrub(report_fn);
const auto& key = _validator.previous_partition_key();
@@ -1292,8 +1413,7 @@ private:
// The only case a partition end is invalid is when it comes after
// another partition end, and we can just drop it in that case.
if (!mf.is_end_of_partition() && _scrub_mode == compaction_type_options::scrub::mode::segregate) {
report_invalid_mutation_fragment(compaction_type::Scrub, _validator, mf,
"Injecting partition start/end to segregate out-of-order fragment");
report_fn("Injecting partition start/end to segregate out-of-order fragment");
push_mutation_fragment(*_schema, _permit, partition_end{});
// We loose the partition tombstone if any, but it will be
@@ -1306,19 +1426,23 @@ private:
return skip::no;
}
report_invalid_mutation_fragment(compaction_type::Scrub, _validator, mf, "Skipping");
report_fn("Skipping");
return skip::yes;
}
void on_invalid_end_of_stream() {
maybe_abort_scrub();
auto report_fn = [this] (std::string_view action = "") {
report_invalid_end_of_stream(compaction_type::Scrub, _validator, action);
};
maybe_abort_scrub(report_fn);
// Handle missing partition_end
push_mutation_fragment(mutation_fragment_v2(*_schema, _permit, partition_end{}));
report_invalid_end_of_stream(compaction_type::Scrub, _validator, "Rectifying by adding missing partition-end to the end of the stream");
report_fn("Rectifying by adding missing partition-end to the end of the stream");
}
void fill_buffer_from_underlying() {
utils::get_local_injector().inject("rest_api_keyspace_scrub_abort", [] { throw compaction_aborted_exception("", "", "scrub compaction found invalid data"); });
while (!_reader.is_buffer_empty() && !is_buffer_full()) {
auto mf = _reader.pop_mutation_fragment();
if (mf.is_partition_start()) {
@@ -1358,11 +1482,12 @@ private:
}
public:
reader(flat_mutation_reader_v2 underlying, compaction_type_options::scrub::mode scrub_mode)
reader(flat_mutation_reader_v2 underlying, compaction_type_options::scrub::mode scrub_mode, uint64_t& validation_errors)
: impl(underlying.schema(), underlying.permit())
, _scrub_mode(scrub_mode)
, _reader(std::move(underlying))
, _validator(*_schema)
, _validation_errors(validation_errors)
{ }
virtual future<> fill_buffer() override {
if (_end_of_stream) {
@@ -1410,6 +1535,7 @@ private:
std::string _scrub_start_description;
mutable std::string _scrub_finish_description;
uint64_t _bucket_count = 0;
mutable uint64_t _validation_errors = 0;
public:
scrub_compaction(table_state& table_s, compaction_descriptor descriptor, compaction_data& cdata, compaction_type_options::scrub options)
@@ -1432,7 +1558,7 @@ public:
flat_mutation_reader_v2 make_sstable_reader() const override {
auto crawling_reader = _compacting->make_crawling_reader(_schema, _permit, _io_priority, nullptr);
return make_flat_mutation_reader_v2<reader>(std::move(crawling_reader), _options.operation_mode);
return make_flat_mutation_reader_v2<reader>(std::move(crawling_reader), _options.operation_mode, _validation_errors);
}
uint64_t partitions_per_sstable() const override {
@@ -1463,12 +1589,17 @@ public:
return _options.operation_mode == compaction_type_options::scrub::mode::segregate;
}
friend flat_mutation_reader_v2 make_scrubbing_reader(flat_mutation_reader_v2 rd, compaction_type_options::scrub::mode scrub_mode);
friend flat_mutation_reader make_scrubbing_reader(flat_mutation_reader rd, compaction_type_options::scrub::mode scrub_mode);
compaction_result finish(std::chrono::time_point<db_clock> started_at, std::chrono::time_point<db_clock> ended_at) override {
auto ret = compaction::finish(started_at, ended_at);
ret.stats.validation_errors = _validation_errors;
return ret;
}
friend flat_mutation_reader_v2 make_scrubbing_reader(flat_mutation_reader_v2 rd, compaction_type_options::scrub::mode scrub_mode, uint64_t& validation_errors);
};
flat_mutation_reader_v2 make_scrubbing_reader(flat_mutation_reader_v2 rd, compaction_type_options::scrub::mode scrub_mode) {
return make_flat_mutation_reader_v2<scrub_compaction::reader>(std::move(rd), scrub_mode);
flat_mutation_reader_v2 make_scrubbing_reader(flat_mutation_reader_v2 rd, compaction_type_options::scrub::mode scrub_mode, uint64_t& validation_errors) {
return make_flat_mutation_reader_v2<scrub_compaction::reader>(std::move(rd), scrub_mode, validation_errors);
}
class resharding_compaction final : public compaction {
@@ -1486,7 +1617,7 @@ class resharding_compaction final : public compaction {
uint64_t estimated_partitions = 0;
};
std::vector<estimated_values> _estimation_per_shard;
std::vector<utils::UUID> _run_identifiers;
std::vector<run_id> _run_identifiers;
private:
// return estimated partitions per sstable for a given shard
uint64_t partitions_per_sstable(shard_id s) const {
@@ -1510,7 +1641,7 @@ public:
}
}
for (auto i : boost::irange(0u, smp::count)) {
_run_identifiers[i] = utils::make_random_uuid();
_run_identifiers[i] = run_id::create_random_id();
}
}
@@ -1567,14 +1698,14 @@ public:
future<compaction_result> compaction::run(std::unique_ptr<compaction> c) {
return seastar::async([c = std::move(c)] () mutable {
c->setup();
c->setup().get();
auto consumer = c->consume();
auto start_time = db_clock::now();
try {
consumer.get();
} catch (...) {
c->delete_sstables_for_interrupted_compaction();
c->on_interrupt(std::current_exception());
c = nullptr; // make sure writers are stopped while running in thread context. This is because of calls to file.close().get();
throw;
}
@@ -1626,10 +1757,10 @@ static std::unique_ptr<compaction> make_compaction(table_state& table_s, sstable
return descriptor.options.visit(visitor_factory);
}
future<bool> scrub_validate_mode_validate_reader(flat_mutation_reader_v2 reader, const compaction_data& cdata) {
future<uint64_t> scrub_validate_mode_validate_reader(flat_mutation_reader_v2 reader, const compaction_data& cdata) {
auto schema = reader.schema();
bool valid = true;
uint64_t errors = 0;
std::exception_ptr ex;
try {
@@ -1648,24 +1779,24 @@ future<bool> scrub_validate_mode_validate_reader(flat_mutation_reader_v2 reader,
if (!validator(mf)) {
scrub_compaction::report_invalid_partition_start(compaction_type::Scrub, validator, ps.key());
validator.reset(mf);
valid = false;
++errors;
}
if (!validator(ps.key())) {
scrub_compaction::report_invalid_partition(compaction_type::Scrub, validator, ps.key());
validator.reset(ps.key());
valid = false;
++errors;
}
} else {
if (!validator(mf)) {
scrub_compaction::report_invalid_mutation_fragment(compaction_type::Scrub, validator, mf);
validator.reset(mf);
valid = false;
++errors;
}
}
}
if (!validator.on_end_of_stream()) {
scrub_compaction::report_invalid_end_of_stream(compaction_type::Scrub, validator);
valid = false;
++errors;
}
} catch (...) {
ex = std::current_exception();
@@ -1677,14 +1808,14 @@ future<bool> scrub_validate_mode_validate_reader(flat_mutation_reader_v2 reader,
co_return coroutine::exception(std::move(ex));
}
co_return valid;
co_return errors;
}
static future<compaction_result> scrub_sstables_validate_mode(sstables::compaction_descriptor descriptor, compaction_data& cdata, table_state& table_s) {
auto schema = table_s.schema();
formatted_sstables_list sstables_list_msg;
auto sstables = make_lw_shared<sstables::sstable_set>(sstables::make_partitioned_sstable_set(schema, make_lw_shared<sstable_list>(sstable_list{}), false));
auto sstables = make_lw_shared<sstables::sstable_set>(sstables::make_partitioned_sstable_set(schema, false));
for (const auto& sst : descriptor.sstables) {
sstables_list_msg += sst;
sstables->insert(sst);
@@ -1695,11 +1826,11 @@ static future<compaction_result> scrub_sstables_validate_mode(sstables::compacti
auto permit = table_s.make_compaction_reader_permit();
auto reader = sstables->make_crawling_reader(schema, permit, descriptor.io_priority, nullptr);
const auto valid = co_await scrub_validate_mode_validate_reader(std::move(reader), cdata);
const auto validation_errors = co_await scrub_validate_mode_validate_reader(std::move(reader), cdata);
clogger.info("Finished scrubbing in validate mode {} - sstable(s) are {}", sstables_list_msg, valid ? "valid" : "invalid");
clogger.info("Finished scrubbing in validate mode {} - sstable(s) are {}", sstables_list_msg, validation_errors == 0 ? "valid" : "invalid");
if (!valid) {
if (validation_errors != 0) {
for (auto& sst : *sstables->all()) {
co_await sst->move_to_quarantine();
}
@@ -1707,7 +1838,10 @@ static future<compaction_result> scrub_sstables_validate_mode(sstables::compacti
co_return compaction_result {
.new_sstables = {},
.ended_at = db_clock::now(),
.stats = {
.ended_at = db_clock::now(),
.validation_errors = validation_errors,
},
};
}
@@ -1740,26 +1874,26 @@ get_fully_expired_sstables(const table_state& table_s, const std::vector<sstable
int64_t min_timestamp = std::numeric_limits<int64_t>::max();
for (auto& sstable : overlapping) {
auto gc_before = sstable->get_gc_before_for_fully_expire(compaction_time);
auto gc_before = sstable->get_gc_before_for_fully_expire(compaction_time, table_s.get_tombstone_gc_state());
if (sstable->get_max_local_deletion_time() >= gc_before) {
min_timestamp = std::min(min_timestamp, sstable->get_stats_metadata().min_timestamp);
}
}
auto compacted_undeleted_gens = boost::copy_range<std::unordered_set<int64_t>>(table_s.compacted_undeleted_sstables()
auto compacted_undeleted_gens = boost::copy_range<std::unordered_set<generation_type>>(table_s.compacted_undeleted_sstables()
| boost::adaptors::transformed(std::mem_fn(&sstables::sstable::generation)));
auto has_undeleted_ancestor = [&compacted_undeleted_gens] (auto& candidate) {
// Get ancestors from sstable which is empty after restart. It works for this purpose because
// we only need to check that a sstable compacted *in this instance* hasn't an ancestor undeleted.
// Not getting it from sstable metadata because mc format hasn't it available.
return boost::algorithm::any_of(candidate->compaction_ancestors(), [&compacted_undeleted_gens] (auto gen) {
return boost::algorithm::any_of(candidate->compaction_ancestors(), [&compacted_undeleted_gens] (const generation_type& gen) {
return compacted_undeleted_gens.contains(gen);
});
};
// SStables that do not contain live data is added to list of possibly expired sstables.
for (auto& candidate : compacting) {
auto gc_before = candidate->get_gc_before_for_fully_expire(compaction_time);
auto gc_before = candidate->get_gc_before_for_fully_expire(compaction_time, table_s.get_tombstone_gc_state());
clogger.debug("Checking if candidate of generation {} and max_deletion_time {} is expired, gc_before is {}",
candidate->generation(), candidate->get_stats_metadata().max_local_deletion_time, gc_before);
// A fully expired sstable which has an ancestor undeleted shouldn't be compacted because
@@ -1790,7 +1924,11 @@ get_fully_expired_sstables(const table_state& table_s, const std::vector<sstable
}
unsigned compaction_descriptor::fan_in() const {
return boost::copy_range<std::unordered_set<utils::UUID>>(sstables | boost::adaptors::transformed(std::mem_fn(&sstables::sstable::run_identifier))).size();
return boost::copy_range<std::unordered_set<run_id>>(sstables | boost::adaptors::transformed(std::mem_fn(&sstables::sstable::run_identifier))).size();
}
uint64_t compaction_descriptor::sstables_size() const {
return boost::accumulate(sstables | boost::adaptors::transformed(std::mem_fn(&sstables::sstable::data_size)), uint64_t(0));
}
}

View File

@@ -17,8 +17,8 @@
#include "utils/UUID.hh"
#include "table_state.hh"
#include <seastar/core/thread.hh>
#include <seastar/core/abort_source.hh>
class flat_mutation_reader;
using namespace compaction;
namespace sstables {
@@ -40,9 +40,19 @@ public:
friend std::ostream& operator<<(std::ostream&, pretty_printed_throughput);
};
// Return the name of the compaction type
// as used over the REST api, e.g. "COMPACTION" or "CLEANUP".
sstring compaction_name(compaction_type type);
// Reverse map the name of the compaction type
// as used over the REST api, e.g. "COMPACTION" or "CLEANUP",
// to the compaction_type enum code.
compaction_type to_compaction_type(sstring type_name);
// Return a string respresenting the compaction type
// as a verb for logging purposes, e.g. "Compact" or "Cleanup".
std::string_view to_string(compaction_type type);
struct compaction_info {
utils::UUID compaction_uuid;
compaction_type type = compaction_type::Compaction;
@@ -56,6 +66,7 @@ struct compaction_data {
uint64_t total_partitions = 0;
uint64_t total_keys_written = 0;
sstring stop_requested;
abort_source abort;
utils::UUID compaction_uuid;
unsigned compaction_fan_in = 0;
struct replacement {
@@ -70,13 +81,33 @@ struct compaction_data {
void stop(sstring reason) {
stop_requested = std::move(reason);
abort.request_abort();
}
};
struct compaction_stats {
std::chrono::time_point<db_clock> ended_at;
uint64_t start_size = 0;
uint64_t end_size = 0;
uint64_t validation_errors = 0;
compaction_stats& operator+=(const compaction_stats& r) {
ended_at = std::max(ended_at, r.ended_at);
start_size += r.start_size;
end_size += r.end_size;
validation_errors += r.validation_errors;
return *this;
}
friend compaction_stats operator+(const compaction_stats& l, const compaction_stats& r) {
auto tmp = l;
tmp += r;
return tmp;
}
};
struct compaction_result {
std::vector<sstables::shared_sstable> new_sstables;
std::chrono::time_point<db_clock> ended_at;
uint64_t end_size = 0;
compaction_stats stats;
};
// Compact a list of N sstables into M sstables.
@@ -95,9 +126,9 @@ std::unordered_set<sstables::shared_sstable>
get_fully_expired_sstables(const table_state& table_s, const std::vector<sstables::shared_sstable>& compacting, gc_clock::time_point gc_before);
// For tests, can drop after we virtualize sstables.
flat_mutation_reader_v2 make_scrubbing_reader(flat_mutation_reader_v2 rd, compaction_type_options::scrub::mode scrub_mode);
flat_mutation_reader_v2 make_scrubbing_reader(flat_mutation_reader_v2 rd, compaction_type_options::scrub::mode scrub_mode, uint64_t& validation_errors);
// For tests, can drop after we virtualize sstables.
future<bool> scrub_validate_mode_validate_reader(flat_mutation_reader_v2 rd, const compaction_data& info);
future<uint64_t> scrub_validate_mode_validate_reader(flat_mutation_reader_v2 rd, const compaction_data& info);
}

View File

@@ -60,8 +60,7 @@ public:
using ongoing_compactions = std::unordered_map<sstables::shared_sstable, backlog_read_progress_manager*>;
struct impl {
virtual void add_sstable(sstables::shared_sstable sst) = 0;
virtual void remove_sstable(sstables::shared_sstable sst) = 0;
virtual void replace_sstables(std::vector<sstables::shared_sstable> old_ssts, std::vector<sstables::shared_sstable> new_ssts) = 0;
virtual double backlog(const ongoing_writes& ow, const ongoing_compactions& oc) const = 0;
virtual ~impl() { }
};
@@ -72,22 +71,21 @@ public:
~compaction_backlog_tracker();
double backlog() const;
void add_sstable(sstables::shared_sstable sst);
void remove_sstable(sstables::shared_sstable sst);
void replace_sstables(const std::vector<sstables::shared_sstable>& old_ssts, const std::vector<sstables::shared_sstable>& new_ssts);
void register_partially_written_sstable(sstables::shared_sstable sst, backlog_write_progress_manager& wp);
void register_compacting_sstable(sstables::shared_sstable sst, backlog_read_progress_manager& rp);
void transfer_ongoing_charges(compaction_backlog_tracker& new_bt, bool move_read_charges = true);
void revert_charges(sstables::shared_sstable sst);
private:
// Returns true if this SSTable can be added or removed from the tracker.
bool sstable_belongs_to_tracker(const sstables::shared_sstable& sst);
void disable() {
_disabled = true;
_impl = {};
_ongoing_writes = {};
_ongoing_compactions = {};
}
bool _disabled = false;
private:
// Returns true if this SSTable can be added or removed from the tracker.
bool sstable_belongs_to_tracker(const sstables::shared_sstable& sst);
bool disabled() const noexcept { return !_impl; }
std::unique_ptr<impl> _impl;
// We keep track of this so that we can transfer to a new tracker if the compaction strategy is
// changed in the middle of a compaction.

View File

@@ -14,12 +14,22 @@
#include <variant>
#include <seastar/core/smp.hh>
#include <seastar/core/file.hh>
#include "sstables/shared_sstable.hh"
#include "sstables/types_fwd.hh"
#include "sstables/sstable_set.hh"
#include "utils/UUID.hh"
#include "dht/i_partitioner.hh"
#include "compaction_weight_registration.hh"
namespace compaction {
using owned_ranges_ptr = lw_shared_ptr<const dht::token_range_vector>;
inline owned_ranges_ptr make_owned_ranges_ptr(dht::token_range_vector&& ranges) {
return make_lw_shared<const dht::token_range_vector>(std::move(ranges));
}
} // namespace compaction
namespace sstables {
enum class compaction_type {
@@ -54,10 +64,10 @@ public:
struct regular {
};
struct cleanup {
dht::token_range_vector owned_ranges;
compaction::owned_ranges_ptr owned_ranges;
};
struct upgrade {
dht::token_range_vector owned_ranges;
compaction::owned_ranges_ptr owned_ranges;
};
struct scrub {
enum class mode {
@@ -102,11 +112,11 @@ public:
return compaction_type_options(regular{});
}
static compaction_type_options make_cleanup(dht::token_range_vector&& owned_ranges) {
static compaction_type_options make_cleanup(compaction::owned_ranges_ptr owned_ranges) {
return compaction_type_options(cleanup{std::move(owned_ranges)});
}
static compaction_type_options make_upgrade(dht::token_range_vector&& owned_ranges) {
static compaction_type_options make_upgrade(compaction::owned_ranges_ptr owned_ranges) {
return compaction_type_options(upgrade{std::move(owned_ranges)});
}
@@ -143,10 +153,10 @@ struct compaction_descriptor {
int level;
// Threshold size for sstable(s) to be created.
uint64_t max_sstable_bytes;
// Can split large partitions at clustering boundary.
bool can_split_large_partition = false;
// Run identifier of output sstables.
utils::UUID run_identifier;
// Calls compaction manager's task for this compaction to release reference to exhausted sstables.
std::function<void(const std::vector<shared_sstable>& exhausted_sstables)> release_exhausted;
sstables::run_id run_identifier;
// The options passed down to the compaction code.
// This also selects the kind of compaction to do.
compaction_type_options options = compaction_type_options::make_regular();
@@ -165,14 +175,12 @@ struct compaction_descriptor {
static constexpr uint64_t default_max_sstable_bytes = std::numeric_limits<uint64_t>::max();
explicit compaction_descriptor(std::vector<sstables::shared_sstable> sstables,
std::optional<sstables::sstable_set> all_sstables_snapshot,
::io_priority_class io_priority,
int level = default_level,
uint64_t max_sstable_bytes = default_max_sstable_bytes,
utils::UUID run_identifier = utils::make_random_uuid(),
run_id run_identifier = run_id::create_random_id(),
compaction_type_options options = compaction_type_options::make_regular())
: sstables(std::move(sstables))
, all_sstables_snapshot(std::move(all_sstables_snapshot))
, level(level)
, max_sstable_bytes(max_sstable_bytes)
, run_identifier(run_identifier)
@@ -182,13 +190,11 @@ struct compaction_descriptor {
explicit compaction_descriptor(sstables::has_only_fully_expired has_only_fully_expired,
std::vector<sstables::shared_sstable> sstables,
std::optional<sstables::sstable_set> all_sstables_snapshot,
::io_priority_class io_priority)
: sstables(std::move(sstables))
, all_sstables_snapshot(std::move(all_sstables_snapshot))
, level(default_level)
, max_sstable_bytes(default_max_sstable_bytes)
, run_identifier(utils::make_random_uuid())
, run_identifier(run_id::create_random_id())
, options(compaction_type_options::make_regular())
, io_priority(io_priority)
, has_only_fully_expired(has_only_fully_expired)
@@ -196,6 +202,10 @@ struct compaction_descriptor {
// Return fan-in of this job, which is equal to its number of runs.
unsigned fan_in() const;
// Enables garbage collection for this descriptor, meaning that compaction will be able to purge expired data
void enable_garbage_collection(sstables::sstable_set snapshot) { all_sstables_snapshot = std::move(snapshot); }
// Returns total size of all sstables contained in this descriptor
uint64_t sstables_size() const;
};
}

File diff suppressed because it is too large Load Diff

Some files were not shown because too many files have changed in this diff Show More