Commit Graph

24611 Commits

Author SHA1 Message Date
Avi Kivity
0f967f911d Merge "storage_service: get_token_metadata_ptr to hold on to token_metadata" from Benny
"
This series fixes use-after-free via token_metadata&

We may currently get a token_metadata& via get_token_metadata() and
use it across yield points in a couple of sites:
- do_decommission_removenode_with_repair
- get_new_source_ranges

To fix that, get_token_metadata_ptr and hold on to it
across yielding.

Fixes #7790

Dtest: update_cluster_layout_tests:TestUpdateClusterLayout.simple_removenode_2_test(debug)
Test: unit(dev)
"

* tag 'storage_service-token_metadata_ptr-v2' of github.com:bhalevy/scylla:
  storage_service: get_new_source_ranges: don't hold token_metadata& across yield point
  storage_service: get_changed_ranges_for_leaving: no need to maybe_yield for each token_range
  storage_service: get_changed_ranges_for_leaving: release token_metadata_ptr sooner
  storage_service: get_changed_ranges_for_leaving: don't hold token_metadata& across yield
2020-12-13 17:37:24 +02:00
Aleksandr Bykov
e74dc311e7 dist: scylla_util: fix aws_instance.ebs_disks method
aws_instance.ebs_disks() method should return ebs disk
instead of ephemeral

Signed-off-by: Aleksandr Bykov <alex.bykov@scylladb.com>

Closes #7780
2020-12-13 17:33:37 +02:00
Benny Halevy
1fbc831dae storage_service: get_new_source_ranges: don't hold token_metadata& across yield point
Provide the token_metadata& to get_new_source_ranges by the caller,
who keeps it valid throughout the call.

Note that there is no need to clone_only_token_map
since the token_metadata_ptr is immutable and can be
used just as well for calling strat.get_range_addresses.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-13 16:42:00 +02:00
Benny Halevy
f13913d251 storage_service: get_changed_ranges_for_leaving: no need to maybe_yield for each token_range
Now that we pass can_yield::yes to calculate_natural_endpoints
for each token_range.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-13 16:42:00 +02:00
Benny Halevy
89ed0705e8 storage_service: get_changed_ranges_for_leaving: release token_metadata_ptr sooner
No need to hold on to the shared token_metadata_ptr
after we got clone_after_all_left().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-13 16:42:00 +02:00
Benny Halevy
684c4143df storage_service: get_changed_ranges_for_leaving: don't hold token_metadata& across yield
When yielding in clone_only_token_map or clone_after_all_left
the token_metadata got with get_token_metadata() may go away.

Use get_token_metadata_ptr() instead to hold on to it.

And with that, we don't need to clone_only_token_map.
`metadata` is not modified by calculate_natural_endpoints, so we
can just refer to the immutable copy retrieved with
get_token_metadata_ptr.

Fixes #7790

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-13 16:41:58 +02:00
Avi Kivity
65a0244614 Update tools/jmx submodule
* tools/jmx 6174a47...20469bf (1):
  > column_family: Return proper cardinality for toppartitions requests
2020-12-13 13:51:38 +02:00
Avi Kivity
9265b87610 Merge "Remove get_local_storage_proxy from validation" from Pavel E
"
The validate_column_family() helper uses the global proxy
reference to get database from. Fortunatelly, all the callers
of it can provide one via argument.

tests: unit(dev)
"

* 'br-no-proxy-in-validate' of https://github.com/xemul/scylla:
  validation: Remove get_local_storage_proxy call
  client_state: Call validate_column_family() with database arg
  client_state: Add database& arg to has_column_family_access
  storage_proxy: Add .local_db() getters
  validate: Mark database argument const
2020-12-13 13:12:57 +02:00
Avi Kivity
19aaf8eb83 Merge "Remove global storage service from index manager" from Pavel E
"
The initial intent was to remove call for global storage service from
secondary index manager's create_view_for_index(), but while fixing it
one of intermediate schema table's helper managed to benefit from it
by re-using the database reference flying by.

The cleanup is done by simply pushing the database reference along the
stack from the code that already has it down the create_view_for_index().

tests: unit(dev)
"

* 'br-no-storages-in-index-and-schema' of https://github.com/xemul/scylla:
  schema-tables: Use db from make_update_table_mutations in make_update_indices_mutations
  schema-tables: Add database argument to make_update_table_mutations
  schema-tables: Factor out calls getting database instance
  index-manager: Move feature evaluation one level up
2020-12-13 12:41:51 +02:00
Benny Halevy
aae3991246 repair: do_decommission_removenode_with_repair: don't deref ops when null
`ops` might be passed as a disengaged shared_ptr when called
from `decommission_with_repair`.

In this case we need to propagate to sync_data_using_repair a
disengaged std::optional<utils::UUID>.

Fixes #7788

DTest: update_cluster_layout_tests:TestUpdateClusterLayout.verify_latest_copy_decommission_node_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201213073743.331253-1-bhalevy@scylladb.com>
2020-12-13 12:37:18 +02:00
Avi Kivity
18be57a4e5 Update seastar submodule
* seastar 8b400c7b45...2de43eb6bf (3):
  > core: show span free sizes correctly in diagnostics
  > Merge "IO queues to share capacities" from Pavel E
  > file: make_file_impl: determine blockdev using st_mode
2020-12-12 21:57:01 +02:00
Pekka Enberg
c990f2bd34 Merge 'Reinstate [[nodiscard]] support' from Avi Kivity
The switch to clang disabled the clang-specific -Wunused-value
since it generated some harmless warnings. Unfortunately, that also
prevent [[nodiscard]] violations from warning.

Fix by clearing all instances of the warning (including [[nodiscard]]
violations that crept in while it was disabled) and reinstating the warning.

Closes #7767

* github.com:scylladb/scylla:
  build: reinstate -Wunused-value warning for [[nodiscard]]
  test: lib: don't ignore future in compare_readers()
  test: mutation_test: check both ranges when comparing summaries
  serialializer: silence unused value warning in variant deserializer
2020-12-12 09:54:05 +02:00
Avi Kivity
615b8e8184 dist: rpm: uninstall tuned when installing scylla-kernel-conf
tuned 2.11.0-9 and later writes to kerned.sched_wakeup_granularity_ns
and other sysctl tunables that we so laboriously tuned, dropping
performance by a factor of 5 (due to increased latency). Fix by
obsoleting tuned during install (in effect, we are a better tuned,
at least for us).

Not needed for .deb, since debian/ubunto do not install tuned by
default.

Fixes #7696

Closes #7776
2020-12-12 09:54:05 +02:00
Pavel Emelyanov
3a025cfa52 schema-tables: Use db from make_update_table_mutations in make_update_indices_mutations
Two halves of the tunnel finally connect -- the
latter helper needs the local database instance and
is only called by the former one which already has it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-11 21:23:53 +03:00
Pavel Emelyanov
89fd524c5a schema-tables: Add database argument to make_update_table_mutations
There are 3 callers of this helper (cdc, migration manager and tests)
and all of them already have the database object at hands.

The argument will be used by next patch to remove call for global
storage proxy instance from make_update_indices_mutations.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-11 21:21:22 +03:00
Pavel Emelyanov
1bcef04c7a schema-tables: Factor out calls getting database instance
The make_update_indices_mutations gets database instance
for two things -- to find the cf to work with and to get
the value of a feature for index view creation.

To suit both and to remove calls for global storage proxy
and service instances get the database once in the
function entrance. Next patch will clean this further.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-11 21:17:11 +03:00
Pavel Emelyanov
6dd10e771d index-manager: Move feature evaluation one level up
The create_view_for_index needs to know the state of the
correct-idx-token-in-secondary-index feature. To get one
it takes quite a long route through global storage service
instance.

Since there's only one caller of the method in question,
and the method is called in a loop, it's a bit faster to
get the feature value in caller and pass it in argument.

This will also help to get rid of the call for global
storage service.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-11 21:14:12 +03:00
Pavel Emelyanov
83073f4e8b validation: Remove get_local_storage_proxy call
It is used in validate_column_family. The last caller of it was removed by
previous patch, so we may kill the helper itself

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-11 18:52:42 +03:00
Pavel Emelyanov
12cc539835 client_state: Call validate_column_family() with database arg
The previous patch brought the databse reference arg. And since
the currently called validate_column_family() overload _just_
gets the database from global proxy, it's better to shortcut.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-11 18:50:49 +03:00
Pavel Emelyanov
b0c4a9087d client_state: Add database& arg to has_column_family_access
It is called from cql3/statements' check_access methods and from thrift
handlers. The former have proxy argument from which they can get the
database. The latter already have the database itself on board.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-11 18:49:16 +03:00
Pavel Emelyanov
4c7bc8a3d1 storage_proxy: Add .local_db() getters
To facilitate the next patching

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-11 18:48:02 +03:00
Avi Kivity
a11ecfe231 Merge 'types: don't linearize in validate()' from Michał Chojnowski
A sequel to #7692.

This series gets rid of linearization when validating collections and tuple types. (Other types were already validated without linearizing).
The necessary helpers for reading from fragmented buffers were introduced in #7692. All this series does is put them to use in `validate()`.

Refs: #6138

Closes #7770

* github.com:scylladb/scylla:
  types: add single-fragment optimization in validate()
  utils: fragment_range: add with_simplified()
  cql3: statements: select_statement: remove unnecessary use of with_linearized
  cql3: maps: remove unnecessary use of with_linearized
  cql3: lists: remove unnecessary use of with_linearized
  cql3: tuples: remove unnecessary use of with_linearized
  cql3: sets: remove unnecessary use of with_linearized
  cql3: tuples: remove unnecessary use of with_linearized
  cql3: attributes: remove unnecessary uses of with_linearized
  types: validate lists without linearizing
  types: validate tuples without linearizing
  types: validate sets without linearizing
  types: validate maps without linearizing
  types: template abstract_type::validate on FragmentedView
  types: validate_visitor: transition from FragmentRange to FragmentedView
  utils: fragmented_temporary_buffer: add empty() to FragmentedView
  utils: fragmented_temporary_buffer: don't add to null pointer
2020-12-11 17:33:59 +02:00
Pavel Emelyanov
563b466227 validate: Mark database argument const
They are indeed used like that

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-11 18:27:45 +03:00
Michał Chojnowski
150473f074 types: add single-fragment optimization in validate()
Manipulating fragmented views is costlier that manipulating contiguous views,
so let's detect the common situation when the fragmented view is actually
contiguous underneath, and make use of that.

Note: this optimization is only useful for big types. For trivial types,
validation usually only checks the size of the view.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
e2d17879fc utils: fragment_range: add with_simplified()
Reading from contiguous memory (bytes_view) is significantly simpler
runtime-wise than reading from a fragmented view, due to less state and less
branching, so we often want to convert a fragmented view to a simple view before
processing it, if the fragmented view contains at most one fragment, which is
common. with_simplified() does just that.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
51ca5fa4c5 cql3: statements: select_statement: remove unnecessary use of with_linearized
We can validate directly from fragmented buffers now.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
72186bee69 cql3: maps: remove unnecessary use of with_linearized
We can validate directly from fragmented buffers now.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
3f3a10c588 cql3: lists: remove unnecessary use of with_linearized
We can validate directly from fragmented buffers now.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
efa036329d cql3: tuples: remove unnecessary use of with_linearized
We can validate directly from fragmented buffers now.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
4f359a7a99 cql3: sets: remove unnecessary use of with_linearized
We can validate directly from fragmented buffers now.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
281417917b cql3: tuples: remove unnecessary use of with_linearized
We can validate directly from fragmented buffers now.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
d1d1a00311 cql3: attributes: remove unnecessary uses of with_linearized
We can validate and deserialize directly from fragmented buffers now.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
0581b3ff31 types: validate lists without linearizing
We can validate collections directly from fragmented buffers now.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
4fe41b69fd types: validate tuples without linearizing
We can validate tuples directly from fragmented buffers now.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
a7dd736d03 types: validate sets without linearizing
We can validate collections directly from fragmented buffers now.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
1459608375 types: validate maps without linearizing
We can validate collections directly from fragmented buffers now.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
82befbe8c0 types: template abstract_type::validate on FragmentedView
This is primarily a stylistic change. It makes the interface more consistent
with deserialize(). It will also allow us to call `validate()` for collection
elements in `validate_aux()`.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
15dbe00e8a types: validate_visitor: transition from FragmentRange to FragmentedView
This will allow us to easily get rid of linearizations when validating
collections and tuples, because the helpers used in validate_aux() already
have FragmentedView overloads.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
3647c0ba47 utils: fragmented_temporary_buffer: add empty() to FragmentedView
It's redundant with size_bytes(), but sometimes empty() is more readable and
reduces churn when replacing other types with FragmentedView.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
b4dd5d3bdb utils: fragmented_temporary_buffer: don't add to null pointer
When fragmented_temporary_buffer::view is created from a bytes_view,
_current is null. In that case, in remove_current(), null pointer offset
happens, and ubsan complains. Fix that.
2020-12-11 09:53:07 +01:00
Raphael S. Carvalho
e4b55f40f3 sstables: Fix sstable reshaping for STCS
The heuristic of STCS reshape is correct, and it built the compaction
descriptor correctly, but forgot to return it to the caller, so no
reshape was ever done on behalf of STCS even when the strategy
needed it.

Fixes #7774.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20201209175044.1609102-1-raphaelsc@scylladb.com>
2020-12-10 12:45:25 +02:00
Asias He
829b4c1438 repair: Make removenode safe by default
Currently removenode works like below:

- The coordinator node advertises the node to be removed in
  REMOVING_TOKEN status in gossip

- Existing nodes learn the node in REMOVING_TOKEN status

- Existing nodes sync data for the range it owns

- Existing nodes send notification to the coordinator

- The coordinator node waits for notification and announce the node in
  REMOVED_TOKEN

Current problems:

- Existing nodes do not tell the coordinator if the data sync is ok or failed.

- The coordinator can not abort the removenode operation in case of error

- Failed removenode operation will make the node to be removed in
  REMOVING_TOKEN forever.

- The removenode runs in best effort mode which may cause data
  consistency issues.

  It means if a node that owns the range after the removenode
  operation is down during the operation, the removenode node operation
  will continue to succeed without requiring that node to perform data
  syncing. This can cause data consistency issues.

  For example, Five nodes in the cluster, RF = 3, for a range, n1, n2,
  n3 is the old replicas, n2 is being removed, after the removenode
  operation, the new replicas are n1, n5, n3. If n3 is down during the
  removenode operation, only n1 will be used to sync data with the new
  owner n5. This will break QUORUM read consistency if n1 happens to
  miss some writes.

Improvements in this patch:

- This patch makes the removenode safe by default.

We require all nodes in the cluster to participate in the removenode operation and
sync data if needed. We fail the removenode operation if any of them is down or
fails.

If the user want the removenode operation to succeed even if some of the nodes
are not available, the user has to explicitly pass a list of nodes that can be
skipped for the operation.

$ nodetool removenode --ignore-dead-nodes <list_of_dead_nodes_to_ignore> <host_id>

Example restful api:

$ curl -X POST "http://127.0.0.1:10000/storage_service/remove_node/?host_id=7bd303e9-4c7b-4915-84f6-343d0dbd9a49&ignore_nodes=127.0.0.3,127.0.0.5"

- The coordinator can abort data sync on existing nodes

For example, if one of the nodes fails to sync data. It makes no sense for
other nodes to continue to sync data because the whole operation will
fail anyway.

- The coordinator can decide which nodes to ignore and pass the decision
  to other nodes

Previously, there is no way for the coordinator to tell existing nodes
to run in strict mode or best effort mode. Users will have to modify
config file or run a restful api cmd on all the nodes to select strict
or best effort mode. With this patch, the cluster wide configuration is
eliminated.

Fixes #7359

Closes #7626
2020-12-10 10:14:39 +02:00
Piotr Sarna
20bdeb315a Merge ' types: add constraint on lexicographical_tri_compare()' from Avi Kivity
Verify that the input types are iterators and their value types are compatible
with the compare function.

Because some of the inputs were not actually valid iterators, they are adjusted
too.

Closes #7631

* github.com:scylladb/scylla:
  types: add constraint on lexicographical_tri_compare()
  composite: make composite::iterator a real input_iterator
  compound: make compount_type::iterator a real input_iterator
2020-12-09 18:48:01 +01:00
Nadav Har'El
a8fdbf31cd alternator: fix UpdateItem ADD for non-existent attribute
UpdateItem's "ADD" operation usually adds elements to an existing set
or adds a number to an existing counter. But it can *also* be used
to create a new set or counter (as if adding to an empty set or zero).

We unfortunately did not have a test for this case (creating a new set
or counter), and when I wrote such a test now, I discovered the
implementation was missing. So this patch adds both the test and the
implementation. The new test used to fail before this patch, and passes
with it - and passes on DynamoDB.

Note that we only had this bug for the newer UpdateItem syntax.
For the old AttributeUpdates syntax, we already support ADD actions
on missing attributes, and already tested it in test_update_item_add().
I just forgot to test the same thing for the newer syntax, so I missed
this bug :-(

Fixes #7763.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201207085135.2551845-1-nyh@scylladb.com>
2020-12-09 18:44:30 +01:00
Juliusz Stasiewicz
b150906d39 gossip: Added SNITCH_NAME to application_state
Snitch name needs to be exchanged within cluster once, on shadow
round, so joining nodes cannot use wrong snitch. The snitch names
are compared on bootstrap and on normal node start.

If the cluster already used mixed snitches, the upgrade to this
version will fail. In this case customer needs to add a node with
correct snitch for every node with the wrong snitch, then put
down the nodes with the wrong snitch and only then do the upgrade.

Fixes #6832

Closes #7739
2020-12-09 15:45:25 +02:00
Nadav Har'El
781f9d9aca alternator: make default timeout configurable
Whereas in CQL the client can pass a timeout parameter to the server, in
the DynamoDB API there is no such feature; The server needs to choose
reasonable timeouts for its own internal operations - e.g., writes to disk,
querying other replicas, etc.

Until now, Alternator had a fixed timeout of 10 seconds for its
requests. This choice was reasonable - it is much higher than we expect
during normal operations, and still lower than the client-side timeouts
that some DynamoDB libraries have (boto3 has a one-minute timeout).
However, there's nothing holy about this number of 10 seconds, some
installations might want to change this default.

So this patch adds a configuration option, "--alternator-timeout-in-ms",
to choose this timeout. As before, it defaults to 10 seconds (10,000ms).

In particular, some test runs are unusually slow - consider for example
testing a debug build (which is already very slow) in an extremely
over-comitted test host. In some cases (see issue #7706) we noticed
the 10 second timeout was not enough. So in this patch we increase the
default timeout chosen in the "test/alternator/run" script to 30 seconds.

Please note that as the code is structured today, this timeout only
applies to some operations, such as GetItem, UpdateItem or Scan, but
does not apply to CreateTable, for example. This is a pre-existing
issue that this patch does not change.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201207122758.2570332-1-nyh@scylladb.com>
2020-12-09 14:30:43 +01:00
Avi Kivity
f802356572 Revert "Revert "Merge "raft: fix replication if existing log on leader" from Gleb""
This reverts commit dc77d128e9. It was reverted
due to a strange and unexplained diff, which is now explained. The
HEAD on the working directory being pulled from was set back, so git
thought it was merging the intended commits, plus all the work that was
committed from HEAD to master. So it is safe to restore it.
2020-12-08 19:19:55 +02:00
Avi Kivity
1badd315ef Merge "Speed up devel tests 10 times" from Pavel E
"
The multishard_mutation_query test is toooo slow when built
with clang in dev mode. By reducing the number of scans it's
possible to shrink the full suite run time from half an hour
down to ~3 minutes.

tests: unit(dev)
"

* 'br-devel-mode-tests' of https://github.com/xemul/scylla:
  test: Make multishard_mutation_query test do less scans
  configure: Add -DDEVEL to dev build flags
2020-12-08 15:42:12 +02:00
Pavel Emelyanov
b837cf25b1 test: Make multishard_mutation_query test do less scans
When built by clang this dev-mode test takes ~30 minutes to
complete. Let's reduce this time by reducing the scale of
the test if DEVEL is set.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-08 15:55:04 +03:00
Pavel Emelyanov
703451311f configure: Add -DDEVEL to dev build flags
To let source code tell debug, dev and release builds
from each other.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-08 15:54:30 +03:00