Commit Graph

4972 Commits

Author SHA1 Message Date
Avi Kivity
6ffd813b7b Merge 'hints: delay repair until hints are replayed' from Piotr Dulikowski
Both hinted handoff and repair are meant to improve the consistency of the cluster's data. HH does this by storing records of failed replica writes and replaying them later, while repair goes through all data on all participaring replicas and makes sure the same data is stored on all nodes. The former is generally cheaper and sometimes (but not always) can bring back full consistency on its own; repair, while being more costly, is a sure way to bring back current data to full consistency.

When hinted handoff and repair are running at the same time, some of the work can be unnecessarily duplicated. For example, if a row is repaired first, then hints towards it become unnecessary. However, repair needs to do less work if data already has good consistency, so if hints finish first, then the repair will be shorter.

This PR introduces a possibility to wait for hints to be replayed before continuing with user-issued repair. The coordinator of the repair operation asks all nodes participating in the repair operation (including itself) to mark a point at the end of all hint queues pointing towards other nodes participating in repair. Then, it waits until hint replay in all those queues reaches marked point, or configured timeout is reached.

This operation is currently opt-in and can be turned on by setting the `wait_for_hint_replay_before_repair_in_ms` config option to a positive value.

Fixes #8102

Tests:
- unit(dev)
- some manual tests:
    - shutting down repair coordinator during hints replay,
    - shutting down node participating in repair during hints replay,

Closes #8452

* github.com:scylladb/scylla:
  repair: introduce abort_source for repair abort
  repair: introduce abort_source for shutdown
  storage_proxy: add abort_source to wait_for_hints_to_be_replayed
  storage_proxy: stop waiting for hints replay when node goes down
  hints: dismiss segment waiters when hint queue can't send
  repair: plug in waiting for hints to be sent before repair
  repair: add get_hosts_participating_in_repair
  storage_proxy: coordinate waiting for hints to be sent
  config: add wait_for_hint_replay_before_repair option
  storage_proxy: implement verbs for hint sync points
  messaging_service: add verbs for hint sync points
  storage_proxy: add functions for syncing with hints queue
  db/hints: make it possible to wait until current hints are sent
  db/hints: add a metric for counting processed files
  db/hints: allow to forcefully update segment list on flush
2021-05-03 18:47:27 +03:00
Piotr Dulikowski
9d68824327 hints: dismiss segment waiters when hint queue can't send
When a hint queue becomes stuck due to not being able to send to its
destination (e.g. destination node is no longer UP, or we failed to send
some hints from a file), then it's better to immediately dismiss anybody
who waits for hint replay instead of letting them wait until timeout.
2021-04-27 15:58:15 +02:00
Piotr Dulikowski
86d831b319 config: add wait_for_hint_replay_before_repair option
Adds the `wait_for_hint_replay_before_repair` configuration option. If
set to true, the repair coordinator will first wait until the cluster
replays its hints towards the nodes participating in repair. It is set
to true by default, and is live-updateable. It will be used in
subsequent commits from the same PR.
2021-04-27 15:16:26 +02:00
Piotr Dulikowski
82c419870a messaging_service: add verbs for hint sync points
Adds two verbs: HINT_SYNC_POINT_CREATE and HINT_SYNC_POINT_CHECK.
Those will make it possible to create a sync point and regularly poll
to check its existence.
2021-04-27 15:06:39 +02:00
Piotr Dulikowski
244738b0d5 storage_proxy: add functions for syncing with hints queue
Adds two methods to `storage_proxy`:

- `create_hint_queue_sync_point` - creates a "hint sync point" which
  is kept present in storage_proxy until all hint queues on the local
  node reach their curent end. It will also disappear if given deadline
  is reached first.
- `check_hint_queue_sync_point` - checks if given hint sync point still
  exists.

The created sync point waits for hint queues in all hint managers, on
all shards.
2021-04-27 15:06:39 +02:00
Piotr Dulikowski
427bbf6d86 db/hints: make it possible to wait until current hints are sent
Implements `wait_until_hints_are_replayed` method returning a future
which blocks until either all current hint segments are replayed
(returns success in this case), or when provided timeout is reached
(returns a timeout exception in this case).
2021-04-26 13:57:03 +02:00
Benny Halevy
dad6c94476 view_builder: stop: close all build_step readers
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
02d74e1530 view_update_generator: start: close staging_sstable_reader when done
The staging_sstable_reader has to be closed before it's destroyed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
1e1c8ea824 view: build_progress_virtual_reader: implement close method
Close underlying reader.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
2d8b00f2d8 view: generate_view_updates: close builder readers when done
Make sure to close the builder's _updates and optional _existings
readers before they are destroyed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
652ba714fe view_builder: initialize_reader_at_current_token: close reader before reassigning it
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
7093610931 view_builder: do_build_step: close build_step reader when done
Make sure to close the build_step reader before destroying it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Benny Halevy
38e48bb462 size_estimates_reader: close partition_reader
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00
Piotr Dulikowski
5a49fe74bb db/hints: add a metric for counting processed files
Adds a field to `end_point_hints_manager::sender`:
`_total_replayed_segments_count` which keeps track of how many segments
were replayed so far. This metric will be used to calculate the
sequence number of the last current hint segments in the queue - so
that we can implement waiting for current segments to be replayed.
2021-04-22 18:45:34 +02:00
Piotr Dulikowski
e48739a6da db/hints: allow to forcefully update segment list on flush
Endpoint hints manager keeps a list of segments to replay. New segments
are appended to it lazily - only when a hint flush occurs (hints
commitlog instance is re-created) and the list is empty. Because of
that, this list cannot be currently used to tell how many segments are
on disk.

This commit allows to trigger hints flush and forcefully update the list
of segments to replay. In later commits, a mechanism will be implemented
which will allow to wait until a given number of hint segments is
replayed. Triggering a hints flush with segment list update will allow
us to properly synchronize and determine up to which segment we need to
wait.
2021-04-22 17:34:04 +02:00
Piotr Sarna
ad661561c8 db: stop using infinite timeout for service level updates
Due to a porting bug, the routines for updating service levels
used the default infinite timeout for internal CQL queries,
which causes Scylla to hang on shutdown. The behavior is now
fixed and the routines use the same timeout as the other
similar functions - 10s at the time of writing this message.
2021-04-22 09:03:21 +02:00
Avi Kivity
daeddda7cc treewide: remove inclusions of storage_proxy.hh from headers
storage_proxy.hh is huge and includes many headers itself, so
remove its inclusions from headers and re-add smaller headers
where needed (and storage_proxy.hh itself in source files that
need it).

Ref #1.
2021-04-20 21:23:00 +03:00
Avi Kivity
14a4173f50 treewide: make headers self-sufficient
In preparation for some large header changes, fix up any headers
that aren't self-sufficient by adding needed includes or forward
declarations.
2021-04-20 21:23:00 +03:00
Kamil Braun
617813ba66 sys_dist_ks: new keyspace for system tables with Everywhere strategy
`system_distributed_everywhere` is a new keyspace that uses Everywhere
replication strategy. This is useful, for example, when we want to store
internal data that should be accessible by every node; the data can be
written using CL=ALL (e.g. during node operations such as node
bootstrap, which require all nodes to be alive - at least currently) and
then read by each node locally using CL=ONE (e.g. during node restarts).

Closes #8457
2021-04-19 11:22:57 +03:00
Eliran Sinvani
dd74556ad9 service/qos: adding service level table to the distributed keyspace
This patch adds the service level table and functions to manipulate it
to the distributed keyspace.

Message-Id: <b6cb7f311ac1ee6802d8f3d78eac9cf40fe21f68.1609161341.git.sarna@scylladb.com>
2021-04-12 15:58:09 +02:00
Benny Halevy
705f9c4f79 commitlog: segment_manager: max_size must be aligned
This was triggered by the test_total_space_limit_of_commitlog dtest.
When it passes a very large commitlog_segment_size_in_mb (1/6th of the
free memory size, in mb), segment_manager constructor limits max_size
to std::numeric_limits<position_type>::max() which is 0xffffffff.

This causes allocate_segment_ex to loop forever when writing the segment
file since `dma_write` returns 0 when the count is unaligned (seen 4095).

The fix here is to select a sligtly small maxsize that is aligned
down to a multiple of 1MB.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210407121059.277912-1-bhalevy@scylladb.com>
2021-04-11 13:17:50 +03:00
Pavel Emelyanov
70c851e69b view: Don't expect int from position_in_partition::tri_compare
Now it's int, but soon will be std::strong_ordering.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-09 18:20:39 +03:00
Piotr Sarna
8e808a56d2 Merge 'commitlog: Fix race and edge condition in delete_segments' from Calle Wilund
Fixes #8363
Fixes #8376

Delete segements has two issues when running with size-limited
commit log and strict adherence to said limit.

1.) It uses parallel processing, with deferral. This means that
    the disk usage variables it looks at might not be fully valid
    - i.e. we might have already issued a file delete that will
    reduce disk footprint such that a segment could instead be
    recycled, but since vars are (and should) only updated
    _post_ delete, we don't know.
2.) It does not take into account edge conditions, when we only
    delete a single segment, and this segment is the border segment
    - i.e. the one pushing us over the limit, yet allocation is
    desperately waiting for recycling. In this case we should
    allow it to live on, and assume that next delete will reduce
    footprint. Note: to ensure exact size limit, make sure
    total size is a multiple of segment size.

if we had an error in recycling (disk rename?), and no elements
are available, we could have waiters hoping they will get segements.
abort the queue (not permanent, but wakes up waiters), and let them
retry. Since we did deletions instead, disk footprint should allow
for new allocs at least. Or more likely, everything is broken, but
we will at least make more noise.

Closes #8372

* github.com:scylladb/scylla:
  commitlog: Add signalling to recycle queue iff we fail to recycle
  commitlog: Fix race and edge condition in delete_segments
  commitlog: coroutinize delete_segments
  commitlog_test: Add test for deadlock in recycle waiter
2021-04-07 15:13:25 +02:00
Nadav Har'El
0dd6f2db8f Merge 'CDC generations: refactors and improvements' from Kamil Braun
The "most important" major changes are:

1. storage_service: simplify CDC generation management during node replace

Previously, when node A replaced node B, it would obtain B's
generation timestamp from its application state (gossiped by other
nodes) and start gossiping it immediately on bootstrap.

But that's not necessary:
  - if this is the timestamp of the last (current) generation, we would
     obtain it from other nodes anyway (every node gossips the last known
     timestamp),
  - if this is the timestamp of an earlier generation, we would forget
     it immediately and start gossiping the last timestamp (obtained from
     other nodes).

This commit simplifies the bootstrap code (in node-replace case) a bit:
the replacing node no longer attempts to retrieve the CDC generation
timestamp from the node being replaced.

2. tree-wide: introduce cdc::generation_id type

Each CDC generation has a timestamp which denotes a logical point in time
when this generation starts operating. That same timestamp is
used to identify the CDC generation. We use this identification scheme
to exchange CDC generations around the cluster.

However, the fact that a generation's timestamp is used as an ID for
this generation is an implementation detail of the currently used method
of managing CDC generations.

Places in the code that deal with the timestamp, e.g. functions which
take it as an argument (such as handle_cdc_generation) are often
interested in the ID aspect, not the "when does the generation start
operating" aspect. They don't care that the ID is a `db_clock::time_point`.
They may sometimes want to retrieve the time point given the ID (such as
do_handle_cdc_generation when it calls `cdc::metadata::insert`),
but they don't care about the fact that the time point actually IS the ID.

In the future we may actually change the specific type of the ID if we
modify the generation management algorithms.

This commit is an intermediate step that will ease the transition in the
future. It introduces a new type, `cdc::generation_id`. Inside it contains
the timestamp, so:
- if a piece of code doesn't care about the timestamp, it just passes
   the ID around
- if it does care, it can access it using the `get_ts` function.
   The fact that `get_ts` simply accesses the ID's only field is an
   implementation detail.

3. cdc: handle missing generation case in check_and_repair_cdc_streams

check_and_repair_cdc_streams assumed that there is always at least
one generation being gossiped by at least one of the nodes. Otherwise it
would enter undefined behavior.

I'm not aware of any "real" scenario where this assumption wouldn't be
satisfied at the moment where check_and_repair_cdc_streams makes it
except perhaps some theoretical races. But it's best to stay on the safe
side.

---

Additionally the PR does some simplifications, stylistic improvements,
removes some dead code, coroutinizes some functions, uncoroutinizes others
(due to miscompiles), adds additional logging, updates some stale comments.
Read commit messages for more details.

Closes #8283

* github.com:scylladb/scylla:
  cdc: log a message when creating a new CDC generation
  cdc: handle missing generation case in check_and_repair_cdc_streams
  tree-wide: introduce cdc::generation_id type
  tree-wide: rename "cdc streams timestamp" to "cdc generation id"
  cdc: remove some functions from generation.hh
  storage_service: make set_gossip_tokens a static free-function
  db: system_keyspace: group cdc functions in single place
  cdc: get rid of "get_local_streams_timestamp"
  sys_dist_ks: update comment at quorum_if_many
  storage_service: simplify CDC generation management during node replace
2021-04-07 14:49:02 +03:00
Kamil Braun
99fd2244a3 tree-wide: introduce cdc::generation_id type
This is a follow-up to the previous commit.

Each CDC generation has a timestamp which denotes a logical point in time
when this generation starts operating. That same timestamp is
used to identify the CDC generation. We use this identification scheme
to exchange CDC generations around the cluster.

However, the fact that a generation's timestamp is used as an ID for
this generation is an implementation detail of the currently used method
of managing CDC generations.

Places in the code that deal with the timestamp, e.g. functions which
take it as an argument (such as handle_cdc_generation) are often
interested in the ID aspect, not the "when does the generation start
operating" aspect. They don't care that the ID is a `db_clock::time_point`.
They may sometimes want to retrieve the time point given the ID (such as
do_handle_cdc_generation when it calls `cdc::metadata::insert`),
but they don't care about the fact that the time point actually IS the ID.

In the future we may actually change the specific type of the ID if we
modify the generation management algorithms.

This commit is an intermediate step that will ease the transition in the
future. It introduces a new type, `cdc::generation_id`. Inside it contains
the timestamp, so:
1. if a piece of code doesn't care about the timestamp, it just passes
   the ID around
2. if it does care, it can simply access it using the `get_ts` function.
   The fact that `get_ts` simply accesses the ID's only field is an
   implementation detail.

Using the occasion, we change the `do_handle_cdc_generation_intercept...`
function to be a standard function, not a coroutine. It turns out that -
depending on the shape of the passed-in argument - the function would
sometimes miscompile (the compiled code would not copy the argument to the
coroutine frame).
2021-04-07 13:47:13 +02:00
Avi Kivity
5109bf8b99 config: relax batch size warning and failure thresholds
We inherited very low threshold for warning and failing multi-partition
batches, but these warnings aren't useful. The size of a batch in bytes
as no impact on node stability. In fact the warnings can cause more
problems if they flood the log.

Fix by raising the warning threshold to 128 kiB (our magic size)
and the fail threshold to 1 MiB.

Fixes #8416.

Closes #8417
2021-04-06 20:56:06 +03:00
Calle Wilund
d734f85280 commitlog: Add signalling to recycle queue iff we fail to recycle
Fixes #8376

If a recycle should fail, we will sort of handle it by deleting
the segment, so no leaks. But if we have waiter(s) on the recycle
queue, we could end up deadlocked/starved because nothing is
incoming there.

This adds an abort of the queue iff we failed and no objects are
available. This will wake up any waiter, and he should retry,
and hopefully at least be able to create a new segment.
We then reset the queue to a new one. So we can go on.

v2:
* Forgot to reset queue
v3:
* Nicer exception handling in allocate_segment_ex
2021-04-06 16:38:14 +00:00
Calle Wilund
15dd76f0c2 commitlog: Fix race and edge condition in delete_segments
Fixes #8363

Delete segements has two issues when running with size-limited
commit log and strict adherence to said limit.

1.) It uses parallel processing, with deferral. This means that
    the disk usage variables it looks at might not be fully valid
    - i.e. we might have already issued a file delete that will
    reduce disk footprint such that a segment could instead be
    recycled, but since vars are (and should) only updated
    _post_ delete, we don't know.
2.) It does not take into account edge conditions, when we only
    delete a single segment, and this segment is the border segment
    - i.e. the one pushing us over the limit, yet allocation is
    desperately waiting for recycling. In this case we should
    allow it to live on, and assume that next delete will reduce
    footprint. Note: to ensure exact size limit, make sure
    total size is a multiple of segment size.

Fixed by
a.) Doing delete serialized. It is not like being parallel here will
    win us speed awards. And now we can know exact footprint, and
    how many segments we have left to delete
b.) Check if we are a block across the footprint boundry, and people
    might be waiting for a segment. If so, don't delete segment, but
    recycle.

As a follow-up, we should probably instead adjust the commitlog size
limit (per shard) to be a multiple of segment sizes, but there is
risks in that too.
2021-04-06 16:38:14 +00:00
Calle Wilund
d9a9897892 commitlog: coroutinize delete_segments
Because we like cow routines.
2021-04-06 16:38:14 +00:00
Calle Wilund
813694b617 commitlog_test: Add test for deadlock in recycle waiter
Not a very good test, mind you. Nothing to verify, just see if
the test times out. But try to make it at least complete for
failure report.
2021-04-06 16:38:14 +00:00
Konstantin Osipov
c83cf1f965 uuid: switch the API to use std::chrono
A follow up for the patch for #7611. This change was requested
during review and moved out of #7611 to reduce its scope.

The patch switches UUID_gen API from using plain integers to
hold time units to units from std::chrono.

For one, we plan to switch the entire code base to std::chrono units,
to ensure type safety. Secondly, using std::chrono units allows to
increase code reuse with template metaprogramming and remove a few
of UUID_gen functions that beceme redundant as a result.

* switch  get_time_UUID(), unix_timestamp(), get_time_UUID_raw(), switch
  min_time_UUID(), max_time_UUID(), create_time_safe() to
  std::chrono
* remove unused variant of from_unix_timestamp()
* remove unused get_time_UUID_bytes(), create_time_unsafe(),
  redundant get_adjusted_timestamp()
* inline get_raw_UUID_bytes()
* collapse to similar implementations of get_time_UUID()
* switch internal constants to std::chrono
* remove unnecessary unique_ptr from UUID_gen::_instance
Message-Id: <20210406130152.3237914-2-kostja@scylladb.com>
2021-04-06 17:12:54 +03:00
Kamil Braun
e486e0f759 tree-wide: rename "cdc streams timestamp" to "cdc generation id"
Each CDC generation always has a timestamp, but the fact that the
timestamp identifies the generation is an implementation detail.
We abstract away from this detail by using a more generic naming scheme:
a generation "identifier" (whatever that is - a timestamp or something
else).

It's possible that a CDC generation will be identified by more than a
timestamp in the (near) future.

The actual string gossiped by nodes in their application state is left
as "CDC_STREAMS_TIMESTAMP" for backward compatibility.

Some stale comments have been updated.
2021-04-06 13:15:31 +02:00
Kamil Braun
1019ff07cb db: system_keyspace: group cdc functions in single place 2021-04-06 13:15:31 +02:00
Kamil Braun
3cebe99613 sys_dist_ks: update comment at quorum_if_many
The comment mentioned tables that no longer exist: their names have
changed some time ago. Update the comment to be name-agnostic.

Furthemore, the second part of the comment related to a case of "joining
a node without bootstrapping". Fortunately this operation is no longer
possible (after #6848 which became part of Scylla 4.3) so we can shorten
the comment.
2021-04-06 13:15:31 +02:00
Avi Kivity
56cd058b34 config: correct description of listen_address
- it does not support using interface names
 - listen_interface is not supported
 - 0.0.0.0 will work (and is reasonable) if you set broadcast_address
 - empty setting is not supported

Fixes #8381.

Closes #8409
2021-04-05 14:06:48 +03:00
Avi Kivity
82c76832df treewide: don't include "db/system_distributed_keyspace.hh" from headers
This just causes unneeded and slower recompliations. Instead replace
with forward declarations, or includes of smaller headers that were
incidentally brought in by the one removed. The .cc files that really
need it gain the include, but they are few.

Ref #1.

Closes #8403
2021-04-04 14:00:26 +03:00
Kamil Braun
641040d465 sys_dist_ks: remove dead code (expire_cdc_* functions)
These functions were not used anywhere but had to be maintained anyway.
When (if) the expiration algorithm actually gets implemented (see issue #7300),
the functions can be added back (perhaps they will need to look differently
at that time, and it's likely that the `expire` column won't be used in the
expiration algorithm in the end anyway).
2021-04-04 13:12:12 +03:00
Kamil Braun
4f3f245188 sys_dist_ks: coroutinize system_distributed_keyspace::start 2021-04-04 13:10:44 +03:00
Tomasz Grabiec
307bd354d2 Merge 'hints: use token_metadata to tell if node has left the ring' from Piotr Dulikowski
This PR changes the `can_send` function so that it looks at the `token_metadata` in order to tell if the destination node is in the ring. Previously, gossiper state was used for that purpose and required a relatively complicated condition to check. The new logic just uses `token_metadata::is_member` which reduces complexity of the `can_send` function.

Additionally, `storage_service` is slightly modified so that during a removenode operation the `token_metadata` is first updated and only then endpoint lifecycle subscribers are notified. This was done in order to prevent a race just like the one which happened in #5087 - hints manager is a lifecycle subscriber and starts a draining operation when a node is removed, and in order for draining to work correctly, `can_send` should keep returning true for that node.

Tests:

- unit(dev)
- dtest(hintedhandoff_additional_test.py)
- dtest(topology_test.py)

Closes #8387

* github.com:scylladb/scylla:
  hints: clarify docstring comment for can_send
  hints: use token_metadata to tell if node is in the ring
  hints: slightly reogranize "if" statement in can_send
  storage_service: release token_metadata lock before notify_left
  storage_service: notify_left after token_metadata is replicated
2021-04-01 15:51:46 +02:00
Piotr Dulikowski
6a1152ea9b hints: clarify docstring comment for can_send
Now, the docstring comment next to can_send better represents the
condition that is checked inside that function. The statement about
returning true when destination left the NORMAL state is replaced with a
statement about returning true when the destination has left the ring.
2021-04-01 03:58:29 +02:00
Piotr Dulikowski
4f90514247 hints: use token_metadata to tell if node is in the ring
Now, instead of looking at the gossiper state to check if the
destination node is still in the ring, we are using token_metadata as a
source of truth. This results in much simpler code in can_send() as
token_metadata has an is_member method which does exactly what we want.
2021-04-01 03:58:29 +02:00
Piotr Dulikowski
e7d9057d0c hints: slightly reogranize "if" statement in can_send
This commit reverses the order of if-else blocks in can_send, which
makes it - in my opinion, at least - slightly easier to read.
2021-04-01 03:58:29 +02:00
Piotr Jastrzebski
57c7964d6c config: ignore enable_sstables_mc_format flag
Don't allow users to disable MC sstables format any more.
We would like to retire some old cluster features that has been around
for years. Namely MC_SSTABLE and UNBOUNDED_RANGE_TOMBSTONES. To do this
we first have to make sure that all existing clusters have them enabled.
It is impossible to know that unless we stop supporting
enable_sstables_mc_format flag.

Test: unit(dev)

Refs #8352

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>

Closes #8360
2021-03-31 12:23:59 +03:00
Calle Wilund
c0666ea89b commitlog: Fix inner loop condition in allocation pre-fill
Fixes #8369

This was originally found (and fixed) by @gleb-cloudius, but the patch set with
the fix was reverted at some point, and the fix went away. Now the error remains
even in new, nice coroutine code.

We check the wrong var in the inner loop of the pre-fill path of
allocate_segment_ex, often causing us to generate giant writev:s of more or less
the whole file.  Not intended.

Closes #8370
2021-03-30 12:14:55 +02:00
Nadav Har'El
ccc75bfe2a Merge 'Disable thrift by default' from Piotr Sarna
The Thrift layer is functional, but it's not usually the first-choice protocol for Scylla users, so it's hereby disabled by default.

Fixes #8336

Closes #8338

* github.com:scylladb/scylla:
  docs: mention disabling Thrift by default
  db,config: disable Thrift by default
2021-03-29 12:48:20 +03:00
Piotr Wojtczak
c1daf2bb24 column_family: Make toppartitions queries more generic
Right now toppartitions can only be invoked on one column family at a time.
This change introduces a natural extension to this functionality,
allowing to specify a list of families.

We provide three ways for filtering in the query parameter "name_list":
    1. A specific column family to include in the form "ks:cf"
    2. A keyspace, telling the server to include all column families in it.
       Specified by omitting the cf name, i.e. "ks:"
    3. All column families, which is represented by an empty list
The list can include any amount of one or both of the 1. and 2. option.

Fixes #4520

Closes #7864
2021-03-24 17:54:05 +02:00
Pavel Emelyanov
37bec6fb76 commitlog: Open files with append_is_unlikely
This open option tells seastar that the file in question
will be truncated to the needed size right at once and all
the subsequent writes will happen within this size. This
hint turns off append optimization in seastar that's not
that cheap and helps so save few cpu cycles.

The option was introduced in seastar by 8bec57bc.

tests: unit(dev), dtest(commitlog:
                        test_batch_commitlog,
                        test_periodic_commitlog,
                        test_commitlog_replay_on_startup)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210323115409.31215-1-xemul@scylladb.com>
2021-03-24 13:05:33 +02:00
Piotr Sarna
e2443337d9 db,config: disable Thrift by default
It will still be possible to use Thrift once it's enabled
in the yaml file, but it's better to not open this port
by default, since Thrift is definitely not the first choice
for Scylla users.

Fixes #8336
2021-03-22 10:54:26 +01:00
Avi Kivity
972ea9900c Merge 'commitlog: Make pre-allocation drop O_DSYNC while pre-filling' from Calle Wilund
Refs #7794

Iff we need to pre-fill segment file ni O_DSYNC mode, we should
drop this for the pre-fill, to avoid issuing flushes until the file
is filled. Done by temporarily closing, re-opening in "normal" mode,
filling, then re-opening.

Closes #8250

* github.com:scylladb/scylla:
  commitlog: Make pre-allocation drop O_DSYNC while pre-filling
  commitlog: coroutinize allocate_segment_ex
2021-03-17 09:59:22 +02:00
Calle Wilund
48ca01c3ab commitlog: Make pre-allocation drop O_DSYNC while pre-filling
Refs #7794

Iff we need to pre-fill segment file ni O_DSYNC mode, we should
drop this for the pre-fill, to avoid issuing flushes until the file
is filled. Done by temporarily closing, re-opening in "normal" mode,
filling, then re-opening.

v2:
* More comment
v3:
* Add missing flush
v4:
* comment
v5:
* Split coroutine and fix into separate patches
2021-03-15 09:35:45 +00:00