Compare commits

...

261 Commits

Author SHA1 Message Date
Shlomi Livne
d82f2fb7ee release: prepare for 2.0.5
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
2018-05-24 14:27:09 +03:00
Avi Kivity
5aaa8031a2 Update seastar submodule
* seastar f5162dc...da2e1af (1):
  > net/tls: Wait for output to be sent when shutting down

Fixes #3459.
2018-05-24 12:02:15 +03:00
Avi Kivity
3d50e7077a Merge "Backport fixes for streaming segfault with bogus dst_cpu_id for 2.0" from Asias
"
The minimum changes that makes the backport of "streaming: Do send failed
message for uninitialized session" without backport conflits.

Fixes simliar issue we saw:

  https://github.com/scylladb/scylla/issues/3115
"

* tag 'asias/backport_issue_3115_for_2.0/v1' of github.com:scylladb/seastar-dev:
  streaming: Do send failed message for uninitialized session
  streaming: Introduce streaming::abort()
  streaming: Log peer address in on_error
  streaming: Check if _stream_result is valid
  streaming: Introduce received_failed_complete_message
2018-05-24 11:14:20 +03:00
Avi Kivity
4063e92f57 dist: redhat: get rid of raid0.devices_discard_performance
This parameter is not available on recent Red Hat kernels or on
non-Red Hat kernels (it was removed on 3.10.0-772.el7,
RHBZ 1455932). The presence of the parameter on kernels that don't
support it cause the module load to fail, with the result that the
storage is not available.

Fix by removing the parameter. For someone running an older Red Hat
kernel the effect will be that discard is disabled, but they can fix
that by updating the kernel. For someone running a newer kernel, the
effect will be that they can access their data.

Fixes #3437.
Message-Id: <20180516134913.6540-1-avi@scylladb.com>

(cherry picked from commit 3b8118d4e5)
2018-05-24 11:08:13 +03:00
Asias He
b6de30bb87 streaming: Do send failed message for uninitialized session
The uninitialized session has no peer associated with it yet. There is
no point sending the failed message when abort the session. Sending the
failed message in this case will send to a peer with uninitialized
dst_cpu_id which will casue the receiver to pass a bogus shard id to
smp::submit_to which cases segfault.

In addition, to be safe, initialize the dst_cpu_id to zero. So that
uninitialized session will send message to shard zero instead of random
bogus shard id.

Fixes the segfault issue found by
repair_additional_test.py:RepairAdditionalTest.repair_abort_test

Fixes #3115
Message-Id: <9f0f7b44c7d6d8f5c60d6293ab2435dadc3496a9.1515380325.git.asias@scylladb.com>

(cherry picked from commit 774307b3a7)
2018-05-24 15:24:29 +08:00
Asias He
c23e3a1eda streaming: Introduce streaming::abort()
It will be used soon by stream_plan::abort() to abort a stream session.

(cherry picked from commit fad34801bf)
2018-05-24 15:21:54 +08:00
Asias He
2732b6cf1d streaming: Log peer address in on_error
(cherry picked from commit 8a3f6acdd2)
2018-05-24 15:20:43 +08:00
Asias He
49722e74da streaming: Check if _stream_result is valid
If on_error() was called before init() was executed, the
_stream_result can be invalid.

(cherry picked from commit be573bcafb)
2018-05-24 15:20:02 +08:00
Asias He
ba7623ac55 streaming: Introduce received_failed_complete_message
It is the handler for the failed complete message. Add a flag to
remember if we received a such message from peer, if so, do not send
back the failed complete message back to the peer when running
close_session with failed status.

(cherry picked from commit eace5fc6e8)
2018-05-24 15:19:26 +08:00
Avi Kivity
9db2ff36f2 dist: redhat: fix binutils dependency on alternatives
Since /sbin is a symlink, file dependencies on /sbin/alternatives
don't work.

Change to /usr/bin to fix.
2018-04-26 12:00:42 +03:00
Avi Kivity
378029b8da schema_tables: discard [[nodiscard]] attribute
Note supported on gcc 5.3, which is used to build this branch.
2018-04-25 18:37:17 +03:00
Mika Eloranta' via ScyllaDB development
2b7644dc36 build: fix rpm build script --jobs N handling
Fixes argument misquoting at $SRPM_OPTS expansion for the mock commands
and makes the --jobs argument work as supposed.

Signed-off-by: Mika Eloranta <mel@aiven.io>
Message-Id: <20180113212904.85907-1-mel@aiven.io>
(cherry picked from commit 7266446227)
2018-04-25 17:47:09 +03:00
Duarte Nunes
4bd931ba59 db/schema_tables: Only drop UDTs after merging tables
Dropping a user type requires that all tables using that type also be
dropped. However, a type may appear to be dropped at the same time as
a table, for instance due to the order in which a node receives schema
notifications, or when dropping a keyspace.

When dropping a table, if we build a schema in a shard through a
global_schema_pointer, then we'll check for the existence of any user
type the schema employs. We thus need to ensure types are only dropped
after tables, similarly to how it's done for keyspaces.

Fixes #3068

Tests: unit-tests (release)

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180129114137.85149-1-duarte@scylladb.com>
(cherry picked from commit 1e3fae5bef)
2018-04-25 01:15:45 +03:00
Avi Kivity
78eebe74c7 loading_cache: adjust code for older compilers
Need this-> qualifier in a generic lambda, and static_assert()s need
a message.
2018-04-23 16:12:25 +03:00
Avi Kivity
30e21afb13 streaming: adjust code for older compilers
Need this-> qualifier in a generic lambda.
2018-04-23 16:12:25 +03:00
Shlomi Livne
e8616b10e5 release: prepare for 2.0.4
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
2018-04-23 09:11:35 +03:00
Avi Kivity
0cb842dde1 Update seastar submodule
* seastar e7facd4...f5162dc (1):
  > tls: Ensure we always pass through semaphores on shutdown

Fixes #3358.
2018-04-14 20:52:57 +03:00
Gleb Natapov
7945f5edda cql_server: fix a race between closing of a connection and notifier registration
There is a race between cql connection closure and notifier
registration. If a connection is closed before notification registration
is complete stale pointer to the connection will remain in notification
list since attempt to unregister the connection will happen to early.
The fix is to move notifier unregisteration after connection's gate
is closed which will ensure that there is no outstanding registration
request. But this means that now a connection with closed gate can be in
notifier list, so with_gate() may throw and abort a notifier loop. Fix
that by replacing with_gate() by call to is_closed();

Fixes: #3355
Tests: unit(release)

Message-Id: <20180412134744.GB22593@scylladb.com>
(cherry picked from commit 1a9aaece3e)
2018-04-12 16:57:30 +03:00
Asias He
9c2a328000 gossip: Relax generation max difference check
start node 1 2 3
shutdown node2
shutdown node1 and node3
start node1 and node3
nodetool removenode node2
clean up all scylla data on node2
bootstrap node2 as a new node

I saw node2 could not bootstrap stuck at waiting for schema information to compelte for ever:

On node1, node3

    [shard 0] gossip - received an invalid gossip generation for peer 127.0.0.2; local generation = 2, received generation = 1521779704

On node2

    [shard 0] storage_service - JOINING: waiting for schema information to complete

This is becasue in nodetool removenode operation, the generation of node1 was increased from 0 to 2.

   gossiper::advertise_removing () calls eps.get_heart_beat_state().force_newer_generation_unsafe();
   gossiper::advertise_token_removed() calls eps.get_heart_beat_state().force_newer_generation_unsafe();

Each force_newer_generation_unsafe increases the generation by 1.

Here is an example,

Before nodetool removenode:
```
curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" | python -mjson.tool
   {
   "addrs": "127.0.0.2",
   "generation": 0,
   "is_alive": false,
   "update_time": 1521778757334,
   "version": 0
   },
```

After nodetool revmoenode:
```
curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" | python -mjson.tool
 {
     "addrs": "127.0.0.2",
     "application_state": [
         {
             "application_state": 0,
             "value": "removed,146b52d5-dc94-4e35-b7d4-4f64be0d2672,1522038476246",
             "version": 214
         },
         {
             "application_state": 6,
             "value": "REMOVER,14ecc9b0-4b88-4ff3-9c96-38505fb4968a",
             "version": 153
            }
     ],
     "generation": 2,
     "is_alive": false,
     "update_time": 1521779276246,
     "version": 0
 },
```

In gossiper::apply_state_locally, we have this check:

```
if (local_generation != 0 && remote_generation > local_generation + MAX_GENERATION_DIFFERENCE) {
    // assume some peer has corrupted memory and is broadcasting an unbelievable generation about another peer (or itself)
  logger.warn("received an invalid gossip generation for peer {}; local generation = {}, received generation = {}",ep, local_generation, remote_generation);

}
```
to skip the gossip update.

To fix, we relax generation max difference check to allow the generation
of a removed node.

After this patch, the removed node bootstraps successfully.

Tests: dtest:update_cluster_layout_tests.py
Fixes #3331

Message-Id: <678fb60f6b370d3ca050c768f705a8f2fd4b1287.1522289822.git.asias@scylladb.com>
(cherry picked from commit f539e993d3)
2018-03-29 12:10:37 +03:00
Tomasz Grabiec
98498c679b test.py: set BOOST_TEST_CATCH_SYSTEM_ERRORS=no
This will make boost UTF abort execution on SIGABRT rather than trying
to continue running other test cases. This doesn't work well with
seastar integration, the suite will hang.
Message-Id: <1516205469-16378-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit ab6ec571cb)
2018-03-27 22:42:00 +03:00
Duarte Nunes
b147b5854b view_schema_test: Retry failed queries
Due to the asynchronous nature of view update propagation, results
might still be absent from views when we query them. To be able to
deterministically assert on view rows, this patch retries a query a
bounded number of times until it succeeds.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170718212646.2958-1-duarte@scylladb.com>
(cherry picked from commit ab72132cb1)
2018-03-27 22:42:00 +03:00
Duarte Nunes
226095f4db types: Implement hash() for collections
This patch provides a rather trivial implementation of hash() for
collection types.

It is needed for view building, where we hold mutations in a map
indexed by partition keys (and frozen collection types can be part of
the key).

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170718192107.13746-1-duarte@scylladb.com>
(cherry picked from commit 3bfcf47cc6)
2018-03-27 22:42:00 +03:00
Avi Kivity
3dd1f68590 tests: fix view_schema_test with clang
Clang is happy to create a vector<data_value> from a {}, a {1, 2}, but not a {1}.
No doubt it is correct, but sheesh.

Make the data_value explicit to humor it.
Message-Id: <20170713074315.9857-1-avi@scylladb.com>

(cherry picked from commit 162d9aa85d)
2018-03-27 22:25:17 +03:00
Avi Kivity
e08e4c75d7 Merge "Fix abort during counter table read-on-delete" from Tomasz
"
This fixes an abort in an sstable reader when querying a partition with no
clustering ranges (happens on counter table mutation with no live rows) which
also doesn't have any static columns. In such case, the
sstable_mutation_reader will setup the data_consume_context such that it only
covers the static row of the partition, knowing that there is no need to read
any clustered rows. See partition.cc::advance_to_upper_bound(). Later when
the reader is done with the range for the static row, it will try to skip to
the first clustering range (missing in this case). If clustering_ranges_walker
tells us to skip to after_all_clustering_rows(), we will hit an assert inside
continuous_data_consumer::fast_forward_to() due to attempt to skip past the
original data file range. If clustering_ranges_walker returns
before_all_clustering_rows() instead, all is fine because we're still at the
same data file position.

Fixes #3304.
"

* 'tgrabiec/fix-counter-read-no-static-columns' of github.com:scylladb/seastar-dev:
  tests: mutation_source_test: Test reads with no clustering ranges and no static columns
  tests: simple_schema: Allow creating schema with no static column
  clustering_ranges_walker: Stop after static row in case no clustering ranges

(cherry picked from commit 054854839a)
2018-03-22 18:16:37 +02:00
Vlad Zolotarov
8bcb4e7439 test.py: limit the tests to run on 2 shards with 4GB of memory
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
(cherry picked from commit 57a6ed5aaa)
2018-03-22 12:47:12 +02:00
Avi Kivity
97369adb1c Merge "streaming backport for 2.0 branch" from Asias
"
Backport streaming improvements and bug fixes from 2.1 branch. With this
series, it is more robust to add node, decommission node, because the single
and big stream plan is split into multiple and smaller stream plans. In
addition, the failed stream plans will be retried automatically.
"

Fixes #3285, Fixes #3310, Fixes #3065, Fixes #1743, Fixes #3311.

* 'asias/ticket-352' of github.com:scylladb/seastar-dev:
  range_streamer: Stream 10% of ranges instead of 10 ranges per time
  Revert "streaming: Do not abort session too early in idle detection"
  dht: Fix log in range_streamer
  streaming: One cf per time on sender
  messaging_service: Get rid of timeout and retry logic for streaming verb
  storage_service: Remove rpc client on all shards in on_dead
  Merge "streaming error handling improvement" from Asias
  streaming: Fix streaming not streaming all ranges
  Merge "Use range_streamer everywhere" from Asias
2018-03-22 10:39:08 +02:00
Duarte Nunes
c89ead5e55 gms/gossiper: Synchronize endpoint state destruction
In gossiper::handle_major_state_change() we set the endpoint_state for
a particular endpoint and replicate the changes to other cores.

This is totally unsynchronized with the execution of
gossiper::evict_from_membership(), which can happen concurrently, and
can remove the very same endpoint from the map  (in all cores).

Replicating the changes to other cores in handle_major_state_change()
can interleave with replicating the changes to other cores in
evict_from_membership(), and result in an undefined final state.

Another issue happened in debug mode dtests, where a fiber executes
handle_major_state_change(), calls into the subscribers, of which
storage_service is one, and ultimately lands on
storage_service::update_peer_info(), which iterates over the
endpoint's application state with deferring points in between (to
update a system table). gossiper::evict_from_membership() was executed
concurrently by another fiber, which freed the state the first one is
iterating over.

Fixes #3299.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180318123211.3366-1-duarte@scylladb.com>
(cherry picked from commit 810db425a5)
2018-03-18 14:55:50 +02:00
Asias He
46fd96d877 range_streamer: Stream 10% of ranges instead of 10 ranges per time
If there are a lot of ranges, e.g., num_tokens=2048, 10 ranges per
stream plan will cause tons of stream plan to be created to stream data,
each having very few data. This cause each stream plan has low transfer
bandwidth, so that the total time to complete the streaming increases.

It makes more sense to send a percentage of the total ranges per stream
plan than a fixed ranges.

Here is an example to stream a keyspace with 513 ranges in
total, 10 ranges v.s. 10% ranges:

Before:
[shard 0] range_streamer - Bootstrap with 127.0.0.1 for
keyspace=system_traces, 510 out of 513 ranges: ranges = 51
[shard 0] range_streamer - Bootstrap with ks for keyspace=127.0.0.1
succeeded, took 107 seconds

After:
[shard 0] range_streamer - Bootstrap with 127.0.0.1 for
keyspace=system_traces, 510 out of 513 ranges: ranges = 10
[shard 0] range_streamer - Bootstrap with ks for keyspace=127.0.0.1
succeeded, took 22 seconds

Message-Id: <a890b84fbac0f3c3cc4021e30dbf4cdf135b93ea.1520992228.git.asias@scylladb.com>
(cherry picked from commit 9b5585ebd5)
2018-03-15 10:11:08 +08:00
Asias He
19806fc056 Revert "streaming: Do not abort session too early in idle detection"
This reverts commit f792c78c96.

With the "Use range_streamer everywhere" (7217b7ab36) series,
all the user of streaming now do streaming with relative small ranges
and can retry streaming at higher level.

Reduce the time-to-recover from 5 hours to 10 minutes per stream session.

Even if the 10 minutes idle detection might cause higher false positive,
it is fine, since we can retry the "small" stream session anyway. In the
long term, we should replace the whole idle detection logic with
whenever the stream initiator goes away, the stream slave goes away.

Message-Id: <75f308baf25a520d42d884c7ef36f1aecb8a64b0.1520992219.git.asias@scylladb.com>
(cherry picked from commit ad7b132188)
2018-03-15 10:10:58 +08:00
Asias He
0b314a745f dht: Fix log in range_streamer
The address and keyspace should be swapped.

Before:
  range_streamer - Bootstrap with ks3 for keyspace=127.0.0.1 succeeded,
  took 56 seconds

After:
  range_streamer - Bootstrap with 127.0.0.1 for keyspace=ks3 succeeded,
  took 56 seconds

Message-Id: <5c49646f1fbe45e3a1e7545b8470e04b166922c4.1520416042.git.asias@scylladb.com>
(cherry picked from commit 73d8e2743f)
2018-03-13 15:07:13 +08:00
Asias He
73870751d9 streaming: One cf per time on sender
In the case there are large number of column families, the sender will
send all the column families in parallel. We allow 20% of shard memory
for streaming on the receiver, so each column family will have 1/N, N is
the number of in-flight column families, memory for memtable. Large N
causes a lot of small sstables to be generated.

It is possible there are multiple senders to a single receiver, e.g.,
when a new node joins the cluster, the maximum in-flight column families
is number of peer node. The column families are sent in the order of
cf_id. It is not guaranteed that all peers has the same speed so they
are sending the same cf_id at the same time, though. We still have
chance some of the peers are sending the same cf_id.

Fixes #3065

Message-Id: <46961463c2a5e4f1faff232294dc485ac4f1a04e.1513159678.git.asias@scylladb.com>
(cherry picked from commit a9dab60b6c)
2018-03-13 12:20:41 +08:00
Asias He
c8983034c0 messaging_service: Get rid of timeout and retry logic for streaming verb
With the "Use range_streamer everywhere" (7217b7ab36) seires, all
the user of streaming now do streaming with relative small ranges and
can retry streaming at higher level.

There are problems with timeout and retry at RPC verb level in streaming:
1) Timeout can be false negative.
2) We can not cancel the send operations which are already called. When
user aborts the streaming, the retry logic keeps running for a long
time.

This patch removes all the timeout and retry logic for streaming verbs.
After this, the timeout is the job of TCP, the retry is the job of the
upper layer.

Message-Id: <df20303c1fa728dcfdf06430417cf2bd7a843b00.1503994267.git.asias@scylladb.com>
(cherry picked from commit 8fa35d6ddf)
2018-03-13 12:20:41 +08:00
Asias He
77d14a6256 storage_service: Remove rpc client on all shards in on_dead
We should close connections to nodes that are down on all shards instead
of the shard which runs the on_dead gossip callback.

Found by Gleb.
Message-Id: <527a14105a07218066e9f1da943693d9de6993e5.1505894260.git.asias@scylladb.com>

(cherry picked from commit 173cba67ba)
2018-03-13 12:20:41 +08:00
Avi Kivity
21259bcfb3 Merge "streaming error handling improvement" from Asias
"This series improves the streaming error handling so that when one side of the
streaming failed, it will propagate the error to the other side and the peer
will close the failed session accordingly. This removes the unnecessary wait and
timeout time for the peer to discover the failed session and fail eventually.

Fix it by:

- Use the complete message to notify peer node local session is failed
- Listen on shutdown gossip callback so that we can detect the peer is shutdown
  can close the session with the peer

Fixes #1743"

* tag 'asias/streaming/error_handling_v2' of github.com:cloudius-systems/seastar-dev:
  streaming: Listen on shutdown gossip callback
  gms: Add is_shutdown helper for endpoint_state class
  streaming: Send complete message with failed flag when session is failed
  streaming: Handle failed flag in complete message
  streaming: Do not fail the session when failed to send complete message
  streaming: Introduce send_failed_complete_message
  streaming: Do not send complete message when session is successful
  streaming: Introduce the failed parameter for complete message
  streaming: Remove unused session_failed function
  streaming: Less verbose in logging
  streaming: Better stats

(cherry picked from commit d5aba779d4)
2018-03-13 12:20:41 +08:00
Tomasz Grabiec
9f02b44537 streaming: Fix streaming not streaming all ranges
It skipped one sub-range in each of the 10 range batch, and
tried to access the range vector using end() iterator.

Fixes sporadic failures of
update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_add_node_1_test.

Message-Id: <1505848902-16734-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 741ec61269)
2018-03-13 10:36:19 +08:00
Avi Kivity
9dc7a63014 Merge "Use range_streamer everywhere" from Asias
"With this series, all the following cluster operations:

- bootstrap
- rebuild
- decommission
- removenode

will use the same code to do the streaming.

The range_streamer is now extended to support both fetch from and push
to peer node. Another big change is now the range_streamer will stream
less ranges at a time, so less data, per stream_plan and range_streamer
will remember which ranges are failed to stream and can retry later.

The retry policy is very simple at the moment it retries at most 5 times
and sleep 1 minutes, 1.5^2 minutes, 1.5^3 minutes ....

Later, we can introduce api for user to decide when to stop retrying and
the retry interval.

The benefits:

 - All the cluster operation shares the same code to stream
 - We can know the operation progress, e.g., we can know total number of
   ranges need to be streamed and number of ranges finished in
   bootstrap, decommission and etc.
 - All the cluster operation can survive peer node down during the
   operation which usually takes long time to complete, e.g., when adding
   a new node, currently if any of the existing node which streams data to
   the new node had issue sending data to the new node, the whole bootstrap
   process will fail. After this patch, we can fix the problematic node
   and restart it, the joining node will retry streaming from the node
   again.
 - We can fail streaming early and timeout early and retry less because
   all the operations use stream can survive failure of a single
   stream_plan. It is not that important for now to have to make a single
   stream_plan successful. Note, another user of streaming, repair, is now
   using small stream_plan as well and can rerun the repair for the
   failed ranges too.

This is one step closer to supporting the resumable add/remove node
opeartions."

* tag 'asias/use_range_streamer_everywhere_v4' of github.com:cloudius-systems/seastar-dev:
  storage_service: Use the new range_streamer interface for removenode
  storage_service: Use the new range_streamer interface for decommission
  storage_service: Use the new range_streamer interface for rebuild
  storage_service: Use the new range_streamer interface for bootstrap
  dht: Extend range_streamer interface

(cherry picked from commit 7217b7ab36)
2018-03-13 10:34:10 +08:00
Asias He
5dcef25f6f storage_service: Add missing return in pieces empty check
If pieces.empty is empty, it is bogus to access pieces[0]:

   sstring move_name = pieces[0];

Fix by adding the missing return.

Spotted by Vlad Zolotarov <vladz@scylladb.com>

Fixes #3258
Message-Id: <bcb446f34f953bc51c3704d06630b53fda82e8d2.1520297558.git.asias@scylladb.com>

(cherry picked from commit 8900e830a3)
2018-03-06 09:58:39 +02:00
Duarte Nunes
f763bf7f0d Merge 'backport of the "loading_shared_values and size limited and evicting prepared statements cache" series' from Vlad
This backport includes changes from the "loading_shared_values and size limited and evicting prepared statements cache" series,
"missing bits from loading_shared_values series" series and the "cql_transport::cql_server: fix the distributed prepared statements cache population"
patch (the last one is squashed inside the "cql3::query_processor: implement CQL and Thrift prepared statements caches using cql3::prepared_statements_cache"
patch in order to avoid the bisect breakage).

* 'branch-2-0-backport-prepared-cache-fixes-v1' of https://github.com/vladzcloudius/scylla:
  tests: loading_cache_test: initial commit
  utils + cql3: use a functor class instead of std::function
  cql3::query_processor: implement CQL and Thrift prepared statements caches using cql3::prepared_statements_cache
  transport::server::process_prepare() don't ignore errors on other shards
  cql3: prepared statements cache on top of loading_cache
  utils::loading_cache: make the size limitation more strict
  utils::loading_cache: added static_asserts for checking the callbacks signatures
  utils::loading_cache: add a bunch of standard synchronous methods
  utils::loading_cache: add the ability to create a cache that would not reload the values
  utils::loading_cache: add the ability to work with not-copy-constructable values
  utils::loading_cache: add EntrySize template parameter
  utils::loading_cache: rework on top of utils::loading_shared_values
  sstables::shared_index_list: use utils::loading_shared_values
  utils::loading_cache: arm the timer with a period equal to min(_expire, _update)
2018-02-22 10:04:41 +00:00
Tomasz Grabiec
9af9ca0d60 tests: mutation_source_tests: Fix use-after-scope on partition range
Message-Id: <1506096881-3076-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit d11d696072)
2018-02-19 14:30:47 +00:00
Avi Kivity
fbc30221b5 row_cache_test: remove unused overload populate_range()
Breaks the build.
2018-02-11 17:01:31 +02:00
Shlomi Livne
d17aa3cd1c release: prepare for 2.0.3
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
2018-02-08 18:11:35 +02:00
Vlad Zolotarov
f7e79322f1 tests: loading_cache_test: initial commit
Test utils::loading_shared_values and utils::loading_cache.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-02-02 15:28:45 -05:00
Vlad Zolotarov
e31331bdb2 utils + cql3: use a functor class instead of std::function
Define value_extractor_fn as a functor class instead of std::function.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-02-02 15:28:45 -05:00
Vlad Zolotarov
6873e26060 cql3::query_processor: implement CQL and Thrift prepared statements caches using cql3::prepared_statements_cache
- Transition the prepared statements caches for both CQL and Trhift to the cql3::prepared_statements_cache class.
   - Add the corresponding metrics to the query_processor:
      - Evictions count.
      - Current entries count.
      - Current memory footprint.

Fixes #2474

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-02-02 15:28:35 -05:00
Vlad Zolotarov
24bee2c887 transport::server::process_prepare() don't ignore errors on other shards
If storing of the statement fails on any shard we should fail the whole PREPARE
request.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1502325392-31169-13-git-send-email-vladz@scylladb.com>
2018-02-02 13:27:34 -05:00
Vlad Zolotarov
8bba15a709 cql3: prepared statements cache on top of loading_cache
This is a template class that implements caching of prepared statements for a given ID type:
   - Each cache instance is given 1/256 of the total shard memory. If the new entry is going to overflow
     this memory limit - the less recently used entries are going to be evicted so that the new entry could
     be added.
   - The memory consumption of a single prepared statement is defined by a cql3::prepared_cache_entry_size
     functor class that returns a number of bytes for a given prepared statement (currently returns 10000
     bytes for any statement).
   - The cache entry is going to be evicted if not used for 60 minutes or more.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-02-02 12:37:00 -05:00
Vlad Zolotarov
ad68d3ecfd utils::loading_cache: make the size limitation more strict
Ensure that the size of the cache is never bigger than the "max_size".

Before this patch the size of the cache could have been indefinitely bigger than
the requested value during the refresh time period which is clearly an undesirable
behaviour.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-02-02 12:36:56 -05:00
Vlad Zolotarov
707ac9242e utils::loading_cache: added static_asserts for checking the callbacks signatures
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-02-02 12:36:53 -05:00
Vlad Zolotarov
dae0563ff8 utils::loading_cache: add a bunch of standard synchronous methods
Add a few standard synchronous methods to the cache, e.g. find(), remove_if(), etc.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-02-02 12:36:43 -05:00
Vlad Zolotarov
7ba50b87f1 utils::loading_cache: add the ability to create a cache that would not reload the values
Sometimes we don't want the cached values to be periodically reloaded.
This patch adds the ability to control this using a ReloadEnabled template parameter.

In case the reloading is not needed the "loading" function is not given to the constructor
but rather to the get_ptr(key, loader) method (currently it's the only method that is used, we may add
the corresponding get(key, loader) method in the future when needed).

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-02-02 12:35:25 -05:00
Vlad Zolotarov
1da277c78e utils::loading_cache: add the ability to work with not-copy-constructable values
Current get(...) interface restricts the cache to work only with copy-constructable
values (it returns future<Tp>).
To make it able to work with non-copyable value we need to introduce an interface that would
return something like a reference to the cached value (like regular containers do).

We can't return future<Tp&> since the caller would have to ensure somehow that the underlying
value is still alive. The much more safe and easy-to-use way would be to return a shared_ptr-like
pointer to that value.

"Luckily" to us we value we actually store in a cache is already wrapped into the lw_shared_ptr
and we may simply return an object that impersonates itself as a smart_pointer<Tp> value while
it keeps a "reference" to an object stored in the cache.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-02-02 12:35:21 -05:00
Vlad Zolotarov
36dfd4b990 utils::loading_cache: add EntrySize template parameter
Allow a variable entry size parameter.
Provide an EntrySize functor that would return a size for a
specific entry.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-02-02 12:35:19 -05:00
Vlad Zolotarov
c5ce2765dc utils::loading_cache: rework on top of utils::loading_shared_values
Get rid of the "proprietary" solution for asynchronous values on-demand loading.
Use utils::loading_shared_values instead.

We would still need to maintain intrusive set and list for efficient shrink and invalidate
operations but their entry is not going to contain the actual key and value anymore
but rather a loading_shared_values::entry_ptr which is essentially a shared pointer to a key-value
pair value.

In general, we added another level of dereferencing in order to get the key value but since
we use the bi::store_hash<true> in the hook and the bi::compare_hash<true> in the bi::unordered_set
this should not translate into an additional set lookup latency.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-02-02 12:35:15 -05:00
Vlad Zolotarov
25ffdf527b sstables::shared_index_list: use utils::loading_shared_values
Since utils::loading_shared_values API is based on the original shared_index_list
this change is mostly a drop-in replacement of the corresponding parts.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-02-02 12:33:34 -05:00
Vlad Zolotarov
dbc6d9fe01 utils::loading_cache: arm the timer with a period equal to min(_expire, _update)
Arm the timer with a period that is not greater than either the permissions_validity_in_ms
or the permissions_update_interval_in_ms in order to ensure that we are not stuck with
the values older than permissions_validity_in_ms.

Fixes #2590

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-02-02 12:17:41 -05:00
Asias He
915683bddd locator: Get rid of assert in token_metadata
In commit 69c81bcc87 (repair: Do not allow repair until node is in
NORMAL status), we saw a coredump due to an assert in
token_metadata::first_token_index.

Throw an exception instead of abort the whole scylla process.
Message-Id: <c110645cee1ee3897e30a3ae1b7ab3f49c97412c.1504752890.git.asias@scylladb.com>

(cherry picked from commit 0ec574610d)
2018-01-29 15:28:44 +02:00
Avi Kivity
7ca8988d0e Update seastar submodule
* seastar d37cf28...e7facd4 (11):
  > tls_test: Fix echo test not setting server trust store
  > tls: Actually verify client certificate if requested
  > tls: Do not restrict re-handshake to client
  > Work around GCC 5 bug: scylladb/seastar#338, scylladb/seastar#339
  > tls: Make put/push mechanism operate "opportunisticly"
  > tls: Move handshake logic outside connect/accept
  > tls: Make sure handshake exceptions are futurized
  > tls: Guard non-established sockets in sesrefs + more explicit close + states
  > tls: Make vec_push fully exception safe
  > net/tls: explicitly ignore ready future during shutdown
  > tls: remove unneeded lambda captures

Fixes #3072
2018-01-28 14:46:06 +02:00
Paweł Dziepak
383d7e6c91 Update scylla-ami submodule
* scylla-ami be90a3f...fa2461d (1):
  > Update Amazon kernel packages release stream to 2017.09
2018-01-24 13:31:04 +00:00
Avi Kivity
7bef696ee5 Update seastar submodule
* seastar 66eb33f...d37cf28 (1):
  > Update dpdk submodule

Fixes build with glibc 2.25 (see scylladb/dpdk#3).
2018-01-18 13:52:39 +02:00
Avi Kivity
a603111a85 Merge "Fix memory leak on zone reclaim" from Tomek
"_free_segments_in_zones is not adjusted by
segment_pool::reclaim_segments() for empty zones on reclaim under some
conditions. For instance when some zone becomes empty due to regular
free() and then reclaiming is called from the std allocator, and it is
satisfied from a zone after the one which is empty. This would result
in free memory in such zone to appear as being leaked due to corrupted
free segment count, which may cause a later reclaim to fail. This
could result in bad_allocs.

The fix is to always collect such zones.

Fixes #3129
Refs #3119
Refs #3120"

* 'tgrabiec/fix-free_segments_in_zones-leak' of github.com:scylladb/seastar-dev:
  tests: lsa: Test _free_segments_in_zones is kept correct on reclaim
  lsa: Expose max_zone_segments for tests
  lsa: Expose tracker::non_lsa_used_space()
  lsa: Fix memory leak on zone reclaim

(cherry picked from commit 4ad212dc01)
2018-01-16 15:54:54 +02:00
Asias He
d5884d3c7c storage_service: Do not wait for restore_replica_count in handle_state_removing
The call chain is:

storage_service::on_change() -> storage_service::handle_state_removing()
-> storage_service::restore_replica_count() -> streamer->stream_async()

Listeners run as part of gossip message processing, which is serialized.
This means we won't be processing any gossip messages until streaming
completes.

In fact, there is no need to wait for restore_replica_count to complete
which can take a long time, since when it completes, this node will send
notification to tell the removal_coordinator that the restore process is
finished on this node. This node will be removed from _replicating_nodes
on the removal_coordinator.

Tested with update_cluster_layout_tests.py

Fixes #2886

Message-Id: <8b4fe637dfea6c56167ddde3ca86fefb8438ce96.1516088237.git.asias@scylladb.com>
(cherry picked from commit 5107b6ad16)
2018-01-16 11:38:08 +02:00
Tomasz Grabiec
e6cb685178 range_tombstone_list: Fix insert_from()
end_bound was not updated in one of the cases in which end and
end_kind was changed, as a result later merging decision using
end_bound were incorrect. end_bound was using the new key, but the old
end_kind.

Fixes #3083.
Message-Id: <1513772083-5257-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit dfe48bbbc7)
2017-12-21 09:29:59 +01:00
Vlad Zolotarov
cd19e5885a messaging_service: fix a mutli-NIC support
Don't enforce the outgoing connections from the 'listen_address'
interface only.

If 'local_address' is given to connect() it will enforce it to use a
particular interface to connect from, even if the destination address
should be accessed from a different interface. If we don't specify the
'local_address' the source interface will be chosen according to the
routing configuration.

Fixes #3066

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1513372688-21595-1-git-send-email-vladz@scylladb.com>
(cherry picked from commit be6f8be9cb)
2017-12-17 10:52:19 +02:00
Takuya ASADA
f367031016 dist/common/systemd: specify correct repo file path for housekeeping service on Ubuntu/Debian
Currently scylla-housekeeping-daily.service/-restart.service hardcoded
"--repo-files '/etc/yum.repos.d/scylla*.repo'" to specify CentOS .repo file,
but we use same .service for Ubuntu/Debian.
It doesn't work correctly, we need to specify .list file for Debian variants.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1513385159-15736-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit c2e87f4677)
2017-12-16 22:05:19 +02:00
Glauber Costa
7ae67331ad database: delete created SSTables if streaming writes fail
We have had an issue recently where failed SSTable writes left the
generated SSTables dangling in a potentially invalid state. If the write
had, for instance, started and generated tmp TOCs but not finished,
those files would be left for dead.

We had fixed this in commit b7e1575ad4,
but streaming memtables still have the same isse.

Note that we can't fix this in the common function
write_memtable_to_sstable because different flushers have different
retry policies.

Fixes #3062

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20171213011741.8156-1-glauber@scylladb.com>
(cherry picked from commit 1aabbc75ab)
2017-12-13 10:12:35 +02:00
Jesse Haber-Kucharsky
0e6561169b cql3: Add missing return
Since `return` is missing, the "else" branch is also taken and this
results a user being created from scratch.

Fixes #3058.

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <bf3ca5907b046586d9bfe00f3b61b3ac695ba9c5.1512951084.git.jhaberku@scylladb.com>
(cherry picked from commit 7e3a344460)
2017-12-11 09:55:46 +02:00
Paweł Dziepak
5b3aa8e90d Merge "Fix range tombstone emitting which led to skipping over data" from Tomasz
"Fixes cache reader to not skip over data in some cases involving overlapping
range tombstones in different partition versions and discontinuous cache.

Introduced in 2.0

Fixes #3053."

* tag 'tgrabiec/fix-range-tombstone-slicing-v2' of github.com:scylladb/seastar-dev:
  tests: row_cache: Add reproducer for issue #3053
  tests: mvcc: Add test for partition_snapshot::range_tombstones()
  mvcc: Optimize partition_snapshot::range_tombstones() for single version case
  mvcc: Fix partition_snapshot::range_tombstones()
  tests: random_mutation_generator: Do not emit dummy entries at clustering row positions

(cherry picked from commit 051cbbc9af)
(cherry picked from commit be5127388d)

[tgrabiec: dropped mvcc_test change, because the file does not exist here]
2017-12-08 14:38:24 +01:00
Tomasz Grabiec
db9d502f82 tests: simple_schema: Add new_tombstone() helper
(cherry picked from commit 204ec9c673)
2017-12-08 14:18:42 +01:00
Amos Kong
cde39bffd0 dist/debian: add scylla-tools-core to depends list
Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <db39cbda0e08e501633556ab238d816e357ad327.1512646123.git.amos@scylladb.com>
(cherry picked from commit 8fd5d27508)
2017-12-07 13:42:24 +02:00
Amos Kong
0fbcc852a5 dist/redhat: add scylla-tools-core to requires list
Fixes #3051

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <f7013a4fbc241bb4429d855671fee4b845b255cd.1512646123.git.amos@scylladb.com>
(cherry picked from commit eb3b138ee2)
2017-12-07 13:42:17 +02:00
Duarte Nunes
16d5f68886 thrift/server: Handle exception within gate
The exception handling code inspects server state, which could be
destroyed before the handle_exception() task runs since it runs after
exiting the gate. Move the exception handling inside the gate and
avoid scheduling another accept if the server has been stopped.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171116122921.21273-1-duarte@scylladb.com>
(cherry picked from commit 34a0b85982)
2017-12-07 13:07:11 +02:00
Pekka Enberg
91540c8181 Update seastar submodule
* seastar 0dbedf0...66eb33f (1):
  > core/gate: Add is_closed() function
2017-12-07 13:06:39 +02:00
Raphael S. Carvalho
eaa8ed929f thrift: fix compilation error
thrift/server.cc:237:6:   required from here
thrift/server.cc:236:9: error: cannot call member function ‘void thrift_server::maybe_retry_accept(int, bool, std::__exception_ptr::exception_ptr)’ without object
         maybe_retry_accept(which, keepalive, std::move(ex));

gcc version: gcc (GCC) 6.3.1 20161221 (Red Hat 6.3.1-1)

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171113184537.10472-1-raphaelsc@scylladb.com>
(cherry picked from commit 564046a135)
2017-12-07 12:55:34 +02:00
Avi Kivity
5ba1621716 Merge "thrift/server: Ensure stop() waits for accepts" from Duarte
"Ensure stop() waits for the accept loop to complete to avoid crashes
during shutdown."

* 'thrift-server-stop/v4' of https://github.com/duarten/scylla:
  thrift/server: Restore code format
  thrift/server: Stopping the server waits for connection shutdown
  thrift/server: Abort listeners on stop()
  thrift/server: Avoid manual memory management
  thrift/server: Add move ctor for connection
  thrift/server: Extract retry logic
  thrift/server: Retry with backoff for some error types
  thrift/server: Retry accept in case of error

(cherry picked from commit 061f6830fa)
2017-12-07 12:54:45 +02:00
Avi Kivity
b4f515035a build: disable -fsanitize-address-use-after-scope on CqlParser.o
The parser generator somehow confuses the use-after-scope sanitizer, causing it
to use large amounts of stack space. Disable that sanitizer on that file.
Message-Id: <20170905110628.18047-1-avi@scylladb.com>

(cherry picked from commit 4751402709)
2017-12-04 12:41:05 +02:00
Avi Kivity
d55e3f6a7f build: fix excessive stack usage in CqlParser in debug mode
The state machines generated by antlr allocate many local variables per function.
In release mode, the stack space occupied by the variables is reused, but in debug
build, it is not, due to Address Sanitizer setting -fstack-reuse=none. This causes
a single function to take above 100k of stack space.

Fix by hacking the generated code to use just one variable.

Fixes #2546
Message-Id: <20170704135824.13225-1-avi@scylladb.com>

(cherry picked from commit a6d9cf09a7)
2017-12-04 12:39:27 +02:00
Shlomi Livne
07b039feab release: prepare for 2.0.2
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
2017-11-30 21:02:42 +02:00
Avi Kivity
35b7353efd Update seastar submodule
* seastar 0489655...0dbedf0 (1):
  > fstream: do not ignore dma_write return value
2017-11-30 17:36:36 +02:00
Duarte Nunes
200e01cc31 compound_compact: Change universal reference to const reference
The universal reference was introduced so we could bind an rvalue to
the argument, but it would have sufficed to make the argument a const
reference. This is also more consistent with the function's other
overload.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171129132758.19654-1-duarte@scylladb.com>
(cherry picked from commit cda3ddd146)
(cherry picked from commit 106c69ad45)
2017-11-30 16:51:16 +02:00
Tomasz Grabiec
b5c4cf2d87 Merge "compact_storage serialization fixes" from Duarte
Fix two issues with serializing non-compound range tombstones as
compound: convert a non-compound clustering element to compound and
actually advertise the issue to other nodes.

* git@github.com:duarten/scylla.git  rt-compact-fixes/v1:
  compound_compact: Allow rvalues in size()
  sstables/sstables: Convert non-compound clustering element to compound
  tests/sstable_mutation_test: Verify we can write/read non-correct RTs
  service/storage_service: Export non-compound RT feature

(cherry picked from commit e9cce59b85)
(cherry picked from commit 740fcc73b8)

Undid changes to size_estimates_virtual_reader
Changed test from memtable::make_flat_reader to memtable::make_reader
2017-11-30 16:50:15 +02:00
Duarte Nunes
f96cb361aa tests: Initialize storage service for some tests
These tests now require having the storage service initialize, which
is needed to decide whether correct non-compound range tombstones
should be emitted or not.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171126152921.5199-1-duarte@scylladb.com>
(cherry picked from commit 922f095f22)
(cherry picked from commit 8567723a7b)
2017-11-30 16:27:43 +02:00
Duarte Nunes
bd59d7c968 cql3/delete_statement: Allow non-range deletions on non-compound schemas
This patch fixes a regression introduced in
1c872e2ddc.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171126102333.3736-1-duarte@scylladb.com>
(cherry picked from commit 15fbb8e1ca)
(cherry picked from commit b0b7c73acd)
2017-11-30 16:22:53 +02:00
Tomasz Grabiec
9d923a61e1 Merge "Fixes to sstable files for non-compound schemas" from Duarte
This series mainly fixes issues with the serialization of promoted
index entries for non-compound schemas and with the serialization of
range tombstones, also for non-compound schemas.

We lift the correct cell name writing code into its own function,
and direct all users to it. We also ensure backward compatibility with
incorrectly generated promoted indexes and range tombstones.

Fixes #2995
Fixes #2986
Fixes #2979
Fixes #2992
Fixes #2993

* git@github.com:duarten/scylla.git  promoted-index-serialization/v3:
  sstables/sstables: Unify column name writers
  sstables/sstables: Don't write index entry for a missing row maker
  sstables/sstables: Reuse write_range_tombstone() for row tombstones
  sstables/sstables: Lift index writing for row tombstones
  sstables/sstables: Leverage index code upon range tombstone consume
  sstables/sstables: Move out tombstone check in write_range_tombstone()
  sstables/sstables: A schema with static columns is always compound
  sstables/sstables: Lift column name writing logic
  sstables/sstables: Use schema-aware write_column_name() for
    collections
  sstables/sstables: Use schema-aware write_column_name() for row marker
  sstables/sstables: Use schema-aware write_column_name() for static row
  sstables/sstables: Writing promoted index entry leverages
    column_name_writer
  sstables/sstables: Add supported feature list to sstables
  sstables/sstables: Don't use incorrectly serialized promoted index
  cql3/single_column_primary_key_restrictions: Implement is_inclusive()
  cql3/delete_statement: Constrain range deletions for non-compound
    schemas
  tests/cql_query_test: Verify range deletion constraints
  sstables/sstables: Correctly deserialize range tombstones
  service/storage_service: Add feature for correct non-compound RTs
  tests/sstable_*: Start the storage service for some cases
  sstables/sstable_writer: Prepare to control range tombstone
    serialization
  sstables/sstables: Correctly serialize range tombstones
  tests/sstable_assertions: Fix monotonicity check for promoted indexes
  tests/sstable_assertions: Assert a promoted index is empty
  tests/sstable_mutation_test: Verify promoted index serializes
    correctly
  tests/sstable_mutation_test: Verify promoted index repeats tombstones
  tests/sstable_mutation_test: Ensure range tombstone serializes
    correctly
  tests/sstable_datafile_test: Add test for incorrect promoted index
  tests/sstable_datafile_test: Verify reading of incorrect range
    tombstones
  sstables/sstable: Rename schema-oblivious write_column_name() function
  sstables/sstables: No promoted index without clustering keys
  tests/sstable_mutation_test: Verify promoted index is not generated
  sstables/sstables: Optimize column name writing and indexing
  compound_compat: Don't assume compoundness

(cherry picked from commit bd1efbc25c)

Also added sstables::make_sstable() to preserve source compatibility in tests.
2017-11-30 16:21:13 +02:00
Tomasz Grabiec
0b23bcbe29 sstables: index_reader: Reset lower bound for promoted index lookups from advance_to_next_partition()
_current_pi_idx was not reset from advance_to_next_partition(), which
is used when we skip to the next partition before fully consuming
it. As a result, if we try to skip to a clustering position which is
before the index block used by the last skip in the previous
partition, we would not skip assuming that the new position is in the
current block. This may result in more data being read from the
sstable than necessary.

Fixes #2984
Message-Id: <1510915793-20159-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit 2113299b61)
2017-11-21 14:25:49 +01:00
Duarte Nunes
b1899f000a db/view: Use view schema for view pk operations
Instead of base schema.

Fixes #2504

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170718190703.12972-1-duarte@scylladb.com>
(cherry picked from commit 115ff1095e)
2017-11-15 20:41:24 +00:00
Tomasz Grabiec
7b19167cbd gossiper: Replicate endpoint_state::is_alive()
Broken in f570e41d18.

Not replicating this may cause coordinator to treat a node which is
down as alive, or vice verse.

Fixes regression in dtest:

  consistency_test.py:TestAvailability.test_simple_strategy

which was expected to get "unavailable" exception but it was getting a
timeout.

Message-Id: <1510666967-1288-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 7323fe76db)
2017-11-14 16:21:33 +02:00
Shlomi Livne
164f97fd88 release: prepare for 2.0.1
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
2017-11-10 16:37:29 +02:00
Paweł Dziepak
b3bc82bc77 Merge "Fix exception safety related to range tombstones in cache" from Tomasz
Fixes #2938.

* 'tgrabiec/fix-range-tombstone-list-exception-safety-v1' of github.com:scylladb/seastar-dev:
  tests: range_tombstone_list: Add test for exception safety of apply()
  tests: Introduce range_tombstone_list assertions
  cache: Make range tombstone merging exception-safe
  range_tombstone_list: Introduce apply_monotonically()
  range_tombstone_list: Make reverter::erase() exception-safe
  range_tombstone_list: Fix memory leaks in case of bad_alloc
  mutation_partition: Fix abort in case range tombstone copying fails
  managed_bytes: Declare copy constructor as allocation point
  Integrate with allocation failure injection framework

(cherry picked from commit 5a4b46f555)
2017-11-10 13:10:26 +01:00
Tomasz Grabiec
8f6ffb0487 Update seastar submodule
* seastar 124467d...0489655 (9):
  > alloc_failure_injector: Fix compilation error with gcc 7.1
  > alloc_failure_injector: Replace set_alloc_failure_callback() with run_with_callback()
  > alloc_failure_injector: Log backtrace of failures
  > alloc_failure_injector: Extract fail()
  > util: Introduce support for allocation failure injection
  > noncopyable_function: improve support for capturing mutable lambdas
  > noncopyable_function add bool operator
  > test.py: fix typo in noncopyable_function_test
  > utils: introduce noncopyable_function
2017-11-10 13:10:26 +01:00
Raphael S. Carvalho
4bba0c403e compaction: Make resharding go through compaction manager
Two reasons for this change:
1) every compaction should be multiplexed to manager which in turn
will make decision when to schedule. improvements on it will
immediately benefit every existing compaction type.
2) active tasks metric will now track ongoing reshard jobs.

Fixes #2671.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170817224334.6402-1-raphaelsc@scylladb.com>
(cherry picked from commit 10eaa2339e)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171103012758.19428-1-raphaelsc@scylladb.com>
2017-11-07 15:23:45 +02:00
Calle Wilund
59aae504ae storage_service: Only replicate token metadata iff modified in on_change
Fixes #2869

Message-Id: <20171101105629.22104-1-calle@scylladb.com>
(cherry picked from commit 8c257c40b4)
2017-11-07 14:46:36 +02:00
Avi Kivity
941e5eef4f Merge "gossip backport for 2.0" from Asias
"This series backports both the cleanup series from Duarte and large cluster
fixes series from Tomek."

* tag 'gossip-2.0-backport-v2' of github.com:scylladb/seastar-dev:
  Merge 'Solves problems related to gossip which can be observed in a large cluster' from Tomasz
  utils: introduce loading_shared_values
  tests/serialized_action: add missing forced defers
  utils: Introduce serialized_action
  Merge "gms/gossiper: Multiple cleanups" from Duarte
  storage_service: Do not use c_str() in the logger
  gms/gossiper: Introduce copy-less endpoint_state::get_application_state_ptr()
  Merge "gossiper: Optimize endpoint_state lookup" from Duarte
  gms: Add is_shutdown helper for endpoint_state class
  gossip: Better check for gossip stabilization on startup
  Merge "Fix miss opportunity to update gossiper features" from Asias
  gossip: Fix a log message typo in compare_endpoint_startup
  gossip: Do not use c_str() in the logger
  Revert "gossip: Make bootstrap more robust"
  gossip: Switch to seastar::lowres_system_clock
  gossip: Use unordered_map for _unreachable_endpoints and _shadow_unreachable_endpoints
  gossip: Introduce the shadow_round_ms option
2017-11-07 13:44:20 +02:00
Duarte Nunes
b12a2e6b08 Merge 'Solves problems related to gossip which can be observed in a large cluster' from Tomasz
"The main problem fixed is slow processing of application state changes.
This may lead to a bootstrapping node not having up to date view on the
ring, and serve incorrect data.

Fixes #2855."

* tag 'tgrabiec/gossip-performance-v3' of github.com:scylladb/seastar-dev:
  gms/gossiper: Remove periodic replication of endpoint state map
  gossiper: Check for features in the change listener
  gms/gossiper: Replicate changes incrementally to other shards
  gms/gossiper: Document validity of endpoint_state properties
  storage_service: Update token_metadata after changing endpoint_state
  gms/gossiper: Process endpoints in parallel
  gms/gossiper: Serialize state changes and notifications for given node
  utils/loading_shared_values: Allow Loader to return non-future result
  gms/gossiper: Encapsulate lookup of endpoint_state
  storage_service: Batch token metadata and endpoint state replication
  utils/serialized_action: Introduce trigger_later()
  gossiper: Add and improve logging
  gms/gossiper: Don't fire change listeners when there is no change
  gms/gossiper: Allow parallel apply_state_locally()
  gms/gossiper: Avoid copies in endpoint_state::add_application_state()
  gms/failure_detector: Ignore short update intervals

(cherry picked from commit 044b8deae4)
2017-11-07 19:21:30 +08:00
Vlad Zolotarov
eb34937ff6 utils: introduce loading_shared_values
This class implements an key-value container that is populated
using the provided asynchronous callback.

The value is loaded when there are active references to the value for the given key.

Container ensures that only one entry is loaded per key at any given time.

The returned value is a lw_shared_ptr to the actual value.

The value for a specific key is immediately evicted when there are no
more references to it.

The container is based on the boost::intrusive::unordered_set and is rehashed (grown) if needed
every time a new value is added (asynchronously loaded).

The container has a rehash() method that would grow or shrink the container as needed
in order to get the load factor into the [0.25, 0.75] range.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
(cherry picked from commit ec3fed5c4d)
2017-11-07 19:21:29 +08:00
Paweł Dziepak
127ffff9e5 tests/serialized_action: add missing forced defers
serialized_action_tests depends on the fact that first part of the
serialized_action is executed at cetrtain points (in which it reads a
global variable that is later updated by the main thread).
This worked well in the release mode before ready continuations were
inlined and run immediately, but not in the debug mode since inlining
was not happening and the main seastar::thread was missing some yield
points.
Message-Id: <20170731103013.26542-1-pdziepak@scylladb.com>

(cherry picked from commit e970630272)
2017-11-07 19:21:29 +08:00
Tomasz Grabiec
f06bb656a4 utils: Introduce serialized_action
(cherry picked from commit 6a3703944b)

Conflicts:
	configure.py
2017-11-07 19:21:29 +08:00
Pekka Enberg
cc84cd60c5 Merge "gms/gossiper: Multiple cleanups" from Duarte
"Based on the functions get_endpoint_state_for_endpoint_ptr(),
get_application_state_ptr() and
endpoint_state::get_application_state_ptr(), this series
cleanups miscelaneous functions related to the gossiper.

It not only removes duplicated code, but also omits many copies.

All pointer usages have been audited for safety."

Acked-by: Asias He <asias@scylladb.com>
Acked-by: Tomasz Grabiec <tgrabiec@scylladb.com>

* 'gossiper-cleanup/v2' of github.com:duarten/scylla: (27 commits)
  gms/endpoint_state: Remove get_application_state()
  service/storage_service: Avoid copies in prepare_replacement_info()
  service/storage_service: Cleanup get_application_state_value()
  service/storage_service: Cleanup handle_state_removing()
  service/storage_service: Cleanup get_rpc_address()
  locator/reconnectable_snitch_helper: Avoid versioned_value copies
  locator/production_snitch_base: Cleanup get_endpoint_info()
  service/migration_manager: Avoid copies in is_ready_for_bootstrap()
  service/migration_manager: Cleanup has_compatible_schema_tables_version()
  service/migration_manager: Fix usages of get_application_state()
  cache_hit_rate: Avoid copies in get_hit_rate()
  gms/endpoint_state: Avoid copies in is_shutdown()
  service/load_broadcaster: Avoid copy in on_join()
  gms/gossiper: Cleanup get_supported_features()
  gms/gossiper: Cleanup get_gossip_status()
  gms/gossiper: Cleanup seen_any_seed()
  gms/gossiper: Cleanup get_host_id()
  gms/gossiper: Removed dead uses_vnodes() function
  gms/gossiper: Cleanup uses_host_id()
  gms/gossiper: Add get_application_state_ptr()
  ...

(cherry picked from commit 1701fc2e50)
2017-11-07 19:21:29 +08:00
Asias He
bb6ee1e4b1 storage_service: Do not use c_str() in the logger
Use logger.info("{}", msg) instead.

Message-Id: <d2f15007a54554b58e29fd05331c06ae030d582f.1504832296.git.asias@scylladb.com>
(cherry picked from commit bb9dbc5ade)
2017-11-07 19:21:28 +08:00
Tomasz Grabiec
ce72299a38 gms/gossiper: Introduce copy-less endpoint_state::get_application_state_ptr()
Message-Id: <1507642411-28680-3-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 66a15ccd18)
2017-11-07 19:21:28 +08:00
Avi Kivity
ef39bc4216 Merge "gossiper: Optimize endpoint_state lookup" from Duarte
"gossiper::get_endpoint_state_for_endpoint() returns a copy of
endpoint_state, which we've seen can be very expensive. This
series introduces a function that returns a pointer and avoids
the copy.

Fixes #764"

* 'endpoint-state/v2' of https://github.com/duarten/scylla:
  gossiper: Avoid endpoint_state copies
  endpoint_state: const-qualify functions
  storage_service: Remove duplicate endpoint state check

(cherry picked from commit 4ad3900d8d)
2017-11-07 19:21:28 +08:00
Asias He
fb03f9a73c gms: Add is_shutdown helper for endpoint_state class
It will be used by streaming manager to check if a node is in shutdown
status.

(cherry picked from commit ed7e6974d5)
2017-11-07 19:21:27 +08:00
Asias He
025af9d297 gossip: Better check for gossip stabilization on startup
This is a backport of Apache CASSANDRA-9401
(2b1e6aba405002ce86d5badf4223de9751bf867d)

It is better to check the number of nodes in the endpoint_state_map
is not changing for gossip stabilization.

Fixes #2853
Message-Id: <e9f901ac9cadf5935c9c473433dd93e9d02cb748.1506666004.git.asias@scylladb.com>

(cherry picked from commit c0b965ee56)
2017-11-07 19:21:27 +08:00
Tomasz Grabiec
09da4f3d08 Merge "Fix miss opportunity to update gossiper features" from Asias
The gossiper checks if features should be enabled from its timer
callback when it detects that endpoint_state_map changed, that is
different than shadow_endpoint_state_map.

shadow_endpoint_state_map is also assigned from endpoint_state_map in
storage_service::replicate_tm_and_ep_map(), called from
storage_service::on_change()

Call gossiper:maybe_enable_features() in replicate_tm_and_ep_map so
that we won't miss gossip feature update.

Fixes #2824

* git@github.com:scylladb/seastar-dev asias/gossip_miss_feature_update_v1:
  gossip: Move the _features_condvar signal code to
    maybe_enable_features
  gossip: Make maybe_enable_features public
  storage_service: Check gossip feature update in
    replicate_tm_and_ep_map

(cherry picked from commit 02d41864af)
2017-11-07 19:21:27 +08:00
Asias He
0d22b0c949 gossip: Fix a log message typo in compare_endpoint_startup
Message-Id: <c4958950e1108082b63e08ab81ee2177edc9b232.1505286843.git.asias@scylladb.com>
(cherry picked from commit fa9d47c7f3)
2017-11-07 19:21:27 +08:00
Asias He
5c5296a683 gossip: Do not use c_str() in the logger
Use logger.info("{}", msg) instead.

Message-Id: <52c24d7dfe082ee926f065a6268d83fcb31ddc28.1504832289.git.asias@scylladb.com>
(cherry picked from commit 57dd3cb2c5)
2017-11-07 19:21:27 +08:00
Asias He
5ecb07bbc4 Revert "gossip: Make bootstrap more robust"
This reverts commit b56ba02335.

After commit 8fa35d6ddf (messaging_service: Get rid of timeout and retry
logic for streaming verb), streaming verb in rpc does not check if a
node is in gossip memebership since all the retry logic is removed.

Remove the extra wait before removing the joining node from gossip
membership.

Message-Id: <a416a735bb8aad533bbee190e3324e6b16799415.1504063598.git.asias@scylladb.com>
(cherry picked from commit cc18da5640)
2017-11-07 19:21:26 +08:00
Asias He
a90023a119 gossip: Switch to seastar::lowres_system_clock
The newly added lowres_system_clock is good enough for gossip
resolution. Switch to use it.

Message-Id: <fe0e7a9ef1ea0caffaa8364afe5c78b6988613bf.1503971833.git.asias@scylladb.com>
(cherry picked from commit a36141843a)
2017-11-07 19:21:26 +08:00
Asias He
258f0a383b gossip: Use unordered_map for _unreachable_endpoints and _shadow_unreachable_endpoints
The _unreachable_endpoints will be accessed in fast path soon by the
hinted hand off code.

Message-Id: <500d9cbb2117ab7b070fd1bd111c5590f46c3c3a.1503971826.git.asias@scylladb.com>
(cherry picked from commit 2701bfd1f8)
2017-11-07 19:21:26 +08:00
Asias He
3455bfaa44 gossip: Introduce the shadow_round_ms option
It specifies the maximum gossip shadow round time. It can be used to
reduce the gossip feature check time during node boot up.
For instance, when the first node in the cluster, which listed both
itself and other node as seed in the yaml config, boots up, it will try
to talk to other seed nodes which are not started yet. The gossip shadow
round will be used to fetch the feature info of the cluster. Since there
is no other seed node in the cluster, the shadow round will fail. User
can reduce the default shadow_round_ms option to reduce the boot time.

Fixes #2615
Message-Id: <10916ce9059f3c7f1a1fb465919ae57de3b67d59.1500540297.git.asias@scylladb.com>

(cherry picked from commit cf6f4a5185)
2017-11-07 19:21:26 +08:00
Avi Kivity
cd0b4903e9 Update seastar submodule
* seastar 7ebbb26...124467d (4):
  > peering_sharded_service: prevent over-run the container
  > sharded: fix move constructor for peering_sharded_service services
  > sharded: improve support for cooperating sharded<> services
  > sharded: support for peer services

Includes change to batchlog_manager constructor to adapt it to
seastar::sharded::start() change.

Needed for gossip backport.
2017-11-07 10:49:01 +02:00
Avi Kivity
4d76f564f8 Update seastar submodule
* seastar e4fcb6c...7ebbb26:
  Warn: seastar doesn't contain commit e4fcb6c27cc5dce70d44472522166abe1af29af6

The submodule link pointed nowhere, point it at the tip of scylla-seastar/branch-2.0.
2017-11-06 13:15:16 +02:00
Tomasz Grabiec
a05d3280c4 Update seastar submodule
* seastar b85b0fa...e4fcb6c (1):
  > log: Print nested exceptions

Refs #1011
2017-11-03 10:22:46 +01:00
Glauber Costa
68c41b2346 dist/redhat: do not use s3 addresses in mock profile
They are uglier and less future-proof.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20171019142034.30139-1-glauber@scylladb.com>
2017-10-19 18:11:59 +03:00
Glauber Costa
9c907222c5 dist: point mock profile to the right stream
2.0 builds are currently failing because of that.

Message-Id: <20171018182457.5180-1-glauber@scylladb.com>
2017-10-19 11:37:31 +03:00
Tomasz Grabiec
ba02e2688a cache_streamed_mutation: Read static row with cache region locked
_snp->static_row() allocates and needs reference stability.
Message-Id: <1507555031-11567-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit 44faaafc29)

Fixes #2880.
2017-10-10 15:55:08 +02:00
Avi Kivity
fa540581e8 Update ami submodule
* dist/ami/files/scylla-ami 5ffa449...be90a3f (1):
  > amazon kernel: enable updates

Still tracking master branch.
2017-10-02 17:11:30 +03:00
Shlomi Livne
e265c91616 release: prepare for 2.0.0
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
2017-09-29 21:12:18 +03:00
Tomasz Grabiec
8a9f8970e4 Update seastar submodule
Refs #2770.

* seastar d763623...b85b0fa (1):
  > scollectd: increment the metadata iterator with the values
2017-09-28 15:32:26 +02:00
Tomasz Grabiec
1e3c777a10 Update seastar submodule
* seastar c853473...d763623 (1):
  > rpc: make sure that _write_buf stream is always properly closed
2017-09-28 15:03:12 +02:00
Tomasz Grabiec
43d785a177 migration_manager: Make sure schema pulls eventually happen when schema_tables_v3 is enabled
We don't pull schema during rolling upgrade, that is until
schema_tables_v3 feature is enabled on all nodes.

Because features are enabled from gossiper timer, there is a race
between feature enablement and processing of endpoint states which may
trigger schema pull.  It can happen that we first try to pull, but
only later enable the feature. In that case the schema pull will not
happen until the next schema change.

The fix is to ensure that pulls abandoned due to feature not being enabled
will be retried when it is enabled.

Fixes sporadic failure in dtest:

  repair_additional_test.py:RepairAdditionalTest.repair_schema_test
Message-Id: <1506428715-8182-2-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit b704710954)
2017-09-27 12:06:45 +01:00
Tomasz Grabiec
b53d3d225d gossiper: Allow waiting for feature to be enabled
Message-Id: <1506428715-8182-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 7a58fb5767)
2017-09-27 12:06:37 +01:00
Paweł Dziepak
f9864686d2 Merge "Fix cache reader skipping rows in some cases" from Tomasz
"Fixes the problem of concurrent populations of clustering row ranges
leading to some readers skipping over some of the rows.
Spotted during code review.

Fixes #2834."

* tag 'tgrabiec/fix-cache-reader-skipping-rows-v2' of github.com:scylladb/seastar-dev:
  tests: mvcc: Add test for partition_snapshot_row_cursor
  tests: row_cache: Add test for concurrent population
  tests: row_cache: Make populate_range() accept partition_range
  tests: Add simple_schema::make_ckey_range()
  cache_streamed_mutation: Add missing _next_row.maybe_refresh() call
  mvcc: partition_snapshot_row_cursor: Fix cursor skipping over rows added after its position
  mvcc: partition_snapshot_row_cursor: Rename up_to_date() to iterators_valid()
  mvcc: Keep track of all iterators in partition_snapshot_row_cursor
  mvcc: Make partition_snapshot_row_cursor printable

(cherry picked from commit af1976bc30)

[tgrabiec: resolved conflicts]
2017-09-26 19:18:29 +02:00
Tomasz Grabiec
454b90980a streamed_mutation: Allow setting buffer capacity
Needed in tests to limit amount of prefetching done by readers, so
that it's easier to test interleaving of various events.

(cherry picked from commit cb16b038ef)
2017-09-26 19:18:29 +02:00
Tomasz Grabiec
13c66b7145 Update seastar submodule
Fixes #2738.

* seastar e380a07...c853473 (1):
  > httpd: handle exception when shutting down
2017-09-26 18:36:50 +02:00
Asias He
df04418fa4 gossip: Print SCHEMA_TABLES_VERSION correctly
Found this when debugging gossip with debug print. The application state
SCHEMA_TABLES_VERSION was printed as UNKNOWN.
Message-Id: <d7616920d2e6516b5470a758bcf9c88f3d857381.1506391495.git.asias@scylladb.com>

(cherry picked from commit 98e9049820)
2017-09-26 08:39:30 +02:00
Tomasz Grabiec
6e2858a47d storage_service: Register features before joining
Since commit 8378fe190, we disable schema sync in a mixed cluster.
The detection is done using gossiper features. We need to make sure
the features are registerred, and thus can be enabled, before the
bootstrapping of a non-seed node happens. Otherwise the bootstrap will
hang waiting on schema sync which will not happen.
Message-Id: <1505893837-27876-2-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit 8e46d15f91)
2017-09-25 09:40:22 +01:00
Tomasz Grabiec
0f79503cf1 storage_service: Extract register_features()
Message-Id: <1505893837-27876-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit b92dcb0284)
2017-09-25 09:40:22 +01:00
Tomasz Grabiec
056f6df859 Update seastar submodule
* seastar 06790c0...e380a07 (1):
  > configure: disable exception scalability hack on debug build
2017-09-25 10:13:01 +02:00
Avi Kivity
7833129ab4 Merge "row_cache: Call fast_forward_to() outside allocating section" from Tomasz
"On bad_alloc the section is retried. If the exception happened inside
fast_forward_to() on the underlying reader, that call will be
retried. However, the reader should not be used after exception is
thrown, since it is in unspecified state. Also, calling
fast_forward_to() with cache region locked increases the chances of it
failing to allocate.

We shouldn't call fast_forward_to() with the cache region locked.

Fixes #2791."

* 'tgrabiec/dont-ffwd-in-alloc-section' of github.com:scylladb/seastar-dev:
  cache_streamed_mutation: De-futurize cursor movement
  cache_streamed_mutation: Call fast_forward_to() outside allocating section
  cache_streamed_mutation: Switch from flags to explicit state machine

(cherry picked from commit 5b0cb28af9)

[tgrabiec: resolved minor conflicts]
2017-09-20 10:23:08 +02:00
Asias He
c3c5ec1d4a gossip: Fix indentation in apply_state_locally
Message-Id: <2bdefa8d982ad8da7452b41e894f41d865b83b0b.1505356245.git.asias@scylladb.com>
(cherry picked from commit 5ff0b113c9)
2017-09-19 22:59:46 +08:00
Asias He
7839cebc6c gossip: Use boost::copy_range in apply_state_locally
boost::copy_range is better because the vector is allocated with the
correct size instead of growing when the inserter is called.

[avi: also crashes less]

Message-Id: <b19ca92d56ad070fca1e848daa67c00c024e3a4d.1505291199.git.asias@scylladb.com>
(cherry picked from commit c84dcabb8f)
2017-09-19 22:59:46 +08:00
Pekka Enberg
e428d06f40 Merge "gossip: optimize apply_state_locally for large cluster" from Asias
"This series tries to improve the bootstrap of a node in a large cluster by
improving how gossip applies the gossip node state. In #2404, the joining node
failed to bootstrap, because it did not see the seed node when
storage_service::bootstrap ran. After this series, we apply the whole gossip
state contained in the gossip ack/ack2 message before applying the next one,
and we apply the state of the seed node earlier than non-seed node so we can
have the seed node's state faster. We also add some randomness to the order of
applying gossip node state to prevent some of the nodes' state are always
applied earlier than the others.

This series improves apply_state_locally for large cluster:

 - Tune the order of applying endpoint_state
 - Serialize apply_state_locally
 - Avoid copying of the gossip state map

Fixes #2404"

* tag 'asias/gossip_issue_2404_v2' of github.com:scylladb/seastar-dev:
  gossip: Avoid copying with apply_state_locally
  gossip: Serialize apply_state_locally
  gossip: Tune the order of applying endpoint_state in apply_state_locally
  gossip: Introduce is_seed helper
  gossip: Pass const endpoint_state& in notify_failure_detector
  gossip: Pass reference in notify_failure_detector

(cherry picked from commit d2632ddf1d)
2017-09-19 22:59:45 +08:00
Asias He
6834ba16a3 gossip: Do not wait for echo message in mark_alive
gossiper::apply_state_locally() calls handle_major_state_change() for
each endpoint, in a seastar thread, which calls mark_alive() for new
nodes, which calls ms().send_gossip_echo(id).get(). So it synchronously
waits for each node to respond before it moves on to the next entry. As
a result it may take a while before whole state is processed.

Apache (tm) Cassandra (tm) sends echos in the background.

In a large cluster, we see at the time the joining node starts
streaming, it hasn't managed to apply all the endpoint_state for peer
nodes, so the joining node does not know some of the nodes yet, which
results in the joining node ingores to stream from some of the existing
nodes.

Fixes #2787
Fixes #2797

Message-Id: <3760da2bef1a83f1b6a27702a67ca4170e74b92c.1505719669.git.asias@scylladb.com>
(cherry picked from commit 8f8273969d)
2017-09-19 17:12:26 +03:00
Shlomi Livne
0b49cfcf12 release: prepare for 2.0.rc5
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
2017-09-17 17:05:53 +03:00
Duarte Nunes
5e8c9a369e Merge 'Fix schema version mismatch during rolling upgrade from 1.7' from Tomasz
"When there are at least 2 nodes upgraded to 2.0, and the two exchanged schema
for some reason, reads or writes which involve both 1.7 and 2.0 nodes may
start to fail with the following error logged:

    storage_proxy - Exception when communicating with 127.0.0.3: Failed to load schema version 58fc9b89-74ab-37ca-8640-8b38a1204f8d

The situation should heal after whole cluster is upgraded.

Table schema versions are calculated by 2.0 nodes differently than 1.7 nodes
due to change in the schema tables format. Mismatch is meant to be avoided by
having 2.0 nodes calculate the old digest on schema migration during upgrade,
and use that version until next time the table is altered. It is thus not
allowed to alter tables during the rolling upgrade.

Two 2.0 nodes may exchange schema, if they detect through gossip that their
schema versions don't match. They may not match temporarily during boot, until
the upgraded node completes the bootstrap and propagates its new schema
through gossip. One source of such temporary mismatch is construction of new
tracing tables, which didn't exist on 1.7. Such schema pull will result in a
schema merge, which cause all tables to be altered and their schema version to
be recalculated. The new schema will not match the one used by 1.7 nodes,
causing reads and writes to fail, because schema requesting won't work during
rolling upgrade from 1.7 to 2.0.

The main fix employed here is to hold schema pulls, even among 2.0 nodes,
until rolling upgrade is complete."

Fixes #2802.

* 'tgrabiec/fix-schema-mismatch' of github.com:scylladb/seastar-dev:
  tests: schema_change_test: Add test_merging_does_not_alter_tables_which_didnt_change test case
  tests: cql_test_env: Enable all features in tests
  schema_tables: Make make_scylla_tables_mutation() visible
  migration_manager: Disable pulls during rolling upgrade from 1.7
  storage_service: Introduce SCHEMA_TABLES_V3 feature
  schema_tables: Don't alter tables which differ only in version
  schema_mutations: Use mutation_opt instead of stdx::optional<mutation>

(cherry picked from commit 8378fe190a)
2017-09-15 12:07:56 +02:00
Avi Kivity
8567762339 Merge "Refuse to load non-Scylla counter sstables" from Paweł
"These patches make Scylla refuse to load counter sstables that may
contain unsupported counter shards. They are recognised by the lack of
the Scylla component.

Fixes #2766."

* tag 'reject-non-scylla-counter-sstables/v1' of https://github.com/pdziepak/scylla:
  db: reject non-Scylla counter sstables in flush_upload_dir
  db: disallow loading non-Scylla counter sstables
  sstable: add has_scylla_component()

(cherry picked from commit fe019ad84d)
2017-09-11 13:29:32 +03:00
Avi Kivity
f698496ab2 Merge "Fix Scylla upgrades when counters are used" from Paweł
"Scylla 1.7.4 and older use incorrect ordering of counter shards, this
was fixed in 0d87f3dd7d ("utils::UUID:
operator< should behave as comparison of hex strings/bytes"). However,
that patch was not backported to 1.7 branch until very recently. This
means that versions 1.7.4 and older emit counter shards in an incorrect
order and expect them to be so. This is particularly bad when dealing
with imported correct sstables in which case some shards may become
duplicated.

The solution implemented in this patch is to allow any order of counter
shards and automaticly merge all duplicates. The code is written in a
way so that the correct ordering is expected in the fast path in order
not to excessively punish unaffected deployments.

A new feature flag CORRECT_COUNTER_ORDER is introduced to allow seamless
upgrade from 1.7.4 to later Scylla versions. If that feature is not
available Scylla still writes sstables and sends on-wire counters using
the old ordering so that it can be correctly understood by 1.7.4, once
the flag becomes available Scylla switches to the correct order.

Fixes #2752."

* tag 'fix-upgrade-with-counters/v2' of https://github.com/pdziepak/scylla:
  tests/counter: verify counter_id ordering
  counter: check that utils::UUID uses int64_t
  mutation_partition_serializer: use old counter ordering if necessary
  mutation_partition_view: do not expect counter shards to be sorted
  sstables: write counter shards in the order expected by the cluster
  tests/sstables: add storage_service_for_tests to counter write test
  tests/sstables: add test for reading wrong-order counter cells
  sstables: do not expect counter shards to be sorted
  storage_service: introduce CORRECT_COUNTER_ORDER feature
  tests/counter: test 1.7.4 compatible shard ordering
  counters: add helper for retrieving shards in 1.7.4 order
  tests/counter: add tests for 1.7.4 counter shard order
  counters: add counter id comparator compatible with Scylla 1.7.4
  tests/counter: verify order of counter shards
  tests/counter: add test for sorting and deduplicating shards
  counters: add function for sorting and deduplicating counter cells
  counters: add counter_id::operator>

(cherry picked from commit 31706ba989)
2017-09-05 14:25:36 +03:00
Shlomi Livne
6e6de348ea release: prepare for 2.0.rc4
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
2017-09-03 14:58:01 +03:00
Avi Kivity
117db58531 Update AMI submodule
* dist/ami/files/scylla-ami b41e5eb...5ffa449 (3):
  > amzn-main.repo: stick to Amazon Linux 2017.03 kernel (4.9.x)
  > Prevent dependency error on 'yum update'
  > scylla_create_devices: don't raise error when no disks found

Fixes #2751.

Still tracking master branch.
2017-08-31 15:15:42 +03:00
Vlad Zolotarov
086f8b7af2 service::storage_service: initialize auth and tracing after we joined the ring
Initialize the system_auth and system_traces keyspaces and their tables after
the Node joins the token ring because as a part of system_auth initialization
there are going to be issues SELECT and possible INSERT CQL statements.

This patch effectively reverts the d3b8b67 patch and brings the initialization order
to how it was before that patch.

Fixes #2273

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1500417217-16677-1-git-send-email-vladz@scylladb.com>
(cherry picked from commit e98adb13d5)
2017-08-30 09:33:05 +02:00
Avi Kivity
bee9fbe3fc Merge "Fix sstable reader not working for empty set of clustering ranges" from Tomasz
"Fixes #2734."

* 'tgrabiec/make-sstable-reader-work-with-empty-range-set' of github.com:scylladb/seastar-dev:
  tests: Introduce clustering_ranges_walker_test
  tests: simple_schema: Add missing include
  sstables: reader: Make clustering_ranges_walker work with empty range set
  clustering_ranges_walker: Make adjacency more accurate

(cherry picked from commit 5224ab9c92)
2017-08-29 15:54:58 +02:00
Tomer Sandler
b307a36f1e node_health_check: Various updates
- Removed text from Report's "PURPOSE" section, which was referring to the "MANUAL CHECK LIST" (not needed anymore).
- Removed curl command (no longer using the api_address), instead using scylla --version
- Added -v flag in iptables command, for more verbosity
- Added support to for OEL (Oracle Enterprise Linux) - minor fix
- Some text changes - minor
- OEL support indentation fix + collecting all files under /etc/scylla
- Added line seperation under cp output message

Signed-off-by: Tomer Sandler <tomer@scylladb.com>
Message-Id: <20170828131429.4212-1-tomer@scylladb.com>
(cherry picked from commit f1eb6a8de3)
2017-08-29 15:17:16 +03:00
Tomer Sandler
527e12c432 node_health_check: added line seperation under cp output message
Signed-off-by: Tomer Sandler <tomer@scylladb.com>
Message-Id: <20170828124307.2564-1-tomer@scylladb.com>
(cherry picked from commit 83f249c15d)
2017-08-29 15:17:07 +03:00
Shlomi Livne
c57cc55aa6 release: prepare for 2.0.rc3
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
2017-08-28 14:57:54 +03:00
Avi Kivity
5d3c015d27 Merge "Fixes for skipping in sstable reader" from Tomasz
Ref #2733.

* 'tgrabiec/fix-fast-forwarding' of github.com:scylladb/seastar-dev:
  tests: mutation_source_test: Add more tests for fast forwarding across partitions
  sstables: Fix abort in mutation reader for certain skip pattern
  sstables: Fix reader returning partition past the query range in some cases
  sstables: Introduce data_consume_context::eof()

(cherry picked from commit 4e67bc9573)
2017-08-28 12:48:50 +03:00
Avi Kivity
428831b16a Merge "consider the pre-existing cpuset.conf when configuring networking mode" from Vlad
"Preserve the networking configuration mode during the upgrade by generating the /etc/scylla.d/perftune.yaml
file and using it."

Fixes #2725.

* 'dist_respect_cpuset_conf-v3' of https://github.com/vladzcloudius/scylla:
  scylla_prepare: respect the cpuset.conf when configuring the networking
  scylla_cpuset_setup: rm perftune.yaml
  scylla_cpuset_setup: add a missing "include" of scylla_lib.sh

(cherry picked from commit 40aeb00151)
2017-08-24 18:59:37 +03:00
Paweł Dziepak
918339cf2e mvcc: allow invoking maybe_merge_versions() inside allocating section
Message-Id: <20170823083544.4225-1-pdziepak@scylladb.com>
(cherry picked from commit 1006a946e8)
2017-08-24 14:31:00 +02:00
Paweł Dziepak
6c846632e4 abstract_read_executor: make make_requests() exception safe
Message-Id: <20170821162934.25386-5-pdziepak@scylladb.com>
(cherry picked from commit 9d82a1ebfd)
2017-08-24 14:29:32 +02:00
Paweł Dziepak
af7b7f1eff shared_index_lists: restore indentation
Message-Id: <20170821162934.25386-4-pdziepak@scylladb.com>
(cherry picked from commit 31afc2f242)
2017-08-24 14:28:49 +02:00
Paweł Dziepak
701128f8a1 sstables: make shared_index_lists::get_or_load exception safe
Message-Id: <20170821162934.25386-3-pdziepak@scylladb.com>
(cherry picked from commit 93eaa95378)
2017-08-24 14:28:49 +02:00
Avi Kivity
c03118fbe9 Update seastar submodule
* seastar 2993cae...06790c0 (3):
  > scripts: posix_net_conf.sh: allow passing a perftune.py configuration file as a parameter
  > scripts: perftune.py: add the possibility to pass the parameters in a configuration file and print the YAML file with the current configuration
  > scripts: perftune.py: actually use the number of Rx queues when comparing to the number of CPU threads
2017-08-24 11:37:38 +03:00
Piotr Jastrzebski
a98e3aec45 Make streamed_mutation more exception safe
Make sure that push_mutation_fragment leaves
_buffer_size with a correct value if exception
is thrown from emplace_back.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <83398412aa78332d88d91336b79140aecc988602.1503474403.git.piotr@scylladb.com>
(cherry picked from commit 477068d2c3)
2017-08-23 19:10:49 +03:00
Avi Kivity
29baf7966c Merge "repair: Do not allow repair until node is in NORMAL status" from Asias
Fixes #2723.

* tag 'asias/repair_issue_2723_v1' of github.com:cloudius-systems/seastar-dev:
  repair: Do not allow repair until node is in NORMAL status
  gossip: Add is_normal helper

(cherry picked from commit 2f41ed8493)
2017-08-23 09:45:34 +03:00
Amnon Heiman
ba63f74d7e Add configuration to disable per keyspace and column family metrics
The number of keysapce and column family metrics reported is
proportional to the number of shards times the number of keysapce/column
families.

This can cause a performance issue both on the reporting system and on
the collecting system.

This patch adds a configuration flag (set to false by default) to enable
or disable those metrics.

Fixes #2701

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170821113843.1036-1-amnon@scylladb.com>
(cherry picked from commit abbd78367c)
2017-08-22 19:20:50 +03:00
Alexys Jacob
1733f092ef dist: Fix Gentoo Linux scylla-jmx and scylla-tools packages detection
These two admin related packages will be packaged under the "app-admin"
category and not the "dev-db" one.

This fixes the detection path of the packages for scylla_setup.

Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20170817094756.21550-1-ultrabug@gentoo.org>
(cherry picked from commit e5ff8efea3)
2017-08-17 15:44:02 +03:00
Paweł Dziepak
179ff956ee sstables: initialise index metrics on all shards
Fixes #2702.

Message-Id: <20170816085454.21554-1-pdziepak@scylladb.com>
(cherry picked from commit 784dcbf1ca)
2017-08-16 15:44:58 +03:00
Avi Kivity
61c2e8c7e2 Update seastar submodule
* seastar d67c344...2993cae (1):
  > fstream: do not ignore unresolved future

Fixes #2697.
2017-08-16 15:11:10 +03:00
Shlomi Livne
8370e1bc2c release: prepare for 2.0.rc2
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
2017-08-15 15:52:35 +03:00
Pekka Enberg
3a244a4734 docker: Switch to Scylla 2.0 RPM repository 2017-08-15 13:27:41 +03:00
Avi Kivity
3261c927d2 Update seastar submodule
* seastar 2383d60...d67c344 (1):
  > Merge "Fix crash in rpc due to access to already destroyed server socket" from Gleb

Fixes #2690
2017-08-14 16:24:05 +03:00
Avi Kivity
4577a89982 Update seastar submodule
* seastar cfe280c...2383d60 (1):
  > tls: Only recurse once in shutdown code

Fixes #2691
2017-08-14 15:10:27 +03:00
Avi Kivity
6ea306f898 Update seastar submodule
* seastar b9f4568...cfe280c (1):
  > scripts: perftune.py: change the network module mode auto selection heuristic
2017-08-14 10:30:42 +03:00
Avi Kivity
2afcc684b4 Update seastar submodule
* seastar 867b7c7...b9f4568 (4):
  > http: removed unneeded lamda captures
  > Merge "Prometheus to use output stream" from Amnon
  > http_test: Fix an http output stream test
  > Merge "Add output stream to http message reply" from Amnon

Fixes #2475
2017-08-10 12:05:14 +03:00
Avi Kivity
bdb8c861c7 Fork seastar submodule for 2.0 2017-08-10 12:00:31 +03:00
Takuya ASADA
ea933b4306 dist/debian: append postfix '~DISTRIBUTION' to scylla package version
We are moving to aptly to release .deb packages, that requires debian repository
structure changes.
After the change, we will share 'pool' directory between distributions.
However, our .deb package name on specific release is exactly same between
distributions, so we have file name confliction.
To avoid the problem, we need to append distribution name on package version.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1502312935-22348-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 8e115d69a9)
2017-08-10 10:54:17 +03:00
Raphael S. Carvalho
19391cff14 sstables: close index file when sstable writer fails
index's file output stream uses write behind but it's not closed
when sstable write fails and that may lead to crash.
It happened before for data file (which is obviously easier to
reproduce for it) and was fixed by 0977f4fdf8.

Fixes #2673.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170807171146.10243-1-raphaelsc@scylladb.com>
(cherry picked from commit dddbd34b52)
2017-08-08 09:53:36 +03:00
Glauber Costa
87d9a4f1f1 add active streaming reads metric
In commit f38e4ff3f, we have separated streaming reads from normal reads
for the purpose of determining the maximum number of reads going on.
However, we'll now be totally unaware of how many reads will be
happening on behalf of streaming and that can be important information
when debugging issues.

This patch adds this metric so we don't fly blind.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <1501909973-32519-1-git-send-email-glauber@scylladb.com>
(cherry picked from commit 4a911879a3)
2017-08-05 11:07:18 +03:00
Pekka Enberg
c0f894ccef docker: Disable stall detector
Fixes #2162

Message-Id: <1501759957-4380-1-git-send-email-penberg@scylladb.com>
(cherry picked from commit 90872ffa1f)
2017-08-03 14:53:03 +03:00
Takuya ASADA
2c892488ef dist/debian: check scylla user/group existance before adding them
To prevent install failing on the environment which already has scylla
user/group, existance check is needed.

Fixes #2389

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1495023805-14905-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 91ade1a660)
2017-08-03 13:01:30 +03:00
Avi Kivity
911608e9c4 database: prevent streaming reads from blocking normal reads
Streaming reads and normal reads share a semaphore, so if a bunch of
streaming reads use all available slots, no normal reads can proceed.

Fix by assigning streaming reads their own semaphore; they will compete
with normal reads once issued, and the I/O scheduler will determine the
winner.

Fixes #2663.
Message-Id: <20170802153107.939-1-avi@scylladb.com>

(cherry picked from commit f38e4ff3f9)
2017-08-03 12:27:48 +03:00
Avi Kivity
d8ab07de37 database: remove streaming read queue length limit
If we fail a streaming read due queue overload, we will fail the entire repair.
Remove the limit for streaming, and trust the caller (repair) to have bounded
concurrency.

Fixes #2659.
Message-Id: <20170802143448.28311-1-avi@scylladb.com>

(cherry picked from commit 911536960a)
2017-08-03 12:27:46 +03:00
Duarte Nunes
15eefbc434 tests/sstable_mutation_test: Don't use moved-from object
Fix a bug introduced in dbbb9e93d and exposed by gcc6 by not using a
moved-from object. Twice.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170802161033.4213-1-duarte@scylladb.com>
(cherry picked from commit 4c9206ba2f)
2017-08-03 09:46:18 +03:00
Vlad Zolotarov
4bb6ba6d58 utils::loading_cache: cancel the timer after closing the gate
The timer is armed inside the section guarded by the _timer_reads_gate
therefore it has to be canceled after the gate is closed.

Otherwise we may end up with the armed timer after stop() method has
returned a ready future.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1501603059-32515-1-git-send-email-vladz@scylladb.com>
(cherry picked from commit 4b28ea216d)
2017-08-01 17:23:53 +01:00
Avi Kivity
6dd4a9a5a2 Merge "Ensure correct EOC for PI block cell names" from Duarte
"This series ensures the always write correct cell names to promoted
index cell blocks, taking into account the eoc of range tombstones.

Fixes #2333"

* 'pi-cell-name/v1' of github.com:duarten/scylla:
  tests/sstable_mutation_test: Test promoted index blocks are monotonic
  sstables: Consider eoc when flushing pi block
  sstables: Extract out converting bound_kind to eoc

(cherry picked from commit db7329b1cb)
2017-08-01 18:09:54 +03:00
Gleb Natapov
222e85d502 cql transport: run accept loop in the foreground
It was meant to be run in the foreground since it is waited upon during
stop(), but as it is now from the stop() perspective it is completed
after first connection is accepted.

Fixes #2652

Message-Id: <20170801125558.GS20001@scylladb.com>
(cherry picked from commit 1da4d5c5ee)
2017-08-01 17:06:08 +03:00
Takuya ASADA
8d4a30e852 dist/ami: follow scylla-tools package name change on RedHat variants
Since scylla-tools generates two .rpm packages, we need to copy them to our AMI.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20170722090002.9850-1-syuu@scylladb.com>
(cherry picked from commit a998b7b3eb)
2017-07-31 18:57:29 +03:00
Avi Kivity
4710ee229d Merge "educe the effect of the latency metrics" from Amnon
"This series reduce that effect in two ways:
1. Remove the latency counters from the system keyspaces
2. Reduce the histogram size by limiting the maximum number of buckets and
   stop the last bucket."

Fixes #2650.

* 'amnon/remove_cf_latency_v2' of github.com:cloudius-systems/seastar-dev:
  database: remove latency from the system table
  estimated histogram: return a smaller histogram

(cherry picked from commit 3fe6731436)
2017-07-31 16:01:05 +03:00
Vlad Zolotarov
93cb78f21d utils::loading_cache: add stop() method
loading_cache invokes a timer that may issue asynchronous operations
(queries) that would end with writing into the internal fields.

We have to ensure that these operations are over before we can destroy
the loading_cache object.

Fixes #2624

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1501208345-3687-1-git-send-email-vladz@scylladb.com>
2017-07-31 15:55:46 +03:00
Avi Kivity
af8151c4b7 Update scylla-ami submodule
* dist/ami/files/scylla-ami 2bd1481...b41e5eb (1):
  > Fix incorrect scylla-server sysconfig file edit for i3 memflush controller
2017-07-31 09:41:56 +03:00
Takuya ASADA
846d9da9c2 dist/debian: refuse upgrade if current scylla < 1.7.3 && commitlog remains
Commitlog replay fails when upgrade from <1.7.3 to 2.0, we need to refuse
updating package if current scylla < 1.7.3 && commitlog remains.

Note: We have the problem on scylla-server package, but to prevent
scylla-conf package upgrade, %pretrans should be define on scylla-conf.

Fixes #2551

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1501187555-4629-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 714540cd4c)
2017-07-31 09:18:58 +03:00
Paweł Dziepak
aaa59d3437 streamed_mutation: do not call fill_buffer() ahead of time
consume_mutation_fragments_until() allows consuming mutation fragments
until a specified condition happens. This patch reorganises its
implementation so that we avoid situations when fill_buffer() is called
with stop condition being true.
Message-Id: <20170727122218.7703-1-pdziepak@scylladb.com>

(cherry picked from commit f02bef7917)
2017-07-27 17:48:26 +02:00
Tomasz Grabiec
66dd817582 mutation_partition: Always mark static row as continuous when no static columns
To avoid unnecessary cache misses after static columns are added.

Message-Id: <1500650057-26036-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 136d205855)
2017-07-27 14:59:33 +02:00
Tomasz Grabiec
5857a4756d Merge "Some fixes for performance regressions in perf_fast_forward" from Paweł
These patches contain some minor fixes for performance regression reported
by perf_fast_forward after partial cache was merged. The solution is still
far from perfect, there is one case that still has 30% degradation, but
there is some improvement so there is no reason to hold these changes back.

Refs #2582.

Some numbers:
before - before cache changes were merged
(555621b537)

cache - at the commit that introduced the partial cache
(9b21a9bfb6)

after - recent master + this series
(based on e988121dbb)

Differences are shown relative to "before".

Testing effectiveness of caching of large partition, single-key slicing reads:
Large partitions, range [0, 500000], populating cache
  before      cache      after
 1636840    1013688    1234606
              -38%        -25%

Large partitions, range [0, 500000], reading from cache
  before      cache      after
 2012615    3076812    3035423
               +53%       +51%

Testing scanning small partitions with skips.
reading small partitions (skip 0)
 before      cache      after
 227060     165261     200639
              -27%       -11%

skipping small partitions (skip 1)
 before      cache      after
  29813      27312      38210
               -8%       +28%

Testing slicing small partitions:
slicing small partitions (offset 0, read 4096)
 before      cache      after
 195282     149695     180497
              -23%        -8%

* https://github.com/pdziepak/scylla.git perf_fast_forward-regression/v3:
  sstables: make sure that fill_buffer() actually fills buffer
  mutation_merger: improve handling of non-deferring fill_buffer()s
  partition_snapshot_row_cursor: avoid apply() in single-version cases
  sstables: introduce decorated_key_view
  ring_position_comparator: accept sstables::decorated_key_view
  sstable: keep a pre-computed token in summary_entry
  sstables: cache token in index entries
  index_reader: advance_and_check_if_present() use index_comparator
  ring_position_comparator: drop unused overloads
  cache_streamed_mutation: avoid moving clustering_row
  streamed_mutation: introduce consume_mutation_fragments_until()
  cache_streamed_mutation: use consumer based read_context reader
  rows_entry: make position() inlineable
  mutation_fragment: make destructor always_inline
  keys: introduce compound_wrapper::from_exploded_view()
  sstables: avoid copying key components
  compound_compat: explode: reserve some elements in a vector
  cache: short-circut static row logic if there are no static columns
  cache: use equality comparators instead of tri_compare
  sstables: avoid indirect calls to abstract_type::is_multi_cell()

(cherry picked from commit e9fc0b0491)
2017-07-27 13:58:23 +02:00
Takuya ASADA
f199047601 dist/redhat: limit metapackage dependencies to specific version of scylla packages
When we install scylla metapackage with version (ex: scylla-1.7.1),
it just always install newest scylla-server/-jmx/-tools on the repo,
instead of installing specified version of packages.

To install same version packages with the metapackage, limited dependencies to
current package version.

Fixes #2642

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20170726193321.7399-1-syuu@scylladb.com>
(cherry picked from commit 91a75f141b)
2017-07-27 14:21:55 +03:00
Tomasz Grabiec
a8dcbb6bd0 row_cache: Fix potential timeout or deadlock due to sstable read concurrency limit
database::make_sstable_reader() creates a reader which will need to
obtain a semaphore permit when invoked. Therefore, each read may
create at most one such reader in order to be guaranteed to make
progress. If the reader tries to create another reader, that may
deadlock (or for non-system tables, timeout), if enough number of such
readers tries to do the same thing at the same time.

Avoid the problem by dropping previous reader before creating a new
one.

Refs #2644.

Message-Id: <1501152454-4866-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 22948238b6)
2017-07-27 13:58:40 +03:00
Duarte Nunes
bfd99d4e74 db/schema_tables: Drop dropped columns when dropping tables
Fixes #2633

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170726150228.2593-2-duarte@scylladb.com>
(cherry picked from commit 50ad0003c6)
2017-07-26 18:48:59 +02:00
Duarte Nunes
d40df89271 db/schema_tables: Store column_name in text form
As does Cassandra.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170726150228.2593-1-duarte@scylladb.com>
(cherry picked from commit 3425403126)
2017-07-26 18:48:58 +02:00
Duarte Nunes
3da54ffff0 schema_builder: Replace type when re-dropping column
Fixes #2634

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170725183933.5311-1-duarte@scylladb.com>
(cherry picked from commit e988121dbb)
2017-07-26 16:26:59 +02:00
Duarte Nunes
804793e291 tests/schema_change_test: Add test case for add+drop notification
Reproduces #2616

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170725170622.4380-2-duarte@scylladb.com>
(cherry picked from commit 472f32fb06)
2017-07-26 16:26:59 +02:00
Duarte Nunes
83ea9b6fc0 db/schema_tables: Consider differing dropped columns
If a node is notified of a schema change where the schema's dropped
columns have changes, that node will miss the changes to the dropped
columns. A scenario where this can happen is where a column c is
dropped, then added as a different typed, and then dropped again, with
a node n having seen the first drop and being notified of the
subsequent add and drop.

Fixes #2616

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170725170622.4380-1-duarte@scylladb.com>
(cherry picked from commit 33e18a1779)
2017-07-26 16:26:59 +02:00
Asias He
b45855fc1c gossip: Fix nr_live_nodes calculation
We need to consider the _live_endpoints size. The nr_live_nodes should
not be larger than _live_endpoints size, otherwise the loop to collect
the live node can run forever.

It is a regression introduced in commit 437899909d
(gossip: Talk to more live nodes in each gossip round).

Fixes #2637

Message-Id: <863ec3890647038ae1dfcffc73dde0163e29db20.1501026478.git.asias@scylladb.com>
(cherry picked from commit 515a744303)
2017-07-26 16:48:51 +03:00
Duarte Nunes
3900babff2 schema: Remove unnecessary print
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170725174000.71061-1-duarte@scylladb.com>
(cherry picked from commit 9c831b4e97)
2017-07-26 16:07:41 +03:00
Tomasz Grabiec
7c805187a9 Merge fixes related to row cache from Raphael
* git@github.com:raphaelsc/scylla.git row_cache_fixes:
  db: atomically synchronize cache with changes to the snapshot
  db: refresh row cache's underlying data source after compaction

(cherry picked from commit 18be42f71a)
2017-07-25 15:37:40 +02:00
Paweł Dziepak
345a91d55d tests/row_cache: test queries with no clustering ranges
Reproducer for #2604.
Message-Id: <20170725131220.17467-3-pdziepak@scylladb.com>

(cherry picked from commit 79a1ad7a37)
2017-07-25 15:37:32 +02:00
Paweł Dziepak
fda8b35cda tests: do not overload the meaning of empty clustering range
Empty clustering key range is perfectly valid and signifies that the
reader is not interested in anything but the static row. Let's not
make it mean anything else.
Message-Id: <20170725131220.17467-2-pdziepak@scylladb.com>

(cherry picked from commit 1ea507d6ae)
2017-07-25 15:37:29 +02:00
Paweł Dziepak
08ac0f1100 cache: fix aborts if no clustering range is specified
cache_streamed_mutation assumed that at least one clustering range was
specified. That was wrong since the readers are allowed to query just
for a static row (e.g. counter update that modifies only static
columns).

Fixes #2604.
Message-Id: <20170725131220.17467-1-pdziepak@scylladb.com>

(cherry picked from commit 6572f38450)
2017-07-25 15:37:28 +02:00
Calle Wilund
db455305a2 system_keyspace: Make sure "system" is written to keyspaces (visible)
Fixes #2514

Bug in schema version 3 update: We failed to write "system" to the
schema tables. Only visible on an empty instance of course.

Message-Id: <1500469809-23546-2-git-send-email-calle@scylladb.com>
(cherry picked from commit 7a583585a2)
2017-07-24 11:33:25 +02:00
Avi Kivity
e1a3052e76 tests: fix sstable_datafile_test build with boost 1.55
Boost 1.55 accidentally removed support for "range for" on
recursive_directory_iterator (previous and latter versions do
support it). Use old-style iteration instead.

Message-Id: <20170724080128.8824-1-avi@scylladb.com>
(cherry picked from commit c21bb5ae05)
2017-07-24 11:20:53 +03:00
Tomasz Grabiec
50fa3f3b89 schema_registry: Keep unused entries around for 1 second
This is in order to avoid frequent misses which have a relatively high
cost. A miss means we need to fetch schema definition from another
node and in case of writes do a schema merge.

If the schema is kept alive only by the incoming request, then it
will be forgotten immediately when the request is done, and the next
request using the same schema version will miss again.

Refs #2608.
Message-Id: <1500632447-10104-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit 29a82f5554)
2017-07-24 10:12:09 +02:00
Tomasz Grabiec
8474b7a725 legacy_schema_migrator: Don't snapshot empty legacy tables
Otherwise we will create a new (empty) snapshot each time we boot.
Message-Id: <1500573920-31478-2-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit ecc85988dd)
2017-07-24 09:56:22 +02:00
Tomasz Grabiec
0fc874e129 database: Allow disabling auto snapshots during drop/truncate
Message-Id: <1500573920-31478-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 408cea66cd)
2017-07-24 09:56:19 +02:00
Duarte Nunes
5cf1a19f3f Merge 'Fix possible inconsistency of table schema version' from Tomasz
"Fixes issues uncovered in longevity test (#2608).

Main problem is that due to time drift scylla_tables.version column
may not get deleted on all nodes doing the schema merge, which will
make some nodes come up with different table schema version than others.

The inconsistency will not heal because scylla_tables doesn't
take part in the schema sync. This is fixed by the last patch.

This will cause nodes to constantly try to sync the schema, which under
some conditions triggers #2617."

* tag 'tgrabiec/fix-table-schema-version-inconsistency-v1' of github.com:scylladb/seastar-dev:
  schema_tables: Add scylla_tables to ALL
  schema: Make schema_mutations equality consistent with digest
  schema_tables: Extract compact_for_schema_digest()
  schema_tables: Always drop scylla_tables::version

(cherry picked from commit 937fe80a1a)
2017-07-24 09:54:45 +02:00
Tomasz Grabiec
f48466824f schema_registry: Ensure schema_ptr is always synced on the other core
global_schema_ptr ensures that schema object is replicated to other
cores on access. It was replicating the "synced" state as well, but
only when the shard didn't know about the schema. It could happen that
the other shard has the entry, but it's not yet synced, in which case
we would fail to replicate the "synced" state. This will result in
exception from mutate(), which rejects attempts to mutate using an
unsynced schema.

The fix is to always replicate the "synced" state. If the entry is
syncing, we will preemptively mark it as synced earlier. The syncing
code is already prepared for this.

Refs #2617.
Message-Id: <1500555224-15825-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit 65c64614aa)
2017-07-24 09:52:31 +02:00
Avi Kivity
914f6f019f Update ami submodule
* dist/ami/files/scylla-ami 5dfe42f...2bd1481 (1):
  > Enable support for experimental CPU controller in i3 instances
2017-07-24 10:27:35 +03:00
Shlomi Livne
f5bb363f96 release: prepare for 2.0.rc1
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
2017-07-23 09:47:11 +03:00
Duarte Nunes
61ba56f628 schema: Support compaction enabled attribute
Fixes #2547

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170721132206.3037-1-duarte@scylladb.com>
(cherry picked from commit 7eecda3a61)
2017-07-21 15:39:48 +02:00
Tomasz Grabiec
f4d3e5cdcf Merge "Drop mutations that raced with truncate" from Duarte
Instead of retrying, just drop mutations that raced with a truncate.

* git@github.com:duarten/scylla.git truncate-reorder/v1:
  database: Rename replay_position_reordered_exception
  database: Drop mutations that raced with truncate

(cherry picked from commit 63caa58b70)
2017-07-21 15:39:20 +02:00
Avi Kivity
0291a4491e Merge "restrict background writers with scheduling groups" from Glauber
"This patchset restricts background writers - such as compactions,
streaming flushes and memtable flushes to a maximum amount of CPU usage
through a seastar::thread_scheduling_group.

The said maximum is recommended to be set  50 % - it is default
disabled, but can be adjusted through a configuration option until we
are able to auto-tune this.

The second patch in this series provides a preview on how such auto-tune
would look like. By implementing a simple controller we automatically
adjust the quota for the memtable writer processes, so that the rate at
which bytes come in is equal to the rates at which bytes are flushed.

Tail latencies are greatly reduced by this series, and heavy spikes that
previously appeared on CPU-bound workloads are no more."

* 'memtable-controller-v5' of https://github.com/glommer/scylla:
  simple controller for memtable/streaming writer shares.
  restrict background writers to 50 % of CPU.

(cherry picked from commit c5ee62a6a4)
2017-07-20 15:13:39 +03:00
Duarte Nunes
83cc640c6a Merge 'Revert back to 1.7 schema layout in memory' from Tomasz
"Fixes schema layout incompatibility in a mixed 1.7 and 2.0 cluster (#2555)
by reverting back to using the old layout in memory and thus also
in across-node requests. We still use the new v3 layout in schema
tables (needed by drivers and external tools). Translations happen
when converting to/from schema mutations."

* tag 'tgrabiec/use-v2-schema-layout-in-memory-v2' of github.com:scylladb/seastar-dev:
  schema: Revert back to the 1.7 layout of static compact tables in memory
  schema: Use v3 column layout when converting to/from schema mutations
  schema: Encapsulate column layout translations in the v3_columns class

(cherry picked from commit 1daf1bc4bb)
2017-07-19 19:49:43 +03:00
Duarte Nunes
2f06c54033 thrift/handler: Remove leftover debug artifacts
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170705161156.2307-1-duarte@scylladb.com>
(cherry picked from commit d583ef6860)
2017-07-19 19:49:35 +03:00
Calle Wilund
9abe7651f7 system_schema: Fix remaining places not handing two system keyspaces
Some places remained where code looked directly at
system_keyspace::NAME to determine iff a ks is
considered special/system/protected. Including
schema digest calculation.

Export "is_system_keyspace" and use accordingly.

Message-Id: <1500469809-23546-1-git-send-email-calle@scylladb.com>
(cherry picked from commit 247c36e048)
2017-07-19 19:48:30 +03:00
Amos Kong
784aea12e7 scylla_raid_setup: fix syntax error
/usr/lib/scylla/scylla_raid_setup: line 132: syntax error
near unexpected token `fi'

Fixes #2610

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <af3a5bc77c5ba2b49a8f48a5aaa19afffb787886.1500430021.git.amos@scylladb.com>
(cherry picked from commit 2bdcad5bc3)
2017-07-19 11:10:43 +03:00
Avi Kivity
3a98959eba dist: tolerate sysctl failures
sysctl may fail in a container environment if /proc is not virtualized
properly.

Fixes #1990
Message-Id: <20170625145930.31619-1-avi@scylladb.com>

(cherry picked from commit 08488a75e0)
2017-07-18 15:45:41 +03:00
Duarte Nunes
2c7d597307 wrapping_range: Fix lvalue transform()
Instead of copying and moving the bound, pass it by reference so the
transformer can decide whether it wants to copy or not. The only
caller so far doesn't want a copy and takes the value by reference,
which would be capturing a temporary value. Caught by the
view_schema_test with gcc7.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170705210255.29669-1-duarte@scylladb.com>
(cherry picked from commit 3dd0397700)
2017-07-18 14:35:58 +03:00
Duarte Nunes
8d46c4e049 thrift: Fail when mixed CFs are detected
Fixes #2588

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170717222612.7429-1-duarte@scylladb.com>
(cherry picked from commit d9fa3bf322)
2017-07-18 10:21:45 +03:00
Asias He
b1c080984f gossip: Implement the missing fd_max_interval_ms and fd_initial_value_ms option
It is useful for larger cluster with larger gossip message latency. By
default the fd_max_interval_ms is 2 seconds which means the
failure_detector will ignore any gossip message update interval larger
than 2 seconds. However, in larger cluster, the gossip message udpate
interval can be larger than 2 seconds.

Fixes #2603.

Message-Id: <49b387955fbf439e49f22e109723d3a19d11a1b9.1500278434.git.asias@scylladb.com>
(cherry picked from commit adc5f0bd21)
2017-07-17 13:29:30 +03:00
Duarte Nunes
e1706c36b7 Merge 'Fixes around migration to v3 schema tables' from Tomasz
branch 'tgrabiec/schema-migration-fixes' of github.com:scylladb/seastar-dev:
  schema: Use proper name comparator
  legacy_schema_migrator: Properly migrate non-UTF8 named columns
  schema_tables: Store column_name in text form
  legacy_schema_migrator: Migrate columns like Cassandra
  schema_builder: Add factory method for default_names
  legacy_schema_migrator: Simplify logic
  thrift: Don't set regular_column_name_type
  schema: Use proper column name type for static columns
  schema: Fix column_name_type() for static compact tables
  schema: Introduce clustering_column_at()
  thrift: Reuse cell_comparator::to_sstring() for obtaining comparator type
  partition_slice_builder: Use proper column's type instead of regular_column_name_type()

(cherry picked from commit 13caccf1cf)
2017-07-17 12:42:19 +03:00
Avi Kivity
63c8306733 Update seastar submodule
* seastar b812cee...867b7c7 (1):
  > rpc: start server's send loop only after protocol negotiation

Fixes #2600.

Still tracking upstream.
2017-07-17 10:41:59 +03:00
Avi Kivity
a7dfdc0155 tests: move tmpdir to /tmp
Reduces view_schema_test runtime to 5 seconds, from 53 seconds on an NVMe disk
with write-back cache, and forever on a spinning disk.
Message-Id: <20170716081653.10018-1-avi@scylladb.com>

(cherry picked from commit d9c64ef737)
2017-07-17 08:47:17 +03:00
Avi Kivity
70be29173a tests: copy the sstable with an unknown component to the data directory
We will be creating links to those sstable's files, and those don't work
if the data directory and the test sstable are on different devices.

Copying the files to the same directory fixes the problem.
Message-Id: <20170716090405.14307-1-avi@scylladb.com>

(cherry picked from commit 9116dd91cb)
2017-07-17 08:47:08 +03:00
Avi Kivity
e09d4a9b75 Update seastar submodule
* seastar 844bcfb...b812cee (1):
  > Update dpdk submodule

Fixes #2595 (again).

Still tracking master.
2017-07-16 17:01:48 +03:00
Avi Kivity
67f25e56a6 Update seastar submodule
* seastar ff34c42...844bcfb (1):
  > Update dpdk submodule

Still tracking master.

Fixes #2595.
2017-07-15 19:18:10 +03:00
Tomasz Grabiec
74c4651b95 Merge "Fixes for memtable flushing and replay positions" from Duarte
We don't ensure mutations are applied in memory following the order of their
replay positions. A memtable can thus be flushed with replay position rp,
with the new one being at replay position rp', where rp' < rp. This breaks
an intrinsic assumption in the code, which this series addresses.

Fixes #2074

branch memtable-flush/v3 of git@github.com:duarten/scylla.git:
  commitlog: Always flush latest memtable
  column_family: More precise count of switched memtables
  column_family: Fix typo in pending_tasks metric name
  column_family: More precise count of pending flushes
  dirty_memory_manager: Remove unnecessary check from flush_one()
  column_family: Don't rely on flush_queue to guarantee flushes finished
  column_family: Don't bother closing the flush_queue on stop()
  column_family: Stop using flush_queue
  column_family: Remove outdated comment about the flush_queue
  memtable: Stop tracking the highest flushed rp

(cherry picked from commit caa62f7f05)
2017-07-14 19:07:33 +02:00
Duarte Nunes
58bfb86d73 storage_proxy: Preserve replica order across mutations
In storage_proxy we arrange the mutations sent by the replicas in a
vector of vectors, such that each row corresponds to a partition key
and each column contains the mutation, possibly empty, as sent by a
particular replica.

There is reconciliation-related code that assumes that all the
mutations sent by a particular replica can be found in a single
column, but that isn't guaranteed by the way we initially arrange the
mutations.

This patch fixes this and enforces the expected order.

Fixes #2531
Fixes #2593

Signed-off-by: Gleb Natapov <gleb@scylladb.com>
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170713162014.15343-1-duarte@scylladb.com>
(cherry picked from commit b8235f2e88)
2017-07-14 12:11:50 +03:00
Tomasz Grabiec
cb94c66823 legacy_schema_migrator: Fix calculation of is_dense
Current algorithm was marking tables with regular columns not named
"value" as not dense, which doesn't have to be the case. It can be
either way.

It should be enough to look at clustering components. If there is a
clustering key, then table is dense if and only if all comparator
components belong to the clustering key.

If there is no clustering key, then if there are any regular columns
we're sure it's not dense.

Fixes #2587.

Message-Id: <1499877777-7083-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 30ec4af949)
2017-07-13 17:28:25 +03:00
Tomasz Grabiec
5aa3e23fcd gdb: Fix "scylla columnfamilies" command
Broken in 0e4d5bc2f3.

Message-Id: <1499951956-26206-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 54953c8d27)
2017-07-13 16:33:50 +03:00
Takuya ASADA
aac1d5d54d dist/common/systemd: move scylla-server.service to be after network-online.target instead of network.target
To make sure start Scylla after network is up, we need to move from
network.target to network-online.target.

Fixes #2337

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1493661832-9545-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 0c81974bc4)
2017-07-12 13:36:52 +03:00
Glauber Costa
a371b8a5bf change task quota's default
The default of 2ms is somewhat arbitrary. Now that we have a lot more
mileage deploying Scylla applications in production it does sound not
only arbitrary, but high.

In particular, it is really hard to achieve 1ms latencies in the face of
CPU-heavy workloads with it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <1499354495-27173-1-git-send-email-glauber@scylladb.com>
(cherry picked from commit 780a6e4d2e)
2017-07-12 10:21:35 +03:00
Avi Kivity
a69fb8a8ed Update seastar submodule
* seastar 89cc97c...ff34c42 (6):
  > tls: Wrap all IO in semaphore (Fixes #2575)
  > tests/lowres_clock_test.cc: Declare helper static
  > tests/lowres_clock_test.cc: fix compilation error for older GCC
  > configure.py: verifies boost version
  > pkg-config: Eliminate spaces in include path arguments
  > allow applications to override task-quota-ms

Still tracking seastar master.
2017-07-12 10:20:49 +03:00
Avi Kivity
00b9640b2c Merge "Preserve table schema digest on schema tables migration" from Tomasz
"Currently new nodes calculate digests based on v3 schema mutations,
which are very different from v2 mutations. As a result they will
use schemas with different table_schema_version that the old nodes.
The old nodes will not recognize the version and will try to request
its definition. That will fail, because old nodes don't understand
v3 schema mutations.

To fix this problem, let's preserve the digests during migration,
so that they're the same on new and old nodes. This will allow
requests to proceed as usual.

This does not solve the problem of schema being changed during
the rolling upgrade. This is not allowed, as it would bring the
same problem back.

Fixes #2549."

* tag 'tgrabiec/use-consistent-schema-table-digests-v2' of github.com:cloudius-systems/seastar-dev:
  tests: Add test for concurrent column addition
  legacy_schema_migrator: Set digest to one compatible with the old nodes
  schema_tables: Persist table_schema_version
  schema_tables: Introduce system_schema.scylla_tables
  schema_tables: Simplify read_table_mutations()
  schema_tables: Resurrect v2 read_table_mutations()
  system_keyspace: Forward-declare legacy schemas
  legacy_schema_migrator: Take storage_proxy as dependency

(cherry picked from commit a397889c81)
2017-07-11 17:23:21 +03:00
Gleb Natapov
59d608f77f consistency_level: report less live endpoints in Unavailable exception if there are pending nodes
DowngradingConsistencyRetryPolicy uses live replicas count from
Unavailable exception to adjust CL for retry, but when there are pending
nodes CL is increased internally by a coordinator and that may prevent
retried query from succeeding. Adjust live replica count in case of
pending node presence so that retried query will be able to proceed.

Fixes #2535

Message-Id: <20170710085238.GY2324@scylladb.com>
(cherry picked from commit 739dd878e3)
2017-07-11 17:16:46 +03:00
Botond Dénes
1717922219 Fix crash in the out-of order restrictions error msg composition
Use name of the existing preceeding column with restriction
(last_column) instead of assuming that the column right after the
current column already has restrictions.
This will yield an error message that is different from that of
Cassandra, albeit still a correct one.

Fixes #2421

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <40335768a2c8bd6c911b881c27e9ea55745c442e.1499781685.git.bdenes@scylladb.com>
(cherry picked from commit 33bc62a9cf)
2017-07-11 17:15:45 +03:00
Paweł Dziepak
7cd4bb0c4a transport: send correct type id for counter columns
CQL reply may contain metadata that describes columns present in the
response including the information about their type.

However, Scylla incorrectly reports counter types as bigint. The
serialised format of counters and bigint is exactly the same, which
could explain why the problem hasn't been noticed earlier but it is a
bug nevertheless.

Fixes #2569.
Message-Id: <20170711130520.27603-1-pdziepak@scylladb.com>

(cherry picked from commit 5aa523aaf9)
2017-07-11 16:37:24 +03:00
Tomasz Grabiec
588ae935e7 legacy_schema_migrator: Use separate joinpoint instance for each table
Otherwise we may deadlock, as explained in commit 5e8f0efc8:

Table drop starts with creating a snapshot on all shards. All shards
must use the same snapshot timestamp which, among other things, is
part of the snapshot name. The timestamp is generated using supplied
timestamp generating function (joinpoint object). The joinpoint object
will wait for all shards to arrive and then generate and return the
timestamp.

However, we drop tables in parallel, using the same joinpoint
instance. So joinpoint may be contacted by snapshotting shards of
tables A and B concurrently, generating timestamp t1 for some shards
of table A and some shards of table B. Later the remaining shards of
table A will get a different timestamp. As a result, different shards
may use different snapshot names for the same table. The snapshot
creation will never complete because the sealing fiber waits for all
shards to signal it, on the same name.
Message-Id: <1499762663-21967-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit 310d2a54d2)
2017-07-11 12:31:21 +03:00
Avi Kivity
c292e86b3c sstables: fix use-after-free in read_simple()
`r` is moved-from, and later captured in a different lambda. The compiler may
choose to move and perform the other capture later, resulting in a use-after-free.

Fix by copying `r` instead of moving it.

Discovered by sstable_test in debug mode.
Message-Id: <20170702082546.20570-1-avi@scylladb.com>

(cherry picked from commit 07b8adce0e)
2017-07-10 15:32:57 +03:00
Asias He
3dc0d734b0 repair: Do not store the failed ranges
The number of failed ranges can be large so it can consume a lot of memory.
We already logged the failed ranges in the log. No need to storge them
in memory.

Message-Id: <7a70c4732667c5c3a69211785e8efff0c222fc28.1498809367.git.asias@scylladb.com>
(cherry picked from commit b2a2fbcf73)
2017-07-10 14:37:47 +03:00
Takuya ASADA
2d612022ba dist/common/scripts/scylla_cpuscaling_setup: skip configuration when cpufreq driver doesn't loaded
Configuring cpufreq service on VMs/IaaS causes an error because it doesn't supported cpufreq.
To prevent causing error, skip whole configuration when the driver not loaded.

Fixes #2051

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1498809504-27029-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 1c35549932)
2017-07-10 14:08:54 +03:00
Nadav Har'El
5f6100c0aa repair: further limit parallelism of checksum calculation
Repair today has a semaphore limiting the number of ongoing checksum
comparisons running in parallel (on one shard) to 100. We needed this
number to be fairly high, because a "checksum comparison" can involve
high latency operations - namely, sending an RPC request to another node
in a remote DC and waiting for it to calculate a checksum there, and while
waiting for a response we need to proceed calculating checksums in parallel.

But as a consequence, in the current code, we can end up with as many as
100 fibers all at the same stage of reading partitions to checksum from
sstables. This requires tons of memory, to hold at least 128K of buffer
(even more with read-ahead) for each of these fibers, plus partition data
for each. But doing 100 reads in parallel is pointless - one (or very few)
should be enough.

So this patch adds another semaphore to limit the number of checksum
*calculations* (including the read and checksum calculation) on each shard
to just 2. There may still be 100 ongoing checksum *comparisons*, in
other stages of the comparisons (sending the checksum requests to other
and waiting for them to return), but only 2 will ever be in the stage of
reading from disk and checksumming them.

The limit of 2 checksum calculations (per shard) applies on the repair
slave, not just to the master: The slave may receive many checksum
requests in parallel, but will only actually work on 2 at a time.

Because the parallelism=100 now rate-limits operations which use very little
memory, in the future we can safely increase it even more, to support
situations where the disk is very fast but the link between nodes has
very high latency.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170703151329.25716-1-nyh@scylladb.com>
(cherry picked from commit d177ec05cb)
2017-07-10 14:08:28 +03:00
Avi Kivity
d475a44b01 Merge "Silence schema pull errors during upgrade from 1.7 to 2.0" from Tomasz
"Old and new nodes will advertise different schema version because
of different format of schema tables. This will result in attempts
to sync the schema by each of the node.

Currently this will result in scary error messages in logs about
sync failing due to not being able to find schema of given version.
It's benign, but may scare users. It the future incompatibilities
could result in more subtle errors. Better to inhibit it completely."

* 'tgrabiec/fix-schema-pull-errors-during-upgrade' of github.com:cloudius-systems/seastar-dev:
  migration_manager: Give empty response to schema pulls from incompatible nodes
  migration_manager: Don't pull schema from incompatible nodes
  service: Advertise schema tables format version through gossip

(cherry picked from commit 91221e020b)
2017-07-10 14:04:41 +03:00
Pekka Enberg
e02d4935ee idl: Fix frozen_schema version numbers
The IDL changes will appear in 2.0 so fix up the version numbers.

Message-Id: <1499680669-6757-1-git-send-email-penberg@scylladb.com>
(cherry picked from commit 8112d7c5c0)
2017-07-10 14:02:37 +03:00
Botond Dénes
25f8d365b5 Add text(sstring) version of count, max and min functions
Fixes #2459

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <b6abb97f21c0caea8e36c7590b92a12d148195db.1499666251.git.bdenes@scylladb.com>
(cherry picked from commit 66cbc45321)
2017-07-10 12:48:29 +03:00
Tomasz Grabiec
de7cb7bfa4 tests: commitlog: Check there are no segments left on disk after clean shutdown
Reproduces #2550.

Message-Id: <1499358825-17855-2-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 72e01b7fe8)
2017-07-09 19:25:44 +03:00
Tomasz Grabiec
b8eb4ed9cd commitlog: Discard active but unused segments on shutdown
So that they are not left on disk even though we did a clean shutdown.

First part of the fix is to ensure that closed segments are recognized
as not allocating (_closed flag). Not doing this prevents them from
being collected by discard_unused_segments(). Second part is to
actually call discard_unused_segments() on shutdown after all segments
were shut down, so that those whose position are cleared can be
removed.

Fixes #2550.

Message-Id: <1499358825-17855-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 6555a2f50b)
2017-07-09 19:25:42 +03:00
Tomasz Grabiec
fcc05e8ae9 legacy_schema_migrator: Drop tables instead of truncate()+remove()
It achieves similar effect, but is safer than non-standard remove()
path. The latter was missing unregistration from compaction manager.

Fixes 2554.

Message-Id: <1499447165-30253-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit d33d29ad95)
2017-07-09 18:36:56 +03:00
Botond Dénes
05e2ac80af cql3: Add K_FROZEN and K_TUPLE to basic_unreserved_keyword
To allow the non-reserved keywords "frozen" and "tuple" to be used as
column names without double-quotes.

Fixes #2507

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <9ae17390662aca90c14ae695c9b4a39531c6cde6.1499329781.git.bdenes@scylladb.com>
(cherry picked from commit c4277d6774)
2017-07-06 18:20:22 +03:00
Avi Kivity
8fa1add26d Update seastar submodule
* seastar 0ab7ae5...89cc97c (4):
  > future-utils: fix do_for_each exception reporting
  > core/thread: Fix unwind information for seastar threads
  > build: export full cflags in pkgconfig file
  > configure: Avoid putting tmp file on /tmp

Still tracking seastar master.
2017-07-06 17:31:06 +03:00
Takuya ASADA
c0a2ca96dd dist/common/scripts/scylla_raid_setup: prevent renaming MDRAID device after reboot
On Debian variants, mdadm.conf should placed at /etc/mdadm instead of /etc.
Also it seems we need update-initramfs to fix renaming issue.

Fixes #2502

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1499179912-14125-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 71624d7919)
2017-07-04 18:07:33 +03:00
Avi Kivity
c9ed522fa8 Merge "Adjust row cache metrics for row granularity" from Tomasz
* tag 'tgrabiec/row-cache-metrics-v2' of github.com:cloudius-systems/seastar-dev:
  row_cache: Switch _stats.hits/misses to row granularity
  row_cache: Rename num_entries() to partitions() for clarity
  row_cache: Track mispopulations also at row level
  row_cache: Track row insertions
  row_cache: Track row hits and misses
  row_cache: Make mispopulation counter also apply for continuity information
  row_cache: Add partition_ prefix to current counters
  misc_services: Switch to using reads_with[_no]_misses counters
  row_cache: Add metrics for operations on underlying reader
  row_cache: Add reader-related metrics
  row_cache: Remove dead code

(cherry picked from commit b1a0e37fcb)
2017-07-04 15:21:00 +03:00
Tomasz Grabiec
9078433a7f row_cache: Restore update of concurrent_misses_same_key
It was lost in action in 6f6575f456.

Message-Id: <1499168837-5072-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit e720b317c9)
2017-07-04 14:51:19 +03:00
Avi Kivity
7893a3aad2 Merge "Use selective_token_range_sharder in repair" from Asias
"This series introduces selective_token_range_sharder and uses it in repair to
generate dht::token_range belongs to a specific shard."

* tag 'asias/repair-selective_token_range_sharder-v3' of github.com:cloudius-systems/seastar-dev:
  repair: Use selective_token_range_sharder
  tests: Add test_selective_token_range_sharder
  dht: Add selective_token_range_sharder

(cherry picked from commit 66e56511d6)
2017-07-04 14:18:08 +03:00
Nadav Har'El
e467eef58d Fix test to use non-wrapping range
The test put a wrapping range into a non-wrapping range variable.
This was harmless at the time this test was written, but newer code
may not be as forgiving so better use a non-wrapping range as intended.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170704103128.29689-1-nyh@scylladb.com>
(cherry picked from commit d95f908586)
2017-07-04 14:18:01 +03:00
Tomasz Grabiec
19a07143eb row_cache: Drop not very useful prefixes from metric names
This drops "total_opertaions_" and "objects_" prefixes. There is no
convention of adding them in other parts of the system, and they don't
add much value.

Fixes scylladb/scylla-grafana-monitoring#169.

Message-Id: <1499160342-25865-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 1d6fec0755)
2017-07-04 13:37:24 +03:00
Raphael S. Carvalho
a619b978c4 database: fix potential use-after-free in sstable cleanup
when do_for_each is in its last iteration and with_semaphore defers
because there's an ongoing cleanup, sstable object will be used after
freed because it was taken by ref and the container it lives in was
destroyed prematurely.

Let's fix it with a do_with, also making code nicer.

Fixes #2537.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170630035324.19881-1-raphaelsc@scylladb.com>
(cherry picked from commit b9d0645199)
2017-07-03 12:49:13 +03:00
Gleb Natapov
2c66b40a69 main: wait for wait_for_gossip_to_settle() to complete during boot
Boot should not continue until a future returned by
wait_for_gossip_to_settle() is resolved.  Commit 991ec4a16 mistakenly
broke that, so restore it back. Also fix calls for supervisor::notify()
to be in the right places.

Message-Id: <20170702082355.GQ14563@scylladb.com>
(cherry picked from commit d23111312f)
2017-07-02 11:33:04 +03:00
Tomasz Grabiec
079844a51d row_cache: Fix compilation errors with gcc 5
Message-Id: <1498741526-27055-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 97005825bf)
2017-06-29 16:35:02 +03:00
Avi Kivity
ea59e1fbd6 Update ami submodule
* dist/ami/files/scylla-ami f10db69...5dfe42f (1):
  > don't fetch perf from amazon repo

(cherry picked from commit 1317c4a03e)
2017-06-29 09:39:29 +03:00
Tomasz Grabiec
089b58ddfe row_cache: Use continuity information to decide whether to populate
If cache is missing given key, but the range is marked as continuous,
it means sstables don't have that entry and we can insert it without
asking the presence checker (bloom filter based). The latter is more
expensive and gives false positives. So this improves update
performance and hit ratio.

Another positive effect is that we don't have to clear continuity now.

Fixes #1999.

Message-Id: <1498643043-21117-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 786e75dbf7)
2017-06-28 13:33:34 +03:00
Tomasz Grabiec
d76e9e4026 lsa: Fix performance regression in eviction and compact_on_idle
Region comparator, used by the two, calls region_impl::min_occupancy(),
which calls log_histogram::largest(). The latter is O(N) in terms of
the number of segments, and is supposed to be used only in tests.
We should call one_of_largest() instead, which is O(1).

This caused compact_on_idle() to take more CPU as the number of
segments grew (even when there was nothing to compact). Eviction
would see the same kind of slow down as well.

Introduced in 11b5076b3c.

Message-Id: <1498641973-20054-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 3489c68a68)
2017-06-28 12:33:11 +03:00
Glauber Costa
7709b885c4 disable defragment-memory-on-idle-by-default
It's been linked with various performance issues, either by causing
them or making them worse. One example is #1634, and also recently
I have investigated continuous performance degradation that was also
linked to defrag on idle activity.

Until we can figure out how to reduce its impact, we should disable it.

Signed-off-by: Glauber Costa <glauber@glauber.scylladb>
Message-Id: <20170627201109.10775-1-glauber@scylladb.com>
(cherry picked from commit f3742d1e38)
2017-06-28 00:21:35 +03:00
Avi Kivity
3de701dbe1 Merge "Fix compilation issues in older environments" from Tomasz
* 'tgrabiec/fix-compilation-issues' of github.com:cloudius-systems/seastar-dev:
  tests: streamed_mutation_test: Avoid using boost::size() on row ranges
  tests: row_cache: Remove unused method

(cherry picked from commit ff7be8241f)
2017-06-27 16:31:42 +03:00
Shlomi Livne
9912b7d1eb release: prepare for 2.0-rc0 2017-06-27 12:37:59 +03:00
204 changed files with 7789 additions and 2668 deletions

2
.gitmodules vendored
View File

@@ -1,6 +1,6 @@
[submodule "seastar"]
path = seastar
url = ../seastar
url = ../scylla-seastar
ignore = dirty
[submodule "swagger-ui"]
path = swagger-ui

View File

@@ -1,6 +1,6 @@
#!/bin/sh
VERSION=666.development
VERSION=2.0.5
if test -f version
then

View File

@@ -252,13 +252,13 @@ void set_cache_service(http_context& ctx, routes& r) {
// In origin row size is the weighted size.
// We currently do not support weights, so we use num entries instead
return map_reduce_cf(ctx, 0, [](const column_family& cf) {
return cf.get_row_cache().num_entries();
return cf.get_row_cache().partitions();
}, std::plus<uint64_t>());
});
cs::get_row_entries.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, 0, [](const column_family& cf) {
return cf.get_row_cache().num_entries();
return cf.get_row_cache().partitions();
}, std::plus<uint64_t>());
});

View File

@@ -114,7 +114,7 @@ struct hash<auth::authenticated_user> {
class auth::auth::permissions_cache {
public:
typedef utils::loading_cache<std::pair<authenticated_user, data_resource>, permission_set, utils::tuple_hash> cache_type;
typedef utils::loading_cache<std::pair<authenticated_user, data_resource>, permission_set, utils::loading_cache_reload_enabled::yes, utils::simple_entry_size<permission_set>, utils::tuple_hash> cache_type;
typedef typename cache_type::key_type key_type;
permissions_cache()
@@ -130,7 +130,7 @@ public:
}) {}
future<> stop() {
return make_ready_future<>();
return _cache.stop();
}
future<permission_set> get(::shared_ptr<authenticated_user> user, data_resource resource) {

View File

@@ -69,6 +69,29 @@ public:
};
class cache_streamed_mutation final : public streamed_mutation::impl {
enum class state {
before_static_row,
// Invariants:
// - position_range(_lower_bound, _upper_bound) covers all not yet emitted positions from current range
// - _next_row points to the nearest row in cache >= _lower_bound
// - _next_row_in_range = _next.position() < _upper_bound
reading_from_cache,
// Starts reading from underlying reader.
// The range to read is position_range(_lower_bound, min(_next_row.position(), _upper_bound)).
// Invariants:
// - _next_row_in_range = _next.position() < _upper_bound
move_to_underlying,
// Invariants:
// - Upper bound of the read is min(_next_row.position(), _upper_bound)
// - _next_row_in_range = _next.position() < _upper_bound
// - _last_row_key contains the key of last emitted clustering_row
reading_from_underlying,
end_of_stream
};
lw_shared_ptr<partition_snapshot> _snp;
position_in_partition::tri_compare _position_cmp;
@@ -92,25 +115,24 @@ class cache_streamed_mutation final : public streamed_mutation::impl {
position_in_partition _lower_bound;
position_in_partition_view _upper_bound;
bool _static_row_done = false;
bool _reading_underlying = false;
state _state = state::before_static_row;
lw_shared_ptr<read_context> _read_context;
partition_snapshot_row_cursor _next_row;
bool _next_row_in_range = false;
future<> do_fill_buffer();
future<> copy_from_cache_to_buffer();
void copy_from_cache_to_buffer();
future<> process_static_row();
void move_to_end();
future<> move_to_next_range();
future<> move_to_current_range();
future<> move_to_next_entry();
void move_to_next_range();
void move_to_current_range();
void move_to_next_entry();
// Emits all delayed range tombstones with positions smaller than upper_bound.
void drain_tombstones(position_in_partition_view upper_bound);
// Emits all delayed range tombstones.
void drain_tombstones();
void add_to_buffer(const partition_snapshot_row_cursor&);
void add_to_buffer(clustering_row&&);
void add_clustering_row_to_buffer(mutation_fragment&&);
void add_to_buffer(range_tombstone&&);
void add_to_buffer(mutation_fragment&&);
future<> read_from_underlying();
@@ -154,12 +176,16 @@ public:
inline
future<> cache_streamed_mutation::process_static_row() {
if (_snp->version()->partition().static_row_continuous()) {
row sr = _snp->static_row();
_read_context->cache().on_row_hit();
row sr = _lsa_manager.run_in_read_section([this] {
return _snp->static_row();
});
if (!sr.empty()) {
push_mutation_fragment(mutation_fragment(static_row(std::move(sr))));
}
return make_ready_future<>();
} else {
_read_context->cache().on_row_miss();
return _read_context->get_next_fragment().then([this] (mutation_fragment_opt&& sr) {
if (sr) {
assert(sr->is_static_row());
@@ -173,15 +199,24 @@ future<> cache_streamed_mutation::process_static_row() {
inline
future<> cache_streamed_mutation::fill_buffer() {
if (!_static_row_done) {
_static_row_done = true;
return process_static_row().then([this] {
return _lsa_manager.run_in_read_section([this] {
return move_to_current_range();
}).then([this] {
return fill_buffer();
if (_state == state::before_static_row) {
auto after_static_row = [this] {
if (_ck_ranges_curr == _ck_ranges_end) {
_end_of_stream = true;
_state = state::end_of_stream;
return make_ready_future<>();
}
_state = state::reading_from_cache;
_lsa_manager.run_in_read_section([this] {
move_to_current_range();
});
});
return fill_buffer();
};
if (_schema->has_static_columns()) {
return process_static_row().then(std::move(after_static_row));
} else {
return after_static_row();
}
}
return do_until([this] { return _end_of_stream || is_buffer_full(); }, [this] {
return do_fill_buffer();
@@ -190,18 +225,27 @@ future<> cache_streamed_mutation::fill_buffer() {
inline
future<> cache_streamed_mutation::do_fill_buffer() {
if (_reading_underlying) {
if (_state == state::move_to_underlying) {
_state = state::reading_from_underlying;
auto end = _next_row_in_range ? position_in_partition(_next_row.position())
: position_in_partition(_upper_bound);
return _read_context->fast_forward_to(position_range{_lower_bound, std::move(end)}).then([this] {
return read_from_underlying();
});
}
if (_state == state::reading_from_underlying) {
return read_from_underlying();
}
// assert(_state == state::reading_from_cache)
return _lsa_manager.run_in_read_section([this] {
auto same_pos = _next_row.maybe_refresh();
// FIXME: If continuity changed anywhere between _lower_bound and _next_row.position()
// we need to redo the lookup with _lower_bound. There is no eviction yet, so not yet a problem.
assert(same_pos);
while (!is_buffer_full() && !_end_of_stream && !_reading_underlying) {
future<> f = copy_from_cache_to_buffer();
if (!f.available() || need_preempt()) {
return f;
while (!is_buffer_full() && _state == state::reading_from_cache) {
copy_from_cache_to_buffer();
if (need_preempt()) {
break;
}
}
return make_ready_future<>();
@@ -210,33 +254,34 @@ future<> cache_streamed_mutation::do_fill_buffer() {
inline
future<> cache_streamed_mutation::read_from_underlying() {
return do_until([this] { return !_reading_underlying || is_buffer_full(); }, [this] {
return _read_context->get_next_fragment().then([this] (auto&& mfopt) {
if (!mfopt) {
_reading_underlying = false;
return _lsa_manager.run_in_update_section([this] {
auto same_pos = _next_row.maybe_refresh();
assert(same_pos); // FIXME: handle eviction
if (_next_row_in_range) {
return consume_mutation_fragments_until(_read_context->get_streamed_mutation(),
[this] { return _state != state::reading_from_underlying || is_buffer_full(); },
[this] (mutation_fragment mf) {
_read_context->cache().on_row_miss();
maybe_add_to_cache(mf);
add_to_buffer(std::move(mf));
},
[this] {
_state = state::reading_from_cache;
_lsa_manager.run_in_update_section([this] {
auto same_pos = _next_row.maybe_refresh();
assert(same_pos); // FIXME: handle eviction
if (_next_row_in_range) {
maybe_update_continuity();
add_to_buffer(_next_row);
move_to_next_entry();
} else {
if (no_clustering_row_between(*_schema, _upper_bound, _next_row.position())) {
this->maybe_update_continuity();
this->add_to_buffer(_next_row);
return this->move_to_next_entry();
} else {
if (no_clustering_row_between(*_schema, _upper_bound, _next_row.position())) {
this->maybe_update_continuity();
} else {
// FIXME: Insert dummy entry at _upper_bound.
}
return this->move_to_next_range();
// FIXME: Insert dummy entry at _upper_bound.
_read_context->cache().on_mispopulate();
}
});
} else {
this->maybe_add_to_cache(*mfopt);
this->add_to_buffer(std::move(*mfopt));
return make_ready_future<>();
}
move_to_next_range();
}
});
return make_ready_future<>();
});
});
}
inline
@@ -249,6 +294,8 @@ void cache_streamed_mutation::maybe_update_continuity() {
} else if (!_ck_ranges_curr->start()) {
_next_row.set_continuous(true);
}
} else {
_read_context->cache().on_mispopulate();
}
}
@@ -266,6 +313,7 @@ void cache_streamed_mutation::maybe_add_to_cache(const mutation_fragment& mf) {
inline
void cache_streamed_mutation::maybe_add_to_cache(const clustering_row& cr) {
if (!can_populate()) {
_read_context->cache().on_mispopulate();
return;
}
_lsa_manager.run_in_update_section_with_allocator([this, &cr] {
@@ -281,10 +329,11 @@ void cache_streamed_mutation::maybe_add_to_cache(const clustering_row& cr) {
auto new_entry = alloc_strategy_unique_ptr<rows_entry>(
current_allocator().construct<rows_entry>(cr.key(), cr.tomb(), cr.marker(), cr.cells()));
new_entry->set_continuous(false);
auto it = _next_row.has_up_to_date_row_from_latest_version()
auto it = _next_row.has_valid_row_from_latest_version()
? _next_row.get_iterator_in_latest_version() : mp.clustered_rows().lower_bound(cr.key(), less);
auto insert_result = mp.clustered_rows().insert_check(it, *new_entry, less);
if (insert_result.second) {
_read_context->cache().on_row_insert();
new_entry.release();
}
it = insert_result.first;
@@ -294,11 +343,12 @@ void cache_streamed_mutation::maybe_add_to_cache(const clustering_row& cr) {
if (it == mp.clustered_rows().begin()) {
// FIXME: check whether entry for _last_row_key is in older versions and if so set
// continuity to true.
_read_context->cache().on_mispopulate();
} else {
auto prev_it = it;
--prev_it;
clustering_key_prefix::tri_compare tri_comp(*_schema);
if (tri_comp(*_last_row_key, prev_it->key()) == 0) {
clustering_key_prefix::equality eq(*_schema);
if (eq(*_last_row_key, prev_it->key())) {
e.set_continuous(true);
}
}
@@ -306,6 +356,7 @@ void cache_streamed_mutation::maybe_add_to_cache(const clustering_row& cr) {
e.set_continuous(true);
} else {
// FIXME: Insert dummy entry at _ck_ranges_curr->start()
_read_context->cache().on_mispopulate();
}
});
}
@@ -317,26 +368,24 @@ bool cache_streamed_mutation::after_current_range(position_in_partition_view p)
inline
future<> cache_streamed_mutation::start_reading_from_underlying() {
_reading_underlying = true;
auto end = _next_row_in_range ? position_in_partition(_next_row.position())
: position_in_partition(_upper_bound);
return _read_context->fast_forward_to(position_range{_lower_bound, std::move(end)});
_state = state::move_to_underlying;
return make_ready_future<>();
}
inline
future<> cache_streamed_mutation::copy_from_cache_to_buffer() {
void cache_streamed_mutation::copy_from_cache_to_buffer() {
position_in_partition_view next_lower_bound = _next_row.dummy() ? _next_row.position() : position_in_partition_view::after_key(_next_row.key());
for (auto&& rts : _snp->range_tombstones(*_schema, _lower_bound, _next_row_in_range ? next_lower_bound : _upper_bound)) {
add_to_buffer(std::move(rts));
if (is_buffer_full()) {
return make_ready_future<>();
return;
}
}
if (_next_row_in_range) {
add_to_buffer(_next_row);
return move_to_next_entry();
move_to_next_entry();
} else {
return move_to_next_range();
move_to_next_range();
}
}
@@ -344,47 +393,45 @@ inline
void cache_streamed_mutation::move_to_end() {
drain_tombstones();
_end_of_stream = true;
_state = state::end_of_stream;
}
inline
future<> cache_streamed_mutation::move_to_next_range() {
void cache_streamed_mutation::move_to_next_range() {
++_ck_ranges_curr;
if (_ck_ranges_curr == _ck_ranges_end) {
move_to_end();
return make_ready_future<>();
} else {
return move_to_current_range();
move_to_current_range();
}
}
inline
future<> cache_streamed_mutation::move_to_current_range() {
void cache_streamed_mutation::move_to_current_range() {
_last_row_key = std::experimental::nullopt;
_lower_bound = position_in_partition::for_range_start(*_ck_ranges_curr);
_upper_bound = position_in_partition_view::for_range_end(*_ck_ranges_curr);
auto complete_until_next = _next_row.advance_to(_lower_bound) || _next_row.continuous();
_next_row_in_range = !after_current_range(_next_row.position());
if (!complete_until_next) {
return start_reading_from_underlying();
start_reading_from_underlying();
}
return make_ready_future<>();
}
// _next_row must be inside the range.
inline
future<> cache_streamed_mutation::move_to_next_entry() {
void cache_streamed_mutation::move_to_next_entry() {
if (no_clustering_row_between(*_schema, _next_row.position(), _upper_bound)) {
return move_to_next_range();
move_to_next_range();
} else {
if (!_next_row.next()) {
move_to_end();
return make_ready_future<>();
return;
}
_next_row_in_range = !after_current_range(_next_row.position());
if (!_next_row.continuous()) {
return start_reading_from_underlying();
start_reading_from_underlying();
}
return make_ready_future<>();
}
}
@@ -405,7 +452,7 @@ void cache_streamed_mutation::drain_tombstones() {
inline
void cache_streamed_mutation::add_to_buffer(mutation_fragment&& mf) {
if (mf.is_clustering_row()) {
add_to_buffer(std::move(std::move(mf).as_clustering_row()));
add_clustering_row_to_buffer(std::move(mf));
} else {
assert(mf.is_range_tombstone());
add_to_buffer(std::move(mf).as_range_tombstone());
@@ -415,16 +462,18 @@ void cache_streamed_mutation::add_to_buffer(mutation_fragment&& mf) {
inline
void cache_streamed_mutation::add_to_buffer(const partition_snapshot_row_cursor& row) {
if (!row.dummy()) {
add_to_buffer(row.row());
_read_context->cache().on_row_hit();
add_clustering_row_to_buffer(row.row());
}
}
inline
void cache_streamed_mutation::add_to_buffer(clustering_row&& row) {
void cache_streamed_mutation::add_clustering_row_to_buffer(mutation_fragment&& mf) {
auto& row = mf.as_clustering_row();
drain_tombstones(row.position());
_last_row_key = row.key();
_lower_bound = position_in_partition::after_key(row.key());
push_mutation_fragment(std::move(row));
push_mutation_fragment(std::move(mf));
}
inline
@@ -442,17 +491,22 @@ inline
void cache_streamed_mutation::maybe_add_to_cache(const range_tombstone& rt) {
if (can_populate()) {
_lsa_manager.run_in_update_section_with_allocator([&] {
_snp->version()->partition().apply_row_tombstone(*_schema, rt);
_snp->version()->partition().row_tombstones().apply_monotonically(*_schema, rt);
});
} else {
_read_context->cache().on_mispopulate();
}
}
inline
void cache_streamed_mutation::maybe_add_to_cache(const static_row& sr) {
if (can_populate()) {
_read_context->cache().on_row_insert();
_lsa_manager.run_in_update_section_with_allocator([&] {
_snp->version()->partition().static_row().apply(*_schema, column_kind::static_column, sr.cells());
});
} else {
_read_context->cache().on_mispopulate();
}
}
@@ -460,6 +514,8 @@ inline
void cache_streamed_mutation::maybe_set_static_row_continuous() {
if (can_populate()) {
_snp->version()->partition().set_static_row_continuous(true);
} else {
_read_context->cache().on_mispopulate();
}
}

View File

@@ -43,10 +43,14 @@ private:
bool advance_to_next_range() {
_in_current = false;
if (!_current_start.is_static_row()) {
if (_current == _end) {
return false;
}
++_current;
}
++_change_counter;
if (_current == _end) {
_current_end = _current_start = position_in_partition_view::after_all_clustered_rows();
return false;
}
_current_start = position_in_partition_view::for_range_start(*_current);
@@ -61,11 +65,18 @@ public:
, _end(ranges.end())
, _in_current(with_static_row)
, _with_static_row(with_static_row)
, _current_start(with_static_row ? position_in_partition_view::for_static_row()
: position_in_partition_view::for_range_start(*_current))
, _current_end(with_static_row ? position_in_partition_view::before_all_clustered_rows()
: position_in_partition_view::for_range_end(*_current))
{ }
, _current_start(position_in_partition_view::for_static_row())
, _current_end(position_in_partition_view::before_all_clustered_rows())
{
if (!with_static_row) {
if (_current == _end) {
_current_start = position_in_partition_view::before_all_clustered_rows();
} else {
_current_start = position_in_partition_view::for_range_start(*_current);
_current_end = position_in_partition_view::for_range_end(*_current);
}
}
}
clustering_ranges_walker(clustering_ranges_walker&& o) noexcept
: _schema(o._schema)
, _ranges(o._ranges)
@@ -94,10 +105,6 @@ public:
void trim_front(position_in_partition pos) {
position_in_partition::less_compare less(_schema);
if (_current == _end) {
return;
}
do {
if (!less(_current_start, pos)) {
break;
@@ -118,10 +125,6 @@ public:
bool advance_to(position_in_partition_view pos) {
position_in_partition::less_compare less(_schema);
if (_current == _end) {
return false;
}
do {
if (!_in_current && less(pos, _current_start)) {
break;
@@ -146,12 +149,8 @@ public:
bool advance_to(position_in_partition_view start, position_in_partition_view end) {
position_in_partition::less_compare less(_schema);
if (_current == _end) {
return false;
}
do {
if (less(end, _current_start)) {
if (!less(_current_start, end)) {
break;
}
if (less(start, _current_end)) {
@@ -192,7 +191,7 @@ public:
// Returns true if advanced past all contained positions. Any later advance_to() until reset() will return false.
bool out_of_range() const {
return _current == _end;
return !_in_current && _current == _end;
}
// Resets the state of the walker so that advance_to() can be now called for new sequence of positions.

View File

@@ -241,7 +241,7 @@ public:
using component_view = std::pair<bytes_view, eoc>;
private:
template<typename Value, typename = std::enable_if_t<!std::is_same<const data_value, std::decay_t<Value>>::value>>
static size_t size(Value& val) {
static size_t size(const Value& val) {
return val.size();
}
static size_t size(const data_value& val) {
@@ -445,17 +445,16 @@ public:
return _is_compound;
}
// The following factory functions assume this composite is a compound value.
template <typename ClusteringElement>
static composite from_clustering_element(const schema& s, const ClusteringElement& ce) {
return serialize_value(ce.components(s));
return serialize_value(ce.components(s), s.is_compound());
}
static composite from_exploded(const std::vector<bytes_view>& v, eoc marker = eoc::none) {
static composite from_exploded(const std::vector<bytes_view>& v, bool is_compound, eoc marker = eoc::none) {
if (v.size() == 0) {
return composite(bytes(size_t(1), bytes::value_type(marker)));
return composite(bytes(size_t(1), bytes::value_type(marker)), is_compound);
}
return serialize_value(v, true, marker);
return serialize_value(v, is_compound, marker);
}
static composite static_prefix(const schema& s) {
@@ -499,14 +498,15 @@ public:
, _is_compound(true)
{ }
std::vector<bytes> explode() const {
std::vector<bytes_view> explode() const {
if (!_is_compound) {
return { to_bytes(_bytes) };
return { _bytes };
}
std::vector<bytes> ret;
std::vector<bytes_view> ret;
ret.reserve(8);
for (auto it = begin(), e = end(); it != e; ) {
ret.push_back(to_bytes(it->first));
ret.push_back(it->first);
auto marker = it->second;
++it;
if (it != e && marker != composite::eoc::none) {

View File

@@ -34,7 +34,7 @@ for line in open('/etc/os-release'):
os_ids += value.split(' ')
# distribution "internationalization", converting package names.
# Fedora name is key, values is distro -> package name dict.
# Fedora name is key, values is distro -> package name dict.
i18n_xlat = {
'boost-devel': {
'debian': 'libboost-dev',
@@ -48,7 +48,7 @@ def pkgname(name):
for id in os_ids:
if id in dict:
return dict[id]
return name
return name
def get_flags():
with open('/proc/cpuinfo') as f:
@@ -175,6 +175,8 @@ scylla_tests = [
'tests/keys_test',
'tests/partitioner_test',
'tests/frozen_mutation_test',
'tests/serialized_action_test',
'tests/clustering_ranges_walker_test',
'tests/perf/perf_mutation',
'tests/lsa_async_eviction_test',
'tests/lsa_sync_eviction_test',
@@ -236,6 +238,7 @@ scylla_tests = [
'tests/view_schema_test',
'tests/counter_test',
'tests/cell_locker_test',
'tests/loading_cache_test',
]
apps = [
@@ -290,6 +293,8 @@ add_tristate(arg_parser, name = 'hwloc', dest = 'hwloc', help = 'hwloc support')
add_tristate(arg_parser, name = 'xen', dest = 'xen', help = 'Xen support')
arg_parser.add_argument('--enable-gcc6-concepts', dest='gcc6_concepts', action='store_true', default=False,
help='enable experimental support for C++ Concepts as implemented in GCC 6')
arg_parser.add_argument('--enable-alloc-failure-injector', dest='alloc_failure_injector', action='store_true', default=False,
help='enable allocation failure injection')
args = arg_parser.parse_args()
defines = []
@@ -640,7 +645,7 @@ for t in tests_not_using_seastar_test_framework:
for t in scylla_tests:
deps[t] = [t + '.cc']
if t not in tests_not_using_seastar_test_framework:
deps[t] += scylla_tests_dependencies
deps[t] += scylla_tests_dependencies
deps[t] += scylla_tests_seastar_deps
else:
deps[t] += scylla_core + api + idls + ['tests/cql_test_env.cc']
@@ -726,6 +731,9 @@ if not try_compile(compiler=args.cxx, source='''\
print('Installed boost version too old. Please update {}.'.format(pkgname("boost-devel")))
sys.exit(1)
has_sanitize_address_use_after_scope = try_compile(compiler=args.cxx, flags=['-fsanitize-address-use-after-scope'], source='int f() {}')
defines = ' '.join(['-D' + d for d in defines])
globals().update(vars(args))
@@ -760,6 +768,8 @@ if args.staticboost:
seastar_flags += ['--static-boost']
if args.gcc6_concepts:
seastar_flags += ['--enable-gcc6-concepts']
if args.alloc_failure_injector:
seastar_flags += ['--enable-alloc-failure-injector']
seastar_cflags = args.user_cflags + " -march=nehalem"
seastar_flags += ['--compiler', args.cxx, '--c-compiler', args.cc, '--cflags=%s' % (seastar_cflags)]
@@ -857,7 +867,7 @@ with open(buildfile, 'w') as f:
f.write(textwrap.dedent('''\
cxxflags_{mode} = -I. -I $builddir/{mode}/gen -I seastar -I seastar/build/{mode}/gen
rule cxx.{mode}
command = $cxx -MD -MT $out -MF $out.d {seastar_cflags} $cxxflags $cxxflags_{mode} -c -o $out $in
command = $cxx -MD -MT $out -MF $out.d {seastar_cflags} $cxxflags $cxxflags_{mode} $obj_cxxflags -c -o $out $in
description = CXX $out
depfile = $out.d
rule link.{mode}
@@ -875,7 +885,16 @@ with open(buildfile, 'w') as f:
command = thrift -gen cpp:cob_style -out $builddir/{mode}/gen $in
description = THRIFT $in
rule antlr3.{mode}
command = sed -e '/^#if 0/,/^#endif/d' $in > $builddir/{mode}/gen/$in && antlr3 $builddir/{mode}/gen/$in && sed -i 's/^\\( *\)\\(ImplTraits::CommonTokenType\\* [a-zA-Z0-9_]* = NULL;\\)$$/\\1const \\2/' build/{mode}/gen/${{stem}}Parser.cpp
# We replace many local `ExceptionBaseType* ex` variables with a single function-scope one.
# Because we add such a variable to every function, and because `ExceptionBaseType` is not a global
# name, we also add a global typedef to avoid compilation errors.
command = sed -e '/^#if 0/,/^#endif/d' $in > $builddir/{mode}/gen/$in $
&& antlr3 $builddir/{mode}/gen/$in $
&& sed -i -e 's/^\\( *\)\\(ImplTraits::CommonTokenType\\* [a-zA-Z0-9_]* = NULL;\\)$$/\\1const \\2/' $
-e '1i using ExceptionBaseType = int;' $
-e 's/^{{/{{ ExceptionBaseType\* ex = nullptr;/; $
s/ExceptionBaseType\* ex = new/ex = new/' $
build/{mode}/gen/${{stem}}Parser.cpp
description = ANTLR3 $in
''').format(mode = mode, **modeval))
f.write('build {mode}: phony {artifacts}\n'.format(mode = mode,
@@ -918,7 +937,7 @@ with open(buildfile, 'w') as f:
if binary.startswith('tests/'):
local_libs = '$libs'
if binary not in tests_not_using_seastar_test_framework or binary in pure_boost_tests:
local_libs += ' ' + maybe_static(args.staticboost, '-lboost_unit_test_framework')
local_libs += ' ' + maybe_static(args.staticboost, '-lboost_unit_test_framework')
if has_thrift:
local_libs += ' ' + thrift_libs + ' ' + maybe_static(args.staticboost, '-lboost_system')
# Our code's debugging information is huge, and multiplied
@@ -992,6 +1011,9 @@ with open(buildfile, 'w') as f:
for cc in grammar.sources('$builddir/{}/gen'.format(mode)):
obj = cc.replace('.cpp', '.o')
f.write('build {}: cxx.{} {} || {}\n'.format(obj, mode, cc, ' '.join(serializers)))
if cc.endswith('Parser.cpp') and has_sanitize_address_use_after_scope:
# Parsers end up using huge amounts of stack space and overflowing their stack
f.write(' obj_cxxflags = -fno-sanitize-address-use-after-scope\n')
f.write('build seastar/build/{mode}/libseastar.a seastar/build/{mode}/apps/iotune/iotune seastar/build/{mode}/gen/http/request_parser.hh seastar/build/{mode}/gen/http/http_response_parser.hh: ninja {seastar_deps}\n'
.format(**locals()))
f.write(' pool = seastar_pool\n')

View File

@@ -29,6 +29,15 @@ counter_id counter_id::local()
return counter_id(service::get_local_storage_service().get_local_id());
}
bool counter_id::less_compare_1_7_4::operator()(const counter_id& a, const counter_id& b) const
{
if (a._most_significant != b._most_significant) {
return a._most_significant < b._most_significant;
} else {
return a._least_significant < b._least_significant;
}
}
std::ostream& operator<<(std::ostream& os, const counter_id& id) {
return os << id.to_uuid();
}
@@ -42,6 +51,33 @@ std::ostream& operator<<(std::ostream& os, counter_cell_view ccv) {
return os << "{counter_cell timestamp: " << ccv.timestamp() << " shards: {" << ::join(", ", ccv.shards()) << "}}";
}
void counter_cell_builder::do_sort_and_remove_duplicates()
{
boost::range::sort(_shards, [] (auto& a, auto& b) { return a.id() < b.id(); });
std::vector<counter_shard> new_shards;
new_shards.reserve(_shards.size());
for (auto& cs : _shards) {
if (new_shards.empty() || new_shards.back().id() != cs.id()) {
new_shards.emplace_back(cs);
} else {
new_shards.back().apply(cs);
}
}
_shards = std::move(new_shards);
_sorted = true;
}
std::vector<counter_shard> counter_cell_view::shards_compatible_with_1_7_4() const
{
auto sorted_shards = boost::copy_range<std::vector<counter_shard>>(shards());
counter_id::less_compare_1_7_4 cmp;
boost::range::sort(sorted_shards, [&] (auto& a, auto& b) {
return cmp(a.id(), b.id());
});
return sorted_shards;
}
static bool apply_in_place(atomic_cell_or_collection& dst, atomic_cell_or_collection& src)
{
auto dst_ccmv = counter_cell_mutable_view(dst.as_mutable_atomic_cell());

View File

@@ -36,6 +36,10 @@ class counter_id {
int64_t _least_significant;
int64_t _most_significant;
public:
static_assert(std::is_same<decltype(std::declval<utils::UUID>().get_least_significant_bits()), int64_t>::value
&& std::is_same<decltype(std::declval<utils::UUID>().get_most_significant_bits()), int64_t>::value,
"utils::UUID is expected to work with two signed 64-bit integers");
counter_id() = default;
explicit counter_id(utils::UUID uuid) noexcept
: _least_significant(uuid.get_least_significant_bits())
@@ -49,12 +53,20 @@ public:
bool operator<(const counter_id& other) const {
return to_uuid() < other.to_uuid();
}
bool operator>(const counter_id& other) const {
return other.to_uuid() < to_uuid();
}
bool operator==(const counter_id& other) const {
return to_uuid() == other.to_uuid();
}
bool operator!=(const counter_id& other) const {
return !(*this == other);
}
public:
// (Wrong) Counter ID ordering used by Scylla 1.7.4 and earlier.
struct less_compare_1_7_4 {
bool operator()(const counter_id& a, const counter_id& b) const;
};
public:
static counter_id local();
@@ -139,6 +151,22 @@ private:
static void write(const T& value, bytes::iterator& out) {
out = std::copy_n(reinterpret_cast<const signed char*>(&value), sizeof(T), out);
}
private:
// Shared logic for applying counter_shards and counter_shard_views.
// T is either counter_shard or basic_counter_shard_view<U>.
template<typename T>
GCC6_CONCEPT(requires requires(T shard) {
{ shard.value() } -> int64_t;
{ shard.logical_clock() } -> int64_t;
})
counter_shard& do_apply(T&& other) noexcept {
auto other_clock = other.logical_clock();
if (_logical_clock < other_clock) {
_logical_clock = other_clock;
_value = other.value();
}
return *this;
}
public:
counter_shard(counter_id id, int64_t value, int64_t logical_clock) noexcept
: _id(id)
@@ -163,12 +191,11 @@ public:
}
counter_shard& apply(counter_shard_view other) noexcept {
auto other_clock = other.logical_clock();
if (_logical_clock < other_clock) {
_logical_clock = other_clock;
_value = other.value();
}
return *this;
return do_apply(other);
}
counter_shard& apply(const counter_shard& other) noexcept {
return do_apply(other);
}
static size_t serialized_size() {
@@ -183,6 +210,9 @@ public:
class counter_cell_builder {
std::vector<counter_shard> _shards;
bool _sorted = true;
private:
void do_sort_and_remove_duplicates();
public:
counter_cell_builder() = default;
counter_cell_builder(size_t shard_count) {
@@ -193,6 +223,21 @@ public:
_shards.emplace_back(cs);
}
void add_maybe_unsorted_shard(const counter_shard& cs) {
add_shard(cs);
if (_sorted && _shards.size() > 1) {
auto current = _shards.rbegin();
auto previous = std::next(current);
_sorted = current->id() > previous->id();
}
}
void sort_and_remove_duplicates() {
if (!_sorted) {
do_sort_and_remove_duplicates();
}
}
size_t serialized_size() const {
return _shards.size() * counter_shard::serialized_size();
}
@@ -339,6 +384,9 @@ public:
struct counter_cell_view : basic_counter_cell_view<bytes_view> {
using basic_counter_cell_view::basic_counter_cell_view;
// Returns counter shards in an order that is compatible with Scylla 1.7.4.
std::vector<counter_shard> shards_compatible_with_1_7_4() const;
// Reversibly applies two counter cells, at least one of them must be live.
// Returns true iff dst was modified.
static bool apply_reversibly(atomic_cell_or_collection& dst, atomic_cell_or_collection& src);

89
cpu_controller.hh Normal file
View File

@@ -0,0 +1,89 @@
/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <seastar/core/thread.hh>
#include <seastar/core/timer.hh>
#include <chrono>
// Simple proportional controller to adjust shares of memtable/streaming flushes.
//
// Goal is to flush as fast as we can, but not so fast that we steal all the CPU from incoming
// requests, and at the same time minimize user-visible fluctuations in the flush quota.
//
// What that translates to is we'll try to keep virtual dirty's firt derivative at 0 (IOW, we keep
// virtual dirty constant), which means that the rate of incoming writes is equal to the rate of
// flushed bytes.
//
// The exact point at which the controller stops determines the desired flush CPU usage. As we
// approach the hard dirty limit, we need to be more aggressive. We will therefore define two
// thresholds, and increase the constant as we cross them.
//
// 1) the soft limit line
// 2) halfway between soft limit and dirty limit
//
// The constants q1 and q2 are used to determine the proportional factor at each stage.
//
// Below the soft limit, we are in no particular hurry to flush, since it means we're set to
// complete flushing before we a new memtable is ready. The quota is dirty * q1, and q1 is set to a
// low number.
//
// The first half of the virtual dirty region is where we expect to be usually, so we have a low
// slope corresponding to a sluggish response between q1 * soft_limit and q2.
//
// In the second half, we're getting close to the hard dirty limit so we increase the slope and
// become more responsive, up to a maximum quota of qmax.
//
// For now we'll just set them in the structure not to complicate the constructor. But q1, q2 and
// qmax can easily become parameters if we find another user.
class flush_cpu_controller {
static constexpr float hard_dirty_limit = 0.50;
static constexpr float q1 = 0.01;
static constexpr float q2 = 0.2;
static constexpr float qmax = 1;
float _current_quota = 0.0f;
float _goal;
std::function<float()> _current_dirty;
std::chrono::milliseconds _interval;
timer<> _update_timer;
seastar::thread_scheduling_group _scheduling_group;
seastar::thread_scheduling_group *_current_scheduling_group = nullptr;
void adjust();
public:
seastar::thread_scheduling_group* scheduling_group() {
return _current_scheduling_group;
}
float current_quota() const {
return _current_quota;
}
struct disabled {
seastar::thread_scheduling_group *backup;
};
flush_cpu_controller(disabled d) : _scheduling_group(std::chrono::nanoseconds(0), 0), _current_scheduling_group(d.backup) {}
flush_cpu_controller(std::chrono::milliseconds interval, float soft_limit, std::function<float()> current_dirty);
flush_cpu_controller(flush_cpu_controller&&) = default;
};

View File

@@ -1550,6 +1550,8 @@ basic_unreserved_keyword returns [sstring str]
| K_DISTINCT
| K_CONTAINS
| K_STATIC
| K_FROZEN
| K_TUPLE
| K_FUNCTION
| K_AGGREGATE
| K_SFUNC

View File

@@ -75,6 +75,10 @@ functions::init() {
declare(aggregate_fcts::make_max_function<double>());
declare(aggregate_fcts::make_min_function<double>());
declare(aggregate_fcts::make_count_function<sstring>());
declare(aggregate_fcts::make_max_function<sstring>());
declare(aggregate_fcts::make_min_function<sstring>());
//FIXME:
//declare(aggregate_fcts::make_count_function<bytes>());
//declare(aggregate_fcts::make_max_function<bytes>());

View File

@@ -0,0 +1,171 @@
/*
* Copyright (C) 2017 ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include "utils/loading_cache.hh"
#include "cql3/statements/prepared_statement.hh"
namespace cql3 {
using prepared_cache_entry = std::unique_ptr<statements::prepared_statement>;
struct prepared_cache_entry_size {
size_t operator()(const prepared_cache_entry& val) {
// TODO: improve the size approximation
return 10000;
}
};
typedef bytes cql_prepared_id_type;
typedef int32_t thrift_prepared_id_type;
/// \brief The key of the prepared statements cache
///
/// We are going to store the CQL and Thrift prepared statements in the same cache therefore we need generate the key
/// that is going to be unique in both cases. Thrift use int32_t as a prepared statement ID, CQL - MD5 digest.
///
/// We are going to use an std::pair<CQL_PREP_ID_TYPE, int64_t> as a key. For CQL statements we will use {CQL_PREP_ID, std::numeric_limits<int64_t>::max()} as a key
/// and for Thrift - {CQL_PREP_ID_TYPE(0), THRIFT_PREP_ID}. This way CQL and Thrift keys' values will never collide.
class prepared_cache_key_type {
public:
using cache_key_type = std::pair<cql_prepared_id_type, int64_t>;
private:
cache_key_type _key;
public:
prepared_cache_key_type() = default;
explicit prepared_cache_key_type(cql_prepared_id_type cql_id) : _key(std::move(cql_id), std::numeric_limits<int64_t>::max()) {}
explicit prepared_cache_key_type(thrift_prepared_id_type thrift_id) : _key(cql_prepared_id_type(), thrift_id) {}
cache_key_type& key() { return _key; }
const cache_key_type& key() const { return _key; }
static const cql_prepared_id_type& cql_id(const prepared_cache_key_type& key) {
return key.key().first;
}
static thrift_prepared_id_type thrift_id(const prepared_cache_key_type& key) {
return key.key().second;
}
};
class prepared_statements_cache {
public:
struct stats {
uint64_t prepared_cache_evictions = 0;
};
static stats& shard_stats() {
static thread_local stats _stats;
return _stats;
}
struct prepared_cache_stats_updater {
static void inc_hits() noexcept {}
static void inc_misses() noexcept {}
static void inc_blocks() noexcept {}
static void inc_evictions() noexcept {
++shard_stats().prepared_cache_evictions;
}
};
private:
using cache_key_type = typename prepared_cache_key_type::cache_key_type;
using cache_type = utils::loading_cache<cache_key_type, prepared_cache_entry, utils::loading_cache_reload_enabled::no, prepared_cache_entry_size, utils::tuple_hash, std::equal_to<cache_key_type>, prepared_cache_stats_updater>;
using cache_value_ptr = typename cache_type::value_ptr;
using cache_iterator = typename cache_type::iterator;
using checked_weak_ptr = typename statements::prepared_statement::checked_weak_ptr;
struct value_extractor_fn {
checked_weak_ptr operator()(prepared_cache_entry& e) const {
return e->checked_weak_from_this();
}
};
static const std::chrono::minutes entry_expiry;
public:
using key_type = prepared_cache_key_type;
using value_type = checked_weak_ptr;
using statement_is_too_big = typename cache_type::entry_is_too_big;
/// \note both iterator::reference and iterator::value_type are checked_weak_ptr
using iterator = boost::transform_iterator<value_extractor_fn, cache_iterator>;
private:
cache_type _cache;
value_extractor_fn _value_extractor_fn;
public:
prepared_statements_cache(logging::logger& logger)
: _cache(memory::stats().total_memory() / 256, entry_expiry, logger)
{}
template <typename LoadFunc>
future<value_type> get(const key_type& key, LoadFunc&& load) {
return _cache.get_ptr(key.key(), [load = std::forward<LoadFunc>(load)] (const cache_key_type&) { return load(); }).then([] (cache_value_ptr v_ptr) {
return make_ready_future<value_type>((*v_ptr)->checked_weak_from_this());
});
}
iterator find(const key_type& key) {
return boost::make_transform_iterator(_cache.find(key.key()), _value_extractor_fn);
}
iterator end() {
return boost::make_transform_iterator(_cache.end(), _value_extractor_fn);
}
iterator begin() {
return boost::make_transform_iterator(_cache.begin(), _value_extractor_fn);
}
template <typename Pred>
void remove_if(Pred&& pred) {
static_assert(std::is_same<bool, std::result_of_t<Pred(::shared_ptr<cql_statement>)>>::value, "Bad Pred signature");
_cache.remove_if([&pred] (const prepared_cache_entry& e) {
return pred(e->statement);
});
}
size_t size() const {
return _cache.size();
}
size_t memory_footprint() const {
return _cache.memory_footprint();
}
};
}
namespace std { // for prepared_statements_cache log printouts
inline std::ostream& operator<<(std::ostream& os, const typename cql3::prepared_cache_key_type::cache_key_type& p) {
os << "{cql_id: " << p.first << ", thrift_id: " << p.second << "}";
return os;
}
inline std::ostream& operator<<(std::ostream& os, const cql3::prepared_cache_key_type& p) {
os << p.key();
return os;
}
}

View File

@@ -57,11 +57,14 @@ using namespace statements;
using namespace cql_transport::messages;
logging::logger log("query_processor");
logging::logger prep_cache_log("prepared_statements_cache");
distributed<query_processor> _the_query_processor;
const sstring query_processor::CQL_VERSION = "3.3.1";
const std::chrono::minutes prepared_statements_cache::entry_expiry = std::chrono::minutes(60);
class query_processor::internal_state {
service::query_state _qs;
public:
@@ -95,6 +98,7 @@ query_processor::query_processor(distributed<service::storage_proxy>& proxy,
, _proxy(proxy)
, _db(db)
, _internal_state(new internal_state())
, _prepared_cache(prep_cache_log)
{
namespace sm = seastar::metrics;
@@ -130,6 +134,15 @@ query_processor::query_processor(distributed<service::storage_proxy>& proxy,
sm::make_derive("batches_unlogged_from_logged", _cql_stats.batches_unlogged_from_logged,
sm::description("Counts a total number of LOGGED batches that were executed as UNLOGGED batches.")),
sm::make_derive("prepared_cache_evictions", [] { return prepared_statements_cache::shard_stats().prepared_cache_evictions; },
sm::description("Counts a number of prepared statements cache entries evictions.")),
sm::make_gauge("prepared_cache_size", [this] { return _prepared_cache.size(); },
sm::description("A number of entries in the prepared statements cache.")),
sm::make_gauge("prepared_cache_memory_footprint", [this] { return _prepared_cache.memory_footprint(); },
sm::description("Size (in bytes) of the prepared statements cache.")),
});
service::get_local_migration_manager().register_listener(_migration_subscriber.get());
@@ -197,31 +210,21 @@ query_processor::process_statement(::shared_ptr<cql_statement> statement,
}
future<::shared_ptr<cql_transport::messages::result_message::prepared>>
query_processor::prepare(const std::experimental::string_view& query_string, service::query_state& query_state)
query_processor::prepare(sstring query_string, service::query_state& query_state)
{
auto& client_state = query_state.get_client_state();
return prepare(query_string, client_state, client_state.is_thrift());
return prepare(std::move(query_string), client_state, client_state.is_thrift());
}
future<::shared_ptr<cql_transport::messages::result_message::prepared>>
query_processor::prepare(const std::experimental::string_view& query_string,
const service::client_state& client_state,
bool for_thrift)
query_processor::prepare(sstring query_string, const service::client_state& client_state, bool for_thrift)
{
auto existing = get_stored_prepared_statement(query_string, client_state.get_raw_keyspace(), for_thrift);
if (existing) {
return make_ready_future<::shared_ptr<cql_transport::messages::result_message::prepared>>(existing);
using namespace cql_transport::messages;
if (for_thrift) {
return prepare_one<result_message::prepared::thrift>(std::move(query_string), client_state, compute_thrift_id, prepared_cache_key_type::thrift_id);
} else {
return prepare_one<result_message::prepared::cql>(std::move(query_string), client_state, compute_id, prepared_cache_key_type::cql_id);
}
return futurize<::shared_ptr<cql_transport::messages::result_message::prepared>>::apply([this, &query_string, &client_state, for_thrift] {
auto prepared = get_statement(query_string, client_state);
auto bound_terms = prepared->statement->get_bound_terms();
if (bound_terms > std::numeric_limits<uint16_t>::max()) {
throw exceptions::invalid_request_exception(sprint("Too many markers(?). %d markers exceed the allowed maximum of %d", bound_terms, std::numeric_limits<uint16_t>::max()));
}
assert(bound_terms == prepared->bound_names.size());
return store_prepared_statement(query_string, client_state.get_raw_keyspace(), std::move(prepared), for_thrift);
});
}
::shared_ptr<cql_transport::messages::result_message::prepared>
@@ -229,50 +232,11 @@ query_processor::get_stored_prepared_statement(const std::experimental::string_v
const sstring& keyspace,
bool for_thrift)
{
using namespace cql_transport::messages;
if (for_thrift) {
auto statement_id = compute_thrift_id(query_string, keyspace);
auto it = _thrift_prepared_statements.find(statement_id);
if (it == _thrift_prepared_statements.end()) {
return ::shared_ptr<result_message::prepared>();
}
return ::make_shared<result_message::prepared::thrift>(statement_id, it->second->checked_weak_from_this());
return get_stored_prepared_statement_one<result_message::prepared::thrift>(query_string, keyspace, compute_thrift_id, prepared_cache_key_type::thrift_id);
} else {
auto statement_id = compute_id(query_string, keyspace);
auto it = _prepared_statements.find(statement_id);
if (it == _prepared_statements.end()) {
return ::shared_ptr<result_message::prepared>();
}
return ::make_shared<result_message::prepared::cql>(statement_id, it->second->checked_weak_from_this());
}
}
future<::shared_ptr<cql_transport::messages::result_message::prepared>>
query_processor::store_prepared_statement(const std::experimental::string_view& query_string,
const sstring& keyspace,
std::unique_ptr<statements::prepared_statement> prepared,
bool for_thrift)
{
#if 0
// Concatenate the current keyspace so we don't mix prepared statements between keyspace (#5352).
// (if the keyspace is null, queryString has to have a fully-qualified keyspace so it's fine.
long statementSize = measure(prepared.statement);
// don't execute the statement if it's bigger than the allowed threshold
if (statementSize > MAX_CACHE_PREPARED_MEMORY)
throw new InvalidRequestException(String.format("Prepared statement of size %d bytes is larger than allowed maximum of %d bytes.",
statementSize,
MAX_CACHE_PREPARED_MEMORY));
#endif
prepared->raw_cql_statement = query_string.data();
if (for_thrift) {
auto statement_id = compute_thrift_id(query_string, keyspace);
auto msg = ::make_shared<result_message::prepared::thrift>(statement_id, prepared->checked_weak_from_this());
_thrift_prepared_statements.emplace(statement_id, std::move(prepared));
return make_ready_future<::shared_ptr<result_message::prepared>>(std::move(msg));
} else {
auto statement_id = compute_id(query_string, keyspace);
auto msg = ::make_shared<result_message::prepared::cql>(statement_id, prepared->checked_weak_from_this());
_prepared_statements.emplace(statement_id, std::move(prepared));
return make_ready_future<::shared_ptr<result_message::prepared>>(std::move(msg));
return get_stored_prepared_statement_one<result_message::prepared::cql>(query_string, keyspace, compute_id, prepared_cache_key_type::cql_id);
}
}
@@ -289,19 +253,19 @@ static sstring hash_target(const std::experimental::string_view& query_string, c
return keyspace + query_string.to_string();
}
bytes query_processor::compute_id(const std::experimental::string_view& query_string, const sstring& keyspace)
prepared_cache_key_type query_processor::compute_id(const std::experimental::string_view& query_string, const sstring& keyspace)
{
return md5_calculate(hash_target(query_string, keyspace));
return prepared_cache_key_type(md5_calculate(hash_target(query_string, keyspace)));
}
int32_t query_processor::compute_thrift_id(const std::experimental::string_view& query_string, const sstring& keyspace)
prepared_cache_key_type query_processor::compute_thrift_id(const std::experimental::string_view& query_string, const sstring& keyspace)
{
auto target = hash_target(query_string, keyspace);
uint32_t h = 0;
for (auto&& c : hash_target(query_string, keyspace)) {
h = 31*h + c;
}
return static_cast<int32_t>(h);
return prepared_cache_key_type(static_cast<int32_t>(h));
}
std::unique_ptr<prepared_statement>
@@ -527,7 +491,7 @@ void query_processor::migration_subscriber::on_drop_view(const sstring& ks_name,
void query_processor::migration_subscriber::remove_invalid_prepared_statements(sstring ks_name, std::experimental::optional<sstring> cf_name)
{
_qp->invalidate_prepared_statements([&] (::shared_ptr<cql_statement> stmt) {
_qp->_prepared_cache.remove_if([&] (::shared_ptr<cql_statement> stmt) {
return this->should_invalidate(ks_name, cf_name, stmt);
});
}

View File

@@ -57,6 +57,7 @@
#include "statements/prepared_statement.hh"
#include "transport/messages/result_message.hh"
#include "untyped_result_set.hh"
#include "prepared_statements_cache.hh"
namespace cql3 {
@@ -64,9 +65,32 @@ namespace statements {
class batch_statement;
}
class prepared_statement_is_too_big : public std::exception {
public:
static constexpr int max_query_prefix = 100;
private:
sstring _msg;
public:
prepared_statement_is_too_big(const sstring& query_string)
: _msg(seastar::format("Prepared statement is too big: {}", query_string.substr(0, max_query_prefix)))
{
// mark that we clipped the query string
if (query_string.size() > max_query_prefix) {
_msg += "...";
}
}
virtual const char* what() const noexcept override {
return _msg.c_str();
}
};
class query_processor {
public:
class migration_subscriber;
private:
std::unique_ptr<migration_subscriber> _migration_subscriber;
distributed<service::storage_proxy>& _proxy;
@@ -127,9 +151,7 @@ private:
}
};
#endif
std::unordered_map<bytes, std::unique_ptr<statements::prepared_statement>> _prepared_statements;
std::unordered_map<int32_t, std::unique_ptr<statements::prepared_statement>> _thrift_prepared_statements;
prepared_statements_cache _prepared_cache;
std::unordered_map<sstring, std::unique_ptr<statements::prepared_statement>> _internal_statements;
#if 0
@@ -221,21 +243,14 @@ private:
}
#endif
public:
statements::prepared_statement::checked_weak_ptr get_prepared(const bytes& id) {
auto it = _prepared_statements.find(id);
if (it == _prepared_statements.end()) {
statements::prepared_statement::checked_weak_ptr get_prepared(const prepared_cache_key_type& key) {
auto it = _prepared_cache.find(key);
if (it == _prepared_cache.end()) {
return statements::prepared_statement::checked_weak_ptr();
}
return it->second->checked_weak_from_this();
return *it;
}
statements::prepared_statement::checked_weak_ptr get_prepared_for_thrift(int32_t id) {
auto it = _thrift_prepared_statements.find(id);
if (it == _thrift_prepared_statements.end()) {
return statements::prepared_statement::checked_weak_ptr();
}
return it->second->checked_weak_from_this();
}
#if 0
public static void validateKey(ByteBuffer key) throws InvalidRequestException
{
@@ -435,42 +450,61 @@ public:
#endif
future<::shared_ptr<cql_transport::messages::result_message::prepared>>
prepare(const std::experimental::string_view& query_string, service::query_state& query_state);
prepare(sstring query_string, service::query_state& query_state);
future<::shared_ptr<cql_transport::messages::result_message::prepared>>
prepare(const std::experimental::string_view& query_string, const service::client_state& client_state, bool for_thrift);
prepare(sstring query_string, const service::client_state& client_state, bool for_thrift);
static bytes compute_id(const std::experimental::string_view& query_string, const sstring& keyspace);
static int32_t compute_thrift_id(const std::experimental::string_view& query_string, const sstring& keyspace);
static prepared_cache_key_type compute_id(const std::experimental::string_view& query_string, const sstring& keyspace);
static prepared_cache_key_type compute_thrift_id(const std::experimental::string_view& query_string, const sstring& keyspace);
private:
///
/// \tparam ResultMsgType type of the returned result message (CQL or Thrift)
/// \tparam PreparedKeyGenerator a function that generates the prepared statement cache key for given query and keyspace
/// \tparam IdGetter a function that returns the corresponding prepared statement ID (CQL or Thrift) for a given prepared statement cache key
/// \param query_string
/// \param client_state
/// \param id_gen prepared ID generator, called before the first deferring
/// \param id_getter prepared ID getter, passed to deferred context by reference. The caller must ensure its liveness.
/// \return
template <typename ResultMsgType, typename PreparedKeyGenerator, typename IdGetter>
future<::shared_ptr<cql_transport::messages::result_message::prepared>>
prepare_one(sstring query_string, const service::client_state& client_state, PreparedKeyGenerator&& id_gen, IdGetter&& id_getter) {
return do_with(id_gen(query_string, client_state.get_raw_keyspace()), std::move(query_string), [this, &client_state, &id_getter] (const prepared_cache_key_type& key, const sstring& query_string) {
return _prepared_cache.get(key, [this, &query_string, &client_state] {
auto prepared = get_statement(query_string, client_state);
auto bound_terms = prepared->statement->get_bound_terms();
if (bound_terms > std::numeric_limits<uint16_t>::max()) {
throw exceptions::invalid_request_exception(sprint("Too many markers(?). %d markers exceed the allowed maximum of %d", bound_terms, std::numeric_limits<uint16_t>::max()));
}
assert(bound_terms == prepared->bound_names.size());
prepared->raw_cql_statement = query_string;
return make_ready_future<std::unique_ptr<statements::prepared_statement>>(std::move(prepared));
}).then([&key, &id_getter] (auto prep_ptr) {
return make_ready_future<::shared_ptr<cql_transport::messages::result_message::prepared>>(::make_shared<ResultMsgType>(id_getter(key), std::move(prep_ptr)));
}).handle_exception_type([&query_string] (typename prepared_statements_cache::statement_is_too_big&) {
return make_exception_future<::shared_ptr<cql_transport::messages::result_message::prepared>>(prepared_statement_is_too_big(query_string));
});
});
};
template <typename ResultMsgType, typename KeyGenerator, typename IdGetter>
::shared_ptr<cql_transport::messages::result_message::prepared>
get_stored_prepared_statement_one(const std::experimental::string_view& query_string, const sstring& keyspace, KeyGenerator&& key_gen, IdGetter&& id_getter)
{
auto cache_key = key_gen(query_string, keyspace);
auto it = _prepared_cache.find(cache_key);
if (it == _prepared_cache.end()) {
return ::shared_ptr<cql_transport::messages::result_message::prepared>();
}
return ::make_shared<ResultMsgType>(id_getter(cache_key), *it);
}
::shared_ptr<cql_transport::messages::result_message::prepared>
get_stored_prepared_statement(const std::experimental::string_view& query_string, const sstring& keyspace, bool for_thrift);
future<::shared_ptr<cql_transport::messages::result_message::prepared>>
store_prepared_statement(const std::experimental::string_view& query_string, const sstring& keyspace, std::unique_ptr<statements::prepared_statement> prepared, bool for_thrift);
// Erases the statements for which filter returns true.
template <typename Pred>
void invalidate_prepared_statements(Pred filter) {
static_assert(std::is_same<bool, std::result_of_t<Pred(::shared_ptr<cql_statement>)>>::value,
"bad Pred signature");
for (auto it = _prepared_statements.begin(); it != _prepared_statements.end(); ) {
if (filter(it->second->statement)) {
it = _prepared_statements.erase(it);
} else {
++it;
}
}
for (auto it = _thrift_prepared_statements.begin(); it != _thrift_prepared_statements.end(); ) {
if (filter(it->second->statement)) {
it = _thrift_prepared_statements.erase(it);
} else {
++it;
}
}
}
#if 0
public ResultMessage processPrepared(CQLStatement statement, QueryState queryState, QueryOptions options)
throws RequestExecutionException, RequestValidationException

View File

@@ -101,6 +101,10 @@ public:
return boost::algorithm::all_of(_restrictions->restrictions(), [b] (auto&& r) { return r.second->has_bound(b); });
}
virtual bool is_inclusive(statements::bound b) const override {
return boost::algorithm::all_of(_restrictions->restrictions(), [b] (auto&& r) { return r.second->is_inclusive(b); });
}
virtual bool uses_function(const sstring& ks_name, const sstring& function_name) const override {
return _restrictions->uses_function(ks_name, function_name);
}
@@ -120,7 +124,7 @@ public:
if (restriction->is_slice()) {
throw exceptions::invalid_request_exception(sprint(
"PRIMARY KEY column \"%s\" cannot be restricted (preceding column \"%s\" is restricted by a non-EQ relation)",
_restrictions->next_column(new_column)->name_as_text(), new_column.name_as_text()));
last_column.name_as_text(), new_column.name_as_text()));
}
}

View File

@@ -63,7 +63,7 @@ void cql3::statements::alter_keyspace_statement::validate(distributed<service::s
service::get_local_storage_proxy().get_db().local().find_keyspace(_name); // throws on failure
auto tmp = _name;
std::transform(tmp.begin(), tmp.end(), tmp.begin(), ::tolower);
if (tmp == db::system_keyspace::NAME) {
if (is_system_keyspace(tmp)) {
throw exceptions::invalid_request_exception("Cannot alter system keyspace");
}

View File

@@ -41,6 +41,8 @@
#include "cql3/statements/cf_prop_defs.hh"
#include <boost/algorithm/string/predicate.hpp>
namespace cql3 {
namespace statements {
@@ -65,6 +67,8 @@ const sstring cf_prop_defs::KW_CRC_CHECK_CHANCE = "crc_check_chance";
const sstring cf_prop_defs::COMPACTION_STRATEGY_CLASS_KEY = "class";
const sstring cf_prop_defs::COMPACTION_ENABLED_KEY = "enabled";
void cf_prop_defs::validate() {
// Skip validation if the comapction strategy class is already set as it means we've alreayd
// prepared (and redoing it would set strategyClass back to null, which we don't want)
@@ -188,6 +192,13 @@ void cf_prop_defs::apply_to_builder(schema_builder& builder) {
builder.set_min_compaction_threshold(min_compaction_threshold);
builder.set_max_compaction_threshold(max_compaction_threshold);
if (has_property(KW_COMPACTION)) {
if (get_compaction_options().count(COMPACTION_ENABLED_KEY)) {
auto enabled = boost::algorithm::iequals(get_compaction_options().at(COMPACTION_ENABLED_KEY), "true");
builder.set_compaction_enabled(enabled);
}
}
builder.set_default_time_to_live(gc_clock::duration(get_int(KW_DEFAULT_TIME_TO_LIVE, DEFAULT_DEFAULT_TIME_TO_LIVE)));
if (has_property(KW_SPECULATIVE_RETRY)) {

View File

@@ -73,6 +73,7 @@ public:
static const sstring KW_CRC_CHECK_CHANCE;
static const sstring COMPACTION_STRATEGY_CLASS_KEY;
static const sstring COMPACTION_ENABLED_KEY;
// FIXME: In origin the following consts are in CFMetaData.
static constexpr int32_t DEFAULT_DEFAULT_TIME_TO_LIVE = 0;

View File

@@ -72,7 +72,7 @@ void create_keyspace_statement::validate(distributed<service::storage_proxy>&, c
std::string name;
name.resize(_name.length());
std::transform(_name.begin(), _name.end(), name.begin(), ::tolower);
if (name == db::system_keyspace::NAME) {
if (is_system_keyspace(name)) {
throw exceptions::invalid_request_exception("system keyspace is not user-modifiable");
}
// keyspace name

View File

@@ -75,7 +75,7 @@ cql3::statements::create_user_statement::execute(distributed<service::storage_pr
throw exceptions::invalid_request_exception(sprint("User %s already exists", _username));
}
if (exists && _if_not_exists) {
make_ready_future<::shared_ptr<cql_transport::messages::result_message>>();
return make_ready_future<::shared_ptr<cql_transport::messages::result_message>>();
}
return auth::authenticator::get().create(_username, _opts->options()).then([this] {
return auth::auth::insert_user(_username, _superuser).then([] {

View File

@@ -106,6 +106,9 @@ delete_statement::prepare_internal(database& db, schema_ptr schema, shared_ptr<v
|| !stmt->restrictions()->get_clustering_columns_restrictions()->has_bound(bound::END)) {
throw exceptions::invalid_request_exception("A range deletion operation needs to specify both bounds");
}
if (!schema->is_compound() && stmt->restrictions()->get_clustering_columns_restrictions()->is_slice()) {
throw exceptions::invalid_request_exception("Range deletions on \"compact storage\" schemas are not supported");
}
return stmt;
}

View File

@@ -65,13 +65,13 @@
#include <core/fstream.hh>
#include <seastar/core/enum.hh>
#include "utils/latency.hh"
#include "utils/flush_queue.hh"
#include "schema_registry.hh"
#include "service/priority_manager.hh"
#include "cell_locking.hh"
#include <seastar/core/execution_stage.hh>
#include "view_info.hh"
#include "memtable-sstable.hh"
#include "db/schema_tables.hh"
#include "checked-file-impl.hh"
#include "disk-error-handler.hh"
@@ -84,28 +84,10 @@ static const std::unordered_set<sstring> system_keyspaces = {
db::system_keyspace::NAME, db::schema_tables::NAME
};
static bool is_system_keyspace(const sstring& name) {
bool is_system_keyspace(const sstring& name) {
return system_keyspaces.find(name) != system_keyspaces.end();
}
// Slight extension to the flush_queue type.
class column_family::memtable_flush_queue : public utils::flush_queue<db::replay_position> {
public:
template<typename Func, typename Post>
auto run_cf_flush(db::replay_position rp, Func&& func, Post&& post) {
// special case: empty rp, yet still data.
// We generate a few memtables with no valid, "high_rp", yet
// still containing data -> actual flush.
// And to make matters worse, we can initiate a flush of N such
// tables at the same time.
// Just queue them at the end of the queue and treat them as such.
if (rp == db::replay_position() && !empty()) {
rp = highest_key();
}
return run_with_ordered_post_op(rp, std::forward<Func>(func), std::forward<Post>(post));
}
};
// Used for tests where the CF exists without a database object. We need to pass a valid
// dirty_memory manager in that case.
thread_local dirty_memory_manager default_dirty_memory_manager;
@@ -147,7 +129,6 @@ column_family::column_family(schema_ptr schema, config config, db::commitlog* cl
, _cache(_schema, sstables_as_snapshot_source(), global_cache_tracker())
, _commitlog(cl)
, _compaction_manager(compaction_manager)
, _flush_queue(std::make_unique<memtable_flush_queue>())
, _counter_cell_locks(std::make_unique<cell_locker>(_schema, cl_stats))
{
if (!_config.enable_disk_writes) {
@@ -190,7 +171,6 @@ column_family::sstables_as_mutation_source() {
snapshot_source
column_family::sstables_as_snapshot_source() {
return snapshot_source([this] () {
// FIXME: Will keep sstables on disk until next memtable flush. Make compaction force cache refresh.
auto sst_set = _sstables;
return mutation_source([this, sst_set = std::move(sst_set)] (schema_ptr s,
const dht::partition_range& r,
@@ -779,6 +759,9 @@ column_family::open_sstable(sstables::foreign_sstable_open_info info, sstring di
}
void column_family::load_sstable(sstables::shared_sstable& sst, bool reset_level) {
if (schema()->is_counter() && !sst->has_scylla_component()) {
throw std::runtime_error("Loading non-Scylla SSTables containing counters is not supported. Use sstableloader instead.");
}
auto shards = sst->get_shards_for_this_sstable();
if (belongs_to_other_shard(shards)) {
// If we're here, this sstable is shared by this and other
@@ -890,18 +873,21 @@ column_family::seal_active_streaming_memtable_immediate() {
//
// Lastly, we don't have any commitlog RP to update, and we don't need to deal manipulate the
// memtable list, since this memtable was not available for reading up until this point.
return write_memtable_to_sstable(*old, newtab, incremental_backups_enabled(), priority).then([this, newtab, old] {
return write_memtable_to_sstable(*old, newtab, incremental_backups_enabled(), priority, false, _config.background_writer_scheduling_group).then([this, newtab, old] {
return newtab->open_data();
}).then([this, old, newtab] () {
add_sstable(newtab, {engine().cpu_id()});
trigger_compaction();
// Cache synchronization must be started atomically with add_sstable()
if (_config.enable_cache) {
return _cache.update_invalidating(*old);
} else {
return old->clear_gently();
}
}).handle_exception([old] (auto ep) {
return with_semaphore(_cache_update_sem, 1, [this, newtab, old] {
add_sstable(newtab, {engine().cpu_id()});
trigger_compaction();
// Cache synchronization must be started atomically with add_sstable()
if (_config.enable_cache) {
return _cache.update_invalidating(*old);
} else {
return old->clear_gently();
}
});
}).handle_exception([old, newtab] (auto ep) {
newtab->mark_for_deletion();
dblog.error("failed to write streamed sstable: {}", ep);
return make_exception_future<>(ep);
});
@@ -937,9 +923,10 @@ future<> column_family::seal_active_streaming_memtable_big(streaming_memtable_bi
newtab->set_unshared();
auto&& priority = service::get_local_streaming_write_priority();
return write_memtable_to_sstable(*old, newtab, incremental_backups_enabled(), priority, true).then([this, newtab, old, &smb] {
return write_memtable_to_sstable(*old, newtab, incremental_backups_enabled(), priority, true, _config.background_writer_scheduling_group).then([this, newtab, old, &smb] {
smb.sstables.emplace_back(newtab);
}).handle_exception([] (auto ep) {
}).handle_exception([newtab] (auto ep) {
newtab->mark_for_deletion();
dblog.error("failed to write streamed sstable: {}", ep);
return make_exception_future<>(ep);
});
@@ -955,34 +942,32 @@ column_family::seal_active_memtable(memtable_list::flush_behavior ignored) {
if (old->empty()) {
dblog.debug("Memtable is empty");
return make_ready_future<>();
return _flush_barrier.advance_and_await();
}
_memtables->add_memtable();
_stats.memtable_switch_count++;
auto previous_flush = _flush_barrier.advance_and_await();
auto op = _flush_barrier.start();
assert(_highest_flushed_rp < old->replay_position()
|| (_highest_flushed_rp == db::replay_position() && old->replay_position() == db::replay_position())
);
_highest_flushed_rp = old->replay_position();
auto memtable_size = old->occupancy().total_space();
return _flush_queue->run_cf_flush(old->replay_position(), [old, this] {
auto memtable_size = old->occupancy().total_space();
_stats.pending_flushes++;
_config.cf_stats->pending_memtables_flushes_count++;
_config.cf_stats->pending_memtables_flushes_bytes += memtable_size;
_config.cf_stats->pending_memtables_flushes_count++;
_config.cf_stats->pending_memtables_flushes_bytes += memtable_size;
return repeat([this, old] {
return repeat([this, old] {
return with_lock(_sstables_lock.for_read(), [this, old] {
_flush_queue->check_open_gate();
return try_flush_memtable_to_sstable(old);
});
}).then([this, memtable_size] {
}).then([this, memtable_size, old, op = std::move(op), previous_flush = std::move(previous_flush)] () mutable {
_stats.pending_flushes--;
_config.cf_stats->pending_memtables_flushes_count--;
_config.cf_stats->pending_memtables_flushes_bytes -= memtable_size;
});
}, [old, this] {
if (_commitlog) {
_commitlog->discard_completed_segments(_schema->id(), old->rp_set());
}
return previous_flush.finally([op = std::move(op)] { });
});
// FIXME: release commit log
// FIXME: provide back-pressure to upper layers
@@ -1011,7 +996,7 @@ column_family::try_flush_memtable_to_sstable(lw_shared_ptr<memtable> old) {
// The code as is guarantees that we'll never partially backup a
// single sstable, so that is enough of a guarantee.
auto&& priority = service::get_local_memtable_flush_priority();
return write_memtable_to_sstable(*old, newtab, incremental_backups_enabled(), priority).then([this, newtab, old] {
return write_memtable_to_sstable(*old, newtab, incremental_backups_enabled(), priority, false, _config.memtable_scheduling_group).then([this, newtab, old] {
return newtab->open_data();
}).then_wrapped([this, old, newtab] (future<> ret) {
dblog.debug("Flushing to {} done", newtab->get_filename());
@@ -1067,9 +1052,7 @@ column_family::stop() {
return when_all(_memtables->request_flush(), _streaming_memtables->request_flush()).discard_result().finally([this] {
return _compaction_manager.remove(this).then([this] {
// Nest, instead of using when_all, so we don't lose any exceptions.
return _flush_queue->close().then([this] {
return _streaming_flush_gate.close();
});
return _streaming_flush_gate.close();
}).then([this] {
return _sstable_deletion_gate.close();
});
@@ -1123,7 +1106,10 @@ distributed_loader::flush_upload_dir(distributed<database>& db, sstring ks_name,
auto gen = cf.calculate_generation_for_new_table();
// Read toc content as it will be needed for moving and deleting a sstable.
return sst->read_toc().then([sst] {
return sst->read_toc().then([sst, s = cf.schema()] {
if (s->is_counter() && !sst->has_scylla_component()) {
return make_exception_future<>(std::runtime_error("Loading non-Scylla SSTables containing counters is not supported. Use sstableloader instead."));
}
return sst->mutate_sstable_level(0);
}).then([&cf, sst, gen] {
return sst->create_links(cf._config.datadir, gen);
@@ -1208,20 +1194,22 @@ void column_family::set_metrics() {
auto cf = column_family_label(_schema->cf_name());
auto ks = keyspace_label(_schema->ks_name());
namespace ms = seastar::metrics;
_metrics.add_group("column_family", {
ms::make_histogram("read_latency", ms::description("Read latency histogram"), [this] {return _stats.estimated_read.get_histogram();})(cf)(ks),
ms::make_histogram("write_latency", ms::description("Write latency histogram"), [this] {return _stats.estimated_write.get_histogram();})(cf)(ks),
ms::make_derive("memtable_switch", ms::description("Number of times flush has resulted in the memtable being switched out"), _stats.memtable_switch_count)(cf)(ks),
ms::make_gauge("pending_taks", ms::description("Estimated number of tasks pending for this column family"), _stats.pending_flushes)(cf)(ks),
ms::make_gauge("live_disk_space", ms::description("Live disk space used"), _stats.live_disk_space_used)(cf)(ks),
ms::make_gauge("total_disk_space", ms::description("Total disk space used"), _stats.total_disk_space_used)(cf)(ks),
ms::make_gauge("live_sstable", ms::description("Live sstable count"), _stats.live_sstable_count)(cf)(ks),
ms::make_gauge("pending_compaction", ms::description("Estimated number of compactions pending for this column family"), _stats.pending_compactions)(cf)(ks)
});
if (_schema->ks_name() != db::system_keyspace::NAME) {
if (_config.enable_metrics_reporting) {
_metrics.add_group("column_family", {
ms::make_gauge("cache_hit_rate", ms::description("Cache hit rate"), [this] {return float(_global_cache_hit_rate);})(cf)(ks)
ms::make_derive("memtable_switch", ms::description("Number of times flush has resulted in the memtable being switched out"), _stats.memtable_switch_count)(cf)(ks),
ms::make_gauge("pending_tasks", ms::description("Estimated number of tasks pending for this column family"), _stats.pending_flushes)(cf)(ks),
ms::make_gauge("live_disk_space", ms::description("Live disk space used"), _stats.live_disk_space_used)(cf)(ks),
ms::make_gauge("total_disk_space", ms::description("Total disk space used"), _stats.total_disk_space_used)(cf)(ks),
ms::make_gauge("live_sstable", ms::description("Live sstable count"), _stats.live_sstable_count)(cf)(ks),
ms::make_gauge("pending_compaction", ms::description("Estimated number of compactions pending for this column family"), _stats.pending_compactions)(cf)(ks)
});
if (_schema->ks_name() != db::system_keyspace::NAME && _schema->ks_name() != db::schema_tables::v3::NAME && _schema->ks_name() != "system_traces") {
_metrics.add_group("column_family", {
ms::make_histogram("read_latency", ms::description("Read latency histogram"), [this] {return _stats.estimated_read.get_histogram(std::chrono::microseconds(100));})(cf)(ks),
ms::make_histogram("write_latency", ms::description("Write latency histogram"), [this] {return _stats.estimated_write.get_histogram(std::chrono::microseconds(100));})(cf)(ks),
ms::make_gauge("cache_hit_rate", ms::description("Cache hit rate"), [this] {return float(_global_cache_hit_rate);})(cf)(ks)
});
}
}
}
@@ -1311,6 +1299,10 @@ column_family::rebuild_sstable_list(const std::vector<sstables::shared_sstable>&
} catch (sstables::atomic_deletion_cancelled& adc) {
dblog.debug("Failed to delete sstables after compaction: {}", adc);
}
}).then([this] {
// refresh underlying data source in row cache to prevent it from holding reference
// to sstables files which were previously deleted.
_cache.refresh_snapshot();
});
});
}
@@ -1366,7 +1358,7 @@ column_family::compact_sstables(sstables::compaction_descriptor descriptor, bool
return sst;
};
return sstables::compact_sstables(*sstables_to_compact, *this, create_sstable, descriptor.max_sstable_bytes, descriptor.level,
cleanup).then([this, sstables_to_compact] (auto new_sstables) {
cleanup, _config.background_writer_scheduling_group).then([this, sstables_to_compact] (auto new_sstables) {
_compaction_strategy.notify_completion(*sstables_to_compact, new_sstables);
return this->rebuild_sstable_list(new_sstables, *sstables_to_compact);
});
@@ -1374,7 +1366,7 @@ column_family::compact_sstables(sstables::compaction_descriptor descriptor, bool
}
static bool needs_cleanup(const lw_shared_ptr<sstables::sstable>& sst,
const lw_shared_ptr<dht::token_range_vector>& owned_ranges,
const dht::token_range_vector& owned_ranges,
schema_ptr s) {
auto first = sst->get_first_partition_key();
auto last = sst->get_last_partition_key();
@@ -1383,7 +1375,7 @@ static bool needs_cleanup(const lw_shared_ptr<sstables::sstable>& sst,
dht::token_range sst_token_range = dht::token_range::make(first_token, last_token);
// return true iff sst partition range isn't fully contained in any of the owned ranges.
for (auto& r : *owned_ranges) {
for (auto& r : owned_ranges) {
if (r.contains(sst_token_range, dht::token_comparator())) {
return false;
}
@@ -1393,11 +1385,10 @@ static bool needs_cleanup(const lw_shared_ptr<sstables::sstable>& sst,
future<> column_family::cleanup_sstables(sstables::compaction_descriptor descriptor) {
dht::token_range_vector r = service::get_local_storage_service().get_local_ranges(_schema->ks_name());
auto owned_ranges = make_lw_shared<dht::token_range_vector>(std::move(r));
auto sstables_to_cleanup = make_lw_shared<std::vector<sstables::shared_sstable>>(std::move(descriptor.sstables));
return do_for_each(*sstables_to_cleanup, [this, owned_ranges = std::move(owned_ranges), sstables_to_cleanup] (auto& sst) {
if (!owned_ranges->empty() && !needs_cleanup(sst, owned_ranges, _schema)) {
return do_with(std::move(descriptor.sstables), std::move(r), [this] (auto& sstables, auto& owned_ranges) {
return do_for_each(sstables, [this, &owned_ranges] (auto& sst) {
if (!owned_ranges.empty() && !needs_cleanup(sst, owned_ranges, _schema)) {
return make_ready_future<>();
}
@@ -1411,6 +1402,7 @@ future<> column_family::cleanup_sstables(sstables::compaction_descriptor descrip
return this->compact_sstables(sstables::compaction_descriptor({ sst }, sst->get_sstable_level()), true);
});
});
});
}
// FIXME: this is just an example, should be changed to something more general
@@ -1673,12 +1665,12 @@ template <typename Func>
static future<> invoke_all_resharding_jobs(global_column_family_ptr cf, std::vector<sstables::resharding_descriptor> jobs, Func&& func) {
return parallel_for_each(std::move(jobs), [cf, func] (sstables::resharding_descriptor& job) mutable {
return forward_sstables_to(job.reshard_at, std::move(job.sstables), cf,
[func, level = job.level, max_sstable_bytes = job.max_sstable_bytes] (auto sstables) {
// used to ensure that only one reshard operation will run per shard.
static thread_local semaphore sem(1);
return with_semaphore(sem, 1, [func, sstables = std::move(sstables), level, max_sstable_bytes] () mutable {
[cf, func, level = job.level, max_sstable_bytes = job.max_sstable_bytes] (auto sstables) {
// compaction manager ensures that only one reshard operation will run per shard.
auto job = [func, sstables = std::move(sstables), level, max_sstable_bytes] () mutable {
return func(std::move(sstables), level, max_sstable_bytes);
});
};
return cf->get_compaction_manager().run_resharding_job(&*cf, std::move(job));
});
});
}
@@ -1733,7 +1725,7 @@ void distributed_loader::reshard(distributed<database>& db, sstring ks_name, sst
gc_clock::now(), default_io_error_handler_gen());
return sst;
};
auto f = sstables::reshard_sstables(sstables, *cf, creator, max_sstable_bytes, level);
auto f = sstables::reshard_sstables(sstables, *cf, creator, max_sstable_bytes, level, cf->background_writer_scheduling_group());
return f.then([&cf, sstables = std::move(sstables)] (std::vector<sstables::shared_sstable> new_sstables) mutable {
// an input sstable may belong to shard 1 and 2 and only have data which
@@ -1776,14 +1768,6 @@ void distributed_loader::reshard(distributed<database>& db, sstring ks_name, sst
});
}
});
}).then_wrapped([] (future<> f) {
try {
f.get();
} catch (sstables::compaction_stop_exception& e) {
dblog.info("resharding was abruptly stopped, reason: {}", e.what());
} catch (...) {
dblog.error("resharding failed: {}", std::current_exception());
}
});
}).get();
});
@@ -1805,15 +1789,17 @@ future<> distributed_loader::load_new_sstables(distributed<database>& db, sstrin
}).then([&db, ks, cf] {
return db.invoke_on_all([ks = std::move(ks), cfname = std::move(cf)] (database& db) {
auto& cf = db.find_column_family(ks, cfname);
// atomically load all opened sstables into column family.
for (auto& sst : cf._sstables_opened_but_not_loaded) {
cf.load_sstable(sst, true);
}
cf._sstables_opened_but_not_loaded.clear();
cf.trigger_compaction();
// Drop entire cache for this column family because it may be populated
// with stale data.
return cf.get_row_cache().invalidate();
return with_semaphore(cf._cache_update_sem, 1, [&cf] {
// atomically load all opened sstables into column family.
for (auto& sst : cf._sstables_opened_but_not_loaded) {
cf.load_sstable(sst, true);
}
cf._sstables_opened_but_not_loaded.clear();
cf.trigger_compaction();
// Drop entire cache for this column family because it may be populated
// with stale data.
return cf.get_row_cache().invalidate();
});
});
}).then([&db, ks, cf] () mutable {
return smp::submit_to(0, [&db, ks = std::move(ks), cf = std::move(cf)] () mutable {
@@ -1989,6 +1975,15 @@ future<> distributed_loader::populate_column_family(distributed<database>& db, s
}
inline
flush_cpu_controller
make_flush_cpu_controller(db::config& cfg, seastar::thread_scheduling_group* backup, std::function<double()> fn) {
if (cfg.auto_adjust_flush_quota()) {
return flush_cpu_controller(250ms, cfg.virtual_dirty_soft_limit(), std::move(fn));
}
return flush_cpu_controller(flush_cpu_controller::disabled{backup});
}
utils::UUID database::empty_version = utils::UUID_gen::get_name_UUID(bytes{});
database::database() : database(db::config())
@@ -2002,6 +1997,10 @@ database::database(const db::config& cfg)
, _system_dirty_memory_manager(*this, 10 << 20, cfg.virtual_dirty_soft_limit())
, _dirty_memory_manager(*this, memory::stats().total_memory() * 0.45, cfg.virtual_dirty_soft_limit())
, _streaming_dirty_memory_manager(*this, memory::stats().total_memory() * 0.10, cfg.virtual_dirty_soft_limit())
, _background_writer_scheduling_group(1ms, _cfg->background_writer_scheduling_quota())
, _memtable_cpu_controller(make_flush_cpu_controller(*_cfg, &_background_writer_scheduling_group, [this, limit = 2.0f * _dirty_memory_manager.throttle_threshold()] {
return (_dirty_memory_manager.virtual_dirty_memory()) / limit;
}))
, _version(empty_version)
, _enable_incremental_backups(cfg.incremental_backups())
{
@@ -2011,6 +2010,32 @@ database::database(const db::config& cfg)
dblog.info("Row: max_vector_size: {}, internal_count: {}", size_t(row::max_vector_size), size_t(row::internal_count));
}
void flush_cpu_controller::adjust() {
auto mid = _goal + (hard_dirty_limit - _goal) / 2;
auto dirty = _current_dirty();
if (dirty < _goal) {
_current_quota = dirty * q1 / _goal;
} else if ((dirty >= _goal) && (dirty < mid)) {
_current_quota = q1 + (dirty - _goal) * (q2 - q1)/(mid - _goal);
} else {
_current_quota = q2 + (dirty - mid) * (qmax - q2) / (hard_dirty_limit - mid);
}
dblog.trace("dirty {}, goal {}, mid {} quota {}", dirty, _goal, mid, _current_quota);
_scheduling_group.update_usage(_current_quota);
}
flush_cpu_controller::flush_cpu_controller(std::chrono::milliseconds interval, float soft_limit, std::function<float()> current_dirty)
: _goal(soft_limit / 2)
, _current_dirty(std::move(current_dirty))
, _interval(interval)
, _update_timer([this] { adjust(); })
, _scheduling_group(1ms, 0.0f)
, _current_scheduling_group(&_scheduling_group)
{
_update_timer.arm_periodic(_interval);
}
void
dirty_memory_manager::setup_collectd(sstring namestr) {
@@ -2108,6 +2133,14 @@ database::setup_metrics() {
sm::make_gauge("queued_reads", [this] { return _read_concurrency_sem.waiters(); },
sm::description("Holds the number of currently queued read operations.")),
sm::make_gauge("active_reads_streaming", [this] { return max_streaming_concurrent_reads() - _streaming_concurrency_sem.current(); },
sm::description(seastar::format("Holds the number of currently active read operations issued on behalf of streaming "
"If this value gets close to {} we are likely to start dropping new read requests. "
"In that case sstable_read_queue_overloads is going to get a non-zero value.", max_streaming_concurrent_reads()))),
sm::make_gauge("queued_reads_streaming", [this] { return _streaming_concurrency_sem.waiters(); },
sm::description("Holds the number of currently queued read operations on behalf of streaming.")),
sm::make_gauge("active_reads_system_keyspace", [this] { return max_system_concurrent_reads() - _system_read_concurrency_sem.current(); },
sm::description(seastar::format("Holds the number of currently active read operations from \"system\" keyspace tables. "
"If this vlaue gets close to {} we are likely to start dropping new read requests. "
@@ -2119,6 +2152,9 @@ database::setup_metrics() {
sm::make_gauge("total_result_bytes", [this] { return get_result_memory_limiter().total_used_memory(); },
sm::description("Holds the current amount of memory used for results.")),
sm::make_gauge("cpu_flush_quota", [this] { return _memtable_cpu_controller.current_quota(); },
sm::description("The current quota for memtable CPU scheduling group")),
sm::make_derive("short_data_queries", _stats->short_data_queries,
sm::description("The rate of data queries (data or digest reads) that returned less rows than requested due to result size limiting.")),
@@ -2330,7 +2366,7 @@ database::init_commitlog() {
_commitlog->discard_completed_segments(id);
return;
}
_column_families[id]->flush(pos);
_column_families[id]->flush();
}).release(); // we have longer life time than CL. Ignore reg anchor
});
}
@@ -2444,12 +2480,12 @@ void database::remove(const column_family& cf) {
}
}
future<> database::drop_column_family(const sstring& ks_name, const sstring& cf_name, timestamp_func tsf) {
future<> database::drop_column_family(const sstring& ks_name, const sstring& cf_name, timestamp_func tsf, bool snapshot) {
auto uuid = find_uuid(ks_name, cf_name);
auto cf = _column_families.at(uuid);
remove(*cf);
auto& ks = find_keyspace(ks_name);
return truncate(ks, *cf, std::move(tsf)).then([this, cf] {
return truncate(ks, *cf, std::move(tsf), snapshot).then([this, cf] {
return cf->stop();
}).then([this, cf] {
return make_ready_future<>();
@@ -2589,6 +2625,9 @@ keyspace::make_column_family_config(const schema& s, const db::config& db_config
cfg.streaming_read_concurrency_config = _config.streaming_read_concurrency_config;
cfg.cf_stats = _config.cf_stats;
cfg.enable_incremental_backups = _config.enable_incremental_backups;
cfg.background_writer_scheduling_group = _config.background_writer_scheduling_group;
cfg.memtable_scheduling_group = _config.memtable_scheduling_group;
cfg.enable_metrics_reporting = db_config.enable_keyspace_column_family_metrics();
return cfg;
}
@@ -3035,7 +3074,7 @@ void column_family::apply_streaming_big_mutation(schema_ptr m_schema, utils::UUI
void
column_family::check_valid_rp(const db::replay_position& rp) const {
if (rp != db::replay_position() && rp < _lowest_allowed_rp) {
throw replay_position_reordered_exception();
throw mutation_reordered_with_truncate_exception();
}
}
@@ -3079,10 +3118,6 @@ lw_shared_ptr<memtable> memtable_list::new_memtable() {
}
future<> dirty_memory_manager::flush_one(memtable_list& mtlist, semaphore_units<> permit) {
if (mtlist.back()->empty()) {
return make_ready_future<>();
}
auto* region = &(mtlist.back()->region());
auto schema = mtlist.back()->schema();
@@ -3185,25 +3220,24 @@ future<mutation> database::apply_counter_update(schema_ptr s, const frozen_mutat
}
}
static future<> maybe_handle_reorder(std::exception_ptr exp) {
try {
std::rethrow_exception(exp);
return make_exception_future(exp);
} catch (mutation_reordered_with_truncate_exception&) {
// This mutation raced with a truncate, so we can just drop it.
dblog.debug("replay_position reordering detected");
return make_ready_future<>();
}
}
future<> database::apply_with_commitlog(column_family& cf, const mutation& m, timeout_clock::time_point timeout) {
if (cf.commitlog() != nullptr) {
return do_with(freeze(m), [this, &m, &cf, timeout] (frozen_mutation& fm) {
commitlog_entry_writer cew(m.schema(), fm);
return cf.commitlog()->add_entry(m.schema()->id(), cew, timeout);
}).then([this, &m, &cf, timeout] (db::rp_handle h) {
return apply_in_memory(m, cf, std::move(h), timeout).handle_exception([this, &cf, &m, timeout] (auto ep) {
try {
std::rethrow_exception(ep);
} catch (replay_position_reordered_exception&) {
// expensive, but we're assuming this is super rare.
// if we failed to apply the mutation due to future re-ordering
// (which should be the ever only reason for rp mismatch in CF)
// let's just try again, add the mutation to the CL once more,
// and assume success in inevitable eventually.
dblog.debug("replay_position reordering detected");
return this->apply_with_commitlog(cf, m, timeout);
}
});
return apply_in_memory(m, cf, std::move(h), timeout).handle_exception(maybe_handle_reorder);
});
}
return apply_in_memory(m, cf, {}, timeout);
@@ -3214,19 +3248,7 @@ future<> database::apply_with_commitlog(schema_ptr s, column_family& cf, utils::
if (cl != nullptr) {
commitlog_entry_writer cew(s, m);
return cf.commitlog()->add_entry(uuid, cew, timeout).then([&m, this, s, timeout, cl](db::rp_handle h) {
return this->apply_in_memory(m, s, std::move(h), timeout).handle_exception([this, s, &m, timeout] (auto ep) {
try {
std::rethrow_exception(ep);
} catch (replay_position_reordered_exception&) {
// expensive, but we're assuming this is super rare.
// if we failed to apply the mutation due to future re-ordering
// (which should be the ever only reason for rp mismatch in CF)
// let's just try again, add the mutation to the CL once more,
// and assume success in inevitable eventually.
dblog.debug("replay_position reordering detected");
return this->apply(s, m, timeout);
}
});
return this->apply_in_memory(m, s, std::move(h), timeout).handle_exception(maybe_handle_reorder);
});
}
return apply_in_memory(m, std::move(s), {}, timeout);
@@ -3317,10 +3339,17 @@ database::make_keyspace_config(const keyspace_metadata& ksm) {
++_stats->sstable_read_queue_overloaded;
throw std::runtime_error("sstable inactive read queue overloaded");
};
cfg.streaming_read_concurrency_config = cfg.read_concurrency_config;
cfg.streaming_read_concurrency_config.timeout = {};
// No timeouts or queue length limits - a failure here can kill an entire repair.
// Trust the caller to limit concurrency.
cfg.streaming_read_concurrency_config.sem = &_streaming_concurrency_sem;
cfg.cf_stats = &_cf_stats;
cfg.enable_incremental_backups = _enable_incremental_backups;
if (_cfg->background_writer_scheduling_quota() < 1.0f) {
cfg.background_writer_scheduling_group = &_background_writer_scheduling_group;
cfg.memtable_scheduling_group = _memtable_cpu_controller.scheduling_group();
}
cfg.enable_metrics_reporting = _cfg->enable_keyspace_column_family_metrics();
return cfg;
}
@@ -3444,10 +3473,10 @@ future<> database::truncate(sstring ksname, sstring cfname, timestamp_func tsf)
return truncate(ks, cf, std::move(tsf));
}
future<> database::truncate(const keyspace& ks, column_family& cf, timestamp_func tsf)
future<> database::truncate(const keyspace& ks, column_family& cf, timestamp_func tsf, bool with_snapshot)
{
const auto durable = ks.metadata()->durable_writes();
const auto auto_snapshot = get_config().auto_snapshot();
const auto auto_snapshot = with_snapshot && get_config().auto_snapshot();
// Force mutations coming in to re-acquire higher rp:s
// This creates a "soft" ordering, in that we will guarantee that
@@ -3774,35 +3803,6 @@ future<std::unordered_map<sstring, column_family::snapshot_details>> column_fami
}
future<> column_family::flush() {
_stats.pending_flushes++;
// highest_flushed_rp is only updated when we flush. If the memtable is currently alive, then
// the most up2date replay position is the one that's in there now. Otherwise, if the memtable
// hasn't received any writes yet, that's the one from the last flush we made.
auto desired_rp = _memtables->back()->empty() ? _highest_flushed_rp : _memtables->back()->replay_position();
return _memtables->request_flush().finally([this, desired_rp] {
_stats.pending_flushes--;
// In origin memtable_switch_count is incremented inside
// ColumnFamilyMeetrics Flush.run
_stats.memtable_switch_count++;
// wait for all up until us.
return _flush_queue->wait_for_pending(desired_rp);
});
}
future<> column_family::flush(const db::replay_position& pos) {
// Technically possible if we've already issued the
// sstable write, but it is not done yet.
if (pos < _highest_flushed_rp) {
return make_ready_future<>();
}
// TODO: Origin looks at "secondary" memtables
// It also consideres "minReplayPosition", which is simply where
// the CL "started" (the first ever RP in this run).
// We ignore this for now and just say that if we're asked for
// a CF and it exists, we pretty much have to have data that needs
// flushing. Let's do it.
return _memtables->request_flush();
}
@@ -3824,12 +3824,14 @@ future<> column_family::flush_streaming_mutations(utils::UUID plan_id, dht::part
return _streaming_memtables->seal_active_memtable(memtable_list::flush_behavior::delayed).then([this] {
return _streaming_flush_phaser.advance_and_await();
}).then([this, sstables = std::move(sstables), ranges = std::move(ranges)] () mutable {
for (auto&& sst : sstables) {
// seal_active_streaming_memtable_big() ensures sst is unshared.
this->add_sstable(sst, {engine().cpu_id()});
}
this->trigger_compaction();
return _cache.invalidate(std::move(ranges));
return with_semaphore(_cache_update_sem, 1, [this, sstables = std::move(sstables), ranges = std::move(ranges)] () mutable {
for (auto&& sst : sstables) {
// seal_active_streaming_memtable_big() ensures sst is unshared.
this->add_sstable(sst, {engine().cpu_id()});
}
this->trigger_compaction();
return _cache.invalidate(std::move(ranges));
});
});
});
});
@@ -4092,9 +4094,9 @@ column_family::cache_hit_rate column_family::get_hit_rate(gms::inet_address addr
if (it == _cluster_cache_hit_rates.end()) {
// no data yet, get it from the gossiper
auto& gossiper = gms::get_local_gossiper();
auto eps = gossiper.get_endpoint_state_for_endpoint(addr);
auto* eps = gossiper.get_endpoint_state_for_endpoint_ptr(addr);
if (eps) {
auto state = eps->get_application_state(gms::application_state::CACHE_HITRATES);
auto* state = eps->get_application_state_ptr(gms::application_state::CACHE_HITRATES);
float f = -1.0f; // missing state means old node
if (state) {
sstring me = sprint("%s.%s", _schema->ks_name(), _schema->cf_name());
@@ -4119,11 +4121,12 @@ void column_family::drop_hit_rate(gms::inet_address addr) {
}
future<>
write_memtable_to_sstable(memtable& mt, sstables::shared_sstable sst, bool backup, const io_priority_class& pc, bool leave_unsealed) {
write_memtable_to_sstable(memtable& mt, sstables::shared_sstable sst, bool backup, const io_priority_class& pc, bool leave_unsealed, seastar::thread_scheduling_group *tsg) {
sstables::sstable_writer_config cfg;
cfg.replay_position = mt.replay_position();
cfg.backup = backup;
cfg.leave_unsealed = leave_unsealed;
cfg.thread_scheduling_group = tsg;
return sst->write_components(mt.make_flush_reader(mt.schema(), pc), mt.partition_count(), mt.schema(), cfg, pc);
}

View File

@@ -77,6 +77,8 @@
#include <boost/intrusive/parent_from_member.hpp>
#include "db/view/view.hh"
#include "lister.hh"
#include "utils/phased_barrier.hh"
#include "cpu_controller.hh"
class cell_locker;
class cell_locker_stats;
@@ -114,7 +116,7 @@ void make(database& db, bool durable, bool volatile_testing_only);
}
}
class replay_position_reordered_exception : public std::exception {};
class mutation_reordered_with_truncate_exception : public std::exception {};
using shared_memtable = lw_shared_ptr<memtable>;
class memtable_list;
@@ -429,6 +431,9 @@ public:
restricted_mutation_reader_config read_concurrency_config;
restricted_mutation_reader_config streaming_read_concurrency_config;
::cf_stats* cf_stats = nullptr;
seastar::thread_scheduling_group* background_writer_scheduling_group = nullptr;
seastar::thread_scheduling_group* memtable_scheduling_group = nullptr;
bool enable_metrics_reporting = false;
};
struct no_commitlog {};
struct stats {
@@ -538,7 +543,6 @@ private:
mutable row_cache _cache; // Cache covers only sstables.
std::experimental::optional<int64_t> _sstable_generation = {};
db::replay_position _highest_flushed_rp;
db::replay_position _highest_rp;
db::replay_position _lowest_allowed_rp;
@@ -546,15 +550,7 @@ private:
db::commitlog* _commitlog;
compaction_manager& _compaction_manager;
int _compaction_disabled = 0;
class memtable_flush_queue;
std::unique_ptr<memtable_flush_queue> _flush_queue;
// Because streaming mutations bypass the commitlog, there is
// no need for the complications of the flush queue. Besides, it
// is easier to just use a common gate than it is to modify the flush_queue
// to work both with and without a replay position.
//
// Last but not least, we seldom need to guarantee any ordering here: as long
// as all data is waited for, we're good.
utils::phased_barrier _flush_barrier;
seastar::gate _streaming_flush_gate;
std::vector<view_ptr> _views;
semaphore _cache_update_sem{1};
@@ -753,7 +749,6 @@ public:
void start();
future<> stop();
future<> flush();
future<> flush(const db::replay_position&);
future<> flush_streaming_mutations(utils::UUID plan_id, dht::partition_range_vector ranges = dht::partition_range_vector{});
future<> fail_streaming_mutations(utils::UUID plan_id);
future<> clear(); // discards memtable(s) without flushing them to disk.
@@ -864,6 +859,10 @@ public:
return _config.cf_stats;
}
seastar::thread_scheduling_group* background_writer_scheduling_group() {
return _config.background_writer_scheduling_group;
}
compaction_manager& get_compaction_manager() const {
return _compaction_manager;
}
@@ -1072,6 +1071,9 @@ public:
restricted_mutation_reader_config read_concurrency_config;
restricted_mutation_reader_config streaming_read_concurrency_config;
::cf_stats* cf_stats = nullptr;
seastar::thread_scheduling_group* background_writer_scheduling_group = nullptr;
seastar::thread_scheduling_group* memtable_scheduling_group = nullptr;
bool enable_metrics_reporting = false;
};
private:
std::unique_ptr<locator::abstract_replication_strategy> _replication_strategy;
@@ -1154,6 +1156,7 @@ public:
private:
::cf_stats _cf_stats;
static constexpr size_t max_concurrent_reads() { return 100; }
static constexpr size_t max_streaming_concurrent_reads() { return 10; } // They're rather heavyweight, so limit more
static constexpr size_t max_system_concurrent_reads() { return 10; }
static constexpr size_t max_concurrent_sstable_loads() { return 3; }
struct db_stats {
@@ -1177,7 +1180,11 @@ private:
dirty_memory_manager _dirty_memory_manager;
dirty_memory_manager _streaming_dirty_memory_manager;
seastar::thread_scheduling_group _background_writer_scheduling_group;
flush_cpu_controller _memtable_cpu_controller;
semaphore _read_concurrency_sem{max_concurrent_reads()};
semaphore _streaming_concurrency_sem{max_streaming_concurrent_reads()};
restricted_mutation_reader_config _read_concurrency_config;
semaphore _system_read_concurrency_sem{max_system_concurrent_reads()};
restricted_mutation_reader_config _system_read_concurrency_config;
@@ -1332,10 +1339,10 @@ public:
/** Truncates the given column family */
future<> truncate(sstring ksname, sstring cfname, timestamp_func);
future<> truncate(const keyspace& ks, column_family& cf, timestamp_func);
future<> truncate(const keyspace& ks, column_family& cf, timestamp_func, bool with_snapshot = true);
bool update_column_family(schema_ptr s);
future<> drop_column_family(const sstring& ks_name, const sstring& cf_name, timestamp_func);
future<> drop_column_family(const sstring& ks_name, const sstring& cf_name, timestamp_func, bool with_snapshot = true);
void remove(const column_family&);
const logalloc::region_group& dirty_memory_region_group() const {

View File

@@ -84,9 +84,6 @@ public:
// to be per shard and does no dispatching beyond delegating the the
// shard qp (which is what you feed here).
batchlog_manager(cql3::query_processor&);
batchlog_manager(distributed<cql3::query_processor>& qp)
: batchlog_manager(qp.local())
{}
future<> start();
future<> stop();

View File

@@ -511,6 +511,7 @@ public:
if (shutdown) {
auto me = shared_from_this();
return _gate.close().then([me] {
me->_closed = true;
return me->sync().finally([me] {
// When we get here, nothing should add ops,
// and we should have waited out all pending.
@@ -1319,6 +1320,7 @@ future<> db::commitlog::segment_manager::shutdown() {
return _gate.close().then(std::bind(&segment_manager::sync_all_segments, this, true));
});
}).finally([this] {
discard_unused_segments();
// Now that the gate is closed and requests completed we are sure nobody else will pop()
return clear_reserve_segments().finally([this] {
return std::move(_reserve_replenisher).then_wrapped([this] (auto f) {

View File

@@ -166,6 +166,12 @@ public:
*/
#define _make_config_values(val) \
val(background_writer_scheduling_quota, double, 1.0, Used, \
"max cpu usage ratio (between 0 and 1) for compaction process. Not intended for setting in normal operations. Setting it to 1 or higher will disable it, recommended operational setting is 0.5." \
) \
val(auto_adjust_flush_quota, bool, false, Used, \
"true: auto-adjust quota for flush processes. false: put everyone together in the static background writer group - if background writer group is enabled. Not intended for setting in normal operations" \
) \
/* Initialization properties */ \
/* The minimal properties needed for configuring a cluster. */ \
val(cluster_name, sstring, "Test Cluster", Used, \
@@ -330,7 +336,7 @@ public:
val(sstable_preemptive_open_interval_in_mb, uint32_t, 50, Unused, \
"When compacting, the replacement opens SSTables before they are completely written and uses in place of the prior SSTables for any range previously written. This setting helps to smoothly transfer reads between the SSTables by reducing page cache churn and keeps hot rows hot." \
) \
val(defragment_memory_on_idle, bool, true, Used, "Set to true to defragment memory when the cpu is idle. This reduces the amount of work Scylla performs when processing client requests.") \
val(defragment_memory_on_idle, bool, false, Used, "When set to true, will defragment memory when the cpu is idle. This reduces the amount of work Scylla performs when processing client requests.") \
/* Memtable settings */ \
val(memtable_allocation_type, sstring, "heap_buffers", Invalid, \
"Specify the way Cassandra allocates and manages memtable memory. See Off-heap memtables in Cassandra 2.1. Options are:\n" \
@@ -754,6 +760,9 @@ public:
val(replace_address_first_boot, sstring, "", Used, "Like replace_address option, but if the node has been bootstrapped successfully it will be ignored. Same as -Dcassandra.replace_address_first_boot.") \
val(override_decommission, bool, false, Used, "Set true to force a decommissioned node to join the cluster") \
val(ring_delay_ms, uint32_t, 30 * 1000, Used, "Time a node waits to hear from other nodes before joining the ring in milliseconds. Same as -Dcassandra.ring_delay_ms in cassandra.") \
val(shadow_round_ms, uint32_t, 300 * 1000, Used, "The maximum gossip shadow round time. Can be used to reduce the gossip feature check time during node boot up.") \
val(fd_max_interval_ms, uint32_t, 2 * 1000, Used, "The maximum failure_detector interval time in milliseconds. Interval larger than the maximum will be ignored. Larger cluster may need to increase the default.") \
val(fd_initial_value_ms, uint32_t, 2 * 1000, Used, "The initial failure_detector interval time in milliseconds.") \
val(shutdown_announce_in_ms, uint32_t, 2 * 1000, Used, "Time a node waits after sending gossip shutdown message in milliseconds. Same as -Dcassandra.shutdown_announce_in_ms in cassandra.") \
val(developer_mode, bool, false, Used, "Relax environment checks. Setting to true can reduce performance and reliability significantly.") \
val(skip_wait_for_gossip_to_settle, int32_t, -1, Used, "An integer to configure the wait for gossip to settle. -1: wait normally, 0: do not wait at all, n: wait for at most n polls. Same as -Dcassandra.skip_wait_for_gossip_to_settle in cassandra.") \
@@ -765,6 +774,7 @@ public:
val(abort_on_lsa_bad_alloc, bool, false, Used, "Abort when allocation in LSA region fails") \
val(murmur3_partitioner_ignore_msb_bits, unsigned, 0, Used, "Number of most siginificant token bits to ignore in murmur3 partitioner; increase for very large clusters") \
val(virtual_dirty_soft_limit, double, 0.6, Used, "Soft limit of virtual dirty memory expressed as a portion of the hard limit") \
val(enable_keyspace_column_family_metrics, bool, false, Used, "Enable per keyspace and per column family metrics reporting") \
/* done! */
#define _make_value_member(name, type, deflt, status, desc, ...) \

View File

@@ -162,6 +162,14 @@ inline void assure_sufficient_live_nodes(
const PendingRange& pending_endpoints = std::array<gms::inet_address, 0>()) {
size_t need = block_for(ks, cl);
auto adjust_live_for_error = [] (size_t live, size_t pending) {
// DowngradingConsistencyRetryPolicy uses alive replicas count from Unavailable
// exception to adjust CL for retry. When pending node is present CL is increased
// by 1 internally, so reported number of live nodes has to be adjusted to take
// this into account
return pending <= live ? live - pending : 0;
};
switch (cl) {
case consistency_level::ANY:
// local hint is acceptable, and local node is always live
@@ -176,7 +184,7 @@ inline void assure_sufficient_live_nodes(
size_t pending = count_local_endpoints(pending_endpoints);
if (local_live < need + pending) {
cl_logger.debug("Local replicas {} are insufficient to satisfy LOCAL_QUORUM requirement of needed {} and pending {}", live_endpoints, local_live, pending);
throw exceptions::unavailable_exception(cl, need, local_live);
throw exceptions::unavailable_exception(cl, need, adjust_live_for_error(local_live, pending));
}
break;
}
@@ -190,7 +198,7 @@ inline void assure_sufficient_live_nodes(
size_t pending = pending_endpoints.size();
if (live < need + pending) {
cl_logger.debug("Live nodes {} do not satisfy ConsistencyLevel ({} required, {} pending)", live, need, pending);
throw exceptions::unavailable_exception(cl, need, live);
throw exceptions::unavailable_exception(cl, need, adjust_live_for_error(live, pending));
}
break;
}

View File

@@ -66,8 +66,8 @@ class migrator {
public:
static const std::unordered_set<sstring> legacy_schema_tables;
migrator(cql3::query_processor& qp)
: _qp(qp) {
migrator(sharded<service::storage_proxy>& sp, cql3::query_processor& qp)
: _sp(sp), _qp(qp) {
}
migrator(migrator&&) = default;
@@ -147,15 +147,18 @@ public:
auto cq = fmt_query(fmt, db::system_keyspace::legacy::COLUMNS);
auto zq = fmt_query(fmt, db::system_keyspace::legacy::TRIGGERS);
typedef std::tuple<future<result_set_type>, future<result_set_type>, future<result_set_type>> result_tuple;
typedef std::tuple<future<result_set_type>, future<result_set_type>, future<result_set_type>, future<db::schema_tables::legacy::schema_mutations>> result_tuple;
return when_all(_qp.execute_internal(tq, { dst.name, cf_name }),
_qp.execute_internal(cq, { dst.name, cf_name }),
_qp.execute_internal(zq, { dst.name, cf_name })).then([this, &dst, cf_name, timestamp](result_tuple&& t) {
_qp.execute_internal(zq, { dst.name, cf_name }),
db::schema_tables::legacy::read_table_mutations(_sp, dst.name, cf_name, db::system_keyspace::legacy::column_families()))
.then([this, &dst, cf_name, timestamp](result_tuple&& t) {
result_set_type tables = std::get<0>(t).get0();
result_set_type columns = std::get<1>(t).get0();
result_set_type triggers = std::get<2>(t).get0();
db::schema_tables::legacy::schema_mutations sm = std::get<3>(t).get0();
row_type& td = tables->one();
@@ -165,6 +168,8 @@ public:
schema_builder builder(dst.name, cf_name, id);
builder.with_version(sm.digest());
cf_type cf = sstring_to_cf_type(td.get_or("type", sstring("standard")));
if (cf == cf_type::super) {
fail(unimplemented::cause::SUPER);
@@ -183,6 +188,7 @@ public:
if (default_validator->is_counter()) {
builder.set_is_counter(true);
}
builder.set_default_validation_class(default_validator);
}
/*
@@ -191,10 +197,8 @@ public:
* but we can trust is_dense value of false.
*/
auto is_dense = td.get_opt<bool>("is_dense");
if (is_dense && !*is_dense) {
builder.set_is_dense(false);
} else {
auto calulated_is_dense = [&] {
if (!is_dense || *is_dense) {
is_dense = [&] {
/*
* As said above, this method is only here because we need to deal with thrift upgrades.
* Once a CF has been "upgraded", i.e. we've rebuilt and save its CQL3 metadata at least once,
@@ -252,40 +256,48 @@ public:
return comparator.compare(off, end - off, utf8_type->name()) == 0;
};
if (regular) {
auto name = regular->get_or("column_name", bytes());
// This is a lame attempt at determining if this was in fact a compact_value column
if (!max_cl_idx || (!name.empty() && name != to_bytes("value"))
|| db::schema_tables::parse_type(regular->get_as<sstring>("type")) != default_validator) {
return false;
}
// Ok, we will assume this was in fact a (scylla-created) compact value.
}
if (max_cl_idx) {
auto n = std::count(comparator.begin(), comparator.end(), ','); // num comp - 1
return *max_cl_idx == n;
}
if (regular) {
return false;
}
return !is_cql3_only_pk_comparator(comparator);
}();
builder.set_is_dense(calulated_is_dense);
// now, if switched to sparse, remove redundant compact_value column and the last clustering column,
// directly copying CASSANDRA-11502 logic. See CASSANDRA-11315.
filter_sparse = !calulated_is_dense && is_dense.value_or(true);
filter_sparse = !*is_dense;
}
builder.set_is_dense(*is_dense);
auto is_cql = !*is_dense && is_compound;
auto is_static_compact = !*is_dense && !is_compound;
// org.apache.cassandra.schema.LegacySchemaMigrator#isEmptyCompactValueColumn
auto is_empty_compact_value = [](const cql3::untyped_result_set::row& column_row) {
auto kind_str = column_row.get_as<sstring>("type");
// Cassandra only checks for "compact_value", but Scylla generates "regular" instead (#2586)
return (kind_str == "compact_value" || kind_str == "regular")
&& column_row.get_as<sstring>("column_name").empty();
};
for (auto& row : *columns) {
auto kind_str = row.get_as<sstring>("type");
auto kind = db::schema_tables::deserialize_kind(kind_str);
auto component_index = kind > column_kind::clustering_key ? 0 : column_id(row.get_or("component_index", 0));
auto name = row.get_or("column_name", bytes());
auto name = row.get_or<sstring>("column_name", sstring());
auto validator = db::schema_tables::parse_type(row.get_as<sstring>("validator"));
if (is_empty_compact_value(row)) {
continue;
}
if (filter_sparse) {
if (kind_str == "compact_value") {
continue;
@@ -329,7 +341,7 @@ public:
type = "VALUES";
}
}
auto column = cql3::util::maybe_quote(utf8_type->to_string(name));
auto column = cql3::util::maybe_quote(name);
options["target"] = validator->is_collection()
? type + "(" + column + ")"
: column;
@@ -339,7 +351,26 @@ public:
builder.with_index(index_metadata(index_name, options, *index_kind));
}
builder.with_column(std::move(name), std::move(validator), kind, component_index);
data_type column_name_type = [&] {
if (is_static_compact && kind == column_kind::regular_column) {
return db::schema_tables::parse_type(comparator);
}
return utf8_type;
}();
auto column_name = [&] {
try {
return column_name_type->from_string(name);
} catch (marshal_exception) {
// #2597: Scylla < 2.0 writes names in serialized form, try to recover
column_name_type->validate(to_bytes_view(name));
return to_bytes(name);
}
}();
builder.with_column(std::move(column_name), std::move(validator), kind, component_index);
}
if (is_static_compact) {
builder.set_regular_column_name_type(db::schema_tables::parse_type(comparator));
}
if (td.has("read_repair_chance")) {
@@ -414,8 +445,6 @@ public:
throw unsupported_feature("triggers");
}
// TODO: table upgrades as in origin converter.
dst.tables.emplace_back(table{timestamp, builder.build() });
});
}
@@ -517,21 +546,13 @@ public:
});
}
future<> unload_legacy_tables() {
return _qp.db().invoke_on_all([](database& db) {
for (auto& cfname : legacy_schema_tables) {
auto& cf = db.find_column_family(db::system_keyspace::NAME, cfname);
db.remove(cf);
}
});
}
future<> truncate_legacy_tables() {
mlogger.info("Truncating legacy schema tables");
return do_with(utils::make_joinpoint([] { return db_clock::now();}),[this](auto& tsf) {
return _qp.db().invoke_on_all([&tsf](database& db) {
return parallel_for_each(legacy_schema_tables, [&db, &tsf](const sstring& cfname) {
return db.truncate(db::system_keyspace::NAME, cfname, [&tsf] { return tsf.value(); });
future<> drop_legacy_tables() {
mlogger.info("Dropping legacy schema tables");
return parallel_for_each(legacy_schema_tables, [this](const sstring& cfname) {
return do_with(utils::make_joinpoint([] { return db_clock::now();}),[this, cfname](auto& tsf) {
auto with_snapshot = !_keyspaces.empty();
return _qp.db().invoke_on_all([&tsf, cfname, with_snapshot](database& db) {
return db.drop_column_family(db::system_keyspace::NAME, cfname, [&tsf] { return tsf.value(); }, with_snapshot);
});
});
});
@@ -590,18 +611,15 @@ public:
future<> migrate() {
return read_all_keyspaces().then([this]() {
if (_keyspaces.empty()) {
return unload_legacy_tables();
}
// write metadata to the new schema tables
return store_keyspaces_in_new_schema_tables().then(std::bind(&migrator::migrate_indexes, this))
.then(std::bind(&migrator::flush_schemas, this))
.then(std::bind(&migrator::truncate_legacy_tables, this))
.then(std::bind(&migrator::unload_legacy_tables, this))
.then(std::bind(&migrator::drop_legacy_tables, this))
.then([] { mlogger.info("Completed migration of legacy schema tables"); });
});
}
sharded<service::storage_proxy>& _sp;
cql3::query_processor& _qp;
std::vector<keyspace> _keyspaces;
};
@@ -620,7 +638,7 @@ const std::unordered_set<sstring> migrator::legacy_schema_tables = {
}
future<>
db::legacy_schema_migrator::migrate(cql3::query_processor& qp) {
return do_with(migrator(qp), std::bind(&migrator::migrate, std::placeholders::_1));
db::legacy_schema_migrator::migrate(sharded<service::storage_proxy>& sp, cql3::query_processor& qp) {
return do_with(migrator(sp, qp), std::bind(&migrator::migrate, std::placeholders::_1));
}

View File

@@ -48,10 +48,14 @@ namespace cql3 {
class query_processor;
}
namespace service {
class storage_proxy;
}
namespace db {
namespace legacy_schema_migrator {
future<> migrate(cql3::query_processor&);
future<> migrate(sharded<service::storage_proxy>&, cql3::query_processor&);
}
}

View File

@@ -64,7 +64,11 @@
#include "db/config.hh"
#include "md5_hasher.hh"
#include <seastar/util/noncopyable_function.hh>
#include <boost/algorithm/string/predicate.hpp>
#include <boost/range/algorithm/copy.hpp>
#include <boost/range/algorithm/transform.hpp>
#include <boost/range/adaptor/map.hpp>
#include <boost/range/join.hpp>
@@ -82,6 +86,8 @@ namespace schema_tables {
logging::logger slogger("schema_tables");
const sstring version = "3";
struct push_back_and_return {
std::vector<mutation> muts;
@@ -123,7 +129,11 @@ static void merge_tables_and_views(distributed<service::storage_proxy>& proxy,
std::map<qualified_name, schema_mutations>&& views_before,
std::map<qualified_name, schema_mutations>&& views_after);
static void merge_types(distributed<service::storage_proxy>& proxy,
struct user_types_to_drop final {
seastar::noncopyable_function<void()> drop;
};
static user_types_to_drop merge_types(distributed<service::storage_proxy>& proxy,
schema_result&& before,
schema_result&& after);
@@ -149,8 +159,8 @@ static void add_index_to_schema_mutation(schema_ptr table,
const index_metadata& index, api::timestamp_type timestamp,
mutation& mutation);
static void drop_column_from_schema_mutation(schema_ptr,
const column_definition&, long timestamp,
static void drop_column_from_schema_mutation(schema_ptr schema_table, schema_ptr table,
const sstring& column_name, long timestamp,
std::vector<mutation>&);
static void drop_index_from_schema_mutation(schema_ptr table,
@@ -165,13 +175,12 @@ static void prepare_builder_from_table_row(schema_builder&, const query::result_
using namespace v3;
std::vector<const char*> ALL { KEYSPACES, TABLES, COLUMNS, DROPPED_COLUMNS, TRIGGERS, VIEWS, TYPES, FUNCTIONS, AGGREGATES, INDEXES };
std::vector<const char*> ALL { KEYSPACES, TABLES, SCYLLA_TABLES, COLUMNS, DROPPED_COLUMNS, TRIGGERS, VIEWS, TYPES, FUNCTIONS, AGGREGATES, INDEXES };
using days = std::chrono::duration<int, std::ratio<24 * 3600>>;
/** add entries to system.schema_* for the hardcoded system definitions */
future<> save_system_keyspace_schema() {
auto& ks = db::qctx->db().find_keyspace(NAME);
future<> save_system_schema(const sstring & ksname) {
auto& ks = db::qctx->db().find_keyspace(ksname);
auto ksm = ks.metadata();
// delete old, possibly obsolete entries in schema tables
@@ -185,6 +194,11 @@ future<> save_system_keyspace_schema() {
});
}
/** add entries to system_schema.* for the hardcoded system definitions */
future<> save_system_keyspace_schema() {
return save_system_schema(NAME);
}
namespace v3 {
static constexpr auto schema_gc_grace = std::chrono::duration_cast<std::chrono::seconds>(days(7)).count();
@@ -256,6 +270,21 @@ schema_ptr tables() {
return schema;
}
// Holds Scylla-specific table metadata.
schema_ptr scylla_tables() {
static thread_local auto schema = [] {
auto id = generate_legacy_id(NAME, SCYLLA_TABLES);
return schema_builder(NAME, SCYLLA_TABLES, stdx::make_optional(id))
.with_column("keyspace_name", utf8_type, column_kind::partition_key)
.with_column("table_name", utf8_type, column_kind::clustering_key)
.with_column("version", uuid_type)
.set_gc_grace_seconds(schema_gc_grace)
.with_version(generate_schema_version(id))
.build();
}();
return schema;
}
schema_ptr columns() {
static thread_local auto schema = [] {
schema_builder builder(make_lw_shared(::schema(generate_legacy_id(NAME, COLUMNS), NAME, COLUMNS,
@@ -519,7 +548,7 @@ future<utils::UUID> calculate_schema_digest(distributed<service::storage_proxy>&
for (auto&& p : rs->partitions()) {
auto mut = p.mut().unfreeze(s);
auto partition_key = value_cast<sstring>(utf8_type->deserialize(mut.key().get_component(*s, 0)));
if (partition_key == NAME) {
if (is_system_keyspace(partition_key)) {
continue;
}
mutations.emplace_back(std::move(mut));
@@ -552,7 +581,7 @@ future<std::vector<frozen_mutation>> convert_schema_to_mutations(distributed<ser
for (auto&& p : rs->partitions()) {
auto mut = p.mut().unfreeze(s);
auto partition_key = value_cast<sstring>(utf8_type->deserialize(mut.key().get_component(*s, 0)));
if (partition_key == NAME) {
if (is_system_keyspace(partition_key)) {
continue;
}
results.emplace_back(std::move(p.mut()));
@@ -727,6 +756,33 @@ read_tables_for_keyspaces(distributed<service::storage_proxy>& proxy, const std:
return result;
}
mutation compact_for_schema_digest(const mutation& m) {
// Cassandra is skipping tombstones from digest calculation
// to avoid disagreements due to tombstone GC.
// See https://issues.apache.org/jira/browse/CASSANDRA-6862.
// We achieve similar effect with compact_for_compaction().
mutation m_compacted(m);
m_compacted.partition().compact_for_compaction(*m.schema(), always_gc, gc_clock::time_point::max());
return m_compacted;
}
// Applies deletion of the "version" column to a system_schema.scylla_tables mutation.
static void delete_schema_version(mutation& m) {
if (m.column_family_id() != scylla_tables()->id()) {
return;
}
const column_definition& version_col = *scylla_tables()->get_column_definition(to_bytes("version"));
for (auto&& row : m.partition().clustered_rows()) {
auto&& cells = row.row().cells();
auto&& cell = cells.find_cell(version_col.id);
api::timestamp_type t = api::new_timestamp();
if (cell) {
t = std::max(t, cell->as_atomic_cell().timestamp());
}
cells.apply(version_col, atomic_cell::make_dead(t, gc_clock::now()));
}
}
static future<> do_merge_schema(distributed<service::storage_proxy>& proxy, std::vector<mutation> mutations, bool do_flush)
{
return seastar::async([&proxy, mutations = std::move(mutations), do_flush] () mutable {
@@ -737,6 +793,9 @@ static future<> do_merge_schema(distributed<service::storage_proxy>& proxy, std:
for (auto&& mutation : mutations) {
keyspaces.emplace(value_cast<sstring>(utf8_type->deserialize(mutation.key().get_component(*s, 0))));
column_families.emplace(mutation.column_family_id());
// We must force recalculation of schema version after the merge, since the resulting
// schema may be a mix of the old and new schemas.
delete_schema_version(mutation);
}
// current state of the schema
@@ -749,6 +808,15 @@ static future<> do_merge_schema(distributed<service::storage_proxy>& proxy, std:
/*auto& old_aggregates = */read_schema_for_keyspaces(proxy, AGGREGATES, keyspaces).get0();
#endif
// Incoming mutations have the version field deleted. Delete here as well so that
// schemas which are otherwise equal don't appear as differing.
for (auto&& e : old_column_families) {
schema_mutations& sm = e.second;
if (sm.scylla_tables()) {
delete_schema_version(*sm.scylla_tables());
}
}
proxy.local().mutate_locally(std::move(mutations)).get0();
if (do_flush) {
@@ -771,7 +839,7 @@ static future<> do_merge_schema(distributed<service::storage_proxy>& proxy, std:
#endif
std::set<sstring> keyspaces_to_drop = merge_keyspaces(proxy, std::move(old_keyspaces), std::move(new_keyspaces)).get0();
merge_types(proxy, std::move(old_types), std::move(new_types));
auto types_to_drop = merge_types(proxy, std::move(old_types), std::move(new_types));
merge_tables_and_views(proxy,
std::move(old_column_families), std::move(new_column_families),
std::move(old_views), std::move(new_views));
@@ -779,6 +847,8 @@ static future<> do_merge_schema(distributed<service::storage_proxy>& proxy, std:
mergeFunctions(oldFunctions, newFunctions);
mergeAggregates(oldAggregates, newAggregates);
#endif
types_to_drop.drop();
proxy.local().get_db().invoke_on_all([keyspaces_to_drop = std::move(keyspaces_to_drop)] (database& db) {
// it is safe to drop a keyspace only when all nested ColumnFamilies where deleted
return do_for_each(keyspaces_to_drop, [&db] (auto keyspace_to_drop) {
@@ -935,30 +1005,37 @@ static void merge_tables_and_views(distributed<service::storage_proxy>& proxy,
}).get();
}
static inline void collect_types(std::set<sstring>& keys, schema_result& result, std::vector<user_type>& to)
struct naked_user_type {
const sstring keyspace;
const sstring qualified_name;
};
static inline void collect_types(std::set<sstring>& keys, schema_result& result, std::vector<naked_user_type>& to)
{
for (auto&& key : keys) {
auto&& value = result[key];
auto types = create_types_from_schema_partition(schema_result_value_type{key, std::move(value)});
std::move(types.begin(), types.end(), std::back_inserter(to));
boost::transform(types, std::back_inserter(to), [] (user_type type) {
return naked_user_type{std::move(type->_keyspace), std::move(type->name())};
});
}
}
// see the comments for merge_keyspaces()
static void merge_types(distributed<service::storage_proxy>& proxy, schema_result&& before, schema_result&& after)
// see the comments for merge_keyspaces()
static user_types_to_drop merge_types(distributed<service::storage_proxy>& proxy, schema_result&& before, schema_result&& after)
{
std::vector<user_type> created, altered, dropped;
std::vector<naked_user_type> created, altered, dropped;
auto diff = difference(before, after, indirect_equal_to<lw_shared_ptr<query::result_set>>());
collect_types(diff.entries_only_on_left, before, dropped); // Keyspaces with no more types
collect_types(diff.entries_only_on_right, after, created); // New keyspaces with types
for (auto&& key : diff.entries_differing) {
for (auto&& keyspace : diff.entries_differing) {
// The user types of this keyspace differ, so diff the current types with the updated ones
auto current_types = proxy.local().get_db().local().find_keyspace(key).metadata()->user_types()->get_all_types();
auto current_types = proxy.local().get_db().local().find_keyspace(keyspace).metadata()->user_types()->get_all_types();
decltype(current_types) updated_types;
auto ts = create_types_from_schema_partition(schema_result_value_type{key, std::move(after[key])});
auto ts = create_types_from_schema_partition(schema_result_value_type{keyspace, std::move(after[keyspace])});
updated_types.reserve(ts.size());
for (auto&& type : ts) {
updated_types[type->_name] = std::move(type);
@@ -966,36 +1043,46 @@ static void merge_types(distributed<service::storage_proxy>& proxy, schema_resul
auto delta = difference(current_types, updated_types, indirect_equal_to<user_type>());
for (auto&& key : delta.entries_only_on_left) {
dropped.emplace_back(current_types[key]);
for (auto&& type_name : delta.entries_only_on_left) {
dropped.emplace_back(naked_user_type{keyspace, current_types[type_name]->name()});
}
for (auto&& key : delta.entries_only_on_right) {
created.emplace_back(std::move(updated_types[key]));
for (auto&& type_name : delta.entries_only_on_right) {
created.emplace_back(naked_user_type{keyspace, updated_types[type_name]->name()});
}
for (auto&& key : delta.entries_differing) {
altered.emplace_back(std::move(updated_types[key]));
for (auto&& type_name : delta.entries_differing) {
altered.emplace_back(naked_user_type{keyspace, updated_types[type_name]->name()});
}
}
proxy.local().get_db().invoke_on_all([&created, &dropped, &altered] (database& db) {
// Create and update user types before any tables/views are created that potentially
// use those types. Similarly, defer dropping until after tables/views that may use
// some of these user types are dropped.
proxy.local().get_db().invoke_on_all([&created, &altered] (database& db) {
return seastar::async([&] {
for (auto&& type : created) {
auto user_type = dynamic_pointer_cast<const user_type_impl>(parse_type(type->name()));
auto user_type = dynamic_pointer_cast<const user_type_impl>(parse_type(type.qualified_name));
db.find_keyspace(user_type->_keyspace).add_user_type(user_type);
service::get_local_migration_manager().notify_create_user_type(user_type).get();
}
for (auto&& type : dropped) {
auto user_type = dynamic_pointer_cast<const user_type_impl>(parse_type(type->name()));
db.find_keyspace(user_type->_keyspace).remove_user_type(user_type);
service::get_local_migration_manager().notify_drop_user_type(user_type).get();
}
for (auto&& type : altered) {
auto user_type = dynamic_pointer_cast<const user_type_impl>(parse_type(type->name()));
auto user_type = dynamic_pointer_cast<const user_type_impl>(parse_type(type.qualified_name));
db.find_keyspace(user_type->_keyspace).add_user_type(user_type);
service::get_local_migration_manager().notify_update_user_type(user_type).get();
}
});
}).get();
return user_types_to_drop{[&proxy, dropped = std::move(dropped)] {
proxy.local().get_db().invoke_on_all([dropped = std::move(dropped)](database& db) {
return do_for_each(dropped, [&db](auto& user_type_to_drop) {
auto user_type = dynamic_pointer_cast<const user_type_impl>(
parse_type(std::move(user_type_to_drop.qualified_name)));
db.find_keyspace(user_type->_keyspace).remove_user_type(user_type);
return service::get_local_migration_manager().notify_drop_user_type(user_type);
});
}).get();
}};
}
#if 0
@@ -1387,7 +1474,7 @@ static void add_table_params_to_mutations(mutation& m, const clustering_key& cke
{
auto map = table->compaction_strategy_options();
map["class"] = sstables::compaction_strategy::name(table->compaction_strategy());
map["class"] = sstables::compaction_strategy::name(table->configured_compaction_strategy());
store_map(m, ckey, "compaction", timestamp, map);
}
@@ -1461,6 +1548,15 @@ static void add_dropped_column_to_schema_mutation(schema_ptr table, const sstrin
m.set_clustered_cell(ckey, "type", expand_user_type(column.type)->as_cql3_type()->to_string(), timestamp);
}
mutation make_scylla_tables_mutation(schema_ptr table, api::timestamp_type timestamp) {
schema_ptr s = tables();
auto pkey = partition_key::from_singular(*s, table->ks_name());
auto ckey = clustering_key::from_singular(*s, table->cf_name());
mutation m(pkey, scylla_tables());
m.set_clustered_cell(ckey, "version", utils::UUID(table->version()), timestamp);
return m;
}
static schema_mutations make_table_mutations(schema_ptr table, api::timestamp_type timestamp, bool with_columns_and_triggers)
{
// When adding new schema properties, don't set cells for default values so that
@@ -1474,6 +1570,8 @@ static schema_mutations make_table_mutations(schema_ptr table, api::timestamp_ty
auto ckey = clustering_key::from_singular(*s, table->cf_name());
m.set_clustered_cell(ckey, "id", table->id(), timestamp);
auto scylla_tables_mutation = make_scylla_tables_mutation(table, timestamp);
{
list_type_impl::native_type flags;
if (table->is_super()) {
@@ -1499,7 +1597,7 @@ static schema_mutations make_table_mutations(schema_ptr table, api::timestamp_ty
mutation indices_mutation(pkey, indexes());
if (with_columns_and_triggers) {
for (auto&& column : table->all_columns()) {
for (auto&& column : table->v3().all_columns()) {
add_column_to_schema_mutation(table, column, timestamp, columns_mutation);
}
for (auto&& index : table->indices()) {
@@ -1512,7 +1610,8 @@ static schema_mutations make_table_mutations(schema_ptr table, api::timestamp_ty
}
}
return schema_mutations{std::move(m), std::move(columns_mutation), std::move(indices_mutation), std::move(dropped_columns_mutation)};
return schema_mutations{std::move(m), std::move(columns_mutation), std::move(indices_mutation), std::move(dropped_columns_mutation),
std::move(scylla_tables_mutation)};
}
void add_table_or_view_to_schema_mutation(schema_ptr s, api::timestamp_type timestamp, bool with_columns, std::vector<mutation>& mutations)
@@ -1561,23 +1660,23 @@ static void make_update_columns_mutations(schema_ptr old_table,
std::vector<mutation>& mutations) {
mutation columns_mutation(partition_key::from_singular(*columns(), old_table->ks_name()), columns());
auto diff = difference(old_table->columns_by_name(), new_table->columns_by_name());
auto diff = difference(old_table->v3().columns_by_name(), new_table->v3().columns_by_name());
// columns that are no longer needed
for (auto&& name : diff.entries_only_on_left) {
// Thrift only knows about the REGULAR ColumnDefinition type, so don't consider other type
// are being deleted just because they are not here.
const column_definition& column = *old_table->columns_by_name().at(name);
const column_definition& column = *old_table->v3().columns_by_name().at(name);
if (from_thrift && !column.is_regular()) {
continue;
}
drop_column_from_schema_mutation(old_table, column, timestamp, mutations);
drop_column_from_schema_mutation(columns(), old_table, column.name_as_text(), timestamp, mutations);
}
// newly added columns and old columns with updated attributes
for (auto&& name : boost::range::join(diff.entries_differing, diff.entries_only_on_right)) {
const column_definition& column = *new_table->columns_by_name().at(name);
const column_definition& column = *new_table->v3().columns_by_name().at(name);
add_column_to_schema_mutation(new_table, column, timestamp, columns_mutation);
}
@@ -1588,7 +1687,7 @@ static void make_update_columns_mutations(schema_ptr old_table,
// newly dropped columns
// columns added then dropped again
for (auto& name : dc_diff.entries_only_on_right) {
for (auto& name : boost::range::join(dc_diff.entries_differing, dc_diff.entries_only_on_right)) {
add_drop_column_to_mutations(new_table, name, new_table->dropped_columns().at(name), timestamp, mutations);
}
}
@@ -1626,12 +1725,20 @@ static void make_drop_table_or_view_mutations(schema_ptr schema_table,
api::timestamp_type timestamp,
std::vector<mutation>& mutations) {
auto pkey = partition_key::from_singular(*schema_table, table_or_view->ks_name());
mutation m{std::move(pkey), schema_table};
mutation m{pkey, schema_table};
auto ckey = clustering_key::from_singular(*schema_table, table_or_view->cf_name());
m.partition().apply_delete(*schema_table, std::move(ckey), tombstone(timestamp, gc_clock::now()));
m.partition().apply_delete(*schema_table, ckey, tombstone(timestamp, gc_clock::now()));
mutations.emplace_back(m);
for (auto &column : table_or_view->all_columns()) {
drop_column_from_schema_mutation(table_or_view, column, timestamp, mutations);
for (auto& column : table_or_view->v3().all_columns()) {
drop_column_from_schema_mutation(columns(), table_or_view, column.name_as_text(), timestamp, mutations);
}
for (auto& column : table_or_view->dropped_columns() | boost::adaptors::map_keys) {
drop_column_from_schema_mutation(dropped_columns(), table_or_view, column, timestamp, mutations);
}
{
mutation m{pkey, scylla_tables()};
m.partition().apply_delete(*scylla_tables(), ckey, tombstone(timestamp, gc_clock::now()));
mutations.emplace_back(m);
}
}
@@ -1655,17 +1762,14 @@ future<std::vector<mutation>> make_drop_table_mutations(lw_shared_ptr<keyspace_m
static future<schema_mutations> read_table_mutations(distributed<service::storage_proxy>& proxy, const qualified_name& table, schema_ptr s)
{
return read_schema_partition_for_table(proxy, s, table.keyspace_name, table.table_name)
.then([&proxy, table] (mutation cf_m) {
return read_schema_partition_for_table(proxy, columns(), table.keyspace_name, table.table_name)
.then([&proxy, table, cf_m = std::move(cf_m)] (mutation col_m) {
return read_schema_partition_for_table(proxy, dropped_columns(), table.keyspace_name, table.table_name)
.then([&proxy, table, cf_m = std::move(cf_m), col_m = std::move(col_m)] (mutation dropped_m) {
return read_schema_partition_for_table(proxy, indexes(), table.keyspace_name, table.table_name)
.then([cf_m = std::move(cf_m), col_m = std::move(col_m), dropped_m = std::move(dropped_m)] (mutation idx_m) {
return schema_mutations{std::move(cf_m), std::move(col_m), std::move(idx_m), std::move(dropped_m)};
});
});
return when_all_succeed(
read_schema_partition_for_table(proxy, s, table.keyspace_name, table.table_name),
read_schema_partition_for_table(proxy, columns(), table.keyspace_name, table.table_name),
read_schema_partition_for_table(proxy, dropped_columns(), table.keyspace_name, table.table_name),
read_schema_partition_for_table(proxy, indexes(), table.keyspace_name, table.table_name),
read_schema_partition_for_table(proxy, scylla_tables(), table.keyspace_name, table.table_name)).then(
[] (mutation cf_m, mutation col_m, mutation dropped_m, mutation idx_m, mutation st_m) {
return schema_mutations{std::move(cf_m), std::move(col_m), std::move(idx_m), std::move(dropped_m), std::move(st_m)};
});
#if 0
// FIXME:
@@ -1680,7 +1784,6 @@ static future<schema_mutations> read_table_mutations(distributed<service::storag
throw new RuntimeException(e);
}
#endif
});
}
future<schema_ptr> create_table_from_name(distributed<service::storage_proxy>& proxy, const sstring& keyspace, const sstring& table)
@@ -1771,7 +1874,7 @@ static void prepare_builder_from_table_row(schema_builder& builder, const query:
builder.set_min_compaction_threshold(std::stoi(map["min_threshold"]));
}
if (map.count("enabled")) {
// TODO: enable/disable?
builder.set_compaction_enabled(boost::algorithm::iequals(map["enabled"], "true"));
}
builder.set_compaction_strategy_options(map);
@@ -1870,13 +1973,12 @@ schema_ptr create_table_from_mutations(schema_mutations sm, std::experimental::o
prepare_builder_from_table_row(builder, table_row);
for (auto&& cdef : column_defs) {
builder.with_column(cdef);
}
v3_columns columns(std::move(column_defs), is_dense, is_compound);
columns.apply_to(builder);
std::vector<index_metadata> index_defs;
if (sm.indices_mutation()) {
index_defs = create_indices_from_index_rows(query::result_set(sm.indices_mutation().value()), ks_name, cf_name);
index_defs = create_indices_from_index_rows(query::result_set(*sm.indices_mutation()), ks_name, cf_name);
}
for (auto&& index : index_defs) {
builder.with_index(index);
@@ -1909,7 +2011,8 @@ static void add_column_to_schema_mutation(schema_ptr table,
api::timestamp_type timestamp,
mutation& m)
{
auto ckey = clustering_key::from_exploded(*m.schema(), {utf8_type->decompose(table->cf_name()), column.name()});
auto ckey = clustering_key::from_exploded(*m.schema(), {utf8_type->decompose(table->cf_name()),
utf8_type->decompose(column.name_as_text())});
auto order = "NONE";
if (column.is_clustering_key()) {
@@ -2003,13 +2106,19 @@ static void drop_index_from_schema_mutation(schema_ptr table, const index_metada
mutations.push_back(std::move(m));
}
static void drop_column_from_schema_mutation(schema_ptr table, const column_definition& column, long timestamp, std::vector<mutation>& mutations) {
schema_ptr s = columns();
auto pkey = partition_key::from_singular(*s, table->ks_name());
auto ckey = clustering_key::from_exploded(*s, {utf8_type->decompose(table->cf_name()), column.name()});
static void drop_column_from_schema_mutation(
schema_ptr schema_table,
schema_ptr table,
const sstring& column_name,
long timestamp,
std::vector<mutation>& mutations)
{
auto pkey = partition_key::from_singular(*schema_table, table->ks_name());
auto ckey = clustering_key::from_exploded(*schema_table, {utf8_type->decompose(table->cf_name()),
utf8_type->decompose(column_name)});
mutation m{pkey, s};
m.partition().apply_delete(*s, ckey, tombstone(timestamp, gc_clock::now()));
mutation m{pkey, schema_table};
m.partition().apply_delete(*schema_table, ckey, tombstone(timestamp, gc_clock::now()));
mutations.emplace_back(m);
}
@@ -2153,7 +2262,7 @@ static schema_mutations make_view_mutations(view_ptr view, api::timestamp_type t
mutation indices_mutation(pkey, indexes());
if (with_columns) {
for (auto&& column : view->all_columns()) {
for (auto&& column : view->v3().all_columns()) {
add_column_to_schema_mutation(view, column, timestamp, columns_mutation);
}
@@ -2165,7 +2274,10 @@ static schema_mutations make_view_mutations(view_ptr view, api::timestamp_type t
}
}
return schema_mutations{std::move(m), std::move(columns_mutation), std::move(indices_mutation), std::move(dropped_columns_mutation)};
auto scylla_tables_mutation = make_scylla_tables_mutation(view, timestamp);
return schema_mutations{std::move(m), std::move(columns_mutation), std::move(indices_mutation), std::move(dropped_columns_mutation),
std::move(scylla_tables_mutation)};
}
schema_mutations make_schema_mutations(schema_ptr s, api::timestamp_type timestamp, bool with_columns)
@@ -2459,10 +2571,33 @@ data_type parse_type(sstring str)
std::vector<schema_ptr> all_tables() {
return {
keyspaces(), tables(), columns(), dropped_columns(), triggers(),
keyspaces(), tables(), scylla_tables(), columns(), dropped_columns(), triggers(),
views(), indexes(), types(), functions(), aggregates(),
};
}
namespace legacy {
table_schema_version schema_mutations::digest() const {
md5_hasher h;
db::schema_tables::feed_hash_for_schema_digest(h, _columnfamilies);
db::schema_tables::feed_hash_for_schema_digest(h, _columns);
return utils::UUID_gen::get_name_UUID(h.finalize());
}
future<schema_mutations> read_table_mutations(distributed<service::storage_proxy>& proxy,
sstring keyspace_name, sstring table_name, schema_ptr s)
{
return read_schema_partition_for_table(proxy, s, keyspace_name, table_name)
.then([&proxy, keyspace_name, table_name] (mutation cf_m) {
return read_schema_partition_for_table(proxy, db::system_keyspace::legacy::columns(), keyspace_name, table_name)
.then([cf_m = std::move(cf_m)] (mutation col_m) {
return schema_mutations{std::move(cf_m), std::move(col_m)};
});
});
}
} // namespace legacy
} // namespace schema_tables
} // namespace schema

View File

@@ -64,6 +64,7 @@ namespace v3 {
static constexpr auto NAME = "system_schema";
static constexpr auto KEYSPACES = "keyspaces";
static constexpr auto TABLES = "tables";
static constexpr auto SCYLLA_TABLES = "scylla_tables";
static constexpr auto COLUMNS = "columns";
static constexpr auto DROPPED_COLUMNS = "dropped_columns";
static constexpr auto TRIGGERS = "triggers";
@@ -77,16 +78,43 @@ schema_ptr columns();
schema_ptr dropped_columns();
schema_ptr indexes();
schema_ptr tables();
schema_ptr scylla_tables();
schema_ptr views();
}
namespace legacy {
class schema_mutations {
mutation _columnfamilies;
mutation _columns;
public:
schema_mutations(mutation columnfamilies, mutation columns)
: _columnfamilies(std::move(columnfamilies))
, _columns(std::move(columns))
{ }
table_schema_version digest() const;
};
future<schema_mutations> read_table_mutations(distributed<service::storage_proxy>& proxy,
sstring keyspace_name, sstring table_name, schema_ptr s);
}
using namespace v3;
// Change on non-backwards compatible changes of schema mutations.
// Replication of schema between nodes with different version is inhibited.
extern const sstring version;
extern std::vector<const char*> ALL;
std::vector<schema_ptr> all_tables();
// saves/creates "ks" + all tables etc, while first deleting all old schema entries (will be rewritten)
future<> save_system_schema(const sstring & ks);
// saves/creates "system_schema" keyspace
future<> save_system_keyspace_schema();
future<utils::UUID> calculate_schema_digest(distributed<service::storage_proxy>& proxy);
@@ -137,6 +165,7 @@ view_ptr create_view_from_mutations(schema_mutations, std::experimental::optiona
future<std::vector<view_ptr>> create_views_from_schema_partition(distributed<service::storage_proxy>& proxy, const schema_result::mapped_type& result);
schema_mutations make_schema_mutations(schema_ptr s, api::timestamp_type timestamp, bool with_columns);
mutation make_scylla_tables_mutation(schema_ptr, api::timestamp_type timestamp);
void add_table_or_view_to_schema_mutation(schema_ptr view, api::timestamp_type timestamp, bool with_columns, std::vector<mutation>& mutations);
@@ -153,15 +182,11 @@ data_type parse_type(sstring str);
sstring serialize_index_kind(index_metadata_kind kind);
index_metadata_kind deserialize_index_kind(sstring kind);
mutation compact_for_schema_digest(const mutation& m);
template<typename Hasher>
void feed_hash_for_schema_digest(Hasher& h, const mutation& m) {
// Cassandra is skipping tombstones from digest calculation
// to avoid disagreements due to tombstone GC.
// See https://issues.apache.org/jira/browse/CASSANDRA-6862.
// We achieve similar effect with compact_for_compaction().
mutation m_compacted(m);
m_compacted.partition().compact_for_compaction(*m.schema(), always_gc, gc_clock::time_point::max());
feed_hash(h, m_compacted);
feed_hash(h, compact_for_schema_digest(m));
}
} // namespace schema_tables

View File

@@ -1044,6 +1044,9 @@ future<> setup(distributed<database>& db, distributed<cql3::query_processor>& qp
return check_health();
}).then([] {
return db::schema_tables::save_system_keyspace_schema();
}).then([] {
// #2514 - make sure "system" is written to system_schema.keyspaces.
return db::schema_tables::save_system_schema(NAME);
}).then([] {
return netw::get_messaging_service().invoke_on_all([] (auto& ms){
return ms.init_local_preferred_ip_cache();

View File

@@ -62,6 +62,8 @@ namespace cql3 {
class query_processor;
}
bool is_system_keyspace(const sstring& ks_name);
namespace db {
namespace system_keyspace {
@@ -120,6 +122,18 @@ extern schema_ptr hints();
extern schema_ptr batchlog();
extern schema_ptr built_indexes(); // TODO (from Cassandra): make private
namespace legacy {
schema_ptr keyspaces();
schema_ptr column_families();
schema_ptr columns();
schema_ptr triggers();
schema_ptr usertypes();
schema_ptr functions();
schema_ptr aggregates();
}
table_schema_version generate_schema_version(utils::UUID table_id);
// Only for testing.

View File

@@ -194,13 +194,13 @@ public:
: _view(std::move(view))
, _view_info(*_view->view_info())
, _base(std::move(base))
, _updates(8, partition_key::hashing(*_base), partition_key::equality(*_base)) {
, _updates(8, partition_key::hashing(*_view), partition_key::equality(*_view)) {
}
void move_to(std::vector<mutation>& mutations) && {
auto& partitioner = dht::global_partitioner();
std::transform(_updates.begin(), _updates.end(), std::back_inserter(mutations), [&, this] (auto&& m) {
return mutation(_view, partitioner.decorate_key(*_base, std::move(m.first)), std::move(m.second));
return mutation(_view, partitioner.decorate_key(*_view, std::move(m.first)), std::move(m.second));
});
}

View File

@@ -59,14 +59,11 @@ future<> boot_strapper::bootstrap() {
streamer->add_ranges(keyspace_name, ranges);
}
return streamer->fetch_async().then_wrapped([streamer] (auto&& f) {
try {
auto state = f.get0();
} catch (...) {
throw std::runtime_error(sprint("Error during boostrap: %s", std::current_exception()));
}
return streamer->stream_async().then([streamer] () {
service::get_local_storage_service().finish_bootstrapping();
return make_ready_future<>();
}).handle_exception([streamer] (std::exception_ptr eptr) {
blogger.warn("Eror during bootstrap: {}", eptr);
return make_exception_future<>(std::move(eptr));
});
}

View File

@@ -260,6 +260,27 @@ unsigned shard_of(const token& t) {
return global_partitioner().shard_of(t);
}
stdx::optional<dht::token_range>
selective_token_range_sharder::next() {
if (_done) {
return {};
}
while (_range.overlaps(dht::token_range(_start_boundary, {}), dht::token_comparator())
&& !(_start_boundary && _start_boundary->value() == maximum_token())) {
auto end_token = _partitioner.token_for_next_shard(_start_token, _next_shard);
auto candidate = dht::token_range(std::move(_start_boundary), range_bound<dht::token>(end_token, false));
auto intersection = _range.intersection(std::move(candidate), dht::token_comparator());
_start_token = _partitioner.token_for_next_shard(end_token, _shard);
_start_boundary = range_bound<dht::token>(_start_token);
if (intersection) {
return *intersection;
}
}
_done = true;
return {};
}
stdx::optional<ring_position_range_and_shard>
ring_position_range_sharder::next(const schema& s) {
if (_done) {
@@ -462,14 +483,13 @@ int ring_position_comparator::operator()(ring_position_view lh, ring_position_vi
}
}
int ring_position_comparator::operator()(ring_position_view lh, sstables::key_view rh) const {
auto rh_token = global_partitioner().get_token(rh);
auto token_cmp = tri_compare(*lh._token, rh_token);
int ring_position_comparator::operator()(ring_position_view lh, sstables::decorated_key_view rh) const {
auto token_cmp = tri_compare(*lh._token, rh.token());
if (token_cmp) {
return token_cmp;
}
if (lh._key) {
auto rel = rh.tri_compare(s, *lh._key);
auto rel = rh.key().tri_compare(s, *lh._key);
if (rel) {
return -rel;
}
@@ -477,7 +497,7 @@ int ring_position_comparator::operator()(ring_position_view lh, sstables::key_vi
return lh._weight;
}
int ring_position_comparator::operator()(sstables::key_view a, ring_position_view b) const {
int ring_position_comparator::operator()(sstables::decorated_key_view a, ring_position_view b) const {
return -(*this)(b, a);
}

View File

@@ -55,6 +55,7 @@
namespace sstables {
class key_view;
class decorated_key_view;
}
@@ -547,8 +548,8 @@ struct ring_position_comparator {
const schema& s;
ring_position_comparator(const schema& s_) : s(s_) {}
int operator()(ring_position_view, ring_position_view) const;
int operator()(ring_position_view, sstables::key_view) const;
int operator()(sstables::key_view, ring_position_view) const;
int operator()(ring_position_view, sstables::decorated_key_view) const;
int operator()(sstables::decorated_key_view, ring_position_view) const;
};
// "less" comparator giving the same order as ring_position_comparator
@@ -671,6 +672,29 @@ split_ranges_to_shards(const dht::token_range_vector& ranges, const schema& s);
std::vector<partition_range> split_range_to_single_shard(const schema& s, const dht::partition_range& pr, shard_id shard);
std::vector<partition_range> split_range_to_single_shard(const i_partitioner& partitioner, const schema& s, const dht::partition_range& pr, shard_id shard);
class selective_token_range_sharder {
const i_partitioner& _partitioner;
dht::token_range _range;
shard_id _shard;
bool _done = false;
shard_id _next_shard;
dht::token _start_token;
stdx::optional<range_bound<dht::token>> _start_boundary;
public:
explicit selective_token_range_sharder(dht::token_range range, shard_id shard)
: selective_token_range_sharder(global_partitioner(), std::move(range), shard) {}
selective_token_range_sharder(const i_partitioner& partitioner, dht::token_range range, shard_id shard)
: _partitioner(partitioner)
, _range(std::move(range))
, _shard(shard)
, _next_shard(_shard + 1 == _partitioner.shard_count() ? 0 : _shard + 1)
, _start_token(_range.start() ? _range.start()->value() : minimum_token())
, _start_boundary(_partitioner.shard_of(_start_token) == shard ?
_range.start() : range_bound<dht::token>(_partitioner.token_for_next_shard(_start_token, shard))) {
}
stdx::optional<dht::token_range> next();
};
} // dht
namespace std {

View File

@@ -193,8 +193,7 @@ range_streamer::get_all_ranges_with_strict_sources_for(const sstring& keyspace_n
inet_address source_ip = range_sources.find(desired_range)->second;
auto& gossiper = gms::get_local_gossiper();
auto source_state = gossiper.get_endpoint_state_for_endpoint(source_ip);
if (gossiper.is_enabled() && source_state && !source_state->is_alive()) {
if (gossiper.is_enabled() && !gossiper.is_alive(source_ip)) {
throw std::runtime_error(sprint("A node required to move the data consistently is down (%s). If you wish to move the data from a potentially inconsistent replica, restart the node with consistent_rangemovement=false", source_ip));
}
}
@@ -211,7 +210,36 @@ bool range_streamer::use_strict_sources_for_ranges(const sstring& keyspace_name)
&& _metadata.get_all_endpoints().size() != strat.get_replication_factor();
}
void range_streamer::add_tx_ranges(const sstring& keyspace_name, std::unordered_map<inet_address, dht::token_range_vector> ranges_per_endpoint, std::vector<sstring> column_families) {
if (_nr_rx_added) {
throw std::runtime_error("Mixed sending and receiving is not supported");
}
_nr_tx_added++;
_to_stream.emplace(keyspace_name, std::move(ranges_per_endpoint));
auto inserted = _column_families.emplace(keyspace_name, std::move(column_families)).second;
if (!inserted) {
throw std::runtime_error("Can not add column_families for the same keyspace more than once");
}
}
void range_streamer::add_rx_ranges(const sstring& keyspace_name, std::unordered_map<inet_address, dht::token_range_vector> ranges_per_endpoint, std::vector<sstring> column_families) {
if (_nr_tx_added) {
throw std::runtime_error("Mixed sending and receiving is not supported");
}
_nr_rx_added++;
_to_stream.emplace(keyspace_name, std::move(ranges_per_endpoint));
auto inserted = _column_families.emplace(keyspace_name, std::move(column_families)).second;
if (!inserted) {
throw std::runtime_error("Can not add column_families for the same keyspace more than once");
}
}
// TODO: This is the legacy range_streamer interface, it is add_rx_ranges which adds rx ranges.
void range_streamer::add_ranges(const sstring& keyspace_name, dht::token_range_vector ranges) {
if (_nr_tx_added) {
throw std::runtime_error("Mixed sending and receiving is not supported");
}
_nr_rx_added++;
auto ranges_for_keyspace = use_strict_sources_for_ranges(keyspace_name)
? get_all_ranges_with_strict_sources_for(keyspace_name, ranges)
: get_all_ranges_with_sources_for(keyspace_name, ranges);
@@ -232,26 +260,114 @@ void range_streamer::add_ranges(const sstring& keyspace_name, dht::token_range_v
logger.debug("{} : range {} from source {} for keyspace {}", _description, x.second, x.first, keyspace_name);
}
}
_to_fetch.emplace(keyspace_name, std::move(range_fetch_map));
_to_stream.emplace(keyspace_name, std::move(range_fetch_map));
}
future<streaming::stream_state> range_streamer::fetch_async() {
for (auto& fetch : _to_fetch) {
const auto& keyspace = fetch.first;
for (auto& x : fetch.second) {
auto& source = x.first;
auto& ranges = x.second;
/* Send messages to respective folks to stream data over to me */
if (logger.is_enabled(logging::log_level::debug)) {
logger.debug("{}ing from {} ranges {}", _description, source, ranges);
future<> range_streamer::stream_async() {
return seastar::async([this] {
int sleep_time = 60;
for (;;) {
try {
do_stream_async().get();
break;
} catch (...) {
logger.warn("{} failed to stream. Will retry in {} seconds ...", _description, sleep_time);
sleep_abortable(std::chrono::seconds(sleep_time)).get();
sleep_time *= 1.5;
if (++_nr_retried >= _nr_max_retry) {
throw;
}
}
_stream_plan.request_ranges(source, keyspace, ranges);
}
});
}
future<> range_streamer::do_stream_async() {
auto nr_ranges_remaining = nr_ranges_to_stream();
logger.info("{} starts, nr_ranges_remaining={}", _description, nr_ranges_remaining);
auto start = lowres_clock::now();
return do_for_each(_to_stream, [this, start, description = _description] (auto& stream) {
const auto& keyspace = stream.first;
auto& ip_range_vec = stream.second;
// Fetch from or send to peer node in parallel
return parallel_for_each(ip_range_vec, [this, description, keyspace] (auto& ip_range) {
auto& source = ip_range.first;
auto& range_vec = ip_range.second;
return seastar::async([this, description, keyspace, source, &range_vec] () mutable {
// TODO: It is better to use fiber instead of thread here because
// creating a thread per peer can be some memory in a large cluster.
auto start_time = lowres_clock::now();
unsigned sp_index = 0;
unsigned nr_ranges_streamed = 0;
size_t nr_ranges_total = range_vec.size();
size_t nr_ranges_per_stream_plan = nr_ranges_total / 10;
dht::token_range_vector ranges_to_stream;
auto do_streaming = [&] {
auto sp = stream_plan(sprint("%s-%s-index-%d", description, keyspace, sp_index++));
logger.info("{} with {} for keyspace={}, {} out of {} ranges: ranges = {}",
description, source, keyspace, nr_ranges_streamed, nr_ranges_total, ranges_to_stream.size());
if (_nr_rx_added) {
sp.request_ranges(source, keyspace, ranges_to_stream, _column_families[keyspace]);
} else if (_nr_tx_added) {
sp.transfer_ranges(source, keyspace, ranges_to_stream, _column_families[keyspace]);
}
sp.execute().discard_result().get();
ranges_to_stream.clear();
};
try {
for (auto it = range_vec.begin(); it < range_vec.end();) {
ranges_to_stream.push_back(*it);
it = range_vec.erase(it);
nr_ranges_streamed++;
if (ranges_to_stream.size() < nr_ranges_per_stream_plan) {
continue;
} else {
do_streaming();
}
}
if (ranges_to_stream.size() > 0) {
do_streaming();
}
} catch (...) {
for (auto& range : ranges_to_stream) {
range_vec.push_back(range);
}
auto t = std::chrono::duration_cast<std::chrono::seconds>(lowres_clock::now() - start_time).count();
logger.warn("{} with {} for keyspace={} failed, took {} seconds: {}", description, source, keyspace, t, std::current_exception());
throw;
}
auto t = std::chrono::duration_cast<std::chrono::seconds>(lowres_clock::now() - start_time).count();
logger.info("{} with {} for keyspace={} succeeded, took {} seconds", description, source, keyspace, t);
});
});
}).finally([this, start] {
auto t = std::chrono::duration_cast<std::chrono::seconds>(lowres_clock::now() - start).count();
auto nr_ranges_remaining = nr_ranges_to_stream();
if (nr_ranges_remaining) {
logger.warn("{} failed, took {} seconds, nr_ranges_remaining={}", _description, t, nr_ranges_remaining);
} else {
logger.info("{} succeeded, took {} seconds, nr_ranges_remaining={}", _description, t, nr_ranges_remaining);
}
});
}
size_t range_streamer::nr_ranges_to_stream() {
size_t nr_ranges_remaining = 0;
for (auto& fetch : _to_stream) {
const auto& keyspace = fetch.first;
auto& ip_range_vec = fetch.second;
for (auto& ip_range : ip_range_vec) {
auto& source = ip_range.first;
auto& range_vec = ip_range.second;
nr_ranges_remaining += range_vec.size();
logger.debug("Remaining: keyspace={}, source={}, ranges={}", keyspace, source, range_vec);
}
}
return _stream_plan.execute();
return nr_ranges_remaining;
}
std::unordered_multimap<inet_address, dht::token_range>
range_streamer::get_work_map(const std::unordered_multimap<dht::token_range, inet_address>& ranges_with_source_target,
const sstring& keyspace) {

View File

@@ -119,6 +119,8 @@ public:
}
void add_ranges(const sstring& keyspace_name, dht::token_range_vector ranges);
void add_tx_ranges(const sstring& keyspace_name, std::unordered_map<inet_address, dht::token_range_vector> ranges_per_endpoint, std::vector<sstring> column_families = {});
void add_rx_ranges(const sstring& keyspace_name, std::unordered_map<inet_address, dht::token_range_vector> ranges_per_endpoint, std::vector<sstring> column_families = {});
private:
bool use_strict_sources_for_ranges(const sstring& keyspace_name);
/**
@@ -159,16 +161,25 @@ public:
}
#endif
public:
future<streaming::stream_state> fetch_async();
future<> stream_async();
future<> do_stream_async();
size_t nr_ranges_to_stream();
private:
distributed<database>& _db;
token_metadata& _metadata;
std::unordered_set<token> _tokens;
inet_address _address;
sstring _description;
std::unordered_multimap<sstring, std::unordered_map<inet_address, dht::token_range_vector>> _to_fetch;
std::unordered_multimap<sstring, std::unordered_map<inet_address, dht::token_range_vector>> _to_stream;
std::unordered_set<std::unique_ptr<i_source_filter>> _source_filters;
stream_plan _stream_plan;
std::unordered_map<sstring, std::vector<sstring>> _column_families;
// Retry the stream plan _nr_max_retry times
unsigned _nr_retried = 0;
unsigned _nr_max_retry = 5;
// Number of tx and rx ranges added
unsigned _nr_tx_added = 0;
unsigned _nr_rx_added = 0;
};
} // dht

View File

@@ -79,13 +79,14 @@ if [ $LOCALRPM -eq 1 ]; then
cd ../..
cp build/scylla-jmx/build/rpms/scylla-jmx-`cat build/scylla-jmx/build/SCYLLA-VERSION-FILE`-`cat build/scylla-jmx/build/SCYLLA-RELEASE-FILE`.*.noarch.rpm dist/ami/files/scylla-jmx.noarch.rpm
fi
if [ ! -f dist/ami/files/scylla-tools.noarch.rpm ]; then
if [ ! -f dist/ami/files/scylla-tools.noarch.rpm ] || [ ! -f dist/ami/files/scylla-tools-core.noarch.rpm ]; then
cd build
git clone --depth 1 https://github.com/scylladb/scylla-tools-java.git
cd scylla-tools-java
sh -x -e dist/redhat/build_rpm.sh
cd ../..
cp build/scylla-tools-java/build/rpms/scylla-tools-`cat build/scylla-tools-java/build/SCYLLA-VERSION-FILE`-`cat build/scylla-tools-java/build/SCYLLA-RELEASE-FILE`.*.noarch.rpm dist/ami/files/scylla-tools.noarch.rpm
cp build/scylla-tools-java/build/rpms/scylla-tools-core-`cat build/scylla-tools-java/build/SCYLLA-VERSION-FILE`-`cat build/scylla-tools-java/build/SCYLLA-RELEASE-FILE`.*.noarch.rpm dist/ami/files/scylla-tools-core.noarch.rpm
fi
else
sudo apt-get install -y git

View File

@@ -1 +0,0 @@
options raid0 devices_discard_performance=Y

View File

@@ -75,13 +75,16 @@ while getopts ":hdncap:q:" opt; do
done
##Check if server is Fedora/Debian release##
cat /etc/os-release | grep fedora &> /dev/null
##Check server release (Fedora/Oracle/Debian)##
cat /etc/os-release | grep -i fedora &> /dev/null
if [ $? -ne 0 ]; then
IS_FEDORA="1"
cat /etc/os-release | grep -i oracle &> /dev/null
if [ $? -ne 0 ]; then
IS_FEDORA="1"
fi
fi
cat /etc/os-release | grep debian &> /dev/null
cat /etc/os-release | grep -i debian &> /dev/null
if [ $? -ne 0 ]; then
IS_DEBIAN="1"
fi
@@ -91,25 +94,24 @@ if [ "$IS_FEDORA" == "1" ] && [ "$IS_DEBIAN" == "1" ]; then
exit 222
fi
##Pass criteria for script execution##
#Check scylla service#
##Scylla-server service status##
echo "--------------------------------------------------"
echo "Checking Scylla Service"
echo "Checking Scylla-server Service"
echo "--------------------------------------------------"
ps -C scylla --no-headers &> /dev/null
if [ $? -ne 0 ]; then
SCYLLA_SERVICE="1"
echo "ERROR: Scylla is NOT Running"
echo "ERROR: Scylla-server is NOT Running"
echo "Cannot Collect Data Model Info"
echo "--------------------------------------------------"
else
echo "Scylla Service: OK"
echo "Scylla-server Service: OK"
echo "--------------------------------------------------"
fi
#Check Scylla-JMX service#
##Scylla-JMX service status##
echo "Checking Scylla-JMX Service on Port $JMX_PORT"
echo "--------------------------------------------------"
@@ -121,7 +123,7 @@ if [ $? -ne 0 ]; then
echo "Use the '-p' Option to Provide the Scylla-JMX Port"
echo "--------------------------------------------------"
else
echo "JMX Service (nodetool): OK"
echo "Scylla-JMX Service (nodetool): OK"
echo "--------------------------------------------------"
fi
@@ -152,12 +154,12 @@ mkdir -p $OUTPUT_PATH1 $OUTPUT_PATH2 $OUTPUT_PATH3 $OUTPUT_PATH4 $OUTPUT_PATH5
#System Checks#
echo "Collecting System Info"
echo "--------------------------------------------------"
cat /etc/os-release > $OUTPUT_PATH1/os-release.txt
cp -p /etc/os-release $OUTPUT_PATH1
uname -r > $OUTPUT_PATH1/kernel-release.txt
lscpu > $OUTPUT_PATH1/cpu-info.txt
vmstat -s -S M | awk '{$1=$1};1' > $OUTPUT_PATH1/vmstat.txt
df -Th > $OUTPUT_PATH1/capacity-info.txt && echo "" >> $OUTPUT_PATH1/capacity-info.txt && sudo du -sh /var/lib/scylla/* >> $OUTPUT_PATH1/capacity-info.txt
cat /proc/mdstat > $OUTPUT_PATH1/raid-conf.txt
cp -p /proc/mdstat $OUTPUT_PATH1
for f in `sudo find /sys -name scheduler`; do echo -n "$f: "; cat $f; done > $OUTPUT_PATH1/io-sched-conf.txt && echo "" >> $OUTPUT_PATH1/io-sched-conf.txt
for f in `sudo find /sys -name nomerges`; do echo -n "$f: "; cat $f; done >> $OUTPUT_PATH1/io-sched-conf.txt
@@ -166,30 +168,23 @@ for f in `sudo find /sys -name nomerges`; do echo -n "$f: "; cat $f; done >> $O
echo "Collecting Scylla Info"
echo "--------------------------------------------------"
scylla --version > $OUTPUT_PATH2/scylla-version.txt
cp -p /etc/scylla/* $OUTPUT_PATH2
ls -ltrh /var/lib/scylla/coredump/ > $OUTPUT_PATH2/coredump-folder.txt
if [ "$IS_FEDORA" == "0" ]; then
rpm -qa | grep -i scylla > $OUTPUT_PATH2/scylla-pkgs.txt
cp -p /etc/sysconfig/scylla-server $OUTPUT_PATH2
fi
if [ "$IS_DEBIAN" == "0" ]; then
dpkg -l | grep -i scylla > $OUTPUT_PATH2/scylla-pkgs.txt
cp -p /etc/default/scylla-server $OUTPUT_PATH2
fi
curl -s -X GET "http://localhost:10000/storage_service/scylla_release_version" > $OUTPUT_PATH2/scylla-version.txt && echo "" >> $OUTPUT_PATH2/scylla-version.txt
cat /etc/scylla/scylla.yaml | grep -v "#" | grep -v "^[[:space:]]*$" > $OUTPUT_PATH2/scylla-yaml.txt
if [ "$IS_FEDORA" == "0" ]; then
cat /etc/sysconfig/scylla-server | grep -v "^[[:space:]]*$" > $OUTPUT_PATH2/scylla-server.txt
fi
if [ "$IS_DEBIAN" == "0" ]; then
cat /etc/default/scylla-server | grep -v "^[[:space:]]*$" > $OUTPUT_PATH2/scylla-server.txt
fi
cat /etc/scylla/cassandra-rackdc.properties | grep -v "#" |grep -v "^[[:space:]]*$" > $OUTPUT_PATH2/multi-DC.txt
ls -ltrh /var/lib/scylla/coredump/ > $OUTPUT_PATH2/coredump-folder.txt
#Scylla Logs#
echo "--------------------------------------------------"
echo "Collecting Logs"
echo "--------------------------------------------------"
@@ -256,7 +251,7 @@ for i in `ls -I lo /sys/class/net/`; do echo "--$i"; cat /sys/class/net/$i/queue
for i in `ls -I lo /sys/class/net/`; do echo "--$i"; cat /sys/class/net/$i/queues/rx-*/rps_flow_cnt; echo ""; done > $OUTPUT_PATH5/rfs-conf.txt
ps -elf | grep irqbalance > $OUTPUT_PATH5/irqbalance-conf.txt
sudo sysctl -a > $OUTPUT_PATH5/sysctl.txt 2>&1
sudo iptables -L > $OUTPUT_PATH5/iptables.txt
sudo iptables -L -v > $OUTPUT_PATH5/iptables.txt
netstat -an | grep tcp > $OUTPUT_PATH5/netstat-tcp.txt
@@ -297,7 +292,7 @@ echo "" >> $REPORT
echo "Host Operating System" >> $REPORT
echo "---------------------" >> $REPORT
cat $OUTPUT_PATH1/os-release.txt >> $REPORT
cat $OUTPUT_PATH1/os-release >> $REPORT
echo "" >> $REPORT
echo "" >> $REPORT
@@ -327,7 +322,7 @@ echo "" >> $REPORT
echo "RAID Configuration" >> $REPORT
echo "------------------" >> $REPORT
cat $OUTPUT_PATH1/raid-conf.txt >> $REPORT
cat $OUTPUT_PATH1/mdstat >> $REPORT
echo "" >> $REPORT
echo "" >> $REPORT
@@ -354,7 +349,7 @@ echo "" >> $REPORT
echo "Configuration files" >> $REPORT
echo "-------------------" >> $REPORT
echo "## /etc/scylla/scylla.yaml ##" >> $REPORT
cat $OUTPUT_PATH2/scylla-yaml.txt >> $REPORT
cat $OUTPUT_PATH2/scylla.yaml | grep -v "#" | grep -v "^[[:space:]]*$" >> $REPORT
echo "" >> $REPORT
echo "" >> $REPORT
@@ -366,11 +361,11 @@ if [ "$IS_DEBIAN" == "0" ]; then
echo "## /etc/default/scylla-server ##" >> $REPORT
fi
cat $OUTPUT_PATH2/scylla-server.txt >> $REPORT
cat $OUTPUT_PATH2/scylla-server | grep -v "^[[:space:]]*$" >> $REPORT
echo "" >> $REPORT
echo "" >> $REPORT
echo "## /etc/scylla/cassandra-rackdc.properties ##" >> $REPORT
cat $OUTPUT_PATH2/multi-DC.txt >> $REPORT
cat $OUTPUT_PATH2/cassandra-rackdc.properties | grep -v "#" |grep -v "^[[:space:]]*$" >> $REPORT
echo "" >> $REPORT
echo "" >> $REPORT

View File

@@ -4,6 +4,10 @@
. /usr/lib/scylla/scylla_lib.sh
if [ ! -f /sys/devices/system/cpu/cpufreq/policy0/scaling_governor ]; then
echo "This computer doesn't supported CPU scaling configuration."
exit 0
fi
if is_debian_variant; then
apt-get install -y cpufrequtils
service cpufrequtils stop

View File

@@ -2,6 +2,8 @@
#
# Copyright (C) 2016 ScyllaDB
. /usr/lib/scylla/scylla_lib.sh
print_usage() {
echo "scylla_cpuset_setup --cpuset 1-7 --smp 7"
echo " --cpuset CPUs to use (in cpuset(7) format; default: all))"
@@ -38,5 +40,6 @@ fi
if [ "$SMP" != "" ]; then
OUT="$OUT--smp $SMP "
fi
rm -f /etc/scylla.d/perftune.yaml
OUT="$OUT\""
echo $OUT > /etc/scylla.d/cpuset.conf

View File

@@ -38,6 +38,51 @@ ec2_is_supported_instance_type() {
esac
}
#
# check_cpuset_conf <NIC name>
#
get_tune_mode() {
local nic=$1
# if cpuset.conf doesn't exist use the default mode
[[ ! -e '/etc/scylla.d/cpuset.conf' ]] && return
local cur_cpuset=`cat /etc/scylla.d/cpuset.conf | cut -d "\"" -f2- | cut -d" " -f2`
local mq_cpuset=`/usr/lib/scylla/perftune.py --tune net --nic "$nic" --mode mq --get-cpu-mask | /usr/lib/scylla/hex2list.py`
local sq_cpuset=`/usr/lib/scylla/perftune.py --tune net --nic "$nic" --mode sq --get-cpu-mask | /usr/lib/scylla/hex2list.py`
local sq_split_cpuset=`/usr/lib/scylla/perftune.py --tune net --nic "$nic" --mode sq_split --get-cpu-mask | /usr/lib/scylla/hex2list.py`
local tune_mode=""
case "$cur_cpuset" in
"$mq_cpuset")
tune_mode="--mode mq"
;;
"$sq_cpuset")
tune_mode="--mode sq"
;;
"$sq_split_cpuset")
tune_mode="--mode sq_split"
;;
esac
# if cpuset is something different from what we expect - use the default mode
echo "$tune_mode"
}
#
# create_perftune_conf [<NIC name>]
#
create_perftune_conf() {
local nic=$1
[[ -z "$nic" ]] && nic='eth0'
# if exists - do nothing
[[ -e '/etc/scylla.d/perftune.yaml' ]] && return
local mode=`get_tune_mode "$nic"`
/usr/lib/scylla/perftune.py --tune net --nic "$nic" $mode --dump-options-file > /etc/scylla.d/perftune.yaml
}
. /etc/os-release
if is_debian_variant || is_gentoo_variant; then
SYSCONFIG=/etc/default

View File

@@ -22,7 +22,8 @@ elif [ "$NETWORK_MODE" = "dpdk" ]; then
done
else # NETWORK_MODE = posix
if [ "$SET_NIC" = "yes" ]; then
/usr/lib/scylla/posix_net_conf.sh $IFNAME
create_perftune_conf "$IFNAME"
/usr/lib/scylla/posix_net_conf.sh $IFNAME --options-file /etc/scylla.d/perftune.yaml
fi
fi
if [ "$ID" = "ubuntu" ]; then

View File

@@ -104,7 +104,11 @@ else
mdadm --create --verbose --force --run $RAID --level=0 -c1024 --raid-devices=$NR_DISK $DISKS
mkfs.xfs $RAID -f -K
fi
mdadm --detail --scan > /etc/mdadm.conf
if is_debian_variant; then
mdadm --detail --scan > /etc/mdadm/mdadm.conf
else
mdadm --detail --scan > /etc/mdadm.conf
fi
mkdir -p "$MOUNT_AT"
mount -t xfs -o noatime $RAID "$MOUNT_AT"
@@ -122,3 +126,7 @@ if [ $FSTAB -ne 0 ]; then
UUID=`blkid $RAID | awk '{print $2}'`
echo "$UUID $MOUNT_AT xfs noatime 0 0" >> /etc/fstab
fi
if is_debian_variant; then
update-initramfs -u
fi

View File

@@ -75,7 +75,7 @@ verify_package() {
if is_debian_variant; then
dpkg -s $1 > /dev/null 2>&1 &&:
elif is_gentoo_variant; then
find /var/db/pkg/dev-db -type d -name "${1}-*" | egrep -q ".*"
find /var/db/pkg/app-admin -type d -name "${1}-*" | egrep -q ".*"
else
rpm -q $1 > /dev/null 2>&1 &&:
fi

View File

@@ -6,7 +6,7 @@ After=network.target
Type=simple
User=scylla
Group=scylla
ExecStart=/usr/lib/scylla/scylla-housekeeping --uuid-file /var/lib/scylla-housekeeping/housekeeping.uuid --repo-files '/etc/yum.repos.d/scylla*.repo' -q -c /etc/scylla.d/housekeeping.cfg version --mode d
ExecStart=/usr/lib/scylla/scylla-housekeeping --uuid-file /var/lib/scylla-housekeeping/housekeeping.uuid -q -c /etc/scylla.d/housekeeping.cfg --repo-files @@REPOFILES@@ version --mode d
[Install]
WantedBy=multi-user.target

View File

@@ -6,7 +6,7 @@ After=network.target
Type=simple
User=scylla
Group=scylla
ExecStart=/usr/lib/scylla/scylla-housekeeping --uuid-file /var/lib/scylla-housekeeping/housekeeping.uuid -q --repo-files '/etc/yum.repos.d/scylla*.repo' -c /etc/scylla.d/housekeeping.cfg version --mode r
ExecStart=/usr/lib/scylla/scylla-housekeeping --uuid-file /var/lib/scylla-housekeeping/housekeeping.uuid -q -c /etc/scylla.d/housekeeping.cfg --repo-files @@REPOFILES@@ version --mode r
[Install]
WantedBy=multi-user.target

View File

@@ -1,6 +1,6 @@
[Unit]
Description=Scylla Server
After=network.target
After=network-online.target
Wants=scylla-jmx.service
Wants=scylla-housekeeping-restart.timer
Wants=scylla-housekeeping-daily.timer

View File

@@ -129,9 +129,11 @@ sed -i -e "s/@@CODENAME@@/$TARGET/g" debian/changelog
cp dist/debian/rules.in debian/rules
cp dist/debian/control.in debian/control
cp dist/debian/scylla-server.install.in debian/scylla-server.install
cp dist/debian/scylla-conf.preinst.in debian/scylla-conf.preinst
sed -i -e "s/@@VERSION@@/$SCYLLA_VERSION/g" debian/scylla-conf.preinst
if [ "$TARGET" = "jessie" ]; then
cp dist/debian/scylla-server.cron.d debian/
sed -i -e "s/@@REVISION@@/1/g" debian/changelog
sed -i -e "s/@@REVISION@@/1~$TARGET/g" debian/changelog
sed -i -e "s/@@DH_INSTALLINIT@@//g" debian/rules
sed -i -e "s/@@COMPILER@@/g++-5/g" debian/rules
sed -i -e "s/@@BUILD_DEPENDS@@/libsystemd-dev, g++-5, libunwind-dev/g" debian/control
@@ -145,7 +147,7 @@ if [ "$TARGET" = "jessie" ]; then
sed -i -e "s#@@SCRIPTS_DELAY_FSTRIM@@#dist/debian/scripts/scylla_delay_fstrim usr/lib/scylla#g" debian/scylla-server.install
elif [ "$TARGET" = "stretch" ] || [ "$TARGET" = "buster" ] || [ "$TARGET" = "sid" ]; then
cp dist/debian/scylla-server.cron.d debian/
sed -i -e "s/@@REVISION@@/1/g" debian/changelog
sed -i -e "s/@@REVISION@@/1~$TARGET/g" debian/changelog
sed -i -e "s/@@DH_INSTALLINIT@@//g" debian/rules
sed -i -e "s/@@COMPILER@@/g++/g" debian/rules
sed -i -e "s/@@BUILD_DEPENDS@@/libsystemd-dev, g++, libunwind8-dev/g" debian/control
@@ -159,7 +161,7 @@ elif [ "$TARGET" = "stretch" ] || [ "$TARGET" = "buster" ] || [ "$TARGET" = "sid
sed -i -e "s#@@SCRIPTS_DELAY_FSTRIM@@#dist/debian/scripts/scylla_delay_fstrim usr/lib/scylla#g" debian/scylla-server.install
elif [ "$TARGET" = "trusty" ]; then
cp dist/debian/scylla-server.cron.d debian/
sed -i -e "s/@@REVISION@@/0ubuntu1/g" debian/changelog
sed -i -e "s/@@REVISION@@/0ubuntu1~$TARGET/g" debian/changelog
sed -i -e "s/@@DH_INSTALLINIT@@/--upstart-only/g" debian/rules
sed -i -e "s/@@COMPILER@@/g++-5/g" debian/rules
sed -i -e "s/@@BUILD_DEPENDS@@/g++-5, libunwind8-dev/g" debian/control
@@ -172,7 +174,7 @@ elif [ "$TARGET" = "trusty" ]; then
sed -i -e "s#@@SCRIPTS_FSTRIM@@#dist/debian/scripts/scylla_fstrim usr/lib/scylla#g" debian/scylla-server.install
sed -i -e "s#@@SCRIPTS_DELAY_FSTRIM@@#dist/debian/scripts/scylla_delay_fstrim usr/lib/scylla#g" debian/scylla-server.install
elif [ "$TARGET" = "xenial" ] || [ "$TARGET" = "yakkety" ] || [ "$TARGET" = "zesty" ] || [ "$TARGET" = "artful" ]; then
sed -i -e "s/@@REVISION@@/0ubuntu1/g" debian/changelog
sed -i -e "s/@@REVISION@@/0ubuntu1~$TARGET/g" debian/changelog
sed -i -e "s/@@DH_INSTALLINIT@@//g" debian/rules
sed -i -e "s/@@COMPILER@@/g++/g" debian/rules
sed -i -e "s/@@BUILD_DEPENDS@@/libsystemd-dev, g++, libunwind-dev/g" debian/control
@@ -194,8 +196,10 @@ else
fi
cp dist/common/systemd/scylla-server.service.in debian/scylla-server.service
sed -i -e "s#@@SYSCONFDIR@@#/etc/default#g" debian/scylla-server.service
cp dist/common/systemd/scylla-housekeeping-daily.service debian/scylla-server.scylla-housekeeping-daily.service
cp dist/common/systemd/scylla-housekeeping-restart.service debian/scylla-server.scylla-housekeeping-restart.service
cp dist/common/systemd/scylla-housekeeping-daily.service.in debian/scylla-server.scylla-housekeeping-daily.service
sed -i -e "s#@@REPOFILES@@#'/etc/apt/sources.list.d/scylla*.list'#g" debian/scylla-server.scylla-housekeeping-daily.service
cp dist/common/systemd/scylla-housekeeping-restart.service.in debian/scylla-server.scylla-housekeeping-restart.service
sed -i -e "s#@@REPOFILES@@#'/etc/apt/sources.list.d/scylla*.list'#g" debian/scylla-server.scylla-housekeeping-restart.service
cp dist/common/systemd/node-exporter.service debian/scylla-server.node-exporter.service
if [ $REBUILD -eq 1 ]; then

View File

@@ -40,7 +40,7 @@ Description: Scylla kernel tuning configuration
Package: scylla
Section: metapackages
Architecture: any
Depends: scylla-server, scylla-jmx, scylla-tools, scylla-kernel-conf
Depends: scylla-server, scylla-jmx, scylla-tools, scylla-tools-core, scylla-kernel-conf
Description: Scylla database metapackage
Scylla is a highly scalable, eventually consistent, distributed,
partitioned row DB.

View File

@@ -7,7 +7,8 @@ KVER=$(uname -r)
if [[ $KVER =~ 3\.13\.0\-([0-9]+)-generic ]]; then
echo "kernel $KVER detected, skip running sysctl..."
else
sysctl -p/etc/sysctl.d/99-scylla-sched.conf
# expect failures in virtualized environments
sysctl -p/etc/sysctl.d/99-scylla-sched.conf || :
fi
#DEBHELPER#

View File

@@ -3,12 +3,22 @@
set -e
if [ "$1" = configure ]; then
adduser --system \
--quiet \
--home /var/lib/scylla \
--no-create-home \
--disabled-password \
--group scylla
getent passwd scylla || NOUSR=1
getent group scylla || NOGRP=1
# this handles both case group is not exist || group already exists
if [ $NOUSR ]; then
adduser --system \
--quiet \
--home /var/lib/scylla \
--no-create-home \
--disabled-password \
--group scylla
# only group is not exist, create it and add user to the group
elif [ $NOGRP ]; then
addgroup --system scylla
adduser scylla scylla
fi
chown -R scylla:scylla /var/lib/scylla
chown -R scylla:scylla /var/lib/scylla-housekeeping
fi

28
dist/debian/scylla-conf.preinst.in vendored Normal file
View File

@@ -0,0 +1,28 @@
#!/bin/bash
ver=$(dpkg -l|grep scylla-server|awk '{print $3}'|sed -e "s/-.*$//")
if [ -n "$ver" ]; then
ver_fmt=$(echo $ver | awk -F. '{printf "%d%02d%02d", $1,$2,$3}')
if [ $ver_fmt -lt 10703 ]; then
# for <scylla-1.2
if [ ! -f /usr/lib/scylla/scylla_config_get.py ]; then
echo
echo "Error: Upgrading from scylla-$ver to scylla-@@VERSION@@ is not supported."
echo "Please upgrade to scylla-1.7.3 or later, before upgrade to @@VERSION@@."
echo
exit 1
fi
commitlog_directory=$(/usr/lib/scylla/scylla_config_get.py -g commitlog_directory)
commitlog_files=$(ls $commitlog_directory | wc -l)
if [ $commitlog_files -ne 0 ]; then
echo
echo "Error: Upgrading from scylla-$ver to scylla-@@VERSION@@ is not supported when commitlog is not clean."
echo "Please upgrade to scylla-1.7.3 or later, before upgrade to @@VERSION@@."
echo "Also make sure $commitlog_directory is empty."
echo
exit 1
fi
fi
fi
#DEBHELPER#

View File

@@ -7,7 +7,7 @@ ENV container docker
VOLUME [ "/sys/fs/cgroup" ]
#install scylla
RUN curl http://downloads.scylladb.com/rpm/unstable/centos/master/latest/scylla.repo -o /etc/yum.repos.d/scylla.repo
RUN curl http://downloads.scylladb.com/rpm/centos/scylla-2.0.repo -o /etc/yum.repos.d/scylla.repo
RUN yum -y install epel-release
RUN yum -y clean expire-cache
RUN yum -y update

View File

@@ -70,5 +70,7 @@ class ScyllaSetup:
if self._experimental == "1":
args += [ "--experimental=on" ]
args += ["--blocked-reactor-notify-ms 999999999"]
with open("/etc/scylla.d/docker.conf", "w") as cqlshrc:
cqlshrc.write("SCYLLA_DOCKER_ARGS=\"%s\"\n" % " ".join(args))

View File

@@ -104,9 +104,9 @@ fi
if [ $JOBS -gt 0 ]; then
SRPM_OPTS="$SRPM_OPTS --define='_smp_mflags -j$JOBS'"
RPM_JOBS_OPTS=(--define="_smp_mflags -j$JOBS")
fi
sudo mock --buildsrpm --root=$TARGET --resultdir=`pwd`/build/srpms --spec=build/scylla.spec --sources=build/scylla-$VERSION.tar $SRPM_OPTS
sudo mock --buildsrpm --root=$TARGET --resultdir=`pwd`/build/srpms --spec=build/scylla.spec --sources=build/scylla-$VERSION.tar $SRPM_OPTS "${RPM_JOBS_OPTS[@]}"
if [ "$TARGET" = "epel-7-x86_64" ] && [ $REBUILD = 1 ]; then
./dist/redhat/centos_dep/build_dependency.sh
sudo mock --init --root=$TARGET
@@ -116,4 +116,4 @@ elif [ "$TARGET" = "epel-7-x86_64" ] && [ $REBUILD = 0 ]; then
TARGET=scylla-$TARGET
RPM_OPTS="$RPM_OPTS --configdir=dist/redhat/mock"
fi
sudo mock --rebuild --root=$TARGET --resultdir=`pwd`/build/rpms $RPM_OPTS build/srpms/scylla-$VERSION*.src.rpm
sudo mock --rebuild --root=$TARGET --resultdir=`pwd`/build/rpms $RPM_OPTS "${RPM_JOBS_OPTS[@]}" build/srpms/scylla-$VERSION*.src.rpm

View File

@@ -33,8 +33,8 @@
Requires(post): coreutils
-Requires(post): %{_sbindir}/alternatives
-Requires(preun): %{_sbindir}/alternatives
+Requires(post): /sbin/alternatives
+Requires(preun): /sbin/alternatives
+Requires(post): /usr/sbin/alternatives
+Requires(preun): /usr/sbin/alternatives
%endif
# On ARM EABI systems, we do want -gnueabi to be part of the
@@ -58,13 +58,13 @@
%if "%{build_gold}" == "both"
%__rm -f %{_bindir}/%{?cross}ld
-%{_sbindir}/alternatives --install %{_bindir}/%{?cross}ld %{?cross}ld \
+/sbin/alternatives --install %{_bindir}/%{?cross}ld %{?cross}ld \
+/usr/sbin/alternatives --install %{_bindir}/%{?cross}ld %{?cross}ld \
%{_bindir}/%{?cross}ld.bfd %{ld_bfd_priority}
-%{_sbindir}/alternatives --install %{_bindir}/%{?cross}ld %{?cross}ld \
+/sbin/alternatives --install %{_bindir}/%{?cross}ld %{?cross}ld \
+/usr/sbin/alternatives --install %{_bindir}/%{?cross}ld %{?cross}ld \
%{_bindir}/%{?cross}ld.gold %{ld_gold_priority}
-%{_sbindir}/alternatives --auto %{?cross}ld
+/sbin/alternatives --auto %{?cross}ld
+/usr/sbin/alternatives --auto %{?cross}ld
%endif
%if %{isnative}
/sbin/ldconfig
@@ -74,8 +74,8 @@
if [ $1 = 0 ]; then
- %{_sbindir}/alternatives --remove %{?cross}ld %{_bindir}/%{?cross}ld.bfd
- %{_sbindir}/alternatives --remove %{?cross}ld %{_bindir}/%{?cross}ld.gold
+ /sbin/alternatives --remove %{?cross}ld %{_bindir}/%{?cross}ld.bfd
+ /sbin/alternatives --remove %{?cross}ld %{_bindir}/%{?cross}ld.gold
+ /usr/sbin/alternatives --remove %{?cross}ld %{_bindir}/%{?cross}ld.bfd
+ /usr/sbin/alternatives --remove %{?cross}ld %{_bindir}/%{?cross}ld.gold
fi
%endif
%if %{isnative}

View File

@@ -71,13 +71,13 @@ enabled=0
[scylla-3rdparty]
name=Scylla 3rdParty for Centos $releasever - $basearch
baseurl=http://downloads.scylladb.com/rpm/unstable/centos/master/latest/3rdparty/7/x86_64/
baseurl=http://downloads.scylladb.com/rpm/3rdparty/centos/scylladb-2.0/$releasever/$basearch/
enabled=1
gpgcheck=0
[scylla-3rdparty-generic]
name=Scylla 3rdParty for Centos $releasever
baseurl=http://downloads.scylladb.com/rpm/unstable/centos/master/latest/3rdparty/7/noarch/
baseurl=http://downloads.scylladb.com/rpm/3rdparty/centos/scylladb-2.0/$releasever/noarch/
enabled=1
gpgcheck=0
"""

View File

@@ -7,14 +7,14 @@ Group: Applications/Databases
License: AGPLv3
URL: http://www.scylladb.com/
Source0: %{name}-@@VERSION@@-@@RELEASE@@.tar
Requires: scylla-server scylla-jmx scylla-tools scylla-kernel-conf
Requires: scylla-server = @@VERSION@@ scylla-jmx = @@VERSION@@ scylla-tools = @@VERSION@@ scylla-tools-core = @@VERSION@@ scylla-kernel-conf = @@VERSION@@
Obsoletes: scylla-server < 1.1
%description
Scylla is a highly scalable, eventually consistent, distributed,
partitioned row DB.
This package installs all required packages for ScyllaDB, including
scylla-server, scylla-jmx, scylla-tools.
scylla-server, scylla-jmx, scylla-tools, scylla-tools-core.
# this is needed to prevent python compilation error on CentOS (#2235)
%if 0%{?rhel}
@@ -78,6 +78,10 @@ python3.4 ./configure.py --enable-dpdk --mode=release --static-stdc++ --static-b
ninja-build %{?_smp_mflags} build/release/scylla build/release/iotune
cp dist/common/systemd/scylla-server.service.in build/scylla-server.service
sed -i -e "s#@@SYSCONFDIR@@#/etc/sysconfig#g" build/scylla-server.service
cp dist/common/systemd/scylla-housekeeping-restart.service.in build/scylla-housekeeping-restart.service
sed -i -e "s#@@REPOFILES@@#'/etc/yum.repos.d/scylla*.repo'#g" build/scylla-housekeeping-restart.service
cp dist/common/systemd/scylla-housekeeping-daily.service.in build/scylla-housekeeping-daily.service
sed -i -e "s#@@REPOFILES@@#'/etc/yum.repos.d/scylla*.repo'#g" build/scylla-housekeeping-daily.service
%install
rm -rf $RPM_BUILD_ROOT
@@ -88,9 +92,6 @@ mkdir -p $RPM_BUILD_ROOT%{_sysconfdir}/security/limits.d/
mkdir -p $RPM_BUILD_ROOT%{_sysconfdir}/collectd.d/
mkdir -p $RPM_BUILD_ROOT%{_sysconfdir}/scylla/
mkdir -p $RPM_BUILD_ROOT%{_sysconfdir}/scylla.d/
%if 0%{?rhel}
mkdir -p $RPM_BUILD_ROOT%{_sysconfdir}/modprobe.d/
%endif
mkdir -p $RPM_BUILD_ROOT%{_sysctldir}/
mkdir -p $RPM_BUILD_ROOT%{_docdir}/scylla/
mkdir -p $RPM_BUILD_ROOT%{_unitdir}
@@ -101,9 +102,6 @@ install -m644 dist/common/limits.d/scylla.conf $RPM_BUILD_ROOT%{_sysconfdir}/sec
install -m644 dist/common/collectd.d/scylla.conf $RPM_BUILD_ROOT%{_sysconfdir}/collectd.d/
install -m644 dist/common/scylla.d/*.conf $RPM_BUILD_ROOT%{_sysconfdir}/scylla.d/
install -m644 dist/common/sysctl.d/*.conf $RPM_BUILD_ROOT%{_sysctldir}/
%if 0%{?rhel}
install -m644 dist/common/modprobe.d/*.conf $RPM_BUILD_ROOT%{_sysconfdir}/modprobe.d/
%endif
install -d -m755 $RPM_BUILD_ROOT%{_sysconfdir}/scylla
install -m644 conf/scylla.yaml $RPM_BUILD_ROOT%{_sysconfdir}/scylla/
install -m644 conf/cassandra-rackdc.properties $RPM_BUILD_ROOT%{_sysconfdir}/scylla/
@@ -267,18 +265,9 @@ if Scylla is the main application on your server and you wish to optimize its la
# We cannot use the sysctl_apply rpm macro because it is not present in 7.0
# following is a "manual" expansion
/usr/lib/systemd/systemd-sysctl 99-scylla-sched.conf >/dev/null 2>&1 || :
# Write modprobe.d params when module already loaded
%if 0%{?rhel}
if [ -e /sys/module/raid0/parameters/devices_discard_performance ]; then
echo Y > /sys/module/raid0/parameters/devices_discard_performance
fi
%endif
%files kernel-conf
%defattr(-,root,root)
%if 0%{?rhel}
%config(noreplace) %{_sysconfdir}/modprobe.d/*.conf
%endif
%{_sysctldir}/*.conf
%changelog

View File

@@ -62,6 +62,7 @@ static const std::map<application_state, sstring> application_state_names = {
{application_state::TOKENS, "TOKENS"},
{application_state::SUPPORTED_FEATURES, "SUPPORTED_FEATURES"},
{application_state::CACHE_HITRATES, "CACHE_HITRATES"},
{application_state::SCHEMA_TABLES_VERSION, "SCHEMA_TABLES_VERSION"},
};
std::ostream& operator<<(std::ostream& os, const application_state& m) {

View File

@@ -59,8 +59,8 @@ enum class application_state {
TOKENS,
SUPPORTED_FEATURES,
CACHE_HITRATES,
SCHEMA_TABLES_VERSION,
// pad to allow adding new states to existing cluster
X3,
X4,
X5,
X6,

View File

@@ -42,12 +42,12 @@
namespace gms {
std::experimental::optional<versioned_value> endpoint_state::get_application_state(application_state key) const {
const versioned_value* endpoint_state::get_application_state_ptr(application_state key) const {
auto it = _application_state.find(key);
if (it == _application_state.end()) {
return {};
return nullptr;
} else {
return _application_state.at(key);
return &it->second;
}
}

View File

@@ -43,6 +43,8 @@
#include "gms/heart_beat_state.hh"
#include "gms/application_state.hh"
#include "gms/versioned_value.hh"
#include <boost/algorithm/string/split.hpp>
#include <boost/algorithm/string/classification.hpp>
#include <experimental/optional>
#include <chrono>
@@ -54,7 +56,7 @@ namespace gms {
*/
class endpoint_state {
public:
using clk = std::chrono::system_clock;
using clk = seastar::lowres_system_clock;
private:
heart_beat_state _heart_beat_state;
std::map<application_state, versioned_value> _application_state;
@@ -89,10 +91,12 @@ public:
, _is_alive(true) {
}
// Valid only on shard 0
heart_beat_state& get_heart_beat_state() {
return _heart_beat_state;
}
// Valid only on shard 0
const heart_beat_state& get_heart_beat_state() const {
return _heart_beat_state;
}
@@ -102,7 +106,7 @@ public:
_heart_beat_state = hbs;
}
std::experimental::optional<versioned_value> get_application_state(application_state key) const;
const versioned_value* get_application_state_ptr(application_state key) const;
/**
* TODO replace this with operations that don't expose private state
@@ -117,18 +121,36 @@ public:
}
void add_application_state(application_state key, versioned_value value) {
if (_application_state.count(key)) {
_application_state.at(key) = value;
} else {
_application_state.emplace(key, value);
_application_state[key] = std::move(value);
}
void apply_application_state(application_state key, versioned_value&& value) {
auto&& e = _application_state[key];
if (e.version < value.version) {
e = std::move(value);
}
}
void apply_application_state(application_state key, const versioned_value& value) {
auto&& e = _application_state[key];
if (e.version < value.version) {
e = value;
}
}
void apply_application_state(const endpoint_state& es) {
for (auto&& e : es._application_state) {
apply_application_state(e.first, e.second);
}
}
/* getters and setters */
/**
* @return System.nanoTime() when state was updated last time.
*
* Valid only on shard 0.
*/
clk::time_point get_update_timestamp() {
clk::time_point get_update_timestamp() const {
return _update_timestamp;
}
@@ -136,16 +158,34 @@ public:
_update_timestamp = clk::now();
}
bool is_alive() {
bool is_alive() const {
return _is_alive;
}
void set_alive(bool alive) {
_is_alive = alive;
}
void mark_alive() {
_is_alive = true;
set_alive(true);
}
void mark_dead() {
_is_alive = false;
set_alive(false);
}
bool is_shutdown() const {
auto* app_state = get_application_state_ptr(application_state::STATUS);
if (!app_state) {
return false;
}
auto value = app_state->value;
std::vector<sstring> pieces;
boost::split(pieces, value, boost::is_any_of(","));
if (pieces.empty()) {
return false;
}
return pieces[0] == sstring(versioned_value::SHUTDOWN);
}
friend std::ostream& operator<<(std::ostream& os, const endpoint_state& x);

View File

@@ -36,6 +36,7 @@
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include <boost/range/adaptor/map.hpp>
#include "gms/failure_detector.hh"
#include "gms/gossiper.hh"
#include "gms/i_failure_detector.hh"
@@ -43,6 +44,7 @@
#include "gms/endpoint_state.hh"
#include "gms/application_state.hh"
#include "gms/inet_address.hh"
#include "service/storage_service.hh"
#include "log.hh"
#include <iostream>
#include <chrono>
@@ -56,46 +58,26 @@ constexpr std::chrono::milliseconds failure_detector::DEFAULT_MAX_PAUSE;
using clk = arrival_window::clk;
static clk::duration get_initial_value() {
#if 0
String newvalue = System.getProperty("cassandra.fd_initial_value_ms");
if (newvalue == null)
{
return Gossiper.intervalInMillis * 2;
}
else
{
logger.info("Overriding FD INITIAL_VALUE to {}ms", newvalue);
return Integer.parseInt(newvalue);
}
#endif
warn(unimplemented::cause::GOSSIP);
return std::chrono::seconds(2);
auto& cfg = service::get_local_storage_service().db().local().get_config();
return std::chrono::milliseconds(cfg.fd_initial_value_ms());
}
clk::duration arrival_window::get_max_interval() {
#if 0
sstring newvalue = System.getProperty("cassandra.fd_max_interval_ms");
if (newvalue == null)
{
return failure_detector.INITIAL_VALUE_NANOS;
}
else
{
logger.info("Overriding FD MAX_INTERVAL to {}ms", newvalue);
return TimeUnit.NANOSECONDS.convert(Integer.parseInt(newvalue), TimeUnit.MILLISECONDS);
}
#endif
warn(unimplemented::cause::GOSSIP);
return get_initial_value();
auto& cfg = service::get_local_storage_service().db().local().get_config();
return std::chrono::milliseconds(cfg.fd_max_interval_ms());
}
static clk::duration get_min_interval() {
return gossiper::INTERVAL;
}
void arrival_window::add(clk::time_point value, const gms::inet_address& ep) {
if (_tlast > clk::time_point::min()) {
auto inter_arrival_time = value - _tlast;
if (inter_arrival_time <= get_max_interval()) {
if (inter_arrival_time <= get_max_interval() && inter_arrival_time >= get_min_interval()) {
_arrival_intervals.add(inter_arrival_time.count());
} else {
logger.debug("failure_detector: Ignoring interval time of {} for {}", inter_arrival_time.count(), ep);
logger.debug("failure_detector: Ignoring interval time of {} for {}, mean={}, size={}", inter_arrival_time.count(), ep, mean(), size());
}
} else {
// We use a very large initial interval since the "right" average depends on the cluster size
@@ -145,39 +127,27 @@ std::map<sstring, sstring> failure_detector::get_simple_states() {
auto& state = entry.second;
std::stringstream ss;
ss << ep;
if (state.is_alive())
if (state.is_alive()) {
nodes_status.emplace(sstring(ss.str()), "UP");
else
} else {
nodes_status.emplace(sstring(ss.str()), "DOWN");
}
}
return nodes_status;
}
int failure_detector::get_down_endpoint_count() {
int count = 0;
for (auto& entry : get_local_gossiper().endpoint_state_map) {
auto& state = entry.second;
if (!state.is_alive()) {
count++;
}
}
return count;
return get_local_gossiper().endpoint_state_map.size() - get_up_endpoint_count();
}
int failure_detector::get_up_endpoint_count() {
int count = 0;
for (auto& entry : get_local_gossiper().endpoint_state_map) {
auto& state = entry.second;
if (state.is_alive()) {
count++;
}
}
return count;
return boost::count_if(get_local_gossiper().endpoint_state_map | boost::adaptors::map_values, std::mem_fn(&endpoint_state::is_alive));
}
sstring failure_detector::get_endpoint_state(sstring address) {
std::stringstream ss;
auto eps = get_local_gossiper().get_endpoint_state_for_endpoint(inet_address(address));
auto* eps = get_local_gossiper().get_endpoint_state_for_endpoint_ptr(inet_address(address));
if (eps) {
append_endpoint_state(ss, *eps);
return sstring(ss.str());
@@ -186,7 +156,7 @@ sstring failure_detector::get_endpoint_state(sstring address) {
}
}
void failure_detector::append_endpoint_state(std::stringstream& ss, endpoint_state& state) {
void failure_detector::append_endpoint_state(std::stringstream& ss, const endpoint_state& state) {
ss << " generation:" << state.get_heart_beat_state().get_generation() << "\n";
ss << " heartbeat:" << state.get_heart_beat_state().get_heart_beat_version() << "\n";
for (const auto& entry : state.get_application_state_map()) {

View File

@@ -58,7 +58,7 @@ class endpoint_state;
class arrival_window {
public:
using clk = std::chrono::system_clock;
using clk = seastar::lowres_system_clock;
private:
clk::time_point _tlast{clk::time_point::min()};
utils::bounded_stats_deque _arrival_intervals;
@@ -87,6 +87,8 @@ public:
// see CASSANDRA-2597 for an explanation of the math at work here.
double phi(clk::time_point tnow);
size_t size() { return _arrival_intervals.size(); }
friend std::ostream& operator<<(std::ostream& os, const arrival_window& w);
};
@@ -154,7 +156,7 @@ public:
}
private:
void append_endpoint_state(std::stringstream& ss, endpoint_state& state);
void append_endpoint_state(std::stringstream& ss, const endpoint_state& state);
public:
/**

View File

@@ -21,6 +21,8 @@
#pragma once
#include <seastar/core/shared_future.hh>
namespace gms {
/**
@@ -31,19 +33,16 @@ namespace gms {
*/
class feature final {
sstring _name;
bool _enabled;
bool _enabled = false;
mutable shared_promise<> _pr;
friend class gossiper;
public:
explicit feature(sstring name, bool enabled = false);
feature() = default;
~feature();
feature()
: _enabled(false)
{ }
feature(const feature& other)
: feature(other._name, other._enabled)
{ }
feature(const feature& other) = delete;
void enable();
feature& operator=(feature other);
feature& operator=(feature&& other);
const sstring& name() const {
return _name;
}
@@ -53,6 +52,7 @@ public:
friend inline std::ostream& operator<<(std::ostream& os, const feature& f) {
return os << "{ gossip feature = " << f._name << " }";
}
future<> when_enabled() const { return _pr.get_shared_future(); }
};
} // namespace gms

View File

@@ -68,7 +68,11 @@ public:
return _digests;
}
std::map<inet_address, endpoint_state> get_endpoint_state_map() const {
std::map<inet_address, endpoint_state>& get_endpoint_state_map() {
return _map;
}
const std::map<inet_address, endpoint_state>& get_endpoint_state_map() const {
return _map;
}

File diff suppressed because it is too large Load Diff

View File

@@ -49,6 +49,7 @@
#include "gms/application_state.hh"
#include "gms/endpoint_state.hh"
#include "gms/feature.hh"
#include "utils/loading_shared_values.hh"
#include "message/messaging_service_fwd.hh"
#include <boost/algorithm/string.hpp>
#include <experimental/optional>
@@ -80,9 +81,9 @@ class i_failure_detector;
* Upon hearing a GossipShutdownMessage, this module will instantly mark the remote node as down in
* the Failure Detector.
*/
class gossiper : public i_failure_detection_event_listener, public seastar::async_sharded_service<gossiper> {
class gossiper : public i_failure_detection_event_listener, public seastar::async_sharded_service<gossiper>, public seastar::peering_sharded_service<gossiper> {
public:
using clk = std::chrono::system_clock;
using clk = seastar::lowres_system_clock;
private:
using messaging_verb = netw::messaging_verb;
using messaging_service = netw::messaging_service;
@@ -105,6 +106,7 @@ private:
std::set<inet_address> _seeds_from_config;
sstring _cluster_name;
semaphore _callback_running{1};
semaphore _apply_state_locally_semaphore{100};
public:
future<> timer_callback_lock() { return _callback_running.wait(); }
void timer_callback_unlock() { _callback_running.signal(); }
@@ -118,10 +120,18 @@ public:
void set_seeds(std::set<inet_address> _seeds);
public:
static clk::time_point inline now() { return clk::now(); }
public:
using endpoint_locks_map = utils::loading_shared_values<inet_address, semaphore>;
struct endpoint_permit {
endpoint_locks_map::entry_ptr _ptr;
semaphore_units<> _units;
};
future<endpoint_permit> lock_endpoint(inet_address);
public:
/* map where key is the endpoint and value is the state associated with the endpoint */
std::unordered_map<inet_address, endpoint_state> endpoint_state_map;
std::unordered_map<inet_address, endpoint_state> shadow_endpoint_state_map;
// Used for serializing changes to endpoint_state_map and running of associated change listeners.
endpoint_locks_map endpoint_locks;
const std::vector<sstring> DEAD_STATES = {
versioned_value::REMOVING_TOKEN,
@@ -192,7 +202,7 @@ private:
std::unordered_set<inet_address> _pending_mark_alive_endpoints;
/* unreachable member set */
std::map<inet_address, clk::time_point> _unreachable_endpoints;
std::unordered_map<inet_address, clk::time_point> _unreachable_endpoints;
/* initial seeds for joining the cluster */
std::set<inet_address> _seeds;
@@ -209,10 +219,19 @@ private:
clk::time_point _last_processed_message_at = now();
std::map<inet_address, clk::time_point> _shadow_unreachable_endpoints;
std::unordered_map<inet_address, clk::time_point> _shadow_unreachable_endpoints;
std::vector<inet_address> _shadow_live_endpoints;
void run();
// Replicates given endpoint_state to all other shards.
// The state state doesn't have to be kept alive around until completes.
future<> replicate(inet_address, const endpoint_state&);
// Replicates "states" from "src" to all other shards.
// "src" and "states" must be kept alive until completes and must not change.
future<> replicate(inet_address, const std::map<application_state, versioned_value>& src, const std::vector<application_state>& states);
// Replicates given value to all other shards.
// The value must be kept alive until completes and not change.
future<> replicate(inet_address, application_state key, const versioned_value& value);
public:
gossiper();
@@ -384,7 +403,15 @@ private:
public:
clk::time_point get_expire_time_for_endpoint(inet_address endpoint);
std::experimental::optional<endpoint_state> get_endpoint_state_for_endpoint(inet_address ep) const;
const endpoint_state* get_endpoint_state_for_endpoint_ptr(inet_address ep) const;
endpoint_state& get_endpoint_state(inet_address ep);
endpoint_state* get_endpoint_state_for_endpoint_ptr(inet_address ep);
const versioned_value* get_application_state_ptr(inet_address endpoint, application_state appstate) const;
// Use with caution, copies might be expensive (see #764)
stdx::optional<endpoint_state> get_endpoint_state_for_endpoint(inet_address ep) const;
// removes ALL endpoint states; should only be called after shadow gossip
void reset_endpoint_state_map();
@@ -393,8 +420,6 @@ public:
bool uses_host_id(inet_address endpoint);
bool uses_vnodes(inet_address endpoint);
utils::UUID get_host_id(inet_address endpoint);
std::experimental::optional<endpoint_state> get_state_for_version_bigger_than(inet_address for_endpoint, int version);
@@ -404,10 +429,10 @@ public:
*/
int compare_endpoint_startup(inet_address addr1, inet_address addr2);
void notify_failure_detector(std::map<inet_address, endpoint_state> remoteEpStateMap);
void notify_failure_detector(const std::map<inet_address, endpoint_state>& remoteEpStateMap);
void notify_failure_detector(inet_address endpoint, endpoint_state remote_endpoint_state);
void notify_failure_detector(inet_address endpoint, const endpoint_state& remote_endpoint_state);
private:
void mark_alive(inet_address addr, endpoint_state& local_state);
@@ -425,10 +450,10 @@ private:
void handle_major_state_change(inet_address ep, const endpoint_state& eps);
public:
bool is_alive(inet_address ep);
bool is_alive(inet_address ep) const;
bool is_dead_state(const endpoint_state& eps) const;
future<> apply_state_locally(const std::map<inet_address, endpoint_state>& map);
future<> apply_state_locally(std::map<inet_address, endpoint_state> map);
private:
void apply_new_states(inet_address addr, endpoint_state& local_state, const endpoint_state& remote_state);
@@ -488,11 +513,11 @@ public:
future<> do_stop_gossiping();
public:
bool is_enabled();
bool is_enabled() const;
void finish_shadow_round();
bool is_in_shadow_round();
bool is_in_shadow_round() const;
void goto_shadow_round();
@@ -504,7 +529,9 @@ public:
void dump_endpoint_state_map();
void debug_show();
public:
bool is_seed(const inet_address& endpoint) const;
bool is_shutdown(const inet_address& endpoint) const;
bool is_normal(const inet_address& endpoint) const;
bool is_silent_shutdown_state(const endpoint_state& ep_state) const;
void mark_as_shutdown(const inet_address& endpoint);
void force_newer_generation();
@@ -534,10 +561,10 @@ private:
public:
void check_knows_remote_features(sstring local_features_string) const;
void check_knows_remote_features(sstring local_features_string, std::unordered_map<inet_address, sstring> peer_features_string) const;
void maybe_enable_features();
private:
void register_feature(feature* f);
void unregister_feature(feature* f);
void maybe_enable_features();
private:
seastar::metrics::metric_groups _metrics;
};

View File

@@ -27,8 +27,9 @@ class schema_mutations {
canonical_mutation columnfamilies_canonical_mutation();
canonical_mutation columns_canonical_mutation();
bool is_view()[[version 1.6]];
std::experimental::optional<canonical_mutation> indices_canonical_mutation()[[version 1.9]];
std::experimental::optional<canonical_mutation> dropped_columns_canonical_mutation()[[version 1.9]];
std::experimental::optional<canonical_mutation> indices_canonical_mutation()[[version 2.0]];
std::experimental::optional<canonical_mutation> dropped_columns_canonical_mutation()[[version 2.0]];
std::experimental::optional<canonical_mutation> scylla_tables_canonical_mutation()[[version 2.0]];
};
class schema stub [[writable]] {

View File

@@ -182,6 +182,9 @@ public:
static TopLevel from_exploded(const schema& s, const std::vector<bytes>& v) {
return from_exploded(v);
}
static TopLevel from_exploded_view(const std::vector<bytes_view>& v) {
return from_exploded(v);
}
// We don't allow optional values, but provide this method as an efficient adaptor
static TopLevel from_optional_exploded(const schema& s, const std::vector<bytes_opt>& v) {

View File

@@ -126,14 +126,9 @@ private:
sstring get_endpoint_info(inet_address endpoint, gms::application_state key,
const sstring& default_val) {
gms::gossiper& local_gossiper = gms::get_local_gossiper();
auto state = local_gossiper.get_endpoint_state_for_endpoint(endpoint);
// First, look in the gossiper::endpoint_state_map...
if (state) {
auto ep_state = state->get_application_state(key);
if (ep_state) {
return ep_state->value;
}
auto* ep_state = local_gossiper.get_application_state_ptr(endpoint, key);
if (ep_state) {
return ep_state->value;
}
// ...if not found - look in the SystemTable...

View File

@@ -56,7 +56,7 @@ private:
private:
void reconnect(gms::inet_address public_address, gms::versioned_value local_address_value) {
void reconnect(gms::inet_address public_address, const gms::versioned_value& local_address_value) {
reconnect(public_address, gms::inet_address(local_address_value.value));
}
@@ -97,10 +97,9 @@ public:
}
void on_join(gms::inet_address endpoint, gms::endpoint_state ep_state) override {
auto internal_ip_state_opt = ep_state.get_application_state(gms::application_state::INTERNAL_IP);
if (internal_ip_state_opt) {
reconnect(endpoint, *internal_ip_state_opt);
auto* internal_ip_state = ep_state.get_application_state_ptr(gms::application_state::INTERNAL_IP);
if (internal_ip_state) {
reconnect(endpoint, *internal_ip_state);
}
}
@@ -111,11 +110,7 @@ public:
}
void on_alive(gms::inet_address endpoint, gms::endpoint_state ep_state) override {
auto internal_ip_state_opt = ep_state.get_application_state(gms::application_state::INTERNAL_IP);
if (internal_ip_state_opt) {
reconnect(endpoint, *internal_ip_state_opt);
}
on_join(std::move(endpoint), std::move(ep_state));
}
void on_dead(gms::inet_address endpoint, gms::endpoint_state ep_state) override {

View File

@@ -110,7 +110,11 @@ void token_metadata::update_normal_tokens(std::unordered_map<inet_address, std::
inet_address endpoint = i.first;
std::unordered_set<token>& tokens = i.second;
assert(!tokens.empty());
if (tokens.empty()) {
auto msg = sprint("tokens is empty in update_normal_tokens");
tlogger.error("{}", msg);
throw std::runtime_error(msg);
}
for(auto it = _token_to_endpoint_map.begin(), ite = _token_to_endpoint_map.end(); it != ite;) {
if(it->second == endpoint) {
@@ -141,7 +145,11 @@ void token_metadata::update_normal_tokens(std::unordered_map<inet_address, std::
}
size_t token_metadata::first_token_index(const token& start) const {
assert(_sorted_tokens.size() > 0);
if (_sorted_tokens.empty()) {
auto msg = sprint("sorted_tokens is empty in first_token_index!");
tlogger.error("{}", msg);
throw std::runtime_error(msg);
}
auto it = std::lower_bound(_sorted_tokens.begin(), _sorted_tokens.end(), start);
if (it == _sorted_tokens.end()) {
return 0;
@@ -292,7 +300,11 @@ void token_metadata::add_bootstrap_tokens(std::unordered_set<token> tokens, inet
}
void token_metadata::remove_bootstrap_tokens(std::unordered_set<token> tokens) {
assert(!tokens.empty());
if (tokens.empty()) {
auto msg = sprint("tokens is empty in remove_bootstrap_tokens!");
tlogger.error("{}", msg);
throw std::runtime_error(msg);
}
for (auto t : tokens) {
_bootstrap_tokens.erase(t);
}
@@ -320,7 +332,11 @@ void token_metadata::remove_from_moving(inet_address endpoint) {
token token_metadata::get_predecessor(token t) {
auto& tokens = sorted_tokens();
auto it = std::lower_bound(tokens.begin(), tokens.end(), t);
assert(it != tokens.end() && *it == t);
if (it == tokens.end() || *it != t) {
auto msg = sprint("token error in get_predecessor!");
tlogger.error("{}", msg);
throw std::runtime_error(msg);
}
if (it == tokens.begin()) {
// If the token is the first element, its preprocessor is the last element
return tokens.back();

17
main.cc
View File

@@ -59,6 +59,8 @@ thread_local disk_error_signal_type commit_error;
thread_local disk_error_signal_type general_disk_error;
seastar::metrics::metric_groups app_metrics;
using namespace std::chrono_literals;
namespace bpo = boost::program_options;
static boost::filesystem::path relative_conf_dir(boost::filesystem::path path) {
@@ -277,7 +279,10 @@ int main(int ac, char** av) {
}
runtime::init_uptime();
std::setvbuf(stdout, nullptr, _IOLBF, 1000);
app_template app;
app_template::config app_cfg;
app_cfg.name = "Scylla";
app_cfg.default_task_quota = 500us;
app_template app(std::move(app_cfg));
auto opt_add = app.add_options();
auto cfg = make_lw_shared<db::config>();
@@ -529,12 +534,12 @@ int main(int ac, char** av) {
db::get_batchlog_manager().start(std::ref(qp)).get();
// #293 - do not stop anything
// engine().at_exit([] { return db::get_batchlog_manager().stop(); });
sstables::init_metrics();
sstables::init_metrics().get();
db::system_keyspace::minimal_setup(db, qp);
// schema migration, if needed, is also done on shard 0
db::legacy_schema_migrator::migrate(qp.local()).get();
db::legacy_schema_migrator::migrate(proxy, qp.local()).get();
supervisor::notify("loading sstables");
@@ -625,13 +630,13 @@ int main(int ac, char** av) {
lb->start_broadcasting();
service::get_local_storage_service().set_load_broadcaster(lb);
engine().at_exit([lb = std::move(lb)] () mutable { return lb->stop_broadcasting(); });
supervisor::notify("starting cf cache hit rate calculator");
cf_cache_hitrate_calculator.start(std::ref(db), std::ref(cf_cache_hitrate_calculator)).get();
engine().at_exit([&cf_cache_hitrate_calculator] { return cf_cache_hitrate_calculator.stop(); });
cf_cache_hitrate_calculator.local().run_on(engine().cpu_id());
supervisor::notify("starting native transport");
gms::get_local_gossiper().wait_for_gossip_to_settle();
gms::get_local_gossiper().wait_for_gossip_to_settle().get();
api::set_server_gossip_settle(ctx).get();
supervisor::notify("starting cf cache hit rate calculator");
supervisor::notify("starting native transport");
service::get_local_storage_service().start_native_transport().get();
if (start_thrift) {
service::get_local_storage_service().start_rpc_server().get();

View File

@@ -29,11 +29,13 @@
#include "sstables/sstables.hh"
#include <seastar/core/future.hh>
#include <seastar/core/file.hh>
#include <seastar/core/thread.hh>
future<>
write_memtable_to_sstable(memtable& mt,
sstables::shared_sstable sst,
bool backup = false,
const io_priority_class& pc = default_priority_class(),
bool leave_unsealed = false);
bool leave_unsealed = false,
seastar::thread_scheduling_group* tsg = nullptr);

View File

@@ -514,7 +514,6 @@ shared_ptr<messaging_service::rpc_protocol_client_wrapper> messaging_service::ge
}();
auto remote_addr = ipv4_addr(get_preferred_ip(id.addr).raw_addr(), must_encrypt ? _ssl_port : _port);
auto local_addr = ipv4_addr{_listen_address.raw_addr(), 0};
rpc::client_options opts;
// send keepalive messages each minute if connection is idle, drop connection after 10 failures
@@ -526,9 +525,9 @@ shared_ptr<messaging_service::rpc_protocol_client_wrapper> messaging_service::ge
auto client = must_encrypt ?
::make_shared<rpc_protocol_client_wrapper>(*_rpc, std::move(opts),
remote_addr, local_addr, _credentials) :
remote_addr, ipv4_addr(), _credentials) :
::make_shared<rpc_protocol_client_wrapper>(*_rpc, std::move(opts),
remote_addr, local_addr);
remote_addr);
it = _clients[idx].emplace(id, shard_info(std::move(client))).first;
uint32_t src_cpu_id = engine().cpu_id();
@@ -640,59 +639,6 @@ auto send_message_timeout(messaging_service* ms, messaging_verb verb, msg_addr i
});
}
template <typename MsgIn, typename... MsgOut>
auto send_message_timeout_and_retry(messaging_service* ms, messaging_verb verb, msg_addr id,
std::chrono::seconds timeout, int nr_retry, std::chrono::seconds wait, MsgOut... msg) {
using MsgInTuple = typename futurize_t<MsgIn>::value_type;
return do_with(int(nr_retry), std::move(msg)..., [ms, verb, id, timeout, wait, nr_retry] (auto& retry, const auto&... messages) {
return repeat_until_value([ms, verb, id, timeout, wait, nr_retry, &retry, &messages...] {
return send_message_timeout<MsgIn>(ms, verb, id, timeout, messages...).then_wrapped(
[ms, verb, id, timeout, wait, nr_retry, &retry] (auto&& f) mutable {
auto vb = int(verb);
try {
MsgInTuple ret = f.get();
if (retry != nr_retry) {
mlogger.info("Retry verb={} to {}, retry={}: OK", vb, id, retry);
}
return make_ready_future<stdx::optional<MsgInTuple>>(std::move(ret));
} catch (rpc::timeout_error) {
mlogger.info("Retry verb={} to {}, retry={}: timeout in {} seconds", vb, id, retry, timeout.count());
throw;
} catch (rpc::closed_error) {
mlogger.info("Retry verb={} to {}, retry={}: {}", vb, id, retry, std::current_exception());
// Stop retrying if retry reaches 0 or message service is shutdown
// or the remote node is removed from gossip (on_remove())
retry--;
if (retry == 0) {
mlogger.debug("Retry verb={} to {}, retry={}: stop retrying: retry == 0", vb, id, retry);
throw;
}
if (ms->is_stopping()) {
mlogger.debug("Retry verb={} to {}, retry={}: stop retrying: messaging_service is stopped",
vb, id, retry);
throw;
}
if (!gms::get_local_gossiper().is_known_endpoint(id.addr)) {
mlogger.debug("Retry verb={} to {}, retry={}: stop retrying: node is removed from the cluster",
vb, id, retry);
throw;
}
return sleep_abortable(wait).then([] {
return make_ready_future<stdx::optional<MsgInTuple>>(stdx::nullopt);
}).handle_exception([vb, id, retry] (std::exception_ptr ep) {
mlogger.debug("Retry verb={} to {}, retry={}: stop retrying: {}", vb, id, retry, ep);
return make_exception_future<stdx::optional<MsgInTuple>>(ep);
});
} catch (...) {
throw;
}
});
}).then([ms = ms->shared_from_this()] (MsgInTuple result) {
return futurize<MsgIn>::from_tuple(std::move(result));
});
});
}
// Send one way message for verb
template <typename... MsgOut>
auto send_message_oneway(messaging_service* ms, messaging_verb verb, msg_addr id, MsgOut&&... msg) {
@@ -707,13 +653,6 @@ auto send_message_oneway_timeout(messaging_service* ms, Timeout timeout, messagi
// Wrappers for verbs
// Retransmission parameters for streaming verbs.
// A stream plan gives up retrying in 10*30 + 10*60 seconds (15 minutes) at
// most, 10*30 seconds (5 minutes) at least.
static constexpr int streaming_nr_retry = 10;
static constexpr std::chrono::seconds streaming_timeout{10*60};
static constexpr std::chrono::seconds streaming_wait_before_retry{30};
// PREPARE_MESSAGE
void messaging_service::register_prepare_message(std::function<future<streaming::prepare_message> (const rpc::client_info& cinfo,
streaming::prepare_message msg, UUID plan_id, sstring description)>&& func) {
@@ -721,8 +660,7 @@ void messaging_service::register_prepare_message(std::function<future<streaming:
}
future<streaming::prepare_message> messaging_service::send_prepare_message(msg_addr id, streaming::prepare_message msg, UUID plan_id,
sstring description) {
return send_message_timeout_and_retry<streaming::prepare_message>(this, messaging_verb::PREPARE_MESSAGE, id,
streaming_timeout, streaming_nr_retry, streaming_wait_before_retry,
return send_message<streaming::prepare_message>(this, messaging_verb::PREPARE_MESSAGE, id,
std::move(msg), plan_id, std::move(description));
}
@@ -731,8 +669,7 @@ void messaging_service::register_prepare_done_message(std::function<future<> (co
register_handler(this, messaging_verb::PREPARE_DONE_MESSAGE, std::move(func));
}
future<> messaging_service::send_prepare_done_message(msg_addr id, UUID plan_id, unsigned dst_cpu_id) {
return send_message_timeout_and_retry<void>(this, messaging_verb::PREPARE_DONE_MESSAGE, id,
streaming_timeout, streaming_nr_retry, streaming_wait_before_retry,
return send_message<void>(this, messaging_verb::PREPARE_DONE_MESSAGE, id,
plan_id, dst_cpu_id);
}
@@ -741,8 +678,7 @@ void messaging_service::register_stream_mutation(std::function<future<> (const r
register_handler(this, messaging_verb::STREAM_MUTATION, std::move(func));
}
future<> messaging_service::send_stream_mutation(msg_addr id, UUID plan_id, frozen_mutation fm, unsigned dst_cpu_id, bool fragmented) {
return send_message_timeout_and_retry<void>(this, messaging_verb::STREAM_MUTATION, id,
streaming_timeout, streaming_nr_retry, streaming_wait_before_retry,
return send_message<void>(this, messaging_verb::STREAM_MUTATION, id,
plan_id, std::move(fm), dst_cpu_id, fragmented);
}
@@ -757,19 +693,17 @@ void messaging_service::register_stream_mutation_done(std::function<future<> (co
});
}
future<> messaging_service::send_stream_mutation_done(msg_addr id, UUID plan_id, dht::token_range_vector ranges, UUID cf_id, unsigned dst_cpu_id) {
return send_message_timeout_and_retry<void>(this, messaging_verb::STREAM_MUTATION_DONE, id,
streaming_timeout, streaming_nr_retry, streaming_wait_before_retry,
return send_message<void>(this, messaging_verb::STREAM_MUTATION_DONE, id,
plan_id, std::move(ranges), cf_id, dst_cpu_id);
}
// COMPLETE_MESSAGE
void messaging_service::register_complete_message(std::function<future<> (const rpc::client_info& cinfo, UUID plan_id, unsigned dst_cpu_id)>&& func) {
void messaging_service::register_complete_message(std::function<future<> (const rpc::client_info& cinfo, UUID plan_id, unsigned dst_cpu_id, rpc::optional<bool> failed)>&& func) {
register_handler(this, messaging_verb::COMPLETE_MESSAGE, std::move(func));
}
future<> messaging_service::send_complete_message(msg_addr id, UUID plan_id, unsigned dst_cpu_id) {
return send_message_timeout_and_retry<void>(this, messaging_verb::COMPLETE_MESSAGE, id,
streaming_timeout, streaming_nr_retry, streaming_wait_before_retry,
plan_id, dst_cpu_id);
future<> messaging_service::send_complete_message(msg_addr id, UUID plan_id, unsigned dst_cpu_id, bool failed) {
return send_message<void>(this, messaging_verb::COMPLETE_MESSAGE, id,
plan_id, dst_cpu_id, failed);
}
void messaging_service::register_gossip_echo(std::function<future<> ()>&& func) {
@@ -835,7 +769,7 @@ future<> messaging_service::send_definitions_update(msg_addr id, std::vector<fro
return send_message_oneway(this, messaging_verb::DEFINITIONS_UPDATE, std::move(id), std::move(fm));
}
void messaging_service::register_migration_request(std::function<future<std::vector<frozen_mutation>> ()>&& func) {
void messaging_service::register_migration_request(std::function<future<std::vector<frozen_mutation>> (const rpc::client_info&)>&& func) {
register_handler(this, netw::messaging_verb::MIGRATION_REQUEST, std::move(func));
}
void messaging_service::unregister_migration_request() {

View File

@@ -249,8 +249,8 @@ public:
void register_stream_mutation_done(std::function<future<> (const rpc::client_info& cinfo, UUID plan_id, dht::token_range_vector ranges, UUID cf_id, unsigned dst_cpu_id)>&& func);
future<> send_stream_mutation_done(msg_addr id, UUID plan_id, dht::token_range_vector ranges, UUID cf_id, unsigned dst_cpu_id);
void register_complete_message(std::function<future<> (const rpc::client_info& cinfo, UUID plan_id, unsigned dst_cpu_id)>&& func);
future<> send_complete_message(msg_addr id, UUID plan_id, unsigned dst_cpu_id);
void register_complete_message(std::function<future<> (const rpc::client_info& cinfo, UUID plan_id, unsigned dst_cpu_id, rpc::optional<bool> failed)>&& func);
future<> send_complete_message(msg_addr id, UUID plan_id, unsigned dst_cpu_id, bool failed = false);
// Wrapper for REPAIR_CHECKSUM_RANGE verb
void register_repair_checksum_range(std::function<future<partition_checksum> (sstring keyspace, sstring cf, dht::token_range range, rpc::optional<repair_checksum> hash_version)>&& func);
@@ -288,7 +288,7 @@ public:
future<> send_definitions_update(msg_addr id, std::vector<frozen_mutation> fm);
// Wrapper for MIGRATION_REQUEST
void register_migration_request(std::function<future<std::vector<frozen_mutation>> ()>&& func);
void register_migration_request(std::function<future<std::vector<frozen_mutation>> (const rpc::client_info&)>&& func);
void unregister_migration_request();
future<std::vector<frozen_mutation>> send_migration_request(msg_addr id);

View File

@@ -248,17 +248,14 @@ mutation_partition::mutation_partition(const mutation_partition& x, const schema
for (const rows_entry& e : x.range(schema, r)) {
_rows.insert(_rows.end(), *current_allocator().construct<rows_entry>(e), rows_entry::compare(schema));
}
for (auto&& rt : x._row_tombstones.slice(schema, r)) {
_row_tombstones.apply(schema, rt);
}
}
} catch (...) {
_rows.clear_and_dispose(current_deleter<rows_entry>());
throw;
}
for(auto&& r : ck_ranges) {
for (auto&& rt : x._row_tombstones.slice(schema, r)) {
_row_tombstones.apply(schema, rt);
}
}
}
mutation_partition::mutation_partition(mutation_partition&& x, const schema& schema,
@@ -932,15 +929,6 @@ rows_entry::equal(const schema& s, const rows_entry& other) const {
return equal(s, other, s);
}
position_in_partition_view rows_entry::position() const {
if (_flags._last) {
return position_in_partition_view::after_all_clustered_rows();
} else {
return position_in_partition_view(
position_in_partition_view::clustering_row_tag_t(), _key);
}
}
bool
rows_entry::equal(const schema& s, const rows_entry& other, const schema& other_schema) const {
position_in_partition::equal_compare eq(s);
@@ -2119,7 +2107,7 @@ public:
mutation_partition::mutation_partition(mutation_partition::incomplete_tag, const schema& s, tombstone t)
: _tombstone(t)
, _static_row_continuous(false)
, _static_row_continuous(!s.has_static_columns())
, _rows()
, _row_tombstones(s)
{

View File

@@ -712,7 +712,15 @@ public:
const deletable_row& row() const {
return _row;
}
position_in_partition_view position() const;
position_in_partition_view position() const {
if (_flags._last) {
return position_in_partition_view::after_all_clustered_rows();
} else {
return position_in_partition_view(
position_in_partition_view::clustering_row_tag_t(), _key);
}
}
is_continuous continuous() const { return is_continuous(_flags._continuous); }
void set_continuous(bool value) { _flags._continuous = value; }
void set_continuous(is_continuous value) { set_continuous(bool(value)); }

View File

@@ -62,8 +62,14 @@ auto write_counter_cell(Writer&& writer, atomic_cell_view c)
counter_cell_view ccv(c);
auto shards = std::move(value).start_value_counter_cell_full()
.start_shards();
for (auto csv : ccv.shards()) {
shards.add_shards(counter_shard(csv));
if (service::get_local_storage_service().cluster_supports_correct_counter_order()) {
for (auto csv : ccv.shards()) {
shards.add_shards(counter_shard(csv));
}
} else {
for (auto& cs : ccv.shards_compatible_with_1_7_4()) {
shards.add_shards(cs);
}
}
return std::move(shards).end_shards().end_counter_cell_full();
}

View File

@@ -73,8 +73,9 @@ atomic_cell read_atomic_cell(atomic_cell_variant cv)
// TODO: a lot of copying for something called view
counter_cell_builder ccb; // we know the final number of shards
for (auto csv : ccv.shards()) {
ccb.add_shard(counter_shard(csv));
ccb.add_maybe_unsorted_shard(counter_shard(csv));
}
ccb.sort_and_remove_duplicates();
return ccb.build(_created_at);
}
atomic_cell operator()(ser::counter_cell_update_view& ccv) const {

View File

@@ -105,7 +105,7 @@ partition_slice_builder::with_regular_column(bytes name) {
throw std::runtime_error(sprint("No such column: %s", _schema.regular_column_name_type()->to_string(name)));
}
if (!def->is_regular()) {
throw std::runtime_error(sprint("Column is not regular: %s", _schema.regular_column_name_type()->to_string(name)));
throw std::runtime_error(sprint("Column is not regular: %s", _schema.column_name_type(*def)->to_string(name)));
}
_regular_columns->push_back(def->id);
return *this;

View File

@@ -41,9 +41,17 @@ inline void maybe_merge_versions(lw_shared_ptr<partition_snapshot>& snp,
with_allocator(lsa_region.allocator(), [&snp, &lsa_region, &read_section] {
return with_linearized_managed_bytes([&snp, &lsa_region, &read_section] {
try {
read_section(lsa_region, [&snp] {
snp->merge_partition_versions();
});
// Allocating sections require the region to be reclaimable
// which means that they cannot be nested.
// It is, however, possible, that if the snapshot is taken
// inside an allocating section and then an exception is thrown
// this function will be called to clean up even though we
// still will be in the context of the allocating section.
if (lsa_region.reclaiming_enabled()) {
read_section(lsa_region, [&snp] {
snp->merge_partition_versions();
});
}
} catch (...) { }
snp = {};
});

View File

@@ -34,6 +34,8 @@
// When the cursor is invalidated, it still maintains its previous position. It can be brought
// back to validity by calling maybe_refresh(), or advance_to().
//
// Insertion of row entries after cursor's position invalidates the cursor.
//
class partition_snapshot_row_cursor final {
struct position_in_version {
mutation_partition::rows_type::iterator it;
@@ -55,6 +57,7 @@ class partition_snapshot_row_cursor final {
logalloc::region& _region;
partition_snapshot& _snp;
std::vector<position_in_version> _heap;
std::vector<mutation_partition::rows_type::iterator> _iterators;
std::vector<position_in_version> _current_row;
position_in_partition _position;
uint64_t _last_reclaim_count = 0;
@@ -78,13 +81,16 @@ public:
, _snp(snp)
, _position(position_in_partition::static_row_tag_t{})
{ }
bool has_up_to_date_row_from_latest_version() const {
return up_to_date() && _current_row[0].version_no == 0;
bool has_valid_row_from_latest_version() const {
return iterators_valid() && _current_row[0].version_no == 0;
}
mutation_partition::rows_type::iterator get_iterator_in_latest_version() const {
return _current_row[0].it;
return _iterators[0];
}
bool up_to_date() const {
// Returns true iff the iterators obtained since the cursor was last made valid
// are still valid. Note that this doesn't mean that the cursor itself is valid.
bool iterators_valid() const {
return _region.reclaim_counter() == _last_reclaim_count && _last_versions_count == _snp.version_count();
}
@@ -97,9 +103,40 @@ public:
//
// but avoids work if not necessary.
bool maybe_refresh() {
if (!up_to_date()) {
if (!iterators_valid()) {
return advance_to(_position);
}
// Refresh latest version's iterator in case there was an insertion
// before it and after cursor's position. There cannot be any
// insertions for non-latest versions, so we don't have to update them.
if (_current_row[0].version_no != 0) {
rows_entry::compare less(_schema);
position_in_partition::equal_compare eq(_schema);
position_in_version::less_compare heap_less(_schema);
auto& rows = _snp.version()->partition().clustered_rows();
auto it = _iterators[0] = rows.lower_bound(_position, less);
auto heap_i = boost::find_if(_heap, [](auto&& v) { return v.version_no == 0; });
if (it == rows.end()) {
if (heap_i != _heap.end()) {
_heap.erase(heap_i);
boost::range::make_heap(_heap, heap_less);
}
} else if (eq(_position, it->position())) {
_current_row.insert(_current_row.begin(), position_in_version{it, rows.end(), 0});
if (heap_i != _heap.end()) {
_heap.erase(heap_i);
boost::range::make_heap(_heap, heap_less);
}
} else {
if (heap_i != _heap.end()) {
heap_i->it = it;
boost::range::make_heap(_heap, heap_less);
} else {
_heap.push_back({it, rows.end(), 0});
boost::range::push_heap(_heap, heap_less);
}
}
}
return true;
}
@@ -119,11 +156,13 @@ public:
position_in_version::less_compare heap_less(_schema);
_heap.clear();
_current_row.clear();
_iterators.clear();
int version_no = 0;
for (auto&& v : _snp.versions()) {
auto& rows = v.partition().clustered_rows();
auto pos = rows.lower_bound(lower_bound, less);
auto end = rows.end();
_iterators.push_back(pos);
if (pos != end) {
_heap.push_back({pos, end, version_no});
}
@@ -142,9 +181,10 @@ public:
// Can be only called on a valid cursor pointing at a row.
bool next() {
position_in_version::less_compare heap_less(_schema);
assert(up_to_date());
assert(iterators_valid());
for (auto&& curr : _current_row) {
++curr.it;
_iterators[curr.version_no] = curr.it;
if (curr.it != curr.end) {
_heap.push_back(curr);
boost::range::push_heap(_heap, heap_less);
@@ -168,12 +208,14 @@ public:
const clustering_key& key() const { return _current_row[0].it->key(); }
// Can be called only when cursor is valid and pointing at a row.
clustering_row row() const {
clustering_row result(key());
for (auto&& v : _current_row) {
result.apply(_schema, *v.it);
mutation_fragment row() const {
auto it = _current_row.begin();
auto mf = mutation_fragment(clustering_row(*it->it));
auto& cr = mf.as_mutable_clustering_row();
for (++it; it != _current_row.end(); ++it) {
cr.apply(_schema, *it->it);
}
return result;
return mf;
}
// Can be called when cursor is pointing at a row, even when invalid.
@@ -184,6 +226,32 @@ public:
bool is_in_latest_version() const;
bool previous_row_in_latest_version_has_key(const clustering_key_prefix& key) const;
void set_continuous(bool val);
friend std::ostream& operator<<(std::ostream& out, const partition_snapshot_row_cursor& cur) {
out << "{cursor: position=" << cur._position << ", ";
if (!cur.iterators_valid()) {
return out << " iterators invalid}";
}
out << "current=[";
bool first = true;
for (auto&& v : cur._current_row) {
if (!first) {
out << ", ";
}
first = false;
out << v.version_no;
}
out << "], heap=[";
first = true;
for (auto&& v : cur._heap) {
if (!first) {
out << ", ";
}
first = false;
out << "{v=" << v.version_no << ", pos=" << v.it->position() << "}";
}
return out << "]}";
};
};
inline
@@ -198,8 +266,8 @@ bool partition_snapshot_row_cursor::previous_row_in_latest_version_has_key(const
}
auto prev_it = _current_row[0].it;
--prev_it;
clustering_key_prefix::tri_compare tri_comp(_schema);
return tri_comp(prev_it->key(), key) == 0;
clustering_key_prefix::equality eq(_schema);
return eq(prev_it->key(), key);
}
inline

View File

@@ -478,9 +478,9 @@ void partition_entry::apply_to_incomplete(const schema& s, partition_version* ve
}
range_tombstone_list& tombstones = dst.partition().row_tombstones();
if (can_move) {
tombstones.apply_reversibly(s, current->partition().row_tombstones()).cancel();
tombstones.apply_monotonically(s, std::move(current->partition().row_tombstones()));
} else {
tombstones.apply(s, current->partition().row_tombstones());
tombstones.apply_monotonically(s, current->partition().row_tombstones());
}
current = current->next();
}
@@ -545,13 +545,19 @@ lw_shared_ptr<partition_snapshot> partition_entry::read(schema_ptr entry_schema,
std::vector<range_tombstone>
partition_snapshot::range_tombstones(const schema& s, position_in_partition_view start, position_in_partition_view end)
{
partition_version* v = &*version();
if (!v->next()) {
return boost::copy_range<std::vector<range_tombstone>>(
v->partition().row_tombstones().slice(s, start, end));
}
range_tombstone_list list(s);
for (auto&& v : versions()) {
for (auto&& rt : v.partition().row_tombstones().slice(s, start, end)) {
while (v) {
for (auto&& rt : v->partition().row_tombstones().slice(s, start, end)) {
list.apply(s, rt);
}
v = v->next();
}
return boost::copy_range<std::vector<range_tombstone>>(list);
return boost::copy_range<std::vector<range_tombstone>>(list.slice(s, start, end));
}
std::ostream& operator<<(std::ostream& out, partition_entry& e) {

View File

@@ -352,10 +352,10 @@ public:
return *this;
}
}
template<typename Transformer, typename U = typename std::result_of<Transformer(T)>::type>
static stdx::optional<typename wrapping_range<U>::bound> transform_bound(optional<bound> b, Transformer&& transformer) {
template<typename Bound, typename Transformer, typename U = typename std::result_of<Transformer(T)>::type>
static stdx::optional<typename wrapping_range<U>::bound> transform_bound(Bound&& b, Transformer&& transformer) {
if (b) {
return { { transformer(std::move(*b).value()), b->is_inclusive() } };
return { { transformer(std::forward<Bound>(b).value().value()), b->is_inclusive() } };
};
return {};
}

View File

@@ -58,9 +58,10 @@ void range_tombstone_list::apply_reversibly(const schema& s,
insert_from(s, std::move(it), std::move(start), start_kind, std::move(end), end_kind, std::move(tomb), rev);
return;
}
auto rt = current_allocator().construct<range_tombstone>(
std::move(start), start_kind, std::move(end), end_kind, std::move(tomb));
auto rt = alloc_strategy_unique_ptr<range_tombstone>(current_allocator().construct<range_tombstone>(
std::move(start), start_kind, std::move(end), end_kind, std::move(tomb)));
rev.insert(_tombstones.end(), *rt);
rt.release();
}
/*
@@ -104,9 +105,11 @@ void range_tombstone_list::insert_from(const schema& s,
if (it->tomb == tomb && end_bound.adjacent(s, it->start_bound())) {
rev.update(it, {std::move(start), start_kind, it->end, it->end_kind, tomb});
} else {
auto rt = current_allocator().construct<range_tombstone>(std::move(start), start_kind, std::move(end),
end_kind, tomb);
auto rt = alloc_strategy_unique_ptr<range_tombstone>(
current_allocator().construct<range_tombstone>(std::move(start), start_kind, std::move(end),
end_kind, tomb));
rev.insert(it, *rt);
rt.release();
}
return;
}
@@ -121,6 +124,7 @@ void range_tombstone_list::insert_from(const schema& s,
if (less(end_bound, it->end_bound())) {
end = it->end;
end_kind = it->end_kind;
end_bound = bound_view(end, end_kind);
}
it = rev.erase(it);
} else if (c > 0) {
@@ -133,7 +137,8 @@ void range_tombstone_list::insert_from(const schema& s,
auto rt = alloc_strategy_unique_ptr<range_tombstone>(
current_allocator().construct<range_tombstone>(it->start_bound(), new_end, it->tomb));
rev.update(it, {start_bound, it->end_bound(), it->tomb});
rev.insert(it, *rt.release());
rev.insert(it, *rt);
rt.release();
}
}
@@ -142,7 +147,8 @@ void range_tombstone_list::insert_from(const schema& s,
auto rt = alloc_strategy_unique_ptr<range_tombstone>(
current_allocator().construct<range_tombstone>(std::move(start), start_kind, end, end_kind, std::move(tomb)));
rev.update(it, {std::move(end), invert_kind(end_kind), it->end, it->end_kind, it->tomb});
rev.insert(it, *rt.release());
rev.insert(it, *rt);
rt.release();
return;
}
@@ -157,16 +163,18 @@ void range_tombstone_list::insert_from(const schema& s,
// Here start < it->start and it->start < end.
auto new_end_kind = invert_kind(it->start_kind);
if (!less(bound_view(it->start, new_end_kind), start_bound)) {
auto rt = current_allocator().construct<range_tombstone>(
std::move(start), start_kind, it->start, new_end_kind, tomb);
auto rt = alloc_strategy_unique_ptr<range_tombstone>(current_allocator().construct<range_tombstone>(
std::move(start), start_kind, it->start, new_end_kind, tomb));
it = rev.insert(it, *rt);
rt.release();
++it;
}
} else {
// Here start < it->start and end <= it->start, so just insert the new tombstone.
auto rt = current_allocator().construct<range_tombstone>(
std::move(start), start_kind, std::move(end), end_kind, std::move(tomb));
auto rt = alloc_strategy_unique_ptr<range_tombstone>(current_allocator().construct<range_tombstone>(
std::move(start), start_kind, std::move(end), end_kind, std::move(tomb)));
rev.insert(it, *rt);
rt.release();
return;
}
}
@@ -184,9 +192,10 @@ void range_tombstone_list::insert_from(const schema& s,
}
// If we got here, then just insert the remainder at the end.
auto rt = current_allocator().construct<range_tombstone>(
std::move(start), start_kind, std::move(end), end_kind, std::move(tomb));
auto rt = alloc_strategy_unique_ptr<range_tombstone>(current_allocator().construct<range_tombstone>(
std::move(start), start_kind, std::move(end), end_kind, std::move(tomb)));
rev.insert(it, *rt);
rt.release();
}
range_tombstone_list::range_tombstones_type::iterator range_tombstone_list::find(const schema& s, const range_tombstone& rt) {
@@ -355,6 +364,7 @@ range_tombstone_list::reverter::insert(range_tombstones_type::iterator it, range
range_tombstone_list::range_tombstones_type::iterator
range_tombstone_list::reverter::erase(range_tombstones_type::iterator it) {
_ops.reserve(_ops.size() + 1);
_ops.emplace_back(erase_undo_op(*it));
return _dst._tombstones.erase(it);
}
@@ -413,3 +423,27 @@ bool range_tombstone_list::equal(const schema& s, const range_tombstone_list& ot
return rt1.equal(s, rt2);
});
}
void range_tombstone_list::apply_monotonically(const schema& s, range_tombstone_list&& list) {
auto del = current_deleter<range_tombstone>();
auto it = list.begin();
while (it != list.end()) {
// FIXME: Optimize by stealing the entry
apply_monotonically(s, *it);
it = list._tombstones.erase_and_dispose(it, del);
}
}
void range_tombstone_list::apply_monotonically(const schema& s, const range_tombstone_list& list) {
for (auto&& rt : list) {
apply_monotonically(s, rt);
}
}
void range_tombstone_list::apply_monotonically(const schema& s, const range_tombstone& rt) {
// FIXME: Optimize given this has relaxed exception guarantees.
// Note that apply() doesn't have monotonic guarantee because it doesn't restore erased entries.
reverter rev(s, *this);
apply_reversibly(s, rt.start, rt.start_kind, rt.end, rt.end_kind, rt.tomb, rev);
rev.cancel();
}

View File

@@ -138,6 +138,19 @@ public:
nop_reverter rev(s, *this);
apply_reversibly(s, std::move(start), start_kind, std::move(end), end_kind, std::move(tomb), rev);
}
// Monotonic exception guarantees. In case of failure the object will contain at least as much information as before the call.
void apply_monotonically(const schema& s, const range_tombstone& rt);
// Merges another list with this object.
// Monotonic exception guarantees. In case of failure the object will contain at least as much information as before the call.
void apply_monotonically(const schema& s, const range_tombstone_list& list);
/// Merges another list with this object.
/// The other list must be governed by the same allocator as this object.
///
/// Monotonic exception guarantees. In case of failure the object will contain at least as much information as before the call.
/// The other list will be left in a state such that it would still commute with this object to the same state as it
/// would if the call didn't fail.
void apply_monotonically(const schema& s, range_tombstone_list&& list);
public:
tombstone search_tombstone_covering(const schema& s, const clustering_key_prefix& key) const;
// Returns range of tombstones which overlap with given range
boost::iterator_range<const_iterator> slice(const schema& s, const query::clustering_range&) const;

View File

@@ -71,7 +71,11 @@ public:
_range = std::move(*new_range);
_last_key = {};
}
if (_reader) {
++_cache._tracker._stats.underlying_recreations;
}
auto& snap = _cache.snapshot_for_phase(phase);
_reader = {}; // See issue #2644
_reader = _cache.create_underlying_reader(_read_context, snap, _range);
_reader_creation_phase = phase;
}
@@ -90,8 +94,14 @@ public:
_range = std::move(range);
_last_key = { };
_new_last_key = { };
if (_reader && _reader_creation_phase == phase) {
return _reader->fast_forward_to(_range);
if (_reader) {
if (_reader_creation_phase == phase) {
++_cache._tracker._stats.underlying_partition_skips;
return _reader->fast_forward_to(_range);
} else {
++_cache._tracker._stats.underlying_recreations;
_reader = {}; // See issue #2644
}
}
_reader = _cache.create_underlying_reader(_read_context, snapshot, _range);
_reader_creation_phase = phase;
@@ -121,6 +131,7 @@ class read_context final : public enable_lw_shared_from_this<read_context> {
mutation_reader::forwarding _fwd_mr;
bool _range_query;
autoupdating_underlying_reader _underlying;
uint64_t _underlying_created = 0;
// When reader enters a partition, it must be set up for reading that
// partition from the underlying mutation source (_sm) in one of two ways:
@@ -155,7 +166,18 @@ public:
, _fwd_mr(fwd_mr)
, _range_query(!range.is_singular() || !range.start()->value().has_key())
, _underlying(_cache, *this)
{ }
{
++_cache._tracker._stats.reads;
}
~read_context() {
++_cache._tracker._stats.reads_done;
if (_underlying_created) {
_cache._stats.reads_with_misses.mark();
++_cache._tracker._stats.reads_with_misses;
} else {
_cache._stats.reads_with_no_misses.mark();
}
}
read_context(const read_context&) = delete;
row_cache& cache() { return _cache; }
const schema_ptr& schema() const { return _schema; }
@@ -169,6 +191,7 @@ public:
autoupdating_underlying_reader& underlying() { return _underlying; }
row_cache::phase_type phase() const { return _phase; }
const dht::decorated_key& key() const { return _sm->decorated_key(); }
void on_underlying_created() { ++_underlying_created; }
private:
future<> create_sm();
future<> ensure_sm_created() {
@@ -198,9 +221,17 @@ public:
// Fast forwards the underlying streamed_mutation to given range.
future<> fast_forward_to(position_range range) {
return ensure_sm_created().then([this, range = std::move(range)] () mutable {
++_cache._tracker._stats.underlying_row_skips;
return _sm->fast_forward_to(std::move(range));
});
}
// Returns the underlying streamed_mutation.
// The caller has to ensure that the streamed mutation was already created
// (e.g. the most recent call to enter_partition(const dht::decorated_key&, ...)
// was followed by a call to fast_forward_to()).
streamed_mutation& get_streamed_mutation() noexcept {
return *_sm;
}
// Gets the next fragment from the underlying streamed_mutation
future<mutation_fragment_opt> get_next_fragment() {
return ensure_sm_created().then([this] {

View File

@@ -41,11 +41,6 @@
static logging::logger rlogger("repair");
struct failed_range {
sstring cf;
::dht::token_range range;
};
class repair_info {
public:
seastar::sharded<database>& db;
@@ -56,7 +51,7 @@ public:
shard_id shard;
std::vector<sstring> data_centers;
std::vector<sstring> hosts;
std::vector<failed_range> failed_ranges;
size_t nr_failed_ranges = 0;
// Map of peer -> <cf, ranges>
std::unordered_map<gms::inet_address, std::unordered_map<sstring, dht::token_range_vector>> ranges_need_repair_in;
std::unordered_map<gms::inet_address, std::unordered_map<sstring, dht::token_range_vector>> ranges_need_repair_out;
@@ -132,14 +127,11 @@ public:
});
}
void check_failed_ranges() {
if (failed_ranges.empty()) {
rlogger.info("repair {} on shard {} completed successfully", id, shard);
if (nr_failed_ranges) {
rlogger.info("repair {} on shard {} failed - {} ranges failed", id, shard, nr_failed_ranges);
throw std::runtime_error(sprint("repair %d on shard %d failed to do checksum for %d sub ranges", id, shard, nr_failed_ranges));
} else {
rlogger.info("repair {} on shard {} failed - {} ranges failed", id, shard, failed_ranges.size());
for (auto& frange: failed_ranges) {
rlogger.info("repair cf {} range {} failed", frange.cf, frange.range);
}
throw std::runtime_error(sprint("repair %d on shard %d failed to do checksum for %d sub ranges", id, shard, failed_ranges.size()));
rlogger.info("repair {} on shard {} completed successfully", id, shard);
}
}
future<> request_transfer_ranges(const sstring& cf,
@@ -504,6 +496,19 @@ static future<partition_checksum> checksum_range_shard(database &db,
});
}
// It is counter-productive to allow a large number of range checksum
// operations to proceed in parallel (on the same shard), because the read
// operation can already parallelize itself as much as needed, and doing
// multiple reads in parallel just adds a lot of memory overheads.
// So checksum_parallelism_semaphore is used to limit this parallelism,
// and should be set to 1, or another small number.
//
// Note that checksumming_parallelism_semaphore applies not just in the
// repair master, but also in the slave: The repair slave may receive many
// checksum requests in parallel, but will only work on one or a few
// (checksum_parallelism_semaphore) at once.
static thread_local semaphore checksum_parallelism_semaphore(2);
// Calculate the checksum of the data held on all shards of a column family,
// in the given token range.
// In practice, we only need to consider one or two shards which intersect the
@@ -526,7 +531,9 @@ future<partition_checksum> checksum_range(seastar::sharded<database> &db,
auto& prs = shard_range.second;
return db.invoke_on(shard, [keyspace, cf, prs = std::move(prs), hash_version] (database& db) mutable {
return do_with(std::move(keyspace), std::move(cf), std::move(prs), [&db, hash_version] (auto& keyspace, auto& cf, auto& prs) {
return checksum_range_shard(db, keyspace, cf, prs, hash_version);
return seastar::with_semaphore(checksum_parallelism_semaphore, 1, [&db, hash_version, &keyspace, &cf, &prs] {
return checksum_range_shard(db, keyspace, cf, prs, hash_version);
});
});
}).then([&result] (partition_checksum sum) {
result.add(sum);
@@ -537,14 +544,15 @@ future<partition_checksum> checksum_range(seastar::sharded<database> &db,
});
}
// We don't need to wait for one checksum to finish before we start the
// next, but doing too many of these operations in parallel also doesn't
// make sense, so we limit the number of concurrent ongoing checksum
// requests with a semaphore.
//
// FIXME: We shouldn't use a magic number here, but rather bind it to
// some resource. Otherwise we'll be doing too little in some machines,
// and too much in others.
// parallelism_semaphore limits the number of parallel ongoing checksum
// comparisons. This could mean, for example, that this number of checksum
// requests have been sent to other nodes and we are waiting for them to
// return so we can compare those to our own checksums. This limit can be
// set fairly high because the outstanding comparisons take only few
// resources. In particular, we do NOT do this number of file reads in
// parallel because file reads have large memory overhads (read buffers,
// partitions, etc.) - the number of concurrent reads is further limited
// by an additional semaphore checksum_parallelism_semaphore (see above).
//
// FIXME: This would be better of in a repair service, or even a per-shard
// repair instance holding all repair state. However, since we are anyway
@@ -576,7 +584,6 @@ static future<uint64_t> estimate_partitions(seastar::sharded<database>& db, cons
static future<> repair_cf_range(repair_info& ri,
sstring cf, ::dht::token_range range,
const std::vector<gms::inet_address>& neighbors) {
ri.ranges_index++;
if (neighbors.empty()) {
// Nothing to do in this case...
return make_ready_future<>();
@@ -584,8 +591,6 @@ static future<> repair_cf_range(repair_info& ri,
return estimate_partitions(ri.db, ri.keyspace, cf, range).then([&ri, cf, range, &neighbors] (uint64_t estimated_partitions) {
range_splitter ranges(range, estimated_partitions, ri.target_partitions);
rlogger.info("Repair {} out of {} ranges, id={}, shard={}, keyspace={}, cf={}, range={}, target_partitions={}, estimated_partitions={}",
ri.ranges_index, ri.ranges.size(), ri.id, ri.shard, ri.keyspace, cf, range, ri.target_partitions, estimated_partitions);
return do_with(seastar::gate(), true, std::move(cf), std::move(ranges),
[&ri, &neighbors] (auto& completion, auto& success, const auto& cf, auto& ranges) {
return do_until([&ranges] () { return !ranges.has_next(); },
@@ -626,7 +631,7 @@ static future<> repair_cf_range(repair_info& ri,
utils::fb_utilities::get_broadcast_address()),
checksums[i].get_exception());
success = false;
ri.failed_ranges.push_back(failed_range{cf, range});
ri.nr_failed_ranges++;
// Do not break out of the loop here, so we can log
// (and discard) all the exceptions.
} else if (i > 0) {
@@ -751,7 +756,7 @@ static future<> repair_cf_range(repair_info& ri,
// any case, we need to remember that the repair failed to
// tell the caller.
success = false;
ri.failed_ranges.push_back(failed_range{cf, range});
ri.nr_failed_ranges++;
rlogger.warn("Failed sync of range {}: {}", range, eptr);
}).finally([&completion] {
parallelism_semaphore.signal(1);
@@ -997,8 +1002,22 @@ static future<> repair_ranges(repair_info ri) {
// repair all the ranges in sequence
return do_for_each(ri.ranges, [&ri] (auto&& range) {
#endif
check_in_shutdown();
return repair_range(ri, range);
ri.ranges_index++;
rlogger.info("Repair {} out of {} ranges, id={}, shard={}, keyspace={}, table={}, range={}",
ri.ranges_index, ri.ranges.size(), ri.id, ri.shard, ri.keyspace, ri.cfs, range);
return do_with(dht::selective_token_range_sharder(range, ri.shard), [&ri] (auto& sharder) {
return repeat([&ri, &sharder] () {
check_in_shutdown();
auto range_shard = sharder.next();
if (range_shard) {
return repair_range(ri, *range_shard).then([] {
return make_ready_future<stop_iteration>(stop_iteration::no);
});
} else {
return make_ready_future<stop_iteration>(stop_iteration::yes);
}
});
});
}).then([&ri] {
// Do streaming for the remaining ranges we do not stream in
// repair_cf_range
@@ -1013,27 +1032,6 @@ static future<> repair_ranges(repair_info ri) {
});
}
static void split_and_add(std::vector<::dht::token_range>& ranges,
const dht::token_range& range) {
// The use of minimum_token() here twice is not a typo - because wrap-
// around token ranges are supported by midpoint(), the beyond-maximum
// token can also be represented by minimum_token().
auto midpoint = dht::global_partitioner().midpoint(
range.start() ? range.start()->value() : dht::minimum_token(),
range.end() ? range.end()->value() : dht::minimum_token());
// This shouldn't happen, but if the range included just one token, we
// can't split further (split() may actually fail with assertion failure)
if ((range.start() && midpoint == range.start()->value()) ||
(range.end() && midpoint == range.end()->value())) {
ranges.push_back(range);
return;
}
auto halves = range.split(midpoint, dht::token_comparator());
ranges.push_back(halves.first);
ranges.push_back(halves.second);
}
// repair_start() can run on any cpu; It runs on cpu0 the function
// do_repair_start(). The benefit of always running that function on the same
// CPU is that it allows us to keep some state (like a list of ongoing
@@ -1053,6 +1051,10 @@ static int do_repair_start(seastar::sharded<database>& db, sstring keyspace,
rlogger.info("starting user-requested repair for keyspace {}, repair id {}, options {}", keyspace, id, options_map);
repair_tracker.start(id);
if (!gms::get_local_gossiper().is_normal(utils::fb_utilities::get_broadcast_address())) {
throw std::runtime_error("Node is not in NORMAL status yet!");
}
// If the "ranges" option is not explicitly specified, we repair all the
// local ranges (the token ranges for which this node holds a replica of).
// Each of these ranges may have a different set of replicas, so the
@@ -1125,35 +1127,12 @@ static int do_repair_start(seastar::sharded<database>& db, sstring keyspace,
cfs = list_column_families(db.local(), keyspace);
}
// Split the ranges so that we have more number of ranges than smp::count
// Note, the split is not a guaratnee when the range can not be split anmore.
dht::token_range_vector tosplit;
while (ranges.size() < smp::count) {
size_t sz = ranges.size();
tosplit.clear();
ranges.swap(tosplit);
for (const auto& range : tosplit) {
split_and_add(ranges, range);
}
if (sz == ranges.size()) {
// We can not split the ranges anymore
break;
}
}
std::map<shard_id, dht::token_range_vector> shard_ranges_map;
unsigned idx = 0;
for (auto& range : ranges) {
shard_ranges_map[idx++ % smp::count].push_back(std::move(range));
}
std::vector<future<>> repair_results;
repair_results.reserve(shard_ranges_map.size());
repair_results.reserve(smp::count);
for (auto& x : shard_ranges_map) {
shard_id shard = x.first;
auto& ranges = x.second;
auto f = db.invoke_on(shard, [keyspace, cfs, id, ranges = std::move(ranges),
for (auto shard : boost::irange(unsigned(0), smp::count)) {
auto f = db.invoke_on(shard, [keyspace, cfs, id, ranges,
data_centers = options.data_centers, hosts = options.hosts] (database& localdb) mutable {
return repair_ranges(repair_info(service::get_local_storage_service().db(),
std::move(keyspace), std::move(ranges), std::move(cfs),

View File

@@ -46,6 +46,7 @@ thread_local seastar::thread_scheduling_group row_cache::_update_thread_scheduli
mutation_reader
row_cache::create_underlying_reader(read_context& ctx, mutation_source& src, const dht::partition_range& pr) {
ctx.on_underlying_created();
return src(_schema, pr, ctx.slice(), ctx.pc(), ctx.trace_state(), streamed_mutation::forwarding::yes);
}
@@ -74,7 +75,7 @@ cache_tracker::cache_tracker() {
}
evict_last(_lru);
--_stats.partitions;
++_stats.evictions;
++_stats.partition_evictions;
++_stats.modification_count;
return memory::reclaiming_result::reclaimed_something;
} catch (std::bad_alloc&) {
@@ -98,15 +99,24 @@ cache_tracker::setup_metrics() {
_metrics.add_group("cache", {
sm::make_gauge("bytes_used", sm::description("current bytes used by the cache out of the total size of memory"), [this] { return _region.occupancy().used_space(); }),
sm::make_gauge("bytes_total", sm::description("total size of memory for the cache"), [this] { return _region.occupancy().total_space(); }),
sm::make_derive("total_operations_hits", sm::description("total number of operation hits"), _stats.hits),
sm::make_derive("total_operations_misses", sm::description("total number of operation misses"), _stats.misses),
sm::make_derive("total_operations_insertions", sm::description("total number of operation insert"), _stats.insertions),
sm::make_derive("total_operations_concurrent_misses_same_key", sm::description("total number of operation with misses same key"), _stats.concurrent_misses_same_key),
sm::make_derive("total_operations_merges", sm::description("total number of operation merged"), _stats.merges),
sm::make_derive("total_operations_evictions", sm::description("total number of operation eviction"), _stats.evictions),
sm::make_derive("total_operations_removals", sm::description("total number of operation removals"), _stats.removals),
sm::make_derive("total_operations_mispopulations", sm::description("number of entries not inserted by reads"), _stats.mispopulations),
sm::make_gauge("objects_partitions", sm::description("total number of partition objects"), _stats.partitions)
sm::make_derive("partition_hits", sm::description("number of partitions needed by reads and found in cache"), _stats.partition_hits),
sm::make_derive("partition_misses", sm::description("number of partitions needed by reads and missing in cache"), _stats.partition_misses),
sm::make_derive("partition_insertions", sm::description("total number of partitions added to cache"), _stats.partition_insertions),
sm::make_derive("row_hits", sm::description("total number of rows needed by reads and found in cache"), _stats.row_hits),
sm::make_derive("row_misses", sm::description("total number of rows needed by reads and missing in cache"), _stats.row_misses),
sm::make_derive("row_insertions", sm::description("total number of rows added to cache"), _stats.row_insertions),
sm::make_derive("concurrent_misses_same_key", sm::description("total number of operation with misses same key"), _stats.concurrent_misses_same_key),
sm::make_derive("partition_merges", sm::description("total number of partitions merged"), _stats.partition_merges),
sm::make_derive("partition_evictions", sm::description("total number of evicted partitions"), _stats.partition_evictions),
sm::make_derive("partition_removals", sm::description("total number of invalidated partitions"), _stats.partition_removals),
sm::make_derive("mispopulations", sm::description("number of entries not inserted by reads"), _stats.mispopulations),
sm::make_gauge("partitions", sm::description("total number of cached partitions"), _stats.partitions),
sm::make_derive("reads", sm::description("number of started reads"), _stats.reads),
sm::make_derive("reads_with_misses", sm::description("number of reads which had to read from sstables"), _stats.reads_with_misses),
sm::make_gauge("active_reads", sm::description("number of currently active reads"), [this] { return _stats.active_reads(); }),
sm::make_derive("sstable_reader_recreations", sm::description("number of times sstable reader was recreated due to memtable flush"), _stats.underlying_recreations),
sm::make_derive("sstable_partition_skips", sm::description("number of times sstable reader was fast forwarded across partitions"), _stats.underlying_partition_skips),
sm::make_derive("sstable_row_skips", sm::description("number of times sstable reader was fast forwarded within a partition"), _stats.underlying_row_skips),
});
}
@@ -127,7 +137,7 @@ void cache_tracker::clear() {
};
clear(_lru);
});
_stats.removals += _stats.partitions;
_stats.partition_removals += _stats.partitions;
_stats.partitions = 0;
++_stats.modification_count;
}
@@ -141,7 +151,7 @@ void cache_tracker::touch(cache_entry& e) {
}
void cache_tracker::insert(cache_entry& entry) {
++_stats.insertions;
++_stats.partition_insertions;
++_stats.partitions;
++_stats.modification_count;
_lru.push_front(entry);
@@ -149,20 +159,28 @@ void cache_tracker::insert(cache_entry& entry) {
void cache_tracker::on_erase() {
--_stats.partitions;
++_stats.removals;
++_stats.partition_removals;
++_stats.modification_count;
}
void cache_tracker::on_merge() {
++_stats.merges;
++_stats.partition_merges;
}
void cache_tracker::on_hit() {
++_stats.hits;
void cache_tracker::on_partition_hit() {
++_stats.partition_hits;
}
void cache_tracker::on_miss() {
++_stats.misses;
void cache_tracker::on_partition_miss() {
++_stats.partition_misses;
}
void cache_tracker::on_row_hit() {
++_stats.row_hits;
}
void cache_tracker::on_row_miss() {
++_stats.row_misses;
}
void cache_tracker::on_mispopulate() {
@@ -348,14 +366,30 @@ void cache_tracker::clear_continuity(cache_entry& ce) {
ce.set_continuous(false);
}
void row_cache::on_hit() {
_stats.hits.mark();
_tracker.on_hit();
void row_cache::on_partition_hit() {
_tracker.on_partition_hit();
}
void row_cache::on_miss() {
void row_cache::on_partition_miss() {
_tracker.on_partition_miss();
}
void row_cache::on_row_hit() {
_stats.hits.mark();
_tracker.on_row_hit();
}
void row_cache::on_mispopulate() {
_tracker.on_mispopulate();
}
void row_cache::on_row_miss() {
_stats.misses.mark();
_tracker.on_miss();
_tracker.on_row_miss();
}
void row_cache::on_row_insert() {
++_tracker._stats.row_insertions;
}
class range_populating_reader {
@@ -369,6 +403,7 @@ private:
}
void handle_end_of_stream() {
if (!can_set_continuity()) {
_cache.on_mispopulate();
return;
}
if (!_reader.range().end() || !_reader.range().end()->is_inclusive()) {
@@ -379,11 +414,15 @@ private:
if (it == _cache._partitions.begin()) {
if (!_last_key->_key) {
it->set_continuous(true);
} else {
_cache.on_mispopulate();
}
} else {
auto prev = std::prev(it);
if (prev->key().equal(*_cache._schema, *_last_key->_key)) {
it->set_continuous(true);
} else {
_cache.on_mispopulate();
}
}
}
@@ -403,17 +442,17 @@ public:
handle_end_of_stream();
return std::move(smopt);
}
_cache.on_miss();
_cache.on_partition_miss();
if (_reader.creation_phase() == _cache.phase_of(smopt->decorated_key())) {
return _cache._read_section(_cache._tracker.region(), [&] {
cache_entry& e = _cache.find_or_create(smopt->decorated_key(), smopt->partition_tombstone(), _reader.creation_phase(),
can_set_continuity() ? &*_last_key : nullptr);
_last_key = smopt->decorated_key();
_last_key = row_cache::previous_entry_pointer(smopt->decorated_key());
return e.read(_cache, _read_context, std::move(*smopt), _reader.creation_phase());
});
} else {
_cache._tracker.on_mispopulate();
_last_key = smopt->decorated_key();
_last_key = row_cache::previous_entry_pointer(smopt->decorated_key());
return read_directly_from_underlying(std::move(*smopt), _read_context);
}
}
@@ -424,7 +463,7 @@ public:
if (!pr.start()) {
_last_key = row_cache::previous_entry_pointer();
} else if (!pr.start()->is_inclusive() && pr.start()->value().has_key()) {
_last_key = pr.start()->value().as_decorated_key();
_last_key = row_cache::previous_entry_pointer(pr.start()->value().as_decorated_key());
} else {
// Inclusive start bound, cannot set continuity flag.
_last_key = {};
@@ -448,7 +487,7 @@ private:
streamed_mutation read_from_entry(cache_entry& ce) {
_cache.upgrade_entry(ce);
_cache._tracker.touch(ce);
_cache.on_hit();
_cache.on_partition_hit();
return ce.read(_cache, *_read_context);
}
@@ -469,7 +508,7 @@ private:
}
cache_entry& e = _primary.entry();
auto sm = read_from_entry(e);
_lower_bound = {e.key(), false};
_lower_bound = dht::partition_range::bound{e.key(), false};
// Delay the call to next() so that we don't see stale continuity on next invocation.
_advance_primary = true;
return streamed_mutation_opt(std::move(sm));
@@ -478,7 +517,7 @@ private:
cache_entry& e = _primary.entry();
_secondary_range = dht::partition_range(_lower_bound ? std::move(_lower_bound) : _pr->start(),
dht::partition_range::bound{e.key(), false});
_lower_bound = {e.key(), true};
_lower_bound = dht::partition_range::bound{e.key(), true};
_secondary_in_progress = true;
return stdx::nullopt;
} else {
@@ -487,7 +526,7 @@ private:
if (!range) {
return stdx::nullopt;
}
_lower_bound = {dht::ring_position::max()};
_lower_bound = dht::partition_range::bound{dht::ring_position::max()};
_secondary_range = std::move(*range);
_secondary_in_progress = true;
return stdx::nullopt;
@@ -570,10 +609,10 @@ row_cache::make_reader(schema_ptr s,
cache_entry& e = *i;
_tracker.touch(e);
upgrade_entry(e);
on_hit();
on_partition_hit();
return make_reader_returning(e.read(*this, *ctx));
} else {
on_miss();
on_partition_miss();
return make_mutation_reader<single_partition_populating_reader>(*this, std::move(ctx));
}
});
@@ -629,6 +668,8 @@ cache_entry& row_cache::do_find_or_create_entry(const dht::decorated_key& key,
|| (previous->_key && i != _partitions.begin()
&& std::prev(i)->key().equal(*_schema, *previous->_key))) {
i->set_continuous(true);
} else {
on_mispopulate();
}
return *i;
@@ -642,6 +683,7 @@ cache_entry& row_cache::find_or_create(const dht::decorated_key& key, tombstone
_tracker.insert(*entry);
return _partitions.insert(i, *entry);
}, [&] (auto i) { // visit
_tracker.on_miss_already_populated();
cache_entry& e = *i;
e.partition().open_version(*e.schema(), phase).partition().apply(t);
_tracker.touch(e);
@@ -760,7 +802,7 @@ future<> row_cache::do_update(memtable& m, Updater updater) {
if (m.partitions.empty()) {
_prev_snapshot_pos = {};
} else {
_prev_snapshot_pos = m.partitions.begin()->key();
_prev_snapshot_pos = dht::ring_position(m.partitions.begin()->key());
}
});
STAP_PROBE1(scylla, row_cache_update_one_batch_end, quota_before - quota);
@@ -790,13 +832,12 @@ future<> row_cache::update(memtable& m, partition_presence_checker is_present) {
entry.partition().apply_to_incomplete(*_schema, std::move(mem_e.partition()), *mem_e.schema());
_tracker.touch(entry);
_tracker.on_merge();
} else if (is_present(mem_e.key()) == partition_presence_checker_result::definitely_doesnt_exist) {
} else if (cache_i->continuous() || is_present(mem_e.key()) == partition_presence_checker_result::definitely_doesnt_exist) {
cache_entry* entry = current_allocator().construct<cache_entry>(
mem_e.schema(), std::move(mem_e.key()), std::move(mem_e.partition()));
entry->set_continuous(cache_i->continuous());
_tracker.insert(*entry);
_partitions.insert(cache_i, *entry);
} else {
_tracker.clear_continuity(*cache_i);
}
});
}
@@ -815,6 +856,10 @@ future<> row_cache::update_invalidating(memtable& m) {
});
}
void row_cache::refresh_snapshot() {
_underlying = _snapshot_source();
}
void row_cache::touch(const dht::decorated_key& dk) {
_read_section(_tracker.region(), [&] {
with_linearized_managed_bytes([&] {

View File

@@ -185,23 +185,35 @@ public:
using lru_type = bi::list<cache_entry,
bi::member_hook<cache_entry, cache_entry::lru_link_type, &cache_entry::_lru_link>,
bi::constant_time_size<false>>; // we need this to have bi::auto_unlink on hooks.
private:
// We will try to evict large partition after that many normal evictions
const uint32_t _normal_large_eviction_ratio = 1000;
// Number of normal evictions to perform before we try to evict large partition
uint32_t _normal_eviction_count = _normal_large_eviction_ratio;
public:
friend class row_cache;
friend class cache::read_context;
friend class cache::autoupdating_underlying_reader;
friend class cache::cache_streamed_mutation;
struct stats {
uint64_t hits;
uint64_t misses;
uint64_t insertions;
uint64_t partition_hits;
uint64_t partition_misses;
uint64_t row_hits;
uint64_t row_misses;
uint64_t partition_insertions;
uint64_t row_insertions;
uint64_t concurrent_misses_same_key;
uint64_t merges;
uint64_t evictions;
uint64_t removals;
uint64_t partition_merges;
uint64_t partition_evictions;
uint64_t partition_removals;
uint64_t partitions;
uint64_t modification_count;
uint64_t mispopulations;
uint64_t underlying_recreations;
uint64_t underlying_partition_skips;
uint64_t underlying_row_skips;
uint64_t reads;
uint64_t reads_with_misses;
uint64_t reads_done;
uint64_t active_reads() const {
return reads_done - reads;
}
};
private:
stats _stats{};
@@ -219,8 +231,10 @@ public:
void clear_continuity(cache_entry& ce);
void on_erase();
void on_merge();
void on_hit();
void on_miss();
void on_partition_hit();
void on_partition_miss();
void on_row_hit();
void on_row_miss();
void on_miss_already_populated();
void on_mispopulate();
allocation_strategy& allocator();
@@ -263,6 +277,8 @@ public:
struct stats {
utils::timed_rate_moving_average hits;
utils::timed_rate_moving_average misses;
utils::timed_rate_moving_average reads_with_misses;
utils::timed_rate_moving_average reads_with_no_misses;
};
private:
cache_tracker& _tracker;
@@ -313,8 +329,12 @@ private:
logalloc::allocating_section _read_section;
mutation_reader create_underlying_reader(cache::read_context&, mutation_source&, const dht::partition_range&);
mutation_reader make_scanning_reader(const dht::partition_range&, lw_shared_ptr<cache::read_context>);
void on_hit();
void on_miss();
void on_partition_hit();
void on_partition_miss();
void on_row_hit();
void on_row_miss();
void on_row_insert();
void on_mispopulate();
void upgrade_entry(cache_entry&);
void invalidate_locked(const dht::decorated_key&);
void invalidate_unwrapped(const dht::partition_range&);
@@ -422,6 +442,10 @@ public:
// as few elements as possible.
future<> update_invalidating(memtable&);
// Refreshes snapshot. Must only be used if logical state in the underlying data
// source hasn't changed.
void refresh_snapshot();
// Moves given partition to the front of LRU if present in cache.
void touch(const dht::decorated_key&);
@@ -449,7 +473,7 @@ public:
// If it did, use invalidate() instead.
void evict(const dht::partition_range& = query::full_partition_range);
auto num_entries() const {
size_t partitions() const {
return _partitions.size();
}
const cache_tracker& get_cache_tracker() const {

207
schema.cc
View File

@@ -105,6 +105,97 @@ schema::make_column_specification(const column_definition& def) {
return ::make_shared<cql3::column_specification>(_raw._ks_name, _raw._cf_name, std::move(id), def.type);
}
v3_columns::v3_columns(std::vector<column_definition> cols, bool is_dense, bool is_compound)
: _is_dense(is_dense)
, _is_compound(is_compound)
, _columns(std::move(cols))
{
for (column_definition& def : _columns) {
_columns_by_name[def.name()] = &def;
}
}
v3_columns v3_columns::from_v2_schema(const schema& s) {
data_type static_column_name_type = utf8_type;
std::vector<column_definition> cols;
if (s.is_static_compact_table()) {
if (s.has_static_columns()) {
throw std::runtime_error(
sprint("v2 static compact table should not have static columns: %s.%s", s.ks_name(), s.cf_name()));
}
if (s.clustering_key_size()) {
throw std::runtime_error(
sprint("v2 static compact table should not have clustering columns: %s.%s", s.ks_name(), s.cf_name()));
}
static_column_name_type = s.regular_column_name_type();
for (auto& c : s.all_columns()) {
// Note that for "static" no-clustering compact storage we use static for the defined columns
if (c.kind == column_kind::regular_column) {
auto new_def = c;
new_def.kind = column_kind::static_column;
cols.push_back(new_def);
} else {
cols.push_back(c);
}
}
schema_builder::default_names names(s._raw);
cols.emplace_back(to_bytes(names.clustering_name()), static_column_name_type, column_kind::clustering_key, 0);
cols.emplace_back(to_bytes(names.compact_value_name()), s.make_legacy_default_validator(), column_kind::regular_column, 0);
} else {
cols = s.all_columns();
}
for (column_definition& def : cols) {
data_type name_type = def.is_static() ? static_column_name_type : utf8_type;
auto id = ::make_shared<cql3::column_identifier>(def.name(), name_type);
def.column_specification = ::make_shared<cql3::column_specification>(s.ks_name(), s.cf_name(), std::move(id), def.type);
}
return v3_columns(std::move(cols), s.is_dense(), s.is_compound());
}
void v3_columns::apply_to(schema_builder& builder) const {
if (is_static_compact()) {
for (auto& c : _columns) {
if (c.kind == column_kind::regular_column) {
builder.set_default_validation_class(c.type);
} else if (c.kind == column_kind::static_column) {
auto new_def = c;
new_def.kind = column_kind::regular_column;
builder.with_column(new_def);
} else if (c.kind == column_kind::clustering_key) {
builder.set_regular_column_name_type(c.type);
} else {
builder.with_column(c);
}
}
} else {
for (auto& c : _columns) {
if (is_compact() && c.kind == column_kind::regular_column) {
builder.set_default_validation_class(c.type);
}
builder.with_column(c);
}
}
}
bool v3_columns::is_static_compact() const {
return !_is_dense && !_is_compound;
}
bool v3_columns::is_compact() const {
return _is_dense || !_is_compound;
}
const std::unordered_map<bytes, const column_definition*>& v3_columns::columns_by_name() const {
return _columns_by_name;
}
const std::vector<column_definition>& v3_columns::all_columns() const {
return _columns;
}
void schema::rebuild() {
_partition_key_type = make_lw_shared<compound_type<>>(get_column_types(partition_key_columns()));
_clustering_key_type = make_lw_shared<compound_prefix>(get_column_types(clustering_key_columns()));
@@ -117,10 +208,10 @@ void schema::rebuild() {
}
static_assert(row_column_ids_are_ordered_by_name::value, "row columns don't need to be ordered by name");
if (!std::is_sorted(regular_columns().begin(), regular_columns().end(), column_definition::name_comparator())) {
if (!std::is_sorted(regular_columns().begin(), regular_columns().end(), column_definition::name_comparator(regular_column_name_type()))) {
throw std::runtime_error("Regular columns should be sorted by name");
}
if (!std::is_sorted(static_columns().begin(), static_columns().end(), column_definition::name_comparator())) {
if (!std::is_sorted(static_columns().begin(), static_columns().end(), column_definition::name_comparator(static_column_name_type()))) {
throw std::runtime_error("Static columns should be sorted by name");
}
@@ -137,7 +228,7 @@ void schema::rebuild() {
}
thrift()._compound = is_compound();
thrift()._is_dynamic = static_columns_count() == 0;
thrift()._is_dynamic = clustering_key_size() > 0;
if (is_counter()) {
for (auto&& cdef : boost::range::join(static_columns(), regular_columns())) {
@@ -152,6 +243,8 @@ void schema::rebuild() {
}
}
}
_v3_columns = v3_columns::from_v2_schema(*this);
}
const column_mapping& schema::get_column_mapping() const {
@@ -189,24 +282,15 @@ schema::schema(const raw_schema& raw, stdx::optional<raw_view_info> raw_view_inf
}())
, _regular_columns_by_name(serialized_compare(_raw._regular_column_name_type))
{
struct name_compare {
data_type type;
name_compare(data_type type) : type(type) {}
bool operator()(const column_definition& cd1, const column_definition& cd2) const {
return type->less(cd1.name(), cd2.name());
}
};
std::sort(
_raw._columns.begin() + column_offset(column_kind::static_column),
_raw._columns.begin()
+ column_offset(column_kind::regular_column),
name_compare(utf8_type));
column_definition::name_comparator(static_column_name_type()));
std::sort(
_raw._columns.begin()
+ column_offset(column_kind::regular_column),
_raw._columns.end(), name_compare(regular_column_name_type()));
_raw._columns.end(), column_definition::name_comparator(regular_column_name_type()));
std::sort(_raw._columns.begin(),
_raw._columns.begin() + column_offset(column_kind::clustering_key),
@@ -360,6 +444,7 @@ bool operator==(const schema& x, const schema& y)
&& x._raw._speculative_retry == y._raw._speculative_retry
&& x._raw._compaction_strategy == y._raw._compaction_strategy
&& x._raw._compaction_strategy_options == y._raw._compaction_strategy_options
&& x._raw._compaction_enabled == y._raw._compaction_enabled
&& x._raw._caching_options == y._raw._caching_options
&& x._raw._dropped_columns == y._raw._dropped_columns
&& x._raw._collections == y._raw._collections
@@ -478,11 +563,10 @@ std::ostream& operator<<(std::ostream& os, const schema& s) {
os << ",compactionStrategyOptions={";
n = 0;
for (auto& p : s._raw._compaction_strategy_options) {
if (n++ != 0) {
os << ", ";
}
os << p.first << "=" << p.second;
os << ", ";
}
os << "enabled=" << std::boolalpha << s._raw._compaction_enabled;
os << "}";
os << ",compressionParameters={";
n = 0;
@@ -500,7 +584,6 @@ std::ostream& operator<<(std::ostream& os, const schema& s) {
os << ",minIndexInterval=" << s._raw._min_index_interval;
os << ",maxIndexInterval=" << s._raw._max_index_interval;
os << ",speculativeRetry=" << s._raw._speculative_retry.to_sstring();
os << ",droppedColumns={}";
os << ",triggers=[]";
os << ",isDense=" << std::boolalpha << s._raw._is_dense;
os << ",version=" << s.version();
@@ -642,11 +725,7 @@ schema_builder& schema_builder::without_column(bytes name)
return column.name() == name;
});
assert(it != _raw._columns.end());
auto now = api::new_timestamp();
auto ret = _raw._dropped_columns.emplace(it->name_as_text(), schema::dropped_column{it->type, now});
if (!ret.second) {
ret.first->second.timestamp = std::max(ret.first->second.timestamp, now);
}
without_column(it->name_as_text(), it->type, api::new_timestamp());
_raw._columns.erase(it);
return *this;
}
@@ -658,8 +737,9 @@ schema_builder& schema_builder::without_column(sstring name, api::timestamp_type
schema_builder& schema_builder::without_column(sstring name, data_type type, api::timestamp_type timestamp)
{
auto ret = _raw._dropped_columns.emplace(name, schema::dropped_column{type, timestamp});
if (!ret.second) {
ret.first->second.timestamp = std::max(ret.first->second.timestamp, timestamp);
if (!ret.second && ret.first->second.timestamp < timestamp) {
ret.first->second.type = type;
ret.first->second.timestamp = timestamp;
}
return *this;
}
@@ -751,10 +831,8 @@ sstring schema_builder::default_names::compact_value_name() {
void schema_builder::prepare_dense_schema(schema::raw_schema& raw) {
auto is_dense = raw._is_dense;
auto is_compound = raw._is_compound;
auto is_static_compact = !is_dense && !is_compound;
auto is_compact_table = is_dense || !is_compound;
if (is_compact_table) {
auto count_kind = [&raw](column_kind kind) {
return std::count_if(raw._columns.begin(), raw._columns.end(), [kind](const column_definition& c) {
@@ -764,37 +842,7 @@ void schema_builder::prepare_dense_schema(schema::raw_schema& raw) {
default_names names(raw);
if (is_static_compact) {
/**
* In origin v3 the general cql-ification of the "storage engine" means
* that "static compact" tables are expressed as all defined columns static,
* but with synthetic clustering + regular columns.
* We unfortunately need to play along with this, both because we want
* schema tables on disk to be compatible (and they are explicit).
* More to the point, we are, at least until we upgrade to version "m"
* sstables, stuck with having origins java tools reading our schema tables
* for table schemas (this btw applies to db drivers too, though maybe a little
* less), and it asserts badly if we don't uphold the origin table tweaks.
*
* So transform away...
*
*/
if (!count_kind(column_kind::static_column)) {
assert(!count_kind(column_kind::clustering_key));
for (auto& c : raw._columns) {
// Note that for "static" no-clustering compact storage we use static for the defined columns
if (c.kind == column_kind::regular_column) {
c.kind = column_kind::static_column;
}
}
// Compact tables always have a clustering and a single regular value.
raw._columns.emplace_back(to_bytes(names.clustering_name()),
utf8_type, column_kind::clustering_key, 0);
raw._columns.emplace_back(to_bytes(names.compact_value_name()),
raw._is_counter ? counter_type : bytes_type,
column_kind::regular_column, 0);
}
} else if (is_dense) {
if (is_dense) {
auto regular_cols = count_kind(column_kind::regular_column);
// In Origin, dense CFs always have at least one regular column
if (regular_cols == 0) {
@@ -838,6 +886,10 @@ schema_ptr schema_builder::build() {
new_raw._version = utils::UUID_gen::get_time_UUID();
}
if (new_raw._is_counter) {
new_raw._default_validation_class = counter_type;
}
if (_compact_storage) {
// Dense means that no part of the comparator stores a CQL column name. This means
// COMPACT STORAGE with at least one columnAliases (otherwise it's a thrift "static" CF).
@@ -1032,7 +1084,10 @@ schema::static_upper_bound(const bytes& name) const {
}
data_type
schema::column_name_type(const column_definition& def) const {
return def.kind == column_kind::regular_column ? _raw._regular_column_name_type : utf8_type;
if (def.kind == column_kind::regular_column) {
return _raw._regular_column_name_type;
}
return utf8_type;
}
const column_definition&
@@ -1043,6 +1098,14 @@ schema::regular_column_at(column_id id) const {
return _raw._columns.at(column_offset(column_kind::regular_column) + id);
}
const column_definition&
schema::clustering_column_at(column_id id) const {
if (id >= clustering_key_size()) {
throw std::out_of_range(sprint("clustering column id %d >= %d", id, clustering_key_size()));
}
return _raw._columns.at(column_offset(column_kind::clustering_key) + id);
}
const column_definition&
schema::static_column_at(column_id id) const {
if (id > static_columns_count()) {
@@ -1119,12 +1182,8 @@ schema::select_order_range schema::all_columns_in_select_order() const {
_raw._columns.begin() + (is_static_compact_table ?
column_offset(column_kind::clustering_key) :
column_offset(column_kind::static_column)));
auto ck_v_range =
(is_static_compact_table || no_non_pk_columns) ?
static_columns() :
const_iterator_range_type(
static_columns().begin(),
all_columns().end());
auto ck_v_range = no_non_pk_columns ? static_columns()
: const_iterator_range_type(static_columns().begin(), all_columns().end());
return boost::range::join(pk_range, ck_v_range);
}
@@ -1163,23 +1222,7 @@ std::vector<sstring> schema::index_names() const {
}
data_type schema::make_legacy_default_validator() const {
if (is_counter()) {
return counter_type;
}
if (is_compact_table()) {
// See CFMetaData.
if (is_super()) {
for (auto& c : regular_columns()) {
if (c.name().empty()) {
return c.type;
}
}
assert("Invalid super column table definition, no 'dynamic' map column");
} else {
return regular_columns().begin()->type;
}
}
return bytes_type;
return _raw._default_validation_class;
}
bool schema::is_synced() const {

Some files were not shown because too many files have changed in this diff Show More