Compare commits

...

1389 Commits

Author SHA1 Message Date
Asias He
f19fbc3058 gossip: Fix tokens assignment in assassinate_endpoint
The tokens vector is defined a few lines above and is needed outsie the
if block.

Do not redefine it again in the if block, otherwise the tokens will be empty.

Found by code inspection.

Fixes #3551.

Message-Id: <c7a06375c65c950e94236571127f533e5a60cbfd.1530002177.git.asias@scylladb.com>
(cherry picked from commit c3b5a2ecd5)
2018-06-27 12:01:19 +03:00
Vlad Zolotarov
8eddb28954 locator::ec2_multi_region_snitch: don't call for ec2_snitch::gossiper_starting()
ec2_snitch::gossiper_starting() calls for the base class (default) method
that sets _gossip_started to TRUE and thereby prevents to following
reconnectable_snitch_helper registration.

Fixes #3454

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1528208520-28046-1-git-send-email-vladz@scylladb.com>
(cherry picked from commit 2dde372ae6)
2018-06-12 19:02:48 +03:00
Avi Kivity
5aaa8031a2 Update seastar submodule
* seastar f5162dc...da2e1af (1):
  > net/tls: Wait for output to be sent when shutting down

Fixes #3459.
2018-05-24 12:02:15 +03:00
Avi Kivity
3d50e7077a Merge "Backport fixes for streaming segfault with bogus dst_cpu_id for 2.0" from Asias
"
The minimum changes that makes the backport of "streaming: Do send failed
message for uninitialized session" without backport conflits.

Fixes simliar issue we saw:

  https://github.com/scylladb/scylla/issues/3115
"

* tag 'asias/backport_issue_3115_for_2.0/v1' of github.com:scylladb/seastar-dev:
  streaming: Do send failed message for uninitialized session
  streaming: Introduce streaming::abort()
  streaming: Log peer address in on_error
  streaming: Check if _stream_result is valid
  streaming: Introduce received_failed_complete_message
2018-05-24 11:14:20 +03:00
Avi Kivity
4063e92f57 dist: redhat: get rid of raid0.devices_discard_performance
This parameter is not available on recent Red Hat kernels or on
non-Red Hat kernels (it was removed on 3.10.0-772.el7,
RHBZ 1455932). The presence of the parameter on kernels that don't
support it cause the module load to fail, with the result that the
storage is not available.

Fix by removing the parameter. For someone running an older Red Hat
kernel the effect will be that discard is disabled, but they can fix
that by updating the kernel. For someone running a newer kernel, the
effect will be that they can access their data.

Fixes #3437.
Message-Id: <20180516134913.6540-1-avi@scylladb.com>

(cherry picked from commit 3b8118d4e5)
2018-05-24 11:08:13 +03:00
Asias He
b6de30bb87 streaming: Do send failed message for uninitialized session
The uninitialized session has no peer associated with it yet. There is
no point sending the failed message when abort the session. Sending the
failed message in this case will send to a peer with uninitialized
dst_cpu_id which will casue the receiver to pass a bogus shard id to
smp::submit_to which cases segfault.

In addition, to be safe, initialize the dst_cpu_id to zero. So that
uninitialized session will send message to shard zero instead of random
bogus shard id.

Fixes the segfault issue found by
repair_additional_test.py:RepairAdditionalTest.repair_abort_test

Fixes #3115
Message-Id: <9f0f7b44c7d6d8f5c60d6293ab2435dadc3496a9.1515380325.git.asias@scylladb.com>

(cherry picked from commit 774307b3a7)
2018-05-24 15:24:29 +08:00
Asias He
c23e3a1eda streaming: Introduce streaming::abort()
It will be used soon by stream_plan::abort() to abort a stream session.

(cherry picked from commit fad34801bf)
2018-05-24 15:21:54 +08:00
Asias He
2732b6cf1d streaming: Log peer address in on_error
(cherry picked from commit 8a3f6acdd2)
2018-05-24 15:20:43 +08:00
Asias He
49722e74da streaming: Check if _stream_result is valid
If on_error() was called before init() was executed, the
_stream_result can be invalid.

(cherry picked from commit be573bcafb)
2018-05-24 15:20:02 +08:00
Asias He
ba7623ac55 streaming: Introduce received_failed_complete_message
It is the handler for the failed complete message. Add a flag to
remember if we received a such message from peer, if so, do not send
back the failed complete message back to the peer when running
close_session with failed status.

(cherry picked from commit eace5fc6e8)
2018-05-24 15:19:26 +08:00
Avi Kivity
9db2ff36f2 dist: redhat: fix binutils dependency on alternatives
Since /sbin is a symlink, file dependencies on /sbin/alternatives
don't work.

Change to /usr/bin to fix.
2018-04-26 12:00:42 +03:00
Avi Kivity
378029b8da schema_tables: discard [[nodiscard]] attribute
Note supported on gcc 5.3, which is used to build this branch.
2018-04-25 18:37:17 +03:00
Mika Eloranta' via ScyllaDB development
2b7644dc36 build: fix rpm build script --jobs N handling
Fixes argument misquoting at $SRPM_OPTS expansion for the mock commands
and makes the --jobs argument work as supposed.

Signed-off-by: Mika Eloranta <mel@aiven.io>
Message-Id: <20180113212904.85907-1-mel@aiven.io>
(cherry picked from commit 7266446227)
2018-04-25 17:47:09 +03:00
Duarte Nunes
4bd931ba59 db/schema_tables: Only drop UDTs after merging tables
Dropping a user type requires that all tables using that type also be
dropped. However, a type may appear to be dropped at the same time as
a table, for instance due to the order in which a node receives schema
notifications, or when dropping a keyspace.

When dropping a table, if we build a schema in a shard through a
global_schema_pointer, then we'll check for the existence of any user
type the schema employs. We thus need to ensure types are only dropped
after tables, similarly to how it's done for keyspaces.

Fixes #3068

Tests: unit-tests (release)

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180129114137.85149-1-duarte@scylladb.com>
(cherry picked from commit 1e3fae5bef)
2018-04-25 01:15:45 +03:00
Avi Kivity
78eebe74c7 loading_cache: adjust code for older compilers
Need this-> qualifier in a generic lambda, and static_assert()s need
a message.
2018-04-23 16:12:25 +03:00
Avi Kivity
30e21afb13 streaming: adjust code for older compilers
Need this-> qualifier in a generic lambda.
2018-04-23 16:12:25 +03:00
Shlomi Livne
e8616b10e5 release: prepare for 2.0.4
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
2018-04-23 09:11:35 +03:00
Avi Kivity
0cb842dde1 Update seastar submodule
* seastar e7facd4...f5162dc (1):
  > tls: Ensure we always pass through semaphores on shutdown

Fixes #3358.
2018-04-14 20:52:57 +03:00
Gleb Natapov
7945f5edda cql_server: fix a race between closing of a connection and notifier registration
There is a race between cql connection closure and notifier
registration. If a connection is closed before notification registration
is complete stale pointer to the connection will remain in notification
list since attempt to unregister the connection will happen to early.
The fix is to move notifier unregisteration after connection's gate
is closed which will ensure that there is no outstanding registration
request. But this means that now a connection with closed gate can be in
notifier list, so with_gate() may throw and abort a notifier loop. Fix
that by replacing with_gate() by call to is_closed();

Fixes: #3355
Tests: unit(release)

Message-Id: <20180412134744.GB22593@scylladb.com>
(cherry picked from commit 1a9aaece3e)
2018-04-12 16:57:30 +03:00
Asias He
9c2a328000 gossip: Relax generation max difference check
start node 1 2 3
shutdown node2
shutdown node1 and node3
start node1 and node3
nodetool removenode node2
clean up all scylla data on node2
bootstrap node2 as a new node

I saw node2 could not bootstrap stuck at waiting for schema information to compelte for ever:

On node1, node3

    [shard 0] gossip - received an invalid gossip generation for peer 127.0.0.2; local generation = 2, received generation = 1521779704

On node2

    [shard 0] storage_service - JOINING: waiting for schema information to complete

This is becasue in nodetool removenode operation, the generation of node1 was increased from 0 to 2.

   gossiper::advertise_removing () calls eps.get_heart_beat_state().force_newer_generation_unsafe();
   gossiper::advertise_token_removed() calls eps.get_heart_beat_state().force_newer_generation_unsafe();

Each force_newer_generation_unsafe increases the generation by 1.

Here is an example,

Before nodetool removenode:
```
curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" | python -mjson.tool
   {
   "addrs": "127.0.0.2",
   "generation": 0,
   "is_alive": false,
   "update_time": 1521778757334,
   "version": 0
   },
```

After nodetool revmoenode:
```
curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" | python -mjson.tool
 {
     "addrs": "127.0.0.2",
     "application_state": [
         {
             "application_state": 0,
             "value": "removed,146b52d5-dc94-4e35-b7d4-4f64be0d2672,1522038476246",
             "version": 214
         },
         {
             "application_state": 6,
             "value": "REMOVER,14ecc9b0-4b88-4ff3-9c96-38505fb4968a",
             "version": 153
            }
     ],
     "generation": 2,
     "is_alive": false,
     "update_time": 1521779276246,
     "version": 0
 },
```

In gossiper::apply_state_locally, we have this check:

```
if (local_generation != 0 && remote_generation > local_generation + MAX_GENERATION_DIFFERENCE) {
    // assume some peer has corrupted memory and is broadcasting an unbelievable generation about another peer (or itself)
  logger.warn("received an invalid gossip generation for peer {}; local generation = {}, received generation = {}",ep, local_generation, remote_generation);

}
```
to skip the gossip update.

To fix, we relax generation max difference check to allow the generation
of a removed node.

After this patch, the removed node bootstraps successfully.

Tests: dtest:update_cluster_layout_tests.py
Fixes #3331

Message-Id: <678fb60f6b370d3ca050c768f705a8f2fd4b1287.1522289822.git.asias@scylladb.com>
(cherry picked from commit f539e993d3)
2018-03-29 12:10:37 +03:00
Tomasz Grabiec
98498c679b test.py: set BOOST_TEST_CATCH_SYSTEM_ERRORS=no
This will make boost UTF abort execution on SIGABRT rather than trying
to continue running other test cases. This doesn't work well with
seastar integration, the suite will hang.
Message-Id: <1516205469-16378-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit ab6ec571cb)
2018-03-27 22:42:00 +03:00
Duarte Nunes
b147b5854b view_schema_test: Retry failed queries
Due to the asynchronous nature of view update propagation, results
might still be absent from views when we query them. To be able to
deterministically assert on view rows, this patch retries a query a
bounded number of times until it succeeds.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170718212646.2958-1-duarte@scylladb.com>
(cherry picked from commit ab72132cb1)
2018-03-27 22:42:00 +03:00
Duarte Nunes
226095f4db types: Implement hash() for collections
This patch provides a rather trivial implementation of hash() for
collection types.

It is needed for view building, where we hold mutations in a map
indexed by partition keys (and frozen collection types can be part of
the key).

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170718192107.13746-1-duarte@scylladb.com>
(cherry picked from commit 3bfcf47cc6)
2018-03-27 22:42:00 +03:00
Avi Kivity
3dd1f68590 tests: fix view_schema_test with clang
Clang is happy to create a vector<data_value> from a {}, a {1, 2}, but not a {1}.
No doubt it is correct, but sheesh.

Make the data_value explicit to humor it.
Message-Id: <20170713074315.9857-1-avi@scylladb.com>

(cherry picked from commit 162d9aa85d)
2018-03-27 22:25:17 +03:00
Avi Kivity
e08e4c75d7 Merge "Fix abort during counter table read-on-delete" from Tomasz
"
This fixes an abort in an sstable reader when querying a partition with no
clustering ranges (happens on counter table mutation with no live rows) which
also doesn't have any static columns. In such case, the
sstable_mutation_reader will setup the data_consume_context such that it only
covers the static row of the partition, knowing that there is no need to read
any clustered rows. See partition.cc::advance_to_upper_bound(). Later when
the reader is done with the range for the static row, it will try to skip to
the first clustering range (missing in this case). If clustering_ranges_walker
tells us to skip to after_all_clustering_rows(), we will hit an assert inside
continuous_data_consumer::fast_forward_to() due to attempt to skip past the
original data file range. If clustering_ranges_walker returns
before_all_clustering_rows() instead, all is fine because we're still at the
same data file position.

Fixes #3304.
"

* 'tgrabiec/fix-counter-read-no-static-columns' of github.com:scylladb/seastar-dev:
  tests: mutation_source_test: Test reads with no clustering ranges and no static columns
  tests: simple_schema: Allow creating schema with no static column
  clustering_ranges_walker: Stop after static row in case no clustering ranges

(cherry picked from commit 054854839a)
2018-03-22 18:16:37 +02:00
Vlad Zolotarov
8bcb4e7439 test.py: limit the tests to run on 2 shards with 4GB of memory
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
(cherry picked from commit 57a6ed5aaa)
2018-03-22 12:47:12 +02:00
Avi Kivity
97369adb1c Merge "streaming backport for 2.0 branch" from Asias
"
Backport streaming improvements and bug fixes from 2.1 branch. With this
series, it is more robust to add node, decommission node, because the single
and big stream plan is split into multiple and smaller stream plans. In
addition, the failed stream plans will be retried automatically.
"

Fixes #3285, Fixes #3310, Fixes #3065, Fixes #1743, Fixes #3311.

* 'asias/ticket-352' of github.com:scylladb/seastar-dev:
  range_streamer: Stream 10% of ranges instead of 10 ranges per time
  Revert "streaming: Do not abort session too early in idle detection"
  dht: Fix log in range_streamer
  streaming: One cf per time on sender
  messaging_service: Get rid of timeout and retry logic for streaming verb
  storage_service: Remove rpc client on all shards in on_dead
  Merge "streaming error handling improvement" from Asias
  streaming: Fix streaming not streaming all ranges
  Merge "Use range_streamer everywhere" from Asias
2018-03-22 10:39:08 +02:00
Duarte Nunes
c89ead5e55 gms/gossiper: Synchronize endpoint state destruction
In gossiper::handle_major_state_change() we set the endpoint_state for
a particular endpoint and replicate the changes to other cores.

This is totally unsynchronized with the execution of
gossiper::evict_from_membership(), which can happen concurrently, and
can remove the very same endpoint from the map  (in all cores).

Replicating the changes to other cores in handle_major_state_change()
can interleave with replicating the changes to other cores in
evict_from_membership(), and result in an undefined final state.

Another issue happened in debug mode dtests, where a fiber executes
handle_major_state_change(), calls into the subscribers, of which
storage_service is one, and ultimately lands on
storage_service::update_peer_info(), which iterates over the
endpoint's application state with deferring points in between (to
update a system table). gossiper::evict_from_membership() was executed
concurrently by another fiber, which freed the state the first one is
iterating over.

Fixes #3299.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180318123211.3366-1-duarte@scylladb.com>
(cherry picked from commit 810db425a5)
2018-03-18 14:55:50 +02:00
Asias He
46fd96d877 range_streamer: Stream 10% of ranges instead of 10 ranges per time
If there are a lot of ranges, e.g., num_tokens=2048, 10 ranges per
stream plan will cause tons of stream plan to be created to stream data,
each having very few data. This cause each stream plan has low transfer
bandwidth, so that the total time to complete the streaming increases.

It makes more sense to send a percentage of the total ranges per stream
plan than a fixed ranges.

Here is an example to stream a keyspace with 513 ranges in
total, 10 ranges v.s. 10% ranges:

Before:
[shard 0] range_streamer - Bootstrap with 127.0.0.1 for
keyspace=system_traces, 510 out of 513 ranges: ranges = 51
[shard 0] range_streamer - Bootstrap with ks for keyspace=127.0.0.1
succeeded, took 107 seconds

After:
[shard 0] range_streamer - Bootstrap with 127.0.0.1 for
keyspace=system_traces, 510 out of 513 ranges: ranges = 10
[shard 0] range_streamer - Bootstrap with ks for keyspace=127.0.0.1
succeeded, took 22 seconds

Message-Id: <a890b84fbac0f3c3cc4021e30dbf4cdf135b93ea.1520992228.git.asias@scylladb.com>
(cherry picked from commit 9b5585ebd5)
2018-03-15 10:11:08 +08:00
Asias He
19806fc056 Revert "streaming: Do not abort session too early in idle detection"
This reverts commit f792c78c96.

With the "Use range_streamer everywhere" (7217b7ab36) series,
all the user of streaming now do streaming with relative small ranges
and can retry streaming at higher level.

Reduce the time-to-recover from 5 hours to 10 minutes per stream session.

Even if the 10 minutes idle detection might cause higher false positive,
it is fine, since we can retry the "small" stream session anyway. In the
long term, we should replace the whole idle detection logic with
whenever the stream initiator goes away, the stream slave goes away.

Message-Id: <75f308baf25a520d42d884c7ef36f1aecb8a64b0.1520992219.git.asias@scylladb.com>
(cherry picked from commit ad7b132188)
2018-03-15 10:10:58 +08:00
Asias He
0b314a745f dht: Fix log in range_streamer
The address and keyspace should be swapped.

Before:
  range_streamer - Bootstrap with ks3 for keyspace=127.0.0.1 succeeded,
  took 56 seconds

After:
  range_streamer - Bootstrap with 127.0.0.1 for keyspace=ks3 succeeded,
  took 56 seconds

Message-Id: <5c49646f1fbe45e3a1e7545b8470e04b166922c4.1520416042.git.asias@scylladb.com>
(cherry picked from commit 73d8e2743f)
2018-03-13 15:07:13 +08:00
Asias He
73870751d9 streaming: One cf per time on sender
In the case there are large number of column families, the sender will
send all the column families in parallel. We allow 20% of shard memory
for streaming on the receiver, so each column family will have 1/N, N is
the number of in-flight column families, memory for memtable. Large N
causes a lot of small sstables to be generated.

It is possible there are multiple senders to a single receiver, e.g.,
when a new node joins the cluster, the maximum in-flight column families
is number of peer node. The column families are sent in the order of
cf_id. It is not guaranteed that all peers has the same speed so they
are sending the same cf_id at the same time, though. We still have
chance some of the peers are sending the same cf_id.

Fixes #3065

Message-Id: <46961463c2a5e4f1faff232294dc485ac4f1a04e.1513159678.git.asias@scylladb.com>
(cherry picked from commit a9dab60b6c)
2018-03-13 12:20:41 +08:00
Asias He
c8983034c0 messaging_service: Get rid of timeout and retry logic for streaming verb
With the "Use range_streamer everywhere" (7217b7ab36) seires, all
the user of streaming now do streaming with relative small ranges and
can retry streaming at higher level.

There are problems with timeout and retry at RPC verb level in streaming:
1) Timeout can be false negative.
2) We can not cancel the send operations which are already called. When
user aborts the streaming, the retry logic keeps running for a long
time.

This patch removes all the timeout and retry logic for streaming verbs.
After this, the timeout is the job of TCP, the retry is the job of the
upper layer.

Message-Id: <df20303c1fa728dcfdf06430417cf2bd7a843b00.1503994267.git.asias@scylladb.com>
(cherry picked from commit 8fa35d6ddf)
2018-03-13 12:20:41 +08:00
Asias He
77d14a6256 storage_service: Remove rpc client on all shards in on_dead
We should close connections to nodes that are down on all shards instead
of the shard which runs the on_dead gossip callback.

Found by Gleb.
Message-Id: <527a14105a07218066e9f1da943693d9de6993e5.1505894260.git.asias@scylladb.com>

(cherry picked from commit 173cba67ba)
2018-03-13 12:20:41 +08:00
Avi Kivity
21259bcfb3 Merge "streaming error handling improvement" from Asias
"This series improves the streaming error handling so that when one side of the
streaming failed, it will propagate the error to the other side and the peer
will close the failed session accordingly. This removes the unnecessary wait and
timeout time for the peer to discover the failed session and fail eventually.

Fix it by:

- Use the complete message to notify peer node local session is failed
- Listen on shutdown gossip callback so that we can detect the peer is shutdown
  can close the session with the peer

Fixes #1743"

* tag 'asias/streaming/error_handling_v2' of github.com:cloudius-systems/seastar-dev:
  streaming: Listen on shutdown gossip callback
  gms: Add is_shutdown helper for endpoint_state class
  streaming: Send complete message with failed flag when session is failed
  streaming: Handle failed flag in complete message
  streaming: Do not fail the session when failed to send complete message
  streaming: Introduce send_failed_complete_message
  streaming: Do not send complete message when session is successful
  streaming: Introduce the failed parameter for complete message
  streaming: Remove unused session_failed function
  streaming: Less verbose in logging
  streaming: Better stats

(cherry picked from commit d5aba779d4)
2018-03-13 12:20:41 +08:00
Tomasz Grabiec
9f02b44537 streaming: Fix streaming not streaming all ranges
It skipped one sub-range in each of the 10 range batch, and
tried to access the range vector using end() iterator.

Fixes sporadic failures of
update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_add_node_1_test.

Message-Id: <1505848902-16734-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 741ec61269)
2018-03-13 10:36:19 +08:00
Avi Kivity
9dc7a63014 Merge "Use range_streamer everywhere" from Asias
"With this series, all the following cluster operations:

- bootstrap
- rebuild
- decommission
- removenode

will use the same code to do the streaming.

The range_streamer is now extended to support both fetch from and push
to peer node. Another big change is now the range_streamer will stream
less ranges at a time, so less data, per stream_plan and range_streamer
will remember which ranges are failed to stream and can retry later.

The retry policy is very simple at the moment it retries at most 5 times
and sleep 1 minutes, 1.5^2 minutes, 1.5^3 minutes ....

Later, we can introduce api for user to decide when to stop retrying and
the retry interval.

The benefits:

 - All the cluster operation shares the same code to stream
 - We can know the operation progress, e.g., we can know total number of
   ranges need to be streamed and number of ranges finished in
   bootstrap, decommission and etc.
 - All the cluster operation can survive peer node down during the
   operation which usually takes long time to complete, e.g., when adding
   a new node, currently if any of the existing node which streams data to
   the new node had issue sending data to the new node, the whole bootstrap
   process will fail. After this patch, we can fix the problematic node
   and restart it, the joining node will retry streaming from the node
   again.
 - We can fail streaming early and timeout early and retry less because
   all the operations use stream can survive failure of a single
   stream_plan. It is not that important for now to have to make a single
   stream_plan successful. Note, another user of streaming, repair, is now
   using small stream_plan as well and can rerun the repair for the
   failed ranges too.

This is one step closer to supporting the resumable add/remove node
opeartions."

* tag 'asias/use_range_streamer_everywhere_v4' of github.com:cloudius-systems/seastar-dev:
  storage_service: Use the new range_streamer interface for removenode
  storage_service: Use the new range_streamer interface for decommission
  storage_service: Use the new range_streamer interface for rebuild
  storage_service: Use the new range_streamer interface for bootstrap
  dht: Extend range_streamer interface

(cherry picked from commit 7217b7ab36)
2018-03-13 10:34:10 +08:00
Asias He
5dcef25f6f storage_service: Add missing return in pieces empty check
If pieces.empty is empty, it is bogus to access pieces[0]:

   sstring move_name = pieces[0];

Fix by adding the missing return.

Spotted by Vlad Zolotarov <vladz@scylladb.com>

Fixes #3258
Message-Id: <bcb446f34f953bc51c3704d06630b53fda82e8d2.1520297558.git.asias@scylladb.com>

(cherry picked from commit 8900e830a3)
2018-03-06 09:58:39 +02:00
Duarte Nunes
f763bf7f0d Merge 'backport of the "loading_shared_values and size limited and evicting prepared statements cache" series' from Vlad
This backport includes changes from the "loading_shared_values and size limited and evicting prepared statements cache" series,
"missing bits from loading_shared_values series" series and the "cql_transport::cql_server: fix the distributed prepared statements cache population"
patch (the last one is squashed inside the "cql3::query_processor: implement CQL and Thrift prepared statements caches using cql3::prepared_statements_cache"
patch in order to avoid the bisect breakage).

* 'branch-2-0-backport-prepared-cache-fixes-v1' of https://github.com/vladzcloudius/scylla:
  tests: loading_cache_test: initial commit
  utils + cql3: use a functor class instead of std::function
  cql3::query_processor: implement CQL and Thrift prepared statements caches using cql3::prepared_statements_cache
  transport::server::process_prepare() don't ignore errors on other shards
  cql3: prepared statements cache on top of loading_cache
  utils::loading_cache: make the size limitation more strict
  utils::loading_cache: added static_asserts for checking the callbacks signatures
  utils::loading_cache: add a bunch of standard synchronous methods
  utils::loading_cache: add the ability to create a cache that would not reload the values
  utils::loading_cache: add the ability to work with not-copy-constructable values
  utils::loading_cache: add EntrySize template parameter
  utils::loading_cache: rework on top of utils::loading_shared_values
  sstables::shared_index_list: use utils::loading_shared_values
  utils::loading_cache: arm the timer with a period equal to min(_expire, _update)
2018-02-22 10:04:41 +00:00
Tomasz Grabiec
9af9ca0d60 tests: mutation_source_tests: Fix use-after-scope on partition range
Message-Id: <1506096881-3076-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit d11d696072)
2018-02-19 14:30:47 +00:00
Avi Kivity
fbc30221b5 row_cache_test: remove unused overload populate_range()
Breaks the build.
2018-02-11 17:01:31 +02:00
Shlomi Livne
d17aa3cd1c release: prepare for 2.0.3
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
2018-02-08 18:11:35 +02:00
Vlad Zolotarov
f7e79322f1 tests: loading_cache_test: initial commit
Test utils::loading_shared_values and utils::loading_cache.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-02-02 15:28:45 -05:00
Vlad Zolotarov
e31331bdb2 utils + cql3: use a functor class instead of std::function
Define value_extractor_fn as a functor class instead of std::function.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-02-02 15:28:45 -05:00
Vlad Zolotarov
6873e26060 cql3::query_processor: implement CQL and Thrift prepared statements caches using cql3::prepared_statements_cache
- Transition the prepared statements caches for both CQL and Trhift to the cql3::prepared_statements_cache class.
   - Add the corresponding metrics to the query_processor:
      - Evictions count.
      - Current entries count.
      - Current memory footprint.

Fixes #2474

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-02-02 15:28:35 -05:00
Vlad Zolotarov
24bee2c887 transport::server::process_prepare() don't ignore errors on other shards
If storing of the statement fails on any shard we should fail the whole PREPARE
request.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1502325392-31169-13-git-send-email-vladz@scylladb.com>
2018-02-02 13:27:34 -05:00
Vlad Zolotarov
8bba15a709 cql3: prepared statements cache on top of loading_cache
This is a template class that implements caching of prepared statements for a given ID type:
   - Each cache instance is given 1/256 of the total shard memory. If the new entry is going to overflow
     this memory limit - the less recently used entries are going to be evicted so that the new entry could
     be added.
   - The memory consumption of a single prepared statement is defined by a cql3::prepared_cache_entry_size
     functor class that returns a number of bytes for a given prepared statement (currently returns 10000
     bytes for any statement).
   - The cache entry is going to be evicted if not used for 60 minutes or more.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-02-02 12:37:00 -05:00
Vlad Zolotarov
ad68d3ecfd utils::loading_cache: make the size limitation more strict
Ensure that the size of the cache is never bigger than the "max_size".

Before this patch the size of the cache could have been indefinitely bigger than
the requested value during the refresh time period which is clearly an undesirable
behaviour.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-02-02 12:36:56 -05:00
Vlad Zolotarov
707ac9242e utils::loading_cache: added static_asserts for checking the callbacks signatures
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-02-02 12:36:53 -05:00
Vlad Zolotarov
dae0563ff8 utils::loading_cache: add a bunch of standard synchronous methods
Add a few standard synchronous methods to the cache, e.g. find(), remove_if(), etc.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-02-02 12:36:43 -05:00
Vlad Zolotarov
7ba50b87f1 utils::loading_cache: add the ability to create a cache that would not reload the values
Sometimes we don't want the cached values to be periodically reloaded.
This patch adds the ability to control this using a ReloadEnabled template parameter.

In case the reloading is not needed the "loading" function is not given to the constructor
but rather to the get_ptr(key, loader) method (currently it's the only method that is used, we may add
the corresponding get(key, loader) method in the future when needed).

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-02-02 12:35:25 -05:00
Vlad Zolotarov
1da277c78e utils::loading_cache: add the ability to work with not-copy-constructable values
Current get(...) interface restricts the cache to work only with copy-constructable
values (it returns future<Tp>).
To make it able to work with non-copyable value we need to introduce an interface that would
return something like a reference to the cached value (like regular containers do).

We can't return future<Tp&> since the caller would have to ensure somehow that the underlying
value is still alive. The much more safe and easy-to-use way would be to return a shared_ptr-like
pointer to that value.

"Luckily" to us we value we actually store in a cache is already wrapped into the lw_shared_ptr
and we may simply return an object that impersonates itself as a smart_pointer<Tp> value while
it keeps a "reference" to an object stored in the cache.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-02-02 12:35:21 -05:00
Vlad Zolotarov
36dfd4b990 utils::loading_cache: add EntrySize template parameter
Allow a variable entry size parameter.
Provide an EntrySize functor that would return a size for a
specific entry.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-02-02 12:35:19 -05:00
Vlad Zolotarov
c5ce2765dc utils::loading_cache: rework on top of utils::loading_shared_values
Get rid of the "proprietary" solution for asynchronous values on-demand loading.
Use utils::loading_shared_values instead.

We would still need to maintain intrusive set and list for efficient shrink and invalidate
operations but their entry is not going to contain the actual key and value anymore
but rather a loading_shared_values::entry_ptr which is essentially a shared pointer to a key-value
pair value.

In general, we added another level of dereferencing in order to get the key value but since
we use the bi::store_hash<true> in the hook and the bi::compare_hash<true> in the bi::unordered_set
this should not translate into an additional set lookup latency.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-02-02 12:35:15 -05:00
Vlad Zolotarov
25ffdf527b sstables::shared_index_list: use utils::loading_shared_values
Since utils::loading_shared_values API is based on the original shared_index_list
this change is mostly a drop-in replacement of the corresponding parts.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-02-02 12:33:34 -05:00
Vlad Zolotarov
dbc6d9fe01 utils::loading_cache: arm the timer with a period equal to min(_expire, _update)
Arm the timer with a period that is not greater than either the permissions_validity_in_ms
or the permissions_update_interval_in_ms in order to ensure that we are not stuck with
the values older than permissions_validity_in_ms.

Fixes #2590

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-02-02 12:17:41 -05:00
Asias He
915683bddd locator: Get rid of assert in token_metadata
In commit 69c81bcc87 (repair: Do not allow repair until node is in
NORMAL status), we saw a coredump due to an assert in
token_metadata::first_token_index.

Throw an exception instead of abort the whole scylla process.
Message-Id: <c110645cee1ee3897e30a3ae1b7ab3f49c97412c.1504752890.git.asias@scylladb.com>

(cherry picked from commit 0ec574610d)
2018-01-29 15:28:44 +02:00
Avi Kivity
7ca8988d0e Update seastar submodule
* seastar d37cf28...e7facd4 (11):
  > tls_test: Fix echo test not setting server trust store
  > tls: Actually verify client certificate if requested
  > tls: Do not restrict re-handshake to client
  > Work around GCC 5 bug: scylladb/seastar#338, scylladb/seastar#339
  > tls: Make put/push mechanism operate "opportunisticly"
  > tls: Move handshake logic outside connect/accept
  > tls: Make sure handshake exceptions are futurized
  > tls: Guard non-established sockets in sesrefs + more explicit close + states
  > tls: Make vec_push fully exception safe
  > net/tls: explicitly ignore ready future during shutdown
  > tls: remove unneeded lambda captures

Fixes #3072
2018-01-28 14:46:06 +02:00
Paweł Dziepak
383d7e6c91 Update scylla-ami submodule
* scylla-ami be90a3f...fa2461d (1):
  > Update Amazon kernel packages release stream to 2017.09
2018-01-24 13:31:04 +00:00
Avi Kivity
7bef696ee5 Update seastar submodule
* seastar 66eb33f...d37cf28 (1):
  > Update dpdk submodule

Fixes build with glibc 2.25 (see scylladb/dpdk#3).
2018-01-18 13:52:39 +02:00
Avi Kivity
a603111a85 Merge "Fix memory leak on zone reclaim" from Tomek
"_free_segments_in_zones is not adjusted by
segment_pool::reclaim_segments() for empty zones on reclaim under some
conditions. For instance when some zone becomes empty due to regular
free() and then reclaiming is called from the std allocator, and it is
satisfied from a zone after the one which is empty. This would result
in free memory in such zone to appear as being leaked due to corrupted
free segment count, which may cause a later reclaim to fail. This
could result in bad_allocs.

The fix is to always collect such zones.

Fixes #3129
Refs #3119
Refs #3120"

* 'tgrabiec/fix-free_segments_in_zones-leak' of github.com:scylladb/seastar-dev:
  tests: lsa: Test _free_segments_in_zones is kept correct on reclaim
  lsa: Expose max_zone_segments for tests
  lsa: Expose tracker::non_lsa_used_space()
  lsa: Fix memory leak on zone reclaim

(cherry picked from commit 4ad212dc01)
2018-01-16 15:54:54 +02:00
Asias He
d5884d3c7c storage_service: Do not wait for restore_replica_count in handle_state_removing
The call chain is:

storage_service::on_change() -> storage_service::handle_state_removing()
-> storage_service::restore_replica_count() -> streamer->stream_async()

Listeners run as part of gossip message processing, which is serialized.
This means we won't be processing any gossip messages until streaming
completes.

In fact, there is no need to wait for restore_replica_count to complete
which can take a long time, since when it completes, this node will send
notification to tell the removal_coordinator that the restore process is
finished on this node. This node will be removed from _replicating_nodes
on the removal_coordinator.

Tested with update_cluster_layout_tests.py

Fixes #2886

Message-Id: <8b4fe637dfea6c56167ddde3ca86fefb8438ce96.1516088237.git.asias@scylladb.com>
(cherry picked from commit 5107b6ad16)
2018-01-16 11:38:08 +02:00
Tomasz Grabiec
e6cb685178 range_tombstone_list: Fix insert_from()
end_bound was not updated in one of the cases in which end and
end_kind was changed, as a result later merging decision using
end_bound were incorrect. end_bound was using the new key, but the old
end_kind.

Fixes #3083.
Message-Id: <1513772083-5257-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit dfe48bbbc7)
2017-12-21 09:29:59 +01:00
Vlad Zolotarov
cd19e5885a messaging_service: fix a mutli-NIC support
Don't enforce the outgoing connections from the 'listen_address'
interface only.

If 'local_address' is given to connect() it will enforce it to use a
particular interface to connect from, even if the destination address
should be accessed from a different interface. If we don't specify the
'local_address' the source interface will be chosen according to the
routing configuration.

Fixes #3066

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1513372688-21595-1-git-send-email-vladz@scylladb.com>
(cherry picked from commit be6f8be9cb)
2017-12-17 10:52:19 +02:00
Takuya ASADA
f367031016 dist/common/systemd: specify correct repo file path for housekeeping service on Ubuntu/Debian
Currently scylla-housekeeping-daily.service/-restart.service hardcoded
"--repo-files '/etc/yum.repos.d/scylla*.repo'" to specify CentOS .repo file,
but we use same .service for Ubuntu/Debian.
It doesn't work correctly, we need to specify .list file for Debian variants.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1513385159-15736-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit c2e87f4677)
2017-12-16 22:05:19 +02:00
Glauber Costa
7ae67331ad database: delete created SSTables if streaming writes fail
We have had an issue recently where failed SSTable writes left the
generated SSTables dangling in a potentially invalid state. If the write
had, for instance, started and generated tmp TOCs but not finished,
those files would be left for dead.

We had fixed this in commit b7e1575ad4,
but streaming memtables still have the same isse.

Note that we can't fix this in the common function
write_memtable_to_sstable because different flushers have different
retry policies.

Fixes #3062

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20171213011741.8156-1-glauber@scylladb.com>
(cherry picked from commit 1aabbc75ab)
2017-12-13 10:12:35 +02:00
Jesse Haber-Kucharsky
0e6561169b cql3: Add missing return
Since `return` is missing, the "else" branch is also taken and this
results a user being created from scratch.

Fixes #3058.

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <bf3ca5907b046586d9bfe00f3b61b3ac695ba9c5.1512951084.git.jhaberku@scylladb.com>
(cherry picked from commit 7e3a344460)
2017-12-11 09:55:46 +02:00
Paweł Dziepak
5b3aa8e90d Merge "Fix range tombstone emitting which led to skipping over data" from Tomasz
"Fixes cache reader to not skip over data in some cases involving overlapping
range tombstones in different partition versions and discontinuous cache.

Introduced in 2.0

Fixes #3053."

* tag 'tgrabiec/fix-range-tombstone-slicing-v2' of github.com:scylladb/seastar-dev:
  tests: row_cache: Add reproducer for issue #3053
  tests: mvcc: Add test for partition_snapshot::range_tombstones()
  mvcc: Optimize partition_snapshot::range_tombstones() for single version case
  mvcc: Fix partition_snapshot::range_tombstones()
  tests: random_mutation_generator: Do not emit dummy entries at clustering row positions

(cherry picked from commit 051cbbc9af)
(cherry picked from commit be5127388d)

[tgrabiec: dropped mvcc_test change, because the file does not exist here]
2017-12-08 14:38:24 +01:00
Tomasz Grabiec
db9d502f82 tests: simple_schema: Add new_tombstone() helper
(cherry picked from commit 204ec9c673)
2017-12-08 14:18:42 +01:00
Amos Kong
cde39bffd0 dist/debian: add scylla-tools-core to depends list
Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <db39cbda0e08e501633556ab238d816e357ad327.1512646123.git.amos@scylladb.com>
(cherry picked from commit 8fd5d27508)
2017-12-07 13:42:24 +02:00
Amos Kong
0fbcc852a5 dist/redhat: add scylla-tools-core to requires list
Fixes #3051

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <f7013a4fbc241bb4429d855671fee4b845b255cd.1512646123.git.amos@scylladb.com>
(cherry picked from commit eb3b138ee2)
2017-12-07 13:42:17 +02:00
Duarte Nunes
16d5f68886 thrift/server: Handle exception within gate
The exception handling code inspects server state, which could be
destroyed before the handle_exception() task runs since it runs after
exiting the gate. Move the exception handling inside the gate and
avoid scheduling another accept if the server has been stopped.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171116122921.21273-1-duarte@scylladb.com>
(cherry picked from commit 34a0b85982)
2017-12-07 13:07:11 +02:00
Pekka Enberg
91540c8181 Update seastar submodule
* seastar 0dbedf0...66eb33f (1):
  > core/gate: Add is_closed() function
2017-12-07 13:06:39 +02:00
Raphael S. Carvalho
eaa8ed929f thrift: fix compilation error
thrift/server.cc:237:6:   required from here
thrift/server.cc:236:9: error: cannot call member function ‘void thrift_server::maybe_retry_accept(int, bool, std::__exception_ptr::exception_ptr)’ without object
         maybe_retry_accept(which, keepalive, std::move(ex));

gcc version: gcc (GCC) 6.3.1 20161221 (Red Hat 6.3.1-1)

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171113184537.10472-1-raphaelsc@scylladb.com>
(cherry picked from commit 564046a135)
2017-12-07 12:55:34 +02:00
Avi Kivity
5ba1621716 Merge "thrift/server: Ensure stop() waits for accepts" from Duarte
"Ensure stop() waits for the accept loop to complete to avoid crashes
during shutdown."

* 'thrift-server-stop/v4' of https://github.com/duarten/scylla:
  thrift/server: Restore code format
  thrift/server: Stopping the server waits for connection shutdown
  thrift/server: Abort listeners on stop()
  thrift/server: Avoid manual memory management
  thrift/server: Add move ctor for connection
  thrift/server: Extract retry logic
  thrift/server: Retry with backoff for some error types
  thrift/server: Retry accept in case of error

(cherry picked from commit 061f6830fa)
2017-12-07 12:54:45 +02:00
Avi Kivity
b4f515035a build: disable -fsanitize-address-use-after-scope on CqlParser.o
The parser generator somehow confuses the use-after-scope sanitizer, causing it
to use large amounts of stack space. Disable that sanitizer on that file.
Message-Id: <20170905110628.18047-1-avi@scylladb.com>

(cherry picked from commit 4751402709)
2017-12-04 12:41:05 +02:00
Avi Kivity
d55e3f6a7f build: fix excessive stack usage in CqlParser in debug mode
The state machines generated by antlr allocate many local variables per function.
In release mode, the stack space occupied by the variables is reused, but in debug
build, it is not, due to Address Sanitizer setting -fstack-reuse=none. This causes
a single function to take above 100k of stack space.

Fix by hacking the generated code to use just one variable.

Fixes #2546
Message-Id: <20170704135824.13225-1-avi@scylladb.com>

(cherry picked from commit a6d9cf09a7)
2017-12-04 12:39:27 +02:00
Shlomi Livne
07b039feab release: prepare for 2.0.2
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
2017-11-30 21:02:42 +02:00
Avi Kivity
35b7353efd Update seastar submodule
* seastar 0489655...0dbedf0 (1):
  > fstream: do not ignore dma_write return value
2017-11-30 17:36:36 +02:00
Duarte Nunes
200e01cc31 compound_compact: Change universal reference to const reference
The universal reference was introduced so we could bind an rvalue to
the argument, but it would have sufficed to make the argument a const
reference. This is also more consistent with the function's other
overload.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171129132758.19654-1-duarte@scylladb.com>
(cherry picked from commit cda3ddd146)
(cherry picked from commit 106c69ad45)
2017-11-30 16:51:16 +02:00
Tomasz Grabiec
b5c4cf2d87 Merge "compact_storage serialization fixes" from Duarte
Fix two issues with serializing non-compound range tombstones as
compound: convert a non-compound clustering element to compound and
actually advertise the issue to other nodes.

* git@github.com:duarten/scylla.git  rt-compact-fixes/v1:
  compound_compact: Allow rvalues in size()
  sstables/sstables: Convert non-compound clustering element to compound
  tests/sstable_mutation_test: Verify we can write/read non-correct RTs
  service/storage_service: Export non-compound RT feature

(cherry picked from commit e9cce59b85)
(cherry picked from commit 740fcc73b8)

Undid changes to size_estimates_virtual_reader
Changed test from memtable::make_flat_reader to memtable::make_reader
2017-11-30 16:50:15 +02:00
Duarte Nunes
f96cb361aa tests: Initialize storage service for some tests
These tests now require having the storage service initialize, which
is needed to decide whether correct non-compound range tombstones
should be emitted or not.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171126152921.5199-1-duarte@scylladb.com>
(cherry picked from commit 922f095f22)
(cherry picked from commit 8567723a7b)
2017-11-30 16:27:43 +02:00
Duarte Nunes
bd59d7c968 cql3/delete_statement: Allow non-range deletions on non-compound schemas
This patch fixes a regression introduced in
1c872e2ddc.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171126102333.3736-1-duarte@scylladb.com>
(cherry picked from commit 15fbb8e1ca)
(cherry picked from commit b0b7c73acd)
2017-11-30 16:22:53 +02:00
Tomasz Grabiec
9d923a61e1 Merge "Fixes to sstable files for non-compound schemas" from Duarte
This series mainly fixes issues with the serialization of promoted
index entries for non-compound schemas and with the serialization of
range tombstones, also for non-compound schemas.

We lift the correct cell name writing code into its own function,
and direct all users to it. We also ensure backward compatibility with
incorrectly generated promoted indexes and range tombstones.

Fixes #2995
Fixes #2986
Fixes #2979
Fixes #2992
Fixes #2993

* git@github.com:duarten/scylla.git  promoted-index-serialization/v3:
  sstables/sstables: Unify column name writers
  sstables/sstables: Don't write index entry for a missing row maker
  sstables/sstables: Reuse write_range_tombstone() for row tombstones
  sstables/sstables: Lift index writing for row tombstones
  sstables/sstables: Leverage index code upon range tombstone consume
  sstables/sstables: Move out tombstone check in write_range_tombstone()
  sstables/sstables: A schema with static columns is always compound
  sstables/sstables: Lift column name writing logic
  sstables/sstables: Use schema-aware write_column_name() for
    collections
  sstables/sstables: Use schema-aware write_column_name() for row marker
  sstables/sstables: Use schema-aware write_column_name() for static row
  sstables/sstables: Writing promoted index entry leverages
    column_name_writer
  sstables/sstables: Add supported feature list to sstables
  sstables/sstables: Don't use incorrectly serialized promoted index
  cql3/single_column_primary_key_restrictions: Implement is_inclusive()
  cql3/delete_statement: Constrain range deletions for non-compound
    schemas
  tests/cql_query_test: Verify range deletion constraints
  sstables/sstables: Correctly deserialize range tombstones
  service/storage_service: Add feature for correct non-compound RTs
  tests/sstable_*: Start the storage service for some cases
  sstables/sstable_writer: Prepare to control range tombstone
    serialization
  sstables/sstables: Correctly serialize range tombstones
  tests/sstable_assertions: Fix monotonicity check for promoted indexes
  tests/sstable_assertions: Assert a promoted index is empty
  tests/sstable_mutation_test: Verify promoted index serializes
    correctly
  tests/sstable_mutation_test: Verify promoted index repeats tombstones
  tests/sstable_mutation_test: Ensure range tombstone serializes
    correctly
  tests/sstable_datafile_test: Add test for incorrect promoted index
  tests/sstable_datafile_test: Verify reading of incorrect range
    tombstones
  sstables/sstable: Rename schema-oblivious write_column_name() function
  sstables/sstables: No promoted index without clustering keys
  tests/sstable_mutation_test: Verify promoted index is not generated
  sstables/sstables: Optimize column name writing and indexing
  compound_compat: Don't assume compoundness

(cherry picked from commit bd1efbc25c)

Also added sstables::make_sstable() to preserve source compatibility in tests.
2017-11-30 16:21:13 +02:00
Tomasz Grabiec
0b23bcbe29 sstables: index_reader: Reset lower bound for promoted index lookups from advance_to_next_partition()
_current_pi_idx was not reset from advance_to_next_partition(), which
is used when we skip to the next partition before fully consuming
it. As a result, if we try to skip to a clustering position which is
before the index block used by the last skip in the previous
partition, we would not skip assuming that the new position is in the
current block. This may result in more data being read from the
sstable than necessary.

Fixes #2984
Message-Id: <1510915793-20159-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit 2113299b61)
2017-11-21 14:25:49 +01:00
Duarte Nunes
b1899f000a db/view: Use view schema for view pk operations
Instead of base schema.

Fixes #2504

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170718190703.12972-1-duarte@scylladb.com>
(cherry picked from commit 115ff1095e)
2017-11-15 20:41:24 +00:00
Tomasz Grabiec
7b19167cbd gossiper: Replicate endpoint_state::is_alive()
Broken in f570e41d18.

Not replicating this may cause coordinator to treat a node which is
down as alive, or vice verse.

Fixes regression in dtest:

  consistency_test.py:TestAvailability.test_simple_strategy

which was expected to get "unavailable" exception but it was getting a
timeout.

Message-Id: <1510666967-1288-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 7323fe76db)
2017-11-14 16:21:33 +02:00
Shlomi Livne
164f97fd88 release: prepare for 2.0.1
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
2017-11-10 16:37:29 +02:00
Paweł Dziepak
b3bc82bc77 Merge "Fix exception safety related to range tombstones in cache" from Tomasz
Fixes #2938.

* 'tgrabiec/fix-range-tombstone-list-exception-safety-v1' of github.com:scylladb/seastar-dev:
  tests: range_tombstone_list: Add test for exception safety of apply()
  tests: Introduce range_tombstone_list assertions
  cache: Make range tombstone merging exception-safe
  range_tombstone_list: Introduce apply_monotonically()
  range_tombstone_list: Make reverter::erase() exception-safe
  range_tombstone_list: Fix memory leaks in case of bad_alloc
  mutation_partition: Fix abort in case range tombstone copying fails
  managed_bytes: Declare copy constructor as allocation point
  Integrate with allocation failure injection framework

(cherry picked from commit 5a4b46f555)
2017-11-10 13:10:26 +01:00
Tomasz Grabiec
8f6ffb0487 Update seastar submodule
* seastar 124467d...0489655 (9):
  > alloc_failure_injector: Fix compilation error with gcc 7.1
  > alloc_failure_injector: Replace set_alloc_failure_callback() with run_with_callback()
  > alloc_failure_injector: Log backtrace of failures
  > alloc_failure_injector: Extract fail()
  > util: Introduce support for allocation failure injection
  > noncopyable_function: improve support for capturing mutable lambdas
  > noncopyable_function add bool operator
  > test.py: fix typo in noncopyable_function_test
  > utils: introduce noncopyable_function
2017-11-10 13:10:26 +01:00
Raphael S. Carvalho
4bba0c403e compaction: Make resharding go through compaction manager
Two reasons for this change:
1) every compaction should be multiplexed to manager which in turn
will make decision when to schedule. improvements on it will
immediately benefit every existing compaction type.
2) active tasks metric will now track ongoing reshard jobs.

Fixes #2671.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170817224334.6402-1-raphaelsc@scylladb.com>
(cherry picked from commit 10eaa2339e)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171103012758.19428-1-raphaelsc@scylladb.com>
2017-11-07 15:23:45 +02:00
Calle Wilund
59aae504ae storage_service: Only replicate token metadata iff modified in on_change
Fixes #2869

Message-Id: <20171101105629.22104-1-calle@scylladb.com>
(cherry picked from commit 8c257c40b4)
2017-11-07 14:46:36 +02:00
Avi Kivity
941e5eef4f Merge "gossip backport for 2.0" from Asias
"This series backports both the cleanup series from Duarte and large cluster
fixes series from Tomek."

* tag 'gossip-2.0-backport-v2' of github.com:scylladb/seastar-dev:
  Merge 'Solves problems related to gossip which can be observed in a large cluster' from Tomasz
  utils: introduce loading_shared_values
  tests/serialized_action: add missing forced defers
  utils: Introduce serialized_action
  Merge "gms/gossiper: Multiple cleanups" from Duarte
  storage_service: Do not use c_str() in the logger
  gms/gossiper: Introduce copy-less endpoint_state::get_application_state_ptr()
  Merge "gossiper: Optimize endpoint_state lookup" from Duarte
  gms: Add is_shutdown helper for endpoint_state class
  gossip: Better check for gossip stabilization on startup
  Merge "Fix miss opportunity to update gossiper features" from Asias
  gossip: Fix a log message typo in compare_endpoint_startup
  gossip: Do not use c_str() in the logger
  Revert "gossip: Make bootstrap more robust"
  gossip: Switch to seastar::lowres_system_clock
  gossip: Use unordered_map for _unreachable_endpoints and _shadow_unreachable_endpoints
  gossip: Introduce the shadow_round_ms option
2017-11-07 13:44:20 +02:00
Duarte Nunes
b12a2e6b08 Merge 'Solves problems related to gossip which can be observed in a large cluster' from Tomasz
"The main problem fixed is slow processing of application state changes.
This may lead to a bootstrapping node not having up to date view on the
ring, and serve incorrect data.

Fixes #2855."

* tag 'tgrabiec/gossip-performance-v3' of github.com:scylladb/seastar-dev:
  gms/gossiper: Remove periodic replication of endpoint state map
  gossiper: Check for features in the change listener
  gms/gossiper: Replicate changes incrementally to other shards
  gms/gossiper: Document validity of endpoint_state properties
  storage_service: Update token_metadata after changing endpoint_state
  gms/gossiper: Process endpoints in parallel
  gms/gossiper: Serialize state changes and notifications for given node
  utils/loading_shared_values: Allow Loader to return non-future result
  gms/gossiper: Encapsulate lookup of endpoint_state
  storage_service: Batch token metadata and endpoint state replication
  utils/serialized_action: Introduce trigger_later()
  gossiper: Add and improve logging
  gms/gossiper: Don't fire change listeners when there is no change
  gms/gossiper: Allow parallel apply_state_locally()
  gms/gossiper: Avoid copies in endpoint_state::add_application_state()
  gms/failure_detector: Ignore short update intervals

(cherry picked from commit 044b8deae4)
2017-11-07 19:21:30 +08:00
Vlad Zolotarov
eb34937ff6 utils: introduce loading_shared_values
This class implements an key-value container that is populated
using the provided asynchronous callback.

The value is loaded when there are active references to the value for the given key.

Container ensures that only one entry is loaded per key at any given time.

The returned value is a lw_shared_ptr to the actual value.

The value for a specific key is immediately evicted when there are no
more references to it.

The container is based on the boost::intrusive::unordered_set and is rehashed (grown) if needed
every time a new value is added (asynchronously loaded).

The container has a rehash() method that would grow or shrink the container as needed
in order to get the load factor into the [0.25, 0.75] range.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
(cherry picked from commit ec3fed5c4d)
2017-11-07 19:21:29 +08:00
Paweł Dziepak
127ffff9e5 tests/serialized_action: add missing forced defers
serialized_action_tests depends on the fact that first part of the
serialized_action is executed at cetrtain points (in which it reads a
global variable that is later updated by the main thread).
This worked well in the release mode before ready continuations were
inlined and run immediately, but not in the debug mode since inlining
was not happening and the main seastar::thread was missing some yield
points.
Message-Id: <20170731103013.26542-1-pdziepak@scylladb.com>

(cherry picked from commit e970630272)
2017-11-07 19:21:29 +08:00
Tomasz Grabiec
f06bb656a4 utils: Introduce serialized_action
(cherry picked from commit 6a3703944b)

Conflicts:
	configure.py
2017-11-07 19:21:29 +08:00
Pekka Enberg
cc84cd60c5 Merge "gms/gossiper: Multiple cleanups" from Duarte
"Based on the functions get_endpoint_state_for_endpoint_ptr(),
get_application_state_ptr() and
endpoint_state::get_application_state_ptr(), this series
cleanups miscelaneous functions related to the gossiper.

It not only removes duplicated code, but also omits many copies.

All pointer usages have been audited for safety."

Acked-by: Asias He <asias@scylladb.com>
Acked-by: Tomasz Grabiec <tgrabiec@scylladb.com>

* 'gossiper-cleanup/v2' of github.com:duarten/scylla: (27 commits)
  gms/endpoint_state: Remove get_application_state()
  service/storage_service: Avoid copies in prepare_replacement_info()
  service/storage_service: Cleanup get_application_state_value()
  service/storage_service: Cleanup handle_state_removing()
  service/storage_service: Cleanup get_rpc_address()
  locator/reconnectable_snitch_helper: Avoid versioned_value copies
  locator/production_snitch_base: Cleanup get_endpoint_info()
  service/migration_manager: Avoid copies in is_ready_for_bootstrap()
  service/migration_manager: Cleanup has_compatible_schema_tables_version()
  service/migration_manager: Fix usages of get_application_state()
  cache_hit_rate: Avoid copies in get_hit_rate()
  gms/endpoint_state: Avoid copies in is_shutdown()
  service/load_broadcaster: Avoid copy in on_join()
  gms/gossiper: Cleanup get_supported_features()
  gms/gossiper: Cleanup get_gossip_status()
  gms/gossiper: Cleanup seen_any_seed()
  gms/gossiper: Cleanup get_host_id()
  gms/gossiper: Removed dead uses_vnodes() function
  gms/gossiper: Cleanup uses_host_id()
  gms/gossiper: Add get_application_state_ptr()
  ...

(cherry picked from commit 1701fc2e50)
2017-11-07 19:21:29 +08:00
Asias He
bb6ee1e4b1 storage_service: Do not use c_str() in the logger
Use logger.info("{}", msg) instead.

Message-Id: <d2f15007a54554b58e29fd05331c06ae030d582f.1504832296.git.asias@scylladb.com>
(cherry picked from commit bb9dbc5ade)
2017-11-07 19:21:28 +08:00
Tomasz Grabiec
ce72299a38 gms/gossiper: Introduce copy-less endpoint_state::get_application_state_ptr()
Message-Id: <1507642411-28680-3-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 66a15ccd18)
2017-11-07 19:21:28 +08:00
Avi Kivity
ef39bc4216 Merge "gossiper: Optimize endpoint_state lookup" from Duarte
"gossiper::get_endpoint_state_for_endpoint() returns a copy of
endpoint_state, which we've seen can be very expensive. This
series introduces a function that returns a pointer and avoids
the copy.

Fixes #764"

* 'endpoint-state/v2' of https://github.com/duarten/scylla:
  gossiper: Avoid endpoint_state copies
  endpoint_state: const-qualify functions
  storage_service: Remove duplicate endpoint state check

(cherry picked from commit 4ad3900d8d)
2017-11-07 19:21:28 +08:00
Asias He
fb03f9a73c gms: Add is_shutdown helper for endpoint_state class
It will be used by streaming manager to check if a node is in shutdown
status.

(cherry picked from commit ed7e6974d5)
2017-11-07 19:21:27 +08:00
Asias He
025af9d297 gossip: Better check for gossip stabilization on startup
This is a backport of Apache CASSANDRA-9401
(2b1e6aba405002ce86d5badf4223de9751bf867d)

It is better to check the number of nodes in the endpoint_state_map
is not changing for gossip stabilization.

Fixes #2853
Message-Id: <e9f901ac9cadf5935c9c473433dd93e9d02cb748.1506666004.git.asias@scylladb.com>

(cherry picked from commit c0b965ee56)
2017-11-07 19:21:27 +08:00
Tomasz Grabiec
09da4f3d08 Merge "Fix miss opportunity to update gossiper features" from Asias
The gossiper checks if features should be enabled from its timer
callback when it detects that endpoint_state_map changed, that is
different than shadow_endpoint_state_map.

shadow_endpoint_state_map is also assigned from endpoint_state_map in
storage_service::replicate_tm_and_ep_map(), called from
storage_service::on_change()

Call gossiper:maybe_enable_features() in replicate_tm_and_ep_map so
that we won't miss gossip feature update.

Fixes #2824

* git@github.com:scylladb/seastar-dev asias/gossip_miss_feature_update_v1:
  gossip: Move the _features_condvar signal code to
    maybe_enable_features
  gossip: Make maybe_enable_features public
  storage_service: Check gossip feature update in
    replicate_tm_and_ep_map

(cherry picked from commit 02d41864af)
2017-11-07 19:21:27 +08:00
Asias He
0d22b0c949 gossip: Fix a log message typo in compare_endpoint_startup
Message-Id: <c4958950e1108082b63e08ab81ee2177edc9b232.1505286843.git.asias@scylladb.com>
(cherry picked from commit fa9d47c7f3)
2017-11-07 19:21:27 +08:00
Asias He
5c5296a683 gossip: Do not use c_str() in the logger
Use logger.info("{}", msg) instead.

Message-Id: <52c24d7dfe082ee926f065a6268d83fcb31ddc28.1504832289.git.asias@scylladb.com>
(cherry picked from commit 57dd3cb2c5)
2017-11-07 19:21:27 +08:00
Asias He
5ecb07bbc4 Revert "gossip: Make bootstrap more robust"
This reverts commit b56ba02335.

After commit 8fa35d6ddf (messaging_service: Get rid of timeout and retry
logic for streaming verb), streaming verb in rpc does not check if a
node is in gossip memebership since all the retry logic is removed.

Remove the extra wait before removing the joining node from gossip
membership.

Message-Id: <a416a735bb8aad533bbee190e3324e6b16799415.1504063598.git.asias@scylladb.com>
(cherry picked from commit cc18da5640)
2017-11-07 19:21:26 +08:00
Asias He
a90023a119 gossip: Switch to seastar::lowres_system_clock
The newly added lowres_system_clock is good enough for gossip
resolution. Switch to use it.

Message-Id: <fe0e7a9ef1ea0caffaa8364afe5c78b6988613bf.1503971833.git.asias@scylladb.com>
(cherry picked from commit a36141843a)
2017-11-07 19:21:26 +08:00
Asias He
258f0a383b gossip: Use unordered_map for _unreachable_endpoints and _shadow_unreachable_endpoints
The _unreachable_endpoints will be accessed in fast path soon by the
hinted hand off code.

Message-Id: <500d9cbb2117ab7b070fd1bd111c5590f46c3c3a.1503971826.git.asias@scylladb.com>
(cherry picked from commit 2701bfd1f8)
2017-11-07 19:21:26 +08:00
Asias He
3455bfaa44 gossip: Introduce the shadow_round_ms option
It specifies the maximum gossip shadow round time. It can be used to
reduce the gossip feature check time during node boot up.
For instance, when the first node in the cluster, which listed both
itself and other node as seed in the yaml config, boots up, it will try
to talk to other seed nodes which are not started yet. The gossip shadow
round will be used to fetch the feature info of the cluster. Since there
is no other seed node in the cluster, the shadow round will fail. User
can reduce the default shadow_round_ms option to reduce the boot time.

Fixes #2615
Message-Id: <10916ce9059f3c7f1a1fb465919ae57de3b67d59.1500540297.git.asias@scylladb.com>

(cherry picked from commit cf6f4a5185)
2017-11-07 19:21:26 +08:00
Avi Kivity
cd0b4903e9 Update seastar submodule
* seastar 7ebbb26...124467d (4):
  > peering_sharded_service: prevent over-run the container
  > sharded: fix move constructor for peering_sharded_service services
  > sharded: improve support for cooperating sharded<> services
  > sharded: support for peer services

Includes change to batchlog_manager constructor to adapt it to
seastar::sharded::start() change.

Needed for gossip backport.
2017-11-07 10:49:01 +02:00
Avi Kivity
4d76f564f8 Update seastar submodule
* seastar e4fcb6c...7ebbb26:
  Warn: seastar doesn't contain commit e4fcb6c27cc5dce70d44472522166abe1af29af6

The submodule link pointed nowhere, point it at the tip of scylla-seastar/branch-2.0.
2017-11-06 13:15:16 +02:00
Tomasz Grabiec
a05d3280c4 Update seastar submodule
* seastar b85b0fa...e4fcb6c (1):
  > log: Print nested exceptions

Refs #1011
2017-11-03 10:22:46 +01:00
Glauber Costa
68c41b2346 dist/redhat: do not use s3 addresses in mock profile
They are uglier and less future-proof.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20171019142034.30139-1-glauber@scylladb.com>
2017-10-19 18:11:59 +03:00
Glauber Costa
9c907222c5 dist: point mock profile to the right stream
2.0 builds are currently failing because of that.

Message-Id: <20171018182457.5180-1-glauber@scylladb.com>
2017-10-19 11:37:31 +03:00
Tomasz Grabiec
ba02e2688a cache_streamed_mutation: Read static row with cache region locked
_snp->static_row() allocates and needs reference stability.
Message-Id: <1507555031-11567-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit 44faaafc29)

Fixes #2880.
2017-10-10 15:55:08 +02:00
Avi Kivity
fa540581e8 Update ami submodule
* dist/ami/files/scylla-ami 5ffa449...be90a3f (1):
  > amazon kernel: enable updates

Still tracking master branch.
2017-10-02 17:11:30 +03:00
Shlomi Livne
e265c91616 release: prepare for 2.0.0
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
2017-09-29 21:12:18 +03:00
Tomasz Grabiec
8a9f8970e4 Update seastar submodule
Refs #2770.

* seastar d763623...b85b0fa (1):
  > scollectd: increment the metadata iterator with the values
2017-09-28 15:32:26 +02:00
Tomasz Grabiec
1e3c777a10 Update seastar submodule
* seastar c853473...d763623 (1):
  > rpc: make sure that _write_buf stream is always properly closed
2017-09-28 15:03:12 +02:00
Tomasz Grabiec
43d785a177 migration_manager: Make sure schema pulls eventually happen when schema_tables_v3 is enabled
We don't pull schema during rolling upgrade, that is until
schema_tables_v3 feature is enabled on all nodes.

Because features are enabled from gossiper timer, there is a race
between feature enablement and processing of endpoint states which may
trigger schema pull.  It can happen that we first try to pull, but
only later enable the feature. In that case the schema pull will not
happen until the next schema change.

The fix is to ensure that pulls abandoned due to feature not being enabled
will be retried when it is enabled.

Fixes sporadic failure in dtest:

  repair_additional_test.py:RepairAdditionalTest.repair_schema_test
Message-Id: <1506428715-8182-2-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit b704710954)
2017-09-27 12:06:45 +01:00
Tomasz Grabiec
b53d3d225d gossiper: Allow waiting for feature to be enabled
Message-Id: <1506428715-8182-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 7a58fb5767)
2017-09-27 12:06:37 +01:00
Paweł Dziepak
f9864686d2 Merge "Fix cache reader skipping rows in some cases" from Tomasz
"Fixes the problem of concurrent populations of clustering row ranges
leading to some readers skipping over some of the rows.
Spotted during code review.

Fixes #2834."

* tag 'tgrabiec/fix-cache-reader-skipping-rows-v2' of github.com:scylladb/seastar-dev:
  tests: mvcc: Add test for partition_snapshot_row_cursor
  tests: row_cache: Add test for concurrent population
  tests: row_cache: Make populate_range() accept partition_range
  tests: Add simple_schema::make_ckey_range()
  cache_streamed_mutation: Add missing _next_row.maybe_refresh() call
  mvcc: partition_snapshot_row_cursor: Fix cursor skipping over rows added after its position
  mvcc: partition_snapshot_row_cursor: Rename up_to_date() to iterators_valid()
  mvcc: Keep track of all iterators in partition_snapshot_row_cursor
  mvcc: Make partition_snapshot_row_cursor printable

(cherry picked from commit af1976bc30)

[tgrabiec: resolved conflicts]
2017-09-26 19:18:29 +02:00
Tomasz Grabiec
454b90980a streamed_mutation: Allow setting buffer capacity
Needed in tests to limit amount of prefetching done by readers, so
that it's easier to test interleaving of various events.

(cherry picked from commit cb16b038ef)
2017-09-26 19:18:29 +02:00
Tomasz Grabiec
13c66b7145 Update seastar submodule
Fixes #2738.

* seastar e380a07...c853473 (1):
  > httpd: handle exception when shutting down
2017-09-26 18:36:50 +02:00
Asias He
df04418fa4 gossip: Print SCHEMA_TABLES_VERSION correctly
Found this when debugging gossip with debug print. The application state
SCHEMA_TABLES_VERSION was printed as UNKNOWN.
Message-Id: <d7616920d2e6516b5470a758bcf9c88f3d857381.1506391495.git.asias@scylladb.com>

(cherry picked from commit 98e9049820)
2017-09-26 08:39:30 +02:00
Tomasz Grabiec
6e2858a47d storage_service: Register features before joining
Since commit 8378fe190, we disable schema sync in a mixed cluster.
The detection is done using gossiper features. We need to make sure
the features are registerred, and thus can be enabled, before the
bootstrapping of a non-seed node happens. Otherwise the bootstrap will
hang waiting on schema sync which will not happen.
Message-Id: <1505893837-27876-2-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit 8e46d15f91)
2017-09-25 09:40:22 +01:00
Tomasz Grabiec
0f79503cf1 storage_service: Extract register_features()
Message-Id: <1505893837-27876-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit b92dcb0284)
2017-09-25 09:40:22 +01:00
Tomasz Grabiec
056f6df859 Update seastar submodule
* seastar 06790c0...e380a07 (1):
  > configure: disable exception scalability hack on debug build
2017-09-25 10:13:01 +02:00
Avi Kivity
7833129ab4 Merge "row_cache: Call fast_forward_to() outside allocating section" from Tomasz
"On bad_alloc the section is retried. If the exception happened inside
fast_forward_to() on the underlying reader, that call will be
retried. However, the reader should not be used after exception is
thrown, since it is in unspecified state. Also, calling
fast_forward_to() with cache region locked increases the chances of it
failing to allocate.

We shouldn't call fast_forward_to() with the cache region locked.

Fixes #2791."

* 'tgrabiec/dont-ffwd-in-alloc-section' of github.com:scylladb/seastar-dev:
  cache_streamed_mutation: De-futurize cursor movement
  cache_streamed_mutation: Call fast_forward_to() outside allocating section
  cache_streamed_mutation: Switch from flags to explicit state machine

(cherry picked from commit 5b0cb28af9)

[tgrabiec: resolved minor conflicts]
2017-09-20 10:23:08 +02:00
Asias He
c3c5ec1d4a gossip: Fix indentation in apply_state_locally
Message-Id: <2bdefa8d982ad8da7452b41e894f41d865b83b0b.1505356245.git.asias@scylladb.com>
(cherry picked from commit 5ff0b113c9)
2017-09-19 22:59:46 +08:00
Asias He
7839cebc6c gossip: Use boost::copy_range in apply_state_locally
boost::copy_range is better because the vector is allocated with the
correct size instead of growing when the inserter is called.

[avi: also crashes less]

Message-Id: <b19ca92d56ad070fca1e848daa67c00c024e3a4d.1505291199.git.asias@scylladb.com>
(cherry picked from commit c84dcabb8f)
2017-09-19 22:59:46 +08:00
Pekka Enberg
e428d06f40 Merge "gossip: optimize apply_state_locally for large cluster" from Asias
"This series tries to improve the bootstrap of a node in a large cluster by
improving how gossip applies the gossip node state. In #2404, the joining node
failed to bootstrap, because it did not see the seed node when
storage_service::bootstrap ran. After this series, we apply the whole gossip
state contained in the gossip ack/ack2 message before applying the next one,
and we apply the state of the seed node earlier than non-seed node so we can
have the seed node's state faster. We also add some randomness to the order of
applying gossip node state to prevent some of the nodes' state are always
applied earlier than the others.

This series improves apply_state_locally for large cluster:

 - Tune the order of applying endpoint_state
 - Serialize apply_state_locally
 - Avoid copying of the gossip state map

Fixes #2404"

* tag 'asias/gossip_issue_2404_v2' of github.com:scylladb/seastar-dev:
  gossip: Avoid copying with apply_state_locally
  gossip: Serialize apply_state_locally
  gossip: Tune the order of applying endpoint_state in apply_state_locally
  gossip: Introduce is_seed helper
  gossip: Pass const endpoint_state& in notify_failure_detector
  gossip: Pass reference in notify_failure_detector

(cherry picked from commit d2632ddf1d)
2017-09-19 22:59:45 +08:00
Asias He
6834ba16a3 gossip: Do not wait for echo message in mark_alive
gossiper::apply_state_locally() calls handle_major_state_change() for
each endpoint, in a seastar thread, which calls mark_alive() for new
nodes, which calls ms().send_gossip_echo(id).get(). So it synchronously
waits for each node to respond before it moves on to the next entry. As
a result it may take a while before whole state is processed.

Apache (tm) Cassandra (tm) sends echos in the background.

In a large cluster, we see at the time the joining node starts
streaming, it hasn't managed to apply all the endpoint_state for peer
nodes, so the joining node does not know some of the nodes yet, which
results in the joining node ingores to stream from some of the existing
nodes.

Fixes #2787
Fixes #2797

Message-Id: <3760da2bef1a83f1b6a27702a67ca4170e74b92c.1505719669.git.asias@scylladb.com>
(cherry picked from commit 8f8273969d)
2017-09-19 17:12:26 +03:00
Shlomi Livne
0b49cfcf12 release: prepare for 2.0.rc5
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
2017-09-17 17:05:53 +03:00
Duarte Nunes
5e8c9a369e Merge 'Fix schema version mismatch during rolling upgrade from 1.7' from Tomasz
"When there are at least 2 nodes upgraded to 2.0, and the two exchanged schema
for some reason, reads or writes which involve both 1.7 and 2.0 nodes may
start to fail with the following error logged:

    storage_proxy - Exception when communicating with 127.0.0.3: Failed to load schema version 58fc9b89-74ab-37ca-8640-8b38a1204f8d

The situation should heal after whole cluster is upgraded.

Table schema versions are calculated by 2.0 nodes differently than 1.7 nodes
due to change in the schema tables format. Mismatch is meant to be avoided by
having 2.0 nodes calculate the old digest on schema migration during upgrade,
and use that version until next time the table is altered. It is thus not
allowed to alter tables during the rolling upgrade.

Two 2.0 nodes may exchange schema, if they detect through gossip that their
schema versions don't match. They may not match temporarily during boot, until
the upgraded node completes the bootstrap and propagates its new schema
through gossip. One source of such temporary mismatch is construction of new
tracing tables, which didn't exist on 1.7. Such schema pull will result in a
schema merge, which cause all tables to be altered and their schema version to
be recalculated. The new schema will not match the one used by 1.7 nodes,
causing reads and writes to fail, because schema requesting won't work during
rolling upgrade from 1.7 to 2.0.

The main fix employed here is to hold schema pulls, even among 2.0 nodes,
until rolling upgrade is complete."

Fixes #2802.

* 'tgrabiec/fix-schema-mismatch' of github.com:scylladb/seastar-dev:
  tests: schema_change_test: Add test_merging_does_not_alter_tables_which_didnt_change test case
  tests: cql_test_env: Enable all features in tests
  schema_tables: Make make_scylla_tables_mutation() visible
  migration_manager: Disable pulls during rolling upgrade from 1.7
  storage_service: Introduce SCHEMA_TABLES_V3 feature
  schema_tables: Don't alter tables which differ only in version
  schema_mutations: Use mutation_opt instead of stdx::optional<mutation>

(cherry picked from commit 8378fe190a)
2017-09-15 12:07:56 +02:00
Avi Kivity
8567762339 Merge "Refuse to load non-Scylla counter sstables" from Paweł
"These patches make Scylla refuse to load counter sstables that may
contain unsupported counter shards. They are recognised by the lack of
the Scylla component.

Fixes #2766."

* tag 'reject-non-scylla-counter-sstables/v1' of https://github.com/pdziepak/scylla:
  db: reject non-Scylla counter sstables in flush_upload_dir
  db: disallow loading non-Scylla counter sstables
  sstable: add has_scylla_component()

(cherry picked from commit fe019ad84d)
2017-09-11 13:29:32 +03:00
Avi Kivity
f698496ab2 Merge "Fix Scylla upgrades when counters are used" from Paweł
"Scylla 1.7.4 and older use incorrect ordering of counter shards, this
was fixed in 0d87f3dd7d ("utils::UUID:
operator< should behave as comparison of hex strings/bytes"). However,
that patch was not backported to 1.7 branch until very recently. This
means that versions 1.7.4 and older emit counter shards in an incorrect
order and expect them to be so. This is particularly bad when dealing
with imported correct sstables in which case some shards may become
duplicated.

The solution implemented in this patch is to allow any order of counter
shards and automaticly merge all duplicates. The code is written in a
way so that the correct ordering is expected in the fast path in order
not to excessively punish unaffected deployments.

A new feature flag CORRECT_COUNTER_ORDER is introduced to allow seamless
upgrade from 1.7.4 to later Scylla versions. If that feature is not
available Scylla still writes sstables and sends on-wire counters using
the old ordering so that it can be correctly understood by 1.7.4, once
the flag becomes available Scylla switches to the correct order.

Fixes #2752."

* tag 'fix-upgrade-with-counters/v2' of https://github.com/pdziepak/scylla:
  tests/counter: verify counter_id ordering
  counter: check that utils::UUID uses int64_t
  mutation_partition_serializer: use old counter ordering if necessary
  mutation_partition_view: do not expect counter shards to be sorted
  sstables: write counter shards in the order expected by the cluster
  tests/sstables: add storage_service_for_tests to counter write test
  tests/sstables: add test for reading wrong-order counter cells
  sstables: do not expect counter shards to be sorted
  storage_service: introduce CORRECT_COUNTER_ORDER feature
  tests/counter: test 1.7.4 compatible shard ordering
  counters: add helper for retrieving shards in 1.7.4 order
  tests/counter: add tests for 1.7.4 counter shard order
  counters: add counter id comparator compatible with Scylla 1.7.4
  tests/counter: verify order of counter shards
  tests/counter: add test for sorting and deduplicating shards
  counters: add function for sorting and deduplicating counter cells
  counters: add counter_id::operator>

(cherry picked from commit 31706ba989)
2017-09-05 14:25:36 +03:00
Shlomi Livne
6e6de348ea release: prepare for 2.0.rc4
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
2017-09-03 14:58:01 +03:00
Avi Kivity
117db58531 Update AMI submodule
* dist/ami/files/scylla-ami b41e5eb...5ffa449 (3):
  > amzn-main.repo: stick to Amazon Linux 2017.03 kernel (4.9.x)
  > Prevent dependency error on 'yum update'
  > scylla_create_devices: don't raise error when no disks found

Fixes #2751.

Still tracking master branch.
2017-08-31 15:15:42 +03:00
Vlad Zolotarov
086f8b7af2 service::storage_service: initialize auth and tracing after we joined the ring
Initialize the system_auth and system_traces keyspaces and their tables after
the Node joins the token ring because as a part of system_auth initialization
there are going to be issues SELECT and possible INSERT CQL statements.

This patch effectively reverts the d3b8b67 patch and brings the initialization order
to how it was before that patch.

Fixes #2273

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1500417217-16677-1-git-send-email-vladz@scylladb.com>
(cherry picked from commit e98adb13d5)
2017-08-30 09:33:05 +02:00
Avi Kivity
bee9fbe3fc Merge "Fix sstable reader not working for empty set of clustering ranges" from Tomasz
"Fixes #2734."

* 'tgrabiec/make-sstable-reader-work-with-empty-range-set' of github.com:scylladb/seastar-dev:
  tests: Introduce clustering_ranges_walker_test
  tests: simple_schema: Add missing include
  sstables: reader: Make clustering_ranges_walker work with empty range set
  clustering_ranges_walker: Make adjacency more accurate

(cherry picked from commit 5224ab9c92)
2017-08-29 15:54:58 +02:00
Tomer Sandler
b307a36f1e node_health_check: Various updates
- Removed text from Report's "PURPOSE" section, which was referring to the "MANUAL CHECK LIST" (not needed anymore).
- Removed curl command (no longer using the api_address), instead using scylla --version
- Added -v flag in iptables command, for more verbosity
- Added support to for OEL (Oracle Enterprise Linux) - minor fix
- Some text changes - minor
- OEL support indentation fix + collecting all files under /etc/scylla
- Added line seperation under cp output message

Signed-off-by: Tomer Sandler <tomer@scylladb.com>
Message-Id: <20170828131429.4212-1-tomer@scylladb.com>
(cherry picked from commit f1eb6a8de3)
2017-08-29 15:17:16 +03:00
Tomer Sandler
527e12c432 node_health_check: added line seperation under cp output message
Signed-off-by: Tomer Sandler <tomer@scylladb.com>
Message-Id: <20170828124307.2564-1-tomer@scylladb.com>
(cherry picked from commit 83f249c15d)
2017-08-29 15:17:07 +03:00
Shlomi Livne
c57cc55aa6 release: prepare for 2.0.rc3
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
2017-08-28 14:57:54 +03:00
Avi Kivity
5d3c015d27 Merge "Fixes for skipping in sstable reader" from Tomasz
Ref #2733.

* 'tgrabiec/fix-fast-forwarding' of github.com:scylladb/seastar-dev:
  tests: mutation_source_test: Add more tests for fast forwarding across partitions
  sstables: Fix abort in mutation reader for certain skip pattern
  sstables: Fix reader returning partition past the query range in some cases
  sstables: Introduce data_consume_context::eof()

(cherry picked from commit 4e67bc9573)
2017-08-28 12:48:50 +03:00
Avi Kivity
428831b16a Merge "consider the pre-existing cpuset.conf when configuring networking mode" from Vlad
"Preserve the networking configuration mode during the upgrade by generating the /etc/scylla.d/perftune.yaml
file and using it."

Fixes #2725.

* 'dist_respect_cpuset_conf-v3' of https://github.com/vladzcloudius/scylla:
  scylla_prepare: respect the cpuset.conf when configuring the networking
  scylla_cpuset_setup: rm perftune.yaml
  scylla_cpuset_setup: add a missing "include" of scylla_lib.sh

(cherry picked from commit 40aeb00151)
2017-08-24 18:59:37 +03:00
Paweł Dziepak
918339cf2e mvcc: allow invoking maybe_merge_versions() inside allocating section
Message-Id: <20170823083544.4225-1-pdziepak@scylladb.com>
(cherry picked from commit 1006a946e8)
2017-08-24 14:31:00 +02:00
Paweł Dziepak
6c846632e4 abstract_read_executor: make make_requests() exception safe
Message-Id: <20170821162934.25386-5-pdziepak@scylladb.com>
(cherry picked from commit 9d82a1ebfd)
2017-08-24 14:29:32 +02:00
Paweł Dziepak
af7b7f1eff shared_index_lists: restore indentation
Message-Id: <20170821162934.25386-4-pdziepak@scylladb.com>
(cherry picked from commit 31afc2f242)
2017-08-24 14:28:49 +02:00
Paweł Dziepak
701128f8a1 sstables: make shared_index_lists::get_or_load exception safe
Message-Id: <20170821162934.25386-3-pdziepak@scylladb.com>
(cherry picked from commit 93eaa95378)
2017-08-24 14:28:49 +02:00
Avi Kivity
c03118fbe9 Update seastar submodule
* seastar 2993cae...06790c0 (3):
  > scripts: posix_net_conf.sh: allow passing a perftune.py configuration file as a parameter
  > scripts: perftune.py: add the possibility to pass the parameters in a configuration file and print the YAML file with the current configuration
  > scripts: perftune.py: actually use the number of Rx queues when comparing to the number of CPU threads
2017-08-24 11:37:38 +03:00
Piotr Jastrzebski
a98e3aec45 Make streamed_mutation more exception safe
Make sure that push_mutation_fragment leaves
_buffer_size with a correct value if exception
is thrown from emplace_back.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <83398412aa78332d88d91336b79140aecc988602.1503474403.git.piotr@scylladb.com>
(cherry picked from commit 477068d2c3)
2017-08-23 19:10:49 +03:00
Avi Kivity
29baf7966c Merge "repair: Do not allow repair until node is in NORMAL status" from Asias
Fixes #2723.

* tag 'asias/repair_issue_2723_v1' of github.com:cloudius-systems/seastar-dev:
  repair: Do not allow repair until node is in NORMAL status
  gossip: Add is_normal helper

(cherry picked from commit 2f41ed8493)
2017-08-23 09:45:34 +03:00
Amnon Heiman
ba63f74d7e Add configuration to disable per keyspace and column family metrics
The number of keysapce and column family metrics reported is
proportional to the number of shards times the number of keysapce/column
families.

This can cause a performance issue both on the reporting system and on
the collecting system.

This patch adds a configuration flag (set to false by default) to enable
or disable those metrics.

Fixes #2701

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170821113843.1036-1-amnon@scylladb.com>
(cherry picked from commit abbd78367c)
2017-08-22 19:20:50 +03:00
Alexys Jacob
1733f092ef dist: Fix Gentoo Linux scylla-jmx and scylla-tools packages detection
These two admin related packages will be packaged under the "app-admin"
category and not the "dev-db" one.

This fixes the detection path of the packages for scylla_setup.

Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20170817094756.21550-1-ultrabug@gentoo.org>
(cherry picked from commit e5ff8efea3)
2017-08-17 15:44:02 +03:00
Paweł Dziepak
179ff956ee sstables: initialise index metrics on all shards
Fixes #2702.

Message-Id: <20170816085454.21554-1-pdziepak@scylladb.com>
(cherry picked from commit 784dcbf1ca)
2017-08-16 15:44:58 +03:00
Avi Kivity
61c2e8c7e2 Update seastar submodule
* seastar d67c344...2993cae (1):
  > fstream: do not ignore unresolved future

Fixes #2697.
2017-08-16 15:11:10 +03:00
Shlomi Livne
8370e1bc2c release: prepare for 2.0.rc2
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
2017-08-15 15:52:35 +03:00
Pekka Enberg
3a244a4734 docker: Switch to Scylla 2.0 RPM repository 2017-08-15 13:27:41 +03:00
Avi Kivity
3261c927d2 Update seastar submodule
* seastar 2383d60...d67c344 (1):
  > Merge "Fix crash in rpc due to access to already destroyed server socket" from Gleb

Fixes #2690
2017-08-14 16:24:05 +03:00
Avi Kivity
4577a89982 Update seastar submodule
* seastar cfe280c...2383d60 (1):
  > tls: Only recurse once in shutdown code

Fixes #2691
2017-08-14 15:10:27 +03:00
Avi Kivity
6ea306f898 Update seastar submodule
* seastar b9f4568...cfe280c (1):
  > scripts: perftune.py: change the network module mode auto selection heuristic
2017-08-14 10:30:42 +03:00
Avi Kivity
2afcc684b4 Update seastar submodule
* seastar 867b7c7...b9f4568 (4):
  > http: removed unneeded lamda captures
  > Merge "Prometheus to use output stream" from Amnon
  > http_test: Fix an http output stream test
  > Merge "Add output stream to http message reply" from Amnon

Fixes #2475
2017-08-10 12:05:14 +03:00
Avi Kivity
bdb8c861c7 Fork seastar submodule for 2.0 2017-08-10 12:00:31 +03:00
Takuya ASADA
ea933b4306 dist/debian: append postfix '~DISTRIBUTION' to scylla package version
We are moving to aptly to release .deb packages, that requires debian repository
structure changes.
After the change, we will share 'pool' directory between distributions.
However, our .deb package name on specific release is exactly same between
distributions, so we have file name confliction.
To avoid the problem, we need to append distribution name on package version.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1502312935-22348-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 8e115d69a9)
2017-08-10 10:54:17 +03:00
Raphael S. Carvalho
19391cff14 sstables: close index file when sstable writer fails
index's file output stream uses write behind but it's not closed
when sstable write fails and that may lead to crash.
It happened before for data file (which is obviously easier to
reproduce for it) and was fixed by 0977f4fdf8.

Fixes #2673.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170807171146.10243-1-raphaelsc@scylladb.com>
(cherry picked from commit dddbd34b52)
2017-08-08 09:53:36 +03:00
Glauber Costa
87d9a4f1f1 add active streaming reads metric
In commit f38e4ff3f, we have separated streaming reads from normal reads
for the purpose of determining the maximum number of reads going on.
However, we'll now be totally unaware of how many reads will be
happening on behalf of streaming and that can be important information
when debugging issues.

This patch adds this metric so we don't fly blind.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <1501909973-32519-1-git-send-email-glauber@scylladb.com>
(cherry picked from commit 4a911879a3)
2017-08-05 11:07:18 +03:00
Pekka Enberg
c0f894ccef docker: Disable stall detector
Fixes #2162

Message-Id: <1501759957-4380-1-git-send-email-penberg@scylladb.com>
(cherry picked from commit 90872ffa1f)
2017-08-03 14:53:03 +03:00
Takuya ASADA
2c892488ef dist/debian: check scylla user/group existance before adding them
To prevent install failing on the environment which already has scylla
user/group, existance check is needed.

Fixes #2389

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1495023805-14905-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 91ade1a660)
2017-08-03 13:01:30 +03:00
Avi Kivity
911608e9c4 database: prevent streaming reads from blocking normal reads
Streaming reads and normal reads share a semaphore, so if a bunch of
streaming reads use all available slots, no normal reads can proceed.

Fix by assigning streaming reads their own semaphore; they will compete
with normal reads once issued, and the I/O scheduler will determine the
winner.

Fixes #2663.
Message-Id: <20170802153107.939-1-avi@scylladb.com>

(cherry picked from commit f38e4ff3f9)
2017-08-03 12:27:48 +03:00
Avi Kivity
d8ab07de37 database: remove streaming read queue length limit
If we fail a streaming read due queue overload, we will fail the entire repair.
Remove the limit for streaming, and trust the caller (repair) to have bounded
concurrency.

Fixes #2659.
Message-Id: <20170802143448.28311-1-avi@scylladb.com>

(cherry picked from commit 911536960a)
2017-08-03 12:27:46 +03:00
Duarte Nunes
15eefbc434 tests/sstable_mutation_test: Don't use moved-from object
Fix a bug introduced in dbbb9e93d and exposed by gcc6 by not using a
moved-from object. Twice.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170802161033.4213-1-duarte@scylladb.com>
(cherry picked from commit 4c9206ba2f)
2017-08-03 09:46:18 +03:00
Vlad Zolotarov
4bb6ba6d58 utils::loading_cache: cancel the timer after closing the gate
The timer is armed inside the section guarded by the _timer_reads_gate
therefore it has to be canceled after the gate is closed.

Otherwise we may end up with the armed timer after stop() method has
returned a ready future.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1501603059-32515-1-git-send-email-vladz@scylladb.com>
(cherry picked from commit 4b28ea216d)
2017-08-01 17:23:53 +01:00
Avi Kivity
6dd4a9a5a2 Merge "Ensure correct EOC for PI block cell names" from Duarte
"This series ensures the always write correct cell names to promoted
index cell blocks, taking into account the eoc of range tombstones.

Fixes #2333"

* 'pi-cell-name/v1' of github.com:duarten/scylla:
  tests/sstable_mutation_test: Test promoted index blocks are monotonic
  sstables: Consider eoc when flushing pi block
  sstables: Extract out converting bound_kind to eoc

(cherry picked from commit db7329b1cb)
2017-08-01 18:09:54 +03:00
Gleb Natapov
222e85d502 cql transport: run accept loop in the foreground
It was meant to be run in the foreground since it is waited upon during
stop(), but as it is now from the stop() perspective it is completed
after first connection is accepted.

Fixes #2652

Message-Id: <20170801125558.GS20001@scylladb.com>
(cherry picked from commit 1da4d5c5ee)
2017-08-01 17:06:08 +03:00
Takuya ASADA
8d4a30e852 dist/ami: follow scylla-tools package name change on RedHat variants
Since scylla-tools generates two .rpm packages, we need to copy them to our AMI.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20170722090002.9850-1-syuu@scylladb.com>
(cherry picked from commit a998b7b3eb)
2017-07-31 18:57:29 +03:00
Avi Kivity
4710ee229d Merge "educe the effect of the latency metrics" from Amnon
"This series reduce that effect in two ways:
1. Remove the latency counters from the system keyspaces
2. Reduce the histogram size by limiting the maximum number of buckets and
   stop the last bucket."

Fixes #2650.

* 'amnon/remove_cf_latency_v2' of github.com:cloudius-systems/seastar-dev:
  database: remove latency from the system table
  estimated histogram: return a smaller histogram

(cherry picked from commit 3fe6731436)
2017-07-31 16:01:05 +03:00
Vlad Zolotarov
93cb78f21d utils::loading_cache: add stop() method
loading_cache invokes a timer that may issue asynchronous operations
(queries) that would end with writing into the internal fields.

We have to ensure that these operations are over before we can destroy
the loading_cache object.

Fixes #2624

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1501208345-3687-1-git-send-email-vladz@scylladb.com>
2017-07-31 15:55:46 +03:00
Avi Kivity
af8151c4b7 Update scylla-ami submodule
* dist/ami/files/scylla-ami 2bd1481...b41e5eb (1):
  > Fix incorrect scylla-server sysconfig file edit for i3 memflush controller
2017-07-31 09:41:56 +03:00
Takuya ASADA
846d9da9c2 dist/debian: refuse upgrade if current scylla < 1.7.3 && commitlog remains
Commitlog replay fails when upgrade from <1.7.3 to 2.0, we need to refuse
updating package if current scylla < 1.7.3 && commitlog remains.

Note: We have the problem on scylla-server package, but to prevent
scylla-conf package upgrade, %pretrans should be define on scylla-conf.

Fixes #2551

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1501187555-4629-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 714540cd4c)
2017-07-31 09:18:58 +03:00
Paweł Dziepak
aaa59d3437 streamed_mutation: do not call fill_buffer() ahead of time
consume_mutation_fragments_until() allows consuming mutation fragments
until a specified condition happens. This patch reorganises its
implementation so that we avoid situations when fill_buffer() is called
with stop condition being true.
Message-Id: <20170727122218.7703-1-pdziepak@scylladb.com>

(cherry picked from commit f02bef7917)
2017-07-27 17:48:26 +02:00
Tomasz Grabiec
66dd817582 mutation_partition: Always mark static row as continuous when no static columns
To avoid unnecessary cache misses after static columns are added.

Message-Id: <1500650057-26036-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 136d205855)
2017-07-27 14:59:33 +02:00
Tomasz Grabiec
5857a4756d Merge "Some fixes for performance regressions in perf_fast_forward" from Paweł
These patches contain some minor fixes for performance regression reported
by perf_fast_forward after partial cache was merged. The solution is still
far from perfect, there is one case that still has 30% degradation, but
there is some improvement so there is no reason to hold these changes back.

Refs #2582.

Some numbers:
before - before cache changes were merged
(555621b537)

cache - at the commit that introduced the partial cache
(9b21a9bfb6)

after - recent master + this series
(based on e988121dbb)

Differences are shown relative to "before".

Testing effectiveness of caching of large partition, single-key slicing reads:
Large partitions, range [0, 500000], populating cache
  before      cache      after
 1636840    1013688    1234606
              -38%        -25%

Large partitions, range [0, 500000], reading from cache
  before      cache      after
 2012615    3076812    3035423
               +53%       +51%

Testing scanning small partitions with skips.
reading small partitions (skip 0)
 before      cache      after
 227060     165261     200639
              -27%       -11%

skipping small partitions (skip 1)
 before      cache      after
  29813      27312      38210
               -8%       +28%

Testing slicing small partitions:
slicing small partitions (offset 0, read 4096)
 before      cache      after
 195282     149695     180497
              -23%        -8%

* https://github.com/pdziepak/scylla.git perf_fast_forward-regression/v3:
  sstables: make sure that fill_buffer() actually fills buffer
  mutation_merger: improve handling of non-deferring fill_buffer()s
  partition_snapshot_row_cursor: avoid apply() in single-version cases
  sstables: introduce decorated_key_view
  ring_position_comparator: accept sstables::decorated_key_view
  sstable: keep a pre-computed token in summary_entry
  sstables: cache token in index entries
  index_reader: advance_and_check_if_present() use index_comparator
  ring_position_comparator: drop unused overloads
  cache_streamed_mutation: avoid moving clustering_row
  streamed_mutation: introduce consume_mutation_fragments_until()
  cache_streamed_mutation: use consumer based read_context reader
  rows_entry: make position() inlineable
  mutation_fragment: make destructor always_inline
  keys: introduce compound_wrapper::from_exploded_view()
  sstables: avoid copying key components
  compound_compat: explode: reserve some elements in a vector
  cache: short-circut static row logic if there are no static columns
  cache: use equality comparators instead of tri_compare
  sstables: avoid indirect calls to abstract_type::is_multi_cell()

(cherry picked from commit e9fc0b0491)
2017-07-27 13:58:23 +02:00
Takuya ASADA
f199047601 dist/redhat: limit metapackage dependencies to specific version of scylla packages
When we install scylla metapackage with version (ex: scylla-1.7.1),
it just always install newest scylla-server/-jmx/-tools on the repo,
instead of installing specified version of packages.

To install same version packages with the metapackage, limited dependencies to
current package version.

Fixes #2642

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20170726193321.7399-1-syuu@scylladb.com>
(cherry picked from commit 91a75f141b)
2017-07-27 14:21:55 +03:00
Tomasz Grabiec
a8dcbb6bd0 row_cache: Fix potential timeout or deadlock due to sstable read concurrency limit
database::make_sstable_reader() creates a reader which will need to
obtain a semaphore permit when invoked. Therefore, each read may
create at most one such reader in order to be guaranteed to make
progress. If the reader tries to create another reader, that may
deadlock (or for non-system tables, timeout), if enough number of such
readers tries to do the same thing at the same time.

Avoid the problem by dropping previous reader before creating a new
one.

Refs #2644.

Message-Id: <1501152454-4866-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 22948238b6)
2017-07-27 13:58:40 +03:00
Duarte Nunes
bfd99d4e74 db/schema_tables: Drop dropped columns when dropping tables
Fixes #2633

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170726150228.2593-2-duarte@scylladb.com>
(cherry picked from commit 50ad0003c6)
2017-07-26 18:48:59 +02:00
Duarte Nunes
d40df89271 db/schema_tables: Store column_name in text form
As does Cassandra.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170726150228.2593-1-duarte@scylladb.com>
(cherry picked from commit 3425403126)
2017-07-26 18:48:58 +02:00
Duarte Nunes
3da54ffff0 schema_builder: Replace type when re-dropping column
Fixes #2634

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170725183933.5311-1-duarte@scylladb.com>
(cherry picked from commit e988121dbb)
2017-07-26 16:26:59 +02:00
Duarte Nunes
804793e291 tests/schema_change_test: Add test case for add+drop notification
Reproduces #2616

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170725170622.4380-2-duarte@scylladb.com>
(cherry picked from commit 472f32fb06)
2017-07-26 16:26:59 +02:00
Duarte Nunes
83ea9b6fc0 db/schema_tables: Consider differing dropped columns
If a node is notified of a schema change where the schema's dropped
columns have changes, that node will miss the changes to the dropped
columns. A scenario where this can happen is where a column c is
dropped, then added as a different typed, and then dropped again, with
a node n having seen the first drop and being notified of the
subsequent add and drop.

Fixes #2616

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170725170622.4380-1-duarte@scylladb.com>
(cherry picked from commit 33e18a1779)
2017-07-26 16:26:59 +02:00
Asias He
b45855fc1c gossip: Fix nr_live_nodes calculation
We need to consider the _live_endpoints size. The nr_live_nodes should
not be larger than _live_endpoints size, otherwise the loop to collect
the live node can run forever.

It is a regression introduced in commit 437899909d
(gossip: Talk to more live nodes in each gossip round).

Fixes #2637

Message-Id: <863ec3890647038ae1dfcffc73dde0163e29db20.1501026478.git.asias@scylladb.com>
(cherry picked from commit 515a744303)
2017-07-26 16:48:51 +03:00
Duarte Nunes
3900babff2 schema: Remove unnecessary print
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170725174000.71061-1-duarte@scylladb.com>
(cherry picked from commit 9c831b4e97)
2017-07-26 16:07:41 +03:00
Tomasz Grabiec
7c805187a9 Merge fixes related to row cache from Raphael
* git@github.com:raphaelsc/scylla.git row_cache_fixes:
  db: atomically synchronize cache with changes to the snapshot
  db: refresh row cache's underlying data source after compaction

(cherry picked from commit 18be42f71a)
2017-07-25 15:37:40 +02:00
Paweł Dziepak
345a91d55d tests/row_cache: test queries with no clustering ranges
Reproducer for #2604.
Message-Id: <20170725131220.17467-3-pdziepak@scylladb.com>

(cherry picked from commit 79a1ad7a37)
2017-07-25 15:37:32 +02:00
Paweł Dziepak
fda8b35cda tests: do not overload the meaning of empty clustering range
Empty clustering key range is perfectly valid and signifies that the
reader is not interested in anything but the static row. Let's not
make it mean anything else.
Message-Id: <20170725131220.17467-2-pdziepak@scylladb.com>

(cherry picked from commit 1ea507d6ae)
2017-07-25 15:37:29 +02:00
Paweł Dziepak
08ac0f1100 cache: fix aborts if no clustering range is specified
cache_streamed_mutation assumed that at least one clustering range was
specified. That was wrong since the readers are allowed to query just
for a static row (e.g. counter update that modifies only static
columns).

Fixes #2604.
Message-Id: <20170725131220.17467-1-pdziepak@scylladb.com>

(cherry picked from commit 6572f38450)
2017-07-25 15:37:28 +02:00
Calle Wilund
db455305a2 system_keyspace: Make sure "system" is written to keyspaces (visible)
Fixes #2514

Bug in schema version 3 update: We failed to write "system" to the
schema tables. Only visible on an empty instance of course.

Message-Id: <1500469809-23546-2-git-send-email-calle@scylladb.com>
(cherry picked from commit 7a583585a2)
2017-07-24 11:33:25 +02:00
Avi Kivity
e1a3052e76 tests: fix sstable_datafile_test build with boost 1.55
Boost 1.55 accidentally removed support for "range for" on
recursive_directory_iterator (previous and latter versions do
support it). Use old-style iteration instead.

Message-Id: <20170724080128.8824-1-avi@scylladb.com>
(cherry picked from commit c21bb5ae05)
2017-07-24 11:20:53 +03:00
Tomasz Grabiec
50fa3f3b89 schema_registry: Keep unused entries around for 1 second
This is in order to avoid frequent misses which have a relatively high
cost. A miss means we need to fetch schema definition from another
node and in case of writes do a schema merge.

If the schema is kept alive only by the incoming request, then it
will be forgotten immediately when the request is done, and the next
request using the same schema version will miss again.

Refs #2608.
Message-Id: <1500632447-10104-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit 29a82f5554)
2017-07-24 10:12:09 +02:00
Tomasz Grabiec
8474b7a725 legacy_schema_migrator: Don't snapshot empty legacy tables
Otherwise we will create a new (empty) snapshot each time we boot.
Message-Id: <1500573920-31478-2-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit ecc85988dd)
2017-07-24 09:56:22 +02:00
Tomasz Grabiec
0fc874e129 database: Allow disabling auto snapshots during drop/truncate
Message-Id: <1500573920-31478-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 408cea66cd)
2017-07-24 09:56:19 +02:00
Duarte Nunes
5cf1a19f3f Merge 'Fix possible inconsistency of table schema version' from Tomasz
"Fixes issues uncovered in longevity test (#2608).

Main problem is that due to time drift scylla_tables.version column
may not get deleted on all nodes doing the schema merge, which will
make some nodes come up with different table schema version than others.

The inconsistency will not heal because scylla_tables doesn't
take part in the schema sync. This is fixed by the last patch.

This will cause nodes to constantly try to sync the schema, which under
some conditions triggers #2617."

* tag 'tgrabiec/fix-table-schema-version-inconsistency-v1' of github.com:scylladb/seastar-dev:
  schema_tables: Add scylla_tables to ALL
  schema: Make schema_mutations equality consistent with digest
  schema_tables: Extract compact_for_schema_digest()
  schema_tables: Always drop scylla_tables::version

(cherry picked from commit 937fe80a1a)
2017-07-24 09:54:45 +02:00
Tomasz Grabiec
f48466824f schema_registry: Ensure schema_ptr is always synced on the other core
global_schema_ptr ensures that schema object is replicated to other
cores on access. It was replicating the "synced" state as well, but
only when the shard didn't know about the schema. It could happen that
the other shard has the entry, but it's not yet synced, in which case
we would fail to replicate the "synced" state. This will result in
exception from mutate(), which rejects attempts to mutate using an
unsynced schema.

The fix is to always replicate the "synced" state. If the entry is
syncing, we will preemptively mark it as synced earlier. The syncing
code is already prepared for this.

Refs #2617.
Message-Id: <1500555224-15825-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit 65c64614aa)
2017-07-24 09:52:31 +02:00
Avi Kivity
914f6f019f Update ami submodule
* dist/ami/files/scylla-ami 5dfe42f...2bd1481 (1):
  > Enable support for experimental CPU controller in i3 instances
2017-07-24 10:27:35 +03:00
Shlomi Livne
f5bb363f96 release: prepare for 2.0.rc1
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
2017-07-23 09:47:11 +03:00
Duarte Nunes
61ba56f628 schema: Support compaction enabled attribute
Fixes #2547

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170721132206.3037-1-duarte@scylladb.com>
(cherry picked from commit 7eecda3a61)
2017-07-21 15:39:48 +02:00
Tomasz Grabiec
f4d3e5cdcf Merge "Drop mutations that raced with truncate" from Duarte
Instead of retrying, just drop mutations that raced with a truncate.

* git@github.com:duarten/scylla.git truncate-reorder/v1:
  database: Rename replay_position_reordered_exception
  database: Drop mutations that raced with truncate

(cherry picked from commit 63caa58b70)
2017-07-21 15:39:20 +02:00
Avi Kivity
0291a4491e Merge "restrict background writers with scheduling groups" from Glauber
"This patchset restricts background writers - such as compactions,
streaming flushes and memtable flushes to a maximum amount of CPU usage
through a seastar::thread_scheduling_group.

The said maximum is recommended to be set  50 % - it is default
disabled, but can be adjusted through a configuration option until we
are able to auto-tune this.

The second patch in this series provides a preview on how such auto-tune
would look like. By implementing a simple controller we automatically
adjust the quota for the memtable writer processes, so that the rate at
which bytes come in is equal to the rates at which bytes are flushed.

Tail latencies are greatly reduced by this series, and heavy spikes that
previously appeared on CPU-bound workloads are no more."

* 'memtable-controller-v5' of https://github.com/glommer/scylla:
  simple controller for memtable/streaming writer shares.
  restrict background writers to 50 % of CPU.

(cherry picked from commit c5ee62a6a4)
2017-07-20 15:13:39 +03:00
Duarte Nunes
83cc640c6a Merge 'Revert back to 1.7 schema layout in memory' from Tomasz
"Fixes schema layout incompatibility in a mixed 1.7 and 2.0 cluster (#2555)
by reverting back to using the old layout in memory and thus also
in across-node requests. We still use the new v3 layout in schema
tables (needed by drivers and external tools). Translations happen
when converting to/from schema mutations."

* tag 'tgrabiec/use-v2-schema-layout-in-memory-v2' of github.com:scylladb/seastar-dev:
  schema: Revert back to the 1.7 layout of static compact tables in memory
  schema: Use v3 column layout when converting to/from schema mutations
  schema: Encapsulate column layout translations in the v3_columns class

(cherry picked from commit 1daf1bc4bb)
2017-07-19 19:49:43 +03:00
Duarte Nunes
2f06c54033 thrift/handler: Remove leftover debug artifacts
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170705161156.2307-1-duarte@scylladb.com>
(cherry picked from commit d583ef6860)
2017-07-19 19:49:35 +03:00
Calle Wilund
9abe7651f7 system_schema: Fix remaining places not handing two system keyspaces
Some places remained where code looked directly at
system_keyspace::NAME to determine iff a ks is
considered special/system/protected. Including
schema digest calculation.

Export "is_system_keyspace" and use accordingly.

Message-Id: <1500469809-23546-1-git-send-email-calle@scylladb.com>
(cherry picked from commit 247c36e048)
2017-07-19 19:48:30 +03:00
Amos Kong
784aea12e7 scylla_raid_setup: fix syntax error
/usr/lib/scylla/scylla_raid_setup: line 132: syntax error
near unexpected token `fi'

Fixes #2610

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <af3a5bc77c5ba2b49a8f48a5aaa19afffb787886.1500430021.git.amos@scylladb.com>
(cherry picked from commit 2bdcad5bc3)
2017-07-19 11:10:43 +03:00
Avi Kivity
3a98959eba dist: tolerate sysctl failures
sysctl may fail in a container environment if /proc is not virtualized
properly.

Fixes #1990
Message-Id: <20170625145930.31619-1-avi@scylladb.com>

(cherry picked from commit 08488a75e0)
2017-07-18 15:45:41 +03:00
Duarte Nunes
2c7d597307 wrapping_range: Fix lvalue transform()
Instead of copying and moving the bound, pass it by reference so the
transformer can decide whether it wants to copy or not. The only
caller so far doesn't want a copy and takes the value by reference,
which would be capturing a temporary value. Caught by the
view_schema_test with gcc7.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170705210255.29669-1-duarte@scylladb.com>
(cherry picked from commit 3dd0397700)
2017-07-18 14:35:58 +03:00
Duarte Nunes
8d46c4e049 thrift: Fail when mixed CFs are detected
Fixes #2588

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170717222612.7429-1-duarte@scylladb.com>
(cherry picked from commit d9fa3bf322)
2017-07-18 10:21:45 +03:00
Asias He
b1c080984f gossip: Implement the missing fd_max_interval_ms and fd_initial_value_ms option
It is useful for larger cluster with larger gossip message latency. By
default the fd_max_interval_ms is 2 seconds which means the
failure_detector will ignore any gossip message update interval larger
than 2 seconds. However, in larger cluster, the gossip message udpate
interval can be larger than 2 seconds.

Fixes #2603.

Message-Id: <49b387955fbf439e49f22e109723d3a19d11a1b9.1500278434.git.asias@scylladb.com>
(cherry picked from commit adc5f0bd21)
2017-07-17 13:29:30 +03:00
Duarte Nunes
e1706c36b7 Merge 'Fixes around migration to v3 schema tables' from Tomasz
branch 'tgrabiec/schema-migration-fixes' of github.com:scylladb/seastar-dev:
  schema: Use proper name comparator
  legacy_schema_migrator: Properly migrate non-UTF8 named columns
  schema_tables: Store column_name in text form
  legacy_schema_migrator: Migrate columns like Cassandra
  schema_builder: Add factory method for default_names
  legacy_schema_migrator: Simplify logic
  thrift: Don't set regular_column_name_type
  schema: Use proper column name type for static columns
  schema: Fix column_name_type() for static compact tables
  schema: Introduce clustering_column_at()
  thrift: Reuse cell_comparator::to_sstring() for obtaining comparator type
  partition_slice_builder: Use proper column's type instead of regular_column_name_type()

(cherry picked from commit 13caccf1cf)
2017-07-17 12:42:19 +03:00
Avi Kivity
63c8306733 Update seastar submodule
* seastar b812cee...867b7c7 (1):
  > rpc: start server's send loop only after protocol negotiation

Fixes #2600.

Still tracking upstream.
2017-07-17 10:41:59 +03:00
Avi Kivity
a7dfdc0155 tests: move tmpdir to /tmp
Reduces view_schema_test runtime to 5 seconds, from 53 seconds on an NVMe disk
with write-back cache, and forever on a spinning disk.
Message-Id: <20170716081653.10018-1-avi@scylladb.com>

(cherry picked from commit d9c64ef737)
2017-07-17 08:47:17 +03:00
Avi Kivity
70be29173a tests: copy the sstable with an unknown component to the data directory
We will be creating links to those sstable's files, and those don't work
if the data directory and the test sstable are on different devices.

Copying the files to the same directory fixes the problem.
Message-Id: <20170716090405.14307-1-avi@scylladb.com>

(cherry picked from commit 9116dd91cb)
2017-07-17 08:47:08 +03:00
Avi Kivity
e09d4a9b75 Update seastar submodule
* seastar 844bcfb...b812cee (1):
  > Update dpdk submodule

Fixes #2595 (again).

Still tracking master.
2017-07-16 17:01:48 +03:00
Avi Kivity
67f25e56a6 Update seastar submodule
* seastar ff34c42...844bcfb (1):
  > Update dpdk submodule

Still tracking master.

Fixes #2595.
2017-07-15 19:18:10 +03:00
Tomasz Grabiec
74c4651b95 Merge "Fixes for memtable flushing and replay positions" from Duarte
We don't ensure mutations are applied in memory following the order of their
replay positions. A memtable can thus be flushed with replay position rp,
with the new one being at replay position rp', where rp' < rp. This breaks
an intrinsic assumption in the code, which this series addresses.

Fixes #2074

branch memtable-flush/v3 of git@github.com:duarten/scylla.git:
  commitlog: Always flush latest memtable
  column_family: More precise count of switched memtables
  column_family: Fix typo in pending_tasks metric name
  column_family: More precise count of pending flushes
  dirty_memory_manager: Remove unnecessary check from flush_one()
  column_family: Don't rely on flush_queue to guarantee flushes finished
  column_family: Don't bother closing the flush_queue on stop()
  column_family: Stop using flush_queue
  column_family: Remove outdated comment about the flush_queue
  memtable: Stop tracking the highest flushed rp

(cherry picked from commit caa62f7f05)
2017-07-14 19:07:33 +02:00
Duarte Nunes
58bfb86d73 storage_proxy: Preserve replica order across mutations
In storage_proxy we arrange the mutations sent by the replicas in a
vector of vectors, such that each row corresponds to a partition key
and each column contains the mutation, possibly empty, as sent by a
particular replica.

There is reconciliation-related code that assumes that all the
mutations sent by a particular replica can be found in a single
column, but that isn't guaranteed by the way we initially arrange the
mutations.

This patch fixes this and enforces the expected order.

Fixes #2531
Fixes #2593

Signed-off-by: Gleb Natapov <gleb@scylladb.com>
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170713162014.15343-1-duarte@scylladb.com>
(cherry picked from commit b8235f2e88)
2017-07-14 12:11:50 +03:00
Tomasz Grabiec
cb94c66823 legacy_schema_migrator: Fix calculation of is_dense
Current algorithm was marking tables with regular columns not named
"value" as not dense, which doesn't have to be the case. It can be
either way.

It should be enough to look at clustering components. If there is a
clustering key, then table is dense if and only if all comparator
components belong to the clustering key.

If there is no clustering key, then if there are any regular columns
we're sure it's not dense.

Fixes #2587.

Message-Id: <1499877777-7083-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 30ec4af949)
2017-07-13 17:28:25 +03:00
Tomasz Grabiec
5aa3e23fcd gdb: Fix "scylla columnfamilies" command
Broken in 0e4d5bc2f3.

Message-Id: <1499951956-26206-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 54953c8d27)
2017-07-13 16:33:50 +03:00
Takuya ASADA
aac1d5d54d dist/common/systemd: move scylla-server.service to be after network-online.target instead of network.target
To make sure start Scylla after network is up, we need to move from
network.target to network-online.target.

Fixes #2337

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1493661832-9545-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 0c81974bc4)
2017-07-12 13:36:52 +03:00
Glauber Costa
a371b8a5bf change task quota's default
The default of 2ms is somewhat arbitrary. Now that we have a lot more
mileage deploying Scylla applications in production it does sound not
only arbitrary, but high.

In particular, it is really hard to achieve 1ms latencies in the face of
CPU-heavy workloads with it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <1499354495-27173-1-git-send-email-glauber@scylladb.com>
(cherry picked from commit 780a6e4d2e)
2017-07-12 10:21:35 +03:00
Avi Kivity
a69fb8a8ed Update seastar submodule
* seastar 89cc97c...ff34c42 (6):
  > tls: Wrap all IO in semaphore (Fixes #2575)
  > tests/lowres_clock_test.cc: Declare helper static
  > tests/lowres_clock_test.cc: fix compilation error for older GCC
  > configure.py: verifies boost version
  > pkg-config: Eliminate spaces in include path arguments
  > allow applications to override task-quota-ms

Still tracking seastar master.
2017-07-12 10:20:49 +03:00
Avi Kivity
00b9640b2c Merge "Preserve table schema digest on schema tables migration" from Tomasz
"Currently new nodes calculate digests based on v3 schema mutations,
which are very different from v2 mutations. As a result they will
use schemas with different table_schema_version that the old nodes.
The old nodes will not recognize the version and will try to request
its definition. That will fail, because old nodes don't understand
v3 schema mutations.

To fix this problem, let's preserve the digests during migration,
so that they're the same on new and old nodes. This will allow
requests to proceed as usual.

This does not solve the problem of schema being changed during
the rolling upgrade. This is not allowed, as it would bring the
same problem back.

Fixes #2549."

* tag 'tgrabiec/use-consistent-schema-table-digests-v2' of github.com:cloudius-systems/seastar-dev:
  tests: Add test for concurrent column addition
  legacy_schema_migrator: Set digest to one compatible with the old nodes
  schema_tables: Persist table_schema_version
  schema_tables: Introduce system_schema.scylla_tables
  schema_tables: Simplify read_table_mutations()
  schema_tables: Resurrect v2 read_table_mutations()
  system_keyspace: Forward-declare legacy schemas
  legacy_schema_migrator: Take storage_proxy as dependency

(cherry picked from commit a397889c81)
2017-07-11 17:23:21 +03:00
Gleb Natapov
59d608f77f consistency_level: report less live endpoints in Unavailable exception if there are pending nodes
DowngradingConsistencyRetryPolicy uses live replicas count from
Unavailable exception to adjust CL for retry, but when there are pending
nodes CL is increased internally by a coordinator and that may prevent
retried query from succeeding. Adjust live replica count in case of
pending node presence so that retried query will be able to proceed.

Fixes #2535

Message-Id: <20170710085238.GY2324@scylladb.com>
(cherry picked from commit 739dd878e3)
2017-07-11 17:16:46 +03:00
Botond Dénes
1717922219 Fix crash in the out-of order restrictions error msg composition
Use name of the existing preceeding column with restriction
(last_column) instead of assuming that the column right after the
current column already has restrictions.
This will yield an error message that is different from that of
Cassandra, albeit still a correct one.

Fixes #2421

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <40335768a2c8bd6c911b881c27e9ea55745c442e.1499781685.git.bdenes@scylladb.com>
(cherry picked from commit 33bc62a9cf)
2017-07-11 17:15:45 +03:00
Paweł Dziepak
7cd4bb0c4a transport: send correct type id for counter columns
CQL reply may contain metadata that describes columns present in the
response including the information about their type.

However, Scylla incorrectly reports counter types as bigint. The
serialised format of counters and bigint is exactly the same, which
could explain why the problem hasn't been noticed earlier but it is a
bug nevertheless.

Fixes #2569.
Message-Id: <20170711130520.27603-1-pdziepak@scylladb.com>

(cherry picked from commit 5aa523aaf9)
2017-07-11 16:37:24 +03:00
Tomasz Grabiec
588ae935e7 legacy_schema_migrator: Use separate joinpoint instance for each table
Otherwise we may deadlock, as explained in commit 5e8f0efc8:

Table drop starts with creating a snapshot on all shards. All shards
must use the same snapshot timestamp which, among other things, is
part of the snapshot name. The timestamp is generated using supplied
timestamp generating function (joinpoint object). The joinpoint object
will wait for all shards to arrive and then generate and return the
timestamp.

However, we drop tables in parallel, using the same joinpoint
instance. So joinpoint may be contacted by snapshotting shards of
tables A and B concurrently, generating timestamp t1 for some shards
of table A and some shards of table B. Later the remaining shards of
table A will get a different timestamp. As a result, different shards
may use different snapshot names for the same table. The snapshot
creation will never complete because the sealing fiber waits for all
shards to signal it, on the same name.
Message-Id: <1499762663-21967-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit 310d2a54d2)
2017-07-11 12:31:21 +03:00
Avi Kivity
c292e86b3c sstables: fix use-after-free in read_simple()
`r` is moved-from, and later captured in a different lambda. The compiler may
choose to move and perform the other capture later, resulting in a use-after-free.

Fix by copying `r` instead of moving it.

Discovered by sstable_test in debug mode.
Message-Id: <20170702082546.20570-1-avi@scylladb.com>

(cherry picked from commit 07b8adce0e)
2017-07-10 15:32:57 +03:00
Asias He
3dc0d734b0 repair: Do not store the failed ranges
The number of failed ranges can be large so it can consume a lot of memory.
We already logged the failed ranges in the log. No need to storge them
in memory.

Message-Id: <7a70c4732667c5c3a69211785e8efff0c222fc28.1498809367.git.asias@scylladb.com>
(cherry picked from commit b2a2fbcf73)
2017-07-10 14:37:47 +03:00
Takuya ASADA
2d612022ba dist/common/scripts/scylla_cpuscaling_setup: skip configuration when cpufreq driver doesn't loaded
Configuring cpufreq service on VMs/IaaS causes an error because it doesn't supported cpufreq.
To prevent causing error, skip whole configuration when the driver not loaded.

Fixes #2051

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1498809504-27029-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 1c35549932)
2017-07-10 14:08:54 +03:00
Nadav Har'El
5f6100c0aa repair: further limit parallelism of checksum calculation
Repair today has a semaphore limiting the number of ongoing checksum
comparisons running in parallel (on one shard) to 100. We needed this
number to be fairly high, because a "checksum comparison" can involve
high latency operations - namely, sending an RPC request to another node
in a remote DC and waiting for it to calculate a checksum there, and while
waiting for a response we need to proceed calculating checksums in parallel.

But as a consequence, in the current code, we can end up with as many as
100 fibers all at the same stage of reading partitions to checksum from
sstables. This requires tons of memory, to hold at least 128K of buffer
(even more with read-ahead) for each of these fibers, plus partition data
for each. But doing 100 reads in parallel is pointless - one (or very few)
should be enough.

So this patch adds another semaphore to limit the number of checksum
*calculations* (including the read and checksum calculation) on each shard
to just 2. There may still be 100 ongoing checksum *comparisons*, in
other stages of the comparisons (sending the checksum requests to other
and waiting for them to return), but only 2 will ever be in the stage of
reading from disk and checksumming them.

The limit of 2 checksum calculations (per shard) applies on the repair
slave, not just to the master: The slave may receive many checksum
requests in parallel, but will only actually work on 2 at a time.

Because the parallelism=100 now rate-limits operations which use very little
memory, in the future we can safely increase it even more, to support
situations where the disk is very fast but the link between nodes has
very high latency.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170703151329.25716-1-nyh@scylladb.com>
(cherry picked from commit d177ec05cb)
2017-07-10 14:08:28 +03:00
Avi Kivity
d475a44b01 Merge "Silence schema pull errors during upgrade from 1.7 to 2.0" from Tomasz
"Old and new nodes will advertise different schema version because
of different format of schema tables. This will result in attempts
to sync the schema by each of the node.

Currently this will result in scary error messages in logs about
sync failing due to not being able to find schema of given version.
It's benign, but may scare users. It the future incompatibilities
could result in more subtle errors. Better to inhibit it completely."

* 'tgrabiec/fix-schema-pull-errors-during-upgrade' of github.com:cloudius-systems/seastar-dev:
  migration_manager: Give empty response to schema pulls from incompatible nodes
  migration_manager: Don't pull schema from incompatible nodes
  service: Advertise schema tables format version through gossip

(cherry picked from commit 91221e020b)
2017-07-10 14:04:41 +03:00
Pekka Enberg
e02d4935ee idl: Fix frozen_schema version numbers
The IDL changes will appear in 2.0 so fix up the version numbers.

Message-Id: <1499680669-6757-1-git-send-email-penberg@scylladb.com>
(cherry picked from commit 8112d7c5c0)
2017-07-10 14:02:37 +03:00
Botond Dénes
25f8d365b5 Add text(sstring) version of count, max and min functions
Fixes #2459

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <b6abb97f21c0caea8e36c7590b92a12d148195db.1499666251.git.bdenes@scylladb.com>
(cherry picked from commit 66cbc45321)
2017-07-10 12:48:29 +03:00
Tomasz Grabiec
de7cb7bfa4 tests: commitlog: Check there are no segments left on disk after clean shutdown
Reproduces #2550.

Message-Id: <1499358825-17855-2-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 72e01b7fe8)
2017-07-09 19:25:44 +03:00
Tomasz Grabiec
b8eb4ed9cd commitlog: Discard active but unused segments on shutdown
So that they are not left on disk even though we did a clean shutdown.

First part of the fix is to ensure that closed segments are recognized
as not allocating (_closed flag). Not doing this prevents them from
being collected by discard_unused_segments(). Second part is to
actually call discard_unused_segments() on shutdown after all segments
were shut down, so that those whose position are cleared can be
removed.

Fixes #2550.

Message-Id: <1499358825-17855-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 6555a2f50b)
2017-07-09 19:25:42 +03:00
Tomasz Grabiec
fcc05e8ae9 legacy_schema_migrator: Drop tables instead of truncate()+remove()
It achieves similar effect, but is safer than non-standard remove()
path. The latter was missing unregistration from compaction manager.

Fixes 2554.

Message-Id: <1499447165-30253-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit d33d29ad95)
2017-07-09 18:36:56 +03:00
Botond Dénes
05e2ac80af cql3: Add K_FROZEN and K_TUPLE to basic_unreserved_keyword
To allow the non-reserved keywords "frozen" and "tuple" to be used as
column names without double-quotes.

Fixes #2507

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <9ae17390662aca90c14ae695c9b4a39531c6cde6.1499329781.git.bdenes@scylladb.com>
(cherry picked from commit c4277d6774)
2017-07-06 18:20:22 +03:00
Avi Kivity
8fa1add26d Update seastar submodule
* seastar 0ab7ae5...89cc97c (4):
  > future-utils: fix do_for_each exception reporting
  > core/thread: Fix unwind information for seastar threads
  > build: export full cflags in pkgconfig file
  > configure: Avoid putting tmp file on /tmp

Still tracking seastar master.
2017-07-06 17:31:06 +03:00
Takuya ASADA
c0a2ca96dd dist/common/scripts/scylla_raid_setup: prevent renaming MDRAID device after reboot
On Debian variants, mdadm.conf should placed at /etc/mdadm instead of /etc.
Also it seems we need update-initramfs to fix renaming issue.

Fixes #2502

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1499179912-14125-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 71624d7919)
2017-07-04 18:07:33 +03:00
Avi Kivity
c9ed522fa8 Merge "Adjust row cache metrics for row granularity" from Tomasz
* tag 'tgrabiec/row-cache-metrics-v2' of github.com:cloudius-systems/seastar-dev:
  row_cache: Switch _stats.hits/misses to row granularity
  row_cache: Rename num_entries() to partitions() for clarity
  row_cache: Track mispopulations also at row level
  row_cache: Track row insertions
  row_cache: Track row hits and misses
  row_cache: Make mispopulation counter also apply for continuity information
  row_cache: Add partition_ prefix to current counters
  misc_services: Switch to using reads_with[_no]_misses counters
  row_cache: Add metrics for operations on underlying reader
  row_cache: Add reader-related metrics
  row_cache: Remove dead code

(cherry picked from commit b1a0e37fcb)
2017-07-04 15:21:00 +03:00
Tomasz Grabiec
9078433a7f row_cache: Restore update of concurrent_misses_same_key
It was lost in action in 6f6575f456.

Message-Id: <1499168837-5072-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit e720b317c9)
2017-07-04 14:51:19 +03:00
Avi Kivity
7893a3aad2 Merge "Use selective_token_range_sharder in repair" from Asias
"This series introduces selective_token_range_sharder and uses it in repair to
generate dht::token_range belongs to a specific shard."

* tag 'asias/repair-selective_token_range_sharder-v3' of github.com:cloudius-systems/seastar-dev:
  repair: Use selective_token_range_sharder
  tests: Add test_selective_token_range_sharder
  dht: Add selective_token_range_sharder

(cherry picked from commit 66e56511d6)
2017-07-04 14:18:08 +03:00
Nadav Har'El
e467eef58d Fix test to use non-wrapping range
The test put a wrapping range into a non-wrapping range variable.
This was harmless at the time this test was written, but newer code
may not be as forgiving so better use a non-wrapping range as intended.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170704103128.29689-1-nyh@scylladb.com>
(cherry picked from commit d95f908586)
2017-07-04 14:18:01 +03:00
Tomasz Grabiec
19a07143eb row_cache: Drop not very useful prefixes from metric names
This drops "total_opertaions_" and "objects_" prefixes. There is no
convention of adding them in other parts of the system, and they don't
add much value.

Fixes scylladb/scylla-grafana-monitoring#169.

Message-Id: <1499160342-25865-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 1d6fec0755)
2017-07-04 13:37:24 +03:00
Raphael S. Carvalho
a619b978c4 database: fix potential use-after-free in sstable cleanup
when do_for_each is in its last iteration and with_semaphore defers
because there's an ongoing cleanup, sstable object will be used after
freed because it was taken by ref and the container it lives in was
destroyed prematurely.

Let's fix it with a do_with, also making code nicer.

Fixes #2537.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170630035324.19881-1-raphaelsc@scylladb.com>
(cherry picked from commit b9d0645199)
2017-07-03 12:49:13 +03:00
Gleb Natapov
2c66b40a69 main: wait for wait_for_gossip_to_settle() to complete during boot
Boot should not continue until a future returned by
wait_for_gossip_to_settle() is resolved.  Commit 991ec4a16 mistakenly
broke that, so restore it back. Also fix calls for supervisor::notify()
to be in the right places.

Message-Id: <20170702082355.GQ14563@scylladb.com>
(cherry picked from commit d23111312f)
2017-07-02 11:33:04 +03:00
Tomasz Grabiec
079844a51d row_cache: Fix compilation errors with gcc 5
Message-Id: <1498741526-27055-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 97005825bf)
2017-06-29 16:35:02 +03:00
Avi Kivity
ea59e1fbd6 Update ami submodule
* dist/ami/files/scylla-ami f10db69...5dfe42f (1):
  > don't fetch perf from amazon repo

(cherry picked from commit 1317c4a03e)
2017-06-29 09:39:29 +03:00
Tomasz Grabiec
089b58ddfe row_cache: Use continuity information to decide whether to populate
If cache is missing given key, but the range is marked as continuous,
it means sstables don't have that entry and we can insert it without
asking the presence checker (bloom filter based). The latter is more
expensive and gives false positives. So this improves update
performance and hit ratio.

Another positive effect is that we don't have to clear continuity now.

Fixes #1999.

Message-Id: <1498643043-21117-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 786e75dbf7)
2017-06-28 13:33:34 +03:00
Tomasz Grabiec
d76e9e4026 lsa: Fix performance regression in eviction and compact_on_idle
Region comparator, used by the two, calls region_impl::min_occupancy(),
which calls log_histogram::largest(). The latter is O(N) in terms of
the number of segments, and is supposed to be used only in tests.
We should call one_of_largest() instead, which is O(1).

This caused compact_on_idle() to take more CPU as the number of
segments grew (even when there was nothing to compact). Eviction
would see the same kind of slow down as well.

Introduced in 11b5076b3c.

Message-Id: <1498641973-20054-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 3489c68a68)
2017-06-28 12:33:11 +03:00
Glauber Costa
7709b885c4 disable defragment-memory-on-idle-by-default
It's been linked with various performance issues, either by causing
them or making them worse. One example is #1634, and also recently
I have investigated continuous performance degradation that was also
linked to defrag on idle activity.

Until we can figure out how to reduce its impact, we should disable it.

Signed-off-by: Glauber Costa <glauber@glauber.scylladb>
Message-Id: <20170627201109.10775-1-glauber@scylladb.com>
(cherry picked from commit f3742d1e38)
2017-06-28 00:21:35 +03:00
Avi Kivity
3de701dbe1 Merge "Fix compilation issues in older environments" from Tomasz
* 'tgrabiec/fix-compilation-issues' of github.com:cloudius-systems/seastar-dev:
  tests: streamed_mutation_test: Avoid using boost::size() on row ranges
  tests: row_cache: Remove unused method

(cherry picked from commit ff7be8241f)
2017-06-27 16:31:42 +03:00
Shlomi Livne
9912b7d1eb release: prepare for 2.0-rc0 2017-06-27 12:37:59 +03:00
Takuya ASADA
1e86196ed5 dist/debian: unofficial support of Ubuntu non-LTS versions / Debian non-stable versions
Currently our build script only supports Ubuntu 14.04/16.04 and Debian 8, this
change extends support to Ubuntu non-LTS versions / Debian non-stable versions.
Note that this is unofficial support, users should build the package for these
distributions theirselves.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1498491473-28691-1-git-send-email-syuu@scylladb.com>
2017-06-26 18:55:55 +03:00
Asias He
cc02a62756 repair: Prefer nodes in local dc when streaming
When peer nodes have the same partition data, i.e., with the same
checksum, we currently choose to stream from any of them randomly.
To improve streaming performance, select the peer within the same DC.
This patch is supposed to improve repair perforamnce with multiple DC.

Message-Id: <c6a345b6e8ed2b59f485e53c865241e463b44507.1498490831.git.asias@scylladb.com>
2017-06-26 18:34:21 +03:00
Avi Kivity
1170f56447 Merge "Speed up gossip dissemination in large cluster" from Asias
Fixes #2528.

* tag 'asias/gossip_talk_to_more_nodes/v3' of github.com:cloudius-systems/seastar-dev:
  gossip: Use vector for _live_endpoints
  gossip: Talk to more live nodes in each gossip round
2017-06-26 17:59:43 +03:00
Asias He
e31d4a3940 gossip: Use vector for _live_endpoints
To speed up the random access in get_random_node. Switch to use vector
instead of set.
2017-06-26 22:49:59 +08:00
Asias He
437899909d gossip: Talk to more live nodes in each gossip round
In large clusters with multiple DC deployment, it is observed that it
takes long delay for gossip update to disseminate in the cluster.

To speed up, talk to more live nodes in each gossip round.

Fixes #2528
2017-06-26 22:49:59 +08:00
Nadav Har'El
6cf44f6817 Optimize column_family::make_sstable_reader() for one partition
This patch does the same thing to column_family::make_sstable_reader() as
commit 186f031 did to sstable::as_mutation_source().

Although usually one can fast_forward_to() on the result of a
column_family::make_sstable_reader(), earlier we had an optimization
where if a single partition was specified, it was read exactly,
and fast_forward_to() was *NOT* allowed.

With the mutation_reader::forwarding flag patch, when this flag
was on - requesting fast_forward_to() - we disabled this optimization.
This makes sense, but is not backward compatible with the code which
previously assumes this optimization exists. In particular,
column_family::data_query() does a single partition read but does not
specify forwarding::no explicitly.
So this patch returns this optimization, despite this meaning that we
blatently ignore the fwd_mr flag in that case.

Fixes #2524.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170626141121.30322-1-nyh@scylladb.com>
2017-06-26 17:13:03 +03:00
Avi Kivity
9b21a9bfb6 Merge "Implement partial cache" from Tomasz and Piotr
"This series enables cache to keep partial partitions.
Reads no longer have to read whole partition from sstables
in order to cache the result.

The 10MB threshold for partition size in cache is lifted.

Known issues:

 - There is no partial eviction yet, whole partitions are still evicted,
   and partition snapshots held by active reads are not evictable at all
 - Information about range continuity is not recorded if that
   would require inserting a dummy entry, or if previous entry
   doesn't belong to the latest snapshot
 - Cache update after memtable flush happening concurrently with reads
   may inhibit that reads' ability to populate cache (new issue)
 - Cache update from flushed memtables has partition granularity,
   so may cause latency problems with large partition
 - Schema is still tracked per-partition, so after schema changes
   reads may induce high latency due to whole partition needing
   to be converted atomically
 - Range tombstones are repeated in the stream for every range between
   cache entries they cover (new issue)
 - Populating scans for both small and large partitions (perf_fast_forward)
   experienced a 40% reduction of throughput, CPU bound

How was this tested:

 - test.py --mode release
 - row_cache_stress_test -c1 -m1G
 - perf_fast_forward, passes except for the test case checking range continuity population
   which would require inserting a dummy entry (mentioned above)
 - perf_simple_query (-c1 -m1G --duration 32):
     before: 90k [ops/s] stdev: 4k [ops/s]
     after:  94k [ops/s] stdev: 2k [ops/s]"

* tag 'tgrabiec/introduce-partial-cache-v8' of github.com:cloudius-systems/seastar-dev: (130 commits)
  tests: row_cache: Add test_tombstone_merging_in_partial_partition test case
  tests: Introduce row_cache_stress_test
  utils: Add helpers for dealing with nonwrapping_range<int>
  tests: simple_schema: Allow passing the tombstone to make_range_tombstone()
  tests: simple_schema: Accept value by reference
  tests: simple_schema: Make add_row() accept optional timestamp
  tests: simple_schema: Make new_timestamp() public
  tests: simple_schema: Introduce make_ckeys()
  tests: simple_schema: Introduce get_value(const clustered_row&) helper
  tests: simple_schema: Fix comment
  tests: simple_schema: Add missing include
  row_cache: Introduce evict()
  tests: Add cache_streamed_mutation_test
  tests: mutation_assertions: Allow expecting fragments
  mutation_fragment: Implement equality check
  tests: row_cache: Add test for population of random partitions
  tests: row_cache: Add test for partition tombstone population
  tests: row_cache: Test reading randomly populated partition
  tests: row_cache: Add test_single_partition_update()
  tests: row_cache: Add test_scan_with_partial_partitions
  ...
2017-06-26 14:54:37 +03:00
Avi Kivity
555621b537 Disentable memtables from sstables
Remove sstable::write_components(memtable), replacing it with a helper.

Fixes #2354
Message-Id: <20170624142639.16662-1-avi@scylladb.com>
2017-06-26 09:37:11 +02:00
Avi Kivity
236a8370e4 Remove use of std::random_shuffle()
It was removed in C++17. Replace with std::shuffle().
Message-Id: <20170626063809.7563-1-avi@scylladb.com>
2017-06-26 09:36:38 +02:00
Avi Kivity
c4ae2206c7 messaging: respect inter_dc_tcp_nodelay configuration parameter
We respect it partially (client side only) for now.

Fixes #6.
Message-Id: <20170623172048.23103-1-avi@scylladb.com>
2017-06-24 21:49:27 +02:00
Duarte Nunes
2dfd7040eb CMakeLists.txt: Add boost support
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170623172236.15507-1-duarte@scylladb.com>
2017-06-24 21:49:27 +02:00
Avi Kivity
801b5220d6 Merge seastar upstream
* seastar 9e2b7ec...0ab7ae5 (4):
  > Update fmt submodule
  > rpc: add options to control tcp_nodelay
  > core: Fix compilation for older versions of Boost
  > tests/lowres_clock_test: Fix compilation issues
2017-06-24 20:47:52 +03:00
Tomasz Grabiec
b0bcf2be53 tests: row_cache: Add test_tombstone_merging_in_partial_partition test case 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
23c6f517cb tests: Introduce row_cache_stress_test
Runs readers, updates and eviction concurrently and verifies the
following property of reads:

  - reads see all past writes

  - reads see no partial writes within a single partition
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
4b4aef789e utils: Add helpers for dealing with nonwrapping_range<int> 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
5c9f87fb27 tests: simple_schema: Allow passing the tombstone to make_range_tombstone() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
edf4a3494c tests: simple_schema: Accept value by reference 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
5f70df472f tests: simple_schema: Make add_row() accept optional timestamp 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
53867c4328 tests: simple_schema: Make new_timestamp() public 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
51b5814ec2 tests: simple_schema: Introduce make_ckeys() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
074c67fe4d tests: simple_schema: Introduce get_value(const clustered_row&) helper 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
8ffc776e06 tests: simple_schema: Fix comment 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
ecacd2e84a tests: simple_schema: Add missing include 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
b56232b216 row_cache: Introduce evict() 2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
c4e8effffa tests: Add cache_streamed_mutation_test
[tgrabiec:
  - extracted from a larger commit
  - removed coupling with how cache_streamed_mutation is created (the
    code went out of sync), used more stable make_reader(). it's simpler too.
  - replaced false/true literals with is_continuous/is_dummy where appropraite
  - dropped tests for cache::underlying (class is gone)
  - reused streamed_mutation_assertions, it has better error messages
  - fixed the tests to not create tombstones with missing timestamps
  - relaxed range tombstone assertions to only check information relevant for the query range
  - print cache on failure for improved debuggability
]
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
44fdee3f2e tests: mutation_assertions: Allow expecting fragments 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
1f23130b07 mutation_fragment: Implement equality check 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
116bcb8b30 tests: row_cache: Add test for population of random partitions 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
930a1415fe tests: row_cache: Add test for partition tombstone population 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
9bfece6f82 tests: row_cache: Test reading randomly populated partition 2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
0358334579 tests: row_cache: Add test_single_partition_update()
[tgrabiec: Extracted from "row_cache: Introduce cache_streamed_mutation"]
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
8bb76e2f12 tests: row_cache: Add test_scan_with_partial_partitions 2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
896bf2e5de Remove unused methods from MVCC
Some apply methods where replaced by apply_to_incomplete().

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
6f6575f456 row_cache: Enable partial partition population 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
5a0ae55f6d Introduce schema_upgrader 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
1828e28bbb database: Invalidate cache atomically with attaching streaming sstables
Not doing so may cause reads to see partial writes, if another
update+read happens in between.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
896196b841 database: Invalidate cache from seal_active_streaming_memtable_immediate()
Cache must be synchronized atomically with changing the underlying
mutation source, otherwise write atomicity may not hold.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
7ae40d7045 tests: Add test for update_invalidating() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
e792220c3a row_cache: Introduce update_invalidating() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
c29878f49f row_cache: Extract memtable walking logic from update() into do_update()
So that it can be reused in update_invalidating().
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
6ebfb730ee partition_entry: Introduce partition_tombstone() getter 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
fb62dfab02 tests: mvcc: Introduce test_schema_upgrade_preserves_continuity 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
164989a574 tests: mvcc: Add test for partition_entry::apply_to_incomplete() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
e433e68610 partition_entry: Make squashed() and upgrade() work with not fully continuous versions
Those methods first create a neutral mutation_partition, and left-fold
it with the versions. The problem is that there is no neutral element
for static row continuity, the flag from the first addend always
wins. We have to copy the flag from the first version to preserve
the logical value.
2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
b680de930c partition_entry: Introduce apply_to_incomplete()
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>

[tgrabiec:
  - extracted from a larger commit
  - fix heap comparator in apply_incomplete_target to order versions properly
  - extracted partition_version detaching into
    partition_entry::with_detached_versions()
  - dropped unnecessary rows_iterator::_version field
  - dropped unnecessary allocation of rows_entry and key copies
    in rows_iterator
  - dropped row_pointer
  - replaced apply_reversibly() with weaker and faster apply()
  - added handling of dummy entries at any position
  - fixed exception safety issue in apply_to_incomplete() which may
    result in data loss. We cannot move data out of applied versions
    into a new synthetic row and then apply it, because if exception
    happens in the middle, the data which was moved from the source
    will be lost. To fix that, row_iterator::consume_row() is
    introduced which allows in-place consumption of data without
    construction of temporary deletable_row.
  ]
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
b6ce963200 partition_version: Introduce partition_entry::with_detached_versions() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
2d8f024e4d partition_version: Document version merging rules on partition_entry 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
0770845a23 mutation_partition: Introduce r-value accepting deletable_row::apply() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
48a5b1d3ab converting_mutation_partition_applier: Expose cell upgrade logic 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
04aebaa2cb streamed_mutation: Introduce transform() 2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
3755641c4b row_cache: Introduce cache_streamed_mutation
This streamed mutation populates cache with
the rows requested by the read. It takes whatever
it can find in the cache and fetches the remainings
from underlying source.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>

[tgrabiec:

  - fixed maybe_add_to_cache_and_update_continuity() leaking entries if
    the key already exists in the snapshot

  - fixed a problem where population race could result in a read
    missing some rows, because cache_streamed_mutation was advancing
    the cursor, then deferring, and then checking continuity. We
    should check continuity atomically with advancing.

  - fixed rows_handle.maybe_refresh() being accessed outside of update
    section in read_from_underlying() (undefined behavior)

  - fixed a problem in start_reading_from_underlying() where we would
    use incorrect start if lower_bound ended with a range tombstone
    starting before a key.

  - range tombstone trimming in add_to_buffer() could create a
    tombstone which has too low start bound if last_rt.end was a
    prefix and had inclusive end. invert_kind(end_kind) should be used
    instead of unconditional inc_start.

  - range tombstone trimming incorrectly assumed it is fine to trim
    the tombstone from underlying to the previous fragment's end and
    emit such tombstone. That would mean the stream can't emit any
    fragments which start before previous tombstone's end. Solve with
    range_tombstone_stream.

  - split add_to_buffer() into overloads for clustering_row, and
    range_tombstone. Better than wrapping into mutation_fragment
    before the call and having add_to_buffer() rediscover the
    information.

  - changed maybe_add_to_cache_and_update_continuity() to not set
    continuity to false for existing entries, it's not necessary

  - moved range tombstone trimming to range_tombstone class
  - moved range tombstone slicing code to range_tombstone_list and partition_snapshot
  - can_populate::can_use_cache was unused, dropped
  - dropped assumption that dummy entries are only at the end
  - renamed maybe_add_to_cache_and_update_continuity() to maybe_add_to_cache()
  - dropped no longer needed lower_bound class
  - extracted row_handle to a seaparate patch
  - made the copy-from-cache loop preemptable
  - split maybe_add_next_to_buffer_and_update_continuity(bool)
  - dropped cache_populator
  - replaced "underlying" class with use of read_context
  - replaced can_populate class with a function
  - simplified lsa_manager methods to avoid moves
]
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
58a8022462 intrusive_set_external_comparator: Introduce insert_check() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
509a0d8a83 row_cache: Allow reading from underlying through read_context
The interaction will be as follows:

  - Before creating cache_streamed_mutation for given partition, cache
    mutation reader sets up read_context for current partition (in one
    of two ways) so that the matching underlying streamed_mutation can
    be accessed at any time by cached_stream_mutation.

  - cache_streamed_mutation assumes that read_context is set up for
    current partition and invokes fast_forward_to() and
    get_next_fragment() to access the underlying
    streamed_mutation.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
69ea29131f row_cache: Allow specifying desired snapshot in autoupdating_underlying_reader
When reading from incomplete partition entry, we may discover we need
to read something from the underlying mutation source. In such case we
will fast forward this reader to that partition. But we must do it
using a specific snapshot, the one we obtained when entering the
partition, not the latest one.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
a1d3e0318c row_cache: Store autoupdating_underlying_reader in read_context
Will be reused for reading of incomplete partition entries.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
3f2320c377 row_cache: Store information whether query is a range query in read_context
We will need to use this information later in yet another place, when
creating a reader for incomplete cache entry. This refactors the code
so that there is a single place which determines this fact.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
a2207ee9a6 row_cache: Move autoupdating_underlying_reader to read_context.hh 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
ca920bd0ef row_cache: Keep only one streamed_mutation in scanning_and_populating_reader
Currently scanning_and_populating_reader asks
just_cache_scanning_reader for the next partition from cache, together
with information if the range is continuous. If it's not, it saves the
partition it got from it and moves on to reading from the underlying
reader up to that partition. When that's done, it emits the stored
partition.

This approach won't work well with upcoming changes for storing
partial partitions. We won't have whole partitions any more, so
streamed_mutation returned for the entry needs to be prepared for
reading from the underlying mutation source. We want to reuse the same
underlying reader as much as possible, so all streamed_mutations for
given read (read_context) will share the state of the underlying
reader. Construction of a streamed_mutation will depend on the fact
that the shared state is set up for it, so we cannot have two
streamed_mutations prepared at the same time (one for entry from
primary, and one for the earlier entry being populated). This change
defers the creation of a streamed_mutation for the entry present in
cache until the whole reader reaches it to avoid this problem.

This will also have antoher potentially beneficial effect. Since we
defer the decision about which snapshot to use until we reach the
entry, there is a higher chance that the current snapshot of the entry
will match the one used last by the populating read, and that we will
be able to reuse the reader.

It's implemented by utilizing a stable partition cursor which tracks
its current position so that it's possible to revisit the cache entry
(if it's still there) after population ends. The functionality of
just_cache_scanning_reader was inlined into
scanning_and_populating_reader.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
1b041298fe range: Introduce trim_front() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
045888d5f3 row_cache: Introduce partition_range_cursor 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
e989d65539 dht: Make ring_position_view copyable
dht::token needs to be stored as a pointer now and not a reference so
that validity of old pointers is not impacted by in-place object
construction which would occur in the copy-assignment operator.

[1] says that old pointers can be used to access the new object only
if the type "does not contain any non-static data member whose type is
const-qualified or a reference type".

[1] http://en.cppreference.com/w/cpp/language/lifetime#Storage_reuse
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
c3905bf235 row_cache: Print position instead of key of cache_entry 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
f0cc86e5db row_cache: Introduce cache_entry::position() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
e3272526a1 row_cache: Allow comparing with ring_position views in row_cache::compare 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
5bfecaad99 row_cache: Switch invalidate_unwrapped() to use ring_position_view ranges
It's needed before switching cache_entry ordering to rely solely on
cache_entry::position() so that invalidate_unwrapped() never removes
the dummy entry at the end. Currently if the range has upper bound
like this:

  { ring_position::max(), inclusive=true }

The code which selects entries for removal would include the dummy row
at the end. It uses upper_bound() to get the end iterator, and the
dummy entry has a position which is equal to the position in the
bound.

ring_position_view ranges are end-exclusive, so it's impossible to
create a partition range which would include a dummy entry.

The code is also simpler.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
64626b32b0 row_cache: Make printable 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
639af55a78 partition_version: Add versions() getter
[tgrabiec: Use explicit return type]
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
1d3fec43eb partition_version: Make return type of versions() explicit 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
22a0e301f1 partition_version: Make is_referenced() const-qualified 2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
a17fa5726f Introduce streamed_mutation_from_forwarding_streamed_mutation
This will allow conversion from streamed_mutation that
supports fast forwarding to streamed_mutation that does not.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
11191b7aef streamed_mutation: Introduce make_empty_streamed_mutation() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
7c9569ec95 Introduce partition_snapshot_row_cursor 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
54b3da1910 row_cache: Introduce find_or_create() helper 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
f2d2c221d4 row_cache: Return cache_entry reference from do_find_or_create_entry
Will be useful when additional action needs to be done on the entry
after it was created or constructed.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
0db6bdc916 row_cache: Introduce cache_entry constructor which constructs incomplete entry 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
1c642219c9 row_cache: Ensure there is always a dummy entry after all clustered rows
Algorithms will assume that.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
bbfa52822e row_cache: Switch readers to use per-entry snapshots
Currently readers are always using the latest snapshot. This is fine
for respecting write atomicity if partitions are fully continuous in
cache (now), but will break write atomicity once partial population is
allowed.

Consider the following case:

  flush write(ck=1), write(ck=2) -> snapshot_1
  cache reader 1 reads and inserts ck=1 @snapshot_1
  flush write(ck=1), write(ck=2) -> snapshot_2
  cache reader 2 reads and inserts ck=2 @snapshot_2

Because cache update is not atomic, it can happen that reader 2 will
complete while the partition hasn't been updated yet for snapshot_2.
In such case, after read 2 the partition would contain ck=1 from
snapshot_1 and ck=2 from snapshot_2. It will match neither of the
snapshots, and this could violate write atomicity.

To solve this problem we conceptually assign each partition key in the
ring to its current snapshot which it reflects. The update process
gradually converts entries in ring order to the new snapshot. Reads
will not be using the latest snapshot, but rather the current snapshot
for the position in the ring they are at.

There is a race between the update process and populating reads. Since
after the update all entries must reflect the new snapshot, reads
using the old snapshot cannot be allowed to insert data which can no
longer be reached by the update process. Before this patch this race
was prevented by the use of a phased_barrier, where readers would keep
phased_barrier::operation alive between starting a read of a partition
and inserting it into cache. Cache update was waiting for all prior
operations before starting the update. Any later read which was not
waited for would use the latest snapshot for reads, so the update
process didn't have to fix anything up for such reads.

After this change, later reads cannot always use the latest snapshot,
they have to use the snapshot corresponding to given entry. So it's
not enough for update() to wait for prior reads in order to prevent
stale populations. The (simple) solution implemented in this patch is
to detect the conflict and abandon population of given sub-range. In
general, reads are allowed to populate given range only if it belongs
to a single snapshot.

Note that the range here is not the whole query range. For population
of continuity, it is the range starting after the previous key and
ending after the key being inserted. When populating a partition
entry, the range is a singular range containing only the partition
key. Readers switch to new snapshots automatically as they move across
the ring. It's possible that the insertion of the partition doesn't
conflict, but continuity does. In such case the entry will be inserted
but continuity will not be set.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
81e7b561da dht: Add ring_position min()/max() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
8ba6366610 row_cache: Switch to using snapshot_source
Currently every time cache needs to create reader for missing data it
obtains a reader which is most up to date. That reader includes writes
from later populate phases, for which update() was not yet
called. This will be problematic once we allow partitions to be
partially populated, because different parts of the partition could be
partially populated using readers using different sets of writes, and break
write atomicity.

The solution will be to always populate given partition using the same
set of writes, using reader created from the current snapshot. The
snapshot changes only on update(), with update() gradually converting
each partition to the new snapshot.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
446bcdb00d database: Add missing cache invalidation after attaching sstables
This violation of the contract is currently benign, because there are
no reads from those tables before they are populated. If there were,
the cache would mark the whole (empty) range as continuous and the
table would appear empty.

It will cause similar problem once cache starts using snapshots of the
underlying mutation source. Then this lack of invalidate() will also
result in cache thinking that the table is still empty.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
e23c7e2f34 row_cache: Rework invalidate() implementation
1) Reduce duplication by delegating to more general overloads

 2) Improve documentation to not mention effects in terms of
    population (detail) but rather write visibiliy

 3) Rename clear() to invalidate() and merge with the range variant,
    it has the same semantics
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
c82c6ec6ed database: Allow obtaining snapshot_source for sstables 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
bd023b6161 tests: Introduce memtable_snapshot_source
Snapshottable in-memory mutation source for use in row_cache tests.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
ddfcf64966 mutation_source: Make copying cheaper
Cache readers will need to take snapshots by copying the
mutation_source. That's going to happen quite often, so make copying
cheaper.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
58d5e1393b mutation_reader: Introduce make_combined_mutation_source() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
1e2463a382 mutation_reader: Introduce make_empty_*_source() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
289d01c2cc mutation_reader: Introduce concept of snapshot_source 2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
2d73c193e7 row_cache: Introduce read_context
This object stores all read relevant context required all
over the place. This leads to a cleaner code.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>

[tgrabiec:
  - made read_context shareable to allow storing shared
    mutable state later
  - added range and cache getters
]
2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
a3ff8db323 row_cache: Introduce autoupdating_underlying_reader
This is an abstraction that represents a reader
to the underlying source and auto updates itself
to make sure the reader reflects the latest state
of the underlying source.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>

[tgrabiec: Add range getter to avoid friendships]
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
b6d349728f range_tombstone_list: Introduce slice() working with position range 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
6ce08f2f9a range_tombstone: Introduce trim_front() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
271dfc2eac position_in_partition: Introduce for_range_start()/for_range_end() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
3b52afa4a3 position_in_partition: Introduce no_clustering_row_between() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
9c2b3e1167 position_in_partition: Introduce as_start_bound_view() 2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
b47c8f1df7 partition_snapshot: Add const-qualified overload of version()
[tgrabiec: Extracted from a different patch]
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
dd9d35c166 partition_snapshot: Add getter for range tombstones 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
60c3c0a471 partition_entry: Add squashed() overload with a single schema 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
98f7671553 partition_snapshot: Introduce squashed() 2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
87b0f11be3 partition_snapshot: Add getters for static row and partition tombstone
[tgrabiec:
  - Extracted from a different patch
  - Renamed concept names to more familiar Map and Reduce
  - Renamed aggregate() to squashed() to match the existing nomenclature
  - Uncommented the concepts
  ]
2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
ea59b9475e partition_version: Add const-quialified variant of operator->
[tgrabiec: Extracted from a different patch]
2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
f6fe0acea4 partition_version: Make operator bool() const-qualified
[tgrabiec: Extracted from a different patch]
2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
efc75b0bc3 mutation_partition: Add rows_entry constructor which accepts full contents
[tgrabiec: Extracted from different patch]
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
7f8620d4a7 tests: mutation_source: Relax expectations about range tombstones
In preparation for having partial cache which trims range tombstones
to the lower bound of the query.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
3a9212e0f2 tests: mutation_assertions: Add ability to limit verification to given clustering_row_ranges
Currently mutation sources are free to return range tombstones
covering range which is larger than the query range. The cache
mutation source will soon become more eager about trimming such
tombstones. To cover up for such differences, allow telling the
restrictions to only care about differences relevant for given
clustering ranges.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
f925b26241 tests: mutation_reader_assertions: Simplify 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
1d5d5e26a2 mutation: Introduce sliced() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
92d6456070 range_tombstone_list: Introduce equal() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
1594ace4d3 range_tombstone_stream: Make printable 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
19edb0b535 range_tombstone_list: Make printable 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
2e75595ecf range_tombstone_list: Introduce trim() 2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
5a29c70f3e mutation_fragment: make mutation_fragment copyable
This will be needed by implementation of cache_streamed_mutation

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
2fdabcaa9b Track population phase in partition_snapshot
This will be used by partial cache in later patches.

[tgrabiec:
  - changed title,
  - documented meaning of the variable,
  - renamed the variable,
  - introduced open_version(),
  - fixed continuity of the static row not being preserved in case
    a new version is created]

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
a841c77c54 Introduce maybe_merge_versions
This will be used in the following patches by partial cache.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
9642f806ab partition_version: Introduce version() getter 2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
9380dd1ee3 mutation_source: make sure we never ignore fast forwarding
mutation source sometimes ignore fast forwarding parameter so
this change adds assertion to check that this parameter
can be safely ignored.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
ab72241e22 mutation_reader: Accept forwarding flag in make_reader_returning()
By default make_reader_returning creates a reader that does not
support fast forwarding but the second parameter can be used to
make it support fast forwarding.

[tgrabiec: Improve title]

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
ac03331490 row_cache_test: improve test_sliced_read_row_presence
Remove unused parameter and add checks to make sure
all expected rows have been received.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
db053ef902 tests: Add test for continuity merging rules 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
2edf08d36a tests: random_mutation_generator: Generate random continuity 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
8873a443db tests: mutation: Generate mutations with continuity 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
dce293e11c tests: row_cache: Apply only fully continuous mutations to underlying mutation source
Cache currently assumes that mutations coming from outside are fully
continuous.
2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
e86f74edd8 tests: row_cache: Add missing apply() to test_mvcc test case
[tgrabiec: Extracted from "row_cache: Introduce cache_streamed_mutation"]
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
95dcfa859b tests: row_cache: Improve test_mvcc()
assert_that().is_equal_to() gives better error message.

Also, there is code which can be replaces with
assert_that_stream().has_monotonic_positions()
2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
05b56fcfb0 mutation_partition: Add support for specifying continuity
This will allow expressing lack of information about certain ranges of
rows (including the static row), which will be used in cache to
determine if information in cache is complete or not.

Continuity is represented internally using flags on row entries. The
key range between two consecutive entries is continuous iff
rows_entry::continuous() is true for the later entry. The range
starting after the last entry is assumed to be continuous. The range
corresponding to the key of the entry is continuous iff
rows_entry::dummy() is false.

[tgrabiec:
  - based on the following commits:
     4a5bf75 - Piotr Jastrzebski : mutation_partition: introduce dummy rows_entry
     773070e - Piotr Jastrzebski : mutation_partition: add continuity flag to rows_entry
  - documented that partition tombstone is always complete
  - require specifying the partition tombstone when creating an incomplete entry
  - replaced rows_entry(dummy_tag, ...) constructor with more general
    rows_entry(position_in_partition, ...)
  - documented continuity semantics on mutation_partition
  - fixed _static_row_cached being lost by mutation_partition copy constructors
  - fixed conversion to streamed_mutation to ignore dummy entries
  - fixed mutation_partition serializer to drop dummy entries
  - documented semantics of continuity on mutation_partition level
  - dropped assumptions that dummy entries can be only at the last position
  - changed equality to ignore continuity completely, rather than
    partially (it was not ignoring dummy entries, but ignoring
    continuity flag)
  - added printout of continuity information in mutation_partition
  - fixed handling of empty entries in apply_reversibly() with regards
    to continuity; we no longer can remove empty entries before
    merging, since that may affect continuity of the right-hand
    mutation. Added _erased flag.
  - fixed mutation_partition::clustered_row() with dummy==true to not ignore the key
  - fixed partition_builder to not ignore continuity
  - renamed dummy_tag_t to dummy_tag. _t suffix is reserved.
  - standardized all APIs on is_dummy and is_continuous bool_class:es
  - replaced add_dummy_entry() with ensure_last_dummy() with safer semantics
  - dropped unused remove_dummy_entry()
  - simplified and inlined cache_entry::add_dummy_entry()
  - fixed mutation_partition(incomplete_tag) constructor to mark all row ranges as discontinuous
  ]
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
063b37f352 partition_snapshot_reader: Be prepared for skipping some row entries
If some row entries may have to be skipped by the reader then it could
be that _clustering_rows is not empty, but read_next() will return a
disengaged optional because there are no more rows in the current
range. The code assumed that it's never the case, and if read_next()
returns a disengaged optional then we exhousted all ranges. Before
introducing dummy entries this needs to be refactored.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
2cfe23a35e partition_snapshot_reader: Use rows_entry::position() for comparing rows
key() will not be valid for dummy entries.
2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
50641b2849 partition_snapshot_reader: Reuse rows_entry comparator 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
d1a1fdfd57 partition_snapshot_reader: Encapsulate row walking to simplify read_next() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
a77734952d mutation_partition: Make rows_entry comparable with position_in_partition 2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
65b3123516 mutation_partition: Use rows_entry::position() in comparators
key() will not be valid for dummy entries, but position() is always
valid.

[tgrabiec: Extracted from other commits]
[tgrabiec: Added missing change to range_tombstone_stream::get_next]
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
660f3127a6 mutation_partition: Introduce rows_entry::position()
In preparation for enabling dummy entries with postion past all
clustering rows.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
9d705bc1c6 position_in_partition_view: Add key component getter 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
206a5f2bf5 position_in_partition_view: Add is_clustering_row() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
874b34ac09 position_in_partition_view: Add converting constructor from a key reference 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
794f9b21ef position_in_partition: Add is_after_all_clustered_rows() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
14fbf4409c position_in_partition: Introduce after_key() in the view 2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
1457be4de8 position_in_partition: Introduce for_key()
[tgrabiec: Take the key by reference]
2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
94c957c2ff Extract position_in_partition to separate header
This will allow it's usage in mutation_partition.hh

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
0fd4dedc6a position_in_partition: Add after_all_clustered_rows() to view
This is a position that's always in the end after any
other position. It will be used for dummy rows_entry.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
60346a2819 row_cache: remove unused read overload
This will simplify the following patches and unused
code should be removed anyway.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
77f944880c cache: Remove support for wide partitions
This will be handled by row cache now.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
fbe8c24ebe tests: row_cache_alloc_stress: Make eviction detection more reliable
It can happen that touch() will trigger eviction on entry to
allocating section, and drop in occupancy around insertion
will not happen. As a result, we may evict a lot without detecting that.

Extend the check to include touch() and use more reliable eviction counters.
2017-06-24 18:06:11 +02:00
Jesse Haber-Kucharsky
0791e90424 Further improve CMakeLists.txt for CLion
We support situations where `seastar.pc` is available, and when it isn't.

When `seastar.pc` is available, we grab the compilation flags from it in
addition to the defaults.

Some DPDK files are statically available in the source repository. We prefer
those to files placed during compilation in case modifications are made during
development (that would be lost during a build).

We always disable GCC6 concepts for the IDE, even if they are enabled while
configuring the project for compilation. CLion's parser doesn't understand them.

One final benefit to this revision is that now only target-specific flags are
modified rather than global flags.

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <75457652201c5ed89d05081ec4b56b4340721cf5.1498237756.git.jhaberku@scylladb.com>
2017-06-23 19:21:28 +02:00
Avi Kivity
aebc77507c Merge "Clock efficiency and organization" from Jesse
"This patch series makes two noteworthy changes:

- `gc_clock` now uses `seastar::lowres_system_clock` for better performance.

- Code common to all clocks is properly refactored and shared.

Otherwise, there are some small improvements that should have no functional impact."

* 'jhk/lowres_gc_clock/v1' of https://github.com/hakuch/scylla:
  Seal clock definitions
  `timestamp_clock::now()` is not `noexcept`
  Make `db_clock` `time_t` conversions `constexpr`
  Move common clock implementation helpers
  Simplify clock implementations
  db_clock.hh: Clean preprocessor directives
  Make `gc_clock` a model of `Clock`
  Use `lowres_system_clock` to back `gc_clock`
  Add `time_t` conversions for `gc_clock`
2017-06-23 18:47:56 +03:00
Jesse Haber-Kucharsky
b0ad1ff447 Seal clock definitions 2017-06-23 11:35:35 -04:00
Jesse Haber-Kucharsky
09954c45f1 timestamp_clock::now() is not noexcept
The problem is that `std::chrono::duration_cast` is not `noexcept`. As a result,
`timestamp_clock` is actually a model of `Clock` and not `TrivialClock`.
2017-06-23 11:35:35 -04:00
Jesse Haber-Kucharsky
28169fabca Make db_clock time_t conversions constexpr 2017-06-23 11:35:35 -04:00
Jesse Haber-Kucharsky
e045dddae8 Move common clock implementation helpers
This change fixes the dependencies between the clock implementation headers. All
the clocks share the common clock offset, but are otherwise independent (though
the `db_clock` does depend on `gc_clock` for time point conversions).
2017-06-23 11:35:35 -04:00
Jesse Haber-Kucharsky
2d184f27af Simplify clock implementations 2017-06-23 11:35:34 -04:00
Jesse Haber-Kucharsky
51c767c1c7 db_clock.hh: Clean preprocessor directives 2017-06-23 11:35:34 -04:00
Jesse Haber-Kucharsky
050ece6f74 Make gc_clock a model of Clock
It was missing `is_steady`.
2017-06-23 11:35:34 -04:00
Jesse Haber-Kucharsky
00bcd568a6 Use lowres_system_clock to back gc_clock
`seastar::lowres_system_clock` is more efficient than
`std::chrono::system_clock` and `gc_clock` has very coarse granularity
requirements.

Fixes #1957.
2017-06-23 11:35:34 -04:00
Jesse Haber-Kucharsky
73020685ee Add time_t conversions for gc_clock
`gc_clock` reports system time, and these conversion functions allow for
manipulating time points produced by the clock without making assumptions about
its epoch.
2017-06-23 11:35:34 -04:00
Duarte Nunes
4ef25e8e38 db/schema_tables: Add note to make_update_view_mutations
Document that a new view schema passed to make_update_view_mutations()
might be based on base schema that hasn't yet been loaded.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170618200558.96036-1-duarte@scylladb.com>
2017-06-23 15:24:35 +02:00
Duarte Nunes
bc1f1fa88a Merge branch "Some fixes for clang++ trunk" from Avi
"Fix minor issues found while building with clang trunk."

* 'clang' of https://github.com/avikivity/scylla:
  seastarx: don't make seastar namespace inline
  seastarx: add missing make_shared forward declaration
  tests: fix call to seastar::sleep()
  dht: fix bad to_sstring() call
2017-06-22 17:31:27 +02:00
Avi Kivity
f3366d8ae6 seastarx: don't make seastar namespace inline
It's apparently not legal to re-declare an existing namespace
inline. Use "using" instead.
2017-06-22 18:16:13 +03:00
Avi Kivity
c423330917 seastarx: add missing make_shared forward declaration
Without this, clang (correctly) complains that it can't deduce the type when
it is not explicitly mentioned.
2017-06-22 18:16:13 +03:00
Avi Kivity
672de608bf tests: fix call to seastar::sleep()
It's not in the global namespace.
2017-06-22 18:16:13 +03:00
Avi Kivity
f9f2f18145 dht: fix bad to_sstring() call
to_sstring() is part of seastar, nor the global namespace.
2017-06-22 17:51:27 +03:00
Avi Kivity
6f8dba3fa9 Merge "small fixes and cleanup for leveled strategy" from Raphael
* 'lcs_improvements_v1' of github.com:raphaelsc/scylla:
  lcs: remove useless code for choosing L0 candidates
  lcs: remove some dead code
  lcs: make logger static
  lcs: actually prefer oldest sstables of L0 when it falls behind
  lcs: remove useless expensive check for overlapping L1 sstables
2017-06-22 15:45:34 +03:00
Raphael S. Carvalho
4351e0a996 compaction: introduce new compaction type for reshard
so now user can look at nodetool compactionstats and determine
whether or not resharding is running, for example:
$ ./bin/nodetool compactionstats
pending tasks: 3
       id   compaction type   keyspace                table   completed   total   unit   progress
   <none>           RESHARD     system   compaction_history          11     256   keys      4.30%
   <none>           RESHARD     system   compaction_history           2     256   keys      0.78%
   <none>           RESHARD     system   compaction_history          10     256   keys      3.91%
   <none>           RESHARD     system   compaction_history           8     256   keys      3.12%
   <none>           RESHARD     system   compaction_history           7     256   keys      2.73%

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170620175733.25882-1-raphaelsc@scylladb.com>
2017-06-22 14:48:38 +03:00
Gleb Natapov
9b8499df0e cache_hitrate_calculator: filter cfs based on replication strategy instead of a name
The code filters CFs by name to not include system keyspace, but v3
schema added yet another system namespace. Better filter according to
replication strategy to accommodate for schema v4 adding even more
system keyspaces.

Fixes: #2516

Message-Id: <20170621073816.GB3944@scylladb.com>
2017-06-22 11:26:34 +03:00
Tzach Livyatan
9e6337f330 Add a comment experimental line to scylla.yaml
Making it easier for users to enable experimental features

Signed-off-by: Tzach Livyatan <tzach@scylladb.com>
Message-Id: <20170621191720.13575-1-tzach@scylladb.com>
2017-06-22 09:06:19 +03:00
Jesse Haber-Kucharsky
8174a40098 Improve CMakeLists.txt for CLion
This version optionally reads the include paths for Seastar from pkg-config and
uses file globbing to register all source and header files.

In comparison to the previous version, I see all files in the project explorer
view are "active" (rather than just .cc files). I believe there are also fewer
errors reported by the editor.
2017-06-21 16:34:47 -04:00
Avi Kivity
f0b20be14d Revert "system_keyspace: Make sure "system" is written to keyspaces (visible)"
This reverts commit 89ef69c4b3. Prevents nodes
from joining the cluster.
2017-06-21 16:58:04 +03:00
Avi Kivity
8585a356eb Revert "Revert "db: prevent latency spikes during streaming/repair""
This reverts commit 399d219cab. Turns out it
was not the culprit.
2017-06-21 16:58:04 +03:00
Takuya ASADA
aa77ac1138 dist/debian: Debian 9(stretch) support
Add support Debian new stable release.
Also including following changes:
 - update libthrift due to unable to compile on Debian 9
 - drop dist/debian/supported_release since distribution check code moved to pbuilderrc
 - add libssl-dev for build-depends
 - add sudo for pbuilder extra packages (Debian doesn't have it by default install)

Signed-off-by: syuu <syuu@dokukino.com>
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1498047515-19972-1-git-send-email-syuu@scylladb.com>
2017-06-21 15:30:22 +03:00
Avi Kivity
399d219cab Revert "db: prevent latency spikes during streaming/repair"
This reverts commit bdfa2ed923245e236837f58925c797e26df32361; prevents nodes
from joining.
2017-06-21 11:28:29 +03:00
Calle Wilund
89ef69c4b3 system_keyspace: Make sure "system" is written to keyspaces (visible)
Fixes #2514
Bug in schema version 3 update: We failed to write "system" to the
schema tables. Only visible on an empty instance of course.
Message-Id: <1497966982-10044-1-git-send-email-calle@scylladb.com>
2017-06-20 20:59:47 +02:00
Avi Kivity
bdfa2ed923 db: prevent latency spikes during streaming/repair
The memtable destructor can take a long time if the memtable is full; use
clear_gently() to clear it without impacting latency.

Fixes #2477.
Message-Id: <20170620093550.16121-1-avi@scylladb.com>
2017-06-20 13:03:43 +02:00
Nadav Har'El
186f031187 Optimize sstable::as_mutation_source() for one partition
Although usually one can fast_forward_to() on the result of a
sstable::as_mutation_source(), earlier we had an optimization
where if a single partition was specified, it was read exactly,
and fast_forward_to() was *NOT* allowed.

With the mutation_reader::forwarding flag patch, when this flag
was on - requesting fast_forward_to() - we disabled this optimization.
This makes sense, but is not backward compatible with the code which
previously assumes this optimization exists... So this patch returns
this optimization, despite this meaning that we blatently ignore
the fwd_mr flag in that case.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170620081107.14335-1-nyh@scylladb.com>
2017-06-20 10:32:10 +02:00
Avi Kivity
ba0ba87bf9 Merge seastar upstream
* seastar 621b7ed...9e2b7ec (8):
  > Merge "Low-resolution clocks" from Jesse
  > build: disable -Wattributes when gcc -fvisibility=hidden bug strikes
  > build: work around ragel 7 generated code bug
  > rpc: make unmarshall_exception() inline
  > future-utils: make functions global
  > fix reactor stall detector rate limiting on an mostly idle system
  > prometheus: Add ability to add /metrics to any http_server
  > prometheus: fix memory leak in http_server_control
2017-06-20 11:01:40 +03:00
Tomasz Grabiec
358bf88cf8 mutation_reader: Fix abort when streaming more than one range
multi_range_mutation_reader uses fast_forward_to() to skip between
ranges, so we always need to create the underlying reader with with
mutation_reader::forwarding::yes if there is more than one range,
irrespective of whether multi_range_mutation_reader itself will be
forwarded or not.

Fixes #2510.

Introduced in commit 3018df1.

Message-Id: <1497943032-18696-1-git-send-email-tgrabiec@scylladb.com>
2017-06-20 10:29:45 +03:00
Amos Kong
92731eff4f common/scripts: fix node_exporter url
Commit ff3d83bc2f updated node_exporter
from 0.12.0 to 0.14.0, and it introduced a bug to download install file.

node_exporter started to add 'v' prefix in release tags[1] from 0.13.0,
so we need to fix the url.

[1] https://github.com/prometheus/node_exporter/tags

Fixes #2509

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <42b0a7612539a34034896d404d63a0a31ce79e10.1497919368.git.amos@scylladb.com>
2017-06-20 09:25:39 +03:00
Raphael S. Carvalho
82048e6f77 lcs: remove useless code for choosing L0 candidates
The code being removed could be used if parallel compaction were
allowed for LCS, but the current code isn't even allowing that.
At the moment, it's only wasting cycles.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-19 21:19:29 -03:00
Raphael S. Carvalho
81f20068d6 lcs: remove some dead code
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-19 21:02:46 -03:00
Raphael S. Carvalho
b26dc6db1a lcs: make logger static
otherwise, there will be one instance per shard.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-19 21:01:21 -03:00
Raphael S. Carvalho
4bb27cbd6f lcs: actually prefer oldest sstables of L0 when it falls behind
Strategy prefers promoting oldest sstables in L0. Because sort
procedure is incorrectly sorting elements in descending order,
newest sstables will be promoted first *if and only if* L0 falls
behind (more than 32 sstables). If L0 doesn't fall behind, we'll
have all L0 sstables compacted with overlapping ones in L1.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-19 20:45:39 -03:00
Raphael S. Carvalho
90db2d7eba lcs: remove useless expensive check for overlapping L1 sstables
there's no way a L1 sstable will be in candidates set which was
previously built from list of L0 sstables.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-19 19:33:52 -03:00
Takuya ASADA
71600eb298 dist/debian: Use pbuilder for Ubuntu/Debian debs
Enable pbuilder for Ubuntu/Debian to prevent build enviroment dependent issues.
Also support cross building by pbuilder.
(cross-building from Fedora 25 and Ubuntu 16.04 are tested)

closes #629

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1497895661-26376-1-git-send-email-syuu@scylladb.com>
2017-06-19 21:15:06 +03:00
Nadav Har'El
984da1d8d7 Make forwarding_tag local to streamed_mutation
As Avi noticed, the "forwarding_tag" which was meant to be local in
streamed_mutation, became global. If another class copied the same trick,
it would share the same type instead of being distinct types as intended.

The problem is that in:

	using forwarding = bool_class<class forwarding_tag>;

Apparently, the "class forwarding_tag" forward-declares a global type - it
does not create a local-scope type as intended, which the following apparently
does (even though no actual definition is given for that class):

	class forwarding_tag;
	using forwarding = bool_class<forwarding_tag>;

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170619153933.13116-1-nyh@scylladb.com>
2017-06-19 20:04:47 +03:00
Nadav Har'El
3018df11b5 Allow reading exactly desired byte ranges and fast_forward_to
In commit c63e88d556, support was added for
fast_forward_to() in data_consume_rows(). Because an input stream's end
cannot be changed after creation, that patch ignores the specified end
byte, and uses the end of file as the end position of the stream.

As result of this, even when we want to read a specific byte range (e.g.,
in the repair code to checksum the partitions in a given range), the code
reads an entire 128K buffer around the end byte, or significantly more, with
read-ahead enabled. This causes repair to do more than 10 times the amount
of I/O it really has to do in the checksumming phase (which in the current
implementation, reads small ranges of partitions at a time).

This patch has two levels:

1. In the lower level, sstable::data_consume_rows(), which reads all
   partitions in a given disk byte range, now gets another byte position,
   "last_end". That can be the range's end, the end of the file, or anything
   in between the two. It opens the disk stream until last_end, which means
   1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is
   not allowed beyond last_end.

2. In the upper level, we add to the various layers of sstable readers,
   mutation readers, etc., a boolean flag mutation_reader::forwarding, which
   says whether fast_forward_to() is allowed on the stream of mutations to
   move the stream to a different partition range.

   Note that this flag is separate from the existing boolean flag
   streamed_mutation::fowarding - that one talks about skipping inside a
   single partition, while the flag we are adding is about switching the
   partition range being read. Most of the functions that previously
   accepted streamed_mutation::forwarding now accept *also* the option
   mutation_reader::forwarding. The exception are functions which are known
   to read only a single partition, and not support fast_forward_to() a
   different partition range.

   We note that if mutation_reader::forwarding::no is requested, and
   fast_forward_to() is forbidden, there is no point in reading anything
   beyond the range's end, so data_consume_rows() is called with last_end as
   the range's end. But if forwarding::yes is requested, we use the end of the
   file as last_end, exactly like the code before this patch did.

Importantly, we note that the repair's partition reading code,
column_family::make_streaming_reader, uses mutation_reader::forwarding::no,
while the other existing reading code will use the default forwarding::yes.

In the future, we can further optimize the amount of bytes read from disk
by replacing forwarding::yes by an actual last partition that may ever be
read, and use its byte position as the last_end passed to data_consume_rows.
But we don't do this yet, and it's not a regression from the existing code,
which also opened the file input stream until the end of the file, and not
until the end of the range query. Moreover, such an improvement will not
improve of anything if the overall range is always very large, in which
case not over-reading at its end will not improve performance.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170619152629.11703-1-nyh@scylladb.com>
2017-06-19 18:31:32 +03:00
Pekka Enberg
98fb2c0b56 docs: Fix Docker Hub documentation logo
The URL got broken when www.scylladb.com changed. Fix it up.

Message-Id: <1497360648-19210-1-git-send-email-penberg@scylladb.com>
2017-06-19 13:11:59 +03:00
Takuya ASADA
e1459dc9ef dist/debian: provides 3rdparty packages for Debian jessie
Now we provides jessie prebuilt 3rdparty packages, allow running without --rebuild-dep.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1497863340-19726-1-git-send-email-syuu@scylladb.com>
2017-06-19 13:11:20 +03:00
Takuya ASADA
3d671baf3b dist/debian: define dh_auto_configure task correctly
We mistakenly placed ./configure.py to dh_auto_build, but it's should place at
dh_auto_configure.
This bug causes issue #2505, since we haven't defined dh_auto_configure task yet(It seems running cmake on top of the dir is one of default behavior of dh_auto_configure).

Fixes #2505

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1497866068-32097-1-git-send-email-syuu@scylladb.com>
2017-06-19 12:58:59 +03:00
Duarte Nunes
7c17eba8e8 cql3/cql3_type: Don't quote tuple types
A regression introduced in 08b2ceb28e
quoted tuple type names, which, being of the form tuple<t1, ..., tn>,
would always be quoted. The quoted name would then not be found in any
internal data structures.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1497813816-39956-1-git-send-email-duarte@scylladb.com>
2017-06-18 22:03:57 +02:00
Duarte Nunes
ffcd4c76c2 ide: Add CMakeLists.txt for cmake-based IDEs
This patch add the CMakeLists.txt file for IDEs based on cmake, like
CLion.

This file assumes the existence of a build/release/gen directory,
containing generated files.

Refs #867

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170618151333.94714-1-duarte@scylladb.com>
2017-06-18 18:47:55 +03:00
Avi Kivity
58fd3dd006 Merge "cql3: Quote type name when needed" from Duarte
"This patch set ensures we quote the name of a UDT when it
contains characters that may cause parsing by the CQL parser
to fail.

Fixes #2491"

* 'cql3-quote-type/v1' of https://github.com/duarten/scylla:
  cql3/util: Make maybe_quote() take argument by const reference
  cql3/cql3_type: Quote UDT name if needed
  schema: Lift maybe_quote() into cql3/util
2017-06-18 17:59:47 +03:00
Gleb Natapov
72a4554dd9 storage_proxy: Fix compilation on older (1.55) boost
Boost 1.55 (ubuntu 14) fails to compile because an iterator produce by
boost::adaptors::transformed() when std::ref to lambda is passed to
it do not match iterator concept. It cannot be default constructed
because std::reference_wrapper is not default constructable.
boost::range::min_element() never actually default construct it, but
concept is checked anyway. The patch fixes it by providing an explicit
functor that is default constructable.

Message-Id: <20170618131836.GD3944@scylladb.com>
2017-06-18 16:54:41 +03:00
Duarte Nunes
b2c5aca4cf db/schema_tables: View mutations shouldn't always include base ones
When making the schema mutations for a view update, we should only
include the base table schema mutations (in case the target node
doesn't contain them) when the view is being directly updated. When it
is being updated as a side effect of updating the base table, then
including the base schema mutations will hide the actual changes being
performed on the base.

Fixes #2500

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1497782822-2711-1-git-send-email-duarte@scylladb.com>
2017-06-18 16:29:59 +03:00
Avi Kivity
6e2c9ef9fb Revert "Allow reading exactly desired byte ranges and fast_forward_to"
This reverts commit 317d7fc253 (and also the
related 2c57ab84b2).  It causes crashes
during range scans, reported by Gleb:

"To reproduce I run SELECT * FROM keyspace1.standard1; on typical c-s
dataset and 3 node cluster.

Backtrace:
    at /home/gleb/work/seastar/seastar/core/apply.hh:36
    rvalue=<unknown type in /home/gleb/work/seastar/build/release/scylla, CU 0x54cf307, DIE 0x55ebf2a>) at /home/gleb/work/seastar/seastar/core/do_with.hh:57
    range=std::vector of length 6, capacity 8 = {...}) at /home/gleb/work/seastar/seastar/core/future-util.hh:142
    at ./seastar/core/future.hh:890
    at /home/gleb/work/seastar/seastar/core/future-util.hh:119
    at /home/gleb/work/seastar/seastar/core/future-util.hh:142
2017-06-18 16:10:21 +03:00
Amnon Heiman
ff3d83bc2f node_exporter_install script update version to 0.14
Fixes #2097

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170612125724.7287-1-amnon@scylladb.com>
2017-06-18 12:25:58 +03:00
Calle Wilund
3464422051 commitlog_test: Fix reader test dropping rp handles
Test wants data in live segments to read from, so should
not just drop the handles returned from allocate.
Message-Id: <1497344532-2616-1-git-send-email-calle@scylladb.com>
2017-06-16 22:45:46 +01:00
Etienne Kruger
be0a947596 tests: perf_simple_query: Add delete perf test
Add a performance test for deletion in addition to the existing update
and query tests. The deletion performance test is executed using the
'--delete' argument to perf_simple_query.

Fixes #2417.

Signed-off-by: Etienne Kruger <el@loadavg.io>
Message-Id: <20170615232500.26987-1-el@loadavg.io>
2017-06-16 14:51:00 +01:00
Duarte Nunes
b993124d94 cql3/util: Make maybe_quote() take argument by const reference
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-06-15 19:55:52 +00:00
Duarte Nunes
08b2ceb28e cql3/cql3_type: Quote UDT name if needed
This patch ensures we properly quote a UDT name, which may contain
characters like ".", which can lead the name to be interpreted as a
keyspace qualified name when parsed by the CQL parser.

Fixes #2491

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-06-15 19:55:52 +00:00
Duarte Nunes
4886b7ed5e schema: Lift maybe_quote() into cql3/util
It's a more natural place given its current and future usages.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-06-15 19:55:52 +00:00
Avi Kivity
2c57ab84b2 mutation_reader: fix typo in forwarding_tag
The typo went unnoticed since the compiler picked up the global scope's
forwarding_tag.  The bug made streamed_mutation::forwarding and
mutation_reader::forwarding the same type, but fortunately there were
no type mixups due to this.
2017-06-15 20:13:01 +03:00
Avi Kivity
9cf6db3de5 Merge 2017-06-15 19:11:07 +03:00
Nadav Har'El
317d7fc253 Allow reading exactly desired byte ranges and fast_forward_to
In commit c63e88d556, support was added for
fast_forward_to() in data_consume_rows(). Because an input stream's end
cannot be changed after creation, that patch ignores the specified end
byte, and uses the end of file as the end position of the stream.

As result of this, even when we want to read a specific byte range (e.g.,
in the repair code to checksum the partitions in a given range), the code
reads an entire 128K buffer around the end byte, or significantly more, with
read-ahead enabled. This causes repair to do more than 10 times the amount
of I/O it really has to do in the checksumming phase (which in the current
implementation, reads small ranges of partitions at a time).

This patch has two levels:

1. In the lower level, sstable::data_consume_rows(), which reads all
   partitions in a given disk byte range, now gets another byte position,
   "last_end". That can be the range's end, the end of the file, or anything
   in between the two. It opens the disk stream until last_end, which means
   1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is
   not allowed beyond last_end.

2. In the upper level, we add to the various layers of sstable readers,
   mutation readers, etc., a boolean flag mutation_reader::forwarding, which
   says whether fast_forward_to() is allowed on the stream of mutations to
   move the stream to a different partition range.

   Note that this flag is separate from the existing boolean flag
   streamed_mutation::fowarding - that one talks about skipping inside a
   single partition, while the flag we are adding is about switching the
   partition range being read. Most of the functions that previously
   accepted streamed_mutation::forwarding now accept *also* the option
   mutation_reader::forwarding. The exception are functions which are known
   to read only a single partition, and not support fast_forward_to() a
   different partition range.

   We note that if mutation_reader::forwarding::no is requested, and
   fast_forward_to() is forbidden, there is no point in reading anything
   beyond the range's end, so data_consume_rows() is called with last_end as
   the range's end. But if forwarding::yes is requested, we use the end of the
   file as last_end, exactly like the code before this patch did.

Importantly, we note that the repair's partition reading code,
column_family::make_streaming_reader, uses mutation_reader::forwarding::no,
while the other existing reading code will use the default forwarding::yes.

In the future, we can further optimize the amount of bytes read from disk
by replacing forwarding::yes by an actual last partition that may ever be
read, and use its byte position as the last_end passed to data_consume_rows.
But we don't do this yet, and it's not a regression from the existing code,
which also opened the file input stream until the end of the file, and not
until the end of the range query. Moreover, such an improvement will not
improve of anything if the overall range is always very large, in which
case not over-reading at its end will not improve performance.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170614072122.13473-1-nyh@scylladb.com>
2017-06-15 13:22:46 +01:00
Avi Kivity
da24bd7c34 Merge "Balance read requests according to CF's cache hit ratio" from Gleb
"During read query with CL<ALL not all replicas are contacted. It is
possible for some replicas to cache less data for some CF's (for instance
because of node restart), so the replica choice may have a big impact
on request's completion latency and on amount of work it generates in
a cluster.

This patch series keep track of per CF cached hit ratio and uses this
information to choose best replicas for a request. Nodes with lower
hit ratios are still contacted in order to populate their cache, but
less frequently."

* 'gleb/cache-hitrate' of github.com:cloudius-systems/seastar-dev:
  storage_proxy: load balance read requests according to cache hit rates
  choose extra replica for speculation in filter_for_query()
  consistency_level: drop filter_for_query_dc_local function
  database: reset node's hit rate information on connection drop
  messaging_service: connection drop notifier
  Store cluster wide cache hit statistics in CF
  messaging_service: return cache hit ratio as part of data read
  Distribute cache temperature over gossiper.
  periodically calculate avg cache hit rate between all shards
  database: introduce cache_temperature class
  Rename load_broadcaster.cc to misc_services.cc
  storage_proxy: use db::count_local_endpoints function instead open code it
2017-06-15 14:33:08 +03:00
Avi Kivity
7dffe7f933 Merge "parallel repair and more memory usage fix" from Asias
"This series reduces repair memory usage and improves repair speed."

* tag 'asias/fix-repair-2430-branch-master-v4.1' of github.com:cloudius-systems/seastar-dev:
  repair: Repair on all shards
  repair: Allow one stream plan in flight
2017-06-15 14:00:19 +03:00
Duarte Nunes
5736468a71 mutation_partition_serializer: Assume range tombstone support
Range tombstones were introduced in version 1.3 and there exists
no direct upgrade from 1.2 to vnext, so we can retire the code
enforcing backwards compatibility.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170614211654.82501-1-duarte@scylladb.com>
2017-06-15 09:54:05 +03:00
Avi Kivity
e11f1c9cc3 tests: fix partitioner_test build on gcc 5 2017-06-14 17:22:01 +03:00
Gleb Natapov
4fdfa2dbb7 gdb: Fix "scylla heapprof" command
Message-Id: <20170612084241.GF21915@scylladb.com>
2017-06-14 15:41:39 +02:00
Gleb Natapov
c7a59ab7ff do not calculate serialized size of commitlog_entry_writer before final format is knows
Currently commitlog_entry_writer constructor calculates serialized size
before it is knows if a schema should be included into the entry. The
result is never used since it is recalculated when schema information is
supplied. The patch removes needless calculation.

Message-Id: <20170614114607.GA21915@scylladb.com>
2017-06-14 14:53:07 +03:00
Gleb Natapov
a032078410 intern also tuple and user defined types
Currently each time UDT or tuple is parsed new object is created. If
those objects are used to create container type repeatedly it will cause
memory leak since container types are interned, but lookup in the
cache is done using pointer to a contained type (which will be always
different for UDT and tuples). This patches interns also UDT and tuple,
so each type the same object is parsed same pointer is also returned.

Refs #2469
Fixes #2487

Message-Id: <20170612142942.GO21915@scylladb.com>
2017-06-14 14:41:17 +03:00
Asias He
47345078ec repair: Repair on all shards
Currently, shard zero is the coordinator of the repair. All the work of
checksuming of the local node and sending of the repair checksum rpc
verb is done on shard zero only. This causes other shards being
underutilized.

With this patch, we split the ranges need to be repaired into at least
smp::count ranges, so sizeof(ranges) / smp::count will be assigned to
each shard. For exmaple, we have 8 shards and 256 ragnes, each shard
will repair 32 ranges. Each shard will repair the 32 ranges
sequencially.  There will be at most 8 (smp::count) ranges of repair in
parallel.
2017-06-14 17:52:49 +08:00
Asias He
54831a344c repair: Allow one stream plan in flight
In "repair: Use more stream_plan" (commit 2043ffc064), we
switched to do stream while doing checksum instead of do stream only
after checksum pahse is completed. We take a parallelism_semaphore
before we do checksum, if there are more than sub_ranges_to_stream
(1024) ranges, we start a stream_plan and wait for the streaming to
complete (still under the parallelism_semaphore). So at most
parallelism_semaphore (100) stream_plans can be in parallel.

The parallelism_semaphore limits the parallelism of both checksum and the
streaming plan. However, it is not necessary to have the same
parallelism for both checksum and streaming, because 1) a streaming
operation itself runs in parallel (handling ranges on all shards in
prallel, sending mutaitons in parallel) , 2) and with more streaming plan
(in worse case 100) means we can write to 100 memtables at the same time
and flush 100 memtables to disk at the same time which can take a lot of
memory.

With this patch, we only allow one stream plan in flight.
2017-06-14 17:52:36 +08:00
Calle Wilund
525730e135 database: Fix assert in truncate to handle empty memtables+sstables
If we do two truncates in a row, the second will have neither memtable
nor sstable data. Thus we will not write/remove sstables, and thus
get no resulting truncation replay position.
Message-Id: <1497378469-6063-1-git-send-email-calle@scylladb.com>
2017-06-14 11:21:21 +02:00
Gleb Natapov
87094849fa storage_proxy: load balance read requests according to cache hit rates
This patch makes storage proxy to choose replicas to read from base on
their cache hit rates. Replicas with higher cache hit rates will see
more requests while replicas with lower hit rates will see less. Local
node has a special bonus and will get more requests even if another node
has slightly higher cache hit rate (same goes for local vs remote DC),
but after the patch it is no longer guarantied that a coordinator node
will be chosen as a replica for the read (if the feature is enabled).
2017-06-13 09:57:14 +03:00
Gleb Natapov
bc8aa1b4ee choose extra replica for speculation in filter_for_query()
Currently storage proxy has to loop over remaining replicas to search
for suitable extra replica, but doing it in filter_for_query() is
extremely easy, so do it there instead.
2017-06-13 09:57:14 +03:00
Gleb Natapov
8437ea3b99 consistency_level: drop filter_for_query_dc_local function
Merge filter_for_query_dc_local() functionality into filter_for_query().
This is more efficient since filter_for_query_dc_local() partitions
endpoints into 'local' and 'remote' set but filter_for_query() already
does it for CL=LOCAL so for such queries we needlessly do it twice.
2017-06-13 09:57:14 +03:00
Gleb Natapov
ca812a8ea0 database: reset node's hit rate information on connection drop
Node may go down, so after it restarts cache hit rate info will be
incorrect and it can be overwhelmed with traffic until new and
up-to-date cache hit rate arrives. Solve this by dropping node's
information on connection reset, it is more accurate than relying on
gossip which may be slow and miss reboot of a node.
2017-06-13 09:57:14 +03:00
Gleb Natapov
23c51b3e57 messaging_service: connection drop notifier
Allow registering callbacks that will be called when connection is going
down.
2017-06-13 09:57:14 +03:00
Gleb Natapov
0e4d5bc2f3 Store cluster wide cache hit statistics in CF 2017-06-13 09:57:14 +03:00
Gleb Natapov
69c5526301 messaging_service: return cache hit ratio as part of data read 2017-06-13 09:57:14 +03:00
Gleb Natapov
8ca1432b04 Distribute cache temperature over gossiper.
When a node start it does not have any information about cache temperature
of other nodes in the cluster and it is hard (if not impossible) to make
right guess. During cluster startup all nodes have cold caches, so there
is no point to redirect reads to other nodes even though local cache it
cold, but if only that node restarted than other nodes have populated
cache and reads should be redirected.

The node will get up-to-date information about other nodes caches,
but only after receiving first reply, until then it does not have the
information to make right decisions which may cause unwanted spikes
immediately after restart. Having cache temperature in gossiper helps
to solve the problem.
2017-06-13 09:57:14 +03:00
Gleb Natapov
991ec4a16c periodically calculate avg cache hit rate between all shards
This patch adds new class cache_hitrate_calculator whose responsibility
is to periodically calculate average cache hit rates between all shards
for each CF.
2017-06-13 09:57:14 +03:00
Gleb Natapov
fab18c0c5a database: introduce cache_temperature class
The class will represent cache hit rate for a column family and is
serializable for use with RPC.
2017-06-13 09:57:14 +03:00
Gleb Natapov
f59ecc2687 Rename load_broadcaster.cc to misc_services.cc
load_broadcaster is very small class, move it into generic file so that
we can put other small services there to save on compilation time.
2017-06-13 09:57:14 +03:00
Gleb Natapov
7bcf4c690f storage_proxy: use db::count_local_endpoints function instead open code it 2017-06-13 09:57:14 +03:00
Gleb Natapov
21197981a5 Fix use after free in nonwrapping_range::intersection
end_bound() returns temporary object (end_bound_ref), so it cannot be
taken by reference here and used later. Copy instead.

Message-Id: <20170612132328.GJ21915@scylladb.com>
2017-06-12 15:34:36 +01:00
Tomasz Grabiec
20095d7ed6 gdb: Fix "scylla column_families" command
Apparently some GDB versions (7.11.1-86.fc24) don't parse double '>'
in a type name, so this:

 std::pair<utils::UUID const, seastar::lw_shared_ptr<column_family>>

should be this:

 std::pair<utils::UUID const, seastar::lw_shared_ptr<column_family> >

Message-Id: <1497256644-4335-1-git-send-email-tgrabiec@scylladb.com>
2017-06-12 11:39:50 +03:00
Tomasz Grabiec
9e7a040f0c gdb: Fix "scylla keyspaces" command
The problem is that 'key' is a 'bytes' object now, which doesn't have __format__.

Fixes the following error:

  Traceback (most recent call last):
    File "~/src/scylla/scylla-gdb.py", line 184, in invoke
  TypeError: non-empty format string passed to object.__format__
  Error occurred in Python command: non-empty format string passed to object.__format__

Message-Id: <1497253433-374-2-git-send-email-tgrabiec@scylladb.com>
2017-06-12 11:22:59 +03:00
Tomasz Grabiec
230683bdfa gdb: Add missing seastar namespace qualifier
Message-Id: <1497253433-374-1-git-send-email-tgrabiec@scylladb.com>
2017-06-12 11:22:53 +03:00
Asias He
2bcb368a13 repair: Fix range use after free
Capture it by value.

scylla:  [shard 0] repair - repair's stream failed: streaming::stream_exception (Stream failed)
scylla:  [shard 0] repair - Failed sync of range ==<runtime_exception
(runtime error: Invalid token. Should have size 8, has size 0#012)>: streaming::stream_exception (Stream failed)

Message-Id: <7fda4432e54365f64b556e7e4c26e36d3a9bb1b7.1497238229.git.asias@scylladb.com>
2017-06-12 11:00:57 +03:00
Avi Kivity
419ad9d6cb Merge "repair memory usage fix" from Asias
"This series switches repair to use more stream plans to stream the mismatched
sub ranges and use a range generator to produce sub ranges.

Test shows no huge memory is used for repair with large data set.

In addition, we now have a progress reporter in the log how many ranges are processed.

   Jun 06 14:18:22  [shard 0] repair - Repair 512 out of 529 ranges, id=1, keyspace=myks, cf=mytable, range=(8526136029525195375, 8549482295083869942]
   Jun 06 14:19:55  [shard 0] repair - Repair 513 out of 529 ranges, id=1, keyspace=myks, cf=mytable, range=(8526136029525195375, 8549482295083869942]

Fixes #2430."

* tag 'asias/fix-repair-2430-branch-master-v1' of github.com:cloudius-systems/seastar-dev:
  repair: Remove unused sub_ranges_max
  repair: Reduce parallelism in repair_ranges
  repair: Tweak the log a bit
  repair: Use more stream_plan
  repair: iterator over subranges instead of list
2017-06-08 14:19:08 +03:00
Tomasz Grabiec
9b7f170121 gdb: Improve error message
Message-Id: <1496849069-21750-1-git-send-email-tgrabiec@scylladb.com>
2017-06-07 18:26:31 +03:00
Tomasz Grabiec
0dfe1ad431 Merge "Relax replay position ordering requirement" from Calle
From seastar-dev.git calle/concorde

Normally, we require that all mutations applied to a column family
have replay positions higher than all previously flushed.
The main reason for this is to be able to determine when to drop a
commit log segment, i.e. determine that all replay positions less
than X are now in sstables.

This patch series, small as it is, relaxes this by instead of just
keeping track of high rp applied, keep a reference count to each
segment per CF in memtables, and on flush, release this very count.

The only case where we need to keep a water mark for RP is then
for table truncation, for which we simply say that the highest RP
applied to the column family is the lowest allowed henceforth,
and use the old reordering logic for this instead. I.e. very rare.

There is of course one (big?) downside to all this, and this is
"normal" commit log replay on startup after crash/shutdown.
Since we relax RP ordering, we cannot use RP:s in sstables as
low marks for replay start, since it is now allowed to exist
non-persisted mutations in commitlog with lower RP:s than
previously flushed. I.e. we more or less always have to replay
the full commit log.
It is worth noting though that due to compaction and the non-
propagation of RP marks to new sstables, we end up often
doing this anyway, so it is hard to say how much of a regression
this is.
2017-06-07 14:51:28 +02:00
Calle Wilund
18806989b6 database: remove hard rp ordering requirement, set low rp mark on
truncate

With commitlog keeping use-count per CF id, we can ease the ordering
restriction on replay positiontion. Previously we required that all
added mutations have a position > previously flushed. However, if
we accept that replay must now be all data, by keeping track instead
per CF of highest RP ever entered, we can instead just set a
low mark on truncation, since this is the only remaining hard
RP divider.
2017-06-07 12:07:01 +00:00
Calle Wilund
d9b8c79eb9 commitlog_replayer: Ignore sstable replay positions
With relaxed position ordering, we cannot use existing sstables as
water mark for replay. We must replay everything above truncation
marks.
2017-06-07 12:07:01 +00:00
Calle Wilund
2913241df1 memtable/commitlog: Change bookkeep to track individul segments
Use per CF-id reference count instead, and use handles as result of 
add operations. These must either be explicitly released or stored
(rp_set), or they will release the corresponding replay_position
upon destruction. 

Note: this does _not_ remove the replay positioning ordering requirement
for mutations. It just removes it as a means to track segment liveness.
2017-06-07 12:07:01 +00:00
Calle Wilund
0c598e5645 commitlog_test: Fix test_commitlog_delete_when_over_disk_limit
Test should
a.) Wait for the flush semaphore
b.) Only compare segement sets between start and end, not start,
    end and inbetwen. I.e. the test sort of assumed we started
    with < 2 (or so) segments. Not always the case (timing)

Message-Id: <1496828317-14375-1-git-send-email-calle@scylladb.com>
2017-06-07 12:44:02 +03:00
Avi Kivity
07ff3f68e0 Merge seastar upstream
* seastar b1f69cc...621b7ed (8):
  > net/api: Remove outdated comments
  > Merge "Fixes for Clang 5" from Paweł
  > Merge "Metrics: Safely transfer metadata between shared" from Amnon
  > posix: add missing #include
  > build: add cmake dependency
  > build: add -Wno-maybe-uninitialized
  > rpc: handle messages larger than memory limit
       (Fixes #2453)
  > doxygen: enable macro expansion
2017-06-07 11:04:56 +03:00
Takuya ASADA
7fe63c539a dist/debian: install gdebi when it's not exist
Since we started to use gdebi for install build-dep metapackage that generated by
mk-build-dep, we need to install gdebi on build_deb.sh too.

Fixes #2451

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1496819209-30318-1-git-send-email-syuu@scylladb.com>
2017-06-07 10:24:22 +03:00
Asias He
3fdb8a3d3f repair: Remove unused sub_ranges_max
With the sub range iterator, it is not used anymore. Drop it.
2017-06-07 08:52:45 +08:00
Asias He
ca00c10b35 repair: Reduce parallelism in repair_ranges
We currently repair all the ranges in parallel.

1) All the ranges will contend for parallelism_semaphore, instead of
processing multiple ranges in parallel and calculating the sub ranges
(which take memory) for each range in parallel, we can handle the ranges
one bye one.

We could have enough parallelism because the checksum are calucated on
all the shards.

2) If for some reason the repair failed, if we handle ranges 1 by 1, we
can log which range of repair is successful. Next time, we can ignore
them. If we start ranges in parallel, it has a high chance, no single
range is completed because all the ranges are on going.

Refs #1912
2017-06-07 08:50:57 +08:00
Asias He
3852665156 repair: Tweak the log a bit
- Count n out m ranges the repair is running for (kind of progress report)
- Make the 'Found differing range' log debug because it can be millions
  of such entries
- Print the failed ranges
2017-06-07 08:50:57 +08:00
Asias He
2043ffc064 repair: Use more stream_plan
In the very beginning, we use a stream_plan for each checksum range.
Later, we changed to use a single stream_plan for all the checksum
ranges. It pushes memory presure to streaming, e.g., millinons of ranges
in a vector to send over RPC.

To fix, we do checksum and streaming in parallel, limit the number of
checksum ranges stored in memory.

Fixes #2430
2017-06-07 08:50:56 +08:00
Nadav Har'El
b3ff37e67f repair: iterator over subranges instead of list
When starting repair, we divided the large token ranges (vnodes) linto small
subranges of a desired length (around 100 partition), and built a huge list
of those subranges - to iterate over them later and compare checksums of
those chunks.

However, building this list up-front is completely unnecessary, and wastes
a lot of memory: In a test with 1 TB of data, as much as 3 gigabytes was
spent on this list. Instead, what we do in this patch is to find the next
chunk in a DFS-like splitting algorithm, using only the token range
midpoint() function (as before). The amount of memory needed for this is
O(logN), instead of O(N) in the previous implementation.

Refs #2430.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2017-06-07 08:50:56 +08:00
Raphael S. Carvalho
0ca1e5cca3 sstables: fix report of disk space used by bloom filter
After change in boot, read_filter is called by distributed loader,
so its update to _filter_file_size is lost. The load variant
which receives foreign components that must do it. We were also
not updating it for newly created sstables.

Fixes #2449.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170606151129.5477-1-raphaelsc@scylladb.com>
2017-06-06 18:20:28 +03:00
Takuya ASADA
a4c392c113 dist/debian: use gdebi instead of mk-build-deps -i
At least on Debian8, mk-build-deps -i silently finishes with return code 0
even it fails to install dependencies.
To prevent this, we should manually install the metapackage generated by
mk-build-deps using gdebi.

Fixes #2445

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1496737502-10737-2-git-send-email-syuu@scylladb.com>
2017-06-06 11:37:34 +03:00
Takuya ASADA
5608842e96 dist/debian/dep: install texlive from jessie-backports to prevent gdb build fail on jessie
Installing openjdk-8-jre-headless from jessie-backports breaks texlive on
jessie main repo.
It causes 'Unmet build dependencies' error when building gdb package.
To prevent this, force insatlling texlive from jessie-backports before start
building gdb.

Fixes #2444

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1496737502-10737-1-git-send-email-syuu@scylladb.com>
2017-06-06 11:37:33 +03:00
Paweł Dziepak
b2b78158f6 mutation_partition: restore formatting
No functional change.

Message-Id: <20170526104119.22075-2-pdziepak@scylladb.com>
2017-06-06 11:20:57 +03:00
Gleb Natapov
f5679e0416 database: remove remnants of no longer existing db::serializer.
Message-Id: <20170604100552.GD8248@scylladb.com>
2017-06-04 13:07:17 +03:00
Raphael S. Carvalho
dcbeb42f67 sstables: explicitly close file in fsync_directory
or close is called in the reactor thread when destroying the
file object.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170602024346.7803-1-raphaelsc@scylladb.com>
2017-06-02 21:09:58 +03:00
Pekka Enberg
a6dc21615b Merge "Fixes to thrift/server" from Duarte
"This series fixes some issues with the thrift_server, namely
ensuring that streams and sockets are properly closed.

Fixes #499
Fixes #2437"

* 'thrift-server-fixes/v1' of github.com:duarten/scylla:
  thrift/server: Close connections when stopping server
  thrift/server: Move connection class to header
  thrift/server: Shutdown connection
  thrift/server: Close output_stream when connection is done
2017-06-02 08:15:22 +03:00
Duarte Nunes
c525331e60 thrift/server: Close connections when stopping server
Fixes #499

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-06-02 00:15:20 +02:00
Duarte Nunes
315c69b830 thrift/server: Move connection class to header
No changes in functionality. Required for an upcoming patch.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-06-02 00:15:20 +02:00
Duarte Nunes
22fafd5034 thrift/server: Shutdown connection
This patch adds the shutdown() function to thrif_server::connection,
and calls it after a connection is done.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-06-02 00:15:20 +02:00
Duarte Nunes
0a5ec97b7f thrift/server: Close output_stream when connection is done
Fixes #2437

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-06-02 00:15:20 +02:00
Jesse Haber-Kucharsky
376c661823 Eliminate duplicate definition of sstable column mask values
The column mask identifies the kind of atom in a row in an sstable. Two
definitions of these values were present: one as a C-style enumeration and one
as a C++11-style enumeration.

The C++11-style definition is used elsewhere in `sstables.cc`. It also offers
additional type-safety.

Therefore, this commit removes the inlined C-style enumeration.

Fixes #2214.

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <c525b4ae7fad3b54480e133921aa4ffe0dd5d9ce.1496352711.git.jhaberku@scylladb.com>
2017-06-02 00:06:31 +02:00
Michał Matczuk
04da4dbf83 docker support for api-address
Message-Id: <1b5fb2bbba1b879aae825094a0f1b77c865be139.1496318996.git.michal@scylladb.com>
2017-06-01 15:31:45 +03:00
Takuya ASADA
22339bba44 dist/debian: depends to collectd-core instead of collectd, to reduce dependencies
To reduce unwanted dependencies, we need to replace dependency from collectd to
collectd-core.
However, collectd provides /etc/collectd/collectd.conf, so without this package
we need to install the configuration file by our self.
So install the file on .postinst script.

Fixes #2426

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1496231743-7828-1-git-send-email-syuu@scylladb.com>
2017-06-01 13:20:37 +03:00
Takuya ASADA
909a9ebf97 dist/debian: provide prebuilt 3rdparty packages for Ubuntu 16.04
Currently we only offers 14.04 prebuit but we have 16.04 one on s3, so use it.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1496301544-15251-1-git-send-email-syuu@scylladb.com>
2017-06-01 10:37:52 +03:00
Duarte Nunes
15a62701f2 test.py: Ensure view_schema_test runs with only one cpu
In the write path we don't wait for view updates, as they happen in
the background.

The view schema tests can fail when running with more than one cpu due
to this inherent race condition: the write to the base table returns
while the view updates are still being processed, after which we issue
a query to the view table. The shard handling the view data is not
guaranteed to finish processing the mutation before handling the query.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170531165726.9212-1-duarte@scylladb.com>
2017-05-31 19:17:51 +01:00
Raphael S. Carvalho
b8091799ca lcs: fix off-by-one comparison
invariant is broken if size of L0 candidates is equal to max
sstable size because the overlapping L1 sstables will not be
added to compacting set, and they will be promoted.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170530143708.3775-1-raphaelsc@scylladb.com>
2017-05-30 17:39:51 +03:00
Avi Kivity
15af6acc8b dist: redirect stdout/stderr to the journal on systemd systems
Fixes #2408.

Message-Id: <20170524080729.10085-1-avi@scylladb.com>
2017-05-30 08:47:17 +03:00
Avi Kivity
1c84aae0c1 Merge seastar upstream
* seastar 68dbf60...b1f69cc (10):
  > metrics: fix namespace in documentation
  > add special logger for memory allocation failures
  > xen: remove
  > Merge "sanity checks, fixes and extensions in the perftune.py" from Vlad
  > tutorial: more "seastar" namespace
  > execution_stages: fix build errors in comments
  > tutorial: more "seastar" namespace additions
  > tutorial: more minor changes
  > tutorial: minor changes to the introduction
  > tutorial: start overhauling the examples to use "seastar" namespace
2017-05-29 19:02:02 +03:00
Calle Wilund
3512ed4596 storage_service/config: Add "native_transport_port_ssl" option
Mimic origin behaviour, iff TLS encryption is enabled, and
native_transport_port_ssl is set and different from
native_transport_port, start both tls- and non-tls
listeners.

Message-Id: <1496061600-24454-2-git-send-email-calle@scylladb.com>
2017-05-29 15:53:56 +03:00
Calle Wilund
1b387a1f56 cql server: Allow multiple listeners on different ports
Need to separate "notifiers" to per-port/address and keep
life span as such.

Message-Id: <1496061600-24454-1-git-send-email-calle@scylladb.com>
2017-05-29 15:53:50 +03:00
Avi Kivity
ef98afa748 build: make swagger generated code depend on the code generator
Fixes failures when moving between branches due to the seastar namespace
change.
Message-Id: <20170528100052.29131-1-avi@scylladb.com>
2017-05-29 13:17:42 +02:00
Avi Kivity
8979d7abf0 Deprecate non-murmur3 partitioners
Removing non-murmur3 partitioners will allow us to reduce memory footprint
and speed up some code by utilizing the properties of the murmur3 partitioner
token.
Message-Id: <20170528172536.16079-1-avi@scylladb.com>
2017-05-28 19:35:56 +02:00
Takuya ASADA
36ccbc1539 dist/ami: follow rpm output dir path change
CentOS mock support on build_rpm.sh changed rpm output directory, so follow it.

Fixes #2406

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1495573343-13912-1-git-send-email-syuu@scylladb.com>
2017-05-28 13:02:36 +03:00
Amos Kong
f655639e5a scylla_setup: fix deadloop in inputting invalid option
example: # scylla_setup --invalid-opt

Fixes #2305

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <9a4f631b126d8eaaae479fa99137db7a61a7c869.1493135357.git.amos@scylladb.com>
2017-05-28 13:02:10 +03:00
Takuya ASADA
bdec38d23c dist/common/scripts/scylla_setup: skip SELinux setup when it's already disabled
It doesn't make sence to ask "Do you want to disable SELinux?" when SELinux is
already disabled, so skip whole question.

Fixes #2411

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1495652423-20806-1-git-send-email-syuu@scylladb.com>
2017-05-28 13:00:10 +03:00
Avi Kivity
c4faa1e202 Merge "tracing: tracing spans and time series helper table" from Vlad
"
 - Introduce a parent span IP and span ID paradigm.
 - Introduce time series tables to simplify traces processing.
 - Add the "How to get traces?" chapter to the tracing.md.
"

* 'tracing-span-ids-and-time-series-helpers-v4' of github.com:cloudius-systems/seastar-dev:
  docs: tracing.md: add a "how to get traces" chapter
  tracing::trace_keyspace_helper: introduce a time series helper tables
  tracing: cleanup: use nullptr instead of trace_state_ptr()
  tracing: introduce a span ID and parent span ID
2017-05-28 12:01:35 +03:00
Paweł Dziepak
d9dd798c4f counter_write_query: avoid use-after-free on partition range
Message-Id: <20170526104119.22075-1-pdziepak@scylladb.com>
2017-05-28 11:41:30 +03:00
Raphael S. Carvalho
41137c7fb6 compaction: use sstable::bytes_on_disk for calculating start and end size
Currently, start and end size of compaction are calculated using the
uncompressed size of data component. bytes_on_disk() returns size
used by all components.

NOTE: start and end size are written to compaction history, so users
who monitor it should be aware of this change.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170525212129.6758-1-raphaelsc@scylladb.com>
2017-05-28 11:33:24 +03:00
Raphael S. Carvalho
3b5ad23532 db: fix computation of live disk usage stat after compaction
sstable::data_size() is used by rebuild_statistics() which only
returns uncompressed data size, and the function called by it
expects actual disk space used by all components.
Boot uses add_sstable() which correctly updates the stat with
sstable::bytes_on_disk(). That's what needs to be used by
r__s() too.

Fixes #1592

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170525210055.6391-1-raphaelsc@scylladb.com>
2017-05-28 10:38:32 +03:00
Vlad Zolotarov
1ae40ee91a utils::timestamped_val: fix the touch() comment
The current comment has been written when the function has not been a
timestamped_val member.

Let's adjust it to the current code.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1495555659-10881-1-git-send-email-vladz@scylladb.com>
2017-05-26 19:26:56 +03:00
Vlad Zolotarov
0619c2cb71 utils::serialization: remove not used deserialization_xxx() functions
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1495556124-16672-1-git-send-email-vladz@scylladb.com>
2017-05-26 19:26:20 +03:00
Tomasz Grabiec
de70d942a9 memtable: Decouple from sstable
We can make the dependency more abstract by using mutation_source
instead of an sstable.

Will be useful in some stress tests which want to avoid the disk, but
is also good for the sake of decoupling.
Message-Id: <1495729508-30081-2-git-send-email-tgrabiec@scylladb.com>
2017-05-25 19:30:21 +03:00
Tomasz Grabiec
f3a6d94398 sstables: Introduce sstable::as_mutation_source()
Adaptors extracted from existing testing code.
Message-Id: <1495729508-30081-1-git-send-email-tgrabiec@scylladb.com>
2017-05-25 19:30:20 +03:00
Glauber Costa
3d3afd8f11 node_exporter: add interrupt information
Information about interrupts is invaluable when debugging performance
problems with Scylla in the field. node_exporter doesn't include that in
the list of collectors enabled by default, so we suggest we do it here.

The list that goes into this file is the default list as shown by
node_exporter, with "interrupts" added to it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20170525153008.26720-1-glauber@scylladb.com>
2017-05-25 19:11:30 +03:00
Avi Kivity
a8a82433e5 Merge "improve lcs promotion decision with compression enabled" from Raphael
"lcs' current behavior will make it hard to reduce number of
levels by increasing sstable size because it uses uncompressed
length when deciding whether or not a level needs promotion.
Demotion process is slower because of that."

* 'lcs_promotion_improvement_2' of github.com:raphaelsc/scylla:
  lcs: use sstable compressed length when computing level size
  sstables: introduce sstable::ondisk_data_size
  sstables: unconditionally set sstable data file size
2017-05-25 12:37:24 +03:00
Tomasz Grabiec
848ca035a2 gdb: Adjust scylla-gdb.py for the namespace change in seastar
Message-Id: <1495700444-29269-1-git-send-email-tgrabiec@scylladb.com>
2017-05-25 11:52:42 +03:00
Raphael S. Carvalho
b7e1575ad4 db: remove partial sstable created by memtable flush which failed
partial sstable files aren't being removed after each failed attempt
to flush memtable, which happens periodically. If the cause of the
failure is ENOSPC, memtable flush will be attempted forever, and
as a result, column family may be left with a huge amount of partial
files which will overwhelm subsequent boot when removing temporary
TOC. In the past, it led to OOM because removal of temporary TOC
took place in parallel.

Fixes #2407.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170525015455.23776-1-raphaelsc@scylladb.com>
2017-05-25 11:50:02 +03:00
Raphael S. Carvalho
0a105473df lcs: use sstable compressed length when computing level size
lcs uses uncompressed length of sstables when computing size of a
level, and that may result in unnecessary promotion when the goal
is to reduce the number of levels after an increase in sstable
size, for example.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-05-24 20:10:02 -03:00
Raphael S. Carvalho
b2dc0b2db5 sstables: introduce sstable::ondisk_data_size
this new function is an alternative to data_size(), which will
return size of data component. data_size() returns uncompressed
size of data component if compression is enabled.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-05-24 20:10:00 -03:00
Raphael S. Carvalho
e02bf6da58 sstables: unconditionally set sstable data file size
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-05-24 20:09:58 -03:00
Piotr Jastrzebski
6528f3a963 Make sure mutation_reader for sstables can be fast-forwarded
Fixes #2145.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
[tgrabiec: Extracted from a series, fixed title]
Message-Id: <1495639745-19387-1-git-send-email-tgrabiec@scylladb.com>
2017-05-24 16:36:24 +01:00
Tomasz Grabiec
6cf2841654 mvcc: Extract partition_snapshot_reader to separate header
Right know whole world includes it transitively, which results in
painful recompiles when the code changes.

Relax dependencies.
Message-Id: <1495620201-8046-1-git-send-email-tgrabiec@scylladb.com>
2017-05-24 12:13:15 +01:00
Asias He
f792c78c96 streaming: Do not abort session too early in idle detection
Streaming ususally takes long time to complete. Abort it on false
positive idle detection can be very wasteful.

Increase the abort timeout from 10 minutes to a very large timeout, 300
minutes. The real idle session will be aborted eventually if other
mechanisms, e.g., streaming manager has gossip callback for on_remove
and on_restart event to abort, do not abort the session.

Fixes #2197

Message-Id: <57f81bfebfdc6f42164de5a84733097c001b394e.1494552921.git.asias@scylladb.com>
2017-05-24 12:29:50 +03:00
Paweł Dziepak
3b9c0a6ae2 Merge "loading_cache: fix the known complexity issue in the shrink() method" from Vlad
Use the boost::intrusive containers in order to achieve a O(1) complexity
for both "LRU list" update and to minimize the memory overhead in the hash
table item to "LRU list" item connection:
   - Make the timestamped_val be both a bi::list and a bi::unordered_set
     item.
   - Make a bi::unordered_set be a cache backend instead of the
     std::unordered_map.

As a result dropping k LRU items becomes an O(k) operation instead of
O(N log N), where N is a total number of all cached items:
   - Every time a value is read - move it to the front of the "LRU list"
     (O(1)).
   - When we need to remove k LRU items:
      - Repeat k times:
         - Take an element from the back of the "LRU list". (O(1)).
         - Remove it from the bi::unordered_set and dispose. (O(1)).
           We use an auto-unlink configuration for bi::list, therefore
           disposing an item is going to auto unlink it from the list.

* 'permissions_cache_move_to_intrusive-v1' of github.com:scylladb/seastar-dev:
  utils::loading_cache: cleanup
  utils/loading_cache.hh: use intrusive list to store the lru entry
  utils::loading_cache: implement automatic rehashing
  utils::loading_cache: make the underlying map to be an intrusive unordered_set
2017-05-23 16:18:16 +01:00
Tomasz Grabiec
cec8d7f38c gdb: Fix error about gdb.Value not being convertible to int by %x format
Message-Id: <1495538843-27777-1-git-send-email-tgrabiec@scylladb.com>
2017-05-23 15:38:58 +03:00
Avi Kivity
fd0e1eb1e2 Merge "Fixes for mutation algebra" from Tomasz
"Enforces commutativity of addition:

 m1 + m2 == m2 + m1

and consistency of difference and addition with equality:

 m1 + (m2 - m1) == m1 + m2"

* tag 'tgrabiec/fix-range-tombstone-commutativity-v2' of github.com:cloudius-systems/seastar-dev:
  mutation: Make compare_*_for_merge() consistent with equals()
  tests: mutation: Improve assertion failure message
  tests: Use default equality in test_mutation_diff_with_random_generator
  mutation: Make counter cell difference consistent with apply
  tests: range_tombstone_list_test: Improve error message
  tests: range_tombstone_list: Check adjacent range merging
  range_tombstone_list: Merge adjacent range tombstones in apply()
  tests: mutation: Check commutativity of mutation addition
  range_tombstone_list: Avoid violating set invariant
  range_tombstone_list: Make tombstone merging commutative
  range_tombstone_list: Add erase() operation to the reverter
  range_tombstone_list: Make all undo operations ordered relative to each other
  utils: Extract to_boost_visitor() to a separate header
  allocating_strategy: Introduce alloc_strategy_unique_ptr<>
2017-05-23 15:20:38 +03:00
Tomasz Grabiec
804f46f684 mutation: Make compare_*_for_merge() consistent with equals()
equals() considers expiring cells to be different form non-expiring cells,
but compare_row_marker_for_merge() considers them equal. Fix the latter to
pick expiring cells. The choice was arbitrary.
2017-05-23 13:35:03 +02:00
Tomasz Grabiec
c1475a8eb2 tests: mutation: Improve assertion failure message 2017-05-23 13:16:03 +02:00
Tomasz Grabiec
d15880b3b7 tests: Use default equality in test_mutation_diff_with_random_generator 2017-05-23 13:16:03 +02:00
Tomasz Grabiec
9dbae279ad mutation: Make counter cell difference consistent with apply
The case when both cells are dead was not handled properly, the diff
was always empty, whereas the cell with higher timestamp should win.

Caused test_mutation_diff_with_random_generator to fail.
2017-05-23 13:16:03 +02:00
Tomasz Grabiec
951da421db tests: range_tombstone_list_test: Improve error message 2017-05-23 13:16:03 +02:00
Tomasz Grabiec
bee40b4628 tests: range_tombstone_list: Check adjacent range merging 2017-05-23 13:16:03 +02:00
Tomasz Grabiec
3c509308ab range_tombstone_list: Merge adjacent range tombstones in apply()
Needed for equivalence to work correctly with difference and addition:

  m1 + (m2 - m1) = m1 + m2

Fixes #2158.
2017-05-23 13:16:03 +02:00
Tomasz Grabiec
ef4c7c458c tests: mutation: Check commutativity of mutation addition 2017-05-23 12:11:12 +02:00
Tomasz Grabiec
1dea251ca2 range_tombstone_list: Avoid violating set invariant
The code was inserting an entry with the same key as its successor,
and only later adjusting the key of the old entry. This is violating
set's invariant of unique keys, and insertion may cause rebalancing. I
don't know if this violation actually causes problems currently, but
it's safer not to.

Fix by first updating the existing entry and then inserting the new
one.
2017-05-23 12:11:12 +02:00
Tomasz Grabiec
a2a22e5f00 range_tombstone_list: Make tombstone merging commutative
Example of non-commutative case:

 a = [1, 5]@t2
 b = {[2, 3]@t1, [4, 5]@t1}

 a + b = [1, 5]@t2
 b + a = [1, 4)@t2, [4, 5]@t2

After this patch, both will yield [1, 5]@t2.

The patch also changes the logic to handle overlaps of tombstones with
equal timestamps to be handled symmetrically. They are now merged
instead of split on either of the boundary.

Refs #2158.
2017-05-23 12:11:12 +02:00
Tomasz Grabiec
c4dac7c80f range_tombstone_list: Add erase() operation to the reverter 2017-05-23 12:11:12 +02:00
Tomasz Grabiec
935709cddc range_tombstone_list: Make all undo operations ordered relative to each other
Later operation may depend on the result of previous operation. Same dependency
is present when reverting the operations.

Fixes assertion failure in update reverter.
2017-05-23 12:11:12 +02:00
Vlad Zolotarov
2d4d198fb9 utils::loading_cache: cleanup
- Remove "_" at the beginning of the type names.
   - s/Pred/EqualPred/

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-05-22 23:02:18 -04:00
Vlad Zolotarov
fd59a548c0 utils/loading_cache.hh: use intrusive list to store the lru entry
Fix the shrink() O(n log n) complexity issue by constantly pushing the corresponding intrusive
list entry to the head of the list every time the values are read.

This will keep the list ordered by the last read time from the most recently read
to the least recently read entry.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-05-22 23:00:18 -04:00
Vlad Zolotarov
0c4e9efce7 utils::loading_cache: implement automatic rehashing
- Start the cache with 256 buckets - the minimum number of buckets.
   - Limit the maximal number of buckets by 1M buckets.
   - Keep the load factor between 0.25 and 1.0 as long as the number of buckets is
     between the minimum and the maximum values mentioned above.
   - Grow and shrink the hash every "refresh" period if needed.
   - Enable bi::power_2_buckets and bi::compare_hash bi::unordered_set options.
   - Enable bi::unordered_set_base_hook's bi::store_hash option.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-05-22 22:57:44 -04:00
Vlad Zolotarov
2be3596a4f utils::loading_cache: make the underlying map to be an intrusive unordered_set
Make the underlying map to be a boost::intrusive::unordered_set<timestamped_val>
instead of std::unordered_set<Key, timestamped_val>.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-05-22 18:45:13 -04:00
Tomasz Grabiec
5aeb9eb70c utils: Extract to_boost_visitor() to a separate header 2017-05-22 19:30:02 +02:00
Tomasz Grabiec
69e2eccf68 allocating_strategy: Introduce alloc_strategy_unique_ptr<> 2017-05-22 19:30:02 +02:00
Raphael S. Carvalho
4b4a1883aa refresh: do not use default priority for loading new sstables
Metadata is read using default priority class, which can significantly
slow down the process under high load. Compaction class can be used,
and if it turns out to be a problem, we can switch to a special class
for it.

Fixes #1859.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170517184546.17497-1-raphaelsc@scylladb.com>
2017-05-22 19:03:17 +03:00
Avi Kivity
ef428d008c Merge "reduce memory requirement for loading sstables" from Rapahel
"fixes a problem in which memory requirement for loading in-memory
components of sstables is very high due to unlimited parallelism."

* 'mem_requirement_sstable_load_v2_2' of github.com:raphaelsc/scylla:
  database: fix indentation of distributed_loader::open_sstable
  database: reduce memory requirement to load sstables
  sstables: loads components for a sstable in parallel
  sstables: enable read ahead for read of in-memory components
  sstables: make random_access_reader work with read ahead
2017-05-22 18:23:03 +03:00
Raphael S. Carvalho
28206993a4 database: fix indentation of distributed_loader::open_sstable
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-05-22 11:52:52 -03:00
Raphael S. Carvalho
a4e414cb3b database: reduce memory requirement to load sstables
SSTable load temporarily uses more space than needed to store metadata,
due to:
1) All components are read using read_simple() which uses 128k buffer.
file::dma_read_bulk() will allocate 128k, and may potentially allocate
another big buffer (128k - read) for file::read_maybe_eof().
2) read_filter() may use double the space it needs to.

Due to the fact that sstable loading parallelism is unlimited, Scylla
may require much more memory to load all sstables, and that may lead to
OOM. Higher the number of sstables higher the memory overhead.

To confirm this problem, I wrote a test[1] which loads 30k sstables in
parallel and reports the memory usage peak in the end.
When loading 30k sstables, each of which metadata is ~300kb, memory
usage peak was ~18G. When loading completed, only ~9GB were needed to
store all the metadata.
[1]: https://gist.github.com/raphaelsc/2db37b4fb34301833ab9eeed3b1a524d

To fix this problem, we need to set a limit on load parallelism (let's
start with a small number like 3 and adjust later if needed) and rely
on readahead so that the requirement drops considerably without
increasing boot time. Actually, boot time is improved by it.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
2017-05-22 11:52:51 -03:00
Raphael S. Carvalho
043fae2ef5 sstables: loads components for a sstable in parallel
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
2017-05-22 11:52:49 -03:00
Raphael S. Carvalho
0ac729fd57 sstables: enable read ahead for read of in-memory components
Read ahead 4 is used. Let's adjust it later if needed. File size is
used to prevent file_input_stream from issuing useless reads beyond
file size with read ahead enabled. We can switch to variant without
length once file_input_stream handles it properly.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-05-22 11:52:37 -03:00
Raphael S. Carvalho
77b8870cf3 sstables: make random_access_reader work with read ahead
Scylla crashes if read ahead is enabled by file_random_access_reader
because a call to seek() destroys the existing input stream without
closing it first.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-05-22 11:52:33 -03:00
Duarte Nunes
6ac73b57fb cql3/statements/select_statement: Remove dead code
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170522100230.17393-1-duarte@scylladb.com>
2017-05-22 14:32:12 +03:00
Avi Kivity
5828ddcca4 Merge seastar upstream
* seastar 4af898c...68dbf60 (4):
  > dpdk: follow namespace changes to fix compile error
  > perftune.py: fix regression introduced in df5f74ac
  > doc: typo in README.md
  > posix_net: load-balance connections
2017-05-22 12:39:48 +03:00
Asias He
b56ba02335 gossip: Make bootstrap more robust
The bootstrapping node will be a gossip only member, until the streaming
finishes and the node becomes NORMAL state. If during this time, the
bootstrapping node is overwhelmed with streaming, it is possible the node will
delay the update the gossip heartbeat. Be forgiving for the bootstrapping node
and do not remove it from gossip too fast. Otherwise, streaming rpc verbs will
not be resent becasue the node is not in gossip membership anymore.

Fixes #2150

Message-Id: <286d7035d854f2a48abf4e1e2e3bfcb8b22b9ca2.1494553580.git.asias@scylladb.com>
2017-05-21 19:25:40 +03:00
Takuya ASADA
7777b558c4 dist/redhat: Use mock for CentOS/RHEL rpms
Enable mock for CentOS/RHEL, also support cross building by mock.

Fixes #630

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20170513171200.14926-1-syuu@scylladb.com>
2017-05-21 19:22:54 +03:00
Avi Kivity
2f23648b9e Revert "dist: add conflict with Cassandra"
This reverts commit da55aecca3.  Instead of an
install-time conflict, we'll add a run-time conflict.
2017-05-21 18:37:59 +03:00
Alexys Jacob
c8116b4252 scylla_raid_setup: fix typo on print_usage
Simple typo fix on the usage message output, the script name was not correct.

Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20170519145851.6205-1-ultrabug@gentoo.org>
2017-05-21 18:01:28 +03:00
Avi Kivity
5b182537db Merge seastar upstream
* seastar 8aef5f5...4af898c (4):
  > memory: fix debug build
  > tests: fix slab_test build
  > xen: fix fallouts from seastar namespace change
  > build: make swagger generated files depend on the code generator
2017-05-21 13:48:24 +03:00
Alexys Jacob
8dbad4f34a scylla_sysconfig_setup: fix typo on print_usage
Simple typo fix on the usage message output, the script name was not correct.

Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20170519143227.2741-1-ultrabug@gentoo.org>
2017-05-21 13:41:43 +03:00
Alexys Jacob
c0756d97b8 scylla_setup: fix typos on cpu scaling messages
This fixes typos on CPU scaling related messages.

Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20170519143703.3574-1-ultrabug@gentoo.org>
2017-05-21 13:41:42 +03:00
Glauber Costa
5f99158889 api: return correct values for bloom filter statistics
We are currently suspecting that the bloom filter false positive ratio
is not being respected. While trying to debug that, I found out that we
have a more basic problem:

The numbers are all meaningless, because the stats are wrong.  We are
accumulating by summing the ratios together. It's easy to see how this
doesn't work, if we look at an example where the ratio for some CFs is
zero:

SST1: false = 1, total = 2. ratio = 0.5
SST2: false = 0, total = 98 . ratio = 0.

The real ratio in this example is 1 / (98 + 2) = 1 %, but the displayed
ratio will be 0.5 + 0 = 0.5.

This patch will map reduce all the sstables together keeping both
numerator and denominator, yielding the right value at the end. To do
that, we'll reuse the existing ratio_holder class, which already does
exactly what we want.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20170518222333.16307-1-glauber@scylladb.com>
2017-05-21 13:11:22 +03:00
Avi Kivity
ebaeefa02b Merge seatar upstream (seastar namespace)
- introcduced "seastarx.hh" header, which does a "using namespace seastar";
 - 'net' namespace conflicts with seastar::net, renamed to 'netw'.
 - 'transport' namespace conflicts with seastar::transport, renamed to
   cql_transport.
 - "logger" global variables now conflict with logger global type, renamed
   to xlogger.
 - other minor changes
2017-05-21 12:26:15 +03:00
Avi Kivity
dab2783b58 Merge seastar upstream
* seastar 45b718b...f726938 (2):
  > memory: add --mbind option to supress warning message when running Seastar apps on container
  > Add support for Gentoo Linux irqbalance configuration detection.
2017-05-20 21:15:46 +03:00
Avi Kivity
c8cb3d6ff5 Merge "Materialized views: bug fixes and unit tests" from Duarte
"This series fixes bugs related to materialized views, most pertaining
to column filtering in the where clause."

* 'materialized-views/bug-fixes/v1' of https://github.com/duarten/scylla:
  tests/view_schema_test: Add more test cases
  tests/cql_assertions: Add assertion for row set equality
  single_column_relation: Correctly print IN relation
  statement_restrictions: Allow filtering regular columns for views
  statement_restrictions: Relax clustering restrictions for views
  statement_restrictions: Relax partition restrictions for views
  cql3/statements: Prevent setting default ttl on view
  cql3/restrictions: Complete implementation of is_satisfied_by()
  db/view: Re-implement clustering_prefix_matches()
  db/view: Re-implement partition_key_matches()
  db/view: Generate regular tombstone for base deletions
  db/view: Consider cell liveness when generating updates
  db/view: Don't generate view updates for static rows
2017-05-20 13:52:56 +03:00
Tomasz Grabiec
cd4d15672b utils: estimated_histogram: Fix clear()
It was a no-op. It doesn't seem currently used, but I will have a use
for it soon.
Message-Id: <1495198172-1969-1-git-send-email-tgrabiec@scylladb.com>
2017-05-19 14:34:34 +01:00
Paweł Dziepak
c560cf9d9d Merge "fixes and improvements in the permissions cache implementation" from Vlad
"There are numerous issues in the current implementation of permissions
cache starting from the logical errors and bugs and ending with the
suboptimal implementation described in the issue #2262."

* 'permissions_cache_fixes-v4' of github.com:scylladb/seastar-dev:
  utils::loading_cache: avoid the reads storm when the key is not in the cache
  utils::loading_cache: cleanup
  utils::loading_cache: align the constrains in the constructor with the parameters description
  utils::loading_cache: refresh in the background
  auth::auth: add operator<<() for a permission_cache key
  auth::auth::permissions_cache: use the values from the configuration - don't try to be smart
  db::config: define a saner default value for permissions_validity_in_ms
2017-05-18 13:33:05 +01:00
Vlad Zolotarov
6a63c87a9f utils::loading_cache: avoid the reads storm when the key is not in the cache
Use a mutex to serialize producers when the key is not present in the cache.

Fixes #2262

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-05-18 07:55:48 -04:00
Tomasz Grabiec
3fc1703ccf range: Fix SFINAE rule for picking the best do_lower_bound()/do_upper_bound() overload
mutation_partition has a slicing constructor which is supposed to copy
only the rows from the query range. The rows are located using
nonwrapping_range::lower_bound() and
nonwrapping_range::lower_bound(). Those two have two different
implementations chosen with SFINAE. One is using std::lower_bound(),
and one is using container's built in lower_bound() should it
exist. We're using intrusive tree in mutation_partition, so
container's lower_bound() is preferred. It's O(log N) whereas
std::lower_bound() is O(N), because tree's iterator is not random
access.

However, the current rule for picking container's lower_bound() never
triggers, because lower_bound() has two overloads in the container:

  ./range.hh:618:14: error: decltype cannot resolve address of overloaded function
              typename = decltype(&std::remove_reference<Range>::type::upper_bound)>
              ^~~~~~~~

As a result, the overload which uses std::lower_bound() is used.

Spotted when running perf_fast_forward with wide partition limit in
cache lifted off. It's so slow that I timeouted waiting for the result
(> 16 min).

Fixes #2395.

Message-Id: <1495048614-9913-1-git-send-email-tgrabiec@scylladb.com>
2017-05-18 13:28:10 +03:00
Avi Kivity
ba31619594 tests: fix partitioner_test for g++ 5
It can't make the leap from dht::ring_position to
stdx::optional<range_bound<dht::ring_position>> for some reason.
2017-05-18 13:09:41 +03:00
Pekka Enberg
30b5933db2 Merge "Add Gentoo Linux support to utility and setup scripts" from Alexys
"These patches add support to setup and operate ScyllaDB on Gentoo Linux.

 * scylla_setup and related scripts
 * node_health_check

 I have kept them as simple as possible and tested them to setup and operate
 succesfully a three nodes cluster running on Gentoo Linux."

* 'gentoo_linux_support' of github.com:ultrabug/scylla:
  scylla_setup: add gentoo linux installation detection
  prometheus node_exporter install: add support for gentoo linux
  raid setup: add support for gentoo linux
  ntp setup: add support for gentoo linux
  kernel check: add support for gentoo linux
  cpuscaling setup: add support for gentoo linux
  coredump setup: add support for gentoo linux
  detect gentoo linux on selinux setup
  add gentoo_variant detection and SYSCONFIG setup
2017-05-18 09:41:13 +03:00
Vlad Zolotarov
1ef22f84c1 utils::loading_cache: cleanup
- Fix a callback signature: receive a const ref.
   - White spaces.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-05-17 15:03:14 -04:00
Vlad Zolotarov
87ce0b2d47 utils::loading_cache: align the constrains in the constructor with the parameters description
According to description of permissions_validity_in_ms the permissions_cache is enabled if this
value is set to a non-zero value. Otherwise the permissions_cache is disabled.

According to the permissions_update_interval_in_ms description it must have a non-zero value if permissions_cache
is enabled.

permissions_cache_max_entries description doesn't explicitly state it but it makes no sense to allow it to be zero
if permissions_cache is enabled.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-05-17 15:03:14 -04:00
Vlad Zolotarov
e286828472 utils::loading_cache: refresh in the background
This patch changes the way a loading_cache works.

Before this patch:
   1) If a permissions key is not in the cache it's loaded in the foreground and the original
      query is blocked till the permissions are loaded.
   2) Every _period the timer does the following:
      1) If a value was loaded more than _expiry time ago it is removed from the cache.
      2) If the cache is too big - the less recently loaded values are removed till the cache
         fits the requested size.

After this patch:
   1) If a permissions key is not in the cache it's loaded in the foreground and the original
      query is blocked till the permissions are loaded.
   2) Every _period the timer does the following:
      1) If a value in the cache was loaded or read for the last time more than _expiry time ago - it's removed from the cache.
      2) If the cache is too big - the less recently read values are removed till the cache fits the requested size.
      3) The values that were loaded more than _refresh time ago are re-read in the background.

The new implementation allows to minimize the amount of the foreground reads for a frequently used value to a single
event (when the value is loaded for the first time).

It also ensures we do not reload values we no longer need.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-05-17 15:03:06 -04:00
Alexys Jacob
fa0944ac19 scylla_setup: add gentoo linux installation detection 2017-05-17 18:06:54 +02:00
Alexys Jacob
9bb1bda466 prometheus node_exporter install: add support for gentoo linux 2017-05-17 18:06:34 +02:00
Alexys Jacob
1d235e5012 raid setup: add support for gentoo linux 2017-05-17 18:06:14 +02:00
Alexys Jacob
fdd5944ab2 ntp setup: add support for gentoo linux 2017-05-17 18:05:59 +02:00
Alexys Jacob
412f96a1bf kernel check: add support for gentoo linux 2017-05-17 18:05:45 +02:00
Alexys Jacob
a198f2b1af cpuscaling setup: add support for gentoo linux 2017-05-17 18:05:24 +02:00
Alexys Jacob
6a1807a7d8 coredump setup: add support for gentoo linux 2017-05-17 18:05:08 +02:00
Alexys Jacob
bc63e501db detect gentoo linux on selinux setup 2017-05-17 18:04:20 +02:00
Vlad Zolotarov
4edb336ac5 auth::auth: add operator<<() for a permission_cache key
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-05-17 12:03:56 -04:00
Vlad Zolotarov
d780818cac auth::auth::permissions_cache: use the values from the configuration - don't try to be smart
Our configuration already has the default values for for permission cache parameters.
Therefore if user decides to give some bad parameters we'd rather fail the load and inform him/her
about the bad parameters instead of trying to silently "fix" them.

In addition the original code wasn't passing the parameters correctly: it switched the "expiry" and "refresh" parameters in
the utils::loaded_cache constructor.

Add to this that the original code was doing really strange things in the permission_cache::expiry(cfg) method.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-05-17 12:03:56 -04:00
Vlad Zolotarov
ea1cfabe28 db::config: define a saner default value for permissions_validity_in_ms
It makes little sense to have the same value for permissions_update_interval_in_ms and permissions_validity_in_ms.
This may cause the values to be invalidated only because some minor delays in the timer scheduling.

It makes a lot more sense to make the permissions_update_interval_in_ms value smaller than permissions_validity_in_ms.
This way we would minimize the chances of "false invalidation" due to some small delays in the timer scheduling.

In addition, 2s seems to be a too small value for permissions_validity_in_ms since our default read_request_timeout_in_ms is 5s.
This means that a single system_auth read failure would guarantee that the following queries are going to read system_auth data
in the foreground.

Setting it to 10s would allow a second read attempt before we enforce the foreground read.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-05-17 12:03:56 -04:00
Alexys Jacob
2ca0380d06 add gentoo_variant detection and SYSCONFIG setup 2017-05-17 18:03:53 +02:00
Avi Kivity
2aa5b3e20c Merge "Improve perf_fast_forward test" from Tomasz
"Notably:
  - add validation of the results (e.g. fragment count, expectations about disk activity)
  - add cache-specific tests"

* 'tgrabiec/add-cache-tests-to-perf-fast-forward' of github.com:cloudius-systems/seastar-dev:
  tests: perf_fast_forward: Report cache stats
  row_cache: Keep counters in a struct
  tests: perf_fast_forward: Add cache-specific tests
  tests: perf_fast_forward: Extract test_reading_all()
  tests: perf_fast_forward: Add validation of the results
  tests: perf_fast_forward: Fix partition scans to read the expected amount of fragments
  tests: perf_fast_forward: Allow the test to be interrupted
  tests: perf_fast_forward: Allow testing with cache enabled
  row_cache: Implement mutation_reader::fast_forward_to() for cache scanner
2017-05-17 18:06:02 +03:00
Calle Wilund
29b20d410a schema_tables: Remove "class" attribute from strategy options
Not 100% proper, but in line with how we still store the info.
Ensures (helps at least) to keep schema loaded from tables
and schema from builder comparable.

Fixes schema_changes_test error.

Message-Id: <1495030581-2138-2-git-send-email-calle@scylladb.com>
2017-05-17 17:56:11 +03:00
Calle Wilund
6ca07f16c1 scylla: fix compilation errors on gcc 5
Message-Id: <1495030581-2138-1-git-send-email-calle@scylladb.com>
2017-05-17 17:56:06 +03:00
Paweł Dziepak
3ecceaee48 Merge "Fix fast_forward_to() on sstable reader being ignored in some cases" from Tomasz
"When mutation reader enters the partition using index,
streamed_mutation object is returned to the user before the row start
fragment is processed. In that case, when we process the row start, we
should ignore it and not call setup_for_partition() again. That may
override user's fast_forward_to() request."

* 'tgrabiec/fix-initial-fast-forward-to-for-single-key-sstable-readers' of github.com:scylladb/seastar-dev:
  tests: mutation_source_test: Test forwarding in single-key readers
  sstables: Remove unused code
  sstables: mutation_reader: Fix setup_for_partition() being called twice in some cases
  sstables: Fix verify_end_state() to tolerate ATOM_START_2 state
2017-05-17 15:35:30 +01:00
Avi Kivity
eb69fe78a4 Merge "Adding private repository to housekeeping" from Amnon
"This series adds private repository support to scylla-housekeeping"

* 'amnon/housekeeping_private_repo_v3' of github.com:cloudius-systems/seastar-dev:
  scylla-housekeeping service: Support private repositories
  scylla-housekeeping-upstart: Use repository id, when checking for version
  scylla-housekeeping: support private repositories
2017-05-17 15:56:46 +03:00
Tomasz Grabiec
777ffa3a27 tests: perf_fast_forward: Report cache stats 2017-05-17 14:15:14 +02:00
Tomasz Grabiec
d1bde3036e row_cache: Keep counters in a struct
So that taking a snapshot of all stats is easy.
2017-05-17 14:15:14 +02:00
Tomasz Grabiec
7a81f5e980 tests: perf_fast_forward: Add cache-specific tests 2017-05-17 14:15:14 +02:00
Tomasz Grabiec
1a7b03004a tests: perf_fast_forward: Extract test_reading_all() 2017-05-17 14:15:14 +02:00
Tomasz Grabiec
a38fd16f89 tests: perf_fast_forward: Add validation of the results 2017-05-17 14:15:14 +02:00
Tomasz Grabiec
3c3ea51657 tests: perf_fast_forward: Fix partition scans to read the expected amount of fragments
make_pkeys() needs to be invoked with n equal to the number of keys
which the table was populated with. Otherwise the extra keys, which
are missing in the table, may be placed anywhere in the vector due to
ring order sorting, and break the assumption that the table contains
all keys from the array up to index n. This resulted in the test
reading slighlty less fragments than it would follow from the desired
count.

Another problem is that we should not skip the fast_forward_to() call
for the inital range (workaround for a bug in sstable mutation
reader), otherwise we will read slightly less than expected as well.
2017-05-17 14:15:14 +02:00
Tomasz Grabiec
49a0bc3847 tests: perf_fast_forward: Allow the test to be interrupted 2017-05-17 14:15:14 +02:00
Tomasz Grabiec
5c7f5643a6 tests: perf_fast_forward: Allow testing with cache enabled 2017-05-17 14:15:14 +02:00
Tomasz Grabiec
35c9dfecc2 row_cache: Implement mutation_reader::fast_forward_to() for cache scanner
Needed to make perf_fast_forward work with cache enabled.
2017-05-17 14:15:14 +02:00
Tomasz Grabiec
84648f73ef Merge "Fix performance problems with high shard counts tag" from Avi
From http://github.com/avikivity/scylla exponential-sharder/v3.

The sharder, which takes a range of tokens and splits it among shards, is
slow with large shard count and the default
murmur3_partitioner_ignore_msb_bits.

This patchset fixes excessive iteration in sstable sharding metadata writer and
nonsignular range scans.

Without this patchset, sealing a memtable takes > 60 ms on a 48-shard
system.  With the patchset, it drops below the latency tracker threshold I
used (5 ms).
2017-05-17 14:03:33 +02:00
Avi Kivity
68034604e1 dht: murmur3_partitioner: simplify moving to and from the zero-based token range 2017-05-17 13:50:30 +03:00
Avi Kivity
1a99ebaa65 storage_proxy: switch to the exponential sharder for nonsingular queries
Nonsingular queries used exponential expansion of the token space to
avoid spending too much cpu time on near-empty tables, but the generation
of the search space was itself exponential.  Switch to the exponential sharder
which has linear cost.
2017-05-17 13:50:30 +03:00
Avi Kivity
00f48f96cb sstables: select just the shard we want when writing sharding metadata
On a system with many shards, this saves many useless iterations where
we just skip the unwanted shard.
2017-05-17 13:50:30 +03:00
Avi Kivity
44a1a51987 tests: add tests for dht::split_range_to_single_shard() 2017-05-17 13:50:30 +03:00
Avi Kivity
76f12a8842 dht: add split_range_to_single_shard()
Intersects a shard's owning range with a ring position range, and return
the sorted result.
2017-05-17 13:50:27 +03:00
Tomasz Grabiec
1da3daa4f4 range: Use more standard notation for singular range
Reuse notation for a single-element set.

Message-Id: <1494923827-10097-1-git-send-email-tgrabiec@scylladb.com>
2017-05-17 13:28:42 +03:00
Avi Kivity
a65e8bd215 dht: add a ring-position-range-vector variant of the exponential sharder
The "exponentiality" is not carried over from one range to another, because
we expect one or two ranges (two ranges result from a wrapped around thrift
token range).
2017-05-17 13:18:52 +03:00
Avi Kivity
6eb6f12909 tests: add test for ring_position_exponential_sharder 2017-05-17 13:18:52 +03:00
Avi Kivity
f671ac13b4 dht: add an exponential ring_position range sharder
Like the regular sharder, the exponential sharder divides a range into
subranges owned by individual ranges.  Unlike the regular sharder, it
generates ever-increasing subranges, spanning more and more shards, and
eventually returns several subranges per shard.  To avoid using
exponential cpu and memory, subranges belonging to a single shard are merged,
and a flag is set to indicate the subranges are not ordered wrt. each other.
2017-05-17 13:18:49 +03:00
Avi Kivity
025c6b45b2 dht: extend i_partitioner::next_token_for_shard()
Right now, next_token_for_shard() only allows iterating linearly in shard
order.  Add the ability to select a specific shard to skip to (in case we're
only interested in a single shard), and to select larger ranges (so that
exponential increases are not implemented by iteration).
2017-05-17 12:30:03 +03:00
Avi Kivity
7156ea8804 dht: make ring_position_range_sharder more independent of global_partitioner
Useful for testing.
2017-05-17 12:30:03 +03:00
Avi Kivity
302fec8293 dht: make i_partitioner::name() const 2017-05-17 12:30:03 +03:00
Avi Kivity
f462c4327e dht: make i_partitioner keep track of the number of shards it was configured with
Useful for testing classes layered on top of the partitioner (the sharders).
2017-05-17 12:30:03 +03:00
Avi Kivity
04b16ae8ec dht: fix partitioner initialization for tests
The partitioners now depend on smp::count to be initialized correctly,
but smp::count isn't available at static initialization time.

The scylla executable isn't affected because it calls set_global_partitioner()
after smp::count has been initialized.

Fix by deferring initialization to the first global_partitioner() call.
2017-05-17 12:30:03 +03:00
Avi Kivity
1c6cecd9d0 utils: introduce div_ceil()
Divides integrals but rounds up rather than down.
2017-05-17 12:30:03 +03:00
Avi Kivity
f1dbb951da Merge "Materialized views: implement read before write" from Duarte
"This patch ensures we read the base table rows that an update
is modifying, in order to correctly calculate the set of
materialized view updates.

The read-before-write is performed on the shard applying the
update and attempts to do a precise read of the rows being modified,
which can be more than one in case of ranged deletions or a
batch update."

* 'materialized-views/read-existing/v2' of https://github.com/duarten/scylla:
  database: Read existing base mutations
  db/view: Calculate clustering ranges for MV read-before-write query
  db/view: Replace entry if cells don't match
  view_info: Store base regular col in the view's PK as column_id
  compound_view_wrapper: Add tri_compare
  bound_view: Build range bound from bound_view
  clustering_bounds_comparator: Enable Range concept
  range: Add lvalue version of transform()
  tests: Add test case for nonwrapping_range::intersection()
  nonwrapping_range: Add intersection() function
2017-05-17 12:26:26 +03:00
Duarte Nunes
ef252036ba tests/view_schema_test: Add more test cases
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 11:21:58 +02:00
Duarte Nunes
983af595e9 database: Read existing base mutations
When generating updates for a materialized view we need to read the
existing base row, to be able to determine the primary key of the view
row the new base update will supplant, in case the view includes a
base non-primary key column in its own primary key. That old view row
will be tombstoned or updated, if it exists, depending on the difference
between the new base row and the existing one, if any.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 10:33:19 +02:00
Duarte Nunes
0861a66853 tests/cql_assertions: Add assertion for row set equality
For row set equality, the order of the actual rows doesn't need to
match the order of the expected rows.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 10:33:19 +02:00
Duarte Nunes
8a77bfe35b db/view: Calculate clustering ranges for MV read-before-write query
Introduce the calculate_affected_clustering_ranges() function to
calculate the smallest subject of affected clustering ranges that we
need to query for.

The update_requires_read_before_write() function checks whether
a view is potentially affected by the base update.

The patch also cleans up the may_be_affected_by() function.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 10:33:19 +02:00
Duarte Nunes
9115862419 single_column_relation: Correctly print IN relation
So that the output of a set of relations can be fed back into the CQL
parser; useful for materialized views.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 10:33:19 +02:00
Duarte Nunes
ec681060a8 db/view: Replace entry if cells don't match
If a base table regular columns is part of the view's pk, and if that
column changes, we should replace the entry, by deleting the row(s)
with the old value and inserting a new one.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 10:33:19 +02:00
Duarte Nunes
3ef1a825c9 statement_restrictions: Allow filtering regular columns for views
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 10:33:19 +02:00
Duarte Nunes
0170c743d3 statement_restrictions: Relax clustering restrictions for views
In process_clustering_columns_restrictions(), don't require all
clustering columns to be restricted if we're dealing with a
materialized view's where clause.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 10:33:19 +02:00
Duarte Nunes
4f90b19cc2 statement_restrictions: Relax partition restrictions for views
In process_partition_key_restrictions(), don't require all partition
key columns to be restricted if we're dealing with a materialized
view's where clause.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 10:33:19 +02:00
Duarte Nunes
99b234d717 cql3/statements: Prevent setting default ttl on view
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 10:33:19 +02:00
Duarte Nunes
d480daffca cql3/restrictions: Complete implementation of is_satisfied_by()
This patch implements the is_satisfied_by() function for the remaining
types of restrictions, lifting the function declaration to
abstract_restrictions.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 10:33:19 +02:00
Duarte Nunes
bad0edb23b db/view: Re-implement clustering_prefix_matches()
This patch implements clustering_prefix_matches() in terms of
abstract_restriction::is_satisfied_by() instead of ranges, which
supports filtering just a subset of the clustering columns.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 10:33:19 +02:00
Duarte Nunes
b0d1ea76a2 db/view: Re-implement partition_key_matches()
This patch implements partition_key_matches() in terms of
abstract_restriction::is_satisfied_by() instead of ranges, which
supports filtering just a component of a compound partition key.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 10:33:19 +02:00
Duarte Nunes
38be85a21d db/view: Generate regular tombstone for base deletions
Instead of shadowable tombstones, which only apply to updates.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 10:33:19 +02:00
Duarte Nunes
1fd8b8e723 db/view: Consider cell liveness when generating updates
This patch ensures we take into account the liveness of the base's
regular column in the view's pk when generating view updates.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 10:33:19 +02:00
Duarte Nunes
c421da6825 db/view: Don't generate view updates for static rows
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 10:33:19 +02:00
Duarte Nunes
f41a5e554d view_info: Store base regular col in the view's PK as column_id
This patch stores the base_non_pk_column_in_view column as column_id,
which is more convenient, and it also stores a two-level optional to
encode both lazy initialization and the absence of such a column.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 10:33:18 +02:00
Duarte Nunes
257eaa0d05 compound_view_wrapper: Add tri_compare
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 10:33:18 +02:00
Duarte Nunes
06a6679826 bound_view: Build range bound from bound_view
We introduce the bound_view::to_range_bound() function, which builds a
wrapping_range or nonwrapping_range bound from a bound_view.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 10:33:18 +02:00
Duarte Nunes
8288e504fb clustering_bounds_comparator: Enable Range concept
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 10:33:18 +02:00
Duarte Nunes
fb1e966137 range: Add lvalue version of transform()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 10:33:18 +02:00
Duarte Nunes
f365b7f1f7 tests: Add test case for nonwrapping_range::intersection()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 10:33:18 +02:00
Duarte Nunes
1f9359efba nonwrapping_range: Add intersection() function
intersection() returns an optional range with the intersection of the
this range and the other, specified range.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 10:33:18 +02:00
Avi Kivity
f5dae826ce Merge "Migrate schema tables to v3 format" from Calle
"Defines origin v3-format for system/schema tables, and use them for
schema storage/retrival.

Includes a legacy_schema_migrator implementation/port from origin. Note
that since we don't support features like triggers, functions and
aggregates, it will bail if encountering such a feature used.

Note also that this patch set does not convert the "hints" and
"backlog" tables, even though these have changed in v3 as well.
That will be a separate patch set.

Tested against dtests. Note that patches for dtest + ccm
will follow."

* 'calle/systemtables' of github.com:cloudius-systems/seastar-dev: (36 commits)
  legacy_schema_migrator: Actually truncate legacy schema tables on finish
  database: Extract "remove" from "drop_columnfamily"
  v3 schema test fixes
  thrift: Update CQL mapping of static CFs
  schema_tables: Use v3 schema tables and formats
  type_parser: Origin expects empty string -> bytes_type
  cf_prop_defs: Add crc_check_chance as recognized (even if we don't use)
  types_test: v3 style schemas enforce explicit "frozen" in tupes/ut:s
  cql3_type: v3 to_string
  cql_types: Introduce cql3_type::empty and associate with empty data_type
  schema: rename column accessors to be in line with origin
  schema: Add "is_static_compact_table"
  schema_builder: Add helper to generate unique column names akin origin
  schema: Add utility functions for static columns
  schema: Use heterogeneous comparator for columns bounds
  cql3_type_parser: Resolve from cql3 names/expressions
  cql3_type: Add "prepare_interal" and "references_user_type"
  cql3::cql3_type: Add prepare_internal path using only "local" holders
  cql3_type: Add virtual destructor.
  database/main: encapsulate system CF dir touching
  ...
2017-05-17 11:25:52 +03:00
Asias He
0abfe39d8f database: Log compaction strategy setting on shard 0 only
The compaction strategy is per node not per shard. Do not duplicate the
same log on all shards.

Message-Id: <1494835519.git.asias@scylladb.com>
2017-05-17 11:17:41 +03:00
Avi Kivity
f09f056515 Merge seastar upstream
* seastar 4a3118c...45b718b (7):
  > tests: make connect_test use a random port
  > log: Introduce log.info0
  > configure.py: link to DPDK PMD drivers which are already built on build/dpdk and enabled by default on DPDK config
  > Update fmt submodule
  > perftune: fix perftune.py IndexError when NIC uses less IRQs than requested.
  > build: Add more required build dependencies to the Dockerfile
  > Prometheus: Reserve in protobuf object before iterating
2017-05-17 11:16:58 +03:00
Raphael S. Carvalho
a58699cc92 sstables: kill sstable::mark_for_deletion_on_disk
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170515233021.21223-1-raphaelsc@scylladb.com>
2017-05-17 11:15:59 +03:00
Raphael S. Carvalho
deabf06d49 lcs: log invariant restoration
It will be useful for understanding the strategy behavior after
invariant is possibly broken by resharding.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170515234925.22793-1-raphaelsc@scylladb.com>
2017-05-17 11:15:41 +03:00
Avi Kivity
2eef7cd395 Merge "compress the tracing session ID when compression is requested" from Vlad
"Tested with:
   - test.py --mode relase
   - debug/test-serialization
   - c-s with both debug and relase compiled scylla with authentication enabled:
     cassandra-stress write  n=10000 no-warmup -rate threads=10 -mode native unprepared cql3 user='cassandra' password='cassandra'

Tested with:
   - test.py --mode relase
   - debug/test-serialization
   - c-s with both debug and relase compiled scylla with authentication enabled:
     cassandra-stress write  n=10000 no-warmup -rate threads=10 -mode native unprepared cql3 user='cassandra' password='cassandra'"

* 'compress_tracing_session_id-v6' of github.com:cloudius-systems/seastar-dev:
  cql_server::response: rework the tracing session ID insertion
  utils::UUID: align the UUID serialization API with the similar API of other classes in the project
  utils: serialization: unify the variety of serialize_XXX(...)
  cql_server::response: rework the compress(...) method
  cql_server::response: store the frame flags inside the class
2017-05-17 09:48:49 +03:00
Pekka Enberg
374c3d66ab Merge "Fixes for CQL regressions" from Duarte
"This series fixes a set of regressions introduced by
 f7bc88734a, resulting in two failed
 tests:

   testDenseNonCompositeTable(org.apache.cassandra.cql3.validation.operations.CreateTest)

 and

   testStaticColumnsWithDistinct(org.apache.cassandra.cql3.validation.entities.StaticColumnsTest)"

* 'cql-fixes/v1' of github.com:duarten/scylla:
  update_statement: Reject empty values for dense clustering key
  modification_statement: Fix detection of clustering keys
  cql3/restrictions/statement_restrictions: Consider statement type
  cql3/statements/modification_statement: Extract statement_type
2017-05-17 09:29:24 +03:00
Vlad Zolotarov
a0737abdc5 cql_server::response: rework the tracing session ID insertion
Insert the tracing session ID into the response body in the cql_server::response constructor.

Fixes #2356

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-05-16 15:57:28 -04:00
Vlad Zolotarov
494ea82a88 utils::UUID: align the UUID serialization API with the similar API of other classes in the project
The standard serialization API (e.g. in data_value) includes the following methods:

size_t serialized_size() const;
void serialize(bytes::iterator& it) const;
bytes serialize() const;

Align the utils::UUID API with the pattern above.

The only addition is that we are going to make an output iterator parameter of a second method above
a template so that we may serialize into different output sources.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-05-16 15:56:03 -04:00
Vlad Zolotarov
7706775a63 utils: serialization: unify the variety of serialize_XXX(...)
Use the same templated implementation for all different serialize_XXX(...).

The chosen implementation is based on the std::copy_n(char*, size, OutputIterator),
which is heavily optimized and will be using memcpy/memmove where possible.

This patch also removes the not needed specializations that accept signed integer
values since we were casting them to unsigned value anyway.

The std::ostream based specifications are also removed since they are not used
anywhere except for a test-serialization.cc and adjusting the ostream to the iterator
is a single-liner.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-05-16 15:56:03 -04:00
Vlad Zolotarov
a33fe5b775 cql_server::response: rework the compress(...) method
Cleanup the compress(...) method interface:
   - Encapsulate the technical details inside the method:
      - Re-write the _body inside the method instead of returning it.
      - Set the response::_flags inside the method.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-05-16 15:53:35 -04:00
Vlad Zolotarov
c00814383d cql_server::response: store the frame flags inside the class
It makes a lot more sense to keep the flags mask inside the response and update it each time
the corresponding feature is set instead of holding the separate components like tracing state
pointer.

This patch adds this ability to set the flags.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-05-16 14:31:54 -04:00
Takuya ASADA
da55aecca3 dist: add conflict with Cassandra
Cassandra and Scylla are not able to install single instance, so add
cassandra to 'Conflicts'.

Fixes #2157

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1494856314-9322-1-git-send-email-syuu@scylladb.com>
2017-05-16 19:18:27 +03:00
Gleb Natapov
c7ad3b9959 database: remove temporary sstables sequentially
The code that removes each sstable runs in a thread. Parallel
removing of a lot of sstables may start a lot of threads each of which
is taking 128k for its stack. There is no much benefit in running
deletion in parallel anyway, so fix it by deleting sstables sequentially.

Fixes #2384

Message-Id: <20170516103018.GQ3874@scylladb.com>
2017-05-16 15:06:10 +03:00
Tomasz Grabiec
bdf3c536aa tests: mutation_source_test: Test forwarding in single-key readers 2017-05-16 13:36:10 +02:00
Tomasz Grabiec
e07cc44af2 sstables: Remove unused code 2017-05-16 13:31:01 +02:00
Tomasz Grabiec
0e23f8aa9b sstables: mutation_reader: Fix setup_for_partition() being called twice in some cases
When mutation reads enters the partition using index,
streamed_mutation object is returned to the user before the row start
fragment is processed. In that case, when we process the row start, we
should ignore it and not call setup_for_partition() again. That may
override user's fast_forward_to() request.
2017-05-16 13:31:01 +02:00
Tomasz Grabiec
a1dea3c4fc sstables: Fix verify_end_state() to tolerate ATOM_START_2 state
We would be in that state if consume_row_start() returns porceed::yes
and the stream ends after that. This can happen if slicing using
promoted index determined that there are no cells in the partition in
the range.
2017-05-16 13:31:01 +02:00
Raphael S. Carvalho
706ce5a27b sstables: do not swallow system error exception in read_simple
If error code is different than ENOENT, exception is swallowed.
That can lead to a variety of problems down the road.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170515225309.19185-1-raphaelsc@scylladb.com>
2017-05-16 08:47:34 +02:00
Alexys Jacob
9ddc05899d Fix scylla-housekeeping version detection to work with newer setuptools
Newer setuptools parse_version() don't like dashed version strings,
so we should trim it to avoid false negative version_compare() checks.

Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20170511162646.22129-1-ultrabug@gentoo.org>
2017-05-15 12:41:49 +03:00
Gleb Natapov
385645e8df storage_proxy: Fix mutation logging
Log mutation type only if mutation set is not empty.

Message-Id: <20170510142406.GA30426@scylladb.com>
2017-05-11 15:49:52 +01:00
Tomasz Grabiec
7b6be7e188 row_cache: Add missing propagation of the forwarding flag in handle_large_partition()
Message-Id: <1494503145-25622-1-git-send-email-tgrabiec@scylladb.com>
2017-05-11 15:47:19 +01:00
Vlad Zolotarov
a855e82eff service::client_state: don't allow dropping the system_auth and system_traces objects
Prevent the accidental dropping of system_auth and system_traces objects (keyspaces and tables)
but allow their modification (including tables).

We need to be able to modify keyspases in order to set/modify the replication strategy and its parameters.
We need to be able to ALTER the tables in order to allow rolling upgrades when some of the tables has changed.

Fixes #2346
Fixes #2338

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1494363335-20424-1-git-send-email-vladz@scylladb.com>
2017-05-11 13:03:30 +01:00
Tomasz Grabiec
0351ab8bc6 row_cache: Fix undefined behavior in read_wide()
_underlying is created with _range, which is captured by
reference. But range_and_underlyig_reader is moved after being
constructed by do_with(), so _range reference is invalidated.

Fixes #2377.
Message-Id: <1494492025-18091-1-git-send-email-tgrabiec@scylladb.com>
2017-05-11 09:43:43 +01:00
Duarte Nunes
a69039df03 tests/batchlog_manager_test: Fix failure
Since a9f6e5f8da, metrics can't be
duplicated. This patch works around that by avoiding to create a new
batchlog_manager (one is already created by the cql_test_env).

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170510191047.6154-1-duarte@scylladb.com>
2017-05-11 08:28:08 +02:00
Duarte Nunes
ec35cc33f1 update_statement: Reject empty values for dense clustering key
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-10 19:54:42 +02:00
Duarte Nunes
03f765c468 modification_statement: Fix detection of clustering keys
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-10 19:54:42 +02:00
Duarte Nunes
d7701087af cql3/restrictions/statement_restrictions: Consider statement type
Now that update_statement uses statement_restrictions, we need our
validation logic to take the statement type into account, in
particular to deal with insertion statements which only set static
columns but specify clustering values.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-10 19:54:42 +02:00
Duarte Nunes
c2041753c9 cql3/statements/modification_statement: Extract statement_type
This patch extracts the statement_type into its own file. The type
will be later passed to statement_restrictions for validation
purposes.

Further along, we could add methods to it that currently live in other
statements so we can move more validation into statement_restrictions.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-10 19:54:42 +02:00
Calle Wilund
c8f92536c1 legacy_schema_migrator: Actually truncate legacy schema tables on finish 2017-05-10 16:44:48 +00:00
Calle Wilund
3514123677 database: Extract "remove" from "drop_columnfamily" 2017-05-10 16:44:48 +00:00
Calle Wilund
66991a7ccb v3 schema test fixes 2017-05-10 16:44:48 +00:00
Duarte Nunes
6260f31e08 thrift: Update CQL mapping of static CFs
This patch updates the mapping of static CFs so that their CQL
representation is a non-compound, non-dense schema with static
columns, instead of regular ones. This matches the representation os
static CFs in Cassandra 3.x.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-10 16:44:48 +00:00
Calle Wilund
6c8b5fc09d schema_tables: Use v3 schema tables and formats
Switches system/schema_* for system_schema/*, updates schema/schema
builder and uses to hold/expect v3 style info (i.e. types & dropped).
2017-05-10 16:44:48 +00:00
Calle Wilund
f9b83e299e type_parser: Origin expects empty string -> bytes_type 2017-05-10 16:44:48 +00:00
Calle Wilund
97c54d254b cf_prop_defs: Add crc_check_chance as recognized (even if we don't use) 2017-05-10 16:44:48 +00:00
Calle Wilund
3d90152dc5 types_test: v3 style schemas enforce explicit "frozen" in tupes/ut:s 2017-05-10 16:44:48 +00:00
Calle Wilund
7969a156d5 cql3_type: v3 to_string 2017-05-10 16:44:48 +00:00
Calle Wilund
c572a8c83c cql_types: Introduce cql3_type::empty and associate with empty data_type 2017-05-10 16:44:48 +00:00
Calle Wilund
0e6ae8dec2 schema: rename column accessors to be in line with origin
More pointedly: Expose columns as is (currently
all_columns_in_select_order), expose name->column mapping more
appropriately named. 

Renaming like this is not strictly neccesary, but there is a point to
trying to keep nomenclature similar-ish with origin, esp. when select
order column need to become filtered (spoiler alert).
2017-05-10 16:44:48 +00:00
Calle Wilund
d2dc7898aa schema: Add "is_static_compact_table" 2017-05-10 16:44:48 +00:00
Calle Wilund
1c328a4166 schema_builder: Add helper to generate unique column names akin origin 2017-05-10 16:44:48 +00:00
Duarte Nunes
5387ac98f8 schema: Add utility functions for static columns
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-10 16:44:48 +00:00
Duarte Nunes
0439d83d1e schema: Use heterogeneous comparator for columns bounds
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-10 16:44:47 +00:00
Calle Wilund
b1c5447ab5 cql3_type_parser: Resolve from cql3 names/expressions
Cassandra 3 uses cql names for column/field types, thus
we need to parse these out-of-line, and resolve more akin
to the cql parser. 

Also wrap building user types similarly to origin, using
a "builder" wrapper, and usage graph resolving.
2017-05-10 16:44:47 +00:00
Calle Wilund
fcfea4c121 cql3_type: Add "prepare_interal" and "references_user_type"
Allows localized use of cql type + parsing + resolving
2017-05-10 16:44:47 +00:00
Calle Wilund
8b3e7bbe05 cql3::cql3_type: Add prepare_internal path using only "local" holders 2017-05-10 16:44:47 +00:00
Calle Wilund
2f791f5c3d cql3_type: Add virtual destructor.
It should be there.
2017-05-10 16:44:47 +00:00
Calle Wilund
48ddcbb77b database/main: encapsulate system CF dir touching 2017-05-10 16:44:47 +00:00
Calle Wilund
9eb91bc30b main: Add legacy schema migration to startup 2017-05-10 16:44:47 +00:00
Calle Wilund
3964055d98 legacy_schema_migrator: Add schema table converter
Initial. Does not actually write anything.
2017-05-10 16:44:47 +00:00
Paweł Dziepak
ba6b74e305 storage_service: counters are no longer experimental
Message-Id: <20170510124552.23558-1-pdziepak@scylladb.com>
2017-05-10 17:18:32 +03:00
Gleb Natapov
ab92406585 storage_proxy: optimize reconcile logic for CL=ONE
Regular single key query will never reconcile with CL=ONE since there
will be no digest mismatch, but range queries do not have digest stage,
so always goes through reconcile code. For CL=ONE there will be only one
result though, so no need to run complicated reconciliation logic and the
only result can be returned directly.

Message-Id: <20170509100334.GQ28272@scylladb.com>
2017-05-10 17:09:34 +03:00
Pekka Enberg
b63d33526d cql3: Fix variable_specifications class get_partition_key_bind_indexes()
The "_specs" array contains column specifications that have the bind
marker name if there is one. That results in
get_partition_key_bind_indices() not being able to look up a column
definition for such columns. Fix the issue by keeping track of the
actual column specifications passed to add() like Cassandra does.

Fixes #2369
Message-Id: <1494397358-24795-1-git-send-email-penberg@scylladb.com>
2017-05-10 12:38:18 +03:00
Calle Wilund
8066efb710 system_keyspace: Add getter/setter for built index status
Even though we have none.
2017-05-09 13:48:55 +00:00
Calle Wilund
061ef16562 system_tables/schema_tables: Remove special format case of "execute_cql"
Having a varadic parameter being used in implicit sprint is not
very readable + makes it less intuitive when suddenly system keyspace
becomes more than one -> multiple sprints in the chain -> more confusion
or more execution paths.
Its not that horrible with some spread out sprint:s
2017-05-09 13:48:55 +00:00
Calle Wilund
f5fcadf0b1 schema: Add "as_cql_string" for column_def + quote-wrapper 2017-05-09 13:48:55 +00:00
Calle Wilund
cb7ee98217 json: Add convinience ability to generate unordered_maps 2017-05-09 13:48:55 +00:00
Calle Wilund
e960724724 caching_options: Add from/to map methods 2017-05-09 13:48:55 +00:00
Calle Wilund
2e1c23f2f2 database: Relax rp ordering check to allow non-commitlog mutations
Allow replay to come post certain operations. Such as schema migration
2017-05-09 13:48:55 +00:00
Calle Wilund
27fdc5cfef schema_tables/system_tables: Add v3 tables to "ALL" and handle in init
I.e. deal with more than one keyspace in system_keyspace::make
2017-05-09 13:48:55 +00:00
Calle Wilund
afcf0372df cql3::untyped_result_set: Add more getter methods 2017-05-09 13:48:55 +00:00
Calle Wilund
539b65fc90 client_state: Make "has_access" auth check schema ks name independent 2017-05-09 13:48:55 +00:00
Calle Wilund
815aa8ba9f schema_tables: Add schema definitions for v3 tables 2017-05-09 13:48:55 +00:00
Calle Wilund
4378dca6e1 schema_tables: Hide/abstract schema keyspace name 2017-05-09 13:48:55 +00:00
Calle Wilund
2fb36e3bf8 system_keyspace: Add query overloads with named keyspace 2017-05-09 13:48:55 +00:00
Calle Wilund
32909d4c84 system_keyspace: Add v3+legacy schema definitions 2017-05-09 13:48:55 +00:00
Calle Wilund
b522b2bf22 Merge branch 'master' of https://github.com/scylladb/scylla 2017-05-09 13:48:47 +00:00
Avi Kivity
8af2b7c418 transport: honor the skip_metadata flag
Reduces processing overhead and network traffic.

We can't use the NO_METADATA flag in the metadata object, because this
is a request attribute; different executions of the same prepared statement
can have different settings for skip_metadata.
Message-Id: <20170419175145.19766-1-avi@scylladb.com>
2017-05-09 14:52:03 +03:00
Tomasz Grabiec
e56711a54d sstables: mutation_reader: Avoid reading index when restrictions cover whole partition
The check for is_static_row() used to be enough, but it no longer is
after optimization made in commit 3e06065, which avoids reading the
static row.
Message-Id: <1494241164-25810-1-git-send-email-tgrabiec@scylladb.com>
2017-05-09 11:03:18 +01:00
Pekka Enberg
5b931268d4 cql3: Move variable_specifications implementation to source file
Move the class implementation to source file to reduce the need to
recompile everything when the implementation changes...

Message-Id: <1494312003-8428-1-git-send-email-penberg@scylladb.com>
2017-05-09 12:44:18 +03:00
Calle Wilund
c30e515c70 Merge branch 'master' of https://github.com/scylladb/scylla 2017-05-09 09:29:27 +00:00
Duarte Nunes
65d96421da tests/sstable_datafile_test: Fix regression
This patch fixes a regression introduced in 9e88b60, where the wrong
clustering key was being specified.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170509091621.2682-1-duarte@scylladb.com>
2017-05-09 12:18:47 +03:00
Gleb Natapov
2d5a7c8058 storage_proxy: make read repair stats accessible through Prometheus
Currently they can be read only through JMX.

Message-Id: <20170509075546.GN28272@scylladb.com>
2017-05-09 11:23:38 +03:00
Calle Wilund
780a7c8641 Merge branch 'master' of https://github.com/scylladb/scylla 2017-05-08 15:39:46 +00:00
Avi Kivity
8c5c5d3004 Merge "CQL front-end for secondary indices" from Pekka
"This patch series adds CQL front-end support for secondary indices. You
can now execute CREATE INDEX and DROP INDEX statements, which will
update the newly added "Indexes" system table. However, the indexes are
not actually backed up by anything nor are they available for CQL
queries. The feature is hidden behind a new cluster feature flag and
enabled only with the "--experimental" flag."

* 'penberg/cql-2i/v2' of github.com:cloudius-systems/seastar-dev: (34 commits)
  schema: Kill index_type enum
  schema: Kill index_info class
  cql3/statements/create_index_statement: Use database::existing_index_names() in validation
  cql3/statements: Use secondary index manager in alter_table_statement class
  index: Add secondary_index_manager
  thrift/handler: Use index_metadata
  db/schema_tables: Index persistence
  schema: Add all_indices() to schema class
  schema: Remove add_default_index_names() from schema_builder class
  db/schema_tables: Add system table for indices
  cql3/Cgl.g: DROP INDEX
  cql3/statements: Add drop_index_statement class
  database: Add find_indexed_table() to database class
  cql3: Return change event from announce_migration()
  cql3/statements: Multiple index targets for CREATE INDEX
  cql3/statements: Use index_metadata in create_index_statement class
  cql3/statements: Use feature flag in create_index_statement class
  service/storage_service: Add feature flag for secondary indices
  database: Add get_available_index_name() to database class
  schema: Add get_default_index_name() to index_metadata class
  ...
2017-05-08 17:04:40 +03:00
Calle Wilund
2049303399 query_pagers: bugfix: must count pk only/pk + static rows as 1
Previously only counted clustered/regular

Message-Id: <1494249013-4069-1-git-send-email-calle@scylladb.com>
2017-05-08 16:35:27 +03:00
Pekka Enberg
dfee4d2bb0 cql3: Fix partition key bind indices for prepared statements
Fix the CQL front-end to populate the partition key bind index array in
result message prepared metadata, which is needed for CQL binary
protocol v4 to function correctly.

Fixes #2355.

Message-Id: <1494247871-3148-1-git-send-email-penberg@scylladb.com>
2017-05-08 16:33:17 +03:00
Calle Wilund
a03d54d9f8 Merge branch 'master' of https://github.com/scylladb/scylla 2017-05-08 11:28:26 +00:00
Pekka Enberg
35bb6dedd8 schema: Kill index_type enum 2017-05-08 10:19:34 +03:00
Pekka Enberg
06564afedb schema: Kill index_info class
It's no longer used. Indices are managed by the index_metadata class.
2017-05-08 10:19:34 +03:00
Pekka Enberg
3f27d12e99 cql3/statements/create_index_statement: Use database::existing_index_names() in validation 2017-05-08 10:19:34 +03:00
Pekka Enberg
b87679821c cql3/statements: Use secondary index manager in alter_table_statement class 2017-05-08 10:03:28 +03:00
Pekka Enberg
4b4e4e6878 index: Add secondary_index_manager 2017-05-08 10:03:28 +03:00
Pekka Enberg
94bc031ca7 thrift/handler: Use index_metadata 2017-05-08 10:03:28 +03:00
Pekka Enberg
11474ed4c6 db/schema_tables: Index persistence 2017-05-08 10:03:28 +03:00
Avi Kivity
9e67bd5aac Merge " Add partial range deletion support" from Duarte
"This series introduces partial support for range deletions. This allows
deletion operations such as

delete from cf where p=1 and c > 0 and c <= 3.

This series only adds support for single-column range restrictions.

We enforce that both range bounds be specified, because we can't represent
infinite bounds in the current sstable format. Such bounds are represented
as a prefix with no components, with the bound_kind informing whether they
are a bottom of top bound.

We're currently unable to serialize an infinite bound in such a way that it
would be correctly interpreted by Cassandra 2.2.x. A serialized bound is a
composite with a (<length><value><EOC>)+ format. While we could technically
represent the bottom bound, the top bound, if written as a single component
with 0 bytes in size and some EOC, would always sort before other values.
The same would happen if represented as an empty (no components) composite,
because in Cassandra 2.2.x those always have EOC = NONE.

This limitation should stay in place until we can properly represent range
tombstones in the storage format."

* 'range-deletions/v2' of https://github.com/duarten/scylla:
  mutation: Set cell using clustering_key_prefix
  mutation_partition: Harmonize apply_delete overloads
  prefix_compound_view_wrapper: Add is_full and is_empty functions
  tests/cql_query_test: Add range deletion tests
  cql3: Partially support ranged deletions
  single_column_primary_key_restrictions: Implement has_bound()
  modification_statement: Use statement_restrictions for where clause
  statement_restrictions: Expose primary key restrictions
  to_string: Add missing include
2017-05-07 19:27:09 +03:00
Takuya ASADA
b574100075 dist/common/scripts/scylla_selinux_setup: keep symlink on /etc/sysconfig/selinux
Current script has a bug that overwrites symlink on
/etc/sysconfig/selinux by real file, the script not able to disable
SELinux because of it.

So keep symlink after modifying the file.

Fixes #2279

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1493663263-10573-1-git-send-email-syuu@scylladb.com>
2017-05-07 17:30:05 +03:00
Tomer Sandler
9a1aa6c1d3 node_health_check: Major rework
This is a folded version of the following rework on the node health
check script:

  - Added support for non-default cql + nodetool ports

  - Script will not exit if either Scylla-server / Scylla-jmx / Both
    services are not up and running. It will alert the user about it and
    which output cannot be collected, but continue collecting everything
    else.

  - Removed lshw installation and non-needed use in commands

  - Script supports RHEL/CentOS/Ubuntu14/Ubuntu16/Debian (tested on all
    beside Debian, should behave the same as Ubuntu14/16)

  - All Indentation issues fixed -> using only tab (no spaces)
    consistently.

  - >> vs. >  was fixed as well in the needed places.

  - Changes the ${VAR_NAME} instances to $VAR_NAME, and kept the {} only
    where needed.

  - Check Scylla service as Vlad recommended using 'ps -C'

  - Fixed the CQL not listening error message.

  - Added Sanity check if script is attempted to run on non-Fedora and
    non-Debian OS -> alert the user and exit.

  - Removed the MANUAL CHECK LIST section (moved to Google Forms)

  - Added date in head of the report.

  - Removed text from Report's "PURPOSE" section, which was referring to
    the "MANUAL CHECK LIST" (not needed anymore).

[ penberg: Fold into a single commit and add proper license. ]
Acked-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1493900076-29170-1-git-send-email-penberg@scylladb.com>
2017-05-06 08:38:12 +03:00
Paweł Dziepak
798cfbc68f Merge "Fixes for gcc 7" from Avi
"gcc 7 doesn't like some of our code, so adjust to make it happy."

* 'gcc7' of http://github.com/avikivity/scylla:
  Remove exception specifications
  commitlog: handle noexcept conflict between unlink and function object
  thrift: change generated code namespace
2017-05-05 15:42:56 +01:00
Avi Kivity
a592573491 Remove exception specifications
C++17 removed exception specifications from the language, and gcc 7 warns
about them even in C++14 mode.  Remove them from the code base.
2017-05-05 17:02:31 +03:00
Avi Kivity
5278e1a14d commitlog: handle noexcept conflict between unlink and function object
::unlink is declared as noexcept, but the function object it is passed into
is not.  gcc 7 warns, so wrap ::unlink in a lambda to make it happy.
2017-05-05 17:02:30 +03:00
Avi Kivity
d542cdddf6 thrift: change generated code namespace
org::apache::cassandra (the generated namespace name) gets confused with
apache::cassandra (the thrift runtime library namespace), either due to
changes in gcc 7 or in thrift 0.10.  Either way, the problem is fixed
by changing the generated namespace to plain cassandra.
2017-05-05 05:26:20 +03:00
Paweł Dziepak
c9470b5c94 Merge "Fix abort in advance_and_check_if_present()" form Tomasz
"Fixes abort which happens when making a single-key query for a key which
is after all keys present in the sstable."

* 'tgrabiec/fix-abort-in-index-reader' of github.com:scylladb/seastar-dev:
  tests: mutation_source_test: Add test cases for single-key out of range reads
  sstables: index_reader: Remove redundant function
  sstables: index_reader: Fix abort in advance_and_check_if_present()
2017-05-04 17:19:47 +01:00
Duarte Nunes
9e88b60ef5 mutation: Set cell using clustering_key_prefix
Change the clustering key argument in mutation::set_cell from
exploded_clustering_prefix to clustering_key_prefix, which allows for
some overall code simplification and fewer copies. This mostly affects
the cql3 layer.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-04 15:59:50 +02:00
Duarte Nunes
db63ffdbb4 mutation_partition: Harmonize apply_delete overloads
This patch ensures the different mutation_partition::apply_delete()
overloads behave similarly, so that, for example, an empty clustering
key is treated the same way as an empty
exploded_clustering_key_prefix.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-04 15:59:50 +02:00
Duarte Nunes
07e648251b prefix_compound_view_wrapper: Add is_full and is_empty functions
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-04 15:59:50 +02:00
Duarte Nunes
ef138bdd2c tests/cql_query_test: Add range deletion tests
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-04 15:59:50 +02:00
Duarte Nunes
42873189d4 cql3: Partially support ranged deletions
This patch introduces partial support for range deletions. This allows
deletion operations such as

delete from cf where p=1 and c > 0 and c <= 3.

We enforce that both range bounds be specified, because we can't represent
infinite bounds in the current sstable format. Such bounds are represented
as a prefix with no components, with the bound_kind informing whether they
are a bottom of top bound.

We're currently unable to serialize an infinite bound in such a way that it
would be correctly interpreted by Cassandra 2.2.x. A serialized bound is a
composite with a (<length><value><EOC>)+ format. While we could technically
represent the bottom bound, the top bound, if written as a single component
with 0 bytes in size and some EOC, would always sort before other values.
The same would happen if represented as an empty (no components) composite,
because in Cassandra 2.2.x those always have EOC = NONE.

This limitation should stay in place until we can properly represent range
tombstones in the storage format.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-04 15:59:50 +02:00
Duarte Nunes
169cc41251 single_column_primary_key_restrictions: Implement has_bound()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-04 15:59:49 +02:00
Duarte Nunes
f7bc88734a modification_statement: Use statement_restrictions for where clause
This patch replaces the custom where clause processing by adding and
using a statement_restrictions field to modification_statement.

This improves code reuse and also moves some checks to prepare-time.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-04 15:59:49 +02:00
Duarte Nunes
aff23f93b4 statement_restrictions: Expose primary key restrictions
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-04 15:59:49 +02:00
Duarte Nunes
8b7d7c4e6d to_string: Add missing include
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-04 15:59:49 +02:00
Tomasz Grabiec
e71771d019 tests: mutation_source_test: Add test cases for single-key out of range reads 2017-05-04 14:59:08 +02:00
Tomasz Grabiec
297b4b0cf5 sstables: index_reader: Remove redundant function 2017-05-04 14:59:08 +02:00
Tomasz Grabiec
ec45f1e51d sstables: index_reader: Fix abort in advance_and_check_if_present()
Happens when the key is missing and after all keys in the sstables.

Fixes #2345.
2017-05-04 14:59:08 +02:00
Pekka Enberg
25e2777344 schema: Add all_indices() to schema class 2017-05-04 14:59:12 +03:00
Pekka Enberg
830591b092 schema: Remove add_default_index_names() from schema_builder class
The add_default_index_names() is part of the old and incomplete
secondary index implementation in Scylla. Drop it as it's no longer
used.
2017-05-04 14:59:12 +03:00
Pekka Enberg
8b943c0ceb db/schema_tables: Add system table for indices 2017-05-04 14:59:12 +03:00
Pekka Enberg
af9015d0d0 cql3/Cgl.g: DROP INDEX 2017-05-04 14:59:12 +03:00
Pekka Enberg
a4ee9f9fa1 cql3/statements: Add drop_index_statement class 2017-05-04 14:59:12 +03:00
Pekka Enberg
f26b8d7afb database: Add find_indexed_table() to database class 2017-05-04 14:59:12 +03:00
Pekka Enberg
14391a8ec8 cql3: Return change event from announce_migration()
This changes announce_migration() to return a change event directory in
schema_altering_statement base class. It's needed for drop index
statement, which does not know the keyspace or column family until it
looks up them based on the index. Two stage approach of announcing a
migration and then creating the change event won't work because in the
latter stage, the lookup will fail. The same change in
announce_migration() has been applied to Apache Cassandra.
2017-05-04 14:59:12 +03:00
Pekka Enberg
82394debe6 cql3/statements: Multiple index targets for CREATE INDEX 2017-05-04 14:59:12 +03:00
Pekka Enberg
fe315bd31a cql3/statements: Use index_metadata in create_index_statement class 2017-05-04 14:59:12 +03:00
Pekka Enberg
651af0f45a cql3/statements: Use feature flag in create_index_statement class 2017-05-04 14:59:12 +03:00
Pekka Enberg
815c91a1b8 service/storage_service: Add feature flag for secondary indices 2017-05-04 14:59:11 +03:00
Pekka Enberg
930fa79aff database: Add get_available_index_name() to database class 2017-05-04 14:59:11 +03:00
Pekka Enberg
ef29520c8e schema: Add get_default_index_name() to index_metadata class 2017-05-04 14:59:11 +03:00
Pekka Enberg
c6e7d4484a database: Make existing_index_names() per-keyspace operation 2017-05-04 14:59:11 +03:00
Pekka Enberg
8c729f0f5f database: Rewrite existing_index_names() to use new index metadata 2017-05-04 14:59:11 +03:00
Pekka Enberg
4391faaf45 cql3/statements: Add constants to index_target 2017-05-04 14:59:11 +03:00
Pekka Enberg
546d1e47dd cql3/statements: Add as_cql_string() to index_target class 2017-05-04 14:59:11 +03:00
Pekka Enberg
56cca3b0d6 cql3: Add to_cql_string() to column_identifier class 2017-05-04 14:59:11 +03:00
Pekka Enberg
58b90655d2 cql3/statements: Add to_string(target_type) 2017-05-04 14:59:11 +03:00
Pekka Enberg
1f5a52d03f cql3/statements: Use namespaces in index_target.cc file 2017-05-04 14:59:11 +03:00
Pekka Enberg
5e8f2f49c3 schema: Add indices() to schema class 2017-05-04 14:59:11 +03:00
Pekka Enberg
5abd4b8041 schema: Add has_index() to schema class 2017-05-04 14:59:11 +03:00
Pekka Enberg
1fb1828aa2 schema: Add index_names() to schema class 2017-05-04 14:59:11 +03:00
Pekka Enberg
1c1a767408 schema: Add find_index_noname() to schema class
This adds a find_index_nomame() helper to the schema class, which
searches for index that is otherwise equal but ignores the name of the
index in comparison. This is needed to for CREATE INDEX to reject
duplicate index creation.
2017-05-04 14:59:11 +03:00
Pekka Enberg
62fba73a05 schema_builder: Add index_metadata support 2017-05-04 14:59:11 +03:00
Pekka Enberg
05e12a1d2b schema: Add index_metadata maps to raw_schema class 2017-05-04 13:22:12 +03:00
Raphael S. Carvalho
ddc1d80c28 compaction: remove dead function declaration
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170504013046.23522-2-raphaelsc@scylladb.com>
2017-05-04 11:48:51 +03:00
Raphael S. Carvalho
61229ab88c compaction: fix type for cleanup
After compaction revamp, compaction type set by cleanup at its ctor
is being overwritten at compaction::setup(). Consequently, cleanup
would not be stopped by 'nodetool stop cleanup' and cleanup would
be listed as regular compaction in 'nodetool compactionstats'.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170504013046.23522-1-raphaelsc@scylladb.com>
2017-05-04 11:48:50 +03:00
Avi Kivity
211a337883 Merge seastar upstream
* seastar 194d80f...4a3118c (4):
  > execution_stage: fix wrong exception thrown for non-unique stages
  > metrics: add missing move assignment operators for metric_group, metric_groups
  > Remove unused lambda captures
  > core: lw_shared_ptr::get() should return nullptr for null pointer
2017-05-04 11:47:05 +03:00
Asias He
66e3b73b9c repair: Fix partition estimation
We estimate number of partitions for a given range of a column familiy
and split the range into sub ranges contains fewer partitions as a
checksum unit.

The estimation is wrong, because we need to count the partitions on all
the shards, instead of only counting the local shard.

Fixes #2299

Message-Id: <7876285bd26cfaf65563d6e03ec541626814118a.1493817339.git.asias@scylladb.com>
2017-05-03 16:25:45 +03:00
Pekka Enberg
1e04731fa0 Merge "gossip mark alive fixes" from Asias
"This series fixes the user after free issue in gossip and elimates the
duplicated / unnecessary mark alive operations.

Fixes #2341"

* tag 'asias/gossip_fix_mark_alive/v1' of github.com:cloudius-systems/seastar-dev:
  gossip: Ignore callbacks and mark alive operation in shadow round
  gossip: Ingore the duplicated mark alive operation
  gossip: Fix user after free in mark_alive
2017-05-03 12:19:16 +03:00
Jacob Johansen
9616956c16 dist/docker: Add support for experimental flag
Fixes #2188

Message-Id: <20170502180047.24071-1-jacob.johansen@virginpulse.com>
2017-05-03 10:29:55 +03:00
Asias He
3bd9840c01 gossip: Ignore callbacks and mark alive operation in shadow round
In shadow round, we only interested in the peer's endpoint_state, e.g., gossip
features, host_id, tokens. No need to call the on_restart or on_join callbacks
or to go through the mark alive procedure with EchoMessage gossip message. We
will do them during normal gossip runs anyway.
2017-05-03 07:24:21 +08:00
Asias He
1441ae5cac gossip: Ingore the duplicated mark alive operation
If a node is being marked as alive with EchoMessage, ignore the future
duplicated mark alive opeariton.
2017-05-03 07:24:21 +08:00
Asias He
d682fbfa28 gossip: Fix user after free in mark_alive
After sending echo message, the Node might not be in the
endpoint_state_map anymore, use the reference of local_state
might cause user after free.

Fixes #2341
2017-05-03 07:24:20 +08:00
Raphael S. Carvalho
8b0e358d73 tests/sstable_test: fix release-mode compaction_manager_test
in release mode, compaction task is active after submitting request
because ready future may be scheduled immediately.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170502171925.9893-1-raphaelsc@scylladb.com>
2017-05-02 20:48:30 +03:00
Calle Wilund
a37d03cd1d transport::server: ignore socket shutdown future results
These will as of-late always be ready. Removing the future usage
in preparation for changing api signature to void(*)() (i.e. prevent
breakage on seastar update)
2017-05-02 15:08:47 +00:00
Avi Kivity
7e29dd7066 managed_bytes: improve alignment hygene
While blob_storage is marked as an unaligned type, the back references also
point to an unaligned type (a pointer to blob_storage), since a back
reference can live in a blob_storage.  This triggers errors from zapcc/clang 4.

Fix by creating a type for the reference, which is marked as unaligned.
Message-Id: <20170502071404.507-1-avi@scylladb.com>
2017-05-02 10:04:13 +01:00
Pekka Enberg
2f83232a02 schema: Add index_metadata class 2017-05-02 10:29:18 +03:00
Avi Kivity
b46f6a4124 build: ignore unused lambda capture warnings from clang
Worthwhile to revisit later.
2017-05-02 10:09:58 +03:00
Raphael S. Carvalho
8dfb5f9c33 tests/sstable_test: fix compaction_manager_test
after 'compaction: make major compaction go through compaction manager',
the test fails because task is preempted in debug mode before it reaches
intruction to increase stat.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170501183255.6191-1-raphaelsc@scylladb.com>
2017-05-02 09:06:41 +03:00
Avi Kivity
1d12d69881 logalloc: define segment_zone::maximum_size
Yield build errors with some compilers, if missing.
2017-05-01 16:31:29 +03:00
Amnon Heiman
b59c95359d scylla_setup: Fix conditional when checking for newer version
During the changes in the way the housekeeping check for newer version
and warn about it in the installation the UUID part was removed but kept
in the sarounding if.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170426075724.7132-1-amnon@scylladb.com>
2017-05-01 12:13:35 +03:00
Raphael S. Carvalho
3071b9052a compaction: make cleanup_compaction inherit from regular_compaction
Some fields that belong to regular and cleanup aren't needed for
resharding_compaction, such as incremental selector (which is used
for determining max purgeable timestamp for a given decorated key)
Better move those fields to regular and make cleanup inherit from
regular compaction.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170428195611.9196-1-raphaelsc@scylladb.com>
2017-04-30 19:37:09 +03:00
Raphael S. Carvalho
687a4bb0c2 dtcs: do not compact fully expired sstable which ancestor is not deleted yet
Currently, fully expired sstable[1] is unconditionally chosen for compaction
by DTCS, but that may lead to a compaction loop under certain conditions.

Let's consider that an almost expired sstable is compacted, and it's not
deleted yet, and that the new sstable becomes expired before its ancestor is
deleted.
Because this new sstable is expired, it will be chosen by DTCS, but it will
not be purged because 'compacted undeleted' sstables are taken into account
by calculation of max purgeable timestamp and prevents expired data from
being purged. The problem is that this sequence of events can keep happening
forever as reported by issue #2260.
NOTE: This problem was easier to reproduce before improvement on compaction
of expired cells, because fully expired sstable was being converted into a
sstable full of tombstones, which is also considered fully expired.

Fixes #2260.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170428233554.13744-1-raphaelsc@scylladb.com>
2017-04-30 19:35:46 +03:00
Paweł Dziepak
24f4dcf9e4 db: make virtual dirty soft limit configurable
Message-Id: <20170428150005.28454-1-pdziepak@scylladb.com>
2017-04-30 19:17:22 +03:00
Avi Kivity
248aa4fc23 Merge "Fix update of counter in static rows" from Paweł
"The logic responsible for converting counter updates to counter shards was
not covered by unit tests and didn't transform counter cells inside static
rows.

This series fixes the problem and makes sure that the tests cover both
static rows and transformation logic."

* tag 'pdziepak/static-counter-updates/v1' of github.com:cloudius-systems/seastar-dev:
  tests/counter: test transform_counter_updates_to_shards
  tests/counter: test static columns
  counters: transform static rows from updates to shards
2017-04-30 19:13:44 +03:00
Avi Kivity
339322517e Merge "sstables: index_reader: Fix advance_to() to include relevant range tombstones" from Tomasz
"Fixes #2326."

* 'tgrabiec/fix-range-tombstones-missing-when-slicing' of github.com:cloudius-systems/seastar-dev:
  tests: mutation_source_test: Cover single-ranged queries in test_streamed_mutation_slicing_returns_only_relevant_tombstones()
  tests: mutation_source_test: Add test for slicing of clustered rows
  tests: mutation_reader_assertions: Log expectations
  tests: mutation_reader_assertions: Add produces_eos_or_empty_mutation()
  tests: sstables: Use read_row() for single-key reads
  tests: sstables: Test more configutaions of sstable writer in test_sstable_conforms_to_mutation_source()
  sstables: Improve logging
  sstables: index_reader: Fix advance_to() to include relevant range tombstones
2017-04-30 14:40:41 +03:00
Avi Kivity
831ee80c3c tests: workaround older boost::apply_visitor requiring a result_type member
Older versions of boost::apply_visitor require a result_type member for the
visitor; supply it to make them happy.

Fixes #2312.
2017-04-30 13:56:44 +03:00
Takuya ASADA
a19c1b7f86 dist/redhat: add missing dependencies for Fedora
We only have "%{?rhel:Requires}" for scylla-server, need fedora one.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1493314367-419-1-git-send-email-syuu@scylladb.com>
2017-04-30 11:06:27 +03:00
Takuya ASADA
fe9f72d2c0 dist/debian: add python3-pyudev to dependencies
pyudev is required for seastar/scripts/perftune.py.

Fixes #2315

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1493309116-18074-1-git-send-email-syuu@scylladb.com>
2017-04-30 11:05:15 +03:00
Paweł Dziepak
f5cf86484e lsa: introduce upper bound on zone size
Attempting to create huge zones may introduce significant latency. This
patch introduces the maximum allowed zone size so that the time spent
trying to allocate and initialising zone is bounded.

Fixes #2335.

Message-Id: <20170428145916.28093-1-pdziepak@scylladb.com>
2017-04-30 10:58:11 +03:00
Paweł Dziepak
5c302cf67b tests/counter: test transform_counter_updates_to_shards 2017-04-28 16:29:34 +01:00
Paweł Dziepak
0473750056 tests/counter: test static columns 2017-04-28 16:29:34 +01:00
Paweł Dziepak
0ffdd8d3d0 counters: transform static rows from updates to shards 2017-04-28 16:29:34 +01:00
Tomasz Grabiec
d4df6e214e tests: mutation_source_test: Cover single-ranged queries in test_streamed_mutation_slicing_returns_only_relevant_tombstones() 2017-04-27 18:43:49 +02:00
Tomasz Grabiec
22cce52dff tests: mutation_source_test: Add test for slicing of clustered rows 2017-04-27 18:43:49 +02:00
Tomasz Grabiec
86b693f562 tests: mutation_reader_assertions: Log expectations 2017-04-27 18:43:49 +02:00
Tomasz Grabiec
ece6e107cc tests: mutation_reader_assertions: Add produces_eos_or_empty_mutation() 2017-04-27 18:43:49 +02:00
Tomasz Grabiec
6354acc1a2 tests: sstables: Use read_row() for single-key reads
So that as_mutation_reader() will create the same kind of reader which
database::make_sstable_reader() does.

Before this change, all readers were range readers.
2017-04-27 18:43:49 +02:00
Tomasz Grabiec
fd5dbe04b5 tests: sstables: Test more configutaions of sstable writer in test_sstable_conforms_to_mutation_source()
Test different versions of the format, and different promoted index
block sizes.  The size of 1 is especially important, it will put each
fragment in a separate block, exposing various issues with promoted
index handling.
2017-04-27 18:43:49 +02:00
Tomasz Grabiec
c5baeed6d2 sstables: Improve logging 2017-04-27 18:43:49 +02:00
Tomasz Grabiec
b523815ac1 sstables: index_reader: Fix advance_to() to include relevant range tombstones
Fixes #2326.
2017-04-27 18:43:49 +02:00
Glauber Costa
14b9aa2285 reduce kernel scheduler wakeup granularity
We set the scheduler wakeup granularity to 500usec, because that is the
difference in runtime we want to see from a waking task before it
preempts the running task (which will usually be Scylla). Scheduling
other processes less often is usually good for Scylla, but in this case,
one of the "other processes" is also a Scylla thread, the one we have
been using for marking ticks after we have abandoned signals.

However, there is an artifact from the Linux scheduler that causes those
preemption to be missed if the wakeup granularity is exactly twice as
small as the sched_latency. Our sched_latency is set to 1ms, which
represents the maximum time period in which we will run all runnable
tasks.

We want to keep the sched_latency at 1ms, so we will reduce the wakeup
granularity so to something slightly lower than 500usec, to make sure
that such artifact won't affect the scheduler calculations. 499.99usec
will do - according to my tests, but we will reduce it to a round
number.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20170427135039.8350-1-glauber@scylladb.com>
2017-04-27 18:11:35 +03:00
Pekka Enberg
9cfb94510f Merge "Fix issues found by PVS-Studio static analyzer" from Vlad
Fix issues found by PVS-Studio as reported by Phillip Khandeliants.

Merge branch 'pvs_analyzer_errors-v1' of github.com:cloudius-systems/seastar-dev

* 'pvs_analyzer_errors-v1' of github.com:cloudius-systems/seastar-dev:
  type_parser: catch exceptions by reference and not by value
  token_metadata::get_host_id(ep): add a missing 'throw'
2017-04-27 11:39:49 +03:00
Vlad Zolotarov
d5b76d5198 type_parser: catch exceptions by reference and not by value
Found by PVS-Studio static analyzer:

Type slicing. An exception should be caught by reference rather than by value.

Fixes #2288

Reported-by: Phillip Khandeliants
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-04-26 15:12:15 -04:00
Vlad Zolotarov
181c68e97d token_metadata::get_host_id(ep): add a missing 'throw'
Caught by PVS-Studio static analyzer:

The object was created but it is not being used. The 'throw' keyword could be missing: throw runtime_error(FOO);

Reported-by: Phillip Khandeliants
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-04-26 14:54:34 -04:00
Takuya ASADA
7a59336b8a main.cc: drop FS type check
Since we add support ext4, we don't need to limit filesystem to XFS anymore.

Fixes #1933

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1493212525-26264-1-git-send-email-syuu@scylladb.com>
2017-04-26 17:35:55 +03:00
Raphael S. Carvalho
8bae413bcf database: fix format msg for sprint
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170425224920.16607-1-raphaelsc@scylladb.com>
2017-04-26 17:18:58 +03:00
Raphael S. Carvalho
f49bdb6839 compaction_manager: dont go on with major compaction if task was stopped
A column family which was truncated will remove itself from compaction
manager. Any task running a compaction should be interrupted and a
task waiting to run should bail out when it wakes up.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170425224350.15965-3-raphaelsc@scylladb.com>
2017-04-26 17:18:37 +03:00
Takuya ASADA
abf65cb485 dist/debian: skip tunables when kernel = 3.13.0-*-generic, to prevent kernel panic bug
There is kernel panic bug on kernel = 3.13.0-*-generic(Ubuntu 14.04), we have to skip tunables.

Fixes #1724

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1493196636-25645-1-git-send-email-syuu@scylladb.com>
2017-04-26 11:54:11 +03:00
Vlad Zolotarov
a9ad762f47 docs: tracing.md: add a "how to get traces" chapter
This chapter describes how to get tracing information.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-04-25 21:52:29 -04:00
Vlad Zolotarov
f993e85b5f tracing::trace_keyspace_helper: introduce a time series helper tables
Introduce two time series helper tables that will simplify the querying
of traces.

One for querying regular traces:

CREATE TABLE system_traces.sessions_time_idx (
minute timestamp,
started_at timestamp,
session_id uuid,
PRIMARY KEY (minute, started_at, session_id))

and one for querying slow query records:

CREATE TABLE system_traces.node_slow_log_time_idx (
minute timestamp,
started_at timestamp,
session_id uuid,
start_time timeuuid,
node_ip inet,
shard int,
PRIMARY KEY (minute, started_at, session_id))

With these tables one may get the relevant traces like in an example below:

SELECT * from system_traces.sessions_time_idx where minutes in ('2016-09-07 16:56:00-0700') and started_at > '2016-09-07 16:56:30-0700'

Fixes #2243

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-04-25 21:52:28 -04:00
Vlad Zolotarov
81bcc36b16 tracing: cleanup: use nullptr instead of trace_state_ptr()
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-04-25 21:52:28 -04:00
Vlad Zolotarov
b0f660331a tracing: introduce a span ID and parent span ID
This patch makes the tracing framework follow the general idea of Google's
Dapper paper: traces generated in a context of the same query are forming
a single-rooted acyclic tree where in a ScyllaDB case vertexes are spans running
on each involved replica Node and edges are RPCs sent from one Node to another.

   - Each vertex in the tree above has an ID - "span ID".
   - In order to be able to build the tree from the sessions traces we need
     to know the parent "span ID" - the ID of a span that sent an RPC that created
     the current span.
   - Each span of a tracing session is given a 64-bit random span ID.
   - The root span has a span_id::illegal_id value.

This patch adds:
   - The described above parent span ID and a span ID to the one_session_records
     object.
   - The current span ID is passed in the trace_info struct to the remote replica.
   - Add parent_id and span_id columns to system_traces.events table for the parent
     ID and span ID.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-04-25 21:52:23 -04:00
Tomasz Grabiec
92dba05f0d sstables: Fix malformed_sstable_exception from single-key reads
After 4742008b70, _read_partial_row is
never set, and we will fail here in case the consumer will exhoust the
range. That would be the case if the end bound of the slice aligns
with the end of the index page.

Fix by assuming that if we're out of range in the middle of partition,
we sliced.

Message-Id: <1493121249-18847-1-git-send-email-tgrabiec@scylladb.com>
2017-04-25 14:59:08 +03:00
Avi Kivity
628b3092e4 Merge "Reify shadowable tombstones" from Duarte
"This series introduces the row_tombstone class, which represents a
tombstone applied to a clustering row. It distinguishes itself from a
normal tombstone by the fact that it contains a regular tombstone and
a shadowable one, which can be erased by a row marker.

The intent of the series is thus to reify the idea of shadowable
tombstones, that up until now we considered all materialized view row
tombstones to be, leading to incorrect results."

* 'materialized-views/shadowable/v5' of https://github.com/duarten/scylla:
  sstables: Read and write shadowable tombstones
  mutation_partion: Use row_tombstone
  mutation_partion: Introduce row_tombstone
  mutation_partition: Introduce shadowable tombstones
  idl-compiler: Support optional fields in views
  tombstone: Extract out relational operators
  row_marker: Mark constructors explicit
2017-04-25 13:05:27 +03:00
Duarte Nunes
d45596ae8e sstables: Read and write shadowable tombstones
This patch serializes shadowable tombstones to sstables by adding a
new, incompatible atom's mask.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-04-25 11:46:33 +02:00
Duarte Nunes
4e693383f7 mutation_partion: Use row_tombstone
This patch replaces the current row tombstone representation by a
row_tombstone.

The intent of the patch is thus to reify the idea of shadowable
tombstones, that up until now we considered all materialized view row
tombstones to be.

We need to distinguish shadowable from non-shadowable row tombstones
to support scenarios such as, when inserting to a table with a
materialzied view:

1. insert into base (p, v1, v2) values (3, 1, 3) using timestamp 1
2. delete from base using timestamp 2 where p = 3
3. insert into base (p, v1) values (3, 1) using timestamp 3

These should yield a view row where v2 is definitely null, but with
the current implementation, v2 will pop back with its value v2=3@TS=1,
even though its dead in the base row. This is because the row
tombstone inserted at 2) is a shadowable one.

This patch only addresses the memory representation of such
row_tombstones.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-04-25 11:46:33 +02:00
Duarte Nunes
6a2bccd4ae mutation_partion: Introduce row_tombstone
This patch introduces the row_tombstone class, which represents a
tombstone made up of a regular tombstone and a shadowable one.

The rules for row_tombstones are as follows:

 - The shadowable tombstone is always >= than the regular one;
 - The regular tombstone works as expected;
 - The shadowable tombstone doesn't erase or compact away the regular
   row tombstone, nor dead cells;
 - The shadowable tombstone can erase live cells, but only provided
   they can be recovered (e.g., by including all cells in a MV update,
   both updated cells and pre-existing ones);
 - The shadowable tombstone can be erased or compacted away by a newer
   row marker.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-04-25 11:46:28 +02:00
Duarte Nunes
3d49c1da01 mutation_partition: Introduce shadowable tombstones
A shadowable tombstone is a tombstone that can be replaced by a
smaller one if provided a row_marker with a bigger timestamp than the
shadowable tombstone.

In the context of a row, it is only valid as long as no newer insert
is done (thus setting a live row marker; note that if the row
timestamp set is lower than the tombstone's, then the tombstone
remains in effect as usual). If a row has a shadowable tombstone with
timestamp Ti and that row is updated with a timestamp Tj, such that
Tj > Ti (and that update sets the row marker), then the shadowable
tombstone is shadowed by that update. A concrete consequence is that
if the update has cells with timestamp lower than Ti, then those cells
are preserved (since the deletion is removed), and this is contrary to
a regular, non-shadowable row tombstone where the tombstone is
preserved and such cells are removed.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-04-25 11:46:22 +02:00
Duarte Nunes
8cc29f84fb idl-compiler: Support optional fields in views
When generating view code, the compiler was ignoring optional fields.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-04-25 11:43:04 +02:00
Duarte Nunes
d216c3dbd2 tombstone: Extract out relational operators
This patch extracts out the relational operators in struct tombstone
to a class capable of generating them from a tri-compare function.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-04-25 11:43:04 +02:00
Duarte Nunes
392403b5b3 row_marker: Mark constructors explicit
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-04-25 11:43:04 +02:00
Tomasz Grabiec
f3609fc813 tests: log_historgram_test: Fix compiation on Ubuntu
Some gcc versions incorrectly complain:

  tests/log_histogram_test.cc:87:22: error: ‘opts1’ is not a valid template argument for type ‘const log_histogram_options&’ because object ‘opts1’ has not external linkage
 size_t hist_key<node<opts1>>(const node<opts1>& n) { return n.v; }

Apparently this is a bug in gcc:

   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52036

Fixes #2307.

Message-Id: <1493108791-11247-1-git-send-email-tgrabiec@scylladb.com>
2017-04-25 12:15:28 +03:00
Pekka Enberg
940c3f4330 Merge "Clang fixes (part 2)" from Avi
"This series fixes some more errors found by clang, with the aim of enabling
clang/zapcc as a supported compiler.  A single issue remains, but it's
probably in std::experimental::optional::swap(); not in our code."

* tag 'clang/2/v1' of https://github.com/avikivity/scylla:
  sstable_test: avoid passing negative non-type template arguments to unsigned parameters
  UUID: add more comparison operators
  sstable_datafile_test: avoid string_view user-defined literal conversion operator
  mutation_source_test: avoid template function without template keyword
  cql_query_test: define static variable
  cql_query_test: add braces for single-item collection initializers
  storage_service: don't use typeid(temporary)
  logalloc: remove unused max_occupancy_for_compaction
  storage_proxy: drop overzealous use of __int128_t in recently-modified-no-read-repair logic
  storage_proxy: drop unused member access from return value
  storage_proxy: fix reference bound to temporary in data_read_resolver::less_compare
  read_repair_decision: fix operator<<(std::ostream&, ...)
2017-04-24 20:32:16 +03:00
Tomasz Grabiec
dfbb9fd8f1 gdb: Workaround for gdb.Value being not accepted by %x
Fixes the following error in "scylla segment-descs" and a similar one in "scylla lsa-segment":

Traceback (most recent call last):
  File "scylla-gdb.py", line 530, in invoke
    gdb.write('0x%x: lsa free=%d region=0x%x zone=0x%x\n' % (addr, desc['_free_space'], desc['_region'], desc['_zone']))
TypeError: %x format: an integer is required, not gdb.Value

Message-Id: <1493029465-6482-1-git-send-email-tgrabiec@scylladb.com>
2017-04-24 13:27:25 +03:00
Avi Kivity
6d9e18fd61 logalloc: reduce descriptor overhead
Every lsa-allocated object is prefixed by a header that contains information
needed to free or migrate it.  This includes its size (for freeing) and
an 8-byte migrator (for migrating).  Together with some flags, the overhead
is 14 bytes (16 bytes if the default alignment is used).

This patch reduces the header size to 1 byte (8 bytes if the default alignment
is used).  It uses the following techniques:

 - ULEB128-like encoding (actually more like ULEB64) so a live object's header
   can typically be stored using 1 byte
 - indirection, so that migrators can be encoded in a small index pointing
   to a migrator table, rather than using an 8-byte pointer; this exploits
   the fact that only a small number of types are stored in LSA
 - moving the responsibility for determining an object's size to its
   migrator, rather than storing it in the header; this exploits the fact
   that the migrator stores type information, and object size is in fact
   information about the type

The patch improves the results of memory_footprint_test as following:

Before:

 - in cache:     976
 - in memtable:  947

After:

mutation footprint:
 - in cache:     880
 - in memtable:  858

A reduction of about 10%.  Further reductions are possible by reducing the
alignment of lsa objects.

logalloc_test was adjusted to free more objects, since with the lower
footprint, rounding errors (to full segments) are different and caused
false errors to be detected.

Missing: adjustments to scylla-gdb.py; will be done after we agree on the
new descriptor's format.
2017-04-24 12:23:12 +02:00
Avi Kivity
b4e897a66d cql3::metadata: fix undefined evaluation order in constructor
We both move names_ to its destination, and call names_.size() in the same
expression; this has undefined evaluation order, and fails with clang.

With this patch as well as the clang build fixes, Scylla starts and is
able to serve requests (light cassandra-stress load).

Message-Id: <20170423121727.1948-1-avi@scylladb.com>
2017-04-24 10:40:12 +03:00
Duarte Nunes
cddf2f4d74 tests: Fix failure virtual_reader_test
This patch fixes a failure of virtual_reader_test, where both the test
itself and the cql_test_env initialize the messaging_service to listen
on the same address and port, triggering an assert in
posix_ap_server_socket_impl::accept().

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170423104240.21275-1-duarte@scylladb.com>
2017-04-23 14:06:35 +03:00
Avi Kivity
566c094764 sstable_test: avoid passing negative non-type template arguments to unsigned parameters
Clang complains.  The test looks somewhat bogus, but that's for another patch.
2017-04-22 22:13:55 +03:00
Avi Kivity
dc6ea51ffa UUID: add more comparison operators
Clang wanted them for some unit test; not sure how gcc was able to
synthesize them, but they're clearly needed.
2017-04-22 22:12:33 +03:00
Avi Kivity
5424aca745 sstable_datafile_test: avoid string_view user-defined literal conversion operator
Clang doesn't like it, perhaps because it isn't in the std namespace (it's
still in std::experimental).
2017-04-22 22:11:30 +03:00
Avi Kivity
705ac957a2 mutation_source_test: avoid template function without template keyword
This isn't (yet?) standard C++, and clang rejects it.
2017-04-22 22:10:21 +03:00
Avi Kivity
551fb03476 cql_query_test: define static variable
single_node_cql_env is declared but not defined; define it to make clang
happy.
2017-04-22 22:01:44 +03:00
Avi Kivity
eb700752d8 cql_query_test: add braces for single-item collection initializers
Clang complains that braces are missing; I didn't verify it but I'm sure
it's right.  Add braces to make it happy.
2017-04-22 22:00:49 +03:00
Avi Kivity
6bb8ae7788 storage_service: don't use typeid(temporary)
Clang warns that the expression will be evaluated (doh).  While the warning
seems dubious, keep it and change the code to call the function outside
typeid(), in case it does help someone one day.
2017-04-22 21:09:41 +03:00
Avi Kivity
9303b09a64 logalloc: remove unused max_occupancy_for_compaction
Noticed by clang.
2017-04-22 21:09:41 +03:00
Avi Kivity
6d0811711f storage_proxy: drop overzealous use of __int128_t in recently-modified-no-read-repair logic
Clang's std::abs() doesn't support __int128_t, so use __int64_t instead.  With
this change, it's possible that a read repair 252,700 years after a write
will be interpreted as a recent write and the read repair will incorrectly
be skipped; hopefully by that time __int128_t will be standardized.
2017-04-22 21:09:41 +03:00
Avi Kivity
5ec1742b9a storage_proxy: drop unused member access from return value
Noticed by clang.
2017-04-22 21:09:41 +03:00
Avi Kivity
e4bae0df51 storage_proxy: fix reference bound to temporary in data_read_resolver::less_compare
Noticed by clang.
2017-04-22 21:09:41 +03:00
Avi Kivity
944047f039 read_repair_decision: fix operator<<(std::ostream&, ...)
Argument-dependent lookup requires that the operator be declared in the
same namespace as the class; move it there.

While at it, de-static it, it only causes bloat.
2017-04-22 21:09:41 +03:00
Raphael S. Carvalho
4a86dd473d tests: add tests/sstable_resharding_test.cc
Forgot to add file after resolving conflict.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170422172053.3734-1-raphaelsc@scylladb.com>
2017-04-22 21:09:29 +03:00
Benoît Canet
f68049ef5d tests: Fix clang auto universal reference type deduction
Replace it by regular template type deduction.

Signed-off-by: Benoît Canet <benoit@scylladb.com>
Message-Id: <20170421204150.4626-2-benoit@scylladb.com>
2017-04-22 20:04:00 +03:00
Benoit Canet
b902f3b81b tests: Remove parenthesis in variable declaration
Prevent clang compilation of this tests.

Signed-off-by: Benoît Canet <benoit@scylladb.com>
Message-Id: <20170421204150.4626-1-benoit@scylladb.com>
2017-04-22 20:04:00 +03:00
Avi Kivity
54ab13eb8e Merge "sstable resharding revamp" from Raphael
"Currently, a shared sstable is rewritten at all shards it belongs to, and only
after that, it's deleted.

This new algorithm adds the ability to reshard a set of sstables together at a
single shard and produce unshared sstable for all shards involved.

That's important for the leveled compaction strategy issue, in which the number
of sstables growing considerably after resharding. What happened is that every
sstable was being split into N ones, so we could end up with tons of small
sstables. Now, we will reshard together a set of adjacent sstables."

* 'sstable_resharding_revamp_v9' of github.com:raphaelsc/scylla:
  tests: add test for new sstable resharding
  database: kill column_family::start_rewrite
  database: wire up new resharding algorithm
  database: implement new sstable resharding algorithm
  database: introduce function to replace new sstables by their ancestors
  prevent regular compaction from choosing shared sstables
  compaction_strategy: implement resharding strategy for compaction strategies
  sstables: store more info in foreign_sstable_open_info
  sstables: make it possible to get open info from loaded sstable
  database: export column family dir
  database: inform if column family has shared tables
  sstables: add method to export ancestors
  lcs: implement get_level_count
  compaction_manager: introduce method to check if manager stopped
  lcs: restore invariant instead of sending overlapping sst to L0
  sstables: extend compaction for new resharding
  sstables: allow shard A to correctly create sstable for shard B
  compaction: rework compacting_sstable_writer to work with multiple writers
  compaction: prepare compacting_sstable_writer to work with writers
  sstables: rework compaction to make it easy to extend
2017-04-22 13:31:54 +03:00
Raphael S. Carvalho
8a37b279ed tests: add test for new sstable resharding
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:34 -03:00
Raphael S. Carvalho
662fe77c11 database: kill column_family::start_rewrite
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:33 -03:00
Raphael S. Carvalho
43ac19eb52 database: wire up new resharding algorithm
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:31 -03:00
Raphael S. Carvalho
cf45333588 database: implement new sstable resharding algorithm
NOTE: it's not wired yet.

Currently, a shared sstable is rewritten at all shards it belongs
to and only after that, it's deleted. With this new algorithm, a
shared sstable will be read only once and N unshared sstables
will be created, each of them with 1/N of the data. After it's
done, each owner shard will receive its new unshared sstable
replacing its ancestors.

Another benefit is that we'll no longer have resharding resulting
in number of sstables growing considerably after resharding.
A full-sized leveled sstable is usually 160MB, so after resharding,
we could have N files of 160MB/N. Now, leveled strategy will help
resharding. N adjacent sstables of same level will be resharded
together, so we'll end up with N files of N*160MB/N.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:30 -03:00
Raphael S. Carvalho
6513252e91 database: introduce function to replace new sstables by their ancestors
When resharding, we're working with sstables from all shards. So let's say
we're done with resharding of sstable A that belongs to shard 0 and 1 and
sstable B that belongs to shard 1 and 2. SStables were generated for
shards 0, 1, and 2. So shards 0, 1, and 2 need to load the new sstables
and remove the ancestors. Shard 1 for example will remove sstables A and
B (ancestors) and add the new one. Then it comes this new function.
We'll forward new sstables to their target shards using foreign sstable
open info.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:27 -03:00
Raphael S. Carvalho
c44a2319e6 prevent regular compaction from choosing shared sstables
For new resharding, it's important to exclude resharding sstables
from the list of candidates for regular compaction. That's doesn't
affect current resharding because it marks the sstables as
compacting. That won't work with new resharding which will work
with sstables from multiple shards.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:26 -03:00
Raphael S. Carvalho
13477075e2 compaction_strategy: implement resharding strategy for compaction strategies
Strategies other than leveled will reshard one shared sstable at
a time, and the target shard, shard at which job will run, for each
job will be chosen in a round-robin fashion.

For leveled strategy, we will reshard together smp::count adjacent
sstables that belong to same level.
The reason for that is because resharding one sstable at a time
may result in creation of file for each shard, meaning after
resharding we could end up with NO_SSTABLES*NO_SHARDS.

These resharding strategies will be used for our new resharding
algorithm.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:24 -03:00
Raphael S. Carvalho
bf930476b3 sstables: store more info in foreign_sstable_open_info
We need that info for opening a sstable at different shard, unlike
sstable loader which has everything in entry_descriptor, obtained
from components in sstable filename.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:22 -03:00
Raphael S. Carvalho
e5e7037aa4 sstables: make it possible to get open info from loaded sstable
It will be useful for resharding which will need to move a
sstable across shards, and to do that without reloading the
sstable at target shard, we need to be able to get the open
info and move it to the target shard instead.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:21 -03:00
Raphael S. Carvalho
405e41e9a8 database: export column family dir
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:19 -03:00
Raphael S. Carvalho
2b774c5bc3 database: inform if column family has shared tables
That's gonna be useful to quickly determine if it's worth resharding
a column family.

Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:17 -03:00
Raphael S. Carvalho
2d119287b7 sstables: add method to export ancestors
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:16 -03:00
Raphael S. Carvalho
f2f8a2f5c7 lcs: implement get_level_count
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:14 -03:00
Raphael S. Carvalho
585596cede compaction_manager: introduce method to check if manager stopped
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:12 -03:00
Raphael S. Carvalho
d82a8dfae0 lcs: restore invariant instead of sending overlapping sst to L0
A large token span sstable may find its way into high level due to resharding,
which means the strategy invariant is broken. The invariant is restored by
compacting first set of overlapping sstables, meaning that the restoration
is done incrementally for multiple overlapping sets.

Invariant is restored by regular compaction after resharding puts new unshared
sstables into their original level, where level > 0.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:09 -03:00
Raphael S. Carvalho
0127309820 sstables: extend compaction for new resharding
Extends compaction for new resharding algorithm. Not wired yet.
New resharding will compact shared sstable(s) and create one
sstable for each owner. It's up to the caller to open these
new unshared sstables at their respective column families.

This new approach will save a lot of bandwidth because we'll
no longer read the entire shared sstable #smp::count times.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:08 -03:00
Raphael S. Carvalho
758bc38e7a sstables: allow shard A to correctly create sstable for shard B
That's possible by shard A explicitly saying that sstable is created
for shard B. If we don't do that, sharding metadata isn't correct,
and consequently sstable will report wrong owners.

We'll need this for resharding which will create sstables for all
shards that own the shared sstable.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:06 -03:00
Raphael S. Carvalho
2a437ab427 compaction: rework compacting_sstable_writer to work with multiple writers
compacting_sstable_writer only allowed one writer so far, but we will need
multiple ones for resharding.
It's done by moving writer management to compaction.
finish_sstable_writer() is added for compaction impl to stop all writers,
whereas stop_sstable_writer() will only stop current writer (needed when
current sstable reaches max limit size for example).

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:05 -03:00
Raphael S. Carvalho
a35a3a9647 compaction: prepare compacting_sstable_writer to work with writers
No need for compacting_sstable_writer to store items that are available
in compaction class. Also, that's a step towards supporting multiple
writers for compaction.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:03 -03:00
Raphael S. Carvalho
38ed83e2f7 sstables: rework compaction to make it easy to extend
compact_sstables() supported both regular and cleanup compaction,
but with lots of conditions that made it ugly and hard to extend.

In the future, we want to introduce a new type of compaction for
resharding that will create one sstable for every shard owning
the sstable(s) given as input. That will be easier now.

Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:02 -03:00
Avi Kivity
fdcf64520d Merge seastar upstream
* seastar 2eec212...194d80f (4):
  > removing the collectd tests
  > fix fstream metrics reporting.
  > do_for_each: Make it check for need preempt
  > core/sharded: introduce copy method to foreign_ptr
2017-04-21 22:14:01 +03:00
Avi Kivity
fccbf2c51f Merge "Reduce memory reclamation latency" from Tomasz
"Currently eviction is performed until occupancy of the whole region
drops below the 85% threshold. This may take a while if region had
high occupancy and is large. We could improve the situation by only
evicting until occupancy of the sparsest segment drops below the
threshold, as is done by this change.

I tested this using a c-s read workload in which the condition
triggers in the cache region, with 1G per shard:

 lsa-timing - Reclamation cycle took 12.934 us.
 lsa-timing - Reclamation cycle took 47.771 us.
 lsa-timing - Reclamation cycle took 125.946 us.
 lsa-timing - Reclamation cycle took 144356 us.
 lsa-timing - Reclamation cycle took 655.765 us.
 lsa-timing - Reclamation cycle took 693.418 us.
 lsa-timing - Reclamation cycle took 509.869 us.
 lsa-timing - Reclamation cycle took 1139.15 us.

The 144ms pause is when large eviction is necessary.

Statistics for reclamation pauses for a read workload over
larger-than-memory data set:

Before:

 avg = 865.796362
 stdev = 10253.498038
 min = 93.891000
 max = 264078.000000
 sum = 574022.988000
 samples = 663

After:

 avg = 513.685650
 stdev = 275.270157
 min = 212.286000
 max = 1089.670000
 sum = 340573.586000
 samples = 663

Refs #1634."

* tag 'tgrabiec/lsa-reduce-reclaim-latency-v3' of github.com:cloudius-systems/seastar-dev:
  lsa: Reduce reclamation latency
  tests: Add test for log_histogram
  log_histogram: Allow non-power-of-two minimum values
  lsa: Use regular compaction threshold in on-idle compaction
  tests: row_cache_test: Induce update failure more reliably
  lsa: Add getter for region's eviction function
2017-04-21 17:47:06 +03:00
Tomasz Grabiec
20f4c9bf23 lsa: Reduce reclamation latency
Currently eviction is performed until occupancy of the whole region
drops below the 85% threshold. This may take a while if region had
high occupancy and is large. We could improve the situation by only
evicting until occupancy of the sparsest segment drops below the
threshold, as is done by this change.

I tested this using a c-s read workload in which the condition
triggers in the cache region, with 1G per shard:

 lsa-timing - Reclamation cycle took 12.934 us.
 lsa-timing - Reclamation cycle took 47.771 us.
 lsa-timing - Reclamation cycle took 125.946 us.
 lsa-timing - Reclamation cycle took 144356 us.
 lsa-timing - Reclamation cycle took 655.765 us.
 lsa-timing - Reclamation cycle took 693.418 us.
 lsa-timing - Reclamation cycle took 509.869 us.
 lsa-timing - Reclamation cycle took 1139.15 us.

The 144ms pause is when large eviction is necessary.

Statistics for reclamation pauses for a read workload over
larger-than-memory data set:

Before:

 avg = 865.796362
 stdev = 10253.498038
 min = 93.891000
 max = 264078.000000
 sum = 574022.988000
 samples = 663

After:

 avg = 513.685650
 stdev = 275.270157
 min = 212.286000
 max = 1089.670000
 sum = 340573.586000
 samples = 663

Refs #1634.

Message-Id: <1484730859-11969-1-git-send-email-tgrabiec@scylladb.com>
2017-04-21 12:52:31 +02:00
Tomasz Grabiec
4313641c03 tests: Add test for log_histogram 2017-04-21 12:52:31 +02:00
Tomasz Grabiec
c83768d6bb log_histogram: Allow non-power-of-two minimum values
We will want to reuse the min_size mechanism for the whole compaction
threshold, including the occupancy threshold. That threshold is close
to the segment size and we cannot pick a power of two which would be
close enough to what we need.

Therefore, change log_histogram to support arbitrary minimum base.

bucket_of() was moved into log_histogram_options so that it can be used
in number_of_buckets(), which makes for a simple and much less
error-prone implementation.
2017-04-21 10:54:50 +02:00
Tomasz Grabiec
7a800c54bf lsa: Use regular compaction threshold in on-idle compaction
Idle-time compaction should not produce not-compactible segments
becuase that means we would have to evict a lot when we finally need
to reclaim some memory, so that occupancy falls below the regular
compaction threshold. This may cause latency spikes.

Refs #1634.
2017-04-20 15:00:15 +02:00
Tomasz Grabiec
e054ccc037 tests: row_cache_test: Induce update failure more reliably
After changing region evicitability condition to be less strict, cache
update stopped failing because reclamation was able to compact dense
region. Induce failure by installing evictor which refuses to evict
from cache beyond few elements.
2017-04-20 14:51:47 +02:00
Tomasz Grabiec
7aa286439f lsa: Add getter for region's eviction function 2017-04-20 14:51:42 +02:00
Vlad Zolotarov
9c1d803157 fix_system_distributed_tables.py: add --node and --port parameters
Allow giving a non-default IP address and a port to connect to the cluster.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1491316458-18420-1-git-send-email-vladz@scylladb.com>
2017-04-20 14:49:26 +03:00
Avi Kivity
68f0df12ee Merge "Optimize reads with clustering restrictions" from Tomasz
"This series makes several optimizations to sstable mutation reader relevant
for large partitions.

Some highlights:

One optimization is to use the index for skipping across clustering restrictions.
Currently we read whole partition in such cases. That includes the case when
we need to read a static row and then jump to some clustering row in the
middle of the partition. Another case is having more than one clustering
restriction, e.g. selecting multiple single rows from the same partition.

Another optimization is using information from the index for creation of
streamed_mutation. That can save us the cost of reading the partition header
form the data file in case we would not continue reading, but skip to the
middle of that partition. Or we may not even attempt to read anything from
that partition, if after we determine the key that reader will be put behind
other readers, which will exhaust the query limit first.

Another optimization is switching single-partition queries to use the
index_reader infrastructure. Index lookups via index_reader are faster than
find_disk_ranges(). This is also a cleanup, a step towards converting all code
to use the index_reader."

* tag 'tgrabiec/optimize-sstable-reads-with-restrictions-v2' of github.com:cloudius-systems/seastar-dev: (44 commits)
  sstables: Remove unused code
  sstables: mutation_reader: Use index_reader::advance_to_next_partition() to skip to next partition
  sstables: mutation_reader: Use index_reader for single-partition reads
  sstables: mutation_reader: Add trace-level logging
  sstables: mutation_reader: Move partition reading code to sstable_data_source
  sstables: mutation_reader: Move definitions out of the class body
  sstables: Move binary_search() to a header
  database: Pass partition_range to single_key_sstable_reader to avoid copies and decorating
  sstables: index_reader: Introduce advance_to_next_partition()
  sstables: index_reader: Introduce advance_and_check_if_present()
  sstables: index_reader: Introduce advance_past()
  sstables: index_reader: Make copyable
  sstables: index_reader: Optimize advancing to extreme positions
  sstables: index_reader: Keep two last pages alive
  dht: ring_position_view: Add key getter
  dht: ring_position_view: Add constructor and factory from ring_position_view
  sstables: mutation_reader: Advance to next partition using index in some cases
  sstables: index_reader: Expose access to partition key and tombstone
  sstables: index_reader: Introduce promoted_index_view
  sstables: mutation_reader: Move _index_in_current to sstable_data_source
  ...
2017-04-20 13:58:37 +03:00
Tomasz Grabiec
3472a74de4 sstables: Remove unused code 2017-04-20 11:23:05 +02:00
Tomasz Grabiec
c1059ca8e4 sstables: mutation_reader: Use index_reader::advance_to_next_partition() to skip to next partition
It's cheaper than a key-based lookup, so use it when we can.
2017-04-20 11:23:05 +02:00
Tomasz Grabiec
4742008b70 sstables: mutation_reader: Use index_reader for single-partition reads
This switches single-partition query to use the index_reader
infrastructure. Index lookups via index_reader are faster than
find_disk_ranges().

perf_fast_forward, rows: 1000000, value size: 100

Before:

  Testing forwarding with clustering restriction in a large partition:
  pk-scan   time [s]     frags     frag/s    aio      [KiB] blocked dropped  idx hit idx miss  idx blk    cpu
  no        0.002182         2        916      3        152       2       0        0        1        1  88.1%

After:

  Testing forwarding with clustering restriction in a large partition:
  pk-scan   time [s]     frags     frag/s    aio      [KiB] blocked dropped  idx hit idx miss  idx blk    cpu
  no        0.000758         2       2639      3        152       2       0        0        1        1  48.6%

This is also a cleanup, a step towards converting all code to use the
index_reader.
2017-04-20 11:23:05 +02:00
Tomasz Grabiec
9d8795089d sstables: mutation_reader: Add trace-level logging 2017-04-20 11:18:55 +02:00
Tomasz Grabiec
b198c31c46 sstables: mutation_reader: Move partition reading code to sstable_data_source
It will be reused for read_row(), which does't create mutation_reader
instance, only sstable_data_source.
2017-04-20 11:18:26 +02:00
Tomasz Grabiec
6e4bca0be6 sstables: mutation_reader: Move definitions out of the class body
To make further refactoring easier to review. No functional changes here.
2017-04-20 10:54:38 +02:00
Tomasz Grabiec
4ed7e529db sstables: Move binary_search() to a header
There are instantiations of binary_search() used in sstables.cc, but
defined in partition.cc. The instantiations are explicitly declared in
partition.cc, but the types changed and they became obsolete. The
thing worked because partition.cc also instantiated it with the right
type. But after that code will be removed, it no longer would, and we
would get a linker error. To avoid such problems, define
binary_search() in a header.
2017-04-20 10:54:38 +02:00
Tomasz Grabiec
bedd0ab6f9 database: Pass partition_range to single_key_sstable_reader to avoid copies and decorating 2017-04-20 10:54:38 +02:00
Tomasz Grabiec
0b5ba13230 sstables: index_reader: Introduce advance_to_next_partition() 2017-04-20 10:54:38 +02:00
Tomasz Grabiec
4b81844d2e sstables: index_reader: Introduce advance_and_check_if_present() 2017-04-20 10:54:38 +02:00
Tomasz Grabiec
b92f095bf0 sstables: index_reader: Introduce advance_past() 2017-04-20 10:54:38 +02:00
Tomasz Grabiec
6780756258 sstables: index_reader: Make copyable 2017-04-20 10:54:38 +02:00
Tomasz Grabiec
7db83fa3fe sstables: index_reader: Optimize advancing to extreme positions 2017-04-20 10:54:38 +02:00
Tomasz Grabiec
f66443c01c sstables: index_reader: Keep two last pages alive
The idea behind caching is that when we have two index readers where
one is catching up with the other, each page will be read only
once. Currently that's not always the case. There is a case when
advance_to() may need to read two pages. That's when the target
position is not found in the first page as determined by the summary
index. The second reader which catches up would have to read the first
page as well, but it would not be in cache any more. To avoid this
extra I/O let's keep a reference to the two last pages touched by the
index.
2017-04-20 10:54:38 +02:00
Tomasz Grabiec
c7b9c5dfd3 dht: ring_position_view: Add key getter 2017-04-20 10:54:38 +02:00
Tomasz Grabiec
5b71e0b9ab dht: ring_position_view: Add constructor and factory from ring_position_view 2017-04-20 10:54:38 +02:00
Tomasz Grabiec
3e8795494e sstables: mutation_reader: Advance to next partition using index in some cases
To produce a streamed_mutation for the next partition, we need to read
its key and the tombstone. Currently we always do that by consuming
the partition header from the data file. In some cases that may cause
unnecessary IO.

It's better to obtain partition information from the index if we
already have it. We can save on IO if the user will skip past the
front of partition immediately after.

It is also better to pay the cost of reading the index if we know that
we will need to use the index anyway soon. This patch predicts that by
checking if there are any clustering restrictions. If there are any,
we will almost surely need_skip() and use the index anyway.

This change also lays the ground for unification of multi and single
partiton queries without loss of performance.
2017-04-20 10:54:37 +02:00
Tomasz Grabiec
e35fe7492c sstables: index_reader: Expose access to partition key and tombstone 2017-04-20 10:54:37 +02:00
Tomasz Grabiec
ae72c159b1 sstables: index_reader: Introduce promoted_index_view
So that we have a nice way of extracting tombstone out of it. We not
always need fully parsed index.
2017-04-20 10:54:37 +02:00
Tomasz Grabiec
0ef33b7f29 sstables: mutation_reader: Move _index_in_current to sstable_data_source
sstable_data_source holds a shared state between mutation_reader and
streamed_mutation for sstables. The information whether index is in
current partition will have to be accessed by both in the following
patches.
2017-04-20 10:54:37 +02:00
Tomasz Grabiec
885f53d905 sstables: mutation_reader: Avoid resetting the walker
Before the change, the following scenario was happening:

  1) we try to skip based on clustering restrictions
  2) we find the page and fast forward to it, recording walker's
     lower bound counter
  3) we read the first fragment, it's not a tombstone, so we reset the walker,
     and its lower bound counter too
  4) the fragment is not in range (the range starts in the middle of the page)
  5) needs_skip() is true, we redo the index lookup, which wastes some CPU

This change fixes the problem by avoiding resetting the walker. We can
do that because leading tombstones are checked with a non-mutable
contains_tombstone()
2017-04-20 10:54:37 +02:00
Tomasz Grabiec
bf21aa3a1f clustering_ranges_walker: Introduce contains_tombstone() 2017-04-20 10:54:37 +02:00
Tomasz Grabiec
b030ce693d sstables: mutation_reader: Don't try to read index to skip to static row
Static row is always at the beginning, there's no point in doing
index lookups.
2017-04-20 10:54:37 +02:00
Tomasz Grabiec
3e060659f1 sstables: mutation_reader: Don't try to read static row if table doesn't have any 2017-04-20 10:54:37 +02:00
Tomasz Grabiec
b1860a8a24 clustering_ranges_walker: Allow excluding the static row 2017-04-20 10:54:37 +02:00
Tomasz Grabiec
77d3e30239 sstables: mutation_reader: Use index to skip across clustering restrictions
Improves scans with clustering restrictions. Before the change such
scans would scan whole partition.

Below are results of a test case from perf_fast_forward which selects
few rows from a large partition using query restrictions (not fast forwarding).

Before:

  stride  rows      time [s]     frags     frag/s    aio      [KiB] blocked dropped  idx hit idx miss  idx blk    cpu
  1000000 1         0.000609         1       1642      3        152       2       1        0        1        1  38.0%
  500000  2         0.242255         2          8    511      64152     398       4        0        1        1  98.6%
  250000  4         0.281592         4         14    749      95832     564       4        0        1        1  98.4%
  125000  8         0.328056         8         24    873     111704     657       4        0        1        1  98.4%
  62500   16        0.306700        16         52    935     119640     751       4        0        1        1  99.4%

After:

  stride  rows      time [s]     frags     frag/s    aio      [KiB] blocked dropped  idx hit idx miss  idx blk    cpu
  1000000 1         0.000711         1       1406      3        152       2       1        0        1        1  42.1%
  500000  2         0.000910         2       2197      5        216       3       2        0        1        1  39.2%
  250000  4         0.001384         4       2891      9        344       5       4        0        1        1  35.3%
  125000  8         0.003197         8       2502     21        728      13       8        0        1        1  53.1%
  62500   16        0.006664        16       2401     41       1368      25      16        0        1        1  58.2%
2017-04-20 10:54:37 +02:00
Tomasz Grabiec
05a1f92cbc clustering_ranges_walker: Introduce lower_bound_change_counter()
Allows detecting changes of lower_bound().

Result of advance_to() is not enough. When we get false from
advance_to() twice in a row, lower bound may or may not have changed.
2017-04-20 10:54:37 +02:00
Tomasz Grabiec
461f2af0a1 sstables: mutation_reader: Avoid index lookups when out of range 2017-04-20 10:54:37 +02:00
Tomasz Grabiec
10c92d37d1 sstables: mutation_reader: Simplify fast_forward_to() 2017-04-20 10:54:37 +02:00
Tomasz Grabiec
bfb6858e55 sstables: mutation_reader: Let clustering_ranges_walker handle the _fwd_range start
Simplifies the code a bit, but also will make it easier to calculate
the next position we should skip to after forwarding, taking into
consideration both the position forwarded to as well as clustering
ranges of the query. That will be just calling
_ck_ranges_walker->lower_bound() after it is trimmed.
2017-04-20 10:54:37 +02:00
Tomasz Grabiec
d056d9c31b sstables: mutation_reader: Let mp_row_consumer decide about position passed to the index
In general mp_row_consumer has better information about the next
position to read. It could be after the position we forward to if
there are clustering restrictions. This will be exploited in the
following patches.
2017-04-20 10:54:37 +02:00
Tomasz Grabiec
a37712e9ae sstables: mutation_reader: Move mp_row_consumer::fast_forward_to() out of line 2017-04-20 10:54:37 +02:00
Tomasz Grabiec
bb3e683783 clustering_ranges_walker: Support trimming
Makes implementing fast_forward_to() easier. mp_row_consumer emulates
this currently. This change will allow simplifying this.
2017-04-20 10:54:37 +02:00
Tomasz Grabiec
652d04e78a clustering_ranges_walker: Generalize to work on position ranges
It will include the static row by default. This will allow simplifying
users, which work with position ranges already.
2017-04-20 10:54:36 +02:00
Tomasz Grabiec
c85fe3183c position_range: Allow stealing of bounds 2017-04-20 10:54:36 +02:00
Tomasz Grabiec
503c68de44 position_in_partition: Add more factory methods 2017-04-20 10:54:36 +02:00
Tomasz Grabiec
6c1dc642ee sstables: mutation_reader: Create index on-demand 2017-04-20 10:54:36 +02:00
Tomasz Grabiec
434fda3577 sstables: mutation_reader: Keep priority_class by reference
To indicate that it is not optional.
2017-04-20 10:54:36 +02:00
Tomasz Grabiec
a8c126c82a sstables: Expose get_index_reader() 2017-04-20 10:54:36 +02:00
Tomasz Grabiec
e1af5a406d sstables: Make sstable::get_index_reader() return unique_ptr<>
Makes callers a bit simpler
2017-04-20 10:54:36 +02:00
Tomasz Grabiec
7dc3fe7d3f tests: perf_fast_forward: Add test case for forwarding with clustering restrictions in a large partition 2017-04-20 10:54:36 +02:00
Tomasz Grabiec
eed864690b tests: perf_fast_forward: Add test case for slicing of large partition using a single-partition reader 2017-04-20 10:54:36 +02:00
Tomasz Grabiec
81fc7977a4 tests: perf_fast_forward: Add test for selecting few rows from large partition 2017-04-20 10:54:36 +02:00
Raphael S. Carvalho
3286f7aaa6 compaction: make major compaction go through compaction manager
From now on, major compaction will go through compaction manager.
Major compaction is serialized to reduce disk space requirement.
Each column family will be running either minor and major compaction
at a given time. The only issue is number of small sstables growing
while major compaction is running, but major compaction itself will
reduce the number of tables considerably. If this turns out to be
an issue, we can allow minor to start in parallel to major, but not
the other way around.

Fixes #1156.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170417233125.14092-1-raphaelsc@scylladb.com>
2017-04-19 15:44:21 +03:00
Duarte Nunes
e06bafdc6c alter_type_statement: Fix signed to unsigned conversion
This could allow us to alter a non-existing field of an UDT.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170419114254.5582-1-duarte@scylladb.com>
2017-04-19 14:48:12 +03:00
Tomasz Grabiec
02da3ba316 tests: perf_fast_forward: Fix use-after-free in scan_with_stride_partitions()
partition_range must live as long as the reader is used.
2017-04-19 08:37:56 +02:00
Raphael S. Carvalho
e78db43b79 compaction_manager: fix crash when dropping a resharding column family
Problem is that column family field of task wasn't being set for resharding,
so column family wasn't being properly removed from compaction manager.
In addition to fixing this issue, we'll also interrupt ongoing compactions
when dropping a column family, exactly like we do with shutdown.

Fixes #2291.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170418125807.7712-1-raphaelsc@scylladb.com>
2017-04-18 17:39:27 +03:00
Duarte Nunes
af37a3fdbf logalloc: Fix compilation error
This patch moves a function using the region_impl type after the type
has been defined.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170418124551.25369-1-duarte@scylladb.com>
2017-04-18 15:56:26 +03:00
Raphael S. Carvalho
11b74050a1 partitioned_sstable_set: fix quadratic space complexity
streaming generates lots of small sstables with large token range,
which triggers O(N^2) in space in interval map.
level 0 sstables will now be stored in a structure that has O(N)
in space complexity and which will be included for every read.

Fixes #2287.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170417185509.6633-1-raphaelsc@scylladb.com>
2017-04-18 13:04:38 +03:00
Takuya ASADA
86e464ab26 dist/offline_installer: support Ubuntu/Debian
moved existing script to dist/offline_installer/redhat, added .deb version into
dist/offline_installer/debian.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1492474821-9907-1-git-send-email-syuu@scylladb.com>
2017-04-18 10:56:50 +03:00
Pekka Enberg
b31c45d8af Merge "clang fixes (part 1)" from Avi
"This series fixes some errors found by clang, with the aim of enabling
clang/zapcc as a supported compiler.  A few more fixes are needed to produce
a binary."

* tag 'clang/1/v1' of https://github.com/avikivity/scylla:
  logalloc: avoid auto in function argument declaration
  thrift: avoid auto in function argument declaration
  streamed_mutation: fix non-POD argument to C-style variadic function
  mutation_partition_serializer: avoid auto in function argument declaration
  date: use correct casts for years
  streaming: avoid auto in function argument declaration
  repair: avoid auto in function argument declaration
  gms: expose gms::inet_address streaming operator
  murmur3_partitioner: fix build on clang
  i_partitioner: remove unused function
  byte_ordered_partitioner: fix bad operator precedence
  result_set: pass comparator by reference to std::sort()
  to_string: move standard container overloads of to_string to std:: namespace
  cql_type: fix bad enum syntax on clang
  build: disable more warnings for clang
  build: fix detection of unsupported warnings on clang
2017-04-18 08:49:25 +03:00
Avi Kivity
844529fe33 logalloc: avoid auto in function argument declaration
'auto' in a non-lambda function argument is not legal C++, and is hard
to read besides.  Replace with the right type.

Since the right type is private, add some friendship.
2017-04-17 23:18:44 +03:00
Avi Kivity
54add19ca2 thrift: avoid auto in function argument declaration
'auto' in a non-lambda function argument is not legal C++, and is hard
to read besides.  Replace with the right type.
2017-04-17 23:18:44 +03:00
Avi Kivity
f0c25fc20f streamed_mutation: fix non-POD argument to C-style variadic function
Clang warns that passing a non-POD to a C-style variadic function will
result in an abort().  That happens to be exactly what we want, but to
silence the warning, use a template instead.  Since templates aren't
allowed in local classes, move the containing class to namespace scope.
2017-04-17 23:18:44 +03:00
Avi Kivity
635c32eb32 mutation_partition_serializer: avoid auto in function argument declaration
'auto' in a non-lambda function argument is not legal C++, and is hard
to read besides.  Replace with the right type.
2017-04-17 23:18:44 +03:00
Avi Kivity
a0858dda3e date: use correct casts for years
Our date implementation uses int64_t for years, but some of the code was
not changed; clang complains, so use the correct casts to make it happy.
2017-04-17 23:03:15 +03:00
Avi Kivity
ca69a04969 streaming: avoid auto in function argument declaration
'auto' in a non-lambda function argument is not legal C++, and is hard
to read besides.  Replace with the right type.
2017-04-17 23:03:15 +03:00
Avi Kivity
ae7d7ae20f repair: avoid auto in function argument declaration
'auto' in a non-lambda function argument is not legal C++, and is hard
to read besides.  Replace with the right type.
2017-04-17 23:03:15 +03:00
Avi Kivity
c885c468a9 gms: expose gms::inet_address streaming operator
The standard says, and clang enforces, that declaring a function via
a friend declaration is not sufficient for ADL to kick in.  Add a namespace
level declaration so ADL works.
2017-04-17 23:03:15 +03:00
Avi Kivity
af118ab52b murmur3_partitioner: fix build on clang
Don't know what the root cause it, but the fix is harmless.
2017-04-17 23:03:15 +03:00
Avi Kivity
c05f60387b i_partitioner: remove unused function
Found by clang.
2017-04-17 23:03:15 +03:00
Avi Kivity
a496ec7f5b byte_ordered_partitioner: fix bad operator precedence
Found by clang.
2017-04-17 23:03:15 +03:00
Avi Kivity
d9aaa95b29 result_set: pass comparator by reference to std::sort()
Clang complains about some error without it, I could not understand it, but
I'm not going to argue with it.

Since std::sort() will copy the comparator, it's better to pass using an
std::ref(), and everyone is happy.
2017-04-17 23:03:15 +03:00
Avi Kivity
a83a24268d to_string: move standard container overloads of to_string to std:: namespace
Argument-dependent lookup will not find to_string() overloads in the global
namespace if the argument and the caller are in other namespaces.

Move these to_string() overloads to std:: so ADL will find them.

Found by clang.
2017-04-17 23:03:15 +03:00
Avi Kivity
a7fe7aedbf cql_type: fix bad enum syntax on clang
cql3::type used some gcc extension that is not recognized on clang; use
the standard syntax instead.
2017-04-17 22:35:41 +03:00
Avi Kivity
1faef017e3 build: disable more warnings for clang
We should fix the source and re-enable the warnings, but this will do for
now.
2017-04-17 22:34:59 +03:00
Avi Kivity
78e9b0265b build: fix detection of unsupported warnings on clang
The diagnostic that clang spits out when it sees an unrecognized warning
is itself a warning, so the test compilation succeeds and we don't notice
the warning is not supported.

Adding -Werror turns the warning about the unrecognized warning into an
error, allowing the detection machinery to work.
2017-04-17 22:33:01 +03:00
Takuya ASADA
b8f40a2dff dist/ami/files/.bash_profile: warn user when enhanced networking is not enabled
Show warnings on following conditions:
 - VPC is not used
 - Driver is not enhanced networking one

Fixes #1984

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1488844756-14935-1-git-send-email-syuu@scylladb.com>
2017-04-15 15:16:55 +03:00
Benoît Canet
8f793905a3 perf_sstable: Change busy loop to futurized loop
The blocked task detector introduced in
113ed9e963 was seeing
the initialization phase of perf_ssttable as a blocked
task.

Tranform this part of the code in a futurized loop
to make to blocked task detector happy.

Signed-off-by: Benoît Canet <benoit@scylladb.com>
Message-Id: <20170413132506.17806-1-benoit@scylladb.com>
2017-04-13 18:17:28 +03:00
Amnon Heiman
1dfd32f070 scylla-housekeeping service: Support private repositories
This patch add support for private repositories for scylla-housekeeping.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2017-04-13 18:13:57 +03:00
Amnon Heiman
5839dc1f20 scylla-housekeeping-upstart: Use repository id, when checking for
version

This patch allows the check version to use private repository when
checking for version.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2017-04-13 18:12:52 +03:00
Amnon Heiman
622502de7a scylla-housekeeping: support private repositories
This patch allows the check version to support private repositories.

If a repository file is passed as a parameter, the repository id will be
passed passed as a parameter when checking version.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2017-04-13 18:11:15 +03:00
Avi Kivity
7d16cfa5f0 Merge branch 'penberg/create-index-stmt-cleanup/v1' of github.com:cloudius-systems/seastar-dev
"The version of create_index_statement class that was translated to C++
is pretty old by now. This series of cleanups brings it closer to Apache
Cassandra trunk to make it easier to bring over more secondary index
code to Scylla."

* 'penberg/create-index-stmt-cleanup/v1' of github.com:cloudius-systems/seastar-dev:
  cql3/statements/create_index_statement: Move target validation
  cql3/statements/create_index_statement: Remove static column validation
  cql3/statements/create_index_statement: Extract validations
  cql3/statements/create_index_statement: Kill bogus custom validation
  cql3/statements/create_index_statement: Add materialized view to validate()
  cql3/statements/create_index_statement: Remove validation
2017-04-13 13:27:53 +03:00
Asias He
d27b47595b gossip: Fix possible use-after-free of entry in endpoint_state_map
We take a reference of endpoint_state entry in endpoint_state_map. We
access it again after code which defers, the reference can be invalid
after the defer if someone deletes the entry during the defer.

Fix this by checking take the reference again after the defering code.

I also audited the code to remove unsafe reference to endpoint_state_map entry
as much as possible.

Fixes the following SIGSEGV:

Core was generated by `/usr/bin/scylla --log-to-syslog 1 --log-to-stdout
0 --default-log-level info --'.
Program terminated with signal SIGSEGV, Segmentation fault.
(this=<optimized out>) at /usr/include/c++/5/bits/stl_pair.h:127
127     in /usr/include/c++/5/bits/stl_pair.h
[Current thread is 1 (Thread 0x7f1448f39bc0 (LWP 107308))]

Fixes #2271

Message-Id: <529ec8ede6da884e844bc81d408b93044610afd2.1491960061.git.asias@scylladb.com>
2017-04-13 13:18:17 +03:00
Takuya ASADA
81c1b07bac dist: add offline installer
This introduce offline installer generator.
It will generate self-extractable archive witch contains Scylla packages
and dependency packages.
Package installation automatically starts when the archive executed.

Limitation: Only supported CentOS at this point.

Fixes #2268

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1491997091-15323-1-git-send-email-syuu@scylladb.com>
2017-04-13 13:16:09 +03:00
Avi Kivity
ac48767146 Merge "tracing and cql3 patches" from Vlad
"This series was initially meant to only transition the keyspace based backend to work
on top of prepared statements but there were a few potential issues found on the way.

In addition the original Tracing series has been expanded with a few patches in the cql3 layer
that are improving the generic clq3 layer but are not obvious without the context of the
following Tracing patches.

The "main" patch contains a heavy rework of trace_keyspace_helper:
   - Use prepared statements for updating tables instead of manually constructing mutations:
      - We intentionally decrease the level of code robustness from "paranoid" to "normal".
      - The code gets a lot more simple, e.g. we don't need to cache columns definitions any more.
      - We are loosing some performance here but:
          - Tracing write is not in the fast path.
          - Tracing write events should be rare.
          - Currently the performance loss (for the actual write time of all trace records) for a "SELECT" query with a specific key is
            about 45%: 144us vs 99us."

* 'tracing_rework_using_prepared-v6' of github.com:cloudius-systems/seastar-dev:
  tracing: use prepared statment for updating tables
  tracing::trace_keyspace_helper: add a bad_column_family constructor that accepts an std::exception parameter
  tracing::trace_keyspace_helper: introduce a table_helper class
  tracing::trace_keyspace_helper: add static qualifier to  make_monotonic_UUID_tp() and elapsed_to_micros() methods
  tracing::tracing: allow slow query TTL only in the signed 32-bit integer range
  cql3::query_processor::prepare(): futurize the error case
  cql3::query_options: add a factory method for creation of options for a BATCH statement
  cql3::statements::batch_statement: add a constructor that doesn't receive the "bound_terms" value
  cql3::query_processor: use weak_ptr for passing the prepared statements around
2017-04-13 11:07:49 +03:00
Raphael S. Carvalho
a6f8f4fe24 compaction: do not write expired cell as dead cell if it can be purged right away
When compacting a fully expired sstable, we're not allowing that sstable
to be purged because expired cell is *unconditionally* converted into a
dead cell. Why not check if the expired cell can be purged instead using
gc before and max purgeable timestamp?

Currently, we need two compactions to get rid of a fully expired sstable
which cells could have always been purged.

look at this sstable with expired cell:
  {
    "partition" : {
      "key" : [ "2" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 120,
        "liveness_info" : { "tstamp" : "2017-04-09T17:07:12.702597Z",
"ttl" : 20, "expires_at" : "2017-04-09T17:07:32Z", "expired" : true },
        "cells" : [
          { "name" : "country", "value" : "1" },
        ]

now this sstable data after first compaction:
[shard 0] compaction - Compacted 1 sstables to [...]. 120 bytes to 79
(~65% of original) in 229ms = 0.000328997MB/s.

  {
    ...
    "rows" : [
      {
        "type" : "row",
        "position" : 79,
        "cells" : [
          { "name" : "country", "deletion_info" :
{ "local_delete_time" : "2017-04-09T17:07:12Z" },
            "tstamp" : "2017-04-09T17:07:12.702597Z"
          },
        ]

now another compaction will actually get rid of data:
compaction - Compacted 1 sstables to []. 79 bytes to 0 (~0% of original)
in 1ms = 0MB/s. ~2 total partitions merged to 0

NOTE:
It's a waste of time to wait for second compaction because the expired
cell could have been purged at first compaction because it satisfied
gc_before and max purgeable timestamp.

Fixes #2249, #2253

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170413001049.9663-1-raphaelsc@scylladb.com>
2017-04-13 10:59:19 +03:00
Vlad Zolotarov
f8956ba01a tracing: use prepared statment for updating tables
In addition to actually moving to using the prepared statements the changes also include:
   - Kill the cache_xxx() methods - the schema is going to be checked during the
     prepared statement creation and during its execution.
   - Move the caching of table ID and the prepared statement to the get_schema_ptr_or_create().
   - Rename: get_schema_ptr_or_create() -> cache_table_info().

After these changes we are less strict in our demands to system_traces tables schemas, e.g.
if some column's type is not exactly as we expect but rather only "compatible" in the CQL sense
we will tolerate this and will continue to write into that table.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-04-12 17:12:42 -04:00
Vlad Zolotarov
baf0289951 tracing::trace_keyspace_helper: add a bad_column_family constructor that accepts an std::exception parameter
An object built with this constructor will use the what() message from the
given exception in the final error message.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-04-12 17:04:54 -04:00
Vlad Zolotarov
98864c6c30 tracing::trace_keyspace_helper: introduce a table_helper class
This class contains a general table info and implements standard operations on this table:
   - Creation.
   - Info caching.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-04-12 17:04:48 -04:00
Pekka Enberg
8c45038729 cql3/statements/create_index_statement: Move target validation
Move index target validation to preserve the same code structure as
Apache Cassandra and simplify support for multiple index targets.
2017-04-12 20:50:09 +03:00
Pekka Enberg
528c33a05b cql3/statements/create_index_statement: Remove static column validation
Apache Cassandra supports secondary indices on static columns since
commit 9e74891 ("Add support for secondary indexes on static columns").
2017-04-12 20:46:29 +03:00
Pekka Enberg
975e1c8fc6 cql3/statements/create_index_statement: Extract validations
Extract specific validations to separate functions to preserve the same
structure as Apache Cassandra code and make it easier to add support for
multiple index targets.
2017-04-12 20:44:15 +03:00
Pekka Enberg
940d6de1b8 cql3/statements/create_index_statement: Kill bogus custom validation
Rejecting custom indices is bogus because it's just a configuration
mechanism like replication strategy, for example. Furthermore, it's
needed for SASI indices, which we likely need to be compatible with.
2017-04-12 20:17:09 +03:00
Pekka Enberg
cfadb70565 cql3/statements/create_index_statement: Add materialized view to validate()
Apache Cassandra does not support secondary indices on materialized
views so neither should we.
2017-04-12 20:15:20 +03:00
Pekka Enberg
1706f346f2 cql3/statements/create_index_statement: Remove validation
The validation was removed in Apacha Cassandra commit 0626be8 ("New 2i
API and implementations for built in indexes"). Let's also remove it
from our code so that we remove one dependency to the obsolete db/index/
code.
2017-04-12 20:14:52 +03:00
Vlad Zolotarov
b4bf0735b8 tracing::trace_keyspace_helper: add static qualifier to make_monotonic_UUID_tp() and elapsed_to_micros() methods
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-04-12 12:24:08 -04:00
Vlad Zolotarov
aa1f8ccea4 tracing::tracing: allow slow query TTL only in the signed 32-bit integer range
Any TTL is eventually converted into the gc_clock::duration value, which
is based on int32_t type. Limit the node_slow_log TTL user configurable value
to the same values range for consistency.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-04-12 12:24:08 -04:00
Vlad Zolotarov
1685cefa57 cql3::query_processor::prepare(): futurize the error case
Make sure that errors are reported in a form of an exceptional future and
not by a direct exception throwing.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-04-12 12:24:08 -04:00
Vlad Zolotarov
fcef9d3b05 cql3::query_options: add a factory method for creation of options for a BATCH statement
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-04-12 12:24:08 -04:00
Vlad Zolotarov
75fbc7c558 cql3::statements::batch_statement: add a constructor that doesn't receive the "bound_terms" value
This constructor should be used when we know that there are no bound terms in the current
batch statement.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-04-12 12:24:08 -04:00
Vlad Zolotarov
ff55b76562 cql3::query_processor: use weak_ptr for passing the prepared statements around
Use seastar::checked_ptr<weak_ptr<pepared_statement>> instead of shared_ptr for passing prepared statements around.
This allows an easy tracking and handling of statements invalidation.

This implementation will throw an exception every time an invalidated
statement reference is dereferenced.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-04-12 12:24:03 -04:00
Pekka Enberg
ecb8ee4efd cql3/statements: Cleanup create_index_statement.cc
Use namespaces and fix formatting issues to make the source file easier
on the eyes.

Message-Id: <1491977964-26629-1-git-send-email-penberg@scylladb.com>
2017-04-12 16:48:45 +03:00
Avi Kivity
db73cc045f Merge seastar upstream
* seastar e899c0b...2eec212 (4):
  > Merge "add seastar::checked_ptr class" from Vlad
  > resource: reduce default_reserve_memory size to fit low memory environment
  > scripts: posix_net_conf.sh: add --tune net parameter to perftune.py invocation
  > core: weak_ptr: add a weak_ptr(std::nullptr_t) constructor
2017-04-12 13:49:21 +03:00
Paweł Dziepak
0318dccafd lsa: avoid unnecessary segment migrations during reclaim
segment_zone::migrate_all_segments() was trying to migrate all segments
inside a zone to the other one hoping that the original one could be
completely freed. This was an attempt to optimise for throughput.

However, this may unnecesairly hurt latency if the zone is large, but
only few segments are required to satisfy reclaimer's demands.
Message-Id: <20170410171912.26821-1-pdziepak@scylladb.com>
2017-04-11 08:55:29 +02:00
Gleb Natapov
b4c368a6bc storage_proxy: update correct statistics on range reads
Fixes #2167

Message-Id: <20170405094119.GM8197@scylladb.com>
2017-04-09 18:16:06 +03:00
Glauber Costa
a808c32676 scylla_util: fix issues with cpuset handling
While cpuset.conf is supposed to be set before this is used, not having
a cpuset.conf at all is a valid configuration. The current code will
raise an exception in this case, but it shouldn't.

Also, as noted by Amos, atoi() is not available as a global symbol. Most
invocations were safe, calling string.atoi(), but one of them wasn't.

This patch replaces all usages of atoi() with int(), which is more
portable anyway.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20170407172937.17562-1-glauber@scylladb.com>
2017-04-09 16:04:02 +03:00
Glauber Costa
f842aeb07a mark i3 as a supported instance during login
We have recently fixed the ami init scripts to mark i3 as a supported
instance. However, the code to detect whether or not the instance is
supported is duplicated, and called from multiple locations. That means
that when the user logs in, it will see the instance as not supported -
as the test is coming from a different source.

This patch moves it to the scylla_lib.sh utilities script, so we can
share it, and make sure it is right for all locations.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20170406201137.8921-1-glauber@scylladb.com>
2017-04-09 12:35:35 +03:00
Glauber Costa
7ef8c6aaec centos: add runtime dependency for perftune script
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20170406172949.13264-1-glauber@scylladb.com>
2017-04-09 11:36:26 +03:00
Glauber Costa
ca8ca3b823 scylla_io_setup: change permissions
The script ended up with the default permissions, which lack the
executable bit. Change it, so it can be executed directly.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20170407173109.17870-1-glauber@scylladb.com>
2017-04-09 11:36:13 +03:00
Pekka Enberg
a9ad5cc560 dist/redhat: Add node_health_check to package 2017-04-08 08:12:57 +03:00
Tomer Sandler
c49f944483 dist: Add node_health_check script
This patch adds a script for running an automated node health check.

[ penberg: Fold into single commit ]
2017-04-08 08:06:30 +03:00
Avi Kivity
3b19ea8796 Update scylla-ami submodule
* dist/ami/files/scylla-ami 5d73a71...f10db69 (2):
  > use latest repo for amzn kernel
  > do not use duplicated code to test for instance support status
2017-04-07 12:28:45 +03:00
Avi Kivity
f579e4cc46 Merge seastar upstream
* seastar c5dd395...e899c0b (5):
  > perftune: remove psutil dependency
  > metrics: change push_back({...}) to emplace_back(...)
  > prometheus: return the metric prefix
  > Merge "scripts/perftune.py: add disks tuning" from Vlad
  > Use safe name for metric family
2017-04-06 17:10:19 +03:00
Glauber Costa
f4502bfb79 ami: update packer to 1.0.0 so ENA gets enabled
Older versions of packer do not support ENA, and when faced
with the option

      "enhanced_networking": true,

will only actually enable it for the older 82599 VF instances.
Fortunately, packer 1.0 already supports it, and all we have to do is
update it.

While we are at it, let's check if the file is legit before using a
random file we have downloaded from the internet, to avoid breaching our
building process.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20170406123952.14708-1-glauber@scylladb.com>
2017-04-06 16:39:19 +03:00
Glauber Costa
039f8d5994 dist/redhat: install the new scylla_util.py file
For the debian package, the files don't have to be listed individually,
but for RPMs they do.

The rpm builds are currently failing with:

error: Installed (but unpackaged) file(s) found:
  /usr/lib/scylla/scylla_util.py

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20170406012336.21081-1-glauber@scylladb.com>
2017-04-06 10:20:32 +03:00
Avi Kivity
ecf5781597 Update scylla-ami submodule
* dist/ami/files/scylla-ami 9e8c36d...5d73a71 (1):
  > support i3 instances
2017-04-05 18:52:46 +03:00
Amnon Heiman
6c1858b275 API:storage_service should support metrics load
Following C* API there are two APIs for getting the load from
storage_service:
/storage_service/metrics/load
/storage_service/load

This patch adds the implementation for
/storage_service/metrics/load

The alternative, is to drop on of the API and modify the JMX
implementation to use the same API.

Fixes #2245

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170401181520.19506-1-amnon@scylladb.com>
2017-04-05 18:14:19 +03:00
Tomasz Grabiec
d523c60629 sstables: Push fragments from mp_row_consumer so that parser is interrupted less
Currently we return proceed::no after every mutation_fragment which is
to be consumed. This froces parser to save and reload its state
often. This can be avoided if we pushed the fragments directly from
mp_row_consumer, then we would return proceed::no only when the buffer
fills up.

tests/perf/perf_fast_forward shows 15% increase in throughput of a large partition scan,
from 1.34M frag/s to 1.55M frag/s.

Message-Id: <1490882700-22684-1-git-send-email-tgrabiec@scylladb.com>
2017-04-05 18:10:54 +03:00
Takuya ASADA
da75ce694a dist: migrate python2 scripts to python3
Now we can package python3 .py script on CentOS, no need to keep using
python2 so migrate them to python3.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1491402479-7703-1-git-send-email-syuu@scylladb.com>
2017-04-05 17:35:35 +03:00
Takuya ASADA
610bc31b04 dist/redhat/scylla.spec.in: stop compiling .py on rpm
rpmbuild tries to compile any *.py by default, but it causes compilation
error on python3 code when python2 is system default (CentOS, RHEL).

So skip compiling it, drop .pyc / .pyo from the package.

Fixes #2235

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1490859768-18900-1-git-send-email-syuu@scylladb.com>
2017-04-05 12:15:22 +03:00
Avi Kivity
58318db4bb Merge "Rewrite scylla_io_setup in python" from Glauber
"Also, this new version supports i3.
Number of requests for i3 is obtained similarly as to i2: I have run tests
for a single disk, and then we'll take the amount of disks into account. Other
possible limits are also taken into account, like the max per-shard seastar limit
of 128 in-flight request, and the per-disk limit obtained by sysfs."

* 'python-io-setup' of https://github.com/glommer/scylla:
  rewrite scylla_io_setup in python
  scripts: add python module with common utilities
2017-04-05 10:39:54 +03:00
Avi Kivity
88637c86c2 Update ami submodule
* dist/ami/files/scylla-ami 407e8f3...9e8c36d (1):
  > Switch to Amazon Linux's kernel
2017-04-04 21:55:36 +03:00
Glauber Costa
2fa698ee95 rewrite scylla_io_setup in python
We do it using the new scylla_util.py library. As we do it, we also
enable i3 support.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-04-04 14:44:52 -04:00
Glauber Costa
ba7010b7a5 scripts: add python module with common utilities
As we convert more stuff to python, we'll have more opportunities for
sharing code between them. We already do that for the bash scripts with
a file "scylla_lib.sh". We'll do the same for python.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-04-04 14:42:19 -04:00
Vlad Zolotarov
c26799c9b0 config: enforce the 'stop' value for commit_failure_policy/disk_failure_policy
Fixes #2246

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1491246164-26612-1-git-send-email-vladz@scylladb.com>
2017-04-04 16:46:36 +03:00
Takuya ASADA
4262edd843 dist: use distribution standard fstrim script instead of our custom one
We recently introduced fstrim cronjob / systemd timer unit, but some of
distributions already has their own fstrim cronjob / systemd timer unit.
So let's use them when it's posssible.

Fixes #2233

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1491226727-10507-1-git-send-email-syuu@scylladb.com>
2017-04-04 16:44:45 +03:00
Takuya ASADA
72696eff22 dist/common/scripts/scylla_setup: add 'cancel' feature on RAID disk selector prompt
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1491217320-10211-2-git-send-email-syuu@scylladb.com>
2017-04-04 15:35:23 +03:00
Takuya ASADA
4eee3dd778 dist/common/scripts/scylla_setup: skip list DVD drive on block device list for RAID
Fixes #2230

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1491217320-10211-1-git-send-email-syuu@scylladb.com>
2017-04-04 15:35:22 +03:00
Takuya ASADA
b087616a6c dist/debian/debian/scylla-server.upstart: export SCYLLA_CONF, SCYLLA_HOME
We are sourcing sysconfig file on upstart, but forgot to load them as
environment variables.
So export them.

Fixes #2236

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1491209505-32293-1-git-send-email-syuu@scylladb.com>
2017-04-03 16:34:20 +03:00
Pekka Enberg
57c4bed420 Revert "dist: add --options-file /etc/scylla/scylla.yaml on sysconfig"
This reverts commit 58e628eb3d. Takuya says it
does not fix issue #2236.
2017-04-03 16:33:58 +03:00
Takuya ASADA
58e628eb3d dist: add --options-file /etc/scylla/scylla.yaml on sysconfig
After RAID devices mounted to /var/lib/scylla, scylla-server doesn't able to
find ./conf/ directory since we haven't created symlinks on the volume.

Instead of creating symlink on scylla_raid_setup, let's specify scylla.yaml
path on program argument.

Fixes #2236

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1491056110-1078-1-git-send-email-syuu@scylladb.com>
2017-04-02 11:28:27 +03:00
Vlad Zolotarov
2d8fcde695 init: add a proper message when there is a bad 'seeds' configuration
Fixes #2193

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1490912678-32004-1-git-send-email-vladz@scylladb.com>
2017-04-02 10:41:52 +03:00
Avi Kivity
d03207e939 Merge seastar upstream
* seastar 2ebe842...c5dd395 (3):
  > configure.py: Fix unrecognized option error
  > Merge fixes for fstream slow start from Paweł
  > Merge "A cleaner safer metrics layer" from Amnon
2017-04-02 10:17:40 +03:00
Tzach Livyatan
4efee3432b dist/ami: run nodetool status on each login
Running nodetool status on each login to Scylla AMI helps in three ways:
- give the user a quick view of the node and cluster status beyond the current "scylla is active"
- hint to the user about the nodetool and how to use it
- move the first, slow, run of nodetool to the login phase, making the second interactive run much faster

on the down side, it does slow the login in a few sec

Signed-off-by: Tzach Livyatan <tzach@scylladb.com>
[ penberg: fix formatting ]
Message-Id: <20170330081711.22038-1-tzach@scylladb.com>
2017-03-30 13:45:48 +03:00
Vlad Zolotarov
761a5eb72f scripts: add fix_system_distributed_tables.py
This script validates schemas of distributed system keyspaces:
system_traces and system_auth.

It tries to add the missing columns and checks that the existing columns
have expected types.

In case of any problem the corresponding info message is printed and non-zero exit
status is returned.

Extra columns are ignored.

The validation function may also be called from another Python program.
It returns True in case of success and False otherwise.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1490195645-29905-1-git-send-email-vladz@scylladb.com>
2017-03-29 19:15:05 +03:00
Avi Kivity
5b530aa464 Merge "Use promoted index for skipping in sstable mutation readers" from Tomasz
"sstable_streamed_mutation::fast_forward_to() is changed to use promoted index
(via index_reader) to optimize skipping in large partitions.

In addition to that, sstable mutation_reader is changed to use the index
to skip to the next partition.

Performance impact was evaluated using newly added tests/perf/perf_fast_forward

What's beyond this series:

  - Using index_reader for single-partition reads as well

  - Using index_reader for skipping across ranges in clustering restrictions"

* tag 'tgrabiec/skip-within-partition-using-index-v2' of github.com:cloudius-systems/seastar-dev: (47 commits)
  tests: Add performance test for fast forwarding of sstable readers
  tests: Allow starting cql_test_env on pre-existing data
  config: Allow specifying source when setting value
  tests: sstable: Add test for fast forwarding within partition using index
  sstables: sstable_streamed_mutation: use index in fast_forward_to()
  sstables: Store parsed promoted index in index_entry
  sstables: Add trace-level logging for sstable consumption
  sstables: Define deletion_time earlier
  sstables: Make parsing throw exception on malformed promoted index
  tests: Add tests for ordering of position_in_partition relative to composites
  position_range: Introduce all_clustered_rows() factory method
  position_in_partition: Introduce for_key()/after_key() factory methods
  position_in_partition: Add factory methods for positions around all rows
  position_in_partition: Introduce for_range_start()/for_range_end()
  position_in_partition: Fix friendship declaration
  keys: Introduce is_empty() for prefixes
  position_in_partition: Make comparable with composites
  types: Enhance lexicographical comparators
  compound_compat: Accept marker value in serialize_value()
  compound_compat: Add trichotomic comparator
  ...
2017-03-29 19:01:12 +03:00
Raphael S. Carvalho
023031b0c8 compaction: lcs: fix functionality to feed starved levels
quick introduction to level starvation:
high levels may be left uncompacted (thus starved) for a long time if user
makes something that make they contain little data, such as cleanup or change
of max sstable size (default 160M). Leveled strategy handles this problem as
follow: consider we're compacting L1 to L2. If L3 is starved, we look for one
of its sstable that is fully contained in token range of candidates L1->L2,
so that we won't end up with an overlapping in L2.

now the problem:
the functionality isn't working properly now because range of candidates is
being incorrectly calculated due to an accident when converting the code to
C++. It won't cause an overlap because it's actually being more restrictive
about which sstable from starved level can be used.

A test case was added to confirm the problem.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170328223753.15398-1-raphaelsc@scylladb.com>
2017-03-29 18:59:46 +03:00
Takuya ASADA
5e0cb39db6 dist: follow DPDK tools directory name change
Fixes #2234

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1490776630-6626-1-git-send-email-syuu@scylladb.com>
2017-03-29 11:38:39 +03:00
Tomasz Grabiec
97742fd4c2 test.py: Enable stack trace on UBSAN errors
Message-Id: <1490769716-10217-1-git-send-email-tgrabiec@scylladb.com>
2017-03-29 11:08:05 +03:00
Tomasz Grabiec
7fd724821b tests: Add performance test for fast forwarding of sstable readers 2017-03-28 18:34:55 +02:00
Tomasz Grabiec
543a484d78 tests: Allow starting cql_test_env on pre-existing data 2017-03-28 18:34:55 +02:00
Tomasz Grabiec
2c775bbb6e config: Allow specifying source when setting value
So that is_set() will be true for that option. Needed in tests which
set some config options in higher layer and then lower layers detects
if option was set or not before applying its default.
2017-03-28 18:34:55 +02:00
Tomasz Grabiec
f1aca6d116 tests: sstable: Add test for fast forwarding within partition using index 2017-03-28 18:34:55 +02:00
Tomasz Grabiec
3fbc0bed6e sstables: sstable_streamed_mutation: use index in fast_forward_to() 2017-03-28 18:34:55 +02:00
Tomasz Grabiec
5b36976bf0 sstables: Store parsed promoted index in index_entry 2017-03-28 18:34:55 +02:00
Tomasz Grabiec
a2a8312c78 sstables: Add trace-level logging for sstable consumption 2017-03-28 18:34:55 +02:00
Tomasz Grabiec
5af815bf20 sstables: Define deletion_time earlier 2017-03-28 18:34:55 +02:00
Tomasz Grabiec
5e34743882 sstables: Make parsing throw exception on malformed promoted index
Will be easier to propagate failure to upper layers once parsing is
reused in the index_reader.

The old behavior of ignoring parsing failures is preserved, but the
error is logged now.
2017-03-28 18:34:55 +02:00
Tomasz Grabiec
b40b20387a tests: Add tests for ordering of position_in_partition relative to composites 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
5b813898bc position_range: Introduce all_clustered_rows() factory method 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
622713be60 position_in_partition: Introduce for_key()/after_key() factory methods 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
e27fa712f5 position_in_partition: Add factory methods for positions around all rows 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
b90275f8e3 position_in_partition: Introduce for_range_start()/for_range_end() 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
5c29f4dd04 position_in_partition: Fix friendship declaration 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
212a021fc6 keys: Introduce is_empty() for prefixes 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
2ff6c1705b position_in_partition: Make comparable with composites 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
4e7fe40a70 types: Enhance lexicographical comparators
They now accept optional lexicographical_relation which can be used
to alter position of the element relative to elements prefixed by it.

Example. Let's consider lexicographical ordering on strings. The
position of "bc" in a sample sequence is affected by
lexicographical_realtion as follows:

   aa
   aaa
   b
   ba
 --> before_all_prefixed
   bc
 --> before_all_strictly_prefixed
   bca
   bcd
 --> after_all_prefixed
   bd
   bda
   c
   ca
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
6c6be0f7e4 compound_compat: Accept marker value in serialize_value() 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
1331d10811 compound_compat: Add trichotomic comparator 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
2121c2ee8a compound_compat: Make composites printable 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
38e14ab3c8 compound_compat: Introduce composite::serialize_static()
Generelized from static_prefix().
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
bef677b57d compound_compat: Introduce composite_view::last_eoc() 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
18a057aa81 compound_compat: Return composite from serialize_value()
To make the code more type-safe. Also, mark constructor from bytes
explicit.
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
680ffd20f5 compound: Use const bytes_view as iterator's value type
The iterator doesn't really allow modifying unserlying component.

This change enables using the iterator with boost::make_iterator_range() and
boost::range::join(), which get utterly confused otherwise.
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
123b102dd6 sstables: Skip to next partition using index
Slicing front of a very large partition:

Before:

 offset  read      time [s]     frags     frag/s    aio      [KiB] blocked dropped    cpu
 0       1         0.110960         1          9    992     126956     924       0  92.4%

After:

 offset  read      time [s]     frags     frag/s    aio      [KiB] blocked dropped    cpu
 0       1         0.000784         1       1276      3        344       2       1  37.3%
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
a9252dfc58 sstables: Use separate index readers for lower and upper bounds
So that lower bound can be advanced within the range.
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
27d86dfe18 sstables: Enable skipping to cells at data_consume_context level 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
aad943523a sstables: index_reader: Add trace-level logging 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
388315c1ff sstables: Expose index metrics 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
1dbd2e239e sstables: index_reader: Share index lists among other index readers
Direct motivation for this is to be able to use two index readers from
a single mutation reader, one for lower bound of the range and one for
the upper bound of the range, without sacrificing optimization of
avoiding index reads when forwarding to partition ranges which are
close by. After the change, all index readers of given sstable will
share index buffers, so lower bound reader can reuse the page read by
the upper bound reader.

The reason for using two readers will be so that we are able to skip
inside the partition range, not only outside of it. This is not
possible if we use the same index reader to locate the upper bound of
the range, because we may only advance the cursor.
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
0635d74e17 sstables: Make index_entry copyable
Needed to make the index_list copyable, which is going to be needed to
implement legacy get_index_entries() which returns by value, after
index sharing is implemented.
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
e36979da47 sstables: index_reader: Use sstable's schema
Makes for a simpler interface.
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
e3e2f037bb sstables: index_reader: Refactor around the concept of a cursor
Index reader already can be queried only with monotonic positions, so
the concept of a cursor is ingrained. Making it explicit will make it easier
to define behavior for forwarding withing the partition.

After the change:

 - lower_bound() is renamed to advance_to() and doesn't return
   the position, only advances the cursor

 - data file position for partition under cursor can be obtained
   at any time with data_file_position()
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
27862fa8f6 sstables: index_reader: Narrow down summary range during lookup
Positions passed to lower_bound() must be non-decreasing, so summary
indexes as well.
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
02ace99798 sstables: index_reader: Change lookup to work on ring_position_view
In preparation for changing the interface to work not only with ranges.
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
cd295e9926 sstables: Avoid moving an sstable
In preparation for adding non-movable members.
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
5edb427873 sstables: Remove private constructor
To reduce duplication.
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
705bd6da1a sstables: Remove unused method 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
a7301a702f tests: Add missing blocking on fast_forward_to() 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
5fe14735e8 tests: dht: Test ring_position_comparator 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
ff6cca6e9e tests: Add utility for checking total orders 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
d4b6e430ed dht: Introduce ring_position_view 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
55a7cceef5 dht: Move comparison logic from ring_position::tri_compare() to ring_position_comparator
It will soon define common ordering for many objects, not just
ring_position.
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
d5e704ca1e sstables: Make key_view constructor from bytes_view explicit 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
65a8920b25 dht: Make min/max tokens capturable by reference
So that they can be later used in views.
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
f6d2a07422 config: Warn on use of [[deprecated]] instead of failing 2017-03-28 18:10:39 +02:00
Calle Wilund
b12b65db92 commitlog/replayer: Bugfix: minimum rp broken, and cl reader offset too
The previous fix removed the additional insertion of "min rp" per source
shard based on whether we had processed existing CF:s or not (i.e. if
a CF does not exist as sstable at all, we must tag it as zero-rp, and
make whole shard for it start at same zero.

This is bad in itself, because it can cause data loss. It does not cause
crashing however. But it did uncover another, old old lingering bug,
namely the commitlog reader initiating its stream wrongly when reading
from an actual offset (i.e. not processing the whole file).
We opened the file stream from the file offset, then tried
to read the file header and magic number from there -> boom, error.

Also, rp-to-file mapping was potentially suboptimal due to using
bucket iterator instead of actual range.

I.e. three fixes:
* Reinstate min position guarding for unencoutered CF:s
* Fix stream creating in CL reader
* Fix segment map iterator use.

v2:
* Fix typo
Message-Id: <1490611637-12220-1-git-send-email-calle@scylladb.com>
2017-03-28 10:32:28 +02:00
Duarte Nunes
53014bd762 mutation_source_test: Ensure unique collection elements
Duplicate elements are illegal in collections, so we ensure they only
contain unique ones.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170327161149.8938-4-duarte@scylladb.com>
2017-03-27 18:44:11 +02:00
Duarte Nunes
94d568924d mutation_source_test: Sort collection elements
Ref #1607

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170327161149.8938-3-duarte@scylladb.com>
2017-03-27 18:43:58 +02:00
Duarte Nunes
4963902922 mutation_source_test: Remove extra randomness source
This patch ensures we generate UUIDs using the same randomness source
as all the other values we randomly generator, so that we can get a
deterministic run from the seeds we print.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170327161149.8938-2-duarte@scylladb.com>
2017-03-27 18:43:44 +02:00
Takuya ASADA
b84828b487 dist/common/scripts/scylla_fstrim: don't abort the program when a disk doesn't support TRIM
Do not abort the program until run fstrim on all directories.

Fixes #2220

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1490595397-19130-1-git-send-email-syuu@scylladb.com>
2017-03-27 11:43:19 +03:00
Avi Kivity
27c42359bc Merge seastar upstream
* seastar 6b21197...2ebe842 (6):
  > Merge "Various improvements to execution stages" from Paweł
  > app-template: allow apps to specify a name for help message
  > bool_class: avoid initializing object of incomplete type
  > app-template: make sure we can still get help with required options
  > prometheus: Http handler that returns prometheus 0.4 protobuf or text format
  > Update DPDK to 17.02

Includes patch from Pawel to adjust to updated execution_stage interface.
2017-03-26 10:50:21 +03:00
Takuya ASADA
e7697c37b2 dist/common/scripts/scylla_setup: warn twice before constructing RAID volume
Since RAID construction has possibility to destroy user data, warn twice before
executing scylla_raid_setup.

Fixes #1346

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1487891797-13109-1-git-send-email-syuu@scylladb.com>
2017-03-26 10:36:23 +03:00
Glauber Costa
f7f187a7f3 docker: do not touch yaml during startup
Users sometimes need to run their own yaml configuration files, and it
is currently annoying to deploy modified files on docker.

One possible solution is to bind mount the file into the docker
container using the -v switch, just like we already do for for the data
volume.

The problem with the aforementioned approach is that we have to change
the yaml file to insert the addresses, and that will change the file in
the host (or fail to happen, if we bind mount it read-only).

The solution I am proposing is to avoid touching the yaml file inside
the container altogether. Instead, we can deploy the address-related
arguments that we currently write to the yaml file as Scylla options.

Fixes #2113

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <1490195141-19940-1-git-send-email-glauber@scylladb.com>
2017-03-23 16:52:40 +02:00
Pekka Enberg
c5b1908e03 dist/docker: Use stdout as logging output
If early startup fails in docker-entrypoint.py, the container does not
start. It's therefore not very helpful to log to a file _within_ the
container...

Message-Id: <1490275943-23590-1-git-send-email-penberg@scylladb.com>
2017-03-23 16:48:34 +02:00
Calle Wilund
c3a510a08d commitlog_replayer: Do proper const-loopup of min positions for shards
Fixes #2173

Per-shard min positions can be unset if we never collected any
sstable/truncation info for it, yet replay segments of that id.

Wrap the lookups to handle "missing data -> default", which should have been
there in the first place.

Message-Id: <1490185101-12482-1-git-send-email-calle@scylladb.com>
2017-03-22 17:57:09 +02:00
Amnon Heiman
064f5e1b63 row_cache: switch to the metrics layer registration
This patch moves the row_cache metrics registration from collectd to the
metric layer.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170321143812.785-3-amnon@scylladb.com>
2017-03-21 16:42:58 +02:00
Amnon Heiman
a6a13865bf API: remove unneeded refrences to collectd
This patch removes left over references to the collectd from the API.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170321143812.785-2-amnon@scylladb.com>
2017-03-21 16:42:57 +02:00
Vlad Zolotarov
79978c156e transport::server: don't report a Tracing session ID unless requested
Don't report a Tracing session ID unless the current query had a Tracing bit in its
flags.

Although the current master's behaviour is legal it's suboptimal and some Clients are sensitive to that.
Let's fix that.

Fixes #2179

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1490063752-8915-1-git-send-email-vladz@scylladb.com>
2017-03-21 13:57:08 +00:00
Vlad Zolotarov
9dd5b5762d dist: install seastar/scripts/perftune.py together with posix_net_conf.sh
posix_net_conf.sh is currently a wrapper for perftune.py script and
perftune.py has to be at the same directory as posix_net_conf.sh.


Fixes #2176.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1490020882-32361-1-git-send-email-vladz@scylladb.com>
2017-03-21 12:31:50 +02:00
Amnon Heiman
7572addfbf column_family: metrics should be register once
column_family constructor uses delegation, as such, only the actual
constructor implementation should contain a call to register the
metrics.
Current implementation ends up with re registration of the metrics.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170320140817.22214-1-amnon@scylladb.com>
2017-03-21 11:20:29 +02:00
Pekka Enberg
85a127bc78 dist/docker: Expose Prometheus port by default
This patch exposes Scylla's Prometheus port by default. You can now use
the Scylla Monitoring project with the Docker image:

  https://github.com/scylladb/scylla-grafana-monitoring

To configure the IP addresses, use the 'docker inspect' command to
determine Scylla's IP address (assuming your running container is called
'some-scylla'):

  docker inspect --format='{{ .NetworkSettings.IPAddress }}' some-scylla

and then use that IP address in the prometheus/scylla_servers.yml
configuration file.

Fixes #1827

Message-Id: <1490008357-19627-1-git-send-email-penberg@scylladb.com>
2017-03-20 15:29:52 +02:00
Amos Kong
468df7dd5f scylla_setup: match '-p' option of lsblk with strict pattern
On Ubuntu 14.04, the lsblk doesn't have '-p' option, but
`scylla_setup` try to get block list by `lsblk -pnr` and
trigger error.

Current simple pattern will match all help content, it might
match wrong options.
  scylla-test@amos-ubuntu-1404:~$ lsblk --help | grep -e -p
   -m, --perms          output info about permissions
   -P, --pairs          use key="value" output format

Let's use strict pattern to only match option at the head. Example:
  scylla-test@amos-ubuntu-1404:~$ lsblk --help | grep -e '^\s*-D'
   -D, --discard        print discard capabilities

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <4f0f318353a43664e27da8a66855f5831457f061.1489712867.git.amos@scylladb.com>
2017-03-20 08:10:35 +02:00
Raphael S. Carvalho
7deeffc953 database: serialize sstable cleanup
We're cleaning up sstables in parallel. That means cleanup may need
almost twice the disk space used by all sstables being cleaned up,
if almost all sstables need cleanup and every one will discard an
insignificant portion of its whole data.
Given that cleanup is frequently issued when node is running out of
disk space, we should serialize cleanups in every shard to decrease
the disk space requirement.

Fixes #192.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170317022911.10306-1-raphaelsc@scylladb.com>
2017-03-19 12:33:03 +02:00
Tomasz Grabiec
bb0ce5d8fe Merge "Ensure base and view schema versions match" from Duarte
The mapping between a base table update and a view update is schema
dependent, so we need to ensure the view schema versions match the
base schema version. For example, we match base columns to view
columns by name, so we need to ensure the base and view schemas we're
using for writting are isolated with respect to a previous alter
table statement.

We thus need to match base schema versions with view schema versions,
and we need to so atomically to ensure that when one fiber sees a
schema, it also sees the complete set of corresponding view schemas.
This series ensures the schemas modified as a result of an alter
table statement are published atomically, under the schema lock. This
way, all the schemas referenced by the database are consistent with
each other when they are observed by other fibers.

Finally, we upgrade the mutation schema before generating the view
updates, to ensure it matches the most recent view schemas the base
replica knows about, registered in the database.

The db::view::view class was replaced by a set of non-member
functions, with its state, which used to reflect only the most recent
schema version, being moved to a new view_info class.
2017-03-17 12:40:00 +01:00
Duarte Nunes
b27da688f9 mutation: Remove dead get_cell() function
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170316234843.23130-1-duarte@scylladb.com>
2017-03-17 11:18:23 +02:00
Pekka Enberg
91b9e0d914 Update scylla-ami submodule
* dist/ami/files/scylla-ami eedd12f...407e8f3 (1):
  > scylla_create_devices: check block device is exists

Fixes #2171
2017-03-17 11:13:07 +02:00
Tomasz Grabiec
3609665b19 lsa: Fix debug-mode compilation error
By moving definitions of setters out of #ifdef
2017-03-16 18:23:05 +01:00
Tomasz Grabiec
88e7b3ff79 lsa: Ensure can_allocate_more_memory() always leaves a gap above seastar's min_free_memory()
One of the goals of can_allocate_more_memory() is to prevent depleting
seastar's free memory close to its minimum, leaving a head room above
that minimum so that standard allocations will not cause reclamation
immediately. Currently the function doesn't take into accoutn actual
threshold used by the seastar allocator, so there could be no gap or
even could go below the minimum.

Fix that by ensuring there's always a gap above min_free_memory().

min_gap was reduced to 1 MiB so that low memory setups are not
impacted significantly by the change.
Message-Id: <1489667863-15099-1-git-send-email-tgrabiec@scylladb.com>
2017-03-16 12:42:50 +00:00
Tomasz Grabiec
17ede24a77 Update seastar submodule
* seastar 4d25b85...6b21197 (3):
  > core: memory: Expose control of the free memory low water mark
  > scripts: add perftune.py
  > tutorial: make network examples work on multi-core
2017-03-16 13:32:45 +01:00
Pekka Enberg
3afd7f39b5 cql3: Wire up functions for floating-point types
Fixes #2168
Message-Id: <1489661748-13924-1-git-send-email-penberg@scylladb.com>
2017-03-16 11:04:59 +00:00
Avi Kivity
434a4fee28 Merge "tests: Use allocating_section in lsa_async_eviction_test" from Tomasz
"The test allocates objects in batches (allocation is always under a reclaim
lock) of ~3MiB and assumes that it will always succeed because if we cross the
low water mark for free memory (20MiB) in seastar, reclamation will be
performed between the batches, asynchronously.

Unfortunately that's prevented by can_allocate_more_memory(), which fails
segment allocation when we're below the low water mark. LSA currently doesn't
allow allocating below the low water mark.

The solution which is employed across the code base is to use allocating_section,
so use it here as well.

Exposed by recent consistent failures on branch-1.7."

* 'tgrabiec/fix-lsa-async-eviction-test' of github.com:cloudius-systems/seastar-dev:
  tests: lsa_async_eviction_test: Allocate objects under allocating section
  lsa: Allow adjusting reserves in allocating_section
2017-03-16 12:44:14 +02:00
Tomasz Grabiec
cefb6b604a tests: lsa_async_eviction_test: Allocate objects under allocating section 2017-03-16 10:21:10 +01:00
Tomasz Grabiec
4ab8b255da lsa: Allow adjusting reserves in allocating_section 2017-03-16 10:21:10 +01:00
Raphael S. Carvalho
6b6bb38f38 compaction_manager: stop manager after storage io error
Manager will stop itself if a compaction fails due to storage io
error, which unconditionally results in stop of transportation
services.

Fixes #2147.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170316054538.23423-1-raphaelsc@scylladb.com>
2017-03-16 10:37:47 +02:00
Duarte Nunes
876a514743 database: Upgrade mutation to current schema to push view updates
This patch ensures we upgrade the mutation to the current schema when
generating and pushing view updates, so that the it matches the most
up to date views.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-03-15 18:15:27 +01:00
Duarte Nunes
be12a2bf0a db/schema_tables: Atomically publish base and view changes
This patch ensures that the schema merging atomically publishes
schema changes. In particular, it ensures that when a base schema
and a subset of its views are modified together (i.e., upon an alter
table or alter type statement), then they are published together as
well, without any deferring in-between.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-03-15 16:35:07 +01:00
Duarte Nunes
e215f25b11 migration_manager: Atomically migrate table and views
This patch changes the migration path for table updates such that the
base table mutations are sent and applied atomically with the view
schema mutations.

This ensures that after schema merging, we have a consistent mapping
of base table versions to view table versions, which will be used in
later patches.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-03-15 16:03:56 +01:00
Duarte Nunes
bfb8a3c172 materialized views: Replace db::view::view class
The write path uses a base schema at a particular version, and we
want it to use the materialized views at the corresponding version.

To achieve this, we need to map the state currently in db::view::view
to a particular schema version, which this patch does by introducing
the view_info class to hold the state previously in db::view::view,
and by having a view schema directly point to it.

The changes in the patch are thus:

1) Introduce view_info to hold the extra view state;
2) Point to the view_info from the schema;
3) Make the functions in the now stateless db::view::view non-member;
4) Remove the db::view::view class.

All changes are structural and don't affect current behavior.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-03-15 15:50:05 +01:00
Duarte Nunes
a64c47f315 schema: Move raw_view_info outside of raw_schema
In preparation of an upcoming patch, where the schema
won't directly store the raw_view_info.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-03-15 15:38:31 +01:00
Duarte Nunes
4b209be8b8 view_info: Rename to raw_view_info
In preparation for upcoming patches, which will deal with
moving the state in db::view::view to view_info.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-03-15 15:38:31 +01:00
Paweł Dziepak
dc99197318 Merge "Correctly handle tombstoned collections" from Duarte
"The current implementations of collection_type_impl::is_empty() and
collection_type_impl::difference() don't handle tombstoned collection
mutations correctly. In particular:

- is_empty() considers a collection mutation with a tombstone (and no
  entries) as empty;
- difference() doesn't do set difference between the cells tombstones,
  and always returns the highests.

Fixes #2152"

* 'collection-diff/v4' of github.com:duarten/scylla:
  mutation_test: Add more test cases for difference()
  mutation_source_test: Randomly generate collection cells
  collection_type_impl: Use set difference for tombstones
  collection_type_impl: A mutation with a tombstone is not empty
2017-03-15 13:39:55 +00:00
Duarte Nunes
143136647a mutation_test: Add more test cases for difference()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-03-15 14:34:01 +01:00
Duarte Nunes
005e4741e3 mutation_source_test: Randomly generate collection cells
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-03-15 14:34:01 +01:00
Duarte Nunes
61741a69b6 collection_type_impl: Use set difference for tombstones
This patch fixes collection_type_impl::difference() so it does set
difference for tombstones instead of just returning the larger
one, as difference() is supposed to return only the information in
mutation A that supersedes that in B, given difference(A, B).

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-03-15 14:34:01 +01:00
Duarte Nunes
19fcd2d140 collection_type_impl: A mutation with a tombstone is not empty
This patch changes the collection_type_impl::is_empty() function so
that it doesn't consider empty a collection_mutation which has a
tombstone.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-03-15 14:34:01 +01:00
Takuya ASADA
b65d58e90e dist/common/scripts/scylla_raid_setup: don't discard blocks at mkfs time
Discarding blocks on large RAID volume takes too much time, user may suspects
the script doesn't works correctly, so it's better to skip, do discard directly on each volume instead.

Fixes #1896

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1489533460-30127-1-git-send-email-syuu@scylladb.com>
2017-03-15 13:13:57 +02:00
Calle Wilund
078589c508 commitlog_replayer: Make replay parallel per shard
Fixes #2098

Replay previously did all segments in parallel on shard 0, which
caused heavy memory load. To reduce this and spread footprint
across shards, instead do X segments per shard, sequential per shard.

v2:
* Fixed whitespace errors

Message-Id: <1489503382-830-1-git-send-email-calle@scylladb.com>
2017-03-15 13:07:17 +02:00
Amnon Heiman
0a2eba1b94 database: requests_blocked_memory metric should be unique
Metrics name should be unique per type.

requests_blocked_memory was registered twice, one as a gauge and one as
derived.

This is not allowed.

Fixes #2165

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170314162826.25521-1-amnon@scylladb.com>
2017-03-14 19:36:45 +02:00
Avi Kivity
ed4b5f5a18 Merge seastar upstream
* seastar fd29fd0...4d25b85 (2):
  > core/file: fix EOF detection for file with custom impl
  > tutorial: fix echo server example

Includes patch from Raphael updating checked_file_impl:

"Now file_impl requires dma_read_bulk to be implemented, and for
checked_file_impl, it only's about calling dma_read_bulk from
the posix file it wraps."
2017-03-14 13:38:38 +02:00
Takuya ASADA
d016dd4b74 dist: schedule daily fstrim for data directory and commitlog directory
Schedule daily fstrim for data directory and commitlog directory, witch is
recommended by Scylla doc:
http://www.scylladb.com/doc/admin/#schedule-fstrim

Fixes #1347

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1489447472-2981-1-git-send-email-syuu@scylladb.com>
2017-03-14 11:51:53 +02:00
Amnon Heiman
295a981c61 storage_proxy: metrics should have unique name
Metrics should have their unique name. This patch changes
throttled_writes of the queu lenght to current_throttled_writes.

Without it, metrics will be reported twice under the same name, which
may cause errors in the prometheus server.

This could be related to scylladb/seastar#250

Fixes #2163.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170314081456.6392-1-amnon@scylladb.com>
2017-03-14 11:19:39 +02:00
Tomasz Grabiec
ed530dfb3a tests: sstables: Add test for skipping within a compressed stream
Refs #2143.
2017-03-13 13:08:24 +01:00
Tomasz Grabiec
1e0af2efc3 Update seastar submodule
* seastar 84a0b70...fd29fd0 (4):
  > Fix smp::submit_to() with function reference
  > execution_stage: add concept restraint for operator()
  > core/temporary_buffer: Add operator==()
  > map_reduce: allow reducer to take accumulated value by rref
2017-03-13 10:13:03 +01:00
Paweł Dziepak
60c6b9a240 Merge "Implement sstable_streamed_mutation::fast_forward_to()" from Tomasz
"This replaces use of a generic forwarding wrapper in sstable reader with
specialized implentation. Forwarding doesn't yet utilize indexes in this
series, only integrates it with mp_row_consumer, which is a prerequisite.

It's still an optimization, since mp_row_consumer will not try to consume
past the range as it used to.

Sending early for easier consumption."

* tag 'tgrabiec/forwarding-of-mp-row-consumer-v2' of github.com:scylladb/seastar-dev:
  sstables: Remove use of forwarding wrapper
  sstables: Implement sstable_streamed_mutation::fast_forward_to()
  sstables: Extract and use clustering_ranges_walker
  tests: sstables: Add test for handling of repeated tombstones
  sstables: Extract writer parameters into config objects
  tests: Move as_mutation_source() helper to header
  tests: Extract ensure_monotonic_positions() to streamed_mutation_assertions
  streamed_mutation: Add streamed_mutation_returning() helper
  tests: mutation_source_test: Add test case for forwarding to a full range
  tests: simple_schema: Add fragment factories
  tests: Extract simple_schema
  sstables: Move workaround for out-of-order range tombstones to mp_row_consumer
  sstables: Drop default mp_row_consumer constructor
  sstables: Swap order of values in "proceed" so that "no" is assigned 0
  util/optimized_optional: Make printable
  position_in_partition: Add is_static_row() in the view
  range_tombstone_stream: Add reset()
  range_tombstone_stream: Add get_next(position_in_partition_view)
  sstables: streamed_mutation: Stop reading when end of slice reached
  sstables: Switch is_in_range() to position_in_partition
2017-03-10 13:55:46 +00:00
Tomasz Grabiec
1f1b516b31 sstables: Remove use of forwarding wrapper 2017-03-10 14:42:22 +01:00
Tomasz Grabiec
d7afab21e7 sstables: Implement sstable_streamed_mutation::fast_forward_to()
Handling of forwarding is done inside mp_row_consumer, because it
allows us to filter out irrelevant data sooner and thus more
efficiently.

Becuase static row can be now skipped as well, _skip_clustering_row
was renamed to more generic _skip_in_progress.
2017-03-10 14:42:22 +01:00
Tomasz Grabiec
4750216387 sstables: Extract and use clustering_ranges_walker
Extracted from mp_row_consumer.
2017-03-10 14:42:22 +01:00
Tomasz Grabiec
88ccc99017 tests: sstables: Add test for handling of repeated tombstones 2017-03-10 14:42:22 +01:00
Tomasz Grabiec
124dde30db sstables: Extract writer parameters into config objects
Also enables users to change the default promoted index block size.
2017-03-10 14:42:22 +01:00
Tomasz Grabiec
ad1e69c4c5 tests: Move as_mutation_source() helper to header 2017-03-10 14:42:22 +01:00
Tomasz Grabiec
6f409d367b tests: Extract ensure_monotonic_positions() to streamed_mutation_assertions 2017-03-10 14:42:22 +01:00
Tomasz Grabiec
dc7b93a326 streamed_mutation: Add streamed_mutation_returning() helper 2017-03-10 14:42:22 +01:00
Tomasz Grabiec
06a964b3a0 tests: mutation_source_test: Add test case for forwarding to a full range 2017-03-10 14:42:22 +01:00
Tomasz Grabiec
929842ad3f tests: simple_schema: Add fragment factories 2017-03-10 14:42:22 +01:00
Tomasz Grabiec
d98f013b07 tests: Extract simple_schema 2017-03-10 14:42:22 +01:00
Tomasz Grabiec
01374c41f2 sstables: Move workaround for out-of-order range tombstones to mp_row_consumer
This is a preliminary step before adding support for fast-forwarding
to mp_row_consumer, so that range handling can be solely in
mp_row_consumer rather than split between it and
sstable_streamed_mutation.

This also alleviates #2080 by reading all tombstones only up to the
first row, after that range tombstones are treated like other
fragments.
2017-03-10 14:42:22 +01:00
Tomasz Grabiec
d41a7c5eb4 sstables: Drop default mp_row_consumer constructor 2017-03-10 14:42:22 +01:00
Tomasz Grabiec
56f1ad7841 sstables: Swap order of values in "proceed" so that "no" is assigned 0 2017-03-10 14:42:22 +01:00
Tomasz Grabiec
58c29be45c util/optimized_optional: Make printable 2017-03-10 14:42:21 +01:00
Tomasz Grabiec
a32cf6c4cc position_in_partition: Add is_static_row() in the view 2017-03-10 14:42:21 +01:00
Tomasz Grabiec
e4db643730 range_tombstone_stream: Add reset() 2017-03-10 14:42:21 +01:00
Tomasz Grabiec
48ad2e2d64 range_tombstone_stream: Add get_next(position_in_partition_view) 2017-03-10 14:42:21 +01:00
Tomasz Grabiec
084747b1ee sstables: streamed_mutation: Stop reading when end of slice reached
As part of this change, skip detection detection is refactored. This
simplifies reasoning about mp_row_consumer's state a bit because now
is_mutation() is not reset externally and only depends on current
position of the reader.

It will prove useful when we extend mutation reader to decide if it
should skip to the next partition up front before calling
_context.read(), so that we can for instance skip using index instead.

Fixes #2088.
2017-03-10 14:42:19 +01:00
Duarte Nunes
16bcf8d085 db/schema_tables: Avoid copying keyspace name
This patch changes a lambda argument type so the keyspace name is
passed by reference instead of copying it, in
read_schema_for_keyspaces().

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170309213134.10331-1-duarte@scylladb.com>
2017-03-10 11:03:56 +02:00
Duarte Nunes
d32c848d73 utils/logalloc: Change linkage of hist_options to external
Change linkage of segment_descriptor_hist_options to external to keep
good old GCC5 happy, despite C++11 allowing static linkage of non-type
template arguments.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170309213206.10383-1-duarte@scylladb.com>
2017-03-10 11:02:51 +02:00
Tomasz Grabiec
55358cacc5 sstables: Switch is_in_range() to position_in_partition
Makes it immune to #1446 and is a prerequisite for implementing
forwarding in mp_row_consumer.
2017-03-09 21:15:11 +01:00
Paweł Dziepak
aaae8db033 loggers should not have external linkage
Message-Id: <20170309111034.20929-1-pdziepak@scylladb.com>
2017-03-09 12:27:20 +01:00
Gleb Natapov
d34f3a0440 batchlog: introduce batch_size_fail_threshold_in_kb option
Add batch_size_fail_threshold_in_kb to prevent huge batch from been
applied and causing troubles. Also do not warn or fail if only one
partition is affected.

Fixes: #2128

Message-Id: <20170309111247.GE8197@scylladb.com>
2017-03-09 12:20:17 +01:00
Amnon Heiman
7b04841dda main: Name the http servers
In main there are two http servers that start, the API and prometheus.
This patch name them accordingly so their metrics will have more
meaning.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1489055282-10887-1-git-send-email-amnon@scylladb.com>
2017-03-09 12:30:49 +02:00
Glauber Costa
a7b0a899a3 dist: don't execute dpdk scripts if not in dpdk mode
The scripts are not liking very much being executed inside docker.
Since we don't really need those variables set outside DPDK scenarios,
just don't set them.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <1488823691-9014-1-git-send-email-glauber@scylladb.com>
2017-03-09 11:40:08 +02:00
Avi Kivity
efd96a448c Merge "Add execution stages" from Paweł
"These patches introduce execution stages to Scylla in order to improve
icache friendliness. The places were stages are added are not chosen
very carefully and rather introduced at between different subsystems:
cql, storage proxy and database. This already results in a rather
significant improvement and can be tuned later if necessary.

Performance results:
perf_simple_query -c4 --duration 60
(medians)
          before       after      diff
write   83017.75   242876.04   +192.6%
read    61709.16   168258.26   +172.7%

The real life improvements aren't as good because it is much harder
to collect sufficiently high number of operations in a batch."

Additional benchmarking from Paweł:

"I did some tests on my local setup.

* Latency at light loads

Scylla running on 16 logical CPUs (8 cores) with 64 GB of RAM.
cassandra-stress -rate threads=32

write latency
        master      seda
median   1.2         0.6
95th     1.6         0.8
99th     1.7         0.9
99.9th   2.5         1.3
max     26.4        24.2

Flags '--poll-mode' and '--defragment-memory-on-idle false' didn't improve situation for master.

See also attached graph write_99.svg and write_999.svg.

read latency
        master      seda
median   0.8         0.6
95th     1.0         0.9
99th     1.1         1.0
99.9th   1.4         1.2
max     18.5        18.0

See also attached graph read_99.svg and read_999.svg.

* Server 100% loaded, dataset fitting in memory (throughput)

Scylla running on 2 cores with 64 GB of RAM.
4x scylla-bench with the uniform workload
(concurrency of each s-b: 512  for writes, 256 for reads).

There were no cache misses during reads.

          master        seda      diff
writes  107722.4   168482.26    +56.4%
reads   51049.48    76158.19    +49.2%

* Server 100% loaded, writes being flushed and compacted (throughput)

Scylla running on 2 cores with 4 GB of RAM.
4x scylla-bench with the uniform workload, concurrency 256 each.

          master        seda      diff
writes  79575.77   114206.11    +43.5%

See attached graph: writes_with_flushes_and_compaction.png (first run: master, second: seda)."

* tag 'pdziepak/scylla-execution-stages/v1-rebased' of github.com:cloudius-systems/seastar-dev:
  transport: make process_request_one() an execution stage
  mutation_query: add an execution stage
  db: make database::query() an execution stage
  db: make apply an execution stage
  storage_proxy: make mutate() an execution stage
  cql3: make batch statement an execution stage
  cql3: make modification statement an execution stage
  cql3: make select statement an execution stage
  mutation_reader: make mutation_source nothrow movable
2017-03-09 11:29:43 +02:00
Paweł Dziepak
74f35864ef transport: make process_request_one() an execution stage 2017-03-09 09:27:43 +00:00
Paweł Dziepak
a78501c206 mutation_query: add an execution stage 2017-03-09 09:27:43 +00:00
Paweł Dziepak
b5f0e590be db: make database::query() an execution stage 2017-03-09 09:27:43 +00:00
Paweł Dziepak
38c1501f4d db: make apply an execution stage 2017-03-09 09:27:43 +00:00
Paweł Dziepak
cfde2ad5b4 storage_proxy: make mutate() an execution stage 2017-03-09 09:27:43 +00:00
Paweł Dziepak
827357cb08 cql3: make batch statement an execution stage 2017-03-09 09:27:43 +00:00
Paweł Dziepak
dce785089a cql3: make modification statement an execution stage 2017-03-09 09:27:43 +00:00
Paweł Dziepak
d005b20071 cql3: make select statement an execution stage 2017-03-09 09:27:43 +00:00
Paweł Dziepak
12135dbe21 mutation_reader: make mutation_source nothrow movable 2017-03-09 09:27:43 +00:00
Amnon Heiman
4e8d73098f main: Prometheus should start as early as possible
There is no need to wait when starting the prometheus server. As it is
up to each of the modules to register its metrics when it is ready.

This is especially important when debuging boot issues.

This patch moves the prometheus initilization to be done at an early
stage of the boot sequencec.

Fixes #2144

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1489041986-28974-1-git-send-email-amnon@scylladb.com>
2017-03-09 11:26:51 +02:00
Asias He
39d2e59e7e repair: Fix midpoint is not contained in the split range assertion in split_and_add
We have:

  auto halves = range.split(midpoint, dht::token_comparator());

We saw a case where midpoint == range.start, as a result, range.split
will assert becasue the range.start is marked non-inclusive, so the
midpoint doesn't appear to be contain()ed in the range - hence the
assertion failure.

Fixes #2148

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Asias He <asias@scylladb.com>
Message-Id: <93af2697637c28fbca261ddfb8375a790824df65.1489023933.git.asias@scylladb.com>
2017-03-09 09:09:17 +01:00
Avi Kivity
b8e4113dba Merge seastar upstream
* seastar 5861f99...84a0b70 (13):
  > build: don't error out on [[deprecated]] APIs
  > Merge "Introduce execution stages" from Paweł
  > Remove unused include statement
  > http: catch and count errors in read and respond
  > Merge "Adding metrics configuration" from Amnon
  > future: add concepts for map_reduce(), when_all_succeed()
  > doxygen: exclude c-ares directory
  > scripts/posix_net_conf.sh: add --use-cpu-mask option
  > file: take flush into account when calculating size for truncate in optimize_queue()
  > Fixing the prometheus cleanup patch
  > Merge "posix_net_conf.sh: better distribute ingress processing" from Vlad
  > prometheus: code clean up
  > future: relax finally() constraints even more
2017-03-08 20:02:05 +02:00
Tomasz Grabiec
abf8e83c8d gdb: Cast gdb.Values to int
Fails with newer GDB with:

  TypeError: %x format: an integer is required, not gdb.Value

Message-Id: <1488981412-22279-1-git-send-email-tgrabiec@scylladb.com>
2017-03-08 19:43:48 +02:00
Paweł Dziepak
6db6d25f66 Merge "Avoid loosing changes to keyspace parameters of system_auth and tracing keyspaces" form Tomek
"If a node is bootstrapped with auto_boostrap disabled, it will not
wait for schema sync before creating global keyspaces for auth and
tracing. When such schema changes are then reconciled with schema on
other nodes, they may overwrite changes made by the user before the
node was started, because they will have higher timestamp.

To prevent that, let's use minimum timestamp so that default schema
always looses with manual modifications. This is what Cassandra does.

Fixes #2129."

* tag 'tgrabiec/prevent-keyspace-metadata-loss-v1' of github.com:scylladb/seastar-dev:
  db: Create default auth and tracing keyspaces using lowest timestamp
  migration_manager: Append actual keyspace mutations with schema notifications
2017-03-08 10:59:47 +00:00
Nadav Har'El
506e074ba4 sstable decompression: fix skip() to end of file
The skip() implementation for the compressed file input stream incorrectly
handled the case of skipping to the end of file: In that case we just need
to update the file pointer, but not skip anywhere in the compressed disk
file; In particular, we must NOT call locate() to find the relevant on-disk
compressed chunk, because there is none - locate() can only be called on
actual positions of bytes, not on the one-past-end-of-file position.

Fixes #2143

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170308100057.23316-1-nyh@scylladb.com>
2017-03-08 12:35:05 +02:00
Tomasz Grabiec
d6425e7646 db: Create default auth and tracing keyspaces using lowest timestamp
If the node is bootstrapped with auto_boostrap disabled, it will not
wait for schema sync before creating global keyspaces for auth and
tracing. When such schema changes are then reconciled with schema on
other nodes, they may overwrite changes made by the user before the
node was started, because they will have higher timestamp.

To prevent that, let's use minimum timestamp so that default schema
always looses with manual modifications. This is what Cassandra does.

Fixes #2129.
2017-03-07 19:19:15 +01:00
Tomasz Grabiec
06d4ad1bdd migration_manager: Append actual keyspace mutations with schema notifications
There is a workaround for notification race, which attaches keyspace
mutations to other schema changes in case the target node missed the
keyspace creation. Currently that generated keyspace mutations on the
spot instead of using the ones stored in schema tables. Those
mutations would have current timestamp, as if the keyspace has been
just modified. This is problematic because this may generate an
overwrite of keyspace parameters with newer timestamp but with stale
values, if the node is not up to date with keyspace metadata.

That's especially the case when booting up a node without enabling
auto_bootstrap. In such case the node will not wait for schema sync
before creating auth tables. Such table creation will attach
potentially out of date mutations for keyspace metadata, which may
overwrite changes made to keyspace paramteters made earlier in the
cluster.

Refs #2129.
2017-03-07 19:19:15 +01:00
Avi Kivity
1b5ba63676 sstable: fix unhandled exception in atomic_deletion_manager::delete_atomically()
The current code is assymetric: the first N-1 shards to delete a set receive
a synthetic future to wait on, while the last deletion receives the result
of the delete operation (which also broadcasts completion to the first N-1
operations.  This results, in case of an error, with the Nth future being
reported as an unhandled error.

Fix by making everything symmetric: all N callers receive a synthetic
future.  Nobody waits for the deletion operation (which still broadcasts its
completion to all waiters, so errors are not lost).

Message-Id: <20170305151607.14264-1-avi@scylladb.com>
2017-03-07 12:41:12 +02:00
Avi Kivity
439b38f5ab Merge "Improvements to counter implementation" from Paweł
"This series adds various optimisations to counter implementation
(nothing extreme, mostly just avoiding unnecessary operations) as well
as some missing features such as tracing and dropping timed out queries.

Performance was tested using:
perf-simple-query -c4 --counters --duration 60

The following results are medians.
          before       after      diff
write   18640.41    33156.81    +77.9%
read    58002.32    62733.93     +8.2%"

* tag 'pdziepak/optimise-counters/v3' of github.com:cloudius-systems/seastar-dev: (30 commits)
  cell_locker: add metrics for lock acquisition
  storage_proxy: count counter updates for which the node was a leader
  storage_proxy: use counter-specific timeout for writes
  storage_proxy: transform counter timeouts to mutation_write_timeout_exception
  db: avoid allocations in do_apply_counter_update()
  tests/counters: add test for apply reversability
  counters: attempt to apply in place
  atomic_cell: add COUNTER_IN_PLACE_REVERT flag
  counters: add equality operators
  counters: implement decrement operators for shard_iterator
  counters: allow using both views and mutable_views
  atomic_cell: introduce atomic_cell_mutable_view
  managed_bytes: add cast to mutable_view
  bytes: add bytes_mutable_view
  utils: introduce mutable_view
  db: add more tracing events for counter writes
  db: propagate tracing state for counter writes
  tests/cell_locker: add test for timing out lock acquisition
  counter_cell_locker: allow setting timeouts
  db: propagate timeout for counter writes
  ...
2017-03-07 11:48:13 +02:00
Tomasz Grabiec
ecfa9e40de Merge 'duarte/lsa/hist-cleanup/v2' from github.com:duarten/scylla
histogram cleanups from Duarte.
2017-03-07 10:33:50 +01:00
Gleb Natapov
5c4158daac memtable: do not yield while holding reclaim_lock
Holding reclaim_lock while yielding may cause memory allocations to
fail.

Fixes #2139

Message-Id: <20170306153151.GA5902@scylladb.com>
2017-03-06 17:24:22 +01:00
Gleb Natapov
d7bdf16a16 memtable: do not open code logalloc::reclaim_lock use
logalloc::reclaim_lock prevents reclaim from running which may cause
regular allocation to fail although there is enough of free memory.
To solve that there is an allocation_section which acquire reclaim_lock
and if allocation fails it run reclaimer outside of a lock and retries
the allocation. The patch make use of allocation_section instead of
direct use of reclaim_lock in memtable code.

Fixes #2138.

Message-Id: <20170306160050.GC5902@scylladb.com>
2017-03-06 17:24:22 +01:00
Avi Kivity
1af9e3a5cb Merge "database: fix the 'nodetool clearsnapshot'" from Vlad
"Work on this series started with fixing the  'nodetool clearsnapshot'.
The current master code  ignores the snapshots in deleted keyspaces (issue #2045).

I noticed that in many places our code has to build the path to some directory/file
it simply had the sstring(<path1>) + "/" + sstring(<path2>) constructs which may cause us issues
if somebody decides to complile/run scylla on not-Unix-based OS, like Microsoft Windows.

I understand that this is a long shot but if we can make it right now - why not to.
The answer is boost::filesystem::path class - its synchronous parts, of course.

I decided to take an initiative and fix the issues above and then use the fixed code for
fixing the issue #2045:
   - Fix some minor issues in the existing code.
   - Extend the lister class and move it into the separate files outside database.cc.

On the way I've found an issue in the existing code (issue #2071).
This series fixes this one too (PATCH2)."
2017-03-06 16:45:31 +02:00
Glauber Costa
2d620a25fb raid script: improve test for mounted filesystem
The current test for whether or not the filesystem is mounted is weak
and will fail if multiple pieces of the hierarchy are mounted.

util-linux ships with a mountpoint command that does exactly that,
so we'll use that instead.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <1488742801-4907-1-git-send-email-glauber@scylladb.com>
2017-03-06 15:59:29 +02:00
Gleb Natapov
7f5923f510 storage_service: handle empty token list correctly
boost::split() return one empty string if called on an empty input.
Trying to cast an empty string to a token value results in a bad_lexical_cast
exception. Fix it by handling empty token list explicitly.

Message-Id: <20170302125405.GU11471@scylladb.com>
2017-03-06 15:31:33 +02:00
Takuya ASADA
6602221442 dist/redhat: enables discard on CentOS/RHEL RAID0
Since CentOS/RHEL raid module disables discard by default, we need enable it
again to use.

Fixes #2033

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1488407037-4795-1-git-send-email-syuu@scylladb.com>
2017-03-06 12:21:42 +02:00
Duarte Nunes
ca4f5cabd4 lsa: Extract log_histogram class
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-03-04 14:47:19 +01:00
Avi Kivity
24d6560fbc Update scylla-ami submodule
* dist/ami/files/scylla-ami d5a4397...eedd12f (3):
  > Rewrite disk discovery to handle EBS and NVMEs.
  > add --developer-mode option
  > trivial cleanup: replace tab in indent
2017-03-04 13:29:32 +02:00
Duarte Nunes
5c73978b68 thrift/handler: Enable Aggregator concept with GCC6_CONCEPT
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170303172509.16844-1-duarte@scylladb.com>
2017-03-04 13:27:16 +02:00
Duarte Nunes
2b6abd5a91 lsa: Make log_histogram more generic
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-03-03 17:59:17 +01:00
Duarte Nunes
3819e6d55f lsa: log_histogram cleanups
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-03-03 17:09:07 +01:00
Tomasz Grabiec
22199abf50 gc_clock: Remove orphaned comment
Message-Id: <1488381379-8618-1-git-send-email-tgrabiec@scylladb.com>
2017-03-02 12:56:09 +02:00
Tomasz Grabiec
6a83fe5534 Merge 'pdziepak/optimise-commitlog-entry-writer/v1' from seastar-dev.git
From Paweł:

These patches optimise commitlog_entry_writer so that it avoids copying
column mapping, which is a particularly expensive operation.

perf_simple_query -c4 --write --duration 60
(medians)
          before       after      diff
write   79434.35    89247.54    +12.3%

Tested with:

  commitlog_test.py:TestCommitLog.test_commitlog_replay_on_startup
  commitlog_test.py:TestCommitLog.test_commitlog_replay_with_alter_table
  commitlog_test.py:TestCommitLog.test_commitlog_replay_with_counters
2017-03-02 11:37:42 +01:00
Paweł Dziepak
04b80272f2 cell_locker: add metrics for lock acquisition 2017-03-02 09:05:12 +00:00
Paweł Dziepak
00b42c477f storage_proxy: count counter updates for which the node was a leader 2017-03-02 09:05:12 +00:00
Paweł Dziepak
cf193f4b41 storage_proxy: use counter-specific timeout for writes 2017-03-02 09:05:12 +00:00
Paweł Dziepak
d177160f90 storage_proxy: transform counter timeouts to mutation_write_timeout_exception 2017-03-02 09:05:12 +00:00
Paweł Dziepak
f93a766db4 db: avoid allocations in do_apply_counter_update() 2017-03-02 09:05:12 +00:00
Paweł Dziepak
8457f407ef tests/counters: add test for apply reversability 2017-03-02 09:05:11 +00:00
Paweł Dziepak
3bccca67b9 counters: attempt to apply in place
It is expected that most counter updates just modify the values of
existing shards and can be done in place.
2017-03-02 09:05:11 +00:00
Paweł Dziepak
7604b926e1 atomic_cell: add COUNTER_IN_PLACE_REVERT flag
The general algorithm for merging counter cells involves allocating a
new buffer for the shards. However, it is expected that most of the
applies are just updating the values of existing shards and not adding
new ones, therefore can be done in place.
However, reverting the general and in-place applies requires different
logic, hence the need for an additional flag to differentiate between
them.
2017-03-02 09:05:11 +00:00
Paweł Dziepak
1e3fbddb3a counters: add equality operators 2017-03-02 09:05:11 +00:00
Paweł Dziepak
772c9078d0 counters: implement decrement operators for shard_iterator 2017-03-02 09:05:11 +00:00
Paweł Dziepak
edad5202f3 counters: allow using both views and mutable_views 2017-03-02 09:05:11 +00:00
Paweł Dziepak
2db92e92b2 atomic_cell: introduce atomic_cell_mutable_view 2017-03-02 09:05:11 +00:00
Paweł Dziepak
1293073019 managed_bytes: add cast to mutable_view 2017-03-02 09:05:11 +00:00
Paweł Dziepak
29430ba970 bytes: add bytes_mutable_view 2017-03-02 09:05:11 +00:00
Paweł Dziepak
0ed2352ade utils: introduce mutable_view
std::basic_string_view does not allow modifying the underlying buffer.
This patch introduces a mutable_view which permits that.
2017-03-02 09:05:10 +00:00
Paweł Dziepak
774241648d db: add more tracing events for counter writes 2017-03-02 09:05:10 +00:00
Paweł Dziepak
277501f42f db: propagate tracing state for counter writes 2017-03-02 09:05:10 +00:00
Paweł Dziepak
2b5c4386b5 tests/cell_locker: add test for timing out lock acquisition 2017-03-02 09:05:10 +00:00
Paweł Dziepak
5af780360f counter_cell_locker: allow setting timeouts 2017-03-02 09:05:10 +00:00
Paweł Dziepak
25173f8095 db: propagate timeout for counter writes 2017-03-02 09:05:10 +00:00
Paweł Dziepak
c122f3b2f8 cell_locker: use internal storage for hashtable 2017-03-02 09:05:10 +00:00
Paweł Dziepak
4702ebf80f counters: use c_c_builder::from_single_shard() when possible 2017-03-02 09:05:10 +00:00
Paweł Dziepak
13ec22ad9a counters: drop tombstone handling in transform update to shards
Encountering tombstones while transforming counter update from deltas to
shards is expected to be rare due to the fact that counter cells cannot
be recreated once removed.
This assumption makes it unnecessary to care much about removed cells
during delta->shard transformation as it adds complexity to the code and
is not required to produce correct results.
2017-03-02 09:05:10 +00:00
Paweł Dziepak
37485e5b29 counters: optimise counter_cell_builder
This patch attempts to avoid excessive allocations and copies when
constructing counter cells using counter_cell_builder. That involves
adding serializer interface to atomic_cell so that the counter cell can
be directly serialized to the buffer allocated for atomic cell.

counter_cell_builder::from_single_shard() is added as well to avoid
std::vector<> overhead when creating a counter cell from a single shard.
2017-03-02 09:05:10 +00:00
Glauber Costa
9e61a73654 setup: support mount points in raid script
By default behavior is kept the same. There are deployments in which we
would like to mount data and commitlog to different places - as much as
we have avoided this up until this moment.

One example is EC2, where users may want to have the commitlog mounted
in the SSD drives for faster writes but keep the data in larger, less
expensive and durable EBS volumes.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <1488258215-2592-1-git-send-email-glauber@scylladb.com>
2017-03-01 19:23:59 +02:00
Tomasz Grabiec
4b6e77e97e db: Fix overflow of gc_clock time point
If query_time is time_point::min(), which is used by
to_data_query_result(), the result of subtraction of
gc_grace_seconds() from query_time will overflow.

I don't think this bug would currently have user-perceivable
effects. This affects which tombstones are dropped, but in case of
to_data_query_result() uses, tombstones are not present in the final
data query result, and mutation_partition::do_compact() takes
tombstones into consideration while compacting before expiring them.

Fixes the following UBSAN report:

  /usr/include/c++/5.3.1/chrono:399:55: runtime error: signed integer overflow: -2147483648 - 604800 cannot be represented in type 'int'

Message-Id: <1488385429-14276-1-git-send-email-tgrabiec@scylladb.com>
2017-03-01 18:49:56 +02:00
Paweł Dziepak
bdac487b5a do not use long_type for counter update 2017-03-01 16:33:37 +00:00
Paweł Dziepak
f25fa6566f db: avoid deserialization when applying counter mutation
In the later stages of counter write path a mutation is produced that
already has all cells transformed to counter shards and can be applied
to the memtable and written to the commitlog.
The current interface expectes a frozen mutation, which is suboptimal
for counters. The freeze itself is unaviodable -- it is required by
commitlog, but we can avoid later deserialization of frozen_mutation
when it is applied to the memtable if we pass the unfrozen mutation
along.
2017-03-01 16:33:37 +00:00
Paweł Dziepak
582d397c41 introduce counter_write_query()
Counter write path involves read-modify-write. That read is guaranteed
to query only a single partition, does not care about dead cells and
expects to receive an unserialized mutation as a result.

Standard mutation queries can are able to produce results fit for
counter updates, but the logic involved is much more general (i.e.
slower), hence the addition of new, counter-specific kind of query.
2017-03-01 16:33:36 +00:00
Paweł Dziepak
426345e1d4 storage_proxy: avoid excessive mutation freezes 2017-03-01 16:33:36 +00:00
Paweł Dziepak
f10eb952d0 coordinator: do not apply counter write twice on leader 2017-03-01 16:33:36 +00:00
Paweł Dziepak
910bff297a to_string: add operator<< overload for std::array<> 2017-03-01 16:33:36 +00:00
Takuya ASADA
ba323e2074 dist/debian/dep: fix broken link of gcc-5, update it to 5.4.1-5
Since gcc-5/stretch=5.4.1-2 removed from apt repository, we nolonger able to
build gcc-5.

To avoid dead link, use launchpad.net archives instead of using apt-get source.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1488189378-5607-1-git-send-email-syuu@scylladb.com>
2017-03-01 17:13:14 +02:00
Tomasz Grabiec
0c84f00b16 query: Fix invalid initialization of _memory_tracker by moving-from-self
Fixes the following UBSAN warning:

  core/semaphore.hh:293:74: runtime error: reference binding to misaligned address 0x0000006c55d7 for type 'struct basic_semaphore', which requires 8 byte alignment

Since the field was not initialied properly, probably also fixes some
user-visible bug.
Message-Id: <1488368222-32009-1-git-send-email-tgrabiec@scylladb.com>
2017-03-01 11:38:28 +00:00
Duarte Nunes
c0e5964462 database: Explicitly use discard_result()
Values returned from the lambda passed to finally() are immediately
destroyed, so make that explicit by using discard_result().

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170227235541.28330-1-duarte@scylladb.com>
2017-02-28 18:41:19 +02:00
Duarte Nunes
11b5076b3c lsa: Use log histogram for closed segments
This patch replaces the current heap with a logarithmic histogram
to hold the closed segment descriptors.

This histogram stores elements in different buckets according to
their size. Values are mapped to a sequence of power-of-two ranges
that are split in N sub-buckets. Values less than a minimum value
are placed in bucket 0, whereas values bigger than a maximum value
are not admitted.

There is some loss of precision as segments are now not totally
ordered, and precision decreases the more sparse a segment is. This
allows to reduce the cost of the computations needed when freeing
from a closed segment.

Performance results for perf_simple_query -c4 --duration 60
           before       after       diff
read     43954.27    45246.10      +2.9%
write    48911.54    52807.76      +7.9%

Fixes #1442

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170227235328.27937-1-duarte@scylladb.com>
2017-02-28 18:40:38 +02:00
Avi Kivity
359fc68283 Merge seastar upstream
* seastar 4d4a58d...5861f99 (9):
  > future: adjust finally constraint to allow any future to be returned from the continuation
  > build: allow specifying the C compiler
  > socket: Change signature (and impls) of socket shutdown to void
  > reactor: give names to OS threads
  > Concepts support
  > core/file: Fix short-read in read_maybe_eof()
  > core/fstream: Avoid issuing read requests beyond _remain
  > tests: Improve assertion failure message
  > reactor: Expose IO stats in a public API
2017-02-28 13:13:35 +02:00
Avi Kivity
c1aac6fa87 build: accept and pass seastar's --c-compiler option 2017-02-28 13:13:02 +02:00
Duarte Nunes
a3873423d6 configure.py: Enable concepts support
This patch enables conditional concept support by propagating
seastar's --enable-gcc6-concepts flag.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170227235028.27490-1-duarte@scylladb.com>
2017-02-28 11:56:22 +02:00
Paweł Dziepak
5d66031b7a sstable: make input_stream_history initializers in-class
sstable has two constructors but only one of them was creating input
stream history objects.
Message-Id: <20170227151734.16928-1-pdziepak@scylladb.com>
2017-02-28 09:22:11 +01:00
Paweł Dziepak
374c8a56ac commitlog: avoid copying column_mapping
It is safe to copy column_mapping accros shards. Such guarantee comes at
the cost of performance.

This patch makes commitlog_entry_writer use IDL generated writer to
serialise commitlog_entry so that column_mapping is not copied. This
also simplifies commitlog_entry itself.

Performance difference tested with:
perf_simple_query -c4 --write --duration 60
(medians)
          before       after      diff
write   79434.35    89247.54    +12.3%
2017-02-27 17:05:58 +00:00
Paweł Dziepak
4df4994b71 idl: fix generated writers when member functions are used
When using member name in an idetifer of generated class or method
idl compiler should strip the trailing '()'.
2017-02-27 17:05:58 +00:00
Paweł Dziepak
018d16d315 idl: add start_frame() overload for seastar::simple_output_stream 2017-02-27 17:05:58 +00:00
Paweł Dziepak
0198d8e470 Merge "Introduce streamed_mutation::fast_forward_to()" from Tomasz
"This introduces an API which allows forward navigation in a stream of mutation
fragments. It allows one to consume only a subset of the stream by iteratively
specifying sub-ranges from which fragments should be returned.

API outline:

  When in forwarding mode, the stream does not return all fragments right away,
  but only those belonging to the current range. Initially current range only
  covers the static row. The stream can be forwarded, even before reaching end-
  of-stream for current range, to a later range with fast_forward_to().
  Forwarding doesn't change initial restrictions of the stream, it can only be
  used to skip over data.

  Monotonicity of positions is preserved by forwarding. That is fragments
  emitted after forwarding will have greater positions than any fragments
  emitted before forwarding.

  For any range, all range tombstones relevant for that range which are present
  in the original stream will be emitted. Range tombstones emitted before
  forwarding which overlap with the new range are not necessarily re-emitted.

  When not in forwarding mode, the stream acts as if the current range was equal
  to the full range. This implies that fast_forward_to() cannot be
  used.

  Whether stream is in forwarding mode or not is specified when the stream
  is created, typically via mutation_source interface.

What's left for later series:

  Optimization by providing specialized implementations. This series implements
  forwarding support in all mutation sources via generic wrapper which simply
  drops fragments."

* tag 'tgrabiec/clustering-fast-forward-to-v2' of github.com:scylladb/seastar-dev:
  tests: mutation_source_tests: Verify monotonicty of positions
  tests: random_mutation_generator: Spread the keys more
  tests: mutation_source_test: Make blobs more easily distinguishable
  tests: streamed_mutation: Test that merged stream passes mutation source tests
  tests: mutation_source_test: Add tests for forwarding of streamed_mutation
  tests: streamed_mutation_assertions: Add methods for navigating the stream
  tests: Add range generators to random_mutation_generator
  partition_slice_builder: Add with_ranges()
  query: Introduce full_clustering_range
  streamed_mutation: Add non-owning variant of mutation_from_streamed_mutation()
  db: Enable creating forwardable readers via mutation_source
  mutation_source: Document liveness requirements
  mutation_source: Cleanup
  db: Replace virtual_reader_type with mutation_source_opt
  partition_version: Refactor make_partition_snapshot_reader() overloads
  database: Fix mutation_source created by as_mutation_source() to not ignore trace_state_ptr
  memtable: Accept all mutation_source parameters
  streamed_mutation: Implement fast_forward_to() in stream merger
  streamed_mutation: Add generic implementation of forwardable streamed_mutation
  streamed_mutation: Add fast_forward_to() API
  position_in_partition: Introduce position_range
  position_in_partition: Introduce position constructor for right after the static row
  streamed_mutation: Make cast to view non-explicit
  streamed_mutation: Make schema() getter non-copying
2017-02-24 10:37:51 +00:00
Tomasz Grabiec
0798ea22c8 tests: mutation_source_tests: Verify monotonicty of positions 2017-02-23 18:50:54 +01:00
Tomasz Grabiec
d0421ba545 tests: random_mutation_generator: Spread the keys more
The deviation was very low so most ranges were very close. Spread them
to test more cases.
2017-02-23 18:50:54 +01:00
Tomasz Grabiec
27ff169b6b tests: mutation_source_test: Make blobs more easily distinguishable
It's easier to compare them if they differ only by a few most
significant bits, than by all bits.
2017-02-23 18:50:53 +01:00
Tomasz Grabiec
182e3f981b tests: streamed_mutation: Test that merged stream passes mutation source tests 2017-02-23 18:50:53 +01:00
Tomasz Grabiec
122562c1cc tests: mutation_source_test: Add tests for forwarding of streamed_mutation 2017-02-23 18:50:53 +01:00
Tomasz Grabiec
1d7e84f770 tests: streamed_mutation_assertions: Add methods for navigating the stream 2017-02-23 18:50:53 +01:00
Tomasz Grabiec
f2feb54fb0 tests: Add range generators to random_mutation_generator 2017-02-23 18:50:53 +01:00
Tomasz Grabiec
f56308597c partition_slice_builder: Add with_ranges() 2017-02-23 18:50:53 +01:00
Tomasz Grabiec
0073df30aa query: Introduce full_clustering_range 2017-02-23 18:50:53 +01:00
Tomasz Grabiec
cbf4601e31 streamed_mutation: Add non-owning variant of mutation_from_streamed_mutation() 2017-02-23 18:50:53 +01:00
Tomasz Grabiec
892d4a2165 db: Enable creating forwardable readers via mutation_source
Right now all mutation source implementations will use
make_forwardable() wrapper.
2017-02-23 18:50:44 +01:00
Tomasz Grabiec
b1d1091906 mutation_source: Document liveness requirements 2017-02-23 18:23:52 +01:00
Tomasz Grabiec
15db80188b mutation_source: Cleanup
- combines telescopic overloads into one method with default paramters.
 - Introduce func_type for a full handler to avoid some duplication.
2017-02-23 18:23:52 +01:00
Tomasz Grabiec
586dbaa8d3 db: Replace virtual_reader_type with mutation_source_opt
Virtual reader is a mutation_source.
2017-02-23 18:23:52 +01:00
Tomasz Grabiec
acfad565f0 partition_version: Refactor make_partition_snapshot_reader() overloads
So that streamed_mutation is created in only one of the overloads and
others delegate to that one. Later there will be common logic added to
the construction and doing this will help avoid a duplication.
2017-02-23 18:23:52 +01:00
Tomasz Grabiec
f46ae8128d database: Fix mutation_source created by as_mutation_source() to not ignore trace_state_ptr
It was using the state passed via as_mutation_source() instead. Let's
respect mutation_source contract instead, and use the state passed via
mutation_source invocation.

Technically just a cleanup. Alse prerequisite for more cleanup.
2017-02-23 18:23:52 +01:00
Tomasz Grabiec
2cc27f72ca memtable: Accept all mutation_source parameters 2017-02-23 18:23:52 +01:00
Tomasz Grabiec
53b1a257cc streamed_mutation: Implement fast_forward_to() in stream merger 2017-02-23 18:23:52 +01:00
Tomasz Grabiec
e0a7ed48b0 streamed_mutation: Add generic implementation of forwardable streamed_mutation
Generic but not very efficient wrapper which simply drops
fragments from the original stream.
2017-02-23 18:23:51 +01:00
Tomasz Grabiec
301cd4912b streamed_mutation: Add fast_forward_to() API 2017-02-23 18:23:28 +01:00
Gleb Natapov
2dc56013f8 commitlog: handle cycle() error
Do not ignore a future<> retuned by cycle() since it will produce a
warning in case of an error. Log it instead.

Message-Id: <20170219151811.GN11471@scylladb.com>
2017-02-22 19:15:14 +01:00
Calle Wilund
d5f57bd047 messaging_service: Move log printout to actual listen start
Fixes  #1845
Log printout was before we actually had evaluated endpoint
to create, thus never included SSL info.
Message-Id: <1487766738-27797-1-git-send-email-calle@scylladb.com>
2017-02-22 17:08:21 +01:00
Avi Kivity
9b113ffd3e config: enable new sharding algorithm for new deployments
Set murmur3_partitioner_ignore_msb_bits to 12 (enabling the new sharding
algorithm), but do this in scylla.yaml rather than the built-in defaults.
This avoids changing the configuration for existing clusters, as their
scylla.yaml file will not be updated during the upgrade.
Message-Id: <20170214123253.3933-1-avi@scylladb.com>
2017-02-22 11:23:12 +01:00
Calle Wilund
0a4edca756 counters/cql: allow wormholing actual counter values (with shards) via cql
Adds yet another magic function "SCYLLA_COUNTER_SHARD_LIST", indicating that
argument value, which must be a list of tuples <int, UUID, long, long>,
should be inserted as an actual counter value, not update.

This of course to allow counters to be read from sstable loader.

Note that we also need to allow timestamps for counter mutations,
as well as convince the counter code itself to treat the data as
already baked. So ugly wormhole galore.

v2:
* Changed flag names
* More explicit wormholing, bypassing normal counter path, to
  avoid read-before-write etc
* throw exceptions on unhandled shard types in marshalling
v3:
* Added counter id ordering check
* Added batch statement check for mixing normal and raw counter updates
Message-Id: <1487683665-23426-2-git-send-email-calle@scylladb.com>
2017-02-22 09:19:46 +00:00
Calle Wilund
0d87f3dd7d utils::UUID: operator< should behave as comparison of hex strings/bytes
I.e. need to be unsigned comparison.
Message-Id: <1487683665-23426-1-git-send-email-calle@scylladb.com>
2017-02-22 09:19:22 +00:00
Tomasz Grabiec
2b2d5c4c7a Update seastar submodule
* seastar 5088065...4d4a58d (3):
  > reactor utilization should return the utilization in 0-1 range
  > collectd should ignore type label in name creation
  > fix append_challenged_posix_file_impl::process_queue() to handle recursion
2017-02-22 09:40:25 +01:00
Calle Wilund
e20b804a65 commitlog/database: Add "release" method to ensure we free segments
On database stop, we do flush memtables and clean up commit log segment usage.
However, since we never actually destroy the distributed<database>, we
don't actually free the commitlog either, and thus never clear out
the remaining (clean) segments. Thus we leave perfectly clean segments
on disk.

This just adds a "release" method to commitlog, and calls it from
database::stop, after flushing CF:s.
Message-Id: <1485784950-17387-1-git-send-email-calle@scylladb.com>
2017-02-21 18:17:47 +01:00
Gleb Natapov
0977f4fdf8 sstable: close sstable_writer's file if writing of sstable fails.
Failing to close a file properly before destroying file's object causes
crashes.

[tgrabiec: fixed typo]

Message-Id: <20170221144858.GG11471@scylladb.com>
2017-02-21 18:17:47 +01:00
Tomasz Grabiec
8fd19a71ff position_in_partition: Introduce position_range 2017-02-21 16:49:36 +01:00
Tomasz Grabiec
78c563ea6a position_in_partition: Introduce position constructor for right after the static row 2017-02-21 16:43:09 +01:00
Tomasz Grabiec
ce58706b50 streamed_mutation: Make cast to view non-explicit 2017-02-21 16:43:09 +01:00
Paweł Dziepak
274bcd415a tests/cql_test_env: wait for storage service initialization
Message-Id: <20170221121130.14064-1-pdziepak@scylladb.com>
2017-02-21 17:05:45 +02:00
Paweł Dziepak
359c617821 db: restore call to check_valid_rp()
5a0955e89d "db: add operations for
applying counter updates" merged two column_family::apply() overloads
into do_apply() in order to reduce code duplication. Unfortunately,
a call to check_valid_rp() didn't survive that change.
Message-Id: <20170221133800.30411-1-pdziepak@scylladb.com>
2017-02-21 15:26:04 +01:00
Tomasz Grabiec
b4fd3a08e6 streamed_mutation: Make schema() getter non-copying 2017-02-21 14:18:57 +01:00
Duarte Nunes
65b21e3a99 schema_registry: Don't leak schemas
When loading a schema asynchronously, we're leaving a strong
reference to the loaded schema in the entry's shared future. This
patch fixed this by storing a shared_promised, which is reset when the
schema is loaded.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170220193654.17439-1-duarte@scylladb.com>
2017-02-21 09:56:21 +01:00
Tomasz Grabiec
33457cc9a9 sstables: Fix detection of repeated tombstones
The check was not catching range tombstone repeated immediately after
itself.
Message-Id: <1487596098-17409-1-git-send-email-tgrabiec@scylladb.com>
2017-02-20 15:35:15 +00:00
Tomasz Grabiec
cc439df542 Revert "sstables: Simplify sstable_streamed_mutation::read_next()"
This reverts commit 1e2c01ff49.

We do not detect repeated tombstone if it follows an in-range
tombstone following a skipped clustering row, because _in_progress
will be disengaged after such tombstone is emitted.
Message-Id: <1487596080-21480-1-git-send-email-tgrabiec@scylladb.com>
2017-02-20 15:34:58 +00:00
Vlad Zolotarov
978241d473 database: move lister class into separate files
Move lister class away from database.cc.
This is a preparation for moving it to the seastar library.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-02-17 17:50:40 -05:00
Vlad Zolotarov
34cafa71c3 database: make 'clearsnapshot' to delete the snapshots of deleted keyspaces if requested
The current implementation of 'nodetool clearsnapshot' command only deletes the snapshots
of the keyspaces that are alive at the time the command is issued (issue #2045).

This, besides not implementing the spec, prevents users from being able to clear
the disk space occupied by snapshots of deleted keyspaces that are no longer needed
(e.g. snapshots created when KS is deleted).

This patch fixes this issue by making the database::clear_snapshot() scan the data directories
looking for the snapshots to be deleted instead of relying on in-memory data structures.

This patch makes column_family::clear_snapshot() method not needed any more.

Fixes #2045

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-02-17 17:50:40 -05:00
Vlad Zolotarov
e1ee669aff database: lister: add the rmdir() static method
Removes the directory with all its contents (like 'rm -rf <dir name>' shell command).

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-02-17 17:50:40 -05:00
Vlad Zolotarov
53532ba5ff database: lister: pass the parent path object to callbacks
Pass a parent directory boost::filesystem::path object
to the walker and filter callbacks.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-02-17 17:50:37 -05:00
Vlad Zolotarov
b4c970dfc6 database: lister: make the "filter" callback receive directory_entry instead of sstring
Filter should get all information that the caller has in hand that may be used for filtering.

directory_entry has the following information:
   - Type of the entry
   - Its name

For the code that used lister filters so far this would be enough, however it's not
hard to imagine a filter that may need the parent directory as well.

We will add the parent directory path in the follow up patches to make the interface
complete.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-02-17 17:46:59 -05:00
Vlad Zolotarov
6f9f0e1b3f database: lister: add "show_hidden" parameter
If show_hidden parameter is set to show_hidden::yes - list hidden entries, otherwise skip them.
By default set to show_hidden::no.

This patch also completely removes default parameters in lister::scan_dir() and replaces them
with a few lister::scan_dir() overloads that ensure that lambdas are always going to
be the last parameter in the parameters list.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-02-17 17:46:58 -05:00
Vlad Zolotarov
9aedb191f6 database: lister: if entries' types set is empty - list everything
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-02-17 17:46:58 -05:00
Vlad Zolotarov
cb614f9be4 database: lister::guarantee_type: handle the case when entry type may not be read
There is a possibility that the type of the given entry may not be available
that would manifest in the ENOENT or ENOTDIR value set in the errno
by the fstat() call for this entry. In this case engine().file_type() will return
a not engaged optional<directory_entry_type> value.

Return the future with the std::runtime_error exception in this case.
This will prevent any further usage of the not engaged optional value by the code
in the normal flow.

The exception is going to be propagated to the caller and it's the caller's responsibility
to handle it.

Fixes #2071

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-02-17 17:46:55 -05:00
Tomasz Grabiec
9f63e172fb tests: compaction_manager_test: Fix abort on exception
Message-Id: <1487343901-12745-1-git-send-email-tgrabiec@scylladb.com>
2017-02-17 15:53:55 +00:00
Avi Kivity
113ed9e963 Merge seastar upstream
* seastar 28a143a...5088065 (8):
  > configure.py: switch cmake to build c-ares to do out-of-source-tree build
  > iotune: make sure help is working
  > collectd: send double correctly for gauge
  > tls: make shutdown/close do "clean" handshake shutdown in background
  > tls: Make sink/source (i.e. streams) first class channel owners
  > native-stack: Make sink/source (i.e. streams) first class channel owners
  > posix-stack: Make sink/source (i.e. streams) first class channel owners
  > Merge "Detector for tasks blocked" from Glauber

Fixes #2085.

Packaging updated to require cmake, drop libtool and automake.
2017-02-16 19:34:28 +02:00
Vlad Zolotarov
25502149cf database: lister::scan_dir(): std::move() all that needs to be moved
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-02-16 11:56:44 -05:00
Raphael S. Carvalho
53d9008052 sstables/deletion_manager: kill dead code
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <2b24d9e622238030a737fbbe12b8439853d5d075.1487095059.git.raphaelsc@scylladb.com>
2017-02-16 18:38:54 +02:00
Vlad Zolotarov
f2e4629254 main.cc: expose scylla version as a gauge metrics
Add a new metric that exposes the current ScyllaDB version as a
gauge metrics.

The version is exposed as a label with the "version" key.

Fixes #1979

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1487083703-27929-1-git-send-email-vladz@scylladb.com>
2017-02-16 16:57:55 +02:00
Piotr Jastrzebski
2b8e340761 Replace deprecated BOOST_MESSAGE with BOOST_TEST_MESSAGE
BOOST Unit test deprecated BOOST_MESSAGE as early as 1.34 and had it
been perminently removed.  This patch replaces all uses of BOOST_MESSAGE
with BOOST_TEST_MESSAGE.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <f1732018912a864cea229b0f7cd48170fd927dc2.1487238426.git.piotr@scylladb.com>
2017-02-16 10:49:03 +01:00
Avi Kivity
b8c4b35b57 Merge "Fixes for counter cell locking" from Paweł
"This series contains some fixes and a unit test for the logic responsible
for locking counter cells."

* 'pdziepak/cell-locking-fixes/v1' of github.com:cloudius-systems/seastar-dev:
  tests: add test for counter cell locker
  cell_locking: fix schema upgrades
  cell_locker: make locker non-movable
  cell_locking: allow to be included by anyone
2017-02-15 17:36:48 +02:00
Paweł Dziepak
f7f89df782 tests: add test for counter cell locker 2017-02-15 15:09:40 +00:00
Paweł Dziepak
2eb3e35815 cell_locking: fix schema upgrades
* cell_entry destructor may be called when the former is unlinked
 * update pointer to schema in partition_entry on schema upgrade
 * use correct bucket count when creating a new hash table
2017-02-15 15:09:40 +00:00
Paweł Dziepak
fa9c712263 cell_locker: make locker non-movable
locker keep iteratos to potentially internally stored data and moving
the object would invalidate them.
2017-02-15 13:48:47 +00:00
Paweł Dziepak
fc9145671d cell_locking: allow to be included by anyone 2017-02-15 13:48:47 +00:00
Tomasz Grabiec
9da078a18a tests: logalloc_test: Print debugging info and abort on failure
The test started to fail sporadically on jenkins after
7a00dd6985 due to quiesce() timing out. It's not
clear though if this is a regression because before the series such timeouts
would not cause test failure if the future resulves eventually, timeout was
only logged.

I was not able to reproduce it on my setup nor on jenkins, so let's add more
debugging output and trigger a coredump next time the test fails.

Message-Id: <1487089576-27147-1-git-send-email-tgrabiec@scylladb.com>
2017-02-15 14:41:49 +02:00
Takuya ASADA
9c8515eeed dist/redhat: stop backporting ninja-build from Fedora, install it from EPEL instead
ninja-build-1.6.0-2.fc23.src.rpm on fedora web site deleted for some
reason, but there is ninja-build-1.7.2-2 on EPEL, so we don't need to
backport from Fedora anymore.

Fixes #2087

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1487155729-13257-1-git-send-email-syuu@scylladb.com>
2017-02-15 12:58:00 +02:00
Paweł Dziepak
a5476b4e7d Merge "Emit only range tombstones relevant for query restrictions" from Tomasz
"Immediate reason to do this is to ensure that forwarding of streamed_mutation
will give the same mutations as slicing would, and have unit tests which
verify that those two access methods are consistent with each other.

Secondary reason is performance, to avoid processing unnecessary data.

Note that this should not cause digest mismatch of data queries during rolling
upgrade, because data queries are checksumming only tombstones affecting rows
in the results, so only relevant tombstones.

Fixes #1254."

* tag 'tgrabiec/only-relevant-range-tombstones-v2' of github.com:scylladb/seastar-dev:
  tests: mutation_source_test: Test that slicing returns only relevant range tombstones
  tests: Pass all mutation source parameters
  tests: mutation_source_tests: Ensure timestamps are strictly monotonic
  tests: streamed_mutation_assertions: Add more expectation methods
  tests: streamed_mutation_assertions: Make produces_end_of_stream() give better error messages
  sstables: Simplify sstable_streamed_mutation::read_next()
  sstables: Emit only relevant range tombstones
  range_tombstone: Introduce end_position()
  position_in_partition: Print position when printing fragment
  position_in_partition: Make printable
  position_in_partition: Add cast to view
  position_in_partition: Generalize from-bound_view constructor
  bound_view: Extract converters for range start and end bounds
  mutation_partition: Drop unneeded range tombstones
  mutation_partition: Simplify row removal
  range_tombstone_list: Introduce erase()
  partition_snapshot_reader: Emit only relevant tombstones
  range_tombstone_stream: Add slicing apply() overload
  range_tombstone_list: Introduce slice()
2017-02-14 11:18:51 +00:00
Tomasz Grabiec
7ec8c4cf54 tests: mutation_source_test: Test that slicing returns only relevant range tombstones 2017-02-13 20:52:50 +01:00
Tomasz Grabiec
2b8bd10dca tests: Pass all mutation source parameters 2017-02-13 20:52:49 +01:00
Tomasz Grabiec
25dffef6ae tests: mutation_source_tests: Ensure timestamps are strictly monotonic 2017-02-13 16:19:32 +01:00
Tomasz Grabiec
e6a95fd8cc tests: streamed_mutation_assertions: Add more expectation methods 2017-02-13 16:19:32 +01:00
Tomasz Grabiec
62843175ea tests: streamed_mutation_assertions: Make produces_end_of_stream() give better error messages 2017-02-13 16:19:32 +01:00
Tomasz Grabiec
1e2c01ff49 sstables: Simplify sstable_streamed_mutation::read_next()
mp_row_consumer doesn't split row fragments on repeated range
tombstones any more.
2017-02-13 16:12:16 +01:00
Tomasz Grabiec
6324876f24 sstables: Emit only relevant range tombstones 2017-02-13 16:12:16 +01:00
Tomasz Grabiec
72d74b7b40 range_tombstone: Introduce end_position() 2017-02-13 16:12:16 +01:00
Tomasz Grabiec
79d982cd86 position_in_partition: Print position when printing fragment 2017-02-13 16:12:16 +01:00
Tomasz Grabiec
f2e1f2938b position_in_partition: Make printable 2017-02-13 16:12:16 +01:00
Tomasz Grabiec
fb67dab548 position_in_partition: Add cast to view 2017-02-13 16:12:16 +01:00
Tomasz Grabiec
9ddb5a5173 position_in_partition: Generalize from-bound_view constructor
We will need to create positions corresponding to general range
bounds, not only corresponding to range tombstones.
2017-02-13 16:12:16 +01:00
Tomasz Grabiec
69911a87d3 bound_view: Extract converters for range start and end bounds 2017-02-13 16:12:16 +01:00
Tomasz Grabiec
2489a0f82e mutation_partition: Drop unneeded range tombstones
Fixes #1254.
2017-02-13 16:12:16 +01:00
Tomasz Grabiec
884858078a mutation_partition: Simplify row removal 2017-02-13 16:12:15 +01:00
Tomasz Grabiec
fb42366552 range_tombstone_list: Introduce erase() 2017-02-13 16:12:15 +01:00
Tomasz Grabiec
fcf3391785 partition_snapshot_reader: Emit only relevant tombstones
Refs #1254.
2017-02-13 16:12:15 +01:00
Tomasz Grabiec
440e50b76a range_tombstone_stream: Add slicing apply() overload 2017-02-13 16:12:15 +01:00
Tomasz Grabiec
8b7f93175c range_tombstone_list: Introduce slice() 2017-02-13 16:12:15 +01:00
Paweł Dziepak
8b1d34f39d mutation_partition_serializer: avoid creating atomic_cell object
write_{live, counter, expiring, dead}_cell() take a const reference to
an atomic cell as argument. However, their caller (which is
write_row_cells) passes to them an atomic_cell_view.
There is an appropriate implicit constructor so instead of compiler
complaints we get atomic_cell objects being constructed from views which
involves an allocation and a copy.
Message-Id: <20170213100106.9071-1-pdziepak@scylladb.com>
2017-02-13 11:23:23 +01:00
Avi Kivity
bcf34b9a58 Merge seastar upstream
* seastar 83a41c8...28a143a (5):
  > prometheus: send one MetricFamily per unique metric name
  > tests: Add test for circular_buffer::erase()
  > circular_buffer: Introduce erase()
  > protect against infinite do_until loop
  > metrics: alternative metrics creation with labels
2017-02-12 21:56:11 +02:00
Gleb Natapov
bb72425b61 storage_proxy: fix send_to_endpoint() to use correct create_write_response_handler() overload
There are several problems with storage_proxy::send_to_endpoint right
now. It uses create_write_response_handler() overload that is specific
to read repair which is suboptimal and creates incorrect logs, it does
not process errors and it does not hold storage_proxy object until write
is complete. The patch fixes all of the problems.

Message-Id: <20170208101949.GA19474@scylladb.com>
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
2017-02-12 10:46:13 +02:00
Tomasz Grabiec
c70ebc7ca5 lsa: Make reclaim_timer enclose segment_pool::reclaim_segments()
LSA timing did not include segment migration. It does after this
change.
Message-Id: <1486657046-9378-1-git-send-email-tgrabiec@scylladb.com>
2017-02-09 17:07:59 +00:00
Avi Kivity
5f15388e7a Merge "Size-based buffering of mutation_fragments" from Paweł
"This series changes buffering of mutation fragments in streamed mutations
so that the size of the fragments is taken into account.
The original implementation buffered up to 16 fragments which was pretty
much meaningless since it could be far too much if the fargments were
large or not nearly enough in case they were small

Fixes #2036.."

* 'pdziepak/buffer-mfs-by-size/v1' of github.com:cloudius-systems/seastar-dev:
  streamed_mutation: size-based mutation_fragment buffer limit
  mutation_fragment: cache size in memory
  mutation_fragment: make write access more explicit
2017-02-09 16:42:42 +02:00
Paweł Dziepak
3079b1661e streamed_mutation: size-based mutation_fragment buffer limit
Currently, streamed mutations buffer up to 16 mutation fragments. This
may be too much, not enough or a perfect choice depending on the
mutation fragment size.

This patch makes streamed mutation choose how much mutation fragments to
keep in the buffer depending on their size, so that we avoid using too
much memory in case of large mutation fragments and are able to buffer a
lot of fragments if they are small.
2017-02-09 10:51:11 +00:00
Paweł Dziepak
cd0dc7734a mutation_fragment: cache size in memory 2017-02-09 10:50:51 +00:00
Paweł Dziepak
354ce0b2c7 mutation_fragment: make write access more explicit
mutation_fragments are going to be caching their size in memory. In
order to be able to invalidate that correctly, they need to know when
that size may change (but avoid invalidation when it is not necessary).
2017-02-09 10:49:46 +00:00
Avi Kivity
9530bac2d6 Merge "Adding metrics using histogram and labels" from Amnon
"This series uses the newly added histogram and label support to add metrics to
the storage_proxy and to the column_family.

This would add latency and histogram and the missing metrics from column family."

* 'amnon/histogram_metrics' of github.com:cloudius-systems/seastar-dev:
  database: add metrics registration for the coloumn family
  storage_proxy: add read and write latency histogram
  estimated_histogram: returns a metrics histogram
2017-02-09 11:40:57 +02:00
Avi Kivity
9e4ae0763d Merge "Disallow mixed schemas" fro Paweł
"This series makes sure that schemas containing both counter and non-counter
regular or static columns are not allowed."

* 'pdziepak/disallow-mixed-schemas/v1' of github.com:cloudius-systems/seastar-dev:
  schema: verify that there are no both counter and non-counter columns
  test/mutation_source: specify whether to generate counter mutations
  tests/canonical_mutation: don't try to upgrade incompatible schemas
2017-02-07 18:03:28 +02:00
Paweł Dziepak
4cbbbc67f0 schema: verify that there are no both counter and non-counter columns 2017-02-07 15:17:14 +00:00
Paweł Dziepak
4ffe0401ee test/mutation_source: specify whether to generate counter mutations
Tests using random mutation generator should be provided with bot
counter and non-counter mutations to ensure that both cases are
sufficiently covered. However, mixed schemas (with both counter and
non-counter columns) are not allowed so the RMG has to be explicitly
told whether to use counter or non-counter schema.
2017-02-07 15:17:14 +00:00
Paweł Dziepak
294bf0bb7a tests/canonical_mutation: don't try to upgrade incompatible schemas
Test case test_reading_with_different_schemas uses randomly generated
pairs of mutations and tries to upgrade one to the schema of the other.

However, there are cases when one schema cannot be upgraded to another,
for example, counter and non-counter schemas.
2017-02-07 15:17:14 +00:00
Amnon Heiman
292c08f598 database: add metrics registration for the coloumn family
This patch adds a metrics registration to the column_family.

Using label each column metrics is label with its keyspace and column
family name.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2017-02-06 18:27:01 +02:00
Amnon Heiman
2cf13c26e2 storage_proxy: add read and write latency histogram
Register the read and write latency histogram on the metrics layer.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2017-02-06 17:54:47 +02:00
Amnon Heiman
1e3cfe7396 estimated_histogram: returns a metrics histogram
The metrics histogram is a struct that describe a histogram.
This patch adds a getter method that lets the estimated_histogram return
a metrics::histogram, this will allow to register it as a histogram
metrics.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2017-02-06 17:34:43 +02:00
Avi Kivity
afe98f8572 Merge "Materialized views: support new entries" from Duarte
"This patchset adds bits of the MV write-path, enough to support
new entries to be added. Note that this is still limited, as only
adding new rows to a base table will work correctly."

* 'materialized-views/insert-path/v4' of https://github.com/duarten/scylla: (30 commits)
  database: Apply mutation to views
  column_family: Push view replica update
  materialized views: partial mutate_MV
  materialized views: function to send a mutation to endpoint
  materialized views: add VIEW write type
  database: Ensure new write_type is correctly printed
  materialized views: match base and view replicas
  column_family: Generate view updates
  column_family: Adds affected_views() function
  view: Add view_update_builder class
  range_tombstone_accumulator: Expose current tombstone
  range_tombstone_accumulator: apply() takes value
  view_updates: Generate updates
  view_updates: Adds function to replace row
  view_updates: Update view entry
  view_updates: Delete old view entry
  mutation_partition: Introduce shadowable tombstone
  view_updates: Create view entry
  view_updates: Compute row marker
  view: Introduce view_updates class
  ...
2017-02-06 15:10:38 +02:00
Duarte Nunes
0eca6301d3 database: Apply mutation to views
This patch changes the database apply path so that it also
generates the mutations for the column family's views and
sends them to the paired view replicas.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:37:33 +01:00
Duarte Nunes
4777172348 column_family: Push view replica update
This patch adds a function to push updates to the view replicas of a
particular base table.
2017-02-06 13:36:45 +01:00
Nadav Har'El
3ae73164a4 materialized views: partial mutate_MV
This adds a function mutate_MV() which takes view mutations and sends
them to the appropriate nodes (this may be the current node, or a
remote node).

This is only a partial implementation - we still don't do the local
batch log (to survive reboots and failures) and some other stuff which
is left commented out.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:36:45 +01:00
Nadav Har'El
f2fd81ece0 materialized views: function to send a mutation to endpoint
Add a function for sending one mutation to one remote replica owning
this mutation. This is needed for materialized views, where each
base replica sends each view mutation to one particular view replica.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2017-02-06 13:36:45 +01:00
Nadav Har'El
92fc7386f6 materialized views: add VIEW write type
This adds to the "write_type" enum also the "VIEW" write type.
To be honest, I don't understand why the "write_type" distinction
is important.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:36:45 +01:00
Duarte Nunes
11bd3bd29f database: Ensure new write_type is correctly printed
By removing the default case in the switch statement over a write_type
variable, we ensure the compiler warns us about lack of exhaustiveness
in case we add a value to the enum but forget to change the
corresponding operator<<().

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:36:45 +01:00
Nadav Har'El
365df8f900 materialized views: match base and view replicas
A function to find the appropriate replica to send a view update to.

This patch creates a new source file db/view/view.cc. We should
eventually move a lot more of the materialized views code there.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2017-02-06 13:36:45 +01:00
Duarte Nunes
16206e9f15 column_family: Generate view updates
This patch adds the generate_view_updates() function to the
column_family class, which will use the view_update_builder to
generate updates to the column_family's materialized views.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:36:45 +01:00
Duarte Nunes
90cb35db04 column_family: Adds affected_views() function
This patch the affected_views() to determine the column family's
views a given update affects.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:36:45 +01:00
Duarte Nunes
d5a61a8c48 view: Add view_update_builder class
This patch adds the view_update_builder class, which is responsible
for calculating the mutations to apply to a column family's
materialized views, given a streamed_mutation representing an update
to the base table and a streamed_mutation representing the
pre-existing rows which the update covers.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:36:45 +01:00
Duarte Nunes
2ab9ba995a range_tombstone_accumulator: Expose current tombstone
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:36:45 +01:00
Duarte Nunes
f3c5ea392a range_tombstone_accumulator: apply() takes value
range_tombstone_accumulator::apply() now takes a value so the caller
can decide whether to move or copy the argument.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:36:45 +01:00
Duarte Nunes
3991a58f08 view_updates: Generate updates
This patch adds the view_updates::generate_update() function to
generate view updates given a base row update and the corresponding,
pre-existing row. This function will decide which of the previously
introduced functions to call based on whether there is a pre-existing
row and whether there exists a regular base column that's part of the
view's PK.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:36:45 +01:00
Duarte Nunes
861d2dfb61 view_updates: Adds function to replace row
This patch adds a function to replace a view row given a base
table update and the pre-existing row, which simply deletes the old
view entry and adds a new one.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:36:45 +01:00
Duarte Nunes
7901ce7de4 view_updates: Update view entry
This patch introduces the view_updates::update_entry function,
which creates the updates to apply to the existing view entry given
the base table row before and after the update.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:36:45 +01:00
Duarte Nunes
b34ae6d6da view_updates: Delete old view entry
This patch introduces the view_updates::delete_old_entry function,
which creates a view row mutation to delete an entry given an updated
base table row.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:36:45 +01:00
Duarte Nunes
7e150a18eb mutation_partition: Introduce shadowable tombstone
This patch introduces shadowable row tombstones. A shadowable row
tombstone is valid only if the row has no live marker. In other words,
the row tombstone is only valid as long as no newer insert is done
(thus setting a live row marker; note that if the row timestamp set
is lower than the tombstone's, then the tombstone remains in effect
as usual).

If a row has a shadowable tombstone with timestamp Ti and that row
is updated with a timestamp Tj, such that Tj > Ti (and that update
sets the row marker), then the shadowable tombstone is shadowed by
that update. A concrete consequence is that if the update has cells
with timestamp lower than Ti, then those cells are preserved (since
the deletion is removed), and this is contrary to a regular,
non-shadowable row tombstone where the tombstone is preserved and
such cells are removed.

Currently, only Materialized Views require shadowable row tombstones,
which solve a problem with view row deletions. Consider a base row with
columns p, v1, v2, PRIMARY KEY (p) denormalized into a view row consisting
of columns p, v1, v2 PRIMARY KEY (p, v1), and the following operations:

1) INSERT INTO base (p, v1, v2) VALUES (0, 0, 1) USING TIMESTAMP 0;
2) UPDATE base SET v1 = 1 USING TIMESTAMP 1 WHERE p = 0;
3) UPDATE base SET v1 = 0 USING TIMESTAMP 2 WHERE p = 0;

Without shadowable tombstones, the view contains:

At 1), pk = (0, 0), row_marker@T0, v2=1@T0
At 2), pk = (0, 0), row_marker@T0, row_tombstone@T1, v2=1@T0
       pk = (0, 1), row_marker@T1, v2=1@T0
At 3), pk = (0, 0), row_marker@T2, row_tombstone@T1, v2=1@T0
       pk = (0, 1), row_marker@T1, row_tombstone@T2, v2=1@T0

Notice how, if we read row (0, 0), the value of v2 will be shadowed by
the row tombstone we previously inserted.

With a view's row tombstone becoming shadowable, at 3) the row (0, 0)
will look like pk = (0, 0), row_marker@T2, shadowable_tombstone@T1, v2=1@T0,
which is equivalent to pk = (0, 0), row_marker@T2, v2=1@T0.

Since the shadowable tombstone is shadowed by the new row marker (T0 <
T2), now v2 would be taken into account.

Finally, note that this patch doesn't generalize the idea of
shadowable tombstone, instead taking advantage of the fact that they
are only needed by Materialized Views. This saves changing the
tombstone representation to account for an extra flag, the bits such
representation would require, and also avoids changes to the storage
format.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:36:45 +01:00
Duarte Nunes
e0f642180f view_updates: Create view entry
This patch introduces the view_updates::create_entry function, which
creates a view row mutation given a new base table row.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:35:31 +01:00
Duarte Nunes
b8b8a8099c view_updates: Compute row marker
This patch adds a function to compute the row marker of a view row
given the base row. There are two cases to consider when building the
row marker: 1) there is a column C that is a regular base column but
is in the view PK; and 2) the columns for the base and the view PKs
are the same.

For 1), the view row marker timestamp will be the biggest between the
base's row marker and C. The TTL will be that of C. This means that if
C expires, the view row maker will expire as well (and the row, if no
other column is keeping it alive). Note that if the base row marker
expires but not C, then the base row will still be live due to C and
we shouldn't expire the view row.

For 2), the view row timestamp will be the same as the base row
timestamp. The TTL should be set in such a way that both base and view
rows live for the same time. We thus set the view row TTL to be the
max of any other TTL in the base row. This is particularly important
in the case where the base row marker has a TTL, but a column *absent*
from the view holds a greater one.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:35:31 +01:00
Duarte Nunes
7321938bcf view: Introduce view_updates class
This patch introduces the view_updates class, which is responsible
for generating and storing updates to a particular materialized view.

The updates will be generated from an updated base row and the
pre-existing one (if any), in later patches.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:35:31 +01:00
Duarte Nunes
0f8dbc9243 collection_type_impl: Iterate over collection cells
This patch introduces the collection_type_impl::for_each_cell()
function, which allows the caller to iterate over the cells of a
particular collection_mutation_view.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:35:31 +01:00
Duarte Nunes
082ef56df1 view: Store pk view column that's non-pk in the base
To help calculate the view mutations from a base update, we store in
the view class the column that's part of the view's primary key but
not part of the base's, if such column exists.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:35:30 +01:00
Duarte Nunes
734ad80390 view: Add matches_view_filter() function
This patch adds the matches_view_filter() function which specifies
whether a given base row matches the view filter. Unlike
may_be_affected_by(), this function has no false positives.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:35:30 +01:00
Duarte Nunes
7be0f319d4 single_column_restriction: Filter clustering rows
This patch adds the is_satisfied_by() function to
single_column_restriction, which given a clustering row returns
whether the restrictions applies or not.

This is useful for secondary indexing such as materialized views,
where filters on regular columns precisely select which base table
rows to denormalize.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:35:30 +01:00
Duarte Nunes
3b52440ff3 statement_restrictions: Expose non-pk restrictions
This patch exposes the non-primary key column restrictions in a given
select statement, exposing them as single_column_restrictions.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:35:30 +01:00
Duarte Nunes
e987d87ab1 collection_type_impl: Identify concrete types
This patch adds the is_set() and is_list() functions to
collection_type_impl, which identify the concrete collection
type.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:35:30 +01:00
Duarte Nunes
71faa4a4eb abstract_restriction: Rename uses_function()
This patch renames abstract_restriction::uses_function() to
term_uses_function(), as it was previously hiding a function with the
same name in the restriction base class.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:35:30 +01:00
Duarte Nunes
21d1bbb527 view: Add may_be_affected_by function
This patch adds the may_be_affected_by() function to the view class,
which is responsible to determine whether an update to a base class
affects one of its views.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:35:30 +01:00
Duarte Nunes
c35d14e285 column_family: Store a pointer to view
Instead of storing the view in the column_family's map of materialized
views, store a lw_shared_ptr so that the view can be removed while it
is being updated.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:35:30 +01:00
Duarte Nunes
69171c28f0 cql3/util: Fix use-after-free
This patch fixes a use-after-free error in
rename_column_in_where_clause(), where we were creating a boost
adaptor on an rvalue.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:35:30 +01:00
Avi Kivity
3896c27e5f Merge "DNS use in scylla" from Calle
"Fixes #1531

Adds lookup to gms::inet_address and uses it in (hopefully all) the
salient places where configured symbolic names are interpreted.

Removes the dummy dns modula in scylla in favour of the seastar one."

* 'calle/use-dns' of github.com:cloudius-systems/seastar-dev:
  remove scylla dns code
  service::storage_service: Remove depedency on scylla dns
  main.cc: remove scylla dns dependency
  main/init: Lookup inet addresses from config by dns lookup
  db::system_keyspace: Find rpc_address by lookup
  gms::inet_address: Add lookup functionality.
  scylla tls: Add option support for client auth and tls opts
2017-02-06 13:50:42 +02:00
Avi Kivity
da8d00199e Merge 2017-02-06 13:43:07 +02:00
Avi Kivity
fdfabbf8bb Merge seastar upstream
* seastar f07f8ed...83a41c8 (8):
  > Cleaning the metrics API
  > tutorial: pick the name "asynchronous function".
  > tutorial: explain the difference between exception and exception future
  > tutorial: abstract
  > ninja: don't bother building c-ares shared libraries
  > ninja: unbreak build ordering
  > ninja: unbreak "ninja -t clean"
  > Add libtool to dependencies
2017-02-06 13:42:38 +02:00
Calle Wilund
44503f8253 remove scylla dns code
Use seastar facilities instead.
2017-02-06 11:36:57 +00:00
Calle Wilund
ab800c225a service::storage_service: Remove depedency on scylla dns
Use seastar facilities instead
2017-02-06 11:36:57 +00:00
Calle Wilund
c4c4eb06c4 main.cc: remove scylla dns dependency
Use seastar facilities instead.
2017-02-06 11:36:57 +00:00
Avi Kivity
b18e54307f tests: add --operations-per-shard option to perf_simple_query
This helps achieve more repeatable runs that can then be compared via the
Linux perf tool.  The option overrides duration-based testing and runs the
test for a specific number of iterations.
Message-Id: <20170204172937.8462-1-avi@scylladb.com>
2017-02-06 12:08:04 +01:00
Gleb Natapov
3c372525ed storage_proxy: use storage_proxy clock instead of explicit lowres_clock
Merge commit 45b6070832 used butchered version of storage_proxy
patch to adjust to rpc timer change instead the one I've sent. This
patch fixes the differences.

Message-Id: <20170206095237.GA7691@scylladb.com>
2017-02-06 12:51:36 +02:00
Calle Wilund
feffc2bbe1 main/init: Lookup inet addresses from config by dns lookup
I.e. allow symbolic names in addition to ip addresses.
2017-02-06 09:45:37 +00:00
Calle Wilund
ef26ab0e1b db::system_keyspace: Find rpc_address by lookup 2017-02-06 09:45:37 +00:00
Calle Wilund
0a740b5ccf gms::inet_address: Add lookup functionality.
To find addresses by name.
2017-02-06 09:45:37 +00:00
Calle Wilund
ff8f82f21c scylla tls: Add option support for client auth and tls opts
Refs #1813 (fixes scylla part)

Added require_client_auth and priority_string options to
server_encryption_options/client_encryption_options an process them.

Allows TLS method/algo specification. Also enabled enforcing known cert
authentication for both node-to-node and client communication.
2017-02-06 09:45:09 +00:00
Avi Kivity
6e9e28d5a3 cell_locking: work around for missing boost::container::small_vector
small_vector doesn't exist on Ubuntu 14.04's boost, use std::vector
instead.
2017-02-05 20:48:36 +02:00
Avi Kivity
2510b756fc dist: add build dependency on automake
Needed by seastar's c-ares.
2017-02-05 20:16:27 +02:00
Takuya ASADA
e82932b774 dist/common/systemd: introduce scylla-housekeeping restart mode
scylla-housekeeping requires to run 'restart mode' for check the version during
scylla-server restart, which wasn't called on systemd timer so added it.

Existing scylla-housekeeping.timer renamed to scylla-housekeeping-daily.timer,
since it is running 'daily mode'.

Fixes #1953

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1486180031-18093-1-git-send-email-syuu@scylladb.com>
2017-02-05 10:46:04 +02:00
Avi Kivity
4175f40da1 dist: add libtool build dependency for seastar/c-ares 2017-02-05 10:42:53 +02:00
Takuya ASADA
12b5e7288d dist/common/scripts/scylla_setup: show restart message when SELinux was disabled on the script
Disabling SELinux requires server restart, so warn user to restart before
running Scylla.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1485817393-25919-2-git-send-email-syuu@scylladb.com>
2017-02-05 10:10:18 +02:00
Takuya ASADA
c28a574b9e dist/common/scripts: stop setting hugepages boot parameter
Stop setting hugepages boot parameter since we don't use it on default
configuration (posix mode), but keep scylla_bootparam_setup to setup clocksource
on AMI.

Fixes #1758

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1485817393-25919-1-git-send-email-syuu@scylladb.com>
2017-02-05 10:10:18 +02:00
Paweł Dziepak
37b0c71f1d cell_locking: fix parititon_entry::equal_compare
The comparator constructor took schema by value instead of const l-ref
and, consequently, later tried to access object that has been destroyed
long time ago.
Message-Id: <20170202135853.8190-1-pdziepak@scylladb.com>
2017-02-03 19:49:18 +01:00
Avi Kivity
7a00dd6985 Merge "Avoid avalanche of tasks after memtable flush" from Tomasz
"Before, the logic for releasing writes blocked on dirty worked like this:

  1) When region group size changes and it is not under pressure and there
     are some requests blocked, then schedule request releasing task

  2) request releasing task, if no pressure, runs one request and if there are
     still blocked requests, schedules next request releasing task

If requests don't change the size of the region group, then either some request
executes or there is a request releasing task scheduled. The amount of scheduled
tasks is at most 1, there is a single releasing thread.

However, if requests themselves would change the size of the group, then each
such change would schedule yet another request releasing thread, growing the task
queue size by one.

The group size can also change when memory is reclaimed from the groups (e.g.
when contains sparse segments). Compaction may start many request releasing
threads due to group size updates.

Such behavior is detrimental for performance and stability if there are a lot
of blocked requests. This can happen on 1.5 even with modest concurrency
because timed out requests stay in the queue. This is less likely on 1.6 where
they are dropped from the queue.

The releasing of tasks may start to dominate over other processes in the
system. When the amount of scheduled tasks reaches 1000, polling stops and
server becomes unresponsive until all of the released requests are done, which
is either when they start to block on dirty memory again or run out of blocked
requests. It may take a while to reach pressure condition after memtable flush
if it brings virtual dirty much below the threshold, which is currently the
case for workloads with overwrites producing sparse regions.

I saw this happening in a write workload from issue #2021 where the number of
request releasing threads grew into thousands.

Fix by ensuring there is at most one request releasing thread at a time. There
will be one releasing fiber per region group which is woken up when pressure is
lifted. It executes blocked requests until pressure occurs."

* tag 'tgrabiec/lsa-single-threaded-releasing-v2' of github.com:cloudius-systems/seastar-dev:
  tests: lsa: Add test for reclaimer starting and stopping
  tests: lsa: Add request releasing stress test
  lsa: Avoid avalanche releasing of requests
  lsa: Move definitions to .cc
  lsa: Simplify hard pressure notification management
  lsa: Do not start or stop reclaiming on hard pressure
  tests: lsa: Adjust to take into account that reclaimers are run synchronously
  lsa: Document and annotate reclaimer notification callbacks
  tests: lsa: Use with_timeout() in quiesce()
2017-02-02 17:49:31 +02:00
Tomasz Grabiec
2fd339787b tests: lsa: Add test for reclaimer starting and stopping 2017-02-01 17:41:56 +01:00
Tomasz Grabiec
f943296da0 tests: lsa: Add request releasing stress test 2017-02-01 17:41:55 +01:00
Tomasz Grabiec
e40fb438f5 lsa: Avoid avalanche releasing of requests
Before, the logic for releasing writes blocked on dirty worked like
this:

  1) When region group size changes and it is not under pressure and
     there are some requests blocked, then schedule request releasing
     task

  2) request releasing task, if no pressure, runs one request and if
     there are still blocked requests, schedules next request
     releasing task

If requests don't change the size of the region group, then either
some request executes or there is a request releasing task
scheduled. The amount of scheduled tasks is at most 1, there is a
single thread of excution.

However, if requests themselves would change the size of the group,
then each such change would schedule yet another request releasing
thread, growing the task queue size by one.

The group size can also change when memory is reclaimed from the
groups (e.g.  when contains sparse segments). Compaction may start
many request releasing threads due to group size updates.

Such behavior is detrimental for performance and stability if there
are a lot of blocked requests. This can happen on 1.5 even with modest
concurrency becuase timed out requests stay in the queue. This is less
likely on 1.6 where they are dropped from the queue.

The releasing of tasks may start to dominate over other processes in
the system. When the amount of scheduled tasks reaches 1000, polling
stops and server becomes unresponsive until all of the released
requests are done, which is either when they start to block on dirty
memory again or run out of blocked requests. It may take a while to
reach pressure condition after memtable flush if it brings virtual
dirty much below the threshold, which is currently the case for
workloads with overwrites producing sparse regions.

Refs #2021.

Fix by ensuring there is at most one request releasing thread at a
time. There will be one releasing fiber per region group which is
woken up when pressure is lifted. It executes blocked requests until
pressure occurs.

The logic for notification across hierachy was replaced by calling
region_group::notify_relief() from region_group::update() on the
broadest relieved group.
2017-02-01 17:41:55 +01:00
Tomasz Grabiec
d55baa0cd1 lsa: Move definitions to .cc 2017-02-01 17:41:55 +01:00
Tomasz Grabiec
8f8b111b33 lsa: Simplify hard pressure notification management
The hard pressure was only signalled on region group when
run_when_memory_available() was called after the pressure condition
was met.

So the following loop is always an infinite loop rather than stopping
when engouh is allocated to cause pressure:

   while (!gr.under_pressure()) {
       region.allocate(...);
   }

It's cleaner if pressure notification works not only if
run_when_memory_available() is used but whenever conditino changes,
like we do for the soft pressure.

There is comment in run_when_memory_available() which gives reasons
why notifications are called from there, but I think those reasons no
longer hold:

 - we already notify on soft pressure conditions from update(), and if
   that is safe, notifying about hard pressure should also be safe. I
   checked and it looks safe to me.

 - avoiding notification in the rare case when we stopped writing
   right after crossing the threshold doesn't seem benefitial. It's
   unlikely in the first place, and one could argue it's better to
   actually flush now so that when writes resume they will not block.
2017-02-01 17:41:55 +01:00
Tomasz Grabiec
9aa1be5d08 lsa: Do not start or stop reclaiming on hard pressure
We already call these when crossing the soft threshold. We shouldn't
stop reclaiming when hard pressure is gone because soft pressure may
still be present. Calling start_reclaiming() on hard pressure is
unnecessary because soft pressure also starts it, and when there is
hard pressure there is also soft pressure.
2017-02-01 17:40:15 +01:00
Tomasz Grabiec
f053b48f7c tests: lsa: Adjust to take into account that reclaimers are run synchronously 2017-01-30 19:18:07 +01:00
Tomasz Grabiec
ed9ff19467 lsa: Document and annotate reclaimer notification callbacks
They are called from region_group::update(), so must be alloc-free and
noexcept.
2017-01-30 19:18:07 +01:00
Tomasz Grabiec
2ec6fe415e tests: lsa: Use with_timeout() in quiesce()
Current consutrct doesn't interrupt the test, the timeout failure will
only be logged.
2017-01-30 19:18:07 +01:00
563 changed files with 42635 additions and 12115 deletions

2
.gitmodules vendored
View File

@@ -1,6 +1,6 @@
[submodule "seastar"]
path = seastar
url = ../seastar
url = ../scylla-seastar
ignore = dirty
[submodule "swagger-ui"]
path = swagger-ui

140
CMakeLists.txt Normal file
View File

@@ -0,0 +1,140 @@
##
## For best results, first compile the project using the Ninja build-system.
##
cmake_minimum_required(VERSION 3.7)
project(scylla)
if (NOT DEFINED ENV{CLION_IDE})
message(FATAL_ERROR "This CMakeLists.txt file is only valid for use in CLion")
endif()
# Default value. A more accurate list is populated through `pkg-config` below if `seastar.pc` is available.
set(SEASTAR_INCLUDE_DIRS "seastar")
# These paths are always available, since they're included in the repository. Additional DPDK headers are placed while
# Seastar is built, and are captured in `SEASTAR_INCLUDE_DIRS` through parsing the Seastar pkg-config file (below).
set(SEASTAR_DPDK_INCLUDE_DIRS
seastar/dpdk/lib/librte_eal/common/include
seastar/dpdk/lib/librte_eal/common/include/generic
seastar/dpdk/lib/librte_eal/common/include/x86
seastar/dpdk/lib/librte_ether)
find_package(PkgConfig REQUIRED)
set(ENV{PKG_CONFIG_PATH} "${CMAKE_SOURCE_DIR}/seastar/build/release:$ENV{PKG_CONFIG_PATH}")
pkg_check_modules(SEASTAR seastar)
find_package(Boost COMPONENTS filesystem program_options system thread)
##
## Populate the names of all source and header files in the indicated paths in a designated variable.
##
## When RECURSIVE is specified, directories are traversed recursively.
##
## Use: scan_scylla_source_directories(VAR my_result_var [RECURSIVE] PATHS [path1 path2 ...])
##
function (scan_scylla_source_directories)
set(options RECURSIVE)
set(oneValueArgs VAR)
set(multiValueArgs PATHS)
cmake_parse_arguments(args "${options}" "${oneValueArgs}" "${multiValueArgs}" "${ARGN}")
set(globs "")
foreach (dir ${args_PATHS})
list(APPEND globs "${dir}/*.cc" "${dir}/*.hh")
endforeach()
if (args_RECURSIVE)
set(glob_kind GLOB_RECURSE)
else()
set(glob_kind GLOB)
endif()
file(${glob_kind} var
${globs})
set(${args_VAR} ${var} PARENT_SCOPE)
endfunction()
## Although Seastar is an external project, it is common enough to explore the sources while doing
## Scylla development that we'll treat the Seastar sources as part of this project for easier navigation.
scan_scylla_source_directories(
VAR SEASTAR_SOURCE_FILES
RECURSIVE
PATHS
seastar/core
seastar/http
seastar/json
seastar/net
seastar/rpc
seastar/tests
seastar/util)
scan_scylla_source_directories(
VAR SCYLLA_ROOT_SOURCE_FILES
PATHS .)
scan_scylla_source_directories(
VAR SCYLLA_SUB_SOURCE_FILES
RECURSIVE
PATHS
api
auth
cql3
db
dht
exceptions
gms
index
io
locator
message
repair
service
sstables
streaming
tests
thrift
tracing
transport
utils)
scan_scylla_source_directories(
VAR SCYLLA_GEN_SOURCE_FILES
RECURSIVE
PATHS build/release/gen)
set(SCYLLA_SOURCE_FILES
${SCYLLA_ROOT_SOURCE_FILES}
${SCYLLA_GEN_SOURCE_FILES}
${SCYLLA_SUB_SOURCE_FILES})
add_executable(scylla
${SEASTAR_SOURCE_FILES}
${SCYLLA_SOURCE_FILES})
# Note that since CLion does not undestand GCC6 concepts, we always disable them (even if users configure otherwise).
# CLion seems to have trouble with `-U` (macro undefinition), so we do it this way instead.
list(REMOVE_ITEM SEASTAR_CFLAGS "-DHAVE_GCC6_CONCEPTS")
# If the Seastar pkg-config information is available, append to the default flags.
#
# For ease of browsing the source code, we always pretend that DPDK is enabled.
target_compile_options(scylla PUBLIC
-std=gnu++14
-DHAVE_DPDK
-DHAVE_HWLOC
"${SEASTAR_CFLAGS}")
# The order matters here: prefer the "static" DPDK directories to any dynamic paths from pkg-config. Some files are only
# available dynamically, though.
target_include_directories(scylla PUBLIC
.
${SEASTAR_DPDK_INCLUDE_DIRS}
${SEASTAR_INCLUDE_DIRS}
${Boost_INCLUDE_DIRS}
build/release/gen)

View File

@@ -1,6 +1,6 @@
#!/bin/sh
VERSION=666.development
VERSION=2.0.4
if test -f version
then

View File

@@ -29,6 +29,7 @@
#include "utils/histogram.hh"
#include "http/exception.hh"
#include "api_init.hh"
#include "seastarx.hh"
namespace api {

View File

@@ -252,13 +252,13 @@ void set_cache_service(http_context& ctx, routes& r) {
// In origin row size is the weighted size.
// We currently do not support weights, so we use num entries instead
return map_reduce_cf(ctx, 0, [](const column_family& cf) {
return cf.get_row_cache().num_entries();
return cf.get_row_cache().partitions();
}, std::plus<uint64_t>());
});
cs::get_row_entries.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, 0, [](const column_family& cf) {
return cf.get_row_cache().num_entries();
return cf.get_row_cache().partitions();
}, std::plus<uint64_t>());
});

View File

@@ -182,15 +182,6 @@ static int64_t max_row_size(column_family& cf) {
return res;
}
static double update_ratio(double acc, double f, double total) {
if (f && !total) {
throw bad_param_exception("total should include all elements");
} else if (total) {
acc += f / total;
}
return acc;
}
static integral_ratio_holder mean_row_size(column_family& cf) {
integral_ratio_holder res;
for (auto i: *cf.get_sstables() ) {
@@ -283,6 +274,16 @@ static std::vector<uint64_t> concat_sstable_count_per_level(std::vector<uint64_t
return a;
}
ratio_holder filter_false_positive_as_ratio_holder(const sstables::shared_sstable& sst) {
double f = sst->filter_get_false_positive();
return ratio_holder(f + sst->filter_get_true_positive(), f);
}
ratio_holder filter_recent_false_positive_as_ratio_holder(const sstables::shared_sstable& sst) {
double f = sst->filter_get_recent_false_positive();
return ratio_holder(f + sst->filter_get_recent_true_positive(), f);
}
void set_column_family(http_context& ctx, routes& r) {
cf::get_column_family_name.set(r, [&ctx] (const_req req){
vector<sstring> res;
@@ -604,39 +605,27 @@ void set_column_family(http_context& ctx, routes& r) {
});
cf::get_bloom_filter_false_ratio.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], double(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), double(0), [](double s, auto& sst) {
double f = sst->filter_get_false_positive();
return update_ratio(s, f, f + sst->filter_get_true_positive());
});
}, std::plus<double>());
return map_reduce_cf(ctx, req->param["name"], ratio_holder(), [] (column_family& cf) {
return boost::accumulate(*cf.get_sstables() | boost::adaptors::transformed(filter_false_positive_as_ratio_holder), ratio_holder());
}, std::plus<>());
});
cf::get_all_bloom_filter_false_ratio.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, double(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), double(0), [](double s, auto& sst) {
double f = sst->filter_get_false_positive();
return update_ratio(s, f, f + sst->filter_get_true_positive());
});
}, std::plus<double>());
return map_reduce_cf(ctx, ratio_holder(), [] (column_family& cf) {
return boost::accumulate(*cf.get_sstables() | boost::adaptors::transformed(filter_false_positive_as_ratio_holder), ratio_holder());
}, std::plus<>());
});
cf::get_recent_bloom_filter_false_ratio.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], double(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), double(0), [](double s, auto& sst) {
double f = sst->filter_get_recent_false_positive();
return update_ratio(s, f, f + sst->filter_get_recent_true_positive());
});
}, std::plus<double>());
return map_reduce_cf(ctx, req->param["name"], ratio_holder(), [] (column_family& cf) {
return boost::accumulate(*cf.get_sstables() | boost::adaptors::transformed(filter_recent_false_positive_as_ratio_holder), ratio_holder());
}, std::plus<>());
});
cf::get_all_recent_bloom_filter_false_ratio.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, double(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), double(0), [](double s, auto& sst) {
double f = sst->filter_get_recent_false_positive();
return update_ratio(s, f, f + sst->filter_get_recent_true_positive());
});
}, std::plus<double>());
return map_reduce_cf(ctx, ratio_holder(), [] (column_family& cf) {
return boost::accumulate(*cf.get_sstables() | boost::adaptors::transformed(filter_recent_false_positive_as_ratio_holder), ratio_holder());
}, std::plus<>());
});
cf::get_bloom_filter_disk_space_used.set(r, [&ctx] (std::unique_ptr<request> req) {

View File

@@ -26,7 +26,6 @@
namespace api {
using namespace scollectd;
namespace cm = httpd::compaction_manager_json;
using namespace json;

View File

@@ -24,7 +24,6 @@
namespace api {
using namespace scollectd;
using namespace json;
namespace hh = httpd::hinted_handoff_json;

View File

@@ -29,11 +29,11 @@
namespace api {
static logging::logger logger("lsa-api");
static logging::logger alogger("lsa-api");
void set_lsa(http_context& ctx, routes& r) {
httpd::lsa_json::lsa_compact.set(r, [&ctx](std::unique_ptr<request> req) {
logger.info("Triggering compaction");
alogger.info("Triggering compaction");
return ctx.db.invoke_on_all([] (database&) {
logalloc::shard_tracker().reclaim(std::numeric_limits<size_t>::max());
}).then([] {

View File

@@ -27,7 +27,7 @@
#include <sstream>
using namespace httpd::messaging_service_json;
using namespace net;
using namespace netw;
namespace api {
@@ -120,13 +120,13 @@ void set_messaging_service(http_context& ctx, routes& r) {
}));
get_version.set(r, [](const_req req) {
return net::get_local_messaging_service().get_raw_version(req.get_query_param("addr"));
return netw::get_local_messaging_service().get_raw_version(req.get_query_param("addr"));
});
get_dropped_messages_by_ver.set(r, [](std::unique_ptr<request> req) {
shared_ptr<std::vector<uint64_t>> map = make_shared<std::vector<uint64_t>>(num_verb);
return net::get_messaging_service().map_reduce([map](const uint64_t* local_map) mutable {
return netw::get_messaging_service().map_reduce([map](const uint64_t* local_map) mutable {
for (auto i = 0; i < num_verb; i++) {
(*map)[i]+= local_map[i];
}

View File

@@ -802,10 +802,8 @@ void set_storage_service(http_context& ctx, routes& r) {
return make_ready_future<json::json_return_type>(json_void());
});
ss::get_metrics_load.set(r, [](std::unique_ptr<request> req) {
//TBD
unimplemented();
return make_ready_future<json::json_return_type>(0);
ss::get_metrics_load.set(r, [&ctx](std::unique_ptr<request> req) {
return get_cf_stats(ctx, &column_family::stats::live_disk_space_used);
});
ss::get_exceptions.set(r, [](const_req req) {

View File

@@ -29,10 +29,11 @@
#include "net/byteorder.hh"
#include <cstdint>
#include <iosfwd>
#include <seastar/util/gcc6-concepts.hh>
template<typename T>
template<typename T, typename Input>
static inline
void set_field(managed_bytes& v, unsigned offset, T val) {
void set_field(Input& v, unsigned offset, T val) {
reinterpret_cast<net::packed<T>*>(v.begin() + offset)->raw = net::hton(val);
}
@@ -58,6 +59,7 @@ private:
static constexpr int8_t EXPIRY_FLAG = 0x02; // When present, expiry field is present. Set only for live cells
static constexpr int8_t REVERT_FLAG = 0x04; // transient flag used to efficiently implement ReversiblyMergeable for atomic cells.
static constexpr int8_t COUNTER_UPDATE_FLAG = 0x08; // Cell is a counter update.
static constexpr int8_t COUNTER_IN_PLACE_REVERT = 0x10;
static constexpr unsigned flags_size = 1;
static constexpr unsigned timestamp_offset = flags_size;
static constexpr unsigned timestamp_size = 8;
@@ -67,6 +69,7 @@ private:
static constexpr unsigned deletion_time_size = 4;
static constexpr unsigned ttl_offset = expiry_offset + expiry_size;
static constexpr unsigned ttl_size = 4;
friend class counter_cell_builder;
private:
static bool is_counter_update(bytes_view cell) {
return cell[0] & COUNTER_UPDATE_FLAG;
@@ -74,10 +77,17 @@ private:
static bool is_revert_set(bytes_view cell) {
return cell[0] & REVERT_FLAG;
}
static bool is_counter_in_place_revert_set(bytes_view cell) {
return cell[0] & COUNTER_IN_PLACE_REVERT;
}
template<typename BytesContainer>
static void set_revert(BytesContainer& cell, bool revert) {
cell[0] = (cell[0] & ~REVERT_FLAG) | (revert * REVERT_FLAG);
}
template<typename BytesContainer>
static void set_counter_in_place_revert(BytesContainer& cell, bool flag) {
cell[0] = (cell[0] & ~COUNTER_IN_PLACE_REVERT) | (flag * COUNTER_IN_PLACE_REVERT);
}
static bool is_live(const bytes_view& cell) {
return cell[0] & LIVE_FLAG;
}
@@ -91,13 +101,30 @@ private:
static api::timestamp_type timestamp(const bytes_view& cell) {
return get_field<api::timestamp_type>(cell, timestamp_offset);
}
template<typename BytesContainer>
static void set_timestamp(BytesContainer& cell, api::timestamp_type ts) {
set_field(cell, timestamp_offset, ts);
}
// Can be called on live cells only
static bytes_view value(bytes_view cell) {
private:
template<typename BytesView>
static BytesView do_get_value(BytesView cell) {
auto expiry_field_size = bool(cell[0] & EXPIRY_FLAG) * (expiry_size + ttl_size);
auto value_offset = flags_size + timestamp_size + expiry_field_size;
cell.remove_prefix(value_offset);
return cell;
}
public:
static bytes_view value(bytes_view cell) {
return do_get_value(cell);
}
static bytes_mutable_view value(bytes_mutable_view cell) {
return do_get_value(cell);
}
// Can be called on live counter update cells only
static int64_t counter_update_value(bytes_view cell) {
return get_field<int64_t>(cell, flags_size + timestamp_size);
}
// Can be called only when is_dead() is true.
static gc_clock::time_point deletion_time(const bytes_view& cell) {
assert(is_dead(cell));
@@ -130,12 +157,12 @@ private:
std::copy_n(value.begin(), value.size(), b.begin() + value_offset);
return b;
}
static managed_bytes make_live_counter_update(api::timestamp_type timestamp, bytes_view value) {
static managed_bytes make_live_counter_update(api::timestamp_type timestamp, int64_t value) {
auto value_offset = flags_size + timestamp_size;
managed_bytes b(managed_bytes::initialized_later(), value_offset + value.size());
managed_bytes b(managed_bytes::initialized_later(), value_offset + sizeof(value));
b[0] = LIVE_FLAG | COUNTER_UPDATE_FLAG;
set_field(b, timestamp_offset, timestamp);
std::copy_n(value.begin(), value.size(), b.begin() + value_offset);
set_field(b, value_offset, value);
return b;
}
static managed_bytes make_live(api::timestamp_type timestamp, bytes_view value, gc_clock::time_point expiry, gc_clock::duration ttl) {
@@ -148,6 +175,31 @@ private:
std::copy_n(value.begin(), value.size(), b.begin() + value_offset);
return b;
}
// make_live_from_serializer() is intended for users that need to serialise
// some object or objects to the format used in atomic_cell::value().
// With just make_live() the patter would look like follows:
// 1. allocate a buffer and write to it serialised objects
// 2. pass that buffer to make_live()
// 3. make_live() needs to prepend some metadata to the cell value so it
// allocates a new buffer and copies the content of the original one
//
// The allocation and copy of a buffer can be avoided.
// make_live_from_serializer() allows the user code to specify the timestamp
// and size of the cell value as well as provide the serialiser function
// object, which would write the serialised value of the cell to the buffer
// given to it by make_live_from_serializer().
template<typename Serializer>
GCC6_CONCEPT(requires requires(Serializer serializer, bytes::iterator it) {
serializer(it);
})
static managed_bytes make_live_from_serializer(api::timestamp_type timestamp, size_t size, Serializer&& serializer) {
auto value_offset = flags_size + timestamp_size;
managed_bytes b(managed_bytes::initialized_later(), value_offset + size);
b[0] = LIVE_FLAG;
set_field(b, timestamp_offset, timestamp);
serializer(b.begin() + value_offset);
return b;
}
template<typename ByteContainer>
friend class atomic_cell_base;
friend class atomic_cell;
@@ -167,6 +219,9 @@ public:
bool is_revert_set() const {
return atomic_cell_type::is_revert_set(_data);
}
bool is_counter_in_place_revert_set() const {
return atomic_cell_type::is_counter_in_place_revert_set(_data);
}
bool is_live() const {
return atomic_cell_type::is_live(_data);
}
@@ -189,10 +244,17 @@ public:
api::timestamp_type timestamp() const {
return atomic_cell_type::timestamp(_data);
}
void set_timestamp(api::timestamp_type ts) {
atomic_cell_type::set_timestamp(_data, ts);
}
// Can be called on live cells only
bytes_view value() const {
auto value() const {
return atomic_cell_type::value(_data);
}
// Can be called on live counter update cells only
int64_t counter_update_value() const {
return atomic_cell_type::counter_update_value(_data);
}
// Can be called only when is_dead(gc_clock::time_point)
gc_clock::time_point deletion_time() const {
return !is_live() ? atomic_cell_type::deletion_time(_data) : expiry() - ttl();
@@ -215,6 +277,9 @@ public:
void set_revert(bool revert) {
atomic_cell_type::set_revert(_data, revert);
}
void set_counter_in_place_revert(bool flag) {
atomic_cell_type::set_counter_in_place_revert(_data, flag);
}
};
class atomic_cell_view final : public atomic_cell_base<bytes_view> {
@@ -226,6 +291,14 @@ public:
friend std::ostream& operator<<(std::ostream& os, const atomic_cell_view& acv);
};
class atomic_cell_mutable_view final : public atomic_cell_base<bytes_mutable_view> {
atomic_cell_mutable_view(bytes_mutable_view data) : atomic_cell_base(std::move(data)) {}
public:
static atomic_cell_mutable_view from_bytes(bytes_mutable_view data) { return atomic_cell_mutable_view(data); }
friend class atomic_cell;
};
class atomic_cell_ref final : public atomic_cell_base<managed_bytes&> {
public:
atomic_cell_ref(managed_bytes& buf) : atomic_cell_base(buf) {}
@@ -254,12 +327,9 @@ public:
static atomic_cell make_live(api::timestamp_type timestamp, const bytes& value) {
return make_live(timestamp, bytes_view(value));
}
static atomic_cell make_live_counter_update(api::timestamp_type timestamp, bytes_view value) {
static atomic_cell make_live_counter_update(api::timestamp_type timestamp, int64_t value) {
return atomic_cell_type::make_live_counter_update(timestamp, value);
}
static atomic_cell make_live_counter_update(api::timestamp_type timestamp, const bytes& value) {
return atomic_cell_type::make_live_counter_update(timestamp, bytes_view(value));
}
static atomic_cell make_live(api::timestamp_type timestamp, bytes_view value,
gc_clock::time_point expiry, gc_clock::duration ttl)
{
@@ -277,6 +347,10 @@ public:
return atomic_cell_type::make_live(timestamp, value, gc_clock::now() + *ttl, *ttl);
}
}
template<typename Serializer>
static atomic_cell make_live_from_serializer(api::timestamp_type timestamp, size_t size, Serializer&& serializer) {
return atomic_cell_type::make_live_from_serializer(timestamp, size, std::forward<Serializer>(serializer));
}
friend class atomic_cell_or_collection;
friend std::ostream& operator<<(std::ostream& os, const atomic_cell& ac);
};
@@ -314,11 +388,6 @@ collection_mutation::operator collection_mutation_view() const {
return { data };
}
namespace db {
template<typename T>
class serializer;
}
class column_definition;
int compare_atomic_cell_for_merge(atomic_cell_view left, atomic_cell_view right);

View File

@@ -39,10 +39,14 @@ public:
static atomic_cell_or_collection from_atomic_cell(atomic_cell data) { return { std::move(data._data) }; }
atomic_cell_view as_atomic_cell() const { return atomic_cell_view::from_bytes(_data); }
atomic_cell_ref as_atomic_cell_ref() { return { _data }; }
atomic_cell_mutable_view as_mutable_atomic_cell() { return atomic_cell_mutable_view::from_bytes(_data); }
atomic_cell_or_collection(collection_mutation cm) : _data(std::move(cm.data)) {}
explicit operator bool() const {
return !_data.empty();
}
bool can_use_mutable_view() const {
return !_data.is_fragmented();
}
static atomic_cell_or_collection from_collection_mutation(collection_mutation data) {
return std::move(data.data);
}

View File

@@ -61,7 +61,7 @@ const sstring auth::auth::USERS_CF("users");
static const sstring USER_NAME("name");
static const sstring SUPER("super");
static logging::logger logger("auth");
static logging::logger alogger("auth");
// TODO: configurable
using namespace std::chrono_literals;
@@ -114,7 +114,7 @@ struct hash<auth::authenticated_user> {
class auth::auth::permissions_cache {
public:
typedef utils::loading_cache<std::pair<authenticated_user, data_resource>, permission_set, utils::tuple_hash> cache_type;
typedef utils::loading_cache<std::pair<authenticated_user, data_resource>, permission_set, utils::loading_cache_reload_enabled::yes, utils::simple_entry_size<permission_set>, utils::tuple_hash> cache_type;
typedef typename cache_type::key_type key_type;
permissions_cache()
@@ -123,25 +123,14 @@ public:
}
permissions_cache(const db::config& cfg)
: _cache(cfg.permissions_cache_max_entries(), expiry(cfg),
std::chrono::milliseconds(
cfg.permissions_validity_in_ms()),
[](const key_type& k) {
logger.debug("Refreshing permissions for {}", k.first.name());
return authorizer::get().authorize(::make_shared<authenticated_user>(k.first), k.second);
}) {
}
static std::chrono::milliseconds expiry(const db::config& cfg) {
auto exp = cfg.permissions_update_interval_in_ms();
if (exp == 0 || exp == std::numeric_limits<uint32_t>::max()) {
exp = cfg.permissions_validity_in_ms();
}
return std::chrono::milliseconds(exp);
}
: _cache(cfg.permissions_cache_max_entries(), std::chrono::milliseconds(cfg.permissions_validity_in_ms()), std::chrono::milliseconds(cfg.permissions_update_interval_in_ms()), alogger,
[] (const key_type& k) {
alogger.debug("Refreshing permissions for {}", k.first.name());
return authorizer::get().authorize(::make_shared<authenticated_user>(k.first), k.second);
}) {}
future<> stop() {
return make_ready_future<>();
return _cache.stop();
}
future<permission_set> get(::shared_ptr<authenticated_user> user, data_resource resource) {
@@ -152,6 +141,15 @@ private:
cache_type _cache;
};
namespace std { // for ADL, yuch
std::ostream& operator<<(std::ostream& os, const std::pair<auth::authenticated_user, auth::data_resource>& p) {
os << "{user: " << p.first.name() << ", data_resource: " << p.second << "}";
return os;
}
}
static distributed<auth::auth::permissions_cache> perm_cache;
/**
@@ -178,7 +176,7 @@ struct waiter {
tmr.cancel();
done.set_exception(std::runtime_error("shutting down"));
}
logger.trace("Deleting scheduled task");
alogger.trace("Deleting scheduled task");
}
void kill() {
}
@@ -192,7 +190,7 @@ static std::vector<waiter_ptr> & thread_waiters() {
}
void auth::auth::schedule_when_up(scheduled_func f) {
logger.trace("Adding scheduled task");
alogger.trace("Adding scheduled task");
auto & waiters = thread_waiters();
@@ -208,7 +206,7 @@ void auth::auth::schedule_when_up(scheduled_func f) {
waiters.erase(i);
}
}).then([f = std::move(f)] {
logger.trace("Running scheduled task");
alogger.trace("Running scheduled task");
return f();
}).handle_exception([](auto ep) {
return make_ready_future();
@@ -246,7 +244,8 @@ future<> auth::auth::setup() {
std::map<sstring, sstring> opts;
opts["replication_factor"] = "1";
auto ksm = keyspace_metadata::new_keyspace(AUTH_KS, "org.apache.cassandra.locator.SimpleStrategy", opts, true);
f = service::get_local_migration_manager().announce_new_keyspace(ksm, false);
// We use min_timestamp so that default keyspace metadata will loose with any manual adjustments. See issue #2129.
f = service::get_local_migration_manager().announce_new_keyspace(ksm, api::min_timestamp, false);
}
return f.then([] {
@@ -267,12 +266,12 @@ future<> auth::auth::setup() {
auto query = sprint("INSERT INTO %s.%s (%s, %s) VALUES (?, ?) USING TIMESTAMP 0",
AUTH_KS, USERS_CF, USER_NAME, SUPER);
cql3::get_local_query_processor().process(query, db::consistency_level::ONE, {DEFAULT_SUPERUSER_NAME, true}).then([](auto) {
logger.info("Created default superuser '{}'", DEFAULT_SUPERUSER_NAME);
alogger.info("Created default superuser '{}'", DEFAULT_SUPERUSER_NAME);
}).handle_exception([](auto ep) {
try {
std::rethrow_exception(ep);
} catch (exceptions::request_execution_exception&) {
logger.warn("Skipped default superuser setup: some nodes were not ready");
alogger.warn("Skipped default superuser setup: some nodes were not ready");
}
});
}
@@ -330,14 +329,13 @@ future<bool> auth::auth::is_super_user(const sstring& username) {
});
}
future<> auth::auth::insert_user(const sstring& username, bool is_super)
throw (exceptions::request_execution_exception) {
future<> auth::auth::insert_user(const sstring& username, bool is_super) {
return cql3::get_local_query_processor().process(sprint("INSERT INTO %s.%s (%s, %s) VALUES (?, ?)",
AUTH_KS, USERS_CF, USER_NAME, SUPER),
consistency_for_user(username), { username, is_super }).discard_result();
}
future<> auth::auth::delete_user(const sstring& username) throw(exceptions::request_execution_exception) {
future<> auth::auth::delete_user(const sstring& username) {
return cql3::get_local_query_processor().process(sprint("DELETE FROM %s.%s WHERE %s = ?",
AUTH_KS, USERS_CF, USER_NAME),
consistency_for_user(username), { username }).discard_result();

View File

@@ -50,11 +50,10 @@
#include "exceptions/exceptions.hh"
#include "permission.hh"
#include "data_resource.hh"
#include "authenticated_user.hh"
namespace auth {
class authenticated_user;
class auth {
public:
class permissions_cache;
@@ -91,7 +90,7 @@ public:
* @param isSuper User's new status.
* @throws RequestExecutionException
*/
static future<> insert_user(const sstring& username, bool is_super) throw(exceptions::request_execution_exception);
static future<> insert_user(const sstring& username, bool is_super);
/**
* Deletes the user from AUTH_KS.USERS_CF.
@@ -99,7 +98,7 @@ public:
* @param username Username to delete.
* @throws RequestExecutionException
*/
static future<> delete_user(const sstring& username) throw(exceptions::request_execution_exception);
static future<> delete_user(const sstring& username);
/**
* Sets up Authenticator and Authorizer.
@@ -122,3 +121,5 @@ public:
static void schedule_when_up(scheduled_func);
};
}
std::ostream& operator<<(std::ostream& os, const std::pair<auth::authenticated_user, auth::data_resource>& p);

View File

@@ -43,6 +43,7 @@
#include <seastar/core/sstring.hh>
#include <seastar/core/future.hh>
#include "seastarx.hh"
namespace auth {

View File

@@ -72,7 +72,7 @@ sstring auth::authenticator::option_to_string(option opt) {
static std::unique_ptr<auth::authenticator> global_authenticator;
future<>
auth::authenticator::setup(const sstring& type) throw (exceptions::configuration_exception) {
auth::authenticator::setup(const sstring& type) {
if (auth::auth::is_class_type(type, ALLOW_ALL_AUTHENTICATOR_NAME)) {
class allow_all_authenticator : public authenticator {
public:
@@ -88,16 +88,16 @@ auth::authenticator::setup(const sstring& type) throw (exceptions::configuration
option_set alterable_options() const override {
return option_set();
}
future<::shared_ptr<authenticated_user>> authenticate(const credentials_map& credentials) const throw(exceptions::authentication_exception) override {
future<::shared_ptr<authenticated_user>> authenticate(const credentials_map& credentials) const override {
return make_ready_future<::shared_ptr<authenticated_user>>(::make_shared<authenticated_user>());
}
future<> create(sstring username, const option_map& options) throw(exceptions::request_validation_exception, exceptions::request_execution_exception) override {
future<> create(sstring username, const option_map& options) override {
return make_ready_future();
}
future<> alter(sstring username, const option_map& options) throw(exceptions::request_validation_exception, exceptions::request_execution_exception) override {
future<> alter(sstring username, const option_map& options) override {
return make_ready_future();
}
future<> drop(sstring username) throw(exceptions::request_validation_exception, exceptions::request_execution_exception) override {
future<> drop(sstring username) override {
return make_ready_future();
}
const resource_ids& protected_resources() const override {

View File

@@ -92,7 +92,7 @@ public:
* For example, use this method to create any required keyspaces/column families.
* Note: Only call from main thread.
*/
static future<> setup(const sstring& type) throw(exceptions::configuration_exception);
static future<> setup(const sstring& type);
/**
* Returns the system authenticator. Must have called setup before calling this.
@@ -129,7 +129,7 @@ public:
*
* @throws authentication_exception if credentials don't match any known user.
*/
virtual future<::shared_ptr<authenticated_user>> authenticate(const credentials_map& credentials) const throw(exceptions::authentication_exception) = 0;
virtual future<::shared_ptr<authenticated_user>> authenticate(const credentials_map& credentials) const = 0;
/**
* Called during execution of CREATE USER query (also may be called on startup, see seedSuperuserOptions method).
@@ -141,7 +141,7 @@ public:
* @throws exceptions::request_validation_exception
* @throws exceptions::request_execution_exception
*/
virtual future<> create(sstring username, const option_map& options) throw(exceptions::request_validation_exception, exceptions::request_execution_exception) = 0;
virtual future<> create(sstring username, const option_map& options) = 0;
/**
* Called during execution of ALTER USER query.
@@ -154,7 +154,7 @@ public:
* @throws exceptions::request_validation_exception
* @throws exceptions::request_execution_exception
*/
virtual future<> alter(sstring username, const option_map& options) throw(exceptions::request_validation_exception, exceptions::request_execution_exception) = 0;
virtual future<> alter(sstring username, const option_map& options) = 0;
/**
@@ -164,7 +164,7 @@ public:
* @throws exceptions::request_validation_exception
* @throws exceptions::request_execution_exception
*/
virtual future<> drop(sstring username) throw(exceptions::request_validation_exception, exceptions::request_execution_exception) = 0;
virtual future<> drop(sstring username) = 0;
/**
* Set of resources that should be made inaccessible to users and only accessible internally.
@@ -177,9 +177,9 @@ public:
class sasl_challenge {
public:
virtual ~sasl_challenge() {}
virtual bytes evaluate_response(bytes_view client_response) throw(exceptions::authentication_exception) = 0;
virtual bytes evaluate_response(bytes_view client_response) = 0;
virtual bool is_complete() const = 0;
virtual future<::shared_ptr<authenticated_user>> get_authenticated_user() const throw(exceptions::authentication_exception) = 0;
virtual future<::shared_ptr<authenticated_user>> get_authenticated_user() const = 0;
};
/**

View File

@@ -51,6 +51,8 @@
#include "permission.hh"
#include "data_resource.hh"
#include "seastarx.hh"
namespace auth {
class authenticated_user;

View File

@@ -115,16 +115,14 @@ auth::data_resource auth::data_resource::get_parent() const {
}
}
const sstring& auth::data_resource::keyspace() const
throw (std::invalid_argument) {
const sstring& auth::data_resource::keyspace() const {
if (is_root_level()) {
throw std::invalid_argument("ROOT data resource has no keyspace");
}
return _ks;
}
const sstring& auth::data_resource::column_family() const
throw (std::invalid_argument) {
const sstring& auth::data_resource::column_family() const {
if (!is_column_family_level()) {
throw std::invalid_argument(sprint("%s data resource has no column family", name()));
}

View File

@@ -45,6 +45,7 @@
#include <iosfwd>
#include <set>
#include <seastar/core/sstring.hh>
#include "seastarx.hh"
namespace auth {
@@ -117,13 +118,13 @@ public:
* @return keyspace of the resource.
* @throws std::invalid_argument if it's the root-level resource.
*/
const sstring& keyspace() const throw(std::invalid_argument);
const sstring& keyspace() const;
/**
* @return column family of the resource.
* @throws std::invalid_argument if it's not a cf-level resource.
*/
const sstring& column_family() const throw(std::invalid_argument);
const sstring& column_family() const;
/**
* @return Whether or not the resource has a parent in the hierarchy.

View File

@@ -62,7 +62,7 @@ static const sstring RESOURCE_NAME = "resource";
static const sstring PERMISSIONS_NAME = "permissions";
static const sstring PERMISSIONS_CF = "permissions";
static logging::logger logger("default_authorizer");
static logging::logger alogger("default_authorizer");
auth::default_authorizer::default_authorizer() {
}
@@ -107,7 +107,7 @@ future<auth::permission_set> auth::default_authorizer::authorize(
}
return make_ready_future<permission_set>(permissions::from_strings(res->one().get_set<sstring>(PERMISSIONS_NAME)));
} catch (exceptions::request_execution_exception& e) {
logger.warn("CassandraAuthorizer failed to authorize {} for {}", user->name(), resource);
alogger.warn("CassandraAuthorizer failed to authorize {} for {}", user->name(), resource);
return make_ready_future<permission_set>(permissions::NONE);
}
});
@@ -196,7 +196,7 @@ future<> auth::default_authorizer::revoke_all(sstring dropped_user) {
try {
std::rethrow_exception(ep);
} catch (exceptions::request_execution_exception& e) {
logger.warn("CassandraAuthorizer failed to revoke all permissions of {}: {}", dropped_user, e);
alogger.warn("CassandraAuthorizer failed to revoke all permissions of {}: {}", dropped_user, e);
}
});
}
@@ -217,13 +217,13 @@ future<> auth::default_authorizer::revoke_all(data_resource resource) {
try {
std::rethrow_exception(ep);
} catch (exceptions::request_execution_exception& e) {
logger.warn("CassandraAuthorizer failed to revoke all permissions on {}: {}", resource, e);
alogger.warn("CassandraAuthorizer failed to revoke all permissions on {}: {}", resource, e);
}
});
});
} catch (exceptions::request_execution_exception& e) {
logger.warn("CassandraAuthorizer failed to revoke all permissions on {}: {}", resource, e);
alogger.warn("CassandraAuthorizer failed to revoke all permissions on {}: {}", resource, e);
return make_ready_future();
}
});

View File

@@ -61,7 +61,7 @@ static const sstring DEFAULT_USER_NAME = auth::auth::DEFAULT_SUPERUSER_NAME;
static const sstring DEFAULT_USER_PASSWORD = auth::auth::DEFAULT_SUPERUSER_NAME;
static const sstring CREDENTIALS_CF = "credentials";
static logging::logger logger("password_authenticator");
static logging::logger plogger("password_authenticator");
auth::password_authenticator::~password_authenticator()
{}
@@ -169,7 +169,7 @@ future<> auth::password_authenticator::init() {
USER_NAME, SALTED_HASH
),
db::consistency_level::ONE, {DEFAULT_USER_NAME, hashpw(DEFAULT_USER_PASSWORD)}).then([](auto) {
logger.info("Created default user '{}'", DEFAULT_USER_NAME);
plogger.info("Created default user '{}'", DEFAULT_USER_NAME);
});
}
});
@@ -201,8 +201,7 @@ auth::authenticator::option_set auth::password_authenticator::alterable_options(
}
future<::shared_ptr<auth::authenticated_user> > auth::password_authenticator::authenticate(
const credentials_map& credentials) const
throw (exceptions::authentication_exception) {
const credentials_map& credentials) const {
if (!credentials.count(USERNAME_KEY)) {
throw exceptions::authentication_exception(sprint("Required key '%s' is missing", USERNAME_KEY));
}
@@ -241,9 +240,7 @@ future<::shared_ptr<auth::authenticated_user> > auth::password_authenticator::au
}
future<> auth::password_authenticator::create(sstring username,
const option_map& options)
throw (exceptions::request_validation_exception,
exceptions::request_execution_exception) {
const option_map& options) {
try {
auto password = boost::any_cast<sstring>(options.at(option::PASSWORD));
auto query = sprint("INSERT INTO %s.%s (%s, %s) VALUES (?, ?)",
@@ -256,9 +253,7 @@ future<> auth::password_authenticator::create(sstring username,
}
future<> auth::password_authenticator::alter(sstring username,
const option_map& options)
throw (exceptions::request_validation_exception,
exceptions::request_execution_exception) {
const option_map& options) {
try {
auto password = boost::any_cast<sstring>(options.at(option::PASSWORD));
auto query = sprint("UPDATE %s.%s SET %s = ? WHERE %s = ?",
@@ -270,9 +265,7 @@ future<> auth::password_authenticator::alter(sstring username,
}
}
future<> auth::password_authenticator::drop(sstring username)
throw (exceptions::request_validation_exception,
exceptions::request_execution_exception) {
future<> auth::password_authenticator::drop(sstring username) {
try {
auto query = sprint("DELETE FROM %s.%s WHERE %s = ?",
auth::AUTH_KS, CREDENTIALS_CF, USER_NAME);
@@ -308,9 +301,8 @@ const auth::resource_ids& auth::password_authenticator::protected_resources() co
* would expect
* @throws javax.security.sasl.SaslException
*/
bytes evaluate_response(bytes_view client_response)
throw (exceptions::authentication_exception) override {
logger.debug("Decoding credentials from client token");
bytes evaluate_response(bytes_view client_response) override {
plogger.debug("Decoding credentials from client token");
sstring username, password;
@@ -347,8 +339,7 @@ const auth::resource_ids& auth::password_authenticator::protected_resources() co
bool is_complete() const override {
return _complete;
}
future<::shared_ptr<authenticated_user>> get_authenticated_user() const
throw (exceptions::authentication_exception) override {
future<::shared_ptr<authenticated_user>> get_authenticated_user() const override {
return _authenticator.authenticate(_credentials);
}
private:

View File

@@ -58,10 +58,10 @@ public:
bool require_authentication() const override;
option_set supported_options() const override;
option_set alterable_options() const override;
future<::shared_ptr<authenticated_user>> authenticate(const credentials_map& credentials) const throw(exceptions::authentication_exception) override;
future<> create(sstring username, const option_map& options) throw(exceptions::request_validation_exception, exceptions::request_execution_exception) override;
future<> alter(sstring username, const option_map& options) throw(exceptions::request_validation_exception, exceptions::request_execution_exception) override;
future<> drop(sstring username) throw(exceptions::request_validation_exception, exceptions::request_execution_exception) override;
future<::shared_ptr<authenticated_user>> authenticate(const credentials_map& credentials) const override;
future<> create(sstring username, const option_map& options) override;
future<> alter(sstring username, const option_map& options) override;
future<> drop(sstring username) override;
const resource_ids& protected_resources() const override;
::shared_ptr<sasl_challenge> new_sasl_challenge() const override;

View File

@@ -44,6 +44,7 @@
#include <unordered_set>
#include <seastar/core/sstring.hh>
#include "seastarx.hh"
#include "enum_set.hh"
namespace auth {

View File

@@ -21,14 +21,17 @@
#pragma once
#include "seastarx.hh"
#include "core/sstring.hh"
#include "hashing.hh"
#include <experimental/optional>
#include <iosfwd>
#include <functional>
#include "utils/mutable_view.hh"
using bytes = basic_sstring<int8_t, uint32_t, 31>;
using bytes_view = std::experimental::basic_string_view<int8_t>;
using bytes_mutable_view = basic_mutable_view<bytes_view::value_type>;
using bytes_opt = std::experimental::optional<bytes>;
using sstring_view = std::experimental::string_view;

538
cache_streamed_mutation.hh Normal file
View File

@@ -0,0 +1,538 @@
/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <vector>
#include "row_cache.hh"
#include "mutation_reader.hh"
#include "streamed_mutation.hh"
#include "partition_version.hh"
#include "utils/logalloc.hh"
#include "query-request.hh"
#include "partition_snapshot_reader.hh"
#include "partition_snapshot_row_cursor.hh"
#include "read_context.hh"
namespace cache {
class lsa_manager {
row_cache& _cache;
public:
lsa_manager(row_cache& cache) : _cache(cache) { }
template<typename Func>
decltype(auto) run_in_read_section(const Func& func) {
return _cache._read_section(_cache._tracker.region(), [&func] () {
return with_linearized_managed_bytes([&func] () {
return func();
});
});
}
template<typename Func>
decltype(auto) run_in_update_section(const Func& func) {
return _cache._update_section(_cache._tracker.region(), [&func] () {
return with_linearized_managed_bytes([&func] () {
return func();
});
});
}
template<typename Func>
void run_in_update_section_with_allocator(Func&& func) {
return _cache._update_section(_cache._tracker.region(), [this, &func] () {
return with_linearized_managed_bytes([this, &func] () {
return with_allocator(_cache._tracker.region().allocator(), [this, &func] () mutable {
return func();
});
});
});
}
logalloc::region& region() { return _cache._tracker.region(); }
logalloc::allocating_section& read_section() { return _cache._read_section; }
};
class cache_streamed_mutation final : public streamed_mutation::impl {
enum class state {
before_static_row,
// Invariants:
// - position_range(_lower_bound, _upper_bound) covers all not yet emitted positions from current range
// - _next_row points to the nearest row in cache >= _lower_bound
// - _next_row_in_range = _next.position() < _upper_bound
reading_from_cache,
// Starts reading from underlying reader.
// The range to read is position_range(_lower_bound, min(_next_row.position(), _upper_bound)).
// Invariants:
// - _next_row_in_range = _next.position() < _upper_bound
move_to_underlying,
// Invariants:
// - Upper bound of the read is min(_next_row.position(), _upper_bound)
// - _next_row_in_range = _next.position() < _upper_bound
// - _last_row_key contains the key of last emitted clustering_row
reading_from_underlying,
end_of_stream
};
lw_shared_ptr<partition_snapshot> _snp;
position_in_partition::tri_compare _position_cmp;
query::clustering_key_filter_ranges _ck_ranges;
query::clustering_row_ranges::const_iterator _ck_ranges_curr;
query::clustering_row_ranges::const_iterator _ck_ranges_end;
lsa_manager _lsa_manager;
stdx::optional<clustering_key> _last_row_key;
// We need to be prepared that we may get overlapping and out of order
// range tombstones. We must emit fragments with strictly monotonic positions,
// so we can't just trim such tombstones to the position of the last fragment.
// To solve that, range tombstones are accumulated first in a range_tombstone_stream
// and emitted once we have a fragment with a larger position.
range_tombstone_stream _tombstones;
// Holds the lower bound of a position range which hasn't been processed yet.
// Only fragments with positions < _lower_bound have been emitted.
position_in_partition _lower_bound;
position_in_partition_view _upper_bound;
state _state = state::before_static_row;
lw_shared_ptr<read_context> _read_context;
partition_snapshot_row_cursor _next_row;
bool _next_row_in_range = false;
future<> do_fill_buffer();
void copy_from_cache_to_buffer();
future<> process_static_row();
void move_to_end();
void move_to_next_range();
void move_to_current_range();
void move_to_next_entry();
// Emits all delayed range tombstones with positions smaller than upper_bound.
void drain_tombstones(position_in_partition_view upper_bound);
// Emits all delayed range tombstones.
void drain_tombstones();
void add_to_buffer(const partition_snapshot_row_cursor&);
void add_clustering_row_to_buffer(mutation_fragment&&);
void add_to_buffer(range_tombstone&&);
void add_to_buffer(mutation_fragment&&);
future<> read_from_underlying();
future<> start_reading_from_underlying();
bool after_current_range(position_in_partition_view position);
bool can_populate() const;
void maybe_update_continuity();
void maybe_add_to_cache(const mutation_fragment& mf);
void maybe_add_to_cache(const clustering_row& cr);
void maybe_add_to_cache(const range_tombstone& rt);
void maybe_add_to_cache(const static_row& sr);
void maybe_set_static_row_continuous();
public:
cache_streamed_mutation(schema_ptr s,
dht::decorated_key dk,
query::clustering_key_filter_ranges&& crr,
lw_shared_ptr<read_context> ctx,
lw_shared_ptr<partition_snapshot> snp,
row_cache& cache)
: streamed_mutation::impl(std::move(s), dk, snp->partition_tombstone())
, _snp(std::move(snp))
, _position_cmp(*_schema)
, _ck_ranges(std::move(crr))
, _ck_ranges_curr(_ck_ranges.begin())
, _ck_ranges_end(_ck_ranges.end())
, _lsa_manager(cache)
, _tombstones(*_schema)
, _lower_bound(position_in_partition::before_all_clustered_rows())
, _upper_bound(position_in_partition_view::before_all_clustered_rows())
, _read_context(std::move(ctx))
, _next_row(*_schema, cache._tracker.region(), *_snp)
{ }
cache_streamed_mutation(const cache_streamed_mutation&) = delete;
cache_streamed_mutation(cache_streamed_mutation&&) = delete;
virtual future<> fill_buffer() override;
virtual ~cache_streamed_mutation() {
maybe_merge_versions(_snp, _lsa_manager.region(), _lsa_manager.read_section());
}
};
inline
future<> cache_streamed_mutation::process_static_row() {
if (_snp->version()->partition().static_row_continuous()) {
_read_context->cache().on_row_hit();
row sr = _lsa_manager.run_in_read_section([this] {
return _snp->static_row();
});
if (!sr.empty()) {
push_mutation_fragment(mutation_fragment(static_row(std::move(sr))));
}
return make_ready_future<>();
} else {
_read_context->cache().on_row_miss();
return _read_context->get_next_fragment().then([this] (mutation_fragment_opt&& sr) {
if (sr) {
assert(sr->is_static_row());
maybe_add_to_cache(sr->as_static_row());
push_mutation_fragment(std::move(*sr));
}
maybe_set_static_row_continuous();
});
}
}
inline
future<> cache_streamed_mutation::fill_buffer() {
if (_state == state::before_static_row) {
auto after_static_row = [this] {
if (_ck_ranges_curr == _ck_ranges_end) {
_end_of_stream = true;
_state = state::end_of_stream;
return make_ready_future<>();
}
_state = state::reading_from_cache;
_lsa_manager.run_in_read_section([this] {
move_to_current_range();
});
return fill_buffer();
};
if (_schema->has_static_columns()) {
return process_static_row().then(std::move(after_static_row));
} else {
return after_static_row();
}
}
return do_until([this] { return _end_of_stream || is_buffer_full(); }, [this] {
return do_fill_buffer();
});
}
inline
future<> cache_streamed_mutation::do_fill_buffer() {
if (_state == state::move_to_underlying) {
_state = state::reading_from_underlying;
auto end = _next_row_in_range ? position_in_partition(_next_row.position())
: position_in_partition(_upper_bound);
return _read_context->fast_forward_to(position_range{_lower_bound, std::move(end)}).then([this] {
return read_from_underlying();
});
}
if (_state == state::reading_from_underlying) {
return read_from_underlying();
}
// assert(_state == state::reading_from_cache)
return _lsa_manager.run_in_read_section([this] {
auto same_pos = _next_row.maybe_refresh();
// FIXME: If continuity changed anywhere between _lower_bound and _next_row.position()
// we need to redo the lookup with _lower_bound. There is no eviction yet, so not yet a problem.
assert(same_pos);
while (!is_buffer_full() && _state == state::reading_from_cache) {
copy_from_cache_to_buffer();
if (need_preempt()) {
break;
}
}
return make_ready_future<>();
});
}
inline
future<> cache_streamed_mutation::read_from_underlying() {
return consume_mutation_fragments_until(_read_context->get_streamed_mutation(),
[this] { return _state != state::reading_from_underlying || is_buffer_full(); },
[this] (mutation_fragment mf) {
_read_context->cache().on_row_miss();
maybe_add_to_cache(mf);
add_to_buffer(std::move(mf));
},
[this] {
_state = state::reading_from_cache;
_lsa_manager.run_in_update_section([this] {
auto same_pos = _next_row.maybe_refresh();
assert(same_pos); // FIXME: handle eviction
if (_next_row_in_range) {
maybe_update_continuity();
add_to_buffer(_next_row);
move_to_next_entry();
} else {
if (no_clustering_row_between(*_schema, _upper_bound, _next_row.position())) {
this->maybe_update_continuity();
} else {
// FIXME: Insert dummy entry at _upper_bound.
_read_context->cache().on_mispopulate();
}
move_to_next_range();
}
});
return make_ready_future<>();
});
}
inline
void cache_streamed_mutation::maybe_update_continuity() {
if (can_populate() && _next_row.is_in_latest_version()) {
if (_last_row_key) {
if (_next_row.previous_row_in_latest_version_has_key(*_last_row_key)) {
_next_row.set_continuous(true);
}
} else if (!_ck_ranges_curr->start()) {
_next_row.set_continuous(true);
}
} else {
_read_context->cache().on_mispopulate();
}
}
inline
void cache_streamed_mutation::maybe_add_to_cache(const mutation_fragment& mf) {
if (mf.is_range_tombstone()) {
maybe_add_to_cache(mf.as_range_tombstone());
} else {
assert(mf.is_clustering_row());
const clustering_row& cr = mf.as_clustering_row();
maybe_add_to_cache(cr);
}
}
inline
void cache_streamed_mutation::maybe_add_to_cache(const clustering_row& cr) {
if (!can_populate()) {
_read_context->cache().on_mispopulate();
return;
}
_lsa_manager.run_in_update_section_with_allocator([this, &cr] {
mutation_partition& mp = _snp->version()->partition();
rows_entry::compare less(*_schema);
// FIXME: If _next_row is up to date, but latest version doesn't have iterator in
// current row (could be far away, so we'd do this often), then this will do
// the lookup in mp. This is not necessary, because _next_row has iterators for
// next rows in each version, even if they're not part of the current row.
// They're currently buried in the heap, but you could keep a vector of
// iterators per each version in addition to the heap.
auto new_entry = alloc_strategy_unique_ptr<rows_entry>(
current_allocator().construct<rows_entry>(cr.key(), cr.tomb(), cr.marker(), cr.cells()));
new_entry->set_continuous(false);
auto it = _next_row.has_valid_row_from_latest_version()
? _next_row.get_iterator_in_latest_version() : mp.clustered_rows().lower_bound(cr.key(), less);
auto insert_result = mp.clustered_rows().insert_check(it, *new_entry, less);
if (insert_result.second) {
_read_context->cache().on_row_insert();
new_entry.release();
}
it = insert_result.first;
rows_entry& e = *it;
if (_last_row_key) {
if (it == mp.clustered_rows().begin()) {
// FIXME: check whether entry for _last_row_key is in older versions and if so set
// continuity to true.
_read_context->cache().on_mispopulate();
} else {
auto prev_it = it;
--prev_it;
clustering_key_prefix::equality eq(*_schema);
if (eq(*_last_row_key, prev_it->key())) {
e.set_continuous(true);
}
}
} else if (!_ck_ranges_curr->start()) {
e.set_continuous(true);
} else {
// FIXME: Insert dummy entry at _ck_ranges_curr->start()
_read_context->cache().on_mispopulate();
}
});
}
inline
bool cache_streamed_mutation::after_current_range(position_in_partition_view p) {
return _position_cmp(p, _upper_bound) >= 0;
}
inline
future<> cache_streamed_mutation::start_reading_from_underlying() {
_state = state::move_to_underlying;
return make_ready_future<>();
}
inline
void cache_streamed_mutation::copy_from_cache_to_buffer() {
position_in_partition_view next_lower_bound = _next_row.dummy() ? _next_row.position() : position_in_partition_view::after_key(_next_row.key());
for (auto&& rts : _snp->range_tombstones(*_schema, _lower_bound, _next_row_in_range ? next_lower_bound : _upper_bound)) {
add_to_buffer(std::move(rts));
if (is_buffer_full()) {
return;
}
}
if (_next_row_in_range) {
add_to_buffer(_next_row);
move_to_next_entry();
} else {
move_to_next_range();
}
}
inline
void cache_streamed_mutation::move_to_end() {
drain_tombstones();
_end_of_stream = true;
_state = state::end_of_stream;
}
inline
void cache_streamed_mutation::move_to_next_range() {
++_ck_ranges_curr;
if (_ck_ranges_curr == _ck_ranges_end) {
move_to_end();
} else {
move_to_current_range();
}
}
inline
void cache_streamed_mutation::move_to_current_range() {
_last_row_key = std::experimental::nullopt;
_lower_bound = position_in_partition::for_range_start(*_ck_ranges_curr);
_upper_bound = position_in_partition_view::for_range_end(*_ck_ranges_curr);
auto complete_until_next = _next_row.advance_to(_lower_bound) || _next_row.continuous();
_next_row_in_range = !after_current_range(_next_row.position());
if (!complete_until_next) {
start_reading_from_underlying();
}
}
// _next_row must be inside the range.
inline
void cache_streamed_mutation::move_to_next_entry() {
if (no_clustering_row_between(*_schema, _next_row.position(), _upper_bound)) {
move_to_next_range();
} else {
if (!_next_row.next()) {
move_to_end();
return;
}
_next_row_in_range = !after_current_range(_next_row.position());
if (!_next_row.continuous()) {
start_reading_from_underlying();
}
}
}
inline
void cache_streamed_mutation::drain_tombstones(position_in_partition_view pos) {
while (auto mfo = _tombstones.get_next(pos)) {
push_mutation_fragment(std::move(*mfo));
}
}
inline
void cache_streamed_mutation::drain_tombstones() {
while (auto mfo = _tombstones.get_next()) {
push_mutation_fragment(std::move(*mfo));
}
}
inline
void cache_streamed_mutation::add_to_buffer(mutation_fragment&& mf) {
if (mf.is_clustering_row()) {
add_clustering_row_to_buffer(std::move(mf));
} else {
assert(mf.is_range_tombstone());
add_to_buffer(std::move(mf).as_range_tombstone());
}
}
inline
void cache_streamed_mutation::add_to_buffer(const partition_snapshot_row_cursor& row) {
if (!row.dummy()) {
_read_context->cache().on_row_hit();
add_clustering_row_to_buffer(row.row());
}
}
inline
void cache_streamed_mutation::add_clustering_row_to_buffer(mutation_fragment&& mf) {
auto& row = mf.as_clustering_row();
drain_tombstones(row.position());
_last_row_key = row.key();
_lower_bound = position_in_partition::after_key(row.key());
push_mutation_fragment(std::move(mf));
}
inline
void cache_streamed_mutation::add_to_buffer(range_tombstone&& rt) {
// This guarantees that rt starts after any emitted clustering_row
if (!rt.trim_front(*_schema, _lower_bound)) {
return;
}
_lower_bound = position_in_partition(rt.position());
_tombstones.apply(std::move(rt));
drain_tombstones(_lower_bound);
}
inline
void cache_streamed_mutation::maybe_add_to_cache(const range_tombstone& rt) {
if (can_populate()) {
_lsa_manager.run_in_update_section_with_allocator([&] {
_snp->version()->partition().row_tombstones().apply_monotonically(*_schema, rt);
});
} else {
_read_context->cache().on_mispopulate();
}
}
inline
void cache_streamed_mutation::maybe_add_to_cache(const static_row& sr) {
if (can_populate()) {
_read_context->cache().on_row_insert();
_lsa_manager.run_in_update_section_with_allocator([&] {
_snp->version()->partition().static_row().apply(*_schema, column_kind::static_column, sr.cells());
});
} else {
_read_context->cache().on_mispopulate();
}
}
inline
void cache_streamed_mutation::maybe_set_static_row_continuous() {
if (can_populate()) {
_snp->version()->partition().set_static_row_continuous(true);
} else {
_read_context->cache().on_mispopulate();
}
}
inline
bool cache_streamed_mutation::can_populate() const {
return _snp->at_latest_version() && _read_context->cache().phase_of(_read_context->key()) == _read_context->phase();
}
} // namespace cache
inline streamed_mutation make_cache_streamed_mutation(schema_ptr s,
dht::decorated_key dk,
query::clustering_key_filter_ranges crr,
row_cache& cache,
lw_shared_ptr<cache::read_context> ctx,
lw_shared_ptr<partition_snapshot> snp)
{
return make_streamed_mutation<cache::cache_streamed_mutation>(
std::move(s), std::move(dk), std::move(crr), std::move(ctx), std::move(snp), cache);
}

View File

@@ -24,6 +24,7 @@
#include <boost/lexical_cast.hpp>
#include "exceptions/exceptions.hh"
#include "json.hh"
#include "seastarx.hh"
class schema;
@@ -58,30 +59,34 @@ class caching_options {
caching_options() : _key_cache(default_key), _row_cache(default_row) {}
public:
sstring to_sstring() const {
return json::to_json(std::map<sstring, sstring>({{ "keys", _key_cache }, { "rows_per_partition", _row_cache }}));
std::map<sstring, sstring> to_map() const {
return {{ "keys", _key_cache }, { "rows_per_partition", _row_cache }};
}
static caching_options from_sstring(const sstring& str) {
auto map = json::to_map(str);
if (map.size() > 2) {
throw exceptions::configuration_exception("Invalid map: " + str);
}
sstring k;
sstring r;
if (map.count("keys")) {
k = map.at("keys");
} else {
k = default_key;
}
sstring to_sstring() const {
return json::to_json(to_map());
}
if (map.count("rows_per_partition")) {
r = map.at("rows_per_partition");
} else {
r = default_row;
template<typename Map>
static caching_options from_map(const Map & map) {
sstring k = default_key;
sstring r = default_row;
for (auto& p : map) {
if (p.first == "keys") {
k = p.second;
} else if (p.first == "rows_per_partition") {
r = p.second;
} else {
throw exceptions::configuration_exception("Invalid caching option: " + p.first);
}
}
return caching_options(k, r);
}
static caching_options from_sstring(const sstring& str) {
return from_map(json::to_map(str));
}
bool operator==(const caching_options& other) const {
return _key_cache == other._key_cache && _row_cache == other._row_cache;
}

View File

@@ -22,13 +22,28 @@
#pragma once
#include <boost/intrusive/unordered_set.hpp>
#if __has_include(<boost/container/small_vector.hpp>)
#include <boost/container/small_vector.hpp>
template <typename T, size_t N>
using small_vector = boost::container::small_vector<T, N>;
#else
#include <vector>
template <typename T, size_t N>
using small_vector = std::vector<T>;
#endif
#include "fnv1a_hasher.hh"
#include "streamed_mutation.hh"
#include "mutation_partition.hh"
class cells_range {
using ids_vector_type = boost::container::small_vector<column_id, 5>;
using ids_vector_type = small_vector<column_id, 5>;
position_in_partition_view _position;
ids_vector_type _ids;
@@ -121,7 +136,17 @@ public:
class locked_cell;
struct cell_locker_stats {
uint64_t lock_acquisitions = 0;
uint64_t operations_waiting_for_lock = 0;
};
class cell_locker {
public:
using timeout_clock = lowres_clock;
private:
using semaphore_type = basic_semaphore<default_timeout_exception_factory, timeout_clock>;
class partition_entry;
struct cell_address {
@@ -133,7 +158,7 @@ class cell_locker {
public enable_lw_shared_from_this<cell_entry> {
partition_entry& _parent;
cell_address _address;
semaphore _semaphore { 0 };
semaphore_type _semaphore { 0 };
friend class cell_locker;
public:
@@ -147,7 +172,7 @@ class cell_locker {
// temporarily removed from its parent partition_entry.
// Returns true if the cell_entry still exist in the new schema and
// should be reinserted.
bool upgrade(const schema& from, const schema& to, column_kind kind) {
bool upgrade(const schema& from, const schema& to, column_kind kind) noexcept {
auto& old_column_mapping = from.get_column_mapping();
auto& column = old_column_mapping.column_at(kind, _address.id);
auto cdef = to.get_column_definition(column.name());
@@ -162,15 +187,17 @@ class cell_locker {
return _address.position;
}
future<> lock() {
return _semaphore.wait();
future<> lock(timeout_clock::time_point _timeout) {
return _semaphore.wait(_timeout);
}
void unlock() {
_semaphore.signal();
}
~cell_entry() {
assert(is_linked());
if (!is_linked()) {
return;
}
unlink();
if (!--_parent._cell_count) {
delete &_parent;
@@ -219,14 +246,14 @@ class cell_locker {
bi::hash<cell_entry::hasher>,
bi::constant_time_size<false>>;
static constexpr size_t initial_bucket_count = 64;
static constexpr size_t initial_bucket_count = 16;
using max_load_factor = std::ratio<3, 4>;
dht::decorated_key _key;
cell_locker& _parent;
size_t _rehash_at_size = compute_rehash_at_size(initial_bucket_count);
std::unique_ptr<cells_type::bucket_type[]> _buckets; // TODO: start with internal storage?
size_t _cell_count = 0; // cells_type::empty() is not O(1) if the hook is auto-unlink
cells_type::bucket_type _internal_buckets[initial_bucket_count];
cells_type _cells;
schema_ptr _schema;
@@ -250,8 +277,7 @@ class cell_locker {
partition_entry(schema_ptr s, cell_locker& parent, const dht::decorated_key& dk)
: _key(dk)
, _parent(parent)
, _buckets(std::make_unique<cells_type::bucket_type[]>(initial_bucket_count))
, _cells(cells_type::bucket_traits(_buckets.get(), initial_bucket_count),
, _cells(cells_type::bucket_traits(_internal_buckets, initial_bucket_count),
cell_entry::hasher(*s), cell_entry::equal_compare(*s))
, _schema(s)
{ }
@@ -286,10 +312,9 @@ class cell_locker {
};
class equal_compare {
schema_ptr _schema;
dht::decorated_key_equals_comparator _cmp;
public:
explicit equal_compare(const schema s) : _cmp(s) { }
explicit equal_compare(const schema& s) : _cmp(s) { }
bool operator()(const dht::decorated_key& dk, const partition_entry& pe) {
return _cmp(dk, pe._key);
}
@@ -319,6 +344,7 @@ class cell_locker {
// partitions_type uses equality comparator which keeps a reference to the
// original schema, we must ensure that it doesn't die.
schema_ptr _original_schema;
cell_locker_stats& _stats;
friend class locked_cell;
private:
@@ -339,12 +365,13 @@ private:
}
}
public:
explicit cell_locker(schema_ptr s)
explicit cell_locker(schema_ptr s, cell_locker_stats& stats)
: _buckets(std::make_unique<partitions_type::bucket_type[]>(initial_bucket_count))
, _partitions(partitions_type::bucket_traits(_buckets.get(), initial_bucket_count),
partition_entry::hasher(), partition_entry::equal_compare(*s))
, _schema(s)
, _original_schema(std::move(s))
, _stats(stats)
{ }
~cell_locker() {
@@ -359,7 +386,8 @@ public:
}
// partition_cells_range is required to be in cell_locker::schema()
future<std::vector<locked_cell>> lock_cells(const dht::decorated_key& dk, partition_cells_range&& range);
future<std::vector<locked_cell>> lock_cells(const dht::decorated_key& dk, partition_cells_range&& range,
timeout_clock::time_point timeout);
};
@@ -386,46 +414,51 @@ struct cell_locker::locker {
partition_cells_range _range;
partition_cells_range::iterator _current_ck;
cells_range _cells_range;
cells_range::const_iterator _current_cell;
timeout_clock::time_point _timeout;
std::vector<locked_cell> _locks;
cell_locker_stats& _stats;
private:
void update_ck() {
if (!is_done()) {
_cells_range = *_current_ck;
_current_cell = _cells_range.begin();
_current_cell = _current_ck->begin();
}
}
future<> lock_next();
bool is_done() const { return _current_ck == _range.end(); }
std::vector<locked_cell> get() && { return std::move(_locks); }
public:
explicit locker(const ::schema& s, partition_entry& pe, partition_cells_range&& range)
explicit locker(const ::schema& s, cell_locker_stats& st, partition_entry& pe, partition_cells_range&& range, timeout_clock::time_point timeout)
: _hasher(s)
, _eq_cmp(s)
, _partition_entry(pe)
, _range(std::move(range))
, _current_ck(_range.begin())
, _timeout(timeout)
, _stats(st)
{
update_ck();
}
future<std::vector<locked_cell>> lock_all() && {
locker(const locker&) = delete;
locker(locker&&) = delete;
future<> lock_all() {
// Cannot defer before first call to lock_next().
return lock_next().then([this] {
return do_until([this] { return is_done(); }, [this] {
return lock_next();
}).then([&] {
return std::move(*this).get();
});
});
}
std::vector<locked_cell> get() && { return std::move(_locks); }
};
future<std::vector<locked_cell>> cell_locker::lock_cells(const dht::decorated_key& dk, partition_cells_range&& range) {
inline
future<std::vector<locked_cell>> cell_locker::lock_cells(const dht::decorated_key& dk, partition_cells_range&& range, timeout_clock::time_point timeout) {
partition_entry::hasher pe_hash;
partition_entry::equal_compare pe_eq(*_schema);
@@ -447,6 +480,7 @@ future<std::vector<locked_cell>> cell_locker::lock_cells(const dht::decorated_ke
}
for (auto&& c : r) {
auto cell = make_lw_shared<cell_entry>(*partition, position_in_partition(r.position()), c);
_stats.lock_acquisitions++;
partition->insert(cell);
locks.emplace_back(std::move(cell));
}
@@ -460,14 +494,17 @@ future<std::vector<locked_cell>> cell_locker::lock_cells(const dht::decorated_ke
return make_ready_future<std::vector<locked_cell>>(std::move(locks));
}
return do_with(locker(*_schema, *it, std::move(range)), [] (auto& locker) mutable {
return std::move(locker).lock_all();
auto l = std::make_unique<locker>(*_schema, _stats, *it, std::move(range), timeout);
auto f = l->lock_all();
return f.then([l = std::move(l)] {
return std::move(*l).get();
});
}
inline
future<> cell_locker::locker::lock_next() {
while (!is_done()) {
if (_current_cell == _cells_range.end() || _cells_range.empty()) {
if (_current_cell == _current_ck->end()) {
++_current_ck;
update_ck();
continue;
@@ -475,35 +512,37 @@ future<> cell_locker::locker::lock_next() {
auto cid = *_current_cell++;
cell_address ca { position_in_partition(_cells_range.position()), cid };
cell_address ca { position_in_partition(_current_ck->position()), cid };
auto it = _partition_entry.cells().find(ca, _hasher, _eq_cmp);
if (it != _partition_entry.cells().end()) {
return it->lock().then([this, ce = it->shared_from_this()] () mutable {
_stats.operations_waiting_for_lock++;
return it->lock(_timeout).then([this, ce = it->shared_from_this()] () mutable {
_stats.operations_waiting_for_lock--;
_stats.lock_acquisitions++;
_locks.emplace_back(std::move(ce));
});
}
auto cell = make_lw_shared<cell_entry>(_partition_entry, position_in_partition(_cells_range.position()), cid);
auto cell = make_lw_shared<cell_entry>(_partition_entry, position_in_partition(_current_ck->position()), cid);
_stats.lock_acquisitions++;
_partition_entry.insert(cell);
_locks.emplace_back(std::move(cell));
}
return make_ready_future<>();
}
inline
bool cell_locker::partition_entry::upgrade(schema_ptr new_schema) {
if (_schema == new_schema) {
return true;
}
auto buckets = std::make_unique<cells_type::bucket_type[]>(initial_bucket_count);
auto buckets = std::make_unique<cells_type::bucket_type[]>(_cells.bucket_count());
auto cells = cells_type(cells_type::bucket_traits(buckets.get(), _cells.bucket_count()),
cell_entry::hasher(*new_schema), cell_entry::equal_compare(*new_schema));
while (!_cells.empty()) {
auto it = _cells.begin();
auto& cell = *it;
_cells.erase(it);
_cells.clear_and_dispose([&] (cell_entry* cell_ptr) noexcept {
auto& cell = *cell_ptr;
auto kind = cell.position().is_static_row() ? column_kind::static_column
: column_kind::regular_column;
auto reinsert = cell.upgrade(*_schema, *new_schema, kind);
@@ -512,9 +551,16 @@ bool cell_locker::partition_entry::upgrade(schema_ptr new_schema) {
} else {
_cell_count--;
}
}
});
// bi::unordered_set move assignment is actually a swap.
// Original _buckets cannot be destroyed before the container using them is
// so we need to explicitly make sure that the original _cells is no more.
_cells = std::move(cells);
auto destroy = [] (auto) { };
destroy(std::move(cells));
_buckets = std::move(buckets);
_schema = new_schema;
return _cell_count;
}

View File

@@ -112,6 +112,11 @@ public:
});
}
virtual future<temporary_buffer<uint8_t>> dma_read_bulk(uint64_t offset, size_t range_size, const io_priority_class& pc) override {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->dma_read_bulk(offset, range_size, pc);
});
}
private:
const io_error_handler& _error_handler;
file _file;

View File

@@ -19,6 +19,6 @@
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "gc_clock.hh"
#include "clocks-impl.hh"
std::atomic<int64_t> clocks_offset;

49
clocks-impl.hh Normal file
View File

@@ -0,0 +1,49 @@
/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <algorithm>
#include <atomic>
#include <chrono>
#include <cstdint>
extern std::atomic<int64_t> clocks_offset;
template<typename Duration>
static inline void forward_jump_clocks(Duration delta)
{
auto d = std::chrono::duration_cast<std::chrono::seconds>(delta).count();
clocks_offset.fetch_add(d, std::memory_order_relaxed);
}
static inline std::chrono::seconds get_clocks_offset()
{
auto off = clocks_offset.load(std::memory_order_relaxed);
return std::chrono::seconds(off);
}
// Returns a time point which is earlier from t by d, or minimum time point if it cannot be represented.
template<typename Clock, typename Duration, typename Rep, typename Period>
inline
auto saturating_subtract(std::chrono::time_point<Clock, Duration> t, std::chrono::duration<Rep, Period> d) -> decltype(t) {
return std::max(t, decltype(t)::min() + d) - d;
}

View File

@@ -54,8 +54,8 @@ static inline bound_kind flip_bound_kind(bound_kind bk)
}
class bound_view {
const static thread_local clustering_key empty_prefix;
public:
const static thread_local clustering_key empty_prefix;
const clustering_key_prefix& prefix;
bound_kind kind;
bound_view(const clustering_key_prefix& prefix, bound_kind kind)
@@ -133,20 +133,33 @@ public:
static bound_view top() {
return {empty_prefix, bound_kind::incl_end};
}
/*
template<template<typename> typename T, typename U>
concept bool Range() {
return requires (T<U> range) {
{ range.start() } -> stdx::optional<U>;
{ range.end() } -> stdx::optional<U>;
};
};*/
template<template<typename> typename Range>
static std::pair<bound_view, bound_view> from_range(const Range<clustering_key_prefix>& range) {
return {
range.start() ? bound_view(range.start()->value(), range.start()->is_inclusive() ? bound_kind::incl_start : bound_kind::excl_start) : bottom(),
range.end() ? bound_view(range.end()->value(), range.end()->is_inclusive() ? bound_kind::incl_end : bound_kind::excl_end) : top(),
};
template<template<typename> typename R>
GCC6_CONCEPT( requires Range<R, clustering_key_prefix_view> )
static bound_view from_range_start(const R<clustering_key_prefix>& range) {
return range.start()
? bound_view(range.start()->value(), range.start()->is_inclusive() ? bound_kind::incl_start : bound_kind::excl_start)
: bottom();
}
template<template<typename> typename R>
GCC6_CONCEPT( requires Range<R, clustering_key_prefix> )
static bound_view from_range_end(const R<clustering_key_prefix>& range) {
return range.end()
? bound_view(range.end()->value(), range.end()->is_inclusive() ? bound_kind::incl_end : bound_kind::excl_end)
: top();
}
template<template<typename> typename R>
GCC6_CONCEPT( requires Range<R, clustering_key_prefix> )
static std::pair<bound_view, bound_view> from_range(const R<clustering_key_prefix>& range) {
return {from_range_start(range), from_range_end(range)};
}
template<template<typename> typename R>
GCC6_CONCEPT( requires Range<R, clustering_key_prefix_view> )
static stdx::optional<typename R<clustering_key_prefix_view>::bound> to_range_bound(const bound_view& bv) {
if (&bv.prefix == &empty_prefix) {
return {};
}
bool inclusive = bv.kind != bound_kind::excl_end && bv.kind != bound_kind::excl_start;
return {typename R<clustering_key_prefix_view>::bound(bv.prefix.view(), inclusive)};
}
friend std::ostream& operator<<(std::ostream& out, const bound_view& b) {
return out << "{bound: prefix=" << b.prefix << ", kind=" << b.kind << "}";

View File

@@ -54,6 +54,7 @@ public:
auto end() const { return _ref.end(); }
bool empty() const { return _ref.empty(); }
size_t size() const { return _ref.size(); }
const clustering_row_ranges& ranges() const { return _ref; }
static clustering_key_filter_ranges get_ranges(const schema& schema, const query::partition_slice& slice, const partition_key& key) {
const query::clustering_row_ranges& ranges = slice.row_ranges(schema, key);

219
clustering_ranges_walker.hh Normal file
View File

@@ -0,0 +1,219 @@
/*
* Copyright (C) 2017 ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include "schema.hh"
#include "query-request.hh"
#include "streamed_mutation.hh"
// Utility for in-order checking of overlap with position ranges.
class clustering_ranges_walker {
const schema& _schema;
const query::clustering_row_ranges& _ranges;
query::clustering_row_ranges::const_iterator _current;
query::clustering_row_ranges::const_iterator _end;
bool _in_current; // next position is known to be >= _current_start
bool _with_static_row;
position_in_partition_view _current_start;
position_in_partition_view _current_end;
stdx::optional<position_in_partition> _trim;
size_t _change_counter = 1;
private:
bool advance_to_next_range() {
_in_current = false;
if (!_current_start.is_static_row()) {
if (_current == _end) {
return false;
}
++_current;
}
++_change_counter;
if (_current == _end) {
_current_end = _current_start = position_in_partition_view::after_all_clustered_rows();
return false;
}
_current_start = position_in_partition_view::for_range_start(*_current);
_current_end = position_in_partition_view::for_range_end(*_current);
return true;
}
public:
clustering_ranges_walker(const schema& s, const query::clustering_row_ranges& ranges, bool with_static_row = true)
: _schema(s)
, _ranges(ranges)
, _current(ranges.begin())
, _end(ranges.end())
, _in_current(with_static_row)
, _with_static_row(with_static_row)
, _current_start(position_in_partition_view::for_static_row())
, _current_end(position_in_partition_view::before_all_clustered_rows())
{
if (!with_static_row) {
if (_current == _end) {
_current_start = position_in_partition_view::before_all_clustered_rows();
} else {
_current_start = position_in_partition_view::for_range_start(*_current);
_current_end = position_in_partition_view::for_range_end(*_current);
}
}
}
clustering_ranges_walker(clustering_ranges_walker&& o) noexcept
: _schema(o._schema)
, _ranges(o._ranges)
, _current(o._current)
, _end(o._end)
, _in_current(o._in_current)
, _with_static_row(o._with_static_row)
, _current_start(o._current_start)
, _current_end(o._current_end)
, _trim(std::move(o._trim))
, _change_counter(o._change_counter)
{ }
clustering_ranges_walker& operator=(clustering_ranges_walker&& o) {
if (this != &o) {
this->~clustering_ranges_walker();
new (this) clustering_ranges_walker(std::move(o));
}
return *this;
}
// Excludes positions smaller than pos from the ranges.
// pos should be monotonic.
// No constraints between pos and positions passed to advance_to().
//
// After the invocation, when !out_of_range(), lower_bound() returns the smallest position still contained.
void trim_front(position_in_partition pos) {
position_in_partition::less_compare less(_schema);
do {
if (!less(_current_start, pos)) {
break;
}
if (less(pos, _current_end)) {
_trim = std::move(pos);
_current_start = *_trim;
_in_current = false;
++_change_counter;
break;
}
} while (advance_to_next_range());
}
// Returns true if given position is contained.
// Must be called with monotonic positions.
// Idempotent.
bool advance_to(position_in_partition_view pos) {
position_in_partition::less_compare less(_schema);
do {
if (!_in_current && less(pos, _current_start)) {
break;
}
// All subsequent clustering keys are larger than the start of this
// range so there is no need to check that again.
_in_current = true;
if (less(pos, _current_end)) {
return true;
}
} while (advance_to_next_range());
return false;
}
// Returns true if the range expressed by start and end (as in position_range) overlaps
// with clustering ranges.
// Must be called with monotonic start position. That position must also be greater than
// the last position passed to the other advance_to() overload.
// Idempotent.
bool advance_to(position_in_partition_view start, position_in_partition_view end) {
position_in_partition::less_compare less(_schema);
do {
if (!less(_current_start, end)) {
break;
}
if (less(start, _current_end)) {
return true;
}
} while (advance_to_next_range());
return false;
}
// Returns true if the range tombstone expressed by start and end (as in position_range) overlaps
// with clustering ranges.
// No monotonicity restrictions on argument values across calls.
// Does not affect lower_bound().
// Idempotent.
bool contains_tombstone(position_in_partition_view start, position_in_partition_view end) const {
position_in_partition::less_compare less(_schema);
if (_trim && less(end, *_trim)) {
return false;
}
auto i = _current;
while (i != _end) {
auto range_start = position_in_partition_view::for_range_start(*i);
if (less(end, range_start)) {
return false;
}
auto range_end = position_in_partition_view::for_range_end(*i);
if (less(start, range_end)) {
return true;
}
++i;
}
return false;
}
// Returns true if advanced past all contained positions. Any later advance_to() until reset() will return false.
bool out_of_range() const {
return !_in_current && _current == _end;
}
// Resets the state of the walker so that advance_to() can be now called for new sequence of positions.
// Any range trimmings still hold after this.
void reset() {
auto trim = std::move(_trim);
auto ctr = _change_counter;
*this = clustering_ranges_walker(_schema, _ranges, _with_static_row);
_change_counter = ctr + 1;
if (trim) {
trim_front(std::move(*trim));
}
}
// Can be called only when !out_of_range()
position_in_partition_view lower_bound() const {
return _current_start;
}
// When lower_bound() changes, this also does
// Always > 0.
size_t lower_bound_change_counter() const {
return _change_counter;
}
};

View File

@@ -39,6 +39,7 @@ class compaction_strategy_impl;
class sstable;
class sstable_set;
struct compaction_descriptor;
struct resharding_descriptor;
class compaction_strategy {
::shared_ptr<compaction_strategy_impl> _compaction_strategy_impl;
@@ -54,6 +55,8 @@ public:
// Return a list of sstables to be compacted after applying the strategy.
compaction_descriptor get_sstables_for_compaction(column_family& cfs, std::vector<lw_shared_ptr<sstable>> candidates);
std::vector<resharding_descriptor> get_resharding_jobs(column_family& cf, std::vector<lw_shared_ptr<sstable>> candidates);
// Some strategies may look at the compacted and resulting sstables to
// get some useful information for subsequent compactions.
void notify_completion(const std::vector<lw_shared_ptr<sstable>>& removed, const std::vector<lw_shared_ptr<sstable>>& added);

View File

@@ -130,10 +130,10 @@ public:
bytes decompose_value(const value_type& values) {
return serialize_value(values);
}
class iterator : public std::iterator<std::input_iterator_tag, bytes_view> {
class iterator : public std::iterator<std::input_iterator_tag, const bytes_view> {
private:
bytes_view _v;
value_type _current;
bytes_view _current;
private:
void read_current() {
size_type len;
@@ -220,6 +220,9 @@ public:
assert(AllowPrefixes == allow_prefixes::yes);
return std::distance(begin(v), end(v)) == (ssize_t)_types.size();
}
bool is_empty(bytes_view v) const {
return begin(v) == end(v);
}
void validate(bytes_view v) {
// FIXME: implement
warn(unimplemented::cause::VALIDATION);

View File

@@ -184,6 +184,8 @@ bytes to_legacy(CompoundType& type, bytes_view packed) {
return legacy_form;
}
class composite_view;
// Represents a value serialized according to Origin's CompositeType.
// If is_compound is true, then the value is one or more components encoded as:
//
@@ -202,7 +204,7 @@ public:
, _is_compound(is_compound)
{ }
composite(bytes&& b)
explicit composite(bytes&& b)
: _bytes(std::move(b))
, _is_compound(true)
{ }
@@ -239,7 +241,7 @@ public:
using component_view = std::pair<bytes_view, eoc>;
private:
template<typename Value, typename = std::enable_if_t<!std::is_same<const data_value, std::decay_t<Value>>::value>>
static size_t size(Value& val) {
static size_t size(const Value& val) {
return val.size();
}
static size_t size(const data_value& val) {
@@ -304,23 +306,36 @@ public:
return f(const_cast<bytes&>(_bytes));
}
// marker is ignored if !is_compound
template<typename RangeOfSerializedComponents>
static bytes serialize_value(RangeOfSerializedComponents&& values, bool is_compound = true) {
static composite serialize_value(RangeOfSerializedComponents&& values, bool is_compound = true, eoc marker = eoc::none) {
auto size = serialized_size(values, is_compound);
bytes b(bytes::initialized_later(), size);
auto i = b.begin();
serialize_value(std::forward<decltype(values)>(values), i, is_compound);
return b;
if (is_compound && !b.empty()) {
b.back() = eoc_type(marker);
}
return composite(std::move(b), is_compound);
}
template<typename RangeOfSerializedComponents>
static composite serialize_static(const schema& s, RangeOfSerializedComponents&& values) {
// FIXME: Optimize
auto b = bytes(size_t(2), bytes::value_type(0xff));
std::vector<bytes_view> sv(s.clustering_key_size());
b += composite::serialize_value(boost::range::join(sv, std::forward<RangeOfSerializedComponents>(values)), true).release_bytes();
return composite(std::move(b));
}
static eoc to_eoc(int8_t eoc_byte) {
return eoc_byte == 0 ? eoc::none : (eoc_byte < 0 ? eoc::start : eoc::end);
}
class iterator : public std::iterator<std::input_iterator_tag, const component_view> {
bytes_view _v;
component_view _current;
private:
eoc to_eoc(int8_t eoc_byte) {
return eoc_byte == 0 ? eoc::none : (eoc_byte < 0 ? eoc::start : eoc::end);
}
void read_current() {
size_type len;
{
@@ -406,6 +421,10 @@ public:
return _bytes;
}
bytes release_bytes() && {
return std::move(_bytes);
}
size_t size() const {
return _bytes.size();
}
@@ -426,26 +445,20 @@ public:
return _is_compound;
}
// The following factory functions assume this composite is a compound value.
template <typename ClusteringElement>
static composite from_clustering_element(const schema& s, const ClusteringElement& ce) {
return serialize_value(ce.components(s));
return serialize_value(ce.components(s), s.is_compound());
}
static composite from_exploded(const std::vector<bytes_view>& v, eoc marker = eoc::none) {
static composite from_exploded(const std::vector<bytes_view>& v, bool is_compound, eoc marker = eoc::none) {
if (v.size() == 0) {
return bytes(size_t(1), bytes::value_type(marker));
return composite(bytes(size_t(1), bytes::value_type(marker)), is_compound);
}
auto b = serialize_value(v);
b.back() = eoc_type(marker);
return composite(std::move(b));
return serialize_value(v, is_compound, marker);
}
static composite static_prefix(const schema& s) {
static bytes static_marker(size_t(2), bytes::value_type(0xff));
std::vector<bytes_view> sv(s.clustering_key_size());
return static_marker + serialize_value(sv);
return serialize_static(s, std::vector<bytes_view>());
}
explicit operator bytes_view() const {
@@ -456,6 +469,15 @@ public:
friend inline std::ostream& operator<<(std::ostream& os, const std::pair<Component, eoc>& c) {
return os << "{value=" << c.first << "; eoc=" << sprint("0x%02x", eoc_type(c.second) & 0xff) << "}";
}
friend std::ostream& operator<<(std::ostream& os, const composite& v);
struct tri_compare {
const std::vector<data_type>& _types;
tri_compare(const std::vector<data_type>& types) : _types(types) {}
int operator()(const composite&, const composite&) const;
int operator()(composite_view, composite_view) const;
};
};
class composite_view final {
@@ -476,14 +498,15 @@ public:
, _is_compound(true)
{ }
std::vector<bytes> explode() const {
std::vector<bytes_view> explode() const {
if (!_is_compound) {
return { to_bytes(_bytes) };
return { _bytes };
}
std::vector<bytes> ret;
std::vector<bytes_view> ret;
ret.reserve(8);
for (auto it = begin(), e = end(); it != e; ) {
ret.push_back(to_bytes(it->first));
ret.push_back(it->first);
auto marker = it->second;
++it;
if (it != e && marker != composite::eoc::none) {
@@ -505,6 +528,15 @@ public:
return { begin(), end() };
}
composite::eoc last_eoc() const {
if (!_is_compound || _bytes.empty()) {
return composite::eoc::none;
}
bytes_view v(_bytes);
v.remove_prefix(v.size() - 1);
return composite::to_eoc(read_simple<composite::eoc_type>(v));
}
auto values() const {
return components() | boost::adaptors::transformed([](auto&& c) { return c.first; });
}
@@ -527,4 +559,46 @@ public:
bool operator==(const composite_view& k) const { return k._bytes == _bytes && k._is_compound == _is_compound; }
bool operator!=(const composite_view& k) const { return !(k == *this); }
friend inline std::ostream& operator<<(std::ostream& os, composite_view v) {
return os << "{" << ::join(", ", v.components()) << ", compound=" << v._is_compound << ", static=" << v.is_static() << "}";
}
};
inline
std::ostream& operator<<(std::ostream& os, const composite& v) {
return os << composite_view(v);
}
inline
int composite::tri_compare::operator()(const composite& v1, const composite& v2) const {
return (*this)(composite_view(v1), composite_view(v2));
}
inline
int composite::tri_compare::operator()(composite_view v1, composite_view v2) const {
// See org.apache.cassandra.db.composites.AbstractCType#compare
if (v1.empty()) {
return v2.empty() ? 0 : -1;
}
if (v2.empty()) {
return 1;
}
if (v1.is_static() != v2.is_static()) {
return v1.is_static() ? -1 : 1;
}
auto a_values = v1.components();
auto b_values = v2.components();
auto cmp = [&](const data_type& t, component_view c1, component_view c2) {
// First by value, then by EOC
auto r = t->compare(c1.first, c2.first);
if (r) {
return r;
}
return static_cast<int>(c1.second) - static_cast<int>(c2.second);
};
return lexicographical_tri_compare(_types.begin(), _types.end(),
a_values.begin(), a_values.end(),
b_values.begin(), b_values.end(),
cmp);
}

View File

@@ -89,6 +89,15 @@ listen_address: localhost
# For security reasons, you should not expose this port to the internet. Firewall it if needed.
native_transport_port: 9042
# Enabling native transport encryption in client_encryption_options allows you to either use
# encryption for the standard port or to use a dedicated, additional port along with the unencrypted
# standard native_transport_port.
# Enabling client encryption and keeping native_transport_port_ssl disabled will use encryption
# for native_transport_port. Setting native_transport_port_ssl to a different value
# from native_transport_port will use encryption for native_transport_port_ssl while
# keeping native_transport_port unencrypted.
#native_transport_port_ssl: 9142
# Throttles all outbound streaming file transfers on this node to the
# given total throughput in Mbps. This is necessary because Scylla does
# mostly sequential IO when streaming data during bootstrap or repair, which
@@ -192,6 +201,9 @@ api_address: 127.0.0.1
# Caution should be taken on increasing the size of this threshold as it can lead to node instability.
batch_size_warn_threshold_in_kb: 5
# Fail any multiple-partition batch exceeding this value. 50kb (10x warn threshold) by default.
batch_size_fail_threshold_in_kb: 50
# Authentication backend, identifying users
# Out of the box, Scylla provides org.apache.cassandra.auth.{AllowAllAuthenticator,
# PasswordAuthenticator}.
@@ -223,6 +235,9 @@ batch_size_warn_threshold_in_kb: 5
# be set.
# broadcast_rpc_address: 1.2.3.4
# Uncomment to enable experimental features
# experimental: true
###################################################
## Not currently supported, reserved for future use
###################################################
@@ -279,28 +294,6 @@ batch_size_warn_threshold_in_kb: 5
#
partitioner: org.apache.cassandra.dht.Murmur3Partitioner
# policy for data disk failures:
# die: shut down gossip and Thrift and kill the JVM for any fs errors or
# single-sstable errors, so the node can be replaced.
# stop_paranoid: shut down gossip and Thrift even for single-sstable errors.
# stop: shut down gossip and Thrift, leaving the node effectively dead, but
# can still be inspected via JMX.
# best_effort: stop using the failed disk and respond to requests based on
# remaining available sstables. This means you WILL see obsolete
# data at CL.ONE!
# ignore: ignore fatal errors and let requests fail, as in pre-1.2 Scylla
# disk_failure_policy: stop
# policy for commit disk failures:
# die: shut down gossip and Thrift and kill the JVM, so the node can be replaced.
# stop: shut down gossip and Thrift, leaving the node effectively dead, but
# can still be inspected via JMX.
# stop_commit: shutdown the commit log, letting writes collect but
# continuing to service reads, as in pre-2.0.5 Scylla
# ignore: ignore fatal errors and let the batches fail
# commit_failure_policy: stop
# Maximum size of the key cache in memory.
#
# Each key cache hit saves 1 seek and each row cache hit saves 2 seeks at the
@@ -727,22 +720,17 @@ commitlog_total_space_in_mb: -1
# certificate: conf/scylla.crt
# keyfile: conf/scylla.key
# truststore: <none, use system trust>
# require_client_auth: False
# priority_string: <none, use default>
# enable or disable client/server encryption.
# client_encryption_options:
# enabled: false
# certificate: conf/scylla.crt
# keyfile: conf/scylla.key
# require_client_auth: false
# Set trustore and truststore_password if require_client_auth is true
# truststore: conf/.truststore
# truststore_password: cassandra
# More advanced defaults below:
# protocol: TLS
# algorithm: SunX509
# store_type: JKS
# cipher_suites: [TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA,TLS_DHE_RSA_WITH_AES_128_CBC_SHA,TLS_DHE_RSA_WITH_AES_256_CBC_SHA,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA]
# truststore: <none, use system trust>
# require_client_auth: False
# priority_string: <none, use default>
# internode_compression controls whether traffic between nodes is
# compressed.
@@ -788,3 +776,23 @@ commitlog_total_space_in_mb: -1
# By default, Scylla binds all interfaces to the prometheus API
# It is possible to restrict the listening address to a specific one
# prometheus_address: 0.0.0.0
# Distribution of data among cores (shards) within a node
#
# Scylla distributes data within a node among shards, using a round-robin
# strategy:
# [shard0] [shard1] ... [shardN-1] [shard0] [shard1] ... [shardN-1] ...
#
# Scylla versions 1.6 and below used just one repetition of the pattern;
# this intefered with data placement among nodes (vnodes).
#
# Scylla versions 1.7 and above use 4096 repetitions of the pattern; this
# provides for better data distribution.
#
# the value below is log (base 2) of the number of repetitions.
#
# Set to 0 to avoid rewriting all data when upgrading from Scylla 1.6 and
# below.
#
# Keep at 12 for new clusters.
murmur3_partitioner_ignore_msb_bits: 12

View File

@@ -34,7 +34,7 @@ for line in open('/etc/os-release'):
os_ids += value.split(' ')
# distribution "internationalization", converting package names.
# Fedora name is key, values is distro -> package name dict.
# Fedora name is key, values is distro -> package name dict.
i18n_xlat = {
'boost-devel': {
'debian': 'libboost-dev',
@@ -48,7 +48,7 @@ def pkgname(name):
for id in os_ids:
if id in dict:
return dict[id]
return name
return name
def get_flags():
with open('/proc/cpuinfo') as f:
@@ -93,7 +93,7 @@ def try_compile(compiler, source = '', flags = []):
def warning_supported(warning, compiler):
# gcc ignores -Wno-x even if it is not supported
adjusted = re.sub('^-Wno-', '-W', warning)
return try_compile(flags = [adjusted], compiler = compiler)
return try_compile(flags = ['-Werror', adjusted], compiler = compiler)
def debug_flag(compiler):
src_with_auto = textwrap.dedent('''\
@@ -175,6 +175,8 @@ scylla_tests = [
'tests/keys_test',
'tests/partitioner_test',
'tests/frozen_mutation_test',
'tests/serialized_action_test',
'tests/clustering_ranges_walker_test',
'tests/perf/perf_mutation',
'tests/lsa_async_eviction_test',
'tests/lsa_sync_eviction_test',
@@ -183,6 +185,9 @@ scylla_tests = [
'tests/perf/perf_hash',
'tests/perf/perf_cql_parser',
'tests/perf/perf_simple_query',
'tests/perf/perf_fast_forward',
'tests/cache_streamed_mutation_test',
'tests/row_cache_stress_test',
'tests/memory_footprint',
'tests/perf/perf_sstable',
'tests/cql_query_test',
@@ -194,6 +199,7 @@ scylla_tests = [
'tests/test-serialization',
'tests/sstable_test',
'tests/sstable_mutation_test',
'tests/sstable_resharding_test',
'tests/memtable_test',
'tests/commitlog_test',
'tests/cartesian_product_test',
@@ -215,6 +221,7 @@ scylla_tests = [
'tests/murmur_hash_test',
'tests/allocation_strategy_test',
'tests/logalloc_test',
'tests/log_histogram_test',
'tests/managed_vector_test',
'tests/crc_test',
'tests/flush_queue_test',
@@ -230,6 +237,8 @@ scylla_tests = [
'tests/virtual_reader_test',
'tests/view_schema_test',
'tests/counter_test',
'tests/cell_locker_test',
'tests/loading_cache_test',
]
apps = [
@@ -260,6 +269,8 @@ arg_parser.add_argument('--ldflags', action = 'store', dest = 'user_ldflags', de
help = 'Extra flags for the linker')
arg_parser.add_argument('--compiler', action = 'store', dest = 'cxx', default = 'g++',
help = 'C++ compiler path')
arg_parser.add_argument('--c-compiler', action='store', dest='cc', default='gcc',
help='C compiler path')
arg_parser.add_argument('--with-osv', action = 'store', dest = 'with_osv', default = '',
help = 'Shortcut for compile for OSv')
arg_parser.add_argument('--enable-dpdk', action = 'store_true', dest = 'dpdk', default = False,
@@ -280,6 +291,10 @@ arg_parser.add_argument('--python', action = 'store', dest = 'python', default =
help = 'Python3 path')
add_tristate(arg_parser, name = 'hwloc', dest = 'hwloc', help = 'hwloc support')
add_tristate(arg_parser, name = 'xen', dest = 'xen', help = 'Xen support')
arg_parser.add_argument('--enable-gcc6-concepts', dest='gcc6_concepts', action='store_true', default=False,
help='enable experimental support for C++ Concepts as implemented in GCC 6')
arg_parser.add_argument('--enable-alloc-failure-injector', dest='alloc_failure_injector', action='store_true', default=False,
help='enable allocation failure injection')
args = arg_parser.parse_args()
defines = []
@@ -343,6 +358,7 @@ scylla_core = (['database.cc',
'cql3/statements/create_view_statement.cc',
'cql3/statements/create_type_statement.cc',
'cql3/statements/create_user_statement.cc',
'cql3/statements/drop_index_statement.cc',
'cql3/statements/drop_keyspace_statement.cc',
'cql3/statements/drop_table_statement.cc',
'cql3/statements/drop_view_statement.cc',
@@ -408,16 +424,22 @@ scylla_core = (['database.cc',
'cql3/selection/selector.cc',
'cql3/restrictions/statement_restrictions.cc',
'cql3/result_set.cc',
'cql3/variable_specifications.cc',
'db/consistency_level.cc',
'db/system_keyspace.cc',
'db/schema_tables.cc',
'db/cql_type_parser.cc',
'db/legacy_schema_migrator.cc',
'db/commitlog/commitlog.cc',
'db/commitlog/commitlog_replayer.cc',
'db/commitlog/commitlog_entry.cc',
'db/config.cc',
'db/heat_load_balance.cc',
'db/index/secondary_index.cc',
'db/marshal/type_parser.cc',
'db/batchlog_manager.cc',
'db/view/view.cc',
'index/secondary_index_manager.cc',
'io/io.cc',
'utils/utils.cc',
'utils/UUID_gen.cc',
@@ -438,6 +460,7 @@ scylla_core = (['database.cc',
'gms/gossip_digest_ack2.cc',
'gms/endpoint_state.cc',
'gms/application_state.cc',
'gms/inet_address.cc',
'dht/i_partitioner.cc',
'dht/murmur3_partitioner.cc',
'dht/byte_ordered_partitioner.cc',
@@ -465,7 +488,7 @@ scylla_core = (['database.cc',
'service/client_state.cc',
'service/migration_task.cc',
'service/storage_service.cc',
'service/load_broadcaster.cc',
'service/misc_services.cc',
'service/pager/paging_state.cc',
'service/pager/query_pagers.cc',
'streaming/stream_task.cc',
@@ -481,12 +504,12 @@ scylla_core = (['database.cc',
'streaming/stream_manager.cc',
'streaming/stream_result_future.cc',
'streaming/stream_session_state.cc',
'gc_clock.cc',
'clocks-impl.cc',
'partition_slice_builder.cc',
'init.cc',
'lister.cc',
'repair/repair.cc',
'exceptions/exceptions.cc',
'dns.cc',
'auth/auth.cc',
'auth/authenticated_user.cc',
'auth/authenticator.cc',
@@ -562,6 +585,7 @@ idls = ['idl/gossip_digest.idl.hh',
'idl/commitlog.idl.hh',
'idl/tracing.idl.hh',
'idl/consistency_level.idl.hh',
'idl/cache_temperature.idl.hh',
]
scylla_tests_dependencies = scylla_core + api + idls + [
@@ -607,6 +631,8 @@ tests_not_using_seastar_test_framework = set([
'tests/perf/perf_cql_parser',
'tests/message',
'tests/perf/perf_simple_query',
'tests/perf/perf_fast_forward',
'tests/row_cache_stress_test',
'tests/memory_footprint',
'tests/gossip',
'tests/perf/perf_sstable',
@@ -619,30 +645,39 @@ for t in tests_not_using_seastar_test_framework:
for t in scylla_tests:
deps[t] = [t + '.cc']
if t not in tests_not_using_seastar_test_framework:
deps[t] += scylla_tests_dependencies
deps[t] += scylla_tests_dependencies
deps[t] += scylla_tests_seastar_deps
else:
deps[t] += scylla_core + api + idls + ['tests/cql_test_env.cc']
deps['tests/sstable_test'] += ['tests/sstable_datafile_test.cc']
deps['tests/bytes_ostream_test'] = ['tests/bytes_ostream_test.cc']
deps['tests/bytes_ostream_test'] = ['tests/bytes_ostream_test.cc', 'utils/managed_bytes.cc', 'utils/logalloc.cc', 'utils/dynamic_bitset.cc']
deps['tests/input_stream_test'] = ['tests/input_stream_test.cc']
deps['tests/UUID_test'] = ['utils/UUID_gen.cc', 'tests/UUID_test.cc']
deps['tests/UUID_test'] = ['utils/UUID_gen.cc', 'tests/UUID_test.cc', 'utils/uuid.cc', 'utils/managed_bytes.cc', 'utils/logalloc.cc', 'utils/dynamic_bitset.cc']
deps['tests/murmur_hash_test'] = ['bytes.cc', 'utils/murmur_hash.cc', 'tests/murmur_hash_test.cc']
deps['tests/allocation_strategy_test'] = ['tests/allocation_strategy_test.cc', 'utils/logalloc.cc', 'utils/dynamic_bitset.cc']
deps['tests/log_histogram_test'] = ['tests/log_histogram_test.cc']
deps['tests/anchorless_list_test'] = ['tests/anchorless_list_test.cc']
warnings = [
'-Wno-mismatched-tags', # clang-only
'-Wno-maybe-uninitialized', # false positives on gcc 5
'-Wno-tautological-compare',
'-Wno-parentheses-equality',
'-Wno-c++11-narrowing',
'-Wno-c++1z-extensions',
'-Wno-sometimes-uninitialized',
'-Wno-return-stack-address',
'-Wno-missing-braces',
'-Wno-unused-lambda-capture',
]
warnings = [w
for w in warnings
if warning_supported(warning = w, compiler = args.cxx)]
warnings = ' '.join(warnings)
warnings = ' '.join(warnings + ['-Wno-error=deprecated-declarations'])
dbgflag = debug_flag(args.cxx) if args.debuginfo else ''
tests_link_rule = 'link' if args.tests_debuginfo else 'link_stripped'
@@ -696,6 +731,9 @@ if not try_compile(compiler=args.cxx, source='''\
print('Installed boost version too old. Please update {}.'.format(pkgname("boost-devel")))
sys.exit(1)
has_sanitize_address_use_after_scope = try_compile(compiler=args.cxx, flags=['-fsanitize-address-use-after-scope'], source='int f() {}')
defines = ' '.join(['-D' + d for d in defines])
globals().update(vars(args))
@@ -718,7 +756,7 @@ scylla_release = file.read().strip()
extra_cxxflags["release.cc"] = "-DSCYLLA_VERSION=\"\\\"" + scylla_version + "\\\"\" -DSCYLLA_RELEASE=\"\\\"" + scylla_release + "\\\"\""
seastar_flags = ['--disable-xen']
seastar_flags = []
if args.dpdk:
# fake dependencies on dpdk, so that it is built before anything else
seastar_flags += ['--enable-dpdk']
@@ -728,9 +766,13 @@ if args.staticcxx:
seastar_flags += ['--static-stdc++']
if args.staticboost:
seastar_flags += ['--static-boost']
if args.gcc6_concepts:
seastar_flags += ['--enable-gcc6-concepts']
if args.alloc_failure_injector:
seastar_flags += ['--enable-alloc-failure-injector']
seastar_cflags = args.user_cflags + " -march=nehalem"
seastar_flags += ['--compiler', args.cxx, '--cflags=%s' % (seastar_cflags)]
seastar_flags += ['--compiler', args.cxx, '--c-compiler', args.cc, '--cflags=%s' % (seastar_cflags)]
status = subprocess.call([python, './configure.py'] + seastar_flags, cwd = 'seastar')
@@ -825,7 +867,7 @@ with open(buildfile, 'w') as f:
f.write(textwrap.dedent('''\
cxxflags_{mode} = -I. -I $builddir/{mode}/gen -I seastar -I seastar/build/{mode}/gen
rule cxx.{mode}
command = $cxx -MD -MT $out -MF $out.d {seastar_cflags} $cxxflags $cxxflags_{mode} -c -o $out $in
command = $cxx -MD -MT $out -MF $out.d {seastar_cflags} $cxxflags $cxxflags_{mode} $obj_cxxflags -c -o $out $in
description = CXX $out
depfile = $out.d
rule link.{mode}
@@ -843,7 +885,16 @@ with open(buildfile, 'w') as f:
command = thrift -gen cpp:cob_style -out $builddir/{mode}/gen $in
description = THRIFT $in
rule antlr3.{mode}
command = sed -e '/^#if 0/,/^#endif/d' $in > $builddir/{mode}/gen/$in && antlr3 $builddir/{mode}/gen/$in && sed -i 's/^\\( *\)\\(ImplTraits::CommonTokenType\\* [a-zA-Z0-9_]* = NULL;\\)$$/\\1const \\2/' build/{mode}/gen/${{stem}}Parser.cpp
# We replace many local `ExceptionBaseType* ex` variables with a single function-scope one.
# Because we add such a variable to every function, and because `ExceptionBaseType` is not a global
# name, we also add a global typedef to avoid compilation errors.
command = sed -e '/^#if 0/,/^#endif/d' $in > $builddir/{mode}/gen/$in $
&& antlr3 $builddir/{mode}/gen/$in $
&& sed -i -e 's/^\\( *\)\\(ImplTraits::CommonTokenType\\* [a-zA-Z0-9_]* = NULL;\\)$$/\\1const \\2/' $
-e '1i using ExceptionBaseType = int;' $
-e 's/^{{/{{ ExceptionBaseType\* ex = nullptr;/; $
s/ExceptionBaseType\* ex = new/ex = new/' $
build/{mode}/gen/${{stem}}Parser.cpp
description = ANTLR3 $in
''').format(mode = mode, **modeval))
f.write('build {mode}: phony {artifacts}\n'.format(mode = mode,
@@ -886,7 +937,7 @@ with open(buildfile, 'w') as f:
if binary.startswith('tests/'):
local_libs = '$libs'
if binary not in tests_not_using_seastar_test_framework or binary in pure_boost_tests:
local_libs += ' ' + maybe_static(args.staticboost, '-lboost_unit_test_framework')
local_libs += ' ' + maybe_static(args.staticboost, '-lboost_unit_test_framework')
if has_thrift:
local_libs += ' ' + thrift_libs + ' ' + maybe_static(args.staticboost, '-lboost_system')
# Our code's debugging information is huge, and multiplied
@@ -943,7 +994,7 @@ with open(buildfile, 'w') as f:
f.write('build {}: ragel {}\n'.format(hh, src))
for hh in swaggers:
src = swaggers[hh]
f.write('build {}: swagger {}\n'.format(hh,src))
f.write('build {}: swagger {} | seastar/json/json2code.py\n'.format(hh,src))
for hh in serializers:
src = serializers[hh]
f.write('build {}: serializer {} | idl-compiler.py\n'.format(hh,src))
@@ -960,6 +1011,9 @@ with open(buildfile, 'w') as f:
for cc in grammar.sources('$builddir/{}/gen'.format(mode)):
obj = cc.replace('.cpp', '.o')
f.write('build {}: cxx.{} {} || {}\n'.format(obj, mode, cc, ' '.join(serializers)))
if cc.endswith('Parser.cpp') and has_sanitize_address_use_after_scope:
# Parsers end up using huge amounts of stack space and overflowing their stack
f.write(' obj_cxxflags = -fno-sanitize-address-use-after-scope\n')
f.write('build seastar/build/{mode}/libseastar.a seastar/build/{mode}/apps/iotune/iotune seastar/build/{mode}/gen/http/request_parser.hh seastar/build/{mode}/gen/http/http_response_parser.hh: ninja {seastar_deps}\n'
.format(**locals()))
f.write(' pool = seastar_pool\n')

View File

@@ -22,6 +22,7 @@
#pragma once
#include "mutation_partition_view.hh"
#include "mutation_partition.hh"
#include "schema.hh"
// Mutation partition visitor which applies visited data into
@@ -37,12 +38,12 @@ private:
static bool is_compatible(const column_definition& new_def, const data_type& old_type, column_kind kind) {
return ::is_compatible(new_def.kind, kind) && new_def.type->is_value_compatible_with(*old_type);
}
void accept_cell(row& dst, column_kind kind, const column_definition& new_def, const data_type& old_type, atomic_cell_view cell) {
static void accept_cell(row& dst, column_kind kind, const column_definition& new_def, const data_type& old_type, atomic_cell_view cell) {
if (is_compatible(new_def, old_type, kind) && cell.timestamp() > new_def.dropped_at()) {
dst.apply(new_def, atomic_cell_or_collection(cell));
}
}
void accept_cell(row& dst, column_kind kind, const column_definition& new_def, const data_type& old_type, collection_mutation_view cell) {
static void accept_cell(row& dst, column_kind kind, const column_definition& new_def, const data_type& old_type, collection_mutation_view cell) {
if (!is_compatible(new_def, old_type, kind)) {
return;
}
@@ -94,8 +95,8 @@ public:
_p.apply_row_tombstone(_p_schema, rt);
}
virtual void accept_row(clustering_key_view key, tombstone deleted_at, const row_marker& rm) override {
deletable_row& r = _p.clustered_row(_p_schema, key);
virtual void accept_row(position_in_partition_view key, const row_tombstone& deleted_at, const row_marker& rm, is_dummy dummy, is_continuous continuous) override {
deletable_row& r = _p.clustered_row(_p_schema, key, dummy, continuous);
r.apply(rm);
r.apply(deleted_at);
_current_row = &r;
@@ -116,4 +117,14 @@ public:
accept_cell(_current_row->cells(), column_kind::regular_column, *def, col.type(), collection);
}
}
// Appends the cell to dst upgrading it to the new schema.
// Cells must have monotonic names.
static void append_cell(row& dst, column_kind kind, const column_definition& new_def, const data_type& old_type, const atomic_cell_or_collection& cell) {
if (new_def.is_atomic()) {
accept_cell(dst, kind, new_def, old_type, cell.as_atomic_cell());
} else {
accept_cell(dst, kind, new_def, old_type, cell.as_collection_mutation());
}
}
};

View File

@@ -29,6 +29,15 @@ counter_id counter_id::local()
return counter_id(service::get_local_storage_service().get_local_id());
}
bool counter_id::less_compare_1_7_4::operator()(const counter_id& a, const counter_id& b) const
{
if (a._most_significant != b._most_significant) {
return a._most_significant < b._most_significant;
} else {
return a._least_significant < b._least_significant;
}
}
std::ostream& operator<<(std::ostream& os, const counter_id& id) {
return os << id.to_uuid();
}
@@ -42,10 +51,106 @@ std::ostream& operator<<(std::ostream& os, counter_cell_view ccv) {
return os << "{counter_cell timestamp: " << ccv.timestamp() << " shards: {" << ::join(", ", ccv.shards()) << "}}";
}
void counter_cell_builder::do_sort_and_remove_duplicates()
{
boost::range::sort(_shards, [] (auto& a, auto& b) { return a.id() < b.id(); });
std::vector<counter_shard> new_shards;
new_shards.reserve(_shards.size());
for (auto& cs : _shards) {
if (new_shards.empty() || new_shards.back().id() != cs.id()) {
new_shards.emplace_back(cs);
} else {
new_shards.back().apply(cs);
}
}
_shards = std::move(new_shards);
_sorted = true;
}
std::vector<counter_shard> counter_cell_view::shards_compatible_with_1_7_4() const
{
auto sorted_shards = boost::copy_range<std::vector<counter_shard>>(shards());
counter_id::less_compare_1_7_4 cmp;
boost::range::sort(sorted_shards, [&] (auto& a, auto& b) {
return cmp(a.id(), b.id());
});
return sorted_shards;
}
static bool apply_in_place(atomic_cell_or_collection& dst, atomic_cell_or_collection& src)
{
auto dst_ccmv = counter_cell_mutable_view(dst.as_mutable_atomic_cell());
auto src_ccmv = counter_cell_mutable_view(src.as_mutable_atomic_cell());
auto dst_shards = dst_ccmv.shards();
auto src_shards = src_ccmv.shards();
auto dst_it = dst_shards.begin();
auto src_it = src_shards.begin();
while (src_it != src_shards.end()) {
while (dst_it != dst_shards.end() && dst_it->id() < src_it->id()) {
++dst_it;
}
if (dst_it == dst_shards.end() || dst_it->id() != src_it->id()) {
// Fast-path failed. Revert and fall back to the slow path.
if (dst_it == dst_shards.end()) {
--dst_it;
}
while (src_it != src_shards.begin()) {
--src_it;
while (dst_it->id() != src_it->id()) {
--dst_it;
}
src_it->swap_value_and_clock(*dst_it);
}
return false;
}
if (dst_it->logical_clock() < src_it->logical_clock()) {
dst_it->swap_value_and_clock(*src_it);
} else {
src_it->set_value_and_clock(*dst_it);
}
++src_it;
}
auto dst_ts = dst_ccmv.timestamp();
auto src_ts = src_ccmv.timestamp();
dst_ccmv.set_timestamp(std::max(dst_ts, src_ts));
src_ccmv.set_timestamp(dst_ts);
src.as_mutable_atomic_cell().set_counter_in_place_revert(true);
return true;
}
static void revert_in_place_apply(atomic_cell_or_collection& dst, atomic_cell_or_collection& src)
{
assert(dst.can_use_mutable_view() && src.can_use_mutable_view());
auto dst_ccmv = counter_cell_mutable_view(dst.as_mutable_atomic_cell());
auto src_ccmv = counter_cell_mutable_view(src.as_mutable_atomic_cell());
auto dst_shards = dst_ccmv.shards();
auto src_shards = src_ccmv.shards();
auto dst_it = dst_shards.begin();
auto src_it = src_shards.begin();
while (src_it != src_shards.end()) {
while (dst_it != dst_shards.end() && dst_it->id() < src_it->id()) {
++dst_it;
}
assert(dst_it != dst_shards.end() && dst_it->id() == src_it->id());
dst_it->swap_value_and_clock(*src_it);
++src_it;
}
auto dst_ts = dst_ccmv.timestamp();
auto src_ts = src_ccmv.timestamp();
dst_ccmv.set_timestamp(src_ts);
src_ccmv.set_timestamp(dst_ts);
src.as_mutable_atomic_cell().set_counter_in_place_revert(false);
}
bool counter_cell_view::apply_reversibly(atomic_cell_or_collection& dst, atomic_cell_or_collection& src)
{
// TODO: optimise for single shard existing in the other
// TODO: optimise for no new shards?
auto dst_ac = dst.as_atomic_cell();
auto src_ac = src.as_atomic_cell();
@@ -58,23 +163,29 @@ bool counter_cell_view::apply_reversibly(atomic_cell_or_collection& dst, atomic_
}
if (dst_ac.is_counter_update() && src_ac.is_counter_update()) {
// FIXME: store deltas just as a normal int64_t and get rid of these calls
// to long_type
auto src_v = value_cast<int64_t>(long_type->deserialize_value(src_ac.value()));
auto dst_v = value_cast<int64_t>(long_type->deserialize_value(dst_ac.value()));
auto src_v = src_ac.counter_update_value();
auto dst_v = dst_ac.counter_update_value();
dst = atomic_cell::make_live_counter_update(std::max(dst_ac.timestamp(), src_ac.timestamp()),
long_type->decompose(src_v + dst_v));
src_v + dst_v);
return true;
}
assert(!dst_ac.is_counter_update());
assert(!src_ac.is_counter_update());
auto a_shards = counter_cell_view(dst_ac).shards();
auto b_shards = counter_cell_view(src_ac).shards();
if (counter_cell_view(dst_ac).shard_count() >= counter_cell_view(src_ac).shard_count()
&& dst.can_use_mutable_view() && src.can_use_mutable_view()) {
if (apply_in_place(dst, src)) {
return true;
}
}
src.as_mutable_atomic_cell().set_counter_in_place_revert(false);
auto dst_shards = counter_cell_view(dst_ac).shards();
auto src_shards = counter_cell_view(src_ac).shards();
counter_cell_builder result;
combine(a_shards.begin(), a_shards.end(), b_shards.begin(), b_shards.end(),
combine(dst_shards.begin(), dst_shards.end(), src_shards.begin(), src_shards.end(),
result.inserter(), counter_shard_view::less_compare_by_id(), [] (auto& x, auto& y) {
return x.logical_clock() < y.logical_clock() ? y : x;
});
@@ -87,10 +198,12 @@ bool counter_cell_view::apply_reversibly(atomic_cell_or_collection& dst, atomic_
void counter_cell_view::revert_apply(atomic_cell_or_collection& dst, atomic_cell_or_collection& src)
{
if (dst.as_atomic_cell().is_counter_update()) {
auto src_v = value_cast<int64_t>(long_type->deserialize_value(src.as_atomic_cell().value()));
auto dst_v = value_cast<int64_t>(long_type->deserialize_value(dst.as_atomic_cell().value()));
auto src_v = src.as_atomic_cell().counter_update_value();
auto dst_v = dst.as_atomic_cell().counter_update_value();
dst = atomic_cell::make_live(dst.as_atomic_cell().timestamp(),
long_type->decompose(dst_v - src_v));
} else if (src.as_atomic_cell().is_counter_in_place_revert_set()) {
revert_in_place_apply(dst, src);
} else {
std::swap(dst, src);
}
@@ -101,10 +214,11 @@ stdx::optional<atomic_cell> counter_cell_view::difference(atomic_cell_view a, at
assert(!a.is_counter_update());
assert(!b.is_counter_update());
if (!b.is_live()) {
if (!b.is_live() || !a.is_live()) {
if (b.is_live() || (!a.is_live() && compare_atomic_cell_for_merge(b, a) < 0)) {
return atomic_cell(a);
}
return { };
} else if (!a.is_live()) {
return atomic_cell(a);
}
auto a_shards = counter_cell_view(a).shards();
@@ -139,28 +253,68 @@ stdx::optional<atomic_cell> counter_cell_view::difference(atomic_cell_view a, at
void transform_counter_updates_to_shards(mutation& m, const mutation* current_state, uint64_t clock_offset) {
// FIXME: allow current_state to be frozen_mutation
auto transform_new_row_to_shards = [clock_offset] (auto& cr) {
cr.row().cells().for_each_cell([clock_offset] (auto, atomic_cell_or_collection& ac_o_c) {
auto transform_new_row_to_shards = [clock_offset] (auto& cells) {
cells.for_each_cell([clock_offset] (auto, atomic_cell_or_collection& ac_o_c) {
auto acv = ac_o_c.as_atomic_cell();
if (!acv.is_live()) {
return; // continue -- we are in lambda
}
auto delta = value_cast<int64_t>(long_type->deserialize_value(acv.value()));
counter_cell_builder ccb;
ccb.add_shard(counter_shard(counter_id::local(), delta, clock_offset + 1));
ac_o_c = ccb.build(acv.timestamp());
auto delta = acv.counter_update_value();
auto cs = counter_shard(counter_id::local(), delta, clock_offset + 1);
ac_o_c = counter_cell_builder::from_single_shard(acv.timestamp(), cs);
});
};
if (!current_state) {
transform_new_row_to_shards(m.partition().static_row());
for (auto& cr : m.partition().clustered_rows()) {
transform_new_row_to_shards(cr);
transform_new_row_to_shards(cr.row().cells());
}
return;
}
clustering_key::less_compare cmp(*m.schema());
auto transform_row_to_shards = [clock_offset] (auto& transformee, auto& state) {
std::deque<std::pair<column_id, counter_shard>> shards;
state.for_each_cell([&] (column_id id, const atomic_cell_or_collection& ac_o_c) {
auto acv = ac_o_c.as_atomic_cell();
if (!acv.is_live()) {
return; // continue -- we are in lambda
}
counter_cell_view ccv(acv);
auto cs = ccv.local_shard();
if (!cs) {
return; // continue
}
shards.emplace_back(std::make_pair(id, counter_shard(*cs)));
});
transformee.for_each_cell([&] (column_id id, atomic_cell_or_collection& ac_o_c) {
auto acv = ac_o_c.as_atomic_cell();
if (!acv.is_live()) {
return; // continue -- we are in lambda
}
while (!shards.empty() && shards.front().first < id) {
shards.pop_front();
}
auto delta = acv.counter_update_value();
if (shards.empty() || shards.front().first > id) {
auto cs = counter_shard(counter_id::local(), delta, clock_offset + 1);
ac_o_c = counter_cell_builder::from_single_shard(acv.timestamp(), cs);
} else {
auto& cs = shards.front().second;
cs.update(delta, clock_offset + 1);
ac_o_c = counter_cell_builder::from_single_shard(acv.timestamp(), cs);
shards.pop_front();
}
});
};
transform_row_to_shards(m.partition().static_row(), current_state->partition().static_row());
auto& cstate = current_state->partition();
auto it = cstate.clustered_rows().begin();
auto end = cstate.clustered_rows().end();
@@ -169,60 +323,10 @@ void transform_counter_updates_to_shards(mutation& m, const mutation* current_st
++it;
}
if (it == end || cmp(cr.key(), it->key())) {
transform_new_row_to_shards(cr);
transform_new_row_to_shards(cr.row().cells());
continue;
}
struct counter_shard_or_tombstone {
stdx::optional<counter_shard> shard;
tombstone tomb;
};
std::deque<std::pair<column_id, counter_shard_or_tombstone>> shards;
it->row().cells().for_each_cell([&] (column_id id, const atomic_cell_or_collection& ac_o_c) {
auto acv = ac_o_c.as_atomic_cell();
if (!acv.is_live()) {
counter_shard_or_tombstone cs_o_t { { },
tombstone(acv.timestamp(), acv.deletion_time()) };
shards.emplace_back(std::make_pair(id, cs_o_t));
return; // continue -- we are in lambda
}
counter_cell_view ccv(acv);
auto cs = ccv.local_shard();
if (!cs) {
return; // continue
}
shards.emplace_back(std::make_pair(id, counter_shard_or_tombstone { counter_shard(*cs), tombstone() }));
});
cr.row().cells().for_each_cell([&] (column_id id, atomic_cell_or_collection& ac_o_c) {
auto acv = ac_o_c.as_atomic_cell();
if (!acv.is_live()) {
return; // continue -- we are in lambda
}
while (!shards.empty() && shards.front().first < id) {
shards.pop_front();
}
auto delta = value_cast<int64_t>(long_type->deserialize_value(acv.value()));
counter_cell_builder ccb;
if (shards.empty() || shards.front().first > id) {
ccb.add_shard(counter_shard(counter_id::local(), delta, clock_offset + 1));
} else if (shards.front().second.tomb.timestamp == api::missing_timestamp) {
auto& cs = *shards.front().second.shard;
cs.update(delta, clock_offset + 1);
ccb.add_shard(cs);
shards.pop_front();
} else {
// We are apply the tombstone that's already there second time.
// It is not necessary but there is no easy way to remove cell
// from a mutation.
tombstone t = shards.front().second.tomb;
ac_o_c = atomic_cell::make_dead(t.timestamp, t.deletion_time);
shards.pop_front();
return; // continue -- we are in lambda
}
ac_o_c = ccb.build(acv.timestamp());
});
transform_row_to_shards(cr.row().cells(), it->row().cells());
}
}

View File

@@ -36,6 +36,10 @@ class counter_id {
int64_t _least_significant;
int64_t _most_significant;
public:
static_assert(std::is_same<decltype(std::declval<utils::UUID>().get_least_significant_bits()), int64_t>::value
&& std::is_same<decltype(std::declval<utils::UUID>().get_most_significant_bits()), int64_t>::value,
"utils::UUID is expected to work with two signed 64-bit integers");
counter_id() = default;
explicit counter_id(utils::UUID uuid) noexcept
: _least_significant(uuid.get_least_significant_bits())
@@ -49,12 +53,20 @@ public:
bool operator<(const counter_id& other) const {
return to_uuid() < other.to_uuid();
}
bool operator>(const counter_id& other) const {
return other.to_uuid() < to_uuid();
}
bool operator==(const counter_id& other) const {
return to_uuid() == other.to_uuid();
}
bool operator!=(const counter_id& other) const {
return !(*this == other);
}
public:
// (Wrong) Counter ID ordering used by Scylla 1.7.4 and earlier.
struct less_compare_1_7_4 {
bool operator()(const counter_id& a, const counter_id& b) const;
};
public:
static counter_id local();
@@ -67,7 +79,8 @@ static_assert(std::is_pod<counter_id>::value, "counter_id should be a POD type")
std::ostream& operator<<(std::ostream& os, const counter_id& id);
class counter_shard_view {
template<typename View>
class basic_counter_shard_view {
enum class offset : unsigned {
id = 0u,
value = unsigned(id) + sizeof(counter_id),
@@ -75,32 +88,58 @@ class counter_shard_view {
total_size = unsigned(logical_clock) + sizeof(int64_t),
};
private:
bytes_view::const_pointer _base;
typename View::pointer _base;
private:
template<typename T>
T read(offset off) const {
T value;
std::copy_n(_base + static_cast<unsigned>(off), sizeof(T), reinterpret_cast<char*>(&value));
std::copy_n(_base + static_cast<unsigned>(off), sizeof(T), reinterpret_cast<signed char*>(&value));
return value;
}
public:
static constexpr auto size = size_t(offset::total_size);
public:
counter_shard_view() = default;
explicit counter_shard_view(bytes_view::const_pointer ptr) noexcept
basic_counter_shard_view() = default;
explicit basic_counter_shard_view(typename View::pointer ptr) noexcept
: _base(ptr) { }
counter_id id() const { return read<counter_id>(offset::id); }
int64_t value() const { return read<int64_t>(offset::value); }
int64_t logical_clock() const { return read<int64_t>(offset::logical_clock); }
void swap_value_and_clock(basic_counter_shard_view& other) noexcept {
static constexpr size_t off = size_t(offset::value);
static constexpr size_t size = size_t(offset::total_size) - off;
typename View::value_type tmp[size];
std::copy_n(_base + off, size, tmp);
std::copy_n(other._base + off, size, _base + off);
std::copy_n(tmp, size, other._base + off);
}
void set_value_and_clock(const basic_counter_shard_view& other) noexcept {
static constexpr size_t off = size_t(offset::value);
static constexpr size_t size = size_t(offset::total_size) - off;
std::copy_n(other._base + off, size, _base + off);
}
bool operator==(const basic_counter_shard_view& other) const {
return id() == other.id() && value() == other.value()
&& logical_clock() == other.logical_clock();
}
bool operator!=(const basic_counter_shard_view& other) const {
return !(*this == other);
}
struct less_compare_by_id {
bool operator()(const counter_shard_view& x, const counter_shard_view& y) const {
bool operator()(const basic_counter_shard_view& x, const basic_counter_shard_view& y) const {
return x.id() < y.id();
}
};
};
using counter_shard_view = basic_counter_shard_view<bytes_view>;
std::ostream& operator<<(std::ostream& os, counter_shard_view csv);
class counter_shard {
@@ -110,7 +149,23 @@ class counter_shard {
private:
template<typename T>
static void write(const T& value, bytes::iterator& out) {
out = std::copy_n(reinterpret_cast<const char*>(&value), sizeof(T), out);
out = std::copy_n(reinterpret_cast<const signed char*>(&value), sizeof(T), out);
}
private:
// Shared logic for applying counter_shards and counter_shard_views.
// T is either counter_shard or basic_counter_shard_view<U>.
template<typename T>
GCC6_CONCEPT(requires requires(T shard) {
{ shard.value() } -> int64_t;
{ shard.logical_clock() } -> int64_t;
})
counter_shard& do_apply(T&& other) noexcept {
auto other_clock = other.logical_clock();
if (_logical_clock < other_clock) {
_logical_clock = other_clock;
_value = other.value();
}
return *this;
}
public:
counter_shard(counter_id id, int64_t value, int64_t logical_clock) noexcept
@@ -136,12 +191,11 @@ public:
}
counter_shard& apply(counter_shard_view other) noexcept {
auto other_clock = other.logical_clock();
if (_logical_clock < other_clock) {
_logical_clock = other_clock;
_value = other.value();
}
return *this;
return do_apply(other);
}
counter_shard& apply(const counter_shard& other) noexcept {
return do_apply(other);
}
static size_t serialized_size() {
@@ -156,6 +210,9 @@ public:
class counter_cell_builder {
std::vector<counter_shard> _shards;
bool _sorted = true;
private:
void do_sort_and_remove_duplicates();
public:
counter_cell_builder() = default;
counter_cell_builder(size_t shard_count) {
@@ -166,6 +223,21 @@ public:
_shards.emplace_back(cs);
}
void add_maybe_unsorted_shard(const counter_shard& cs) {
add_shard(cs);
if (_sorted && _shards.size() > 1) {
auto current = _shards.rbegin();
auto previous = std::next(current);
_sorted = current->id() > previous->id();
}
}
void sort_and_remove_duplicates() {
if (!_sorted) {
do_sort_and_remove_duplicates();
}
}
size_t serialized_size() const {
return _shards.size() * counter_shard::serialized_size();
}
@@ -180,10 +252,15 @@ public:
}
atomic_cell build(api::timestamp_type timestamp) const {
bytes b(bytes::initialized_later(), serialized_size());
auto out = b.begin();
serialize(out);
return atomic_cell::make_live(timestamp, b);
return atomic_cell::make_live_from_serializer(timestamp, serialized_size(), [this] (bytes::iterator out) {
serialize(out);
});
}
static atomic_cell from_single_shard(api::timestamp_type timestamp, const counter_shard& cs) {
return atomic_cell::make_live_from_serializer(timestamp, counter_shard::serialized_size(), [&cs] (bytes::iterator out) {
cs.serialize(out);
});
}
class inserter_iterator : public std::iterator<std::output_iterator_tag, counter_shard> {
@@ -210,26 +287,28 @@ public:
// <counter_id> := <int64_t><int64_t>
// <shard> := <counter_id><int64_t:value><int64_t:logical_clock>
// <counter_cell> := <shard>*
class counter_cell_view {
atomic_cell_view _cell;
template<typename View>
class basic_counter_cell_view {
protected:
atomic_cell_base<View> _cell;
private:
class shard_iterator : public std::iterator<std::input_iterator_tag, const counter_shard_view> {
bytes_view::const_pointer _current;
counter_shard_view _current_view;
class shard_iterator : public std::iterator<std::input_iterator_tag, basic_counter_shard_view<View>> {
typename View::pointer _current;
basic_counter_shard_view<View> _current_view;
public:
shard_iterator() = default;
shard_iterator(bytes_view::const_pointer ptr) noexcept
shard_iterator(typename View::pointer ptr) noexcept
: _current(ptr), _current_view(ptr) { }
const counter_shard_view& operator*() const noexcept {
basic_counter_shard_view<View>& operator*() noexcept {
return _current_view;
}
const counter_shard_view* operator->() const noexcept {
basic_counter_shard_view<View>* operator->() noexcept {
return &_current_view;
}
shard_iterator& operator++() noexcept {
_current += counter_shard_view::size;
_current_view = counter_shard_view(_current);
_current_view = basic_counter_shard_view<View>(_current);
return *this;
}
shard_iterator operator++(int) noexcept {
@@ -237,6 +316,16 @@ private:
operator++();
return it;
}
shard_iterator& operator--() noexcept {
_current -= counter_shard_view::size;
_current_view = basic_counter_shard_view<View>(_current);
return *this;
}
shard_iterator operator--(int) noexcept {
auto it = *this;
operator--();
return it;
}
bool operator==(const shard_iterator& other) const noexcept {
return _current == other._current;
}
@@ -257,7 +346,7 @@ public:
}
public:
// ac must be a live counter cell
explicit counter_cell_view(atomic_cell_view ac) noexcept : _cell(ac) {
explicit basic_counter_cell_view(atomic_cell_base<View> ac) noexcept : _cell(ac) {
assert(_cell.is_live());
assert(!_cell.is_counter_update());
}
@@ -287,6 +376,17 @@ public:
return get_shard(counter_id::local());
}
bool operator==(const basic_counter_cell_view& other) const {
return timestamp() == other.timestamp() && boost::equal(shards(), other.shards());
}
};
struct counter_cell_view : basic_counter_cell_view<bytes_view> {
using basic_counter_cell_view::basic_counter_cell_view;
// Returns counter shards in an order that is compatible with Scylla 1.7.4.
std::vector<counter_shard> shards_compatible_with_1_7_4() const;
// Reversibly applies two counter cells, at least one of them must be live.
// Returns true iff dst was modified.
static bool apply_reversibly(atomic_cell_or_collection& dst, atomic_cell_or_collection& src);
@@ -301,6 +401,12 @@ public:
friend std::ostream& operator<<(std::ostream& os, counter_cell_view ccv);
};
struct counter_cell_mutable_view : basic_counter_cell_view<bytes_mutable_view> {
using basic_counter_cell_view::basic_counter_cell_view;
void set_timestamp(api::timestamp_type ts) { _cell.set_timestamp(ts); }
};
// Transforms mutation dst from counter updates to counter shards using state
// stored in current_state.
// If current_state is present it has to be in the same schema as dst.

89
cpu_controller.hh Normal file
View File

@@ -0,0 +1,89 @@
/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <seastar/core/thread.hh>
#include <seastar/core/timer.hh>
#include <chrono>
// Simple proportional controller to adjust shares of memtable/streaming flushes.
//
// Goal is to flush as fast as we can, but not so fast that we steal all the CPU from incoming
// requests, and at the same time minimize user-visible fluctuations in the flush quota.
//
// What that translates to is we'll try to keep virtual dirty's firt derivative at 0 (IOW, we keep
// virtual dirty constant), which means that the rate of incoming writes is equal to the rate of
// flushed bytes.
//
// The exact point at which the controller stops determines the desired flush CPU usage. As we
// approach the hard dirty limit, we need to be more aggressive. We will therefore define two
// thresholds, and increase the constant as we cross them.
//
// 1) the soft limit line
// 2) halfway between soft limit and dirty limit
//
// The constants q1 and q2 are used to determine the proportional factor at each stage.
//
// Below the soft limit, we are in no particular hurry to flush, since it means we're set to
// complete flushing before we a new memtable is ready. The quota is dirty * q1, and q1 is set to a
// low number.
//
// The first half of the virtual dirty region is where we expect to be usually, so we have a low
// slope corresponding to a sluggish response between q1 * soft_limit and q2.
//
// In the second half, we're getting close to the hard dirty limit so we increase the slope and
// become more responsive, up to a maximum quota of qmax.
//
// For now we'll just set them in the structure not to complicate the constructor. But q1, q2 and
// qmax can easily become parameters if we find another user.
class flush_cpu_controller {
static constexpr float hard_dirty_limit = 0.50;
static constexpr float q1 = 0.01;
static constexpr float q2 = 0.2;
static constexpr float qmax = 1;
float _current_quota = 0.0f;
float _goal;
std::function<float()> _current_dirty;
std::chrono::milliseconds _interval;
timer<> _update_timer;
seastar::thread_scheduling_group _scheduling_group;
seastar::thread_scheduling_group *_current_scheduling_group = nullptr;
void adjust();
public:
seastar::thread_scheduling_group* scheduling_group() {
return _current_scheduling_group;
}
float current_quota() const {
return _current_quota;
}
struct disabled {
seastar::thread_scheduling_group *backup;
};
flush_cpu_controller(disabled d) : _scheduling_group(std::chrono::nanoseconds(0), 0), _current_scheduling_group(d.backup) {}
flush_cpu_controller(std::chrono::milliseconds interval, float soft_limit, std::function<float()> current_dirty);
flush_cpu_controller(flush_cpu_controller&&) = default;
};

View File

@@ -46,6 +46,7 @@ options {
#include "cql3/statements/drop_type_statement.hh"
#include "cql3/statements/alter_type_statement.hh"
#include "cql3/statements/property_definitions.hh"
#include "cql3/statements/drop_index_statement.hh"
#include "cql3/statements/drop_table_statement.hh"
#include "cql3/statements/drop_view_statement.hh"
#include "cql3/statements/truncate_statement.hh"
@@ -318,9 +319,7 @@ cqlStatement returns [shared_ptr<raw::parsed_statement> stmt]
| st10=createIndexStatement { $stmt = st10; }
| st11=dropKeyspaceStatement { $stmt = st11; }
| st12=dropTableStatement { $stmt = st12; }
#if 0
| st13=dropIndexStatement { $stmt = st13; }
#endif
| st14=alterTableStatement { $stmt = st14; }
| st15=alterKeyspaceStatement { $stmt = st15; }
| st16=grantStatement { $stmt = st16; }
@@ -778,12 +777,13 @@ createIndexStatement returns [::shared_ptr<create_index_statement> expr]
auto props = make_shared<index_prop_defs>();
bool if_not_exists = false;
auto name = ::make_shared<cql3::index_name>();
std::vector<::shared_ptr<index_target::raw>> targets;
}
: K_CREATE (K_CUSTOM { props->is_custom = true; })? K_INDEX (K_IF K_NOT K_EXISTS { if_not_exists = true; } )?
(idxName[name])? K_ON cf=columnFamilyName '(' id=indexIdent ')'
(idxName[name])? K_ON cf=columnFamilyName '(' (target1=indexIdent { targets.emplace_back(target1); } (',' target2=indexIdent { targets.emplace_back(target2); } )*)? ')'
(K_USING cls=STRING_LITERAL { props->custom_class = sstring{$cls.text}; })?
(K_WITH properties[props])?
{ $expr = ::make_shared<create_index_statement>(cf, name, id, props, if_not_exists); }
{ $expr = ::make_shared<create_index_statement>(cf, name, targets, props, if_not_exists); }
;
indexIdent returns [::shared_ptr<index_target::raw> id]
@@ -957,16 +957,14 @@ dropViewStatement returns [::shared_ptr<drop_view_statement> stmt]
{ $stmt = ::make_shared<drop_view_statement>(cf, if_exists); }
;
#if 0
/**
* DROP INDEX [IF EXISTS] <INDEX_NAME>
*/
dropIndexStatement returns [DropIndexStatement expr]
@init { boolean ifExists = false; }
: K_DROP K_INDEX (K_IF K_EXISTS { ifExists = true; } )? index=indexName
{ $expr = new DropIndexStatement(index, ifExists); }
dropIndexStatement returns [::shared_ptr<drop_index_statement> expr]
@init { bool if_exists = false; }
: K_DROP K_INDEX (K_IF K_EXISTS { if_exists = true; } )? index=indexName
{ $expr = ::make_shared<drop_index_statement>(index, if_exists); }
;
#endif
/**
* TRUNCATE <CF>;
@@ -1303,6 +1301,10 @@ normalColumnOperation[operations_type& operations, ::shared_ptr<cql3::column_ide
}
add_raw_update(operations, key, make_shared<cql3::operation::addition>(cql3::constants::literal::integer($i.text)));
}
| K_SCYLLA_COUNTER_SHARD_LIST '(' t=term ')'
{
add_raw_update(operations, key, ::make_shared<cql3::operation::set_counter_value_from_tuple_list>(t));
}
;
specializedColumnOperation[std::vector<std::pair<shared_ptr<cql3::column_identifier::raw>,
@@ -1548,6 +1550,8 @@ basic_unreserved_keyword returns [sstring str]
| K_DISTINCT
| K_CONTAINS
| K_STATIC
| K_FROZEN
| K_TUPLE
| K_FUNCTION
| K_AGGREGATE
| K_SFUNC
@@ -1688,6 +1692,7 @@ K_REPLACE: R E P L A C E;
K_DETERMINISTIC: D E T E R M I N I S T I C;
K_SCYLLA_TIMEUUID_LIST_INDEX: S C Y L L A '_' T I M E U U I D '_' L I S T '_' I N D E X;
K_SCYLLA_COUNTER_SHARD_LIST: S C Y L L A '_' C O U N T E R '_' S H A R D '_' L I S T;
// Case-insensitive alpha characters
fragment A: ('a'|'A');

View File

@@ -23,6 +23,8 @@
#include "exceptions/exceptions.hh"
#include "cql3/selection/simple_selector.hh"
#include <regex>
namespace cql3 {
column_identifier::column_identifier(sstring raw_text, bool keep_case) {
@@ -59,6 +61,17 @@ sstring column_identifier::to_string() const {
return _text;
}
sstring column_identifier::to_cql_string() const {
static const std::regex unquoted_identifier_re("[a-z][a-z0-9_]*");
if (std::regex_match(_text.begin(), _text.end(), unquoted_identifier_re)) {
return _text;
}
static const std::regex double_quote_re("\"");
std::string result = _text;
std::regex_replace(result, double_quote_re, "\"\"");
return '"' + result + '"';
}
column_identifier::raw::raw(sstring raw_text, bool keep_case)
: _raw_text{raw_text}
, _text{raw_text}

View File

@@ -80,6 +80,8 @@ public:
sstring to_string() const;
sstring to_cql_string() const;
friend std::ostream& operator<<(std::ostream& out, const column_identifier& i) {
return out << i._text;
}

View File

@@ -159,7 +159,7 @@ constants::literal::prepare(database& db, const sstring& keyspace, ::shared_ptr<
return ::make_shared<value>(cql3::raw_value::make_value(parsed_value(receiver->type)));
}
void constants::deleter::execute(mutation& m, const exploded_clustering_prefix& prefix, const update_parameters& params) {
void constants::deleter::execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) {
if (column.type->is_multi_cell()) {
collection_type_impl::mutation coll_m;
coll_m.tomb = params.make_tombstone();

View File

@@ -197,7 +197,7 @@ public:
public:
using operation::operation;
virtual void execute(mutation& m, const exploded_clustering_prefix& prefix, const update_parameters& params) override {
virtual void execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) override {
auto value = _t->bind_and_get(params._options);
if (value.is_null()) {
m.set_cell(prefix, column, std::move(make_dead_cell(params)));
@@ -210,7 +210,7 @@ public:
struct adder final : operation {
using operation::operation;
virtual void execute(mutation& m, const exploded_clustering_prefix& prefix, const update_parameters& params) override {
virtual void execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) override {
auto value = _t->bind_and_get(params._options);
if (value.is_null()) {
throw exceptions::invalid_request_exception("Invalid null value for counter increment");
@@ -225,7 +225,7 @@ public:
struct subtracter final : operation {
using operation::operation;
virtual void execute(mutation& m, const exploded_clustering_prefix& prefix, const update_parameters& params) override {
virtual void execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) override {
auto value = _t->bind_and_get(params._options);
if (value.is_null()) {
throw exceptions::invalid_request_exception("Invalid null value for counter increment");
@@ -246,7 +246,7 @@ public:
: operation(column, {})
{ }
virtual void execute(mutation& m, const exploded_clustering_prefix& prefix, const update_parameters& params) override;
virtual void execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) override;
};
};

View File

@@ -19,11 +19,39 @@
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include <iostream>
#include <iterator>
#include <regex>
#include "cql3_type.hh"
#include "cql3/util.hh"
#include "ut_name.hh"
namespace cql3 {
sstring cql3_type::to_string() const {
if (_type->is_user_type()) {
return "frozen<" + util::maybe_quote(_name) + ">";
}
if (_type->is_tuple()) {
return "frozen<" + _name + ">";
}
return _name;
}
shared_ptr<cql3_type> cql3_type::raw::prepare(database& db, const sstring& keyspace) {
try {
auto&& ks = db.find_keyspace(keyspace);
return prepare_internal(keyspace, ks.metadata()->user_types());
} catch (no_such_keyspace& nsk) {
throw exceptions::invalid_request_exception("Unknown keyspace " + keyspace);
}
}
bool cql3_type::raw::references_user_type(const sstring& name) const {
return false;
}
class cql3_type::raw_type : public raw {
private:
shared_ptr<cql3_type> _type;
@@ -35,6 +63,9 @@ public:
virtual shared_ptr<cql3_type> prepare(database& db, const sstring& keyspace) {
return _type;
}
shared_ptr<cql3_type> prepare_internal(const sstring&, lw_shared_ptr<user_types_metadata>) override {
return _type;
}
virtual bool supports_freezing() const {
return false;
@@ -76,7 +107,7 @@ public:
return true;
}
virtual shared_ptr<cql3_type> prepare(database& db, const sstring& keyspace) override {
virtual shared_ptr<cql3_type> prepare_internal(const sstring& keyspace, lw_shared_ptr<user_types_metadata> user_types) override {
assert(_values); // "Got null values type for a collection";
if (!_frozen && _values->supports_freezing() && !_values->_frozen) {
@@ -93,16 +124,20 @@ public:
}
if (_kind == &collection_type_impl::kind::list) {
return make_shared(cql3_type(to_string(), list_type_impl::get_instance(_values->prepare(db, keyspace)->get_type(), !_frozen), false));
return make_shared(cql3_type(to_string(), list_type_impl::get_instance(_values->prepare_internal(keyspace, user_types)->get_type(), !_frozen), false));
} else if (_kind == &collection_type_impl::kind::set) {
return make_shared(cql3_type(to_string(), set_type_impl::get_instance(_values->prepare(db, keyspace)->get_type(), !_frozen), false));
return make_shared(cql3_type(to_string(), set_type_impl::get_instance(_values->prepare_internal(keyspace, user_types)->get_type(), !_frozen), false));
} else if (_kind == &collection_type_impl::kind::map) {
assert(_keys); // "Got null keys type for a collection";
return make_shared(cql3_type(to_string(), map_type_impl::get_instance(_keys->prepare(db, keyspace)->get_type(), _values->prepare(db, keyspace)->get_type(), !_frozen), false));
return make_shared(cql3_type(to_string(), map_type_impl::get_instance(_keys->prepare_internal(keyspace, user_types)->get_type(), _values->prepare_internal(keyspace, user_types)->get_type(), !_frozen), false));
}
abort();
}
bool references_user_type(const sstring& name) const override {
return (_keys && _keys->references_user_type(name)) || _values->references_user_type(name);
}
virtual sstring to_string() const override {
sstring start = _frozen ? "frozen<" : "";
sstring end = _frozen ? ">" : "";
@@ -132,7 +167,7 @@ public:
_frozen = true;
}
virtual shared_ptr<cql3_type> prepare(database& db, const sstring& keyspace) override {
virtual shared_ptr<cql3_type> prepare_internal(const sstring& keyspace, lw_shared_ptr<user_types_metadata> user_types) override {
if (_name.has_keyspace()) {
// The provided keyspace is the one of the current statement this is part of. If it's different from the keyspace of
// the UTName, we reject since we want to limit user types to their own keyspace (see #6643)
@@ -144,23 +179,23 @@ public:
} else {
_name.set_keyspace(keyspace);
}
if (!user_types) {
// bootstrap mode.
throw exceptions::invalid_request_exception(sprint("Unknown type %s", _name));
}
try {
auto&& ks = db.find_keyspace(_name.get_keyspace());
try {
auto&& type = ks.metadata()->user_types()->get_type(_name.get_user_type_name());
if (!_frozen) {
throw exceptions::invalid_request_exception("Non-frozen User-Defined types are not supported, please use frozen<>");
}
return make_shared<cql3_type>(_name.to_string(), std::move(type));
} catch (std::out_of_range& e) {
throw exceptions::invalid_request_exception(sprint("Unknown type %s", _name));
auto&& type = user_types->get_type(_name.get_user_type_name());
if (!_frozen) {
throw exceptions::invalid_request_exception("Non-frozen User-Defined types are not supported, please use frozen<>");
}
} catch (no_such_keyspace& nsk) {
throw exceptions::invalid_request_exception("Unknown keyspace " + _name.get_keyspace());
return make_shared<cql3_type>(_name.to_string(), std::move(type));
} catch (std::out_of_range& e) {
throw exceptions::invalid_request_exception(sprint("Unknown type %s", _name));
}
}
bool references_user_type(const sstring& name) const override {
return _name.get_string_type_name() == name;
}
virtual bool supports_freezing() const override {
return true;
}
@@ -191,7 +226,7 @@ public:
}
_frozen = true;
}
virtual shared_ptr<cql3_type> prepare(database& db, const sstring& keyspace) override {
virtual shared_ptr<cql3_type> prepare_internal(const sstring& keyspace, lw_shared_ptr<user_types_metadata> user_types) override {
if (!_frozen) {
freeze();
}
@@ -200,10 +235,17 @@ public:
if (t->is_counter()) {
throw exceptions::invalid_request_exception("Counters are not allowed inside tuples");
}
ts.push_back(t->prepare(db, keyspace)->get_type());
ts.push_back(t->prepare_internal(keyspace, user_types)->get_type());
}
return make_cql3_tuple_type(tuple_type_impl::get_instance(std::move(ts)));
}
bool references_user_type(const sstring& name) const override {
return std::any_of(_types.begin(), _types.end(), [&name](auto t) {
return t->references_user_type(name);
});
}
virtual sstring to_string() const override {
return sprint("tuple<%s>", join(", ", _types));
}
@@ -271,6 +313,7 @@ thread_local shared_ptr<cql3_type> cql3_type::bigint = make("bigint", long_type,
thread_local shared_ptr<cql3_type> cql3_type::blob = make("blob", bytes_type, cql3_type::kind::BLOB);
thread_local shared_ptr<cql3_type> cql3_type::boolean = make("boolean", boolean_type, cql3_type::kind::BOOLEAN);
thread_local shared_ptr<cql3_type> cql3_type::double_ = make("double", double_type, cql3_type::kind::DOUBLE);
thread_local shared_ptr<cql3_type> cql3_type::empty = make("empty", empty_type, cql3_type::kind::EMPTY);
thread_local shared_ptr<cql3_type> cql3_type::float_ = make("float", float_type, cql3_type::kind::FLOAT);
thread_local shared_ptr<cql3_type> cql3_type::int_ = make("int", int32_type, cql3_type::kind::INT);
thread_local shared_ptr<cql3_type> cql3_type::smallint = make("smallint", short_type, cql3_type::kind::SMALLINT);
@@ -297,8 +340,9 @@ cql3_type::values() {
cql3_type::counter,
cql3_type::decimal,
cql3_type::double_,
cql3_type::empty,
cql3_type::float_,
cql3_type:inet,
cql3_type::inet,
cql3_type::int_,
cql3_type::smallint,
cql3_type::text,
@@ -329,5 +373,23 @@ operator<<(std::ostream& os, const cql3_type::raw& r) {
return os << r.to_string();
}
namespace util {
sstring maybe_quote(const sstring& s) {
static const std::regex unquoted("\\w*");
static const std::regex double_quote("\"");
if (std::regex_match(s.begin(), s.end(), unquoted)) {
return s;
}
std::ostringstream ss;
ss << "\"";
std::regex_replace(std::ostreambuf_iterator<char>(ss), s.begin(), s.end(), double_quote, "\"\"");
ss << "\"";
return ss.str();
}
}
}

View File

@@ -47,6 +47,7 @@
#include "enum_set.hh"
class database;
class user_types_metadata;
namespace cql3 {
@@ -63,19 +64,22 @@ public:
bool is_counter() const { return _type->is_counter(); }
bool is_native() const { return _native; }
data_type get_type() const { return _type; }
sstring to_string() const { return _name; }
sstring to_string() const;
// For UserTypes, we need to know the current keyspace to resolve the
// actual type used, so Raw is a "not yet prepared" CQL3Type.
class raw {
public:
virtual ~raw() {}
bool _frozen = false;
virtual bool supports_freezing() const = 0;
virtual bool is_collection() const;
virtual bool is_counter() const;
virtual bool references_user_type(const sstring&) const;
virtual std::experimental::optional<sstring> keyspace() const;
virtual void freeze();
virtual shared_ptr<cql3_type> prepare(database& db, const sstring& keyspace) = 0;
virtual shared_ptr<cql3_type> prepare_internal(const sstring& keyspace, lw_shared_ptr<user_types_metadata>) = 0;
virtual shared_ptr<cql3_type> prepare(database& db, const sstring& keyspace);
static shared_ptr<raw> from(shared_ptr<cql3_type> type);
static shared_ptr<raw> user_type(ut_name name);
static shared_ptr<raw> map(shared_ptr<raw> t1, shared_ptr<raw> t2);
@@ -98,7 +102,7 @@ private:
public:
enum class kind : int8_t {
ASCII, BIGINT, BLOB, BOOLEAN, COUNTER, DECIMAL, DOUBLE, FLOAT, INT, SMALLINT, TINYINT, INET, TEXT, TIMESTAMP, UUID, VARCHAR, VARINT, TIMEUUID, DATE, TIME
ASCII, BIGINT, BLOB, BOOLEAN, COUNTER, DECIMAL, DOUBLE, EMPTY, FLOAT, INT, SMALLINT, TINYINT, INET, TEXT, TIMESTAMP, UUID, VARCHAR, VARINT, TIMEUUID, DATE, TIME
};
using kind_enum = super_enum<kind,
kind::ASCII,
@@ -108,6 +112,7 @@ public:
kind::COUNTER,
kind::DECIMAL,
kind::DOUBLE,
kind::EMPTY,
kind::FLOAT,
kind::INET,
kind::INT,
@@ -133,6 +138,7 @@ public:
static thread_local shared_ptr<cql3_type> blob;
static thread_local shared_ptr<cql3_type> boolean;
static thread_local shared_ptr<cql3_type> double_;
static thread_local shared_ptr<cql3_type> empty;
static thread_local shared_ptr<cql3_type> float_;
static thread_local shared_ptr<cql3_type> int_;
static thread_local shared_ptr<cql3_type> smallint;

View File

@@ -46,7 +46,7 @@
#include "service/storage_proxy.hh"
#include "cql3/query_options.hh"
namespace transport {
namespace cql_transport {
namespace messages {
@@ -89,7 +89,7 @@ public:
* @param state the current query state
* @param options options for this query (consistency, variables, pageSize, ...)
*/
virtual future<::shared_ptr<transport::messages::result_message>>
virtual future<::shared_ptr<cql_transport::messages::result_message>>
execute(distributed<service::storage_proxy>& proxy, service::query_state& state, const query_options& options) = 0;
/**
@@ -97,7 +97,7 @@ public:
*
* @param state the current query state
*/
virtual future<::shared_ptr<transport::messages::result_message>>
virtual future<::shared_ptr<cql_transport::messages::result_message>>
execute_internal(distributed<service::storage_proxy>& proxy, service::query_state& state, const query_options& options) = 0;
virtual bool uses_function(const sstring& ks_name, const sstring& function_name) const = 0;

View File

@@ -41,6 +41,7 @@
#pragma once
#include "seastarx.hh"
#include <seastar/core/sstring.hh>
#include <antlr3.hpp>

View File

@@ -67,6 +67,18 @@ functions::init() {
declare(aggregate_fcts::make_max_function<int64_t>());
declare(aggregate_fcts::make_min_function<int64_t>());
declare(aggregate_fcts::make_count_function<float>());
declare(aggregate_fcts::make_max_function<float>());
declare(aggregate_fcts::make_min_function<float>());
declare(aggregate_fcts::make_count_function<double>());
declare(aggregate_fcts::make_max_function<double>());
declare(aggregate_fcts::make_min_function<double>());
declare(aggregate_fcts::make_count_function<sstring>());
declare(aggregate_fcts::make_max_function<sstring>());
declare(aggregate_fcts::make_min_function<sstring>());
//FIXME:
//declare(aggregate_fcts::make_count_function<bytes>());
//declare(aggregate_fcts::make_max_function<bytes>());
@@ -78,15 +90,17 @@ functions::init() {
declare(make_blob_as_varchar_fct());
declare(aggregate_fcts::make_sum_function<int32_t>());
declare(aggregate_fcts::make_sum_function<int64_t>());
declare(aggregate_fcts::make_avg_function<int32_t>());
declare(aggregate_fcts::make_avg_function<int64_t>());
declare(aggregate_fcts::make_sum_function<float>());
declare(aggregate_fcts::make_sum_function<double>());
#if 0
declare(AggregateFcts.sumFunctionForFloat);
declare(AggregateFcts.sumFunctionForDouble);
declare(AggregateFcts.sumFunctionForDecimal);
declare(AggregateFcts.sumFunctionForVarint);
declare(AggregateFcts.avgFunctionForFloat);
declare(AggregateFcts.avgFunctionForDouble);
#endif
declare(aggregate_fcts::make_avg_function<int32_t>());
declare(aggregate_fcts::make_avg_function<int64_t>());
declare(aggregate_fcts::make_avg_function<float>());
declare(aggregate_fcts::make_avg_function<double>());
#if 0
declare(AggregateFcts.avgFunctionForVarint);
declare(AggregateFcts.avgFunctionForDecimal);
#endif

View File

@@ -42,6 +42,7 @@
#pragma once
#include "core/sstring.hh"
#include "seastarx.hh"
#include <experimental/optional>

View File

@@ -111,7 +111,7 @@ lists::literal::test_assignment(database& db, const sstring& keyspace, shared_pt
sstring
lists::literal::to_string() const {
return ::to_string(_elements);
return std::to_string(_elements);
}
lists::value
@@ -242,7 +242,7 @@ lists::precision_time::get_next(db_clock::time_point millis) {
}
void
lists::setter::execute(mutation& m, const exploded_clustering_prefix& prefix, const update_parameters& params) {
lists::setter::execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) {
const auto& value = _t->bind(params._options);
if (value == constants::UNSET_VALUE) {
return;
@@ -270,15 +270,10 @@ lists::setter_by_index::collect_marker_specification(shared_ptr<variable_specifi
}
void
lists::setter_by_index::execute(mutation& m, const exploded_clustering_prefix& prefix, const update_parameters& params) {
lists::setter_by_index::execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) {
// we should not get here for frozen lists
assert(column.type->is_multi_cell()); // "Attempted to set an individual element on a frozen list";
std::experimental::optional<clustering_key> row_key;
if (!column.is_static()) {
row_key = clustering_key::from_clustering_prefix(*params._schema, prefix);
}
auto index = _idx->bind_and_get(params._options);
if (index.is_null()) {
throw exceptions::invalid_request_exception("Invalid null value for list index");
@@ -292,7 +287,7 @@ lists::setter_by_index::execute(mutation& m, const exploded_clustering_prefix& p
}
auto idx = net::ntoh(int32_t(*unaligned_cast<int32_t>(index->begin())));
auto&& existing_list_opt = params.get_prefetched_list(m.key(), std::move(row_key), column);
auto&& existing_list_opt = params.get_prefetched_list(m.key().view(), prefix.view(), column);
if (!existing_list_opt) {
throw exceptions::invalid_request_exception("Attempted to set an element on a list which is null");
}
@@ -327,15 +322,10 @@ lists::setter_by_uuid::requires_read() {
}
void
lists::setter_by_uuid::execute(mutation& m, const exploded_clustering_prefix& prefix, const update_parameters& params) {
lists::setter_by_uuid::execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) {
// we should not get here for frozen lists
assert(column.type->is_multi_cell()); // "Attempted to set an individual element on a frozen list";
std::experimental::optional<clustering_key> row_key;
if (!column.is_static()) {
row_key = clustering_key::from_clustering_prefix(*params._schema, prefix);
}
auto index = _idx->bind_and_get(params._options);
auto value = _t->bind_and_get(params._options);
@@ -355,7 +345,7 @@ lists::setter_by_uuid::execute(mutation& m, const exploded_clustering_prefix& pr
}
void
lists::appender::execute(mutation& m, const exploded_clustering_prefix& prefix, const update_parameters& params) {
lists::appender::execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) {
const auto& value = _t->bind(params._options);
if (value == constants::UNSET_VALUE) {
return;
@@ -367,7 +357,7 @@ lists::appender::execute(mutation& m, const exploded_clustering_prefix& prefix,
void
lists::do_append(shared_ptr<term> value,
mutation& m,
const exploded_clustering_prefix& prefix,
const clustering_key_prefix& prefix,
const column_definition& column,
const update_parameters& params) {
auto&& list_value = dynamic_pointer_cast<lists::value>(value);
@@ -401,7 +391,7 @@ lists::do_append(shared_ptr<term> value,
}
void
lists::prepender::execute(mutation& m, const exploded_clustering_prefix& prefix, const update_parameters& params) {
lists::prepender::execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) {
assert(column.type->is_multi_cell()); // "Attempted to prepend to a frozen list";
auto&& value = _t->bind(params._options);
if (!value || value == constants::UNSET_VALUE) {
@@ -433,15 +423,10 @@ lists::discarder::requires_read() {
}
void
lists::discarder::execute(mutation& m, const exploded_clustering_prefix& prefix, const update_parameters& params) {
lists::discarder::execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) {
assert(column.type->is_multi_cell()); // "Attempted to delete from a frozen list";
std::experimental::optional<clustering_key> row_key;
if (!column.is_static()) {
row_key = clustering_key::from_clustering_prefix(*params._schema, prefix);
}
auto&& existing_list = params.get_prefetched_list(m.key(), std::move(row_key), column);
auto&& existing_list = params.get_prefetched_list(m.key().view(), prefix.view(), column);
// We want to call bind before possibly returning to reject queries where the value provided is not a list.
auto&& value = _t->bind(params._options);
@@ -490,7 +475,7 @@ lists::discarder_by_index::requires_read() {
}
void
lists::discarder_by_index::execute(mutation& m, const exploded_clustering_prefix& prefix, const update_parameters& params) {
lists::discarder_by_index::execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) {
assert(column.type->is_multi_cell()); // "Attempted to delete an item by index from a frozen list";
auto&& index = _t->bind(params._options);
if (!index) {
@@ -504,11 +489,7 @@ lists::discarder_by_index::execute(mutation& m, const exploded_clustering_prefix
auto cvalue = dynamic_pointer_cast<constants::value>(index);
assert(cvalue);
std::experimental::optional<clustering_key> row_key;
if (!column.is_static()) {
row_key = clustering_key::from_clustering_prefix(*params._schema, prefix);
}
auto&& existing_list_opt = params.get_prefetched_list(m.key(), std::move(row_key), column);
auto&& existing_list_opt = params.get_prefetched_list(m.key().view(), prefix.view(), column);
int32_t idx = read_simple_exactly<int32_t>(*cvalue->_bytes);
if (!existing_list_opt) {
throw exceptions::invalid_request_exception("Attempted to delete an element from a list which is null");

View File

@@ -146,7 +146,7 @@ public:
setter(const column_definition& column, shared_ptr<term> t)
: operation(column, std::move(t)) {
}
virtual void execute(mutation& m, const exploded_clustering_prefix& prefix, const update_parameters& params) override;
virtual void execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) override;
};
class setter_by_index : public operation {
@@ -158,7 +158,7 @@ public:
}
virtual bool requires_read() override;
virtual void collect_marker_specification(shared_ptr<variable_specifications> bound_names);
virtual void execute(mutation& m, const exploded_clustering_prefix& prefix, const update_parameters& params) override;
virtual void execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) override;
};
class setter_by_uuid : public setter_by_index {
@@ -167,25 +167,25 @@ public:
: setter_by_index(column, std::move(idx), std::move(t)) {
}
virtual bool requires_read() override;
virtual void execute(mutation& m, const exploded_clustering_prefix& prefix, const update_parameters& params) override;
virtual void execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) override;
};
class appender : public operation {
public:
using operation::operation;
virtual void execute(mutation& m, const exploded_clustering_prefix& prefix, const update_parameters& params) override;
virtual void execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) override;
};
static void do_append(shared_ptr<term> value,
mutation& m,
const exploded_clustering_prefix& prefix,
const clustering_key_prefix& prefix,
const column_definition& column,
const update_parameters& params);
class prepender : public operation {
public:
using operation::operation;
virtual void execute(mutation& m, const exploded_clustering_prefix& prefix, const update_parameters& params) override;
virtual void execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) override;
};
class discarder : public operation {
@@ -194,7 +194,7 @@ public:
: operation(column, std::move(t)) {
}
virtual bool requires_read() override;
virtual void execute(mutation& m, const exploded_clustering_prefix& prefix, const update_parameters& params) override;
virtual void execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) override;
};
class discarder_by_index : public operation {
@@ -203,7 +203,7 @@ public:
: operation(column, std::move(idx)) {
}
virtual bool requires_read() override;
virtual void execute(mutation& m, const exploded_clustering_prefix& prefix, const update_parameters& params);
virtual void execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params);
};
};

View File

@@ -269,7 +269,7 @@ maps::marker::bind(const query_options& options) {
}
void
maps::setter::execute(mutation& m, const exploded_clustering_prefix& row_key, const update_parameters& params) {
maps::setter::execute(mutation& m, const clustering_key_prefix& row_key, const update_parameters& params) {
auto value = _t->bind(params._options);
if (value == constants::UNSET_VALUE) {
return;
@@ -292,7 +292,7 @@ maps::setter_by_key::collect_marker_specification(shared_ptr<variable_specificat
}
void
maps::setter_by_key::execute(mutation& m, const exploded_clustering_prefix& prefix, const update_parameters& params) {
maps::setter_by_key::execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) {
using exceptions::invalid_request_exception;
assert(column.type->is_multi_cell()); // "Attempted to set a value for a single key on a frozen map"m
auto key = _k->bind_and_get(params._options);
@@ -315,7 +315,7 @@ maps::setter_by_key::execute(mutation& m, const exploded_clustering_prefix& pref
}
void
maps::putter::execute(mutation& m, const exploded_clustering_prefix& prefix, const update_parameters& params) {
maps::putter::execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) {
assert(column.type->is_multi_cell()); // "Attempted to add items to a frozen map";
auto value = _t->bind(params._options);
if (value != constants::UNSET_VALUE) {
@@ -324,7 +324,7 @@ maps::putter::execute(mutation& m, const exploded_clustering_prefix& prefix, con
}
void
maps::do_put(mutation& m, const exploded_clustering_prefix& prefix, const update_parameters& params,
maps::do_put(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params,
shared_ptr<term> value, const column_definition& column) {
auto map_value = dynamic_pointer_cast<maps::value>(value);
if (column.type->is_multi_cell()) {
@@ -353,7 +353,7 @@ maps::do_put(mutation& m, const exploded_clustering_prefix& prefix, const update
}
void
maps::discarder_by_key::execute(mutation& m, const exploded_clustering_prefix& prefix, const update_parameters& params) {
maps::discarder_by_key::execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) {
assert(column.type->is_multi_cell()); // "Attempted to delete a single key in a frozen map";
auto&& key = _t->bind(params._options);
if (!key) {

View File

@@ -116,7 +116,7 @@ public:
: operation(column, std::move(t)) {
}
virtual void execute(mutation& m, const exploded_clustering_prefix& row_key, const update_parameters& params) override;
virtual void execute(mutation& m, const clustering_key_prefix& row_key, const update_parameters& params) override;
};
class setter_by_key : public operation {
@@ -126,7 +126,7 @@ public:
: operation(column, std::move(t)), _k(std::move(k)) {
}
virtual void collect_marker_specification(shared_ptr<variable_specifications> bound_names) override;
virtual void execute(mutation& m, const exploded_clustering_prefix& prefix, const update_parameters& params) override;
virtual void execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) override;
};
class putter : public operation {
@@ -134,10 +134,10 @@ public:
putter(const column_definition& column, shared_ptr<term> t)
: operation(column, std::move(t)) {
}
virtual void execute(mutation& m, const exploded_clustering_prefix& prefix, const update_parameters& params) override;
virtual void execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) override;
};
static void do_put(mutation& m, const exploded_clustering_prefix& prefix, const update_parameters& params,
static void do_put(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params,
shared_ptr<term> value, const column_definition& column);
class discarder_by_key : public operation {
@@ -145,7 +145,7 @@ public:
discarder_by_key(const column_definition& column, shared_ptr<term> k)
: operation(column, std::move(k)) {
}
virtual void execute(mutation& m, const exploded_clustering_prefix& prefix, const update_parameters& params) override;
virtual void execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) override;
};
};

View File

@@ -36,6 +36,7 @@
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include <utility>
#include "operation.hh"
#include "operation_impl.hh"
@@ -192,6 +193,78 @@ operation::set_value::prepare(database& db, const sstring& keyspace, const colum
}
}
::shared_ptr <operation>
operation::set_counter_value_from_tuple_list::prepare(database& db, const sstring& keyspace, const column_definition& receiver) {
static thread_local const data_type counter_tuple_type = tuple_type_impl::get_instance({int32_type, uuid_type, long_type, long_type});
static thread_local const data_type counter_tuple_list_type = list_type_impl::get_instance(counter_tuple_type, true);
if (!receiver.type->is_counter()) {
throw exceptions::invalid_request_exception(sprint("Column %s is not a counter", receiver.name_as_text()));
}
// We need to fake a column of list<tuple<...>> to prepare the value term
auto & os = receiver.column_specification;
auto spec = make_shared<cql3::column_specification>(os->ks_name, os->cf_name, os->name, counter_tuple_list_type);
auto v = _value->prepare(db, keyspace, spec);
// Will not be used elsewhere, so make it local.
class counter_setter : public operation {
public:
using operation::operation;
bool is_raw_counter_shard_write() const override {
return true;
}
void execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) override {
const auto& value = _t->bind(params._options);
auto&& list_value = dynamic_pointer_cast<lists::value>(value);
if (!list_value) {
throw std::invalid_argument("Invalid input data to counter set");
}
counter_id last(utils::UUID(0, 0));
counter_cell_builder ccb(list_value->_elements.size());
for (auto& bo : list_value->_elements) {
// lexical etc cast fails should be enough type checking here.
auto tuple = value_cast<tuple_type_impl::native_type>(counter_tuple_type->deserialize(*bo));
auto shard = value_cast<int>(tuple[0]);
auto id = counter_id(value_cast<utils::UUID>(tuple[1]));
auto clock = value_cast<int64_t>(tuple[2]);
auto value = value_cast<int64_t>(tuple[3]);
using namespace std::rel_ops;
if (id <= last) {
throw marshal_exception(
sprint("invalid counter id order, %s <= %s",
id.to_uuid().to_sstring(),
last.to_uuid().to_sstring()));
}
last = id;
// TODO: maybe allow more than global values to propagate,
// though we don't (yet at least) in sstable::partition so...
switch (shard) {
case 'g':
ccb.add_shard(counter_shard(id, value, clock));
break;
case 'l':
throw marshal_exception("encountered a local shard in a counter cell");
case 'r':
throw marshal_exception("encountered remote shards in a counter cell");
default:
throw marshal_exception(sprint("encountered unknown shard %d in a counter cell", shard));
}
}
// Note. this is a counter value cell, not an update.
// see counters.cc, we need to detect this.
m.set_cell(prefix, column, ccb.build(params.timestamp()));
}
};
return make_shared<counter_setter>(receiver, v);
};
bool
operation::set_value::is_compatible_with(::shared_ptr <raw_update> other) {
// We don't allow setting multiple time the same column, because 1)

View File

@@ -103,6 +103,10 @@ public:
return _t && _t->uses_function(ks_name, function_name);
}
virtual bool is_raw_counter_shard_write() const {
return false;
}
/**
* @return whether the operation requires a read of the previous value to be executed
* (only lists setterByIdx, discard and discardByIdx requires that).
@@ -126,7 +130,7 @@ public:
/**
* Execute the operation.
*/
virtual void execute(mutation& m, const exploded_clustering_prefix& prefix, const update_parameters& params) = 0;
virtual void execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) = 0;
/**
* A parsed raw UPDATE operation.
@@ -193,6 +197,7 @@ public:
};
class set_value;
class set_counter_value_from_tuple_list;
class set_element : public raw_update {
const shared_ptr<term::raw> _selector;

View File

@@ -50,7 +50,7 @@
namespace cql3 {
class operation::set_value : public raw_update {
private:
protected:
::shared_ptr<term::raw> _value;
public:
set_value(::shared_ptr<term::raw> value) : _value(std::move(value)) {}
@@ -67,6 +67,12 @@ public:
virtual bool is_compatible_with(::shared_ptr <raw_update> other) override;
};
class operation::set_counter_value_from_tuple_list : public set_value {
public:
using set_value::set_value;
::shared_ptr <operation> prepare(database& db, const sstring& keyspace, const column_definition& receiver) override;
};
class operation::column_deletion : public raw_deletion {
private:
::shared_ptr<column_identifier::raw> _id;

View File

@@ -44,6 +44,7 @@
#include <cstddef>
#include <iosfwd>
#include "core/sstring.hh"
#include "seastarx.hh"
namespace cql3 {

View File

@@ -0,0 +1,171 @@
/*
* Copyright (C) 2017 ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include "utils/loading_cache.hh"
#include "cql3/statements/prepared_statement.hh"
namespace cql3 {
using prepared_cache_entry = std::unique_ptr<statements::prepared_statement>;
struct prepared_cache_entry_size {
size_t operator()(const prepared_cache_entry& val) {
// TODO: improve the size approximation
return 10000;
}
};
typedef bytes cql_prepared_id_type;
typedef int32_t thrift_prepared_id_type;
/// \brief The key of the prepared statements cache
///
/// We are going to store the CQL and Thrift prepared statements in the same cache therefore we need generate the key
/// that is going to be unique in both cases. Thrift use int32_t as a prepared statement ID, CQL - MD5 digest.
///
/// We are going to use an std::pair<CQL_PREP_ID_TYPE, int64_t> as a key. For CQL statements we will use {CQL_PREP_ID, std::numeric_limits<int64_t>::max()} as a key
/// and for Thrift - {CQL_PREP_ID_TYPE(0), THRIFT_PREP_ID}. This way CQL and Thrift keys' values will never collide.
class prepared_cache_key_type {
public:
using cache_key_type = std::pair<cql_prepared_id_type, int64_t>;
private:
cache_key_type _key;
public:
prepared_cache_key_type() = default;
explicit prepared_cache_key_type(cql_prepared_id_type cql_id) : _key(std::move(cql_id), std::numeric_limits<int64_t>::max()) {}
explicit prepared_cache_key_type(thrift_prepared_id_type thrift_id) : _key(cql_prepared_id_type(), thrift_id) {}
cache_key_type& key() { return _key; }
const cache_key_type& key() const { return _key; }
static const cql_prepared_id_type& cql_id(const prepared_cache_key_type& key) {
return key.key().first;
}
static thrift_prepared_id_type thrift_id(const prepared_cache_key_type& key) {
return key.key().second;
}
};
class prepared_statements_cache {
public:
struct stats {
uint64_t prepared_cache_evictions = 0;
};
static stats& shard_stats() {
static thread_local stats _stats;
return _stats;
}
struct prepared_cache_stats_updater {
static void inc_hits() noexcept {}
static void inc_misses() noexcept {}
static void inc_blocks() noexcept {}
static void inc_evictions() noexcept {
++shard_stats().prepared_cache_evictions;
}
};
private:
using cache_key_type = typename prepared_cache_key_type::cache_key_type;
using cache_type = utils::loading_cache<cache_key_type, prepared_cache_entry, utils::loading_cache_reload_enabled::no, prepared_cache_entry_size, utils::tuple_hash, std::equal_to<cache_key_type>, prepared_cache_stats_updater>;
using cache_value_ptr = typename cache_type::value_ptr;
using cache_iterator = typename cache_type::iterator;
using checked_weak_ptr = typename statements::prepared_statement::checked_weak_ptr;
struct value_extractor_fn {
checked_weak_ptr operator()(prepared_cache_entry& e) const {
return e->checked_weak_from_this();
}
};
static const std::chrono::minutes entry_expiry;
public:
using key_type = prepared_cache_key_type;
using value_type = checked_weak_ptr;
using statement_is_too_big = typename cache_type::entry_is_too_big;
/// \note both iterator::reference and iterator::value_type are checked_weak_ptr
using iterator = boost::transform_iterator<value_extractor_fn, cache_iterator>;
private:
cache_type _cache;
value_extractor_fn _value_extractor_fn;
public:
prepared_statements_cache(logging::logger& logger)
: _cache(memory::stats().total_memory() / 256, entry_expiry, logger)
{}
template <typename LoadFunc>
future<value_type> get(const key_type& key, LoadFunc&& load) {
return _cache.get_ptr(key.key(), [load = std::forward<LoadFunc>(load)] (const cache_key_type&) { return load(); }).then([] (cache_value_ptr v_ptr) {
return make_ready_future<value_type>((*v_ptr)->checked_weak_from_this());
});
}
iterator find(const key_type& key) {
return boost::make_transform_iterator(_cache.find(key.key()), _value_extractor_fn);
}
iterator end() {
return boost::make_transform_iterator(_cache.end(), _value_extractor_fn);
}
iterator begin() {
return boost::make_transform_iterator(_cache.begin(), _value_extractor_fn);
}
template <typename Pred>
void remove_if(Pred&& pred) {
static_assert(std::is_same<bool, std::result_of_t<Pred(::shared_ptr<cql_statement>)>>::value, "Bad Pred signature");
_cache.remove_if([&pred] (const prepared_cache_entry& e) {
return pred(e->statement);
});
}
size_t size() const {
return _cache.size();
}
size_t memory_footprint() const {
return _cache.memory_footprint();
}
};
}
namespace std { // for prepared_statements_cache log printouts
inline std::ostream& operator<<(std::ostream& os, const typename cql3::prepared_cache_key_type::cache_key_type& p) {
os << "{cql_id: " << p.first << ", thrift_id: " << p.second << "}";
return os;
}
inline std::ostream& operator<<(std::ostream& os, const cql3::prepared_cache_key_type& p) {
os << p.key();
return os;
}
}

View File

@@ -82,17 +82,6 @@ query_options::query_options(db::consistency_level consistency,
{
}
query_options::query_options(query_options&& o, std::vector<std::vector<cql3::raw_value_view>> value_views)
: query_options(std::move(o))
{
std::vector<query_options> tmp;
tmp.reserve(value_views.size());
std::transform(value_views.begin(), value_views.end(), std::back_inserter(tmp), [this](auto& vals) {
return query_options(_consistency, {}, vals, _skip_metadata, _options, _cql_serialization_format);
});
_batch_options = std::move(tmp);
}
query_options::query_options(db::consistency_level cl, std::vector<cql3::raw_value> values)
: query_options(
cl,

View File

@@ -41,6 +41,7 @@
#pragma once
#include <seastar/util/gcc6-concepts.hh>
#include "timestamp.hh"
#include "bytes.hh"
#include "db/consistency_level.hh"
@@ -77,6 +78,26 @@ private:
const specific_options _options;
cql_serialization_format _cql_serialization_format;
std::experimental::optional<std::vector<query_options>> _batch_options;
private:
/**
* @brief Batch query_options constructor.
*
* Requirements:
* - @tparam OneMutationDataRange has a begin() and end() iterators.
* - The values of @tparam OneMutationDataRange are of either raw_value_view or raw_value types.
*
* @param o Base query_options object. query_options objects for each statement in the batch will derive the values from it.
* @param values_ranges a vector of values ranges for each statement in the batch.
*/
template<typename OneMutationDataRange>
GCC6_CONCEPT( requires requires (OneMutationDataRange range) {
std::begin(range);
std::end(range);
} && ( requires (OneMutationDataRange range) { { *range.begin() } -> raw_value_view; } ||
requires (OneMutationDataRange range) { { *range.begin() } -> raw_value; } ) )
explicit query_options(query_options&& o, std::vector<OneMutationDataRange> values_ranges);
public:
query_options(query_options&&) = default;
query_options(const query_options&) = delete;
@@ -94,8 +115,25 @@ public:
specific_options options,
cql_serialization_format sf);
// Batch query_options constructor
explicit query_options(query_options&&, std::vector<std::vector<cql3::raw_value_view>> value_views);
/**
* @brief Batch query_options factory.
*
* Requirements:
* - @tparam OneMutationDataRange has a begin() and end() iterators.
* - The values of @tparam OneMutationDataRange are of either raw_value_view or raw_value types.
*
* @param o Base query_options object. query_options objects for each statement in the batch will derive the values from it.
* @param values_ranges a vector of values ranges for each statement in the batch.
*/
template<typename OneMutationDataRange>
GCC6_CONCEPT( requires requires (OneMutationDataRange range) {
std::begin(range);
std::end(range);
} && ( requires (OneMutationDataRange range) { { *range.begin() } -> raw_value_view; } ||
requires (OneMutationDataRange range) { { *range.begin() } -> raw_value; } ) )
static query_options make_batch_options(query_options&& o, std::vector<OneMutationDataRange> values_ranges) {
return query_options(std::move(o), std::move(values_ranges));
}
// It can't be const because of prepare()
static thread_local query_options DEFAULT;
@@ -130,4 +168,21 @@ private:
void fill_value_views();
};
template<typename OneMutationDataRange>
GCC6_CONCEPT( requires requires (OneMutationDataRange range) {
std::begin(range);
std::end(range);
} && ( requires (OneMutationDataRange range) { { *range.begin() } -> raw_value_view; } ||
requires (OneMutationDataRange range) { { *range.begin() } -> raw_value; } ) )
query_options::query_options(query_options&& o, std::vector<OneMutationDataRange> values_ranges)
: query_options(std::move(o))
{
std::vector<query_options> tmp;
tmp.reserve(values_ranges.size());
std::transform(values_ranges.begin(), values_ranges.end(), std::back_inserter(tmp), [this](auto& values_range) {
return query_options(_consistency, {}, std::move(values_range), _skip_metadata, _options, _cql_serialization_format);
});
_batch_options = std::move(tmp);
}
}

View File

@@ -54,14 +54,17 @@
namespace cql3 {
using namespace statements;
using namespace transport::messages;
using namespace cql_transport::messages;
logging::logger log("query_processor");
logging::logger prep_cache_log("prepared_statements_cache");
distributed<query_processor> _the_query_processor;
const sstring query_processor::CQL_VERSION = "3.3.1";
const std::chrono::minutes prepared_statements_cache::entry_expiry = std::chrono::minutes(60);
class query_processor::internal_state {
service::query_state _qs;
public:
@@ -95,6 +98,7 @@ query_processor::query_processor(distributed<service::storage_proxy>& proxy,
, _proxy(proxy)
, _db(db)
, _internal_state(new internal_state())
, _prepared_cache(prep_cache_log)
{
namespace sm = seastar::metrics;
@@ -130,6 +134,15 @@ query_processor::query_processor(distributed<service::storage_proxy>& proxy,
sm::make_derive("batches_unlogged_from_logged", _cql_stats.batches_unlogged_from_logged,
sm::description("Counts a total number of LOGGED batches that were executed as UNLOGGED batches.")),
sm::make_derive("prepared_cache_evictions", [] { return prepared_statements_cache::shard_stats().prepared_cache_evictions; },
sm::description("Counts a number of prepared statements cache entries evictions.")),
sm::make_gauge("prepared_cache_size", [this] { return _prepared_cache.size(); },
sm::description("A number of entries in the prepared statements cache.")),
sm::make_gauge("prepared_cache_memory_footprint", [this] { return _prepared_cache.memory_footprint(); },
sm::description("Size (in bytes) of the prepared statements cache.")),
});
service::get_local_migration_manager().register_listener(_migration_subscriber.get());
@@ -179,7 +192,7 @@ query_processor::process_statement(::shared_ptr<cql_statement> statement,
statement->validate(_proxy, client_state);
auto fut = make_ready_future<::shared_ptr<transport::messages::result_message>>();
auto fut = make_ready_future<::shared_ptr<cql_transport::messages::result_message>>();
if (client_state.is_internal()) {
fut = statement->execute_internal(_proxy, query_state, options);
} else {
@@ -196,80 +209,34 @@ query_processor::process_statement(::shared_ptr<cql_statement> statement,
});
}
future<::shared_ptr<transport::messages::result_message::prepared>>
query_processor::prepare(const std::experimental::string_view& query_string, service::query_state& query_state)
future<::shared_ptr<cql_transport::messages::result_message::prepared>>
query_processor::prepare(sstring query_string, service::query_state& query_state)
{
auto& client_state = query_state.get_client_state();
return prepare(query_string, client_state, client_state.is_thrift());
return prepare(std::move(query_string), client_state, client_state.is_thrift());
}
future<::shared_ptr<transport::messages::result_message::prepared>>
query_processor::prepare(const std::experimental::string_view& query_string,
const service::client_state& client_state,
bool for_thrift)
future<::shared_ptr<cql_transport::messages::result_message::prepared>>
query_processor::prepare(sstring query_string, const service::client_state& client_state, bool for_thrift)
{
auto existing = get_stored_prepared_statement(query_string, client_state.get_raw_keyspace(), for_thrift);
if (existing) {
return make_ready_future<::shared_ptr<transport::messages::result_message::prepared>>(existing);
using namespace cql_transport::messages;
if (for_thrift) {
return prepare_one<result_message::prepared::thrift>(std::move(query_string), client_state, compute_thrift_id, prepared_cache_key_type::thrift_id);
} else {
return prepare_one<result_message::prepared::cql>(std::move(query_string), client_state, compute_id, prepared_cache_key_type::cql_id);
}
auto prepared = get_statement(query_string, client_state);
auto bound_terms = prepared->statement->get_bound_terms();
if (bound_terms > std::numeric_limits<uint16_t>::max()) {
throw exceptions::invalid_request_exception(sprint("Too many markers(?). %d markers exceed the allowed maximum of %d", bound_terms, std::numeric_limits<uint16_t>::max()));
}
assert(bound_terms == prepared->bound_names.size());
return store_prepared_statement(query_string, client_state.get_raw_keyspace(), std::move(prepared), for_thrift);
}
::shared_ptr<transport::messages::result_message::prepared>
::shared_ptr<cql_transport::messages::result_message::prepared>
query_processor::get_stored_prepared_statement(const std::experimental::string_view& query_string,
const sstring& keyspace,
bool for_thrift)
{
using namespace cql_transport::messages;
if (for_thrift) {
auto statement_id = compute_thrift_id(query_string, keyspace);
auto it = _thrift_prepared_statements.find(statement_id);
if (it == _thrift_prepared_statements.end()) {
return ::shared_ptr<result_message::prepared>();
}
return ::make_shared<result_message::prepared::thrift>(statement_id, it->second);
return get_stored_prepared_statement_one<result_message::prepared::thrift>(query_string, keyspace, compute_thrift_id, prepared_cache_key_type::thrift_id);
} else {
auto statement_id = compute_id(query_string, keyspace);
auto it = _prepared_statements.find(statement_id);
if (it == _prepared_statements.end()) {
return ::shared_ptr<result_message::prepared>();
}
return ::make_shared<result_message::prepared::cql>(statement_id, it->second);
}
}
future<::shared_ptr<transport::messages::result_message::prepared>>
query_processor::store_prepared_statement(const std::experimental::string_view& query_string,
const sstring& keyspace,
::shared_ptr<statements::prepared_statement> prepared,
bool for_thrift)
{
#if 0
// Concatenate the current keyspace so we don't mix prepared statements between keyspace (#5352).
// (if the keyspace is null, queryString has to have a fully-qualified keyspace so it's fine.
long statementSize = measure(prepared.statement);
// don't execute the statement if it's bigger than the allowed threshold
if (statementSize > MAX_CACHE_PREPARED_MEMORY)
throw new InvalidRequestException(String.format("Prepared statement of size %d bytes is larger than allowed maximum of %d bytes.",
statementSize,
MAX_CACHE_PREPARED_MEMORY));
#endif
prepared->raw_cql_statement = query_string.data();
if (for_thrift) {
auto statement_id = compute_thrift_id(query_string, keyspace);
_thrift_prepared_statements.emplace(statement_id, prepared);
auto msg = ::make_shared<result_message::prepared::thrift>(statement_id, prepared);
return make_ready_future<::shared_ptr<result_message::prepared>>(std::move(msg));
} else {
auto statement_id = compute_id(query_string, keyspace);
_prepared_statements.emplace(statement_id, prepared);
auto msg = ::make_shared<result_message::prepared::cql>(statement_id, prepared);
return make_ready_future<::shared_ptr<result_message::prepared>>(std::move(msg));
return get_stored_prepared_statement_one<result_message::prepared::cql>(query_string, keyspace, compute_id, prepared_cache_key_type::cql_id);
}
}
@@ -286,22 +253,22 @@ static sstring hash_target(const std::experimental::string_view& query_string, c
return keyspace + query_string.to_string();
}
bytes query_processor::compute_id(const std::experimental::string_view& query_string, const sstring& keyspace)
prepared_cache_key_type query_processor::compute_id(const std::experimental::string_view& query_string, const sstring& keyspace)
{
return md5_calculate(hash_target(query_string, keyspace));
return prepared_cache_key_type(md5_calculate(hash_target(query_string, keyspace)));
}
int32_t query_processor::compute_thrift_id(const std::experimental::string_view& query_string, const sstring& keyspace)
prepared_cache_key_type query_processor::compute_thrift_id(const std::experimental::string_view& query_string, const sstring& keyspace)
{
auto target = hash_target(query_string, keyspace);
uint32_t h = 0;
for (auto&& c : hash_target(query_string, keyspace)) {
h = 31*h + c;
}
return static_cast<int32_t>(h);
return prepared_cache_key_type(static_cast<int32_t>(h));
}
::shared_ptr<prepared_statement>
std::unique_ptr<prepared_statement>
query_processor::get_statement(const sstring_view& query, const service::client_state& client_state)
{
#if 0
@@ -340,7 +307,7 @@ query_processor::parse_statement(const sstring_view& query)
}
}
query_options query_processor::make_internal_options(::shared_ptr<statements::prepared_statement> p,
query_options query_processor::make_internal_options(const statements::prepared_statement::checked_weak_ptr& p,
const std::initializer_list<data_value>& values,
db::consistency_level cl)
{
@@ -362,7 +329,7 @@ query_options query_processor::make_internal_options(::shared_ptr<statements::pr
return query_options(cl, bound_values);
}
::shared_ptr<statements::prepared_statement> query_processor::prepare_internal(const sstring& query_string)
statements::prepared_statement::checked_weak_ptr query_processor::prepare_internal(const sstring& query_string)
{
auto& p = _internal_statements[query_string];
if (p == nullptr) {
@@ -370,7 +337,7 @@ query_options query_processor::make_internal_options(::shared_ptr<statements::pr
np->statement->validate(_proxy, *_internal_state);
p = std::move(np); // inserts it into map
}
return p;
return p->checked_weak_from_this();
}
future<::shared_ptr<untyped_result_set>>
@@ -380,17 +347,16 @@ query_processor::execute_internal(const sstring& query_string,
if (log.is_enabled(logging::log_level::trace)) {
log.trace("execute_internal: \"{}\" ({})", query_string, ::join(", ", values));
}
auto p = prepare_internal(query_string);
return execute_internal(p, values);
return execute_internal(prepare_internal(query_string), values);
}
future<::shared_ptr<untyped_result_set>>
query_processor::execute_internal(::shared_ptr<statements::prepared_statement> p,
query_processor::execute_internal(statements::prepared_statement::checked_weak_ptr p,
const std::initializer_list<data_value>& values)
{
auto opts = make_internal_options(p, values);
return do_with(std::move(opts), [this, p = std::move(p)](auto& opts) {
return p->statement->execute_internal(_proxy, *_internal_state, opts).then([p](auto msg) {
return p->statement->execute_internal(_proxy, *_internal_state, opts).then([stmt = p->statement](auto msg) {
return make_ready_future<::shared_ptr<untyped_result_set>>(::make_shared<untyped_result_set>(msg));
});
});
@@ -402,27 +368,30 @@ query_processor::process(const sstring& query_string,
const std::initializer_list<data_value>& values,
bool cache)
{
auto p = cache ? prepare_internal(query_string) : parse_statement(query_string)->prepare(_db.local(), _cql_stats);
if (!cache) {
if (cache) {
return process(prepare_internal(query_string), cl, values);
} else {
auto p = parse_statement(query_string)->prepare(_db.local(), _cql_stats);
p->statement->validate(_proxy, *_internal_state);
auto checked_weak_ptr = p->checked_weak_from_this();
return process(std::move(checked_weak_ptr), cl, values).finally([p = std::move(p)] {});
}
return process(p, cl, values);
}
future<::shared_ptr<untyped_result_set>>
query_processor::process(::shared_ptr<statements::prepared_statement> p,
query_processor::process(statements::prepared_statement::checked_weak_ptr p,
db::consistency_level cl,
const std::initializer_list<data_value>& values)
{
auto opts = make_internal_options(p, values, cl);
return do_with(std::move(opts), [this, p = std::move(p)](auto & opts) {
return p->statement->execute(_proxy, *_internal_state, opts).then([p](auto msg) {
return p->statement->execute(_proxy, *_internal_state, opts).then([](auto msg) {
return make_ready_future<::shared_ptr<untyped_result_set>>(::make_shared<untyped_result_set>(msg));
});
});
}
future<::shared_ptr<transport::messages::result_message>>
future<::shared_ptr<cql_transport::messages::result_message>>
query_processor::process_batch(::shared_ptr<statements::batch_statement> batch,
service::query_state& query_state,
query_options& options)
@@ -522,7 +491,7 @@ void query_processor::migration_subscriber::on_drop_view(const sstring& ks_name,
void query_processor::migration_subscriber::remove_invalid_prepared_statements(sstring ks_name, std::experimental::optional<sstring> cf_name)
{
_qp->invalidate_prepared_statements([&] (::shared_ptr<cql_statement> stmt) {
_qp->_prepared_cache.remove_if([&] (::shared_ptr<cql_statement> stmt) {
return this->should_invalidate(ks_name, cf_name, stmt);
});
}

View File

@@ -57,6 +57,7 @@
#include "statements/prepared_statement.hh"
#include "transport/messages/result_message.hh"
#include "untyped_result_set.hh"
#include "prepared_statements_cache.hh"
namespace cql3 {
@@ -64,9 +65,32 @@ namespace statements {
class batch_statement;
}
class prepared_statement_is_too_big : public std::exception {
public:
static constexpr int max_query_prefix = 100;
private:
sstring _msg;
public:
prepared_statement_is_too_big(const sstring& query_string)
: _msg(seastar::format("Prepared statement is too big: {}", query_string.substr(0, max_query_prefix)))
{
// mark that we clipped the query string
if (query_string.size() > max_query_prefix) {
_msg += "...";
}
}
virtual const char* what() const noexcept override {
return _msg.c_str();
}
};
class query_processor {
public:
class migration_subscriber;
private:
std::unique_ptr<migration_subscriber> _migration_subscriber;
distributed<service::storage_proxy>& _proxy;
@@ -127,10 +151,8 @@ private:
}
};
#endif
std::unordered_map<bytes, ::shared_ptr<statements::prepared_statement>> _prepared_statements;
std::unordered_map<int32_t, ::shared_ptr<statements::prepared_statement>> _thrift_prepared_statements;
std::unordered_map<sstring, ::shared_ptr<statements::prepared_statement>> _internal_statements;
prepared_statements_cache _prepared_cache;
std::unordered_map<sstring, std::unique_ptr<statements::prepared_statement>> _internal_statements;
#if 0
// A map for prepared statements used internally (which we don't want to mix with user statement, in particular we don't
@@ -221,21 +243,14 @@ private:
}
#endif
public:
::shared_ptr<statements::prepared_statement> get_prepared(const bytes& id) {
auto it = _prepared_statements.find(id);
if (it == _prepared_statements.end()) {
return ::shared_ptr<statements::prepared_statement>{};
statements::prepared_statement::checked_weak_ptr get_prepared(const prepared_cache_key_type& key) {
auto it = _prepared_cache.find(key);
if (it == _prepared_cache.end()) {
return statements::prepared_statement::checked_weak_ptr();
}
return it->second;
return *it;
}
::shared_ptr<statements::prepared_statement> get_prepared_for_thrift(int32_t id) {
auto it = _thrift_prepared_statements.find(id);
if (it == _thrift_prepared_statements.end()) {
return ::shared_ptr<statements::prepared_statement>{};
}
return it->second;
}
#if 0
public static void validateKey(ByteBuffer key) throws InvalidRequestException
{
@@ -275,7 +290,7 @@ public:
}
#endif
public:
future<::shared_ptr<transport::messages::result_message>> process_statement(::shared_ptr<cql_statement> statement,
future<::shared_ptr<cql_transport::messages::result_message>> process_statement(::shared_ptr<cql_statement> statement,
service::query_state& query_state, const query_options& options);
#if 0
@@ -286,7 +301,7 @@ public:
}
#endif
future<::shared_ptr<transport::messages::result_message>> process(const std::experimental::string_view& query_string,
future<::shared_ptr<cql_transport::messages::result_message>> process(const std::experimental::string_view& query_string,
service::query_state& query_state, query_options& options);
#if 0
@@ -340,23 +355,23 @@ public:
}
#endif
private:
query_options make_internal_options(::shared_ptr<statements::prepared_statement>, const std::initializer_list<data_value>&, db::consistency_level = db::consistency_level::ONE);
query_options make_internal_options(const statements::prepared_statement::checked_weak_ptr& p, const std::initializer_list<data_value>&, db::consistency_level = db::consistency_level::ONE);
public:
future<::shared_ptr<untyped_result_set>> execute_internal(
const sstring& query_string,
const std::initializer_list<data_value>& = { });
::shared_ptr<statements::prepared_statement> prepare_internal(const sstring& query);
statements::prepared_statement::checked_weak_ptr prepare_internal(const sstring& query);
future<::shared_ptr<untyped_result_set>> execute_internal(
::shared_ptr<statements::prepared_statement>,
statements::prepared_statement::checked_weak_ptr p,
const std::initializer_list<data_value>& = { });
future<::shared_ptr<untyped_result_set>> process(
const sstring& query_string,
db::consistency_level, const std::initializer_list<data_value>& = { }, bool cache = false);
future<::shared_ptr<untyped_result_set>> process(
::shared_ptr<statements::prepared_statement>,
statements::prepared_statement::checked_weak_ptr p,
db::consistency_level, const std::initializer_list<data_value>& = { });
/*
@@ -434,43 +449,62 @@ public:
}
#endif
future<::shared_ptr<transport::messages::result_message::prepared>>
prepare(const std::experimental::string_view& query_string, service::query_state& query_state);
future<::shared_ptr<cql_transport::messages::result_message::prepared>>
prepare(sstring query_string, service::query_state& query_state);
future<::shared_ptr<transport::messages::result_message::prepared>>
prepare(const std::experimental::string_view& query_string, const service::client_state& client_state, bool for_thrift);
future<::shared_ptr<cql_transport::messages::result_message::prepared>>
prepare(sstring query_string, const service::client_state& client_state, bool for_thrift);
static bytes compute_id(const std::experimental::string_view& query_string, const sstring& keyspace);
static int32_t compute_thrift_id(const std::experimental::string_view& query_string, const sstring& keyspace);
static prepared_cache_key_type compute_id(const std::experimental::string_view& query_string, const sstring& keyspace);
static prepared_cache_key_type compute_thrift_id(const std::experimental::string_view& query_string, const sstring& keyspace);
private:
::shared_ptr<transport::messages::result_message::prepared>
get_stored_prepared_statement(const std::experimental::string_view& query_string, const sstring& keyspace, bool for_thrift);
///
/// \tparam ResultMsgType type of the returned result message (CQL or Thrift)
/// \tparam PreparedKeyGenerator a function that generates the prepared statement cache key for given query and keyspace
/// \tparam IdGetter a function that returns the corresponding prepared statement ID (CQL or Thrift) for a given prepared statement cache key
/// \param query_string
/// \param client_state
/// \param id_gen prepared ID generator, called before the first deferring
/// \param id_getter prepared ID getter, passed to deferred context by reference. The caller must ensure its liveness.
/// \return
template <typename ResultMsgType, typename PreparedKeyGenerator, typename IdGetter>
future<::shared_ptr<cql_transport::messages::result_message::prepared>>
prepare_one(sstring query_string, const service::client_state& client_state, PreparedKeyGenerator&& id_gen, IdGetter&& id_getter) {
return do_with(id_gen(query_string, client_state.get_raw_keyspace()), std::move(query_string), [this, &client_state, &id_getter] (const prepared_cache_key_type& key, const sstring& query_string) {
return _prepared_cache.get(key, [this, &query_string, &client_state] {
auto prepared = get_statement(query_string, client_state);
auto bound_terms = prepared->statement->get_bound_terms();
if (bound_terms > std::numeric_limits<uint16_t>::max()) {
throw exceptions::invalid_request_exception(sprint("Too many markers(?). %d markers exceed the allowed maximum of %d", bound_terms, std::numeric_limits<uint16_t>::max()));
}
assert(bound_terms == prepared->bound_names.size());
prepared->raw_cql_statement = query_string;
return make_ready_future<std::unique_ptr<statements::prepared_statement>>(std::move(prepared));
}).then([&key, &id_getter] (auto prep_ptr) {
return make_ready_future<::shared_ptr<cql_transport::messages::result_message::prepared>>(::make_shared<ResultMsgType>(id_getter(key), std::move(prep_ptr)));
}).handle_exception_type([&query_string] (typename prepared_statements_cache::statement_is_too_big&) {
return make_exception_future<::shared_ptr<cql_transport::messages::result_message::prepared>>(prepared_statement_is_too_big(query_string));
});
});
};
future<::shared_ptr<transport::messages::result_message::prepared>>
store_prepared_statement(const std::experimental::string_view& query_string, const sstring& keyspace, ::shared_ptr<statements::prepared_statement> prepared, bool for_thrift);
template <typename ResultMsgType, typename KeyGenerator, typename IdGetter>
::shared_ptr<cql_transport::messages::result_message::prepared>
get_stored_prepared_statement_one(const std::experimental::string_view& query_string, const sstring& keyspace, KeyGenerator&& key_gen, IdGetter&& id_getter)
{
auto cache_key = key_gen(query_string, keyspace);
auto it = _prepared_cache.find(cache_key);
if (it == _prepared_cache.end()) {
return ::shared_ptr<cql_transport::messages::result_message::prepared>();
}
// Erases the statements for which filter returns true.
template <typename Pred>
void invalidate_prepared_statements(Pred filter) {
static_assert(std::is_same<bool, std::result_of_t<Pred(::shared_ptr<cql_statement>)>>::value,
"bad Pred signature");
for (auto it = _prepared_statements.begin(); it != _prepared_statements.end(); ) {
if (filter(it->second->statement)) {
it = _prepared_statements.erase(it);
} else {
++it;
}
}
for (auto it = _thrift_prepared_statements.begin(); it != _thrift_prepared_statements.end(); ) {
if (filter(it->second->statement)) {
it = _thrift_prepared_statements.erase(it);
} else {
++it;
}
}
return ::make_shared<ResultMsgType>(id_getter(cache_key), *it);
}
::shared_ptr<cql_transport::messages::result_message::prepared>
get_stored_prepared_statement(const std::experimental::string_view& query_string, const sstring& keyspace, bool for_thrift);
#if 0
public ResultMessage processPrepared(CQLStatement statement, QueryState queryState, QueryOptions options)
throws RequestExecutionException, RequestValidationException
@@ -497,10 +531,10 @@ private:
#endif
public:
future<::shared_ptr<transport::messages::result_message>> process_batch(::shared_ptr<statements::batch_statement>,
future<::shared_ptr<cql_transport::messages::result_message>> process_batch(::shared_ptr<statements::batch_statement>,
service::query_state& query_state, query_options& options);
::shared_ptr<statements::prepared_statement> get_statement(const std::experimental::string_view& query,
std::unique_ptr<statements::prepared_statement> get_statement(const std::experimental::string_view& query,
const service::client_state& client_state);
static ::shared_ptr<statements::raw::parsed_statement> parse_statement(const std::experimental::string_view& query);

View File

@@ -94,6 +94,26 @@ public:
return true;
}
/**
* Whether the specified row satisfied this restriction.
* Assumes the row is live, but not all cells. If a cell
* isn't live and there's a restriction on its column,
* then the function returns false.
*
* @param schema the schema the row belongs to
* @param key the partition key
* @param ckey the clustering key
* @param cells the remaining row columns
* @return the restriction resulting of the merge
* @throws InvalidRequestException if the restrictions cannot be merged
*/
virtual bool is_satisfied_by(const schema& schema,
const partition_key& key,
const clustering_key_prefix& ckey,
const row& cells,
const query_options& options,
gc_clock::time_point now) const = 0;
protected:
#if 0
protected static ByteBuffer validateIndexedValue(ColumnSpecification columnSpec,
@@ -113,7 +133,7 @@ protected:
* @param function_name the function name
* @return <code>true</code> if the specified term is using the specified function, <code>false</code> otherwise.
*/
static bool uses_function(::shared_ptr<term> term, const sstring& ks_name, const sstring& function_name) {
static bool term_uses_function(::shared_ptr<term> term, const sstring& ks_name, const sstring& function_name) {
return bool(term) && term->uses_function(ks_name, function_name);
}
@@ -125,9 +145,9 @@ protected:
* @param function_name the function name
* @return <code>true</code> if one of the specified term is using the specified function, <code>false</code> otherwise.
*/
static bool uses_function(const std::vector<::shared_ptr<term>>& terms, const sstring& ks_name, const sstring& function_name) {
static bool term_uses_function(const std::vector<::shared_ptr<term>>& terms, const sstring& ks_name, const sstring& function_name) {
for (auto&& value : terms) {
if (uses_function(value, ks_name, function_name)) {
if (term_uses_function(value, ks_name, function_name)) {
return true;
}
}

View File

@@ -85,6 +85,20 @@ public:
do_merge_with(as_pkr);
}
bool is_satisfied_by(const schema& schema,
const partition_key& key,
const clustering_key_prefix& ckey,
const row& cells,
const query_options& options,
gc_clock::time_point now) const override {
for (auto&& range : bounds_ranges(options)) {
if (!range.contains(ckey, clustering_key_prefix::prefix_equal_tri_compare(schema))) {
return false;
}
}
return true;
}
protected:
virtual void do_merge_with(::shared_ptr<primary_key_restrictions<clustering_key_prefix>> other) = 0;
@@ -155,7 +169,7 @@ public:
{ }
virtual bool uses_function(const sstring& ks_name, const sstring& function_name) const override {
return abstract_restriction::uses_function(_value, ks_name, function_name);
return abstract_restriction::term_uses_function(_value, ks_name, function_name);
}
virtual sstring to_string() const override {
@@ -304,11 +318,11 @@ public:
{ }
virtual bool uses_function(const sstring& ks_name, const sstring& function_name) const override {
return abstract_restriction::uses_function(_values, ks_name, function_name);
return abstract_restriction::term_uses_function(_values, ks_name, function_name);
}
virtual sstring to_string() const override {
return sprint("IN(%s)", ::to_string(_values));
return sprint("IN(%s)", std::to_string(_values));
}
protected:
@@ -428,8 +442,8 @@ public:
}
virtual bool uses_function(const sstring& ks_name, const sstring& function_name) const override {
return (_slice.has_bound(statements::bound::START) && abstract_restriction::uses_function(_slice.bound(statements::bound::START), ks_name, function_name))
|| (_slice.has_bound(statements::bound::END) && abstract_restriction::uses_function(_slice.bound(statements::bound::END), ks_name, function_name));
return (_slice.has_bound(statements::bound::START) && abstract_restriction::term_uses_function(_slice.bound(statements::bound::START), ks_name, function_name))
|| (_slice.has_bound(statements::bound::END) && abstract_restriction::term_uses_function(_slice.bound(statements::bound::END), ks_name, function_name));
}
virtual bool is_inclusive(statements::bound b) const override {

View File

@@ -46,6 +46,7 @@
#include "cartesian_product.hh"
#include "cql3/restrictions/primary_key_restrictions.hh"
#include "cql3/restrictions/single_column_restrictions.hh"
#include <boost/algorithm/cxx11/all_of.hpp>
#include <boost/range/adaptor/transformed.hpp>
#include <boost/range/adaptor/filtered.hpp>
@@ -96,6 +97,14 @@ public:
return _in;
}
virtual bool has_bound(statements::bound b) const override {
return boost::algorithm::all_of(_restrictions->restrictions(), [b] (auto&& r) { return r.second->has_bound(b); });
}
virtual bool is_inclusive(statements::bound b) const override {
return boost::algorithm::all_of(_restrictions->restrictions(), [b] (auto&& r) { return r.second->is_inclusive(b); });
}
virtual bool uses_function(const sstring& ks_name, const sstring& function_name) const override {
return _restrictions->uses_function(ks_name, function_name);
}
@@ -115,7 +124,7 @@ public:
if (restriction->is_slice()) {
throw exceptions::invalid_request_exception(sprint(
"PRIMARY KEY column \"%s\" cannot be restricted (preceding column \"%s\" is restricted by a non-EQ relation)",
_restrictions->next_column(new_column)->name_as_text(), new_column.name_as_text()));
last_column.name_as_text(), new_column.name_as_text()));
}
}
@@ -331,6 +340,17 @@ public:
sstring to_string() const override {
return sprint("Restrictions(%s)", join(", ", get_column_defs()));
}
virtual bool is_satisfied_by(const schema& schema,
const partition_key& key,
const clustering_key_prefix& ckey,
const row& cells,
const query_options& options,
gc_clock::time_point now) const override {
return boost::algorithm::all_of(
_restrictions->restrictions() | boost::adaptors::map_values,
[&] (auto&& r) { return r->is_satisfied_by(schema, key, ckey, cells, options, now); });
}
};
template<>

View File

@@ -49,6 +49,8 @@
#include "schema.hh"
#include "to_string.hh"
#include "exceptions/exceptions.hh"
#include "keys.hh"
#include "mutation_partition.hh"
namespace cql3 {
@@ -105,6 +107,13 @@ public:
class slice;
class contains;
protected:
bytes_view_opt get_value(const schema& schema,
const partition_key& key,
const clustering_key_prefix& ckey,
const row& cells,
gc_clock::time_point now) const;
};
class single_column_restriction::EQ final : public single_column_restriction {
@@ -117,7 +126,7 @@ public:
{ }
virtual bool uses_function(const sstring& ks_name, const sstring& function_name) const override {
return abstract_restriction::uses_function(_value, ks_name, function_name);
return abstract_restriction::term_uses_function(_value, ks_name, function_name);
}
virtual bool is_EQ() const override {
@@ -143,6 +152,13 @@ public:
"%s cannot be restricted by more than one relation if it includes an Equal", _column_def.name_as_text()));
}
virtual bool is_satisfied_by(const schema& schema,
const partition_key& key,
const clustering_key_prefix& ckey,
const row& cells,
const query_options& options,
gc_clock::time_point now) const override;
#if 0
@Override
protected boolean isSupportedBy(SecondaryIndex index)
@@ -167,6 +183,13 @@ public:
"%s cannot be restricted by more than one relation if it includes a IN", _column_def.name_as_text()));
}
virtual bool is_satisfied_by(const schema& schema,
const partition_key& key,
const clustering_key_prefix& ckey,
const row& cells,
const query_options& options,
gc_clock::time_point now) const override;
#if 0
@Override
protected final boolean isSupportedBy(SecondaryIndex index)
@@ -186,7 +209,7 @@ public:
{ }
virtual bool uses_function(const sstring& ks_name, const sstring& function_name) const override {
return abstract_restriction::uses_function(_values, ks_name, function_name);
return abstract_restriction::term_uses_function(_values, ks_name, function_name);
}
virtual std::vector<bytes_opt> values(const query_options& options) const override {
@@ -198,7 +221,7 @@ public:
}
virtual sstring to_string() const override {
return sprint("IN(%s)", ::to_string(_values));
return sprint("IN(%s)", std::to_string(_values));
}
};
@@ -237,8 +260,8 @@ public:
{ }
virtual bool uses_function(const sstring& ks_name, const sstring& function_name) const override {
return (_slice.has_bound(statements::bound::START) && abstract_restriction::uses_function(_slice.bound(statements::bound::START), ks_name, function_name))
|| (_slice.has_bound(statements::bound::END) && abstract_restriction::uses_function(_slice.bound(statements::bound::END), ks_name, function_name));
return (_slice.has_bound(statements::bound::START) && abstract_restriction::term_uses_function(_slice.bound(statements::bound::START), ks_name, function_name))
|| (_slice.has_bound(statements::bound::END) && abstract_restriction::term_uses_function(_slice.bound(statements::bound::END), ks_name, function_name));
}
virtual bool is_slice() const override {
@@ -310,6 +333,13 @@ public:
virtual sstring to_string() const override {
return sprint("SLICE%s", _slice);
}
virtual bool is_satisfied_by(const schema& schema,
const partition_key& key,
const clustering_key_prefix& ckey,
const row& cells,
const query_options& options,
gc_clock::time_point now) const override;
};
// This holds CONTAINS, CONTAINS_KEY, and map[key] = value restrictions because we might want to have any combination of them.
@@ -403,15 +433,15 @@ public:
}
virtual bool uses_function(const sstring& ks_name, const sstring& function_name) const override {
return abstract_restriction::uses_function(_values, ks_name, function_name)
|| abstract_restriction::uses_function(_keys, ks_name, function_name)
|| abstract_restriction::uses_function(_entry_keys, ks_name, function_name)
|| abstract_restriction::uses_function(_entry_values, ks_name, function_name);
return abstract_restriction::term_uses_function(_values, ks_name, function_name)
|| abstract_restriction::term_uses_function(_keys, ks_name, function_name)
|| abstract_restriction::term_uses_function(_entry_keys, ks_name, function_name)
|| abstract_restriction::term_uses_function(_entry_values, ks_name, function_name);
}
virtual sstring to_string() const override {
return sprint("CONTAINS(values=%s, keys=%s, entryKeys=%s, entryValues=%s)",
::to_string(_values), ::to_string(_keys), ::to_string(_entry_keys), ::to_string(_entry_values));
std::to_string(_values), std::to_string(_keys), std::to_string(_entry_keys), std::to_string(_entry_values));
}
virtual bool has_bound(statements::bound b) const override {
@@ -426,6 +456,13 @@ public:
throw exceptions::unsupported_operation_exception();
}
virtual bool is_satisfied_by(const schema& schema,
const partition_key& key,
const clustering_key_prefix& ckey,
const row& cells,
const query_options& options,
gc_clock::time_point now) const override;
#if 0
private List<ByteBuffer> keys(const query_options& options) {
return bindAndGet(keys, options);

View File

@@ -75,7 +75,7 @@ private:
* The _restrictions per column.
*/
public:
using restrictions_map = std::map<const column_definition*, ::shared_ptr<restriction>, column_definition_comparator>;
using restrictions_map = std::map<const column_definition*, ::shared_ptr<single_column_restriction>, column_definition_comparator>;
private:
restrictions_map _restrictions;
bool _is_all_eq = true;

View File

@@ -31,6 +31,8 @@
#include "cql3/single_column_relation.hh"
#include "cql3/constants.hh"
#include "stdx.hh"
namespace cql3 {
namespace restrictions {
@@ -88,6 +90,14 @@ public:
sstring to_string() const override {
return "Initial restrictions";
}
virtual bool is_satisfied_by(const schema& schema,
const partition_key& key,
const clustering_key_prefix& ckey,
const row& cells,
const query_options& options,
gc_clock::time_point now) const override {
return true;
}
};
template<>
@@ -144,6 +154,7 @@ to_column_definition(const schema_ptr& schema, const ::shared_ptr<column_identif
statement_restrictions::statement_restrictions(database& db,
schema_ptr schema,
statements::statement_type type,
const std::vector<::shared_ptr<relation>>& where_clause,
::shared_ptr<variable_specifications> bound_names,
bool selects_only_static_columns,
@@ -199,7 +210,7 @@ statement_restrictions::statement_restrictions(database& db,
|| nonprimary_key_restrictions->has_supporting_index(secondaryIndexManager);*/
// At this point, the select statement if fully constructed, but we still have a few things to validate
process_partition_key_restrictions(has_queriable_index);
process_partition_key_restrictions(has_queriable_index, for_view);
// Some but not all of the partition key columns have been specified;
// hence we need turn these restrictions into index expressions.
@@ -208,11 +219,18 @@ statement_restrictions::statement_restrictions(database& db,
}
if (selects_only_static_columns && has_clustering_columns_restriction()) {
throw exceptions::invalid_request_exception(
"Cannot restrict clustering columns when selecting only static columns");
if (type.is_update() || type.is_delete()) {
throw exceptions::invalid_request_exception(sprint(
"Invalid restrictions on clustering columns since the %s statement modifies only static columns", type));
}
if (type.is_select()) {
throw exceptions::invalid_request_exception(
"Cannot restrict clustering columns when selecting only static columns");
}
}
process_clustering_columns_restrictions(has_queriable_index, select_a_collection);
process_clustering_columns_restrictions(has_queriable_index, select_a_collection, for_view);
// Covers indexes on the first clustering column (among others).
if (_is_key_range && has_queriable_clustering_column_index)
@@ -254,7 +272,7 @@ statement_restrictions::statement_restrictions(database& db,
_index_restrictions.push_back(_nonprimary_key_restrictions);
}
if (_uses_secondary_indexing) {
if (_uses_secondary_indexing && !for_view) {
fail(unimplemented::cause::INDEXES);
#if 0
validate_secondary_index_selections(selects_only_static_columns);
@@ -289,7 +307,7 @@ bool statement_restrictions::uses_function(const sstring& ks_name, const sstring
|| _nonprimary_key_restrictions->uses_function(ks_name, function_name);
}
void statement_restrictions::process_partition_key_restrictions(bool has_queriable_index) {
void statement_restrictions::process_partition_key_restrictions(bool has_queriable_index, bool for_view) {
// If there is a queriable index, no special condition are required on the other restrictions.
// But we still need to know 2 things:
// - If we don't have a queriable index, is the query ok
@@ -299,7 +317,7 @@ void statement_restrictions::process_partition_key_restrictions(bool has_queriab
if (_partition_key_restrictions->is_on_token()) {
_is_key_range = true;
} else if (has_partition_key_unrestricted_components()) {
if (!_partition_key_restrictions->empty()) {
if (!_partition_key_restrictions->empty() && !for_view) {
if (!has_queriable_index) {
throw exceptions::invalid_request_exception(sprint("Partition key parts: %s must be restricted as other parts are",
join(", ", get_partition_key_unrestricted_components())));
@@ -315,7 +333,11 @@ bool statement_restrictions::has_partition_key_unrestricted_components() const {
return _partition_key_restrictions->size() < _schema->partition_key_size();
}
void statement_restrictions::process_clustering_columns_restrictions(bool has_queriable_index, bool select_a_collection) {
bool statement_restrictions::has_unrestricted_clustering_columns() const {
return _clustering_columns_restrictions->size() < _schema->clustering_key_size();
}
void statement_restrictions::process_clustering_columns_restrictions(bool has_queriable_index, bool select_a_collection, bool for_view) {
if (!has_clustering_columns_restriction()) {
return;
}
@@ -335,7 +357,7 @@ void statement_restrictions::process_clustering_columns_restrictions(bool has_qu
const column_definition* clustering_column = &(*clustering_columns_iter);
++clustering_columns_iter;
if (clustering_column != restricted_column) {
if (clustering_column != restricted_column && !for_view) {
if (!has_queriable_index) {
throw exceptions::invalid_request_exception(sprint(
"PRIMARY KEY column \"%s\" cannot be restricted as preceding column \"%s\" is not restricted",
@@ -392,5 +414,274 @@ void statement_restrictions::validate_secondary_index_selections(bool selects_on
}
}
static bytes_view_opt do_get_value(const schema& schema,
const column_definition& cdef,
const partition_key& key,
const clustering_key_prefix& ckey,
const row& cells,
gc_clock::time_point now) {
switch(cdef.kind) {
case column_kind::partition_key:
return key.get_component(schema, cdef.component_index());
case column_kind::clustering_key:
return ckey.get_component(schema, cdef.component_index());
default:
auto cell = cells.find_cell(cdef.id);
if (!cell) {
return stdx::nullopt;
}
assert(cdef.is_atomic());
auto c = cell->as_atomic_cell();
return c.is_dead(now) ? stdx::nullopt : bytes_view_opt(c.value());
}
}
bytes_view_opt single_column_restriction::get_value(const schema& schema,
const partition_key& key,
const clustering_key_prefix& ckey,
const row& cells,
gc_clock::time_point now) const {
return do_get_value(schema, _column_def, key, ckey, cells, std::move(now));
}
bool single_column_restriction::EQ::is_satisfied_by(const schema& schema,
const partition_key& key,
const clustering_key_prefix& ckey,
const row& cells,
const query_options& options,
gc_clock::time_point now) const {
if (_column_def.type->is_counter()) {
fail(unimplemented::cause::COUNTERS);
}
auto operand = value(options);
if (operand) {
auto cell_value = get_value(schema, key, ckey, cells, now);
return cell_value && _column_def.type->compare(*operand, *cell_value) == 0;
}
return false;
}
bool single_column_restriction::IN::is_satisfied_by(const schema& schema,
const partition_key& key,
const clustering_key_prefix& ckey,
const row& cells,
const query_options& options,
gc_clock::time_point now) const {
if (_column_def.type->is_counter()) {
fail(unimplemented::cause::COUNTERS);
}
auto cell_value = get_value(schema, key, ckey, cells, now);
if (!cell_value) {
return false;
}
auto operands = values(options);
return std::any_of(operands.begin(), operands.end(), [&] (auto&& operand) {
return operand && _column_def.type->compare(*operand, *cell_value) == 0;
});
}
static query::range<bytes_view> to_range(const term_slice& slice, const query_options& options) {
using range_type = query::range<bytes_view>;
auto extract_bound = [&] (statements::bound bound) -> stdx::optional<range_type::bound> {
if (!slice.has_bound(bound)) {
return { };
}
auto value = slice.bound(bound)->bind_and_get(options);
if (!value) {
return { };
}
return { range_type::bound(*value, slice.is_inclusive(bound)) };
};
return range_type(
extract_bound(statements::bound::START),
extract_bound(statements::bound::END));
}
bool single_column_restriction::slice::is_satisfied_by(const schema& schema,
const partition_key& key,
const clustering_key_prefix& ckey,
const row& cells,
const query_options& options,
gc_clock::time_point now) const {
if (_column_def.type->is_counter()) {
fail(unimplemented::cause::COUNTERS);
}
auto cell_value = get_value(schema, key, ckey, cells, now);
if (!cell_value) {
return false;
}
return to_range(_slice, options).contains(*cell_value, _column_def.type->as_tri_comparator());
}
bool single_column_restriction::contains::is_satisfied_by(const schema& schema,
const partition_key& key,
const clustering_key_prefix& ckey,
const row& cells,
const query_options& options,
gc_clock::time_point now) const {
if (_column_def.type->is_counter()) {
fail(unimplemented::cause::COUNTERS);
}
if (!_column_def.type->is_collection()) {
return false;
}
auto col_type = static_pointer_cast<const collection_type_impl>(_column_def.type);
if ((!_keys.empty() || !_entry_keys.empty()) && !col_type->is_map()) {
return false;
}
assert(_entry_keys.size() == _entry_values.size());
auto&& map_key_type = col_type->name_comparator();
auto&& element_type = col_type->is_set() ? col_type->name_comparator() : col_type->value_comparator();
if (_column_def.type->is_multi_cell()) {
auto cell = cells.find_cell(_column_def.id);
auto&& elements = col_type->deserialize_mutation_form(cell->as_collection_mutation()).cells;
auto end = std::remove_if(elements.begin(), elements.end(), [now] (auto&& element) {
return element.second.is_dead(now);
});
for (auto&& value : _values) {
auto val = value->bind_and_get(options);
if (!val) {
continue;
}
auto found = std::find_if(elements.begin(), end, [&] (auto&& element) {
return element_type->compare(element.second.value(), *val) == 0;
});
if (found == end) {
return false;
}
}
for (auto&& key : _keys) {
auto k = key->bind_and_get(options);
if (!k) {
continue;
}
auto found = std::find_if(elements.begin(), end, [&] (auto&& element) {
return map_key_type->compare(element.first, *k) == 0;
});
if (found == end) {
return false;
}
}
for (uint32_t i = 0; i < _entry_keys.size(); ++i) {
auto map_key = _entry_keys[i]->bind_and_get(options);
auto map_value = _entry_values[i]->bind_and_get(options);
if (!map_key || !map_value) {
continue;
}
auto found = std::find_if(elements.begin(), end, [&] (auto&& element) {
return map_key_type->compare(element.first, *map_key) == 0;
});
if (found == end || element_type->compare(found->second.value(), *map_value) != 0) {
return false;
}
}
} else {
auto cell_value = get_value(schema, key, ckey, cells, now);
if (!cell_value) {
return false;
}
auto deserialized = _column_def.type->deserialize(*cell_value);
for (auto&& value : _values) {
auto val = value->bind_and_get(options);
if (!val) {
continue;
}
auto exists_in = [&](auto&& range) {
auto found = std::find_if(range.begin(), range.end(), [&] (auto&& element) {
return element_type->compare(element.serialize(), *val) == 0;
});
return found != range.end();
};
if (col_type->is_list()) {
if (!exists_in(value_cast<list_type_impl::native_type>(deserialized))) {
return false;
}
} else if (col_type->is_set()) {
if (!exists_in(value_cast<set_type_impl::native_type>(deserialized))) {
return false;
}
} else {
auto data_map = value_cast<map_type_impl::native_type>(deserialized);
if (!exists_in(data_map | boost::adaptors::transformed([] (auto&& p) { return p.second; }))) {
return false;
}
}
}
if (col_type->is_map()) {
auto& data_map = value_cast<map_type_impl::native_type>(deserialized);
for (auto&& key : _keys) {
auto k = key->bind_and_get(options);
if (!k) {
continue;
}
auto found = std::find_if(data_map.begin(), data_map.end(), [&] (auto&& element) {
return map_key_type->compare(element.first.serialize(), *k) == 0;
});
if (found == data_map.end()) {
return false;
}
}
for (uint32_t i = 0; i < _entry_keys.size(); ++i) {
auto map_key = _entry_keys[i]->bind_and_get(options);
auto map_value = _entry_values[i]->bind_and_get(options);
if (!map_key || !map_value) {
continue;
}
auto found = std::find_if(data_map.begin(), data_map.end(), [&] (auto&& element) {
return map_key_type->compare(element.first.serialize(), *map_key) == 0;
});
if (found == data_map.end() || element_type->compare(found->second.serialize(), *map_value) != 0) {
return false;
}
}
}
}
return true;
}
bool token_restriction::EQ::is_satisfied_by(const schema& schema,
const partition_key& key,
const clustering_key_prefix& ckey,
const row& cells,
const query_options& options,
gc_clock::time_point now) const {
bool satisfied = false;
auto cdef = _column_definitions.begin();
for (auto&& operand : values(options)) {
if (operand) {
auto cell_value = do_get_value(schema, **cdef, key, ckey, cells, now);
satisfied = cell_value && (*cdef)->type->compare(*operand, *cell_value) == 0;
}
if (!satisfied) {
break;
}
}
return satisfied;
}
bool token_restriction::slice::is_satisfied_by(const schema& schema,
const partition_key& key,
const clustering_key_prefix& ckey,
const row& cells,
const query_options& options,
gc_clock::time_point now) const {
bool satisfied = false;
auto range = to_range(_slice, options);
for (auto* cdef : _column_definitions) {
auto cell_value = do_get_value(schema, *cdef, key, ckey, cells, now);
if (!cell_value) {
return false;
}
satisfied = range.contains(*cell_value, cdef->type->as_tri_comparator());
if (!satisfied) {
break;
}
}
return satisfied;
}
}
}

View File

@@ -49,6 +49,7 @@
#include "cql3/restrictions/single_column_restrictions.hh"
#include "cql3/relation.hh"
#include "cql3/variable_specifications.hh"
#include "cql3/statements/statement_type.hh"
namespace cql3 {
@@ -111,6 +112,7 @@ public:
statement_restrictions(database& db,
schema_ptr schema,
statements::statement_type type,
const std::vector<::shared_ptr<relation>>& where_clause,
::shared_ptr<variable_specifications> bound_names,
bool selects_only_static_columns,
@@ -150,8 +152,13 @@ public:
return _uses_secondary_indexing;
}
private:
void process_partition_key_restrictions(bool has_queriable_index);
::shared_ptr<primary_key_restrictions<partition_key>> get_partition_key_restrictions() const {
return _partition_key_restrictions;
}
::shared_ptr<primary_key_restrictions<clustering_key_prefix>> get_clustering_columns_restrictions() const {
return _clustering_columns_restrictions;
}
/**
* Checks if the partition key has some unrestricted components.
@@ -159,6 +166,14 @@ private:
*/
bool has_partition_key_unrestricted_components() const;
/**
* Checks if the clustering key has some unrestricted components.
* @return <code>true</code> if the clustering key has some unrestricted components, <code>false</code> otherwise.
*/
bool has_unrestricted_clustering_columns() const;
private:
void process_partition_key_restrictions(bool has_queriable_index, bool for_view);
/**
* Returns the partition key components that are not restricted.
* @return the partition key components that are not restricted.
@@ -172,7 +187,7 @@ private:
* @param select_a_collection <code>true</code> if the query should return a collection column
* @throws InvalidRequestException if the request is invalid
*/
void process_clustering_columns_restrictions(bool has_queriable_index, bool select_a_collection);
void process_clustering_columns_restrictions(bool has_queriable_index, bool select_a_collection, bool for_view);
/**
* Returns the <code>Restrictions</code> for the specified type of columns.
@@ -378,6 +393,13 @@ public:
auto&& restricted = get_restrictions(cdef->kind).get()->get_column_defs();
return std::find(restricted.begin(), restricted.end(), cdef) != restricted.end();
}
/**
* @return the non-primary key restrictions.
*/
const single_column_restrictions::restrictions_map& get_non_pk_restriction() const {
return _nonprimary_key_restrictions->restrictions();
}
};
}

View File

@@ -157,7 +157,7 @@ public:
}
bool uses_function(const sstring& ks_name, const sstring& function_name) const override {
return abstract_restriction::uses_function(_value, ks_name, function_name);
return abstract_restriction::term_uses_function(_value, ks_name, function_name);
}
void merge_with(::shared_ptr<restriction>) override {
@@ -173,6 +173,13 @@ public:
sstring to_string() const override {
return sprint("EQ(%s)", _value->to_string());
}
virtual bool is_satisfied_by(const schema& schema,
const partition_key& key,
const clustering_key_prefix& ckey,
const row& cells,
const query_options& options,
gc_clock::time_point now) const override;
};
class token_restriction::slice final : public token_restriction {
@@ -203,11 +210,11 @@ public:
bool uses_function(const sstring& ks_name,
const sstring& function_name) const override {
return (_slice.has_bound(statements::bound::START)
&& abstract_restriction::uses_function(
&& abstract_restriction::term_uses_function(
_slice.bound(statements::bound::START), ks_name,
function_name))
|| (_slice.has_bound(statements::bound::END)
&& abstract_restriction::uses_function(
&& abstract_restriction::term_uses_function(
_slice.bound(statements::bound::END),
ks_name, function_name));
}
@@ -246,6 +253,13 @@ public:
sstring to_string() const override {
return sprint("SLICE%s", _slice);
}
virtual bool is_satisfied_by(const schema& schema,
const partition_key& key,
const clustering_key_prefix& ckey,
const row& cells,
const query_options& options,
gc_clock::time_point now) const override;
};
}

View File

@@ -44,8 +44,10 @@
namespace cql3 {
metadata::metadata(std::vector<::shared_ptr<column_specification>> names_)
: metadata(flag_enum_set(), std::move(names_), names_.size(), {})
{ }
: _flags(flag_enum_set())
, names(std::move(names_)) {
_column_count = names.size();
}
metadata::metadata(flag_enum_set flags, std::vector<::shared_ptr<column_specification>> names_, uint32_t column_count,
::shared_ptr<const service::pager::paging_state> paging_state)

View File

@@ -75,7 +75,7 @@ public:
std::vector<::shared_ptr<column_specification>> names;
private:
const uint32_t _column_count;
uint32_t _column_count;
::shared_ptr<const service::pager::paging_state> _paging_state;
public:
@@ -153,8 +153,8 @@ public:
void trim(size_t limit);
template<typename RowComparator>
void sort(RowComparator&& cmp) {
std::sort(_rows.begin(), _rows.end(), std::forward<RowComparator>(cmp));
void sort(const RowComparator& cmp) {
std::sort(_rows.begin(), _rows.end(), std::ref(cmp));
}
metadata& get_metadata();

View File

@@ -39,6 +39,8 @@
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include <boost/range/adaptor/transformed.hpp>
#include "cql3/selection/selection.hh"
#include "cql3/selection/selector_factories.hh"
#include "cql3/result_set.hh"
@@ -203,14 +205,10 @@ protected:
};
::shared_ptr<selection> selection::wildcard(schema_ptr schema) {
std::vector<const column_definition*> cds;
auto& columns = schema->all_columns_in_select_order();
cds.reserve(columns.size());
for (auto& c : columns) {
if (!schema->is_dense() || !c.is_regular() || !c.name().empty()) {
cds.emplace_back(&c);
}
}
auto columns = schema->all_columns_in_select_order();
auto cds = boost::copy_range<std::vector<const column_definition*>>(columns | boost::adaptors::transformed([](const column_definition& c) {
return &c;
}));
return simple_selection::make(schema, std::move(cds), true);
}

View File

@@ -224,7 +224,7 @@ sets::marker::bind(const query_options& options) {
}
void
sets::setter::execute(mutation& m, const exploded_clustering_prefix& row_key, const update_parameters& params) {
sets::setter::execute(mutation& m, const clustering_key_prefix& row_key, const update_parameters& params) {
const auto& value = _t->bind(params._options);
if (value == constants::UNSET_VALUE) {
return;
@@ -241,7 +241,7 @@ sets::setter::execute(mutation& m, const exploded_clustering_prefix& row_key, co
}
void
sets::adder::execute(mutation& m, const exploded_clustering_prefix& row_key, const update_parameters& params) {
sets::adder::execute(mutation& m, const clustering_key_prefix& row_key, const update_parameters& params) {
const auto& value = _t->bind(params._options);
if (value == constants::UNSET_VALUE) {
return;
@@ -251,7 +251,7 @@ sets::adder::execute(mutation& m, const exploded_clustering_prefix& row_key, con
}
void
sets::adder::do_add(mutation& m, const exploded_clustering_prefix& row_key, const update_parameters& params,
sets::adder::do_add(mutation& m, const clustering_key_prefix& row_key, const update_parameters& params,
shared_ptr<term> value, const column_definition& column) {
auto set_value = dynamic_pointer_cast<sets::value>(std::move(value));
auto set_type = dynamic_pointer_cast<const set_type_impl>(column.type);
@@ -281,7 +281,7 @@ sets::adder::do_add(mutation& m, const exploded_clustering_prefix& row_key, cons
}
void
sets::discarder::execute(mutation& m, const exploded_clustering_prefix& row_key, const update_parameters& params) {
sets::discarder::execute(mutation& m, const clustering_key_prefix& row_key, const update_parameters& params) {
assert(column.type->is_multi_cell()); // "Attempted to remove items from a frozen set";
auto&& value = _t->bind(params._options);
@@ -305,7 +305,7 @@ sets::discarder::execute(mutation& m, const exploded_clustering_prefix& row_key,
ctype->serialize_mutation_form(mut)));
}
void sets::element_discarder::execute(mutation& m, const exploded_clustering_prefix& row_key, const update_parameters& params)
void sets::element_discarder::execute(mutation& m, const clustering_key_prefix& row_key, const update_parameters& params)
{
assert(column.type->is_multi_cell() && "Attempted to remove items from a frozen set");
auto elt = _t->bind(params._options);

View File

@@ -112,7 +112,7 @@ public:
setter(const column_definition& column, shared_ptr<term> t)
: operation(column, std::move(t)) {
}
virtual void execute(mutation& m, const exploded_clustering_prefix& row_key, const update_parameters& params) override;
virtual void execute(mutation& m, const clustering_key_prefix& row_key, const update_parameters& params) override;
};
class adder : public operation {
@@ -120,8 +120,8 @@ public:
adder(const column_definition& column, shared_ptr<term> t)
: operation(column, std::move(t)) {
}
virtual void execute(mutation& m, const exploded_clustering_prefix& row_key, const update_parameters& params) override;
static void do_add(mutation& m, const exploded_clustering_prefix& row_key, const update_parameters& params,
virtual void execute(mutation& m, const clustering_key_prefix& row_key, const update_parameters& params) override;
static void do_add(mutation& m, const clustering_key_prefix& row_key, const update_parameters& params,
shared_ptr<term> value, const column_definition& column);
};
@@ -131,14 +131,14 @@ public:
discarder(const column_definition& column, shared_ptr<term> t)
: operation(column, std::move(t)) {
}
virtual void execute(mutation& m, const exploded_clustering_prefix& row_key, const update_parameters& params) override;
virtual void execute(mutation& m, const clustering_key_prefix& row_key, const update_parameters& params) override;
};
class element_discarder : public operation {
public:
element_discarder(const column_definition& column, shared_ptr<term> t)
: operation(column, std::move(t)) { }
virtual void execute(mutation& m, const exploded_clustering_prefix& row_key, const update_parameters& params) override;
virtual void execute(mutation& m, const clustering_key_prefix& row_key, const update_parameters& params) override;
};
};

View File

@@ -139,7 +139,7 @@ protected:
}
if (is_IN()) {
return sprint("%s IN %s", entity_as_string, ::to_string(_in_values));
return sprint("%s IN (%s)", entity_as_string, join(", ", _in_values));
}
return sprint("%s %s %s", entity_as_string, _relation_type, _value->to_string());

View File

@@ -63,7 +63,7 @@ void cql3::statements::alter_keyspace_statement::validate(distributed<service::s
service::get_local_storage_proxy().get_db().local().find_keyspace(_name); // throws on failure
auto tmp = _name;
std::transform(tmp.begin(), tmp.end(), tmp.begin(), ::tolower);
if (tmp == db::system_keyspace::NAME) {
if (is_system_keyspace(tmp)) {
throw exceptions::invalid_request_exception("Cannot alter system keyspace");
}
@@ -89,21 +89,18 @@ void cql3::statements::alter_keyspace_statement::validate(distributed<service::s
}
}
future<bool> cql3::statements::alter_keyspace_statement::announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only) {
future<shared_ptr<cql_transport::event::schema_change>> cql3::statements::alter_keyspace_statement::announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only) {
auto old_ksm = service::get_local_storage_proxy().get_db().local().find_keyspace(_name).metadata();
return service::get_local_migration_manager().announce_keyspace_update(_attrs->as_ks_metadata_update(old_ksm), is_local_only).then([] {
return true;
return service::get_local_migration_manager().announce_keyspace_update(_attrs->as_ks_metadata_update(old_ksm), is_local_only).then([this] {
using namespace cql_transport;
return make_shared<event::schema_change>(
event::schema_change::change_type::UPDATED,
keyspace());
});
}
shared_ptr<transport::event::schema_change> cql3::statements::alter_keyspace_statement::change_event() {
return make_shared<transport::event::schema_change>(
transport::event::schema_change::change_type::UPDATED,
keyspace());
}
shared_ptr<cql3::statements::prepared_statement>
std::unique_ptr<cql3::statements::prepared_statement>
cql3::statements::alter_keyspace_statement::prepare(database& db, cql_stats& stats) {
return make_shared<prepared_statement>(make_shared<alter_keyspace_statement>(*this));
return std::make_unique<prepared_statement>(make_shared<alter_keyspace_statement>(*this));
}

View File

@@ -61,9 +61,8 @@ public:
future<> check_access(const service::client_state& state) override;
void validate(distributed<service::storage_proxy>& proxy, const service::client_state& state) override;
future<bool> announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only) override;
shared_ptr<transport::event::schema_change> change_event() override;
virtual shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
future<shared_ptr<cql_transport::event::schema_change>> announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only) override;
virtual std::unique_ptr<prepared> prepare(database& db, cql_stats& stats) override;
};
}

View File

@@ -40,6 +40,7 @@
*/
#include "cql3/statements/alter_table_statement.hh"
#include "index/secondary_index_manager.hh"
#include "prepared_statement.hh"
#include "service/migration_manager.hh"
#include "validation.hh"
@@ -47,6 +48,7 @@
#include <boost/range/adaptor/filtered.hpp>
#include <boost/range/adaptor/transformed.hpp>
#include "cql3/util.hh"
#include "view_info.hh"
namespace cql3 {
@@ -151,12 +153,20 @@ static void validate_column_rename(const schema& schema, const column_identifier
throw exceptions::invalid_request_exception(sprint("Cannot rename non PRIMARY KEY part %s", from));
}
if (def->is_indexed()) {
throw exceptions::invalid_request_exception(sprint("Cannot rename column %s because it is secondary indexed", from));
if (!schema.indices().empty()) {
auto& sim = secondary_index::get_secondary_index_manager();
auto dependent_indices = sim.local().get_dependent_indices(*def);
if (!dependent_indices.empty()) {
auto index_names = ::join(", ", dependent_indices | boost::adaptors::transformed([](const index_metadata& im) {
return im.name();
}));
throw exceptions::invalid_request_exception(
sprint("Cannot rename column %s because it has dependent secondary indexes (%s)", from, index_names));
}
}
}
future<bool> alter_table_statement::announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only)
future<shared_ptr<cql_transport::event::schema_change>> alter_table_statement::announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only)
{
auto& db = proxy.local().get_db().local();
auto schema = validation::validate_column_family(db, keyspace(), column_family());
@@ -178,7 +188,7 @@ future<bool> alter_table_statement::announce_migration(distributed<service::stor
}
auto& cf = db.find_column_family(schema);
std::vector<schema_ptr> view_updates;
std::vector<view_ptr> view_updates;
switch (_type) {
case alter_table_statement::type::add:
@@ -219,8 +229,13 @@ future<bool> alter_table_statement::announce_migration(distributed<service::stor
throw exceptions::invalid_request_exception("Cannot use non-frozen collections with super column families");
}
auto it = schema->collections().find(column_name->name());
if (it != schema->collections().end() && !type->is_compatible_with(*it->second)) {
// If there used to be a non-frozen collection column with the same name (that has been dropped),
// we could still have some data using the old type, and so we can't allow adding a collection
// with the same name unless the types are compatible (see #6276).
auto& dropped = schema->dropped_columns();
auto i = dropped.find(column_name->text());
if (i != dropped.end() && !type->is_compatible_with(*i->second.type)) {
throw exceptions::invalid_request_exception(sprint("Cannot add a collection with the name %s "
"because a collection with the same name and a different type has already been used in the past", column_name));
}
@@ -235,7 +250,7 @@ future<bool> alter_table_statement::announce_migration(distributed<service::stor
if (view->view_info()->include_all_columns()) {
schema_builder builder(view);
builder.with_column(column_name->name(), type);
view_updates.push_back(builder.build());
view_updates.push_back(view_ptr(builder.build()));
}
}
}
@@ -261,7 +276,7 @@ future<bool> alter_table_statement::announce_migration(distributed<service::stor
schema_builder builder(view);
auto view_type = validate_alter(view, *view_def, *validator);
builder.with_altered_column_type(column_name->name(), std::move(view_type));
view_updates.push_back(builder.build());
view_updates.push_back(view_ptr(builder.build()));
}
}
break;
@@ -347,32 +362,26 @@ future<bool> alter_table_statement::announce_migration(distributed<service::stor
builder.with_view_info(view->view_info()->base_id(), view->view_info()->base_name(),
view->view_info()->include_all_columns(), std::move(new_where));
view_updates.push_back(builder.build());
view_updates.push_back(view_ptr(builder.build()));
}
}
}
break;
}
auto f = service::get_local_migration_manager().announce_column_family_update(cfm.build(), false, is_local_only);
return f.then([is_local_only, view_updates = std::move(view_updates)] {
return parallel_for_each(view_updates, [is_local_only] (auto&& view) {
return service::get_local_migration_manager().announce_view_update(view_ptr(std::move(view)), is_local_only);
});
}).then([] {
return true;
return service::get_local_migration_manager().announce_column_family_update(cfm.build(), false, std::move(view_updates), is_local_only).then([this] {
using namespace cql_transport;
return make_shared<event::schema_change>(
event::schema_change::change_type::UPDATED,
event::schema_change::target_type::TABLE,
keyspace(),
column_family());
});
}
shared_ptr<transport::event::schema_change> alter_table_statement::change_event()
{
return make_shared<transport::event::schema_change>(transport::event::schema_change::change_type::UPDATED,
transport::event::schema_change::target_type::TABLE, keyspace(), column_family());
}
shared_ptr<cql3::statements::prepared_statement>
std::unique_ptr<cql3::statements::prepared_statement>
cql3::statements::alter_table_statement::prepare(database& db, cql_stats& stats) {
return make_shared<prepared_statement>(make_shared<alter_table_statement>(*this));
return std::make_unique<prepared_statement>(make_shared<alter_table_statement>(*this));
}
}

View File

@@ -78,9 +78,8 @@ public:
virtual future<> check_access(const service::client_state& state) override;
virtual void validate(distributed<service::storage_proxy>& proxy, const service::client_state& state) override;
virtual future<bool> announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only) override;
virtual shared_ptr<transport::event::schema_change> change_event() override;
virtual shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
virtual future<shared_ptr<cql_transport::event::schema_change>> announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only) override;
virtual std::unique_ptr<prepared> prepare(database& db, cql_stats& stats) override;
};
}

View File

@@ -43,6 +43,7 @@
#include "schema_builder.hh"
#include "service/migration_manager.hh"
#include "boost/range/adaptor/map.hpp"
#include "stdx.hh"
namespace cql3 {
@@ -71,29 +72,19 @@ void alter_type_statement::validate(distributed<service::storage_proxy>& proxy,
// It doesn't really change anything anyway.
}
shared_ptr<transport::event::schema_change> alter_type_statement::change_event()
{
using namespace transport;
return make_shared<transport::event::schema_change>(event::schema_change::change_type::UPDATED,
event::schema_change::target_type::TYPE,
keyspace(),
_name.get_string_type_name());
}
const sstring& alter_type_statement::keyspace() const
{
return _name.get_keyspace();
}
static int32_t get_idx_of_field(user_type type, shared_ptr<column_identifier> field)
static stdx::optional<uint32_t> get_idx_of_field(user_type type, shared_ptr<column_identifier> field)
{
for (uint32_t i = 0; i < type->field_names().size(); ++i) {
if (field->name() == type->field_names()[i]) {
return i;
return {i};
}
}
return -1;
return {};
}
void alter_type_statement::do_announce_migration(database& db, ::keyspace& ks, bool is_local_only)
@@ -114,19 +105,19 @@ void alter_type_statement::do_announce_migration(database& db, ::keyspace& ks, b
for (auto&& schema : ks.metadata()->cf_meta_data() | boost::adaptors::map_values) {
auto cfm = schema_builder(schema);
bool modified = false;
for (auto&& column : schema->all_columns() | boost::adaptors::map_values) {
auto t_opt = column->type->update_user_type(updated);
for (auto&& column : schema->all_columns()) {
auto t_opt = column.type->update_user_type(updated);
if (t_opt) {
modified = true;
// We need to update this column
cfm.with_altered_column_type(column->name(), *t_opt);
cfm.with_altered_column_type(column.name(), *t_opt);
}
}
if (modified) {
if (schema->is_view()) {
service::get_local_migration_manager().announce_view_update(view_ptr(cfm.build()), is_local_only).get();
} else {
service::get_local_migration_manager().announce_column_family_update(cfm.build(), false, is_local_only).get();
service::get_local_migration_manager().announce_column_family_update(cfm.build(), false, {}, is_local_only).get();
}
}
}
@@ -144,14 +135,19 @@ void alter_type_statement::do_announce_migration(database& db, ::keyspace& ks, b
}
}
future<bool> alter_type_statement::announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only)
future<shared_ptr<cql_transport::event::schema_change>> alter_type_statement::announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only)
{
return seastar::async([this, &proxy, is_local_only] {
auto&& db = proxy.local().get_db().local();
try {
auto&& ks = db.find_keyspace(keyspace());
do_announce_migration(db, ks, is_local_only);
return true;
using namespace cql_transport;
return make_shared<event::schema_change>(
event::schema_change::change_type::UPDATED,
event::schema_change::target_type::TYPE,
keyspace(),
_name.get_string_type_name());
} catch (no_such_keyspace& e) {
throw exceptions::invalid_request_exception(sprint("Cannot alter type in unknown keyspace %s", keyspace()));
}
@@ -168,7 +164,7 @@ alter_type_statement::add_or_alter::add_or_alter(const ut_name& name, bool is_ad
user_type alter_type_statement::add_or_alter::do_add(database& db, user_type to_update) const
{
if (get_idx_of_field(to_update, _field_name) >= 0) {
if (get_idx_of_field(to_update, _field_name)) {
throw exceptions::invalid_request_exception(sprint("Cannot add new field %s to type %s: a field of the same name already exists", _field_name->name(), _name.to_string()));
}
@@ -185,19 +181,19 @@ user_type alter_type_statement::add_or_alter::do_add(database& db, user_type to_
user_type alter_type_statement::add_or_alter::do_alter(database& db, user_type to_update) const
{
uint32_t idx = get_idx_of_field(to_update, _field_name);
if (idx < 0) {
stdx::optional<uint32_t> idx = get_idx_of_field(to_update, _field_name);
if (!idx) {
throw exceptions::invalid_request_exception(sprint("Unknown field %s in type %s", _field_name->name(), _name.to_string()));
}
auto previous = to_update->field_types()[idx];
auto previous = to_update->field_types()[*idx];
auto new_type = _field_type->prepare(db, keyspace())->get_type();
if (!new_type->is_compatible_with(*previous)) {
throw exceptions::invalid_request_exception(sprint("Type %s in incompatible with previous type %s of field %s in user type %s", _field_type->to_string(), previous->as_cql3_type()->to_string(), _field_name->name(), _name.to_string()));
}
std::vector<data_type> new_types(to_update->field_types());
new_types[idx] = new_type;
new_types[*idx] = new_type;
return user_type_impl::get_instance(to_update->_keyspace, to_update->_name, to_update->field_names(), std::move(new_types));
}
@@ -221,25 +217,25 @@ user_type alter_type_statement::renames::make_updated_type(database& db, user_ty
std::vector<bytes> new_names(to_update->field_names());
for (auto&& rename : _renames) {
auto&& from = rename.first;
int32_t idx = get_idx_of_field(to_update, from);
if (idx < 0) {
stdx::optional<uint32_t> idx = get_idx_of_field(to_update, from);
if (!idx) {
throw exceptions::invalid_request_exception(sprint("Unknown field %s in type %s", from->to_string(), _name.to_string()));
}
new_names[idx] = rename.second->name();
new_names[*idx] = rename.second->name();
}
auto&& updated = user_type_impl::get_instance(to_update->_keyspace, to_update->_name, std::move(new_names), to_update->field_types());
create_type_statement::check_for_duplicate_names(updated);
return updated;
}
shared_ptr<cql3::statements::prepared_statement>
std::unique_ptr<cql3::statements::prepared_statement>
alter_type_statement::add_or_alter::prepare(database& db, cql_stats& stats) {
return make_shared<prepared_statement>(make_shared<alter_type_statement::add_or_alter>(*this));
return std::make_unique<prepared_statement>(make_shared<alter_type_statement::add_or_alter>(*this));
}
shared_ptr<cql3::statements::prepared_statement>
std::unique_ptr<cql3::statements::prepared_statement>
alter_type_statement::renames::prepare(database& db, cql_stats& stats) {
return make_shared<prepared_statement>(make_shared<alter_type_statement::renames>(*this));
return std::make_unique<prepared_statement>(make_shared<alter_type_statement::renames>(*this));
}
}

View File

@@ -61,11 +61,9 @@ public:
virtual void validate(distributed<service::storage_proxy>& proxy, const service::client_state& state) override;
virtual shared_ptr<transport::event::schema_change> change_event() override;
virtual const sstring& keyspace() const override;
virtual future<bool> announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only) override;
virtual future<shared_ptr<cql_transport::event::schema_change>> announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only) override;
class add_or_alter;
class renames;
@@ -84,7 +82,7 @@ public:
const shared_ptr<column_identifier> field_name,
const shared_ptr<cql3_type::raw> field_type);
virtual user_type make_updated_type(database& db, user_type to_update) const override;
virtual shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
virtual std::unique_ptr<prepared> prepare(database& db, cql_stats& stats) override;
private:
user_type do_add(database& db, user_type to_update) const;
user_type do_alter(database& db, user_type to_update) const;
@@ -101,7 +99,7 @@ public:
void add_rename(shared_ptr<column_identifier> previous_name, shared_ptr<column_identifier> new_name);
virtual user_type make_updated_type(database& db, user_type to_update) const override;
virtual shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
virtual std::unique_ptr<prepared> prepare(database& db, cql_stats& stats) override;
};
}

View File

@@ -92,7 +92,7 @@ future<> cql3::statements::alter_user_statement::check_access(const service::cli
});
}
future<::shared_ptr<transport::messages::result_message>>
future<::shared_ptr<cql_transport::messages::result_message>>
cql3::statements::alter_user_statement::execute(distributed<service::storage_proxy>& proxy, service::query_state& state, const query_options& options) {
return auth::auth::is_existing_user(_username).then([this](bool exists) {
if (!exists) {
@@ -104,7 +104,7 @@ cql3::statements::alter_user_statement::execute(distributed<service::storage_pro
return auth::auth::insert_user(_username, *_superuser);
});
}
return f.then([] { return make_ready_future<::shared_ptr<transport::messages::result_message>>(); });
return f.then([] { return make_ready_future<::shared_ptr<cql_transport::messages::result_message>>(); });
});
}

View File

@@ -62,7 +62,7 @@ public:
void validate(distributed<service::storage_proxy>&, const service::client_state&) override;
future<> check_access(const service::client_state&) override;
future<::shared_ptr<transport::messages::result_message>> execute(distributed<service::storage_proxy>&
future<::shared_ptr<cql_transport::messages::result_message>> execute(distributed<service::storage_proxy>&
, service::query_state&
, const query_options&) override;
};

View File

@@ -43,6 +43,7 @@
#include "cql3/statements/prepared_statement.hh"
#include "service/migration_manager.hh"
#include "validation.hh"
#include "view_info.hh"
namespace cql3 {
@@ -72,7 +73,7 @@ void alter_view_statement::validate(distributed<service::storage_proxy>&, const
// validated in announce_migration()
}
future<bool> alter_view_statement::announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only)
future<shared_ptr<cql_transport::event::schema_change>> alter_view_statement::announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only)
{
auto&& db = proxy.local().get_db().local();
schema_ptr schema = validation::validate_column_family(db, keyspace(), column_family());
@@ -96,24 +97,27 @@ future<bool> alter_view_statement::announce_migration(distributed<service::stora
"low might cause undelivered updates to expire before being replayed.");
}
return service::get_local_migration_manager().announce_view_update(view_ptr(builder.build()), is_local_only).then([] {
return true;
if (builder.default_time_to_live().count() > 0) {
throw exceptions::invalid_request_exception(
"Cannot set or alter default_time_to_live for a materialized view. "
"Data in a materialized view always expire at the same time than "
"the corresponding data in the parent table.");
}
return service::get_local_migration_manager().announce_view_update(view_ptr(builder.build()), is_local_only).then([this] {
using namespace cql_transport;
return make_shared<event::schema_change>(
event::schema_change::change_type::UPDATED,
event::schema_change::target_type::TABLE,
keyspace(),
column_family());
});
}
shared_ptr<transport::event::schema_change> alter_view_statement::change_event()
{
using namespace transport;
return make_shared<event::schema_change>(event::schema_change::change_type::UPDATED,
event::schema_change::target_type::TABLE,
keyspace(),
column_family());
}
shared_ptr<cql3::statements::prepared_statement>
std::unique_ptr<cql3::statements::prepared_statement>
alter_view_statement::prepare(database& db, cql_stats& stats) {
return make_shared<prepared_statement>(make_shared<alter_view_statement>(*this));
return std::make_unique<prepared_statement>(make_shared<alter_view_statement>(*this));
}
}

View File

@@ -63,11 +63,9 @@ public:
virtual void validate(distributed<service::storage_proxy>&, const service::client_state& state) override;
virtual future<bool> announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only) override;
virtual future<shared_ptr<cql_transport::event::schema_change>> announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only) override;
virtual shared_ptr<transport::event::schema_change> change_event() override;
virtual shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
virtual std::unique_ptr<prepared> prepare(database& db, cql_stats& stats) override;
};
}

View File

@@ -46,9 +46,9 @@ uint32_t cql3::statements::authentication_statement::get_bound_terms() {
return 0;
}
::shared_ptr<cql3::statements::prepared_statement> cql3::statements::authentication_statement::prepare(
std::unique_ptr<cql3::statements::prepared_statement> cql3::statements::authentication_statement::prepare(
database& db, cql_stats& stats) {
return ::make_shared<prepared>(this->shared_from_this());
return std::make_unique<prepared>(this->shared_from_this());
}
bool cql3::statements::authentication_statement::uses_function(
@@ -75,7 +75,7 @@ future<> cql3::statements::authentication_statement::check_access(const service:
return make_ready_future<>();
}
future<::shared_ptr<transport::messages::result_message>> cql3::statements::authentication_statement::execute_internal(
future<::shared_ptr<cql_transport::messages::result_message>> cql3::statements::authentication_statement::execute_internal(
distributed<service::storage_proxy>& proxy,
service::query_state& state, const query_options& options) {
// Internal queries are exclusively on the system keyspace and makes no sense here

View File

@@ -54,7 +54,7 @@ class authentication_statement : public raw::parsed_statement, public cql_statem
public:
uint32_t get_bound_terms() override;
::shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
std::unique_ptr<prepared> prepare(database& db, cql_stats& stats) override;
bool uses_function(const sstring& ks_name, const sstring& function_name) const override;
@@ -66,7 +66,7 @@ public:
void validate(distributed<service::storage_proxy>&, const service::client_state& state) override;
future<::shared_ptr<transport::messages::result_message>>
future<::shared_ptr<cql_transport::messages::result_message>>
execute_internal(distributed<service::storage_proxy>& proxy, service::query_state& state, const query_options& options) override;
};

View File

@@ -46,9 +46,9 @@ uint32_t cql3::statements::authorization_statement::get_bound_terms() {
return 0;
}
::shared_ptr<cql3::statements::prepared_statement> cql3::statements::authorization_statement::prepare(
std::unique_ptr<cql3::statements::prepared_statement> cql3::statements::authorization_statement::prepare(
database& db, cql_stats& stats) {
return ::make_shared<parsed_statement::prepared>(this->shared_from_this());
return std::make_unique<parsed_statement::prepared>(this->shared_from_this());
}
bool cql3::statements::authorization_statement::uses_function(
@@ -75,7 +75,7 @@ future<> cql3::statements::authorization_statement::check_access(const service::
return make_ready_future<>();
}
future<::shared_ptr<transport::messages::result_message>> cql3::statements::authorization_statement::execute_internal(
future<::shared_ptr<cql_transport::messages::result_message>> cql3::statements::authorization_statement::execute_internal(
distributed<service::storage_proxy>& proxy,
service::query_state& state, const query_options& options) {
// Internal queries are exclusively on the system keyspace and makes no sense here

View File

@@ -58,7 +58,7 @@ class authorization_statement : public raw::parsed_statement, public cql_stateme
public:
uint32_t get_bound_terms() override;
::shared_ptr<prepared> prepare(database& db, cql_stats& stats) override;
std::unique_ptr<prepared> prepare(database& db, cql_stats& stats) override;
bool uses_function(const sstring& ks_name, const sstring& function_name) const override;
@@ -70,7 +70,7 @@ public:
void validate(distributed<service::storage_proxy>&, const service::client_state& state) override;
future<::shared_ptr<transport::messages::result_message>>
future<::shared_ptr<cql_transport::messages::result_message>>
execute_internal(distributed<service::storage_proxy>& proxy, service::query_state& state, const query_options& options) override;
protected:

View File

@@ -40,6 +40,7 @@
#include "batch_statement.hh"
#include "raw/batch_statement.hh"
#include "db/config.hh"
#include <seastar/core/execution_stage.hh>
namespace {
@@ -77,6 +78,14 @@ batch_statement::batch_statement(int bound_terms, type type_,
{
}
batch_statement::batch_statement(type type_,
std::vector<shared_ptr<modification_statement>> statements,
std::unique_ptr<attributes> attrs,
cql_stats& stats)
: batch_statement(-1, type_, std::move(statements), std::move(attrs), stats)
{
}
bool batch_statement::uses_function(const sstring& ks_name, const sstring& function_name) const
{
return _attrs->uses_function(ks_name, function_name)
@@ -149,6 +158,13 @@ void batch_statement::validate()
| boost::adaptors::uniqued) != 1))) {
throw exceptions::invalid_request_exception("Batch with conditions cannot span multiple tables");
}
std::experimental::optional<bool> raw_counter;
for (auto& s : _statements) {
if (raw_counter && s->is_raw_counter_shard_write() != *raw_counter) {
throw exceptions::invalid_request_exception("Cannot mix raw and regular counter statements in batch");
}
raw_counter = s->is_raw_counter_shard_write();
}
}
void batch_statement::validate(distributed<service::storage_proxy>& proxy, const service::client_state& state)
@@ -200,7 +216,12 @@ future<std::vector<mutation>> batch_statement::get_mutations(distributed<service
}
void batch_statement::verify_batch_size(const std::vector<mutation>& mutations) {
if (mutations.size() <= 1) {
return; // We only warn for batch spanning multiple mutations
}
size_t warn_threshold = service::get_local_storage_proxy().get_db().local().get_config().batch_size_warn_threshold_in_kb() * 1024;
size_t fail_threshold = service::get_local_storage_proxy().get_db().local().get_config().batch_size_fail_threshold_in_kb() * 1024;
class my_partition_visitor : public mutation_partition_visitor {
public:
@@ -212,7 +233,7 @@ void batch_statement::verify_batch_size(const std::vector<mutation>& mutations)
size += v.data.size();
}
void accept_row_tombstone(const range_tombstone&) override {}
void accept_row(clustering_key_view, tombstone, const row_marker&) override {}
void accept_row(position_in_partition_view, const row_tombstone&, const row_marker&, is_dummy, is_continuous) override {}
void accept_row_cell(column_id, atomic_cell_view v) override {
size += v.value().size();
}
@@ -230,24 +251,36 @@ void batch_statement::verify_batch_size(const std::vector<mutation>& mutations)
}
if (v.size > warn_threshold) {
std::unordered_set<sstring> ks_cf_pairs;
for (auto&& m : mutations) {
ks_cf_pairs.insert(m.schema()->ks_name() + "." + m.schema()->cf_name());
auto error = [&] (const char* type, size_t threshold) -> sstring {
std::unordered_set<sstring> ks_cf_pairs;
for (auto&& m : mutations) {
ks_cf_pairs.insert(m.schema()->ks_name() + "." + m.schema()->cf_name());
}
return sprint("Batch of prepared statements for %s is of size %d, exceeding specified %s threshold of %d by %d.",
join(", ", ks_cf_pairs), v.size, type, threshold, v.size - threshold);
};
if (v.size > fail_threshold) {
_logger.error(error("FAIL", fail_threshold).c_str());
throw exceptions::invalid_request_exception("Batch too large");
} else {
_logger.warn(error("WARN", warn_threshold).c_str());
}
_logger.warn(
"Batch of prepared statements for {} is of size {}, exceeding specified threshold of {} by {}.{}",
join(", ", ks_cf_pairs), v.size, warn_threshold,
v.size - warn_threshold, "");
}
}
future<shared_ptr<transport::messages::result_message>> batch_statement::execute(
struct batch_statement_executor {
static auto get() { return &batch_statement::do_execute; }
};
static thread_local auto batch_stage = seastar::make_execution_stage("cql3_batch", batch_statement_executor::get());
future<shared_ptr<cql_transport::messages::result_message>> batch_statement::execute(
distributed<service::storage_proxy>& storage, service::query_state& state, const query_options& options) {
++_stats.batches;
return execute(storage, state, options, false, options.get_timestamp(state));
return batch_stage(this, seastar::ref(storage), seastar::ref(state),
seastar::cref(options), false, options.get_timestamp(state));
}
future<shared_ptr<transport::messages::result_message>> batch_statement::execute(
future<shared_ptr<cql_transport::messages::result_message>> batch_statement::do_execute(
distributed<service::storage_proxy>& storage,
service::query_state& query_state, const query_options& options,
bool local, api::timestamp_type now)
@@ -266,8 +299,8 @@ future<shared_ptr<transport::messages::result_message>> batch_statement::execute
return get_mutations(storage, options, local, now, query_state.get_trace_state()).then([this, &storage, &options, tr_state = query_state.get_trace_state()] (std::vector<mutation> ms) mutable {
return execute_without_conditions(storage, std::move(ms), options.get_consistency(), std::move(tr_state));
}).then([] {
return make_ready_future<shared_ptr<transport::messages::result_message>>(
make_shared<transport::messages::result_message::void_message>());
return make_ready_future<shared_ptr<cql_transport::messages::result_message>>(
make_shared<cql_transport::messages::result_message::void_message>());
});
}
@@ -305,7 +338,7 @@ future<> batch_statement::execute_without_conditions(
return storage.local().mutate_with_triggers(std::move(mutations), cl, mutate_atomic, std::move(tr_state));
}
future<shared_ptr<transport::messages::result_message>> batch_statement::execute_with_conditions(
future<shared_ptr<cql_transport::messages::result_message>> batch_statement::execute_with_conditions(
distributed<service::storage_proxy>& storage,
const query_options& options,
service::query_state& state)
@@ -358,7 +391,7 @@ future<shared_ptr<transport::messages::result_message>> batch_statement::execute
#endif
}
future<shared_ptr<transport::messages::result_message>> batch_statement::execute_internal(
future<shared_ptr<cql_transport::messages::result_message>> batch_statement::execute_internal(
distributed<service::storage_proxy>& proxy,
service::query_state& query_state, const query_options& options)
{
@@ -377,12 +410,22 @@ future<shared_ptr<transport::messages::result_message>> batch_statement::execute
namespace raw {
shared_ptr<prepared_statement>
std::unique_ptr<prepared_statement>
batch_statement::prepare(database& db, cql_stats& stats) {
auto&& bound_names = get_bound_variables();
stdx::optional<sstring> first_ks;
stdx::optional<sstring> first_cf;
bool have_multiple_cfs = false;
std::vector<shared_ptr<cql3::statements::modification_statement>> statements;
for (auto&& parsed : _parsed_statements) {
if (!first_ks) {
first_ks = parsed->keyspace();
first_cf = parsed->column_family();
} else {
have_multiple_cfs = first_ks.value() != parsed->keyspace() || first_cf.value() != parsed->column_family();
}
statements.push_back(parsed->prepare(db, bound_names, stats));
}
@@ -392,8 +435,13 @@ batch_statement::prepare(database& db, cql_stats& stats) {
cql3::statements::batch_statement batch_statement_(bound_names->size(), _type, std::move(statements), std::move(prep_attrs), stats);
batch_statement_.validate();
return ::make_shared<prepared>(make_shared(std::move(batch_statement_)),
bound_names->get_specifications());
std::vector<uint16_t> partition_key_bind_indices;
if (!have_multiple_cfs && batch_statement_.get_statements().size() > 0) {
partition_key_bind_indices = bound_names->get_partition_key_bind_indexes(batch_statement_.get_statements()[0]->s);
}
return std::make_unique<prepared>(make_shared(std::move(batch_statement_)),
bound_names->get_specifications(),
std::move(partition_key_bind_indices));
}
}

View File

@@ -87,6 +87,11 @@ public:
std::unique_ptr<attributes> attrs,
cql_stats& stats);
batch_statement(type type_,
std::vector<shared_ptr<modification_statement>> statements,
std::unique_ptr<attributes> attrs,
cql_stats& stats);
virtual bool uses_function(const sstring& ks_name, const sstring& function_name) const override;
virtual bool depends_on_keyspace(const sstring& ks_name) const override;
@@ -115,10 +120,11 @@ public:
*/
static void verify_batch_size(const std::vector<mutation>& mutations);
virtual future<shared_ptr<transport::messages::result_message>> execute(
virtual future<shared_ptr<cql_transport::messages::result_message>> execute(
distributed<service::storage_proxy>& storage, service::query_state& state, const query_options& options) override;
private:
future<shared_ptr<transport::messages::result_message>> execute(
friend class batch_statement_executor;
future<shared_ptr<cql_transport::messages::result_message>> do_execute(
distributed<service::storage_proxy>& storage,
service::query_state& query_state, const query_options& options,
bool local, api::timestamp_type now);
@@ -129,12 +135,12 @@ private:
db::consistency_level cl,
tracing::trace_state_ptr tr_state);
future<shared_ptr<transport::messages::result_message>> execute_with_conditions(
future<shared_ptr<cql_transport::messages::result_message>> execute_with_conditions(
distributed<service::storage_proxy>& storage,
const query_options& options,
service::query_state& state);
public:
virtual future<shared_ptr<transport::messages::result_message>> execute_internal(
virtual future<shared_ptr<cql_transport::messages::result_message>> execute_internal(
distributed<service::storage_proxy>& proxy,
service::query_state& query_state, const query_options& options) override;

View File

@@ -41,6 +41,8 @@
#include "cql3/statements/cf_prop_defs.hh"
#include <boost/algorithm/string/predicate.hpp>
namespace cql3 {
namespace statements {
@@ -61,9 +63,12 @@ const sstring cf_prop_defs::KW_MEMTABLE_FLUSH_PERIOD = "memtable_flush_period_in
const sstring cf_prop_defs::KW_COMPACTION = "compaction";
const sstring cf_prop_defs::KW_COMPRESSION = "compression";
const sstring cf_prop_defs::KW_CRC_CHECK_CHANCE = "crc_check_chance";
const sstring cf_prop_defs::COMPACTION_STRATEGY_CLASS_KEY = "class";
const sstring cf_prop_defs::COMPACTION_ENABLED_KEY = "enabled";
void cf_prop_defs::validate() {
// Skip validation if the comapction strategy class is already set as it means we've alreayd
// prepared (and redoing it would set strategyClass back to null, which we don't want)
@@ -76,7 +81,7 @@ void cf_prop_defs::validate() {
KW_GCGRACESECONDS, KW_CACHING, KW_DEFAULT_TIME_TO_LIVE,
KW_MIN_INDEX_INTERVAL, KW_MAX_INDEX_INTERVAL, KW_SPECULATIVE_RETRY,
KW_BF_FP_CHANCE, KW_MEMTABLE_FLUSH_PERIOD, KW_COMPACTION,
KW_COMPRESSION,
KW_COMPRESSION, KW_CRC_CHECK_CHANCE
});
static std::set<sstring> obsolete_keywords({
sstring("index_interval"),
@@ -187,6 +192,13 @@ void cf_prop_defs::apply_to_builder(schema_builder& builder) {
builder.set_min_compaction_threshold(min_compaction_threshold);
builder.set_max_compaction_threshold(max_compaction_threshold);
if (has_property(KW_COMPACTION)) {
if (get_compaction_options().count(COMPACTION_ENABLED_KEY)) {
auto enabled = boost::algorithm::iequals(get_compaction_options().at(COMPACTION_ENABLED_KEY), "true");
builder.set_compaction_enabled(enabled);
}
}
builder.set_default_time_to_live(gc_clock::duration(get_int(KW_DEFAULT_TIME_TO_LIVE, DEFAULT_DEFAULT_TIME_TO_LIVE)));
if (has_property(KW_SPECULATIVE_RETRY)) {

View File

@@ -70,8 +70,10 @@ public:
static const sstring KW_COMPACTION;
static const sstring KW_COMPRESSION;
static const sstring KW_CRC_CHECK_CHANCE;
static const sstring COMPACTION_STRATEGY_CLASS_KEY;
static const sstring COMPACTION_ENABLED_KEY;
// FIXME: In origin the following consts are in CFMetaData.
static constexpr int32_t DEFAULT_DEFAULT_TIME_TO_LIVE = 0;

Some files were not shown because too many files have changed in this diff Show More