Compare commits

...

3328 Commits

Author SHA1 Message Date
Gleb Natapov
2d630e068b mutation_query_test: add test for result size calculation
Check that digest only and digest+data query calculate result size to be
the same.

Message-Id: <20180906153800.GK2326@scylladb.com>
(cherry picked from commit 9e438933a2)
2018-09-08 18:55:23 +03:00
Gleb Natapov
5a8e9698d8 mutation_partition: accurately account for result size in digest only queries
When measuring_output_stream is used to calculate result's element size
it incorrectly takes into account not only serialized element size, but
a placeholder that ser::qr_partition__rows/qr_partition__static_row__cells
constructors puts in the beginning. Fix it by taking starting point in a
stream before element serialization and subtracting it afterwords.

Fixes #3755

Message-Id: <20180906153609.GJ2326@scylladb.com>
(cherry picked from commit d7674288a9)
2018-09-08 18:55:23 +03:00
Gleb Natapov
64f1aa8d99 mutation_partition: correctly measure static row size when doing digest calculation
The code uses incorrect output stream in case only digest is requested
and thus getting incorrect data size. Failing to correctly account
for static row size while calculating digest may cause digest mismatch
between digest and data query.

Fixes #3753.

Message-Id: <20180905131219.GD2326@scylladb.com>
(cherry picked from commit 98092353df)
2018-09-06 16:51:31 +03:00
Eliran Sinvani
280e6eedb9 cql3: ensure repeated values in IN clauses don't return repeated rows
When the list of values in the IN list of a single column contains
duplicates, multiple executors are activated since the assumption
is that each value in the IN list corresponds to a different partition.
this results in the same row appearing in the result number times
corresponding to the duplication of the partition value.

Added queries for the in restriction unitest and fixed with a bad result check.

Fixes #2837
Tests: Queries as in the usecase from the GitHub issue in both forms ,
prepared and plain (using python driver),Unitest.

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <ad88b7218fa55466be7bc4303dc50326a3d59733.1534322238.git.eliransin@scylladb.com>
(cherry picked from commit d734d316a6)
2018-08-26 15:52:18 +03:00
Tomasz Grabiec
f80f15a6af Merge 'Fix multi-cell static list updates in the presence of ckeys' from Duarte
Fixes a regression introduced in
9e88b60ef5, which broke the lookup for
prefetched values of lists when a clustering key is specified.

This is the code that was removed from some list operations:

 std::experimental::optional<clustering_key> row_key;
 if (!column.is_static()) {
   row_key = clustering_key::from_clustering_prefix(*params._schema, prefix);
 }
 ...
 auto&& existing_list = params.get_prefetched_list(m.key().view(), row_key, column);

Put it back, in the form of common code in the update_parameters class.

Fixes #3703

* https://github.com/duarten/scylla cql-list-fixes/v1:
  tests/cql_query_test: Test multi-cell static list updates with ckeys
  cql3/lists: Fix multi-cell static list updates in the presence of ckeys
  keys: Add factory for an empty clustering_key_prefix_view

(cherry picked from commit 6937cc2d1c)
2018-08-21 17:37:36 +01:00
Duarte Nunes
d0eb0c0b90 cql3/query_options: Use _value_views in prepare()
_value_views is the authoritative data structure for the
client-specified values. Indeed, the ctor called
transport::request::read_options() leaves _values completely empty.

In query_options::prepare() we were, however, using _values to
associated values to the client-specified column names, and not
_value_views. Fix this by using _value_views instead.

As for the reasons we didn't see this bug earlier, I assume it's
because very few drivers set the 0x04 query options flag, which means
column names are omitted. This is the right thing to do since most
drivers have enough information to correctly position the values.

Fixes #3688

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180814234605.14775-1-duarte@scylladb.com>
(cherry picked from commit a4355fe7e7)
2018-08-21 18:24:06 +03:00
Jesse Haber-Kucharsky
1427c4d428 auth: Don't use unsupported hashing algorithms
In previous versions of Fedora, the `crypt_r` function returned
`nullptr` when a requested hashing algorithm was not supported.

This is consistent with the documentation of the function in its man
page.

As of Fedora 28, the function's behavior changes so that the encrypted
text is not `nullptr` on error, but instead the string "*0".

The info pages for `crypt_r` clarify somewhat (and contradict the man
pages):

    Some implementations return `NULL` on failure, and others return an
    _invalid_ hashed passphrase, which will begin with a `*` and will
    not be the same as SALT.

Because of this change of behavior, users running Scylla on a Fedora 28
machine which was upgraded from a previous release would not be able to
authenticate: an unsupported hashing algorithm would be selected,
producing encrypted text that did not match the entry in the table.

With this change, unsupported algorithms are correctly detected and
users should be able to continue to authenticate themselves.

Fixes #3637.

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <bcd708f3ec195870fa2b0d147c8910fb63db7e0e.1533322594.git.jhaberku@scylladb.com>
(cherry picked from commit fce10f2c6e)
2018-08-05 10:30:58 +03:00
Gleb Natapov
034f2cb42d cache_hitrate_calculator: fix race when new table is added during calculations
The calculation consists of several parts with preemption point between
them, so a table can be added while calculation is ongoing. Do not
assume that table exists in intermediate data structure.

Fixes #3636

Message-Id: <20180801093147.GD23569@scylladb.com>
(cherry picked from commit 44a6afad8c)
2018-08-01 14:30:58 +03:00
Amos Kong
e043a5c276 scylla_setup: fix conditional statement of silent mode
Commit 300af65555 introdued a problem in
conditional statement, script will always abort in silent mode, it doesn't
care about the return value.

Fixes #3485

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <1c12ab04651352964a176368f8ee28f19ae43c68.1528077114.git.amos@scylladb.com>
(cherry picked from commit 364c2551c8)
2018-07-25 12:34:11 +03:00
Takuya ASADA
5da9bd3a6e dist/common/scripts/scylla_setup: abort running script when one of setup failed in silent mode
Current script silently continues even one of setup fails, need to
abort.

Fixes #3433

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180522180355.1648-1-syuu@scylladb.com>
(cherry picked from commit 300af65555)
2018-07-25 12:34:11 +03:00
Avi Kivity
3578027e2e Merge "row_cache: Fix violation of continuity on concurrent eviction and population" from Tomasz
"
The problem happens under the following circumstances:

  - we have a partially populated partition in cache, with a gap in the middle

  - a read with no clustering restrictions trying to populate that gap

  - eviction of the entry for the lower bound of the gap concurrent with population

The population may incorrectly mark the range before the gap as continuous.
This may result in temporary loss of writes in that clustering range. The
problem heals by clearing cache.

Caught by row_cache_test::test_concurrent_reads_and_eviction, which has been
failing sporadically.

The problem is in ensure_population_lower_bound(), which returns true if
current clustering range covers all rows, which means that the populator has a
right to set continuity flag to true on the row it inserts. This is correct
only if the current population range actually starts since before all
clustering rows. Otherwise, we're populating since _last_row and should
consult it.

Fixes #3608.
"

* 'tgrabiec/fix-violation-of-continuity-on-concurrent-read-and-eviction' of github.com:tgrabiec/scylla:
  row_cache: Fix violation of continuity on concurrent eviction and population
  position_in_partition: Introduce is_before_all_clustered_rows()

(cherry picked from commit 31151cadd4)
2018-07-25 12:34:11 +03:00
Shlomi Livne
7d2150a057 release: prepare for 2.1.6
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
2018-07-01 22:35:26 +03:00
Avi Kivity
afd3c571cc Merge "Backport Disable sstable filtering based on min/max clustering key components" to 2.1" from Tomasz
"
Changes made:
  - switched the test to use do_with_cql_env_thread due to lack of SEASTAR_TEST_CASE_THREAD macro
  - imported make_local_key() from master, needed for the database_test to pass
"

* tag 'tgrabiec/disable-min-max-sstable-filtering-v1-branch-2.1' of github.com:tgrabiec/scylla:
  Merge "Disable sstable filtering based on min/max clustering key components" from Tomasz
  tests: simple_schema: Generate local keys form make_pkeys()
  tests: Import make_local_key() from master
2018-06-28 12:41:00 +03:00
Avi Kivity
093c8512db Merge "Disable sstable filtering based on min/max clustering key components" from Tomasz
"
With DateTiered and TimeWindow, there is a read optimization enabled
which excludes sstables based on overlap with recorded min/max values
of clustering key components. The problem is that it doesn't take into
account partition tombstones and static rows, which should still be
returned by the reader even if there is no overlap in the query's
clustering range. A read which returns no clustering rows can
mispopulate cache, which will appear as partition deletion or writes
to the static row being lost. Until node restart or eviction of the
partition entry.

There is also a bad interaction between cache population on read and
that optimization. When the clustering range of the query doesn't
overlap with any sstable, the reader will return no partition markers
for the read, which leads cache populator to assume there is no
partition in sstables and it will cache an empty partition. This will
cause later reads of that partition to miss prior writes to that
partition until it is evicted from cache or node is restarted.

Disable until a more elaborate fix is implemented.

Fixes #3552
Fixes #3553
"

* tag 'tgrabiec/disable-min-max-sstable-filtering-v1' of github.com:tgrabiec/scylla:
  tests: Add test for slicing a mutation source with date tiered compaction strategy
  tests: Check that database conforms to mutation source
  database: Disable sstable filtering based on min/max clustering key components

(cherry picked from commit e1efda8b0c)
2018-06-28 11:10:41 +02:00
Tomasz Grabiec
9c0b8ec736 tests: simple_schema: Generate local keys form make_pkeys()
Extracted from commit 2b0b703615
2018-06-28 11:10:41 +02:00
Tomasz Grabiec
1794b732b0 tests: Import make_local_key() from master
Imported from master at 8a25bd467c69df94ea3f3638b42d36beee20adf0
2018-06-28 11:10:41 +02:00
Avi Kivity
c1ac4fb8b0 Update seastar submodule
* seastar 2a2c1d2...c89c8b8 (1):
  > tests/test-utils: Add macro for running tests within a seastar thread

Needed for tests in the following patch.
2018-06-28 10:00:05 +03:00
Asias He
2e7e59fb50 gossip: Fix tokens assignment in assassinate_endpoint
The tokens vector is defined a few lines above and is needed outsie the
if block.

Do not redefine it again in the if block, otherwise the tokens will be empty.

Found by code inspection.

Fixes #3551.

Message-Id: <c7a06375c65c950e94236571127f533e5a60cbfd.1530002177.git.asias@scylladb.com>
(cherry picked from commit c3b5a2ecd5)
2018-06-27 12:00:58 +03:00
Vladimir Krivopalov
af29d4bed3 Fix Scylla compilation with Crypto++ v6.
In Crypto++ v6, the `byte` typedef has been moved from the global
namespace to the `CryptoPP::` namespace.

This fix brings in the CryptoPP namespace so that the `byte` typedef is
seen with both old and new versions of Crypto++.

Fixes #3252.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <799d055be710231884d101a52c0be8ed8b0a9806.1520125889.git.vladimir@scylladb.com>
(cherry picked from commit 99bd5180ba)
2018-06-25 17:49:32 +03:00
Shlomi Livne
72494bbe05 release: prepare for 2.1.5
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
2018-06-19 09:05:55 +03:00
Avi Kivity
5784823888 Update scylla-ami submodule
* dist/ami/files/scylla-ami c5d9e96...0df779d (1):
  > scylla_install_ami: Update CentOS to latest version

Fixes #3523.
2018-06-17 12:12:21 +03:00
Takuya ASADA
a7633be1a9 Revert "dist/ami: update CentOS base image to latest version"
This reverts commit 69d226625a.
Since ami-4bf3d731 is Market Place AMI, not possible to publish public AMI based on it.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180523112414.27307-1-syuu@scylladb.com>
(cherry picked from commit 55d6be9254)
2018-06-17 11:33:55 +03:00
Takuya ASADA
e78ded74ce dist/debian: add --jobs <njobs> option just like build_rpm.sh
On some build environment we may want to limit number of parallel jobs since
ninja-build runs ncpus jobs by default, it may too many since g++ eats very
huge memory.
So support --jobs <njobs> just like on rpm build script.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180425205439.30053-1-syuu@scylladb.com>
(cherry picked from commit 782ebcece4)
2018-06-14 15:05:09 +03:00
Avi Kivity
6615c2a6a9 database: stop using incremental selectors
There is a bug in incremental_selector for partitioned_sstable_set, so
until it is found, stop using it.

This degrades scan performance of Leveled Compaction Strategy tables.

Fixes #3513. (as a workaround)
Introduced: 2.1
Message-Id: <20180613131547.19084-1-avi@scylladb.com>

(cherry picked from commit aeffbb6732)
2018-06-14 10:52:39 +03:00
Vlad Zolotarov
11500ccd3a locator::ec2_multi_region_snitch: don't call for ec2_snitch::gossiper_starting()
ec2_snitch::gossiper_starting() calls for the base class (default) method
that sets _gossip_started to TRUE and thereby prevents to following
reconnectable_snitch_helper registration.

Fixes #3454

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1528208520-28046-1-git-send-email-vladz@scylladb.com>
(cherry picked from commit 2dde372ae6)
2018-06-14 10:52:39 +03:00
Shlomi Livne
955f3eeb56 release: prepare for 2.1.4
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
2018-06-06 11:27:01 +03:00
Avi Kivity
08bfd96774 Update seastar submodule
* seastar 675acd5...2a2c1d2 (1):
  > tls: Ensure handshake always drains output before return/throw

Fixes #3461.
2018-05-31 12:06:13 +03:00
Mika Eloranta' via ScyllaDB development
f6c4d558eb build: fix rpm build script --jobs N handling
Fixes argument misquoting at $SRPM_OPTS expansion for the mock commands
and makes the --jobs argument work as supposed.

Signed-off-by: Mika Eloranta <mel@aiven.io>
Message-Id: <20180113212904.85907-1-mel@aiven.io>
(cherry picked from commit 7266446227)
2018-05-27 10:25:26 +03:00
Avi Kivity
0040ff6de2 Update seastar submodule
* seastar 0e6dcd5...675acd5 (1):
  > net/tls: Wait for output to be sent when shutting down

Fixes #3459.
2018-05-24 12:03:10 +03:00
Glauber Costa
c238bc7a81 commitlog: don't move pointer to segment
We are currently moving the pointer we acquired to the segment inside
the lambda in which we'll handle the cycle.

The problem is, we also use that same pointer inside the exception
handler. If an exception happens we'll access it and we'll crash.

Probably #3440.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180518125820.10726-1-glauber@scylladb.com>
(cherry picked from commit 596a525950)
2018-05-19 19:13:58 +03:00
Avi Kivity
3b984a4293 dist: redhat: get rid of raid0.devices_discard_performance
This parameter is not available on recent Red Hat kernels or on
non-Red Hat kernels (it was removed on 3.10.0-772.el7,
RHBZ 1455932). The presence of the parameter on kernels that don't
support it cause the module load to fail, with the result that the
storage is not available.

Fix by removing the parameter. For someone running an older Red Hat
kernel the effect will be that discard is disabled, but they can fix
that by updating the kernel. For someone running a newer kernel, the
effect will be that they can access their data.

Fixes #3437.
Message-Id: <20180516134913.6540-1-avi@scylladb.com>

(cherry picked from commit 3b8118d4e5)
2018-05-19 19:13:58 +03:00
Takuya ASADA
156761d77e dist/ami: update CentOS base image to latest version
Since we requires updated version of systemd, we need to update CentOS base
image.

Fixes #3184

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1518118694-23770-1-git-send-email-syuu@scylladb.com>

Conflicts:
	dist/ami/build_ami.sh

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180508083521.18661-1-syuu@scylladb.com>
2018-05-19 19:13:58 +03:00
Avi Kivity
8e33e80ad3 release: prepare for 2.1.3 2018-04-25 09:01:30 +03:00
Duarte Nunes
c35dd86c87 db/schema_tables: Only drop UDTs after merging tables
Dropping a user type requires that all tables using that type also be
dropped. However, a type may appear to be dropped at the same time as
a table, for instance due to the order in which a node receives schema
notifications, or when dropping a keyspace.

When dropping a table, if we build a schema in a shard through a
global_schema_pointer, then we'll check for the existence of any user
type the schema employs. We thus need to ensure types are only dropped
after tables, similarly to how it's done for keyspaces.

Fixes #3068

Tests: unit-tests (release)

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180129114137.85149-1-duarte@scylladb.com>
(cherry picked from commit 1e3fae5bef)
2018-04-25 01:15:25 +03:00
Pekka Enberg
87cb8a1fa4 release: prepare for 2.1.2 2018-04-17 09:45:00 +03:00
Takuya ASADA
26f3340c32 dist/debian: use ~root as HOME to place .pbuilderrc
When 'always_set_home' is specified on /etc/sudoers pbuilder won't read
.pbuilderrc from current user home directory, and we don't have a way to change
the behavor from sudo command parameter.

So let's use ~root/.pbuilderrc and switch to HOME=/root when sudo executed,
this can work both environment which does specified always_set_home and doesn't
specified.

Fixes #3366

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1523926024-3937-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit ace44784e8)
2018-04-17 09:39:15 +03:00
Avi Kivity
aaba093371 Update seastar submodule
* seastar af1b789...0e6dcd5 (1):
  > tls: Ensure we always pass through semaphores on shutdown

Fixes #3358.
2018-04-14 20:52:02 +03:00
Gleb Natapov
a64c6e6be9 cql_server: fix a race between closing of a connection and notifier registration
There is a race between cql connection closure and notifier
registration. If a connection is closed before notification registration
is complete stale pointer to the connection will remain in notification
list since attempt to unregister the connection will happen to early.
The fix is to move notifier unregisteration after connection's gate
is closed which will ensure that there is no outstanding registration
request. But this means that now a connection with closed gate can be in
notifier list, so with_gate() may throw and abort a notifier loop. Fix
that by replacing with_gate() by call to is_closed();

Fixes: #3355
Tests: unit(release)

Message-Id: <20180412134744.GB22593@scylladb.com>
(cherry picked from commit 1a9aaece3e)
2018-04-12 16:57:18 +03:00
Duarte Nunes
c83d2d0d77 db/view: Reject view entries with non-composite, empty partition key
Empty partition keys are not supported on normal tables - they cannot
be inserted or queried (surprisingly, the rules for composite
partition keys are different: all components are then allowed to be
empty). However, the (non-composite) partition key of a view could end
up being empty if that column is: a base table regular column, a
base table clustering key column, or a base table partition key column,
part of a composite key.

Fixes #3262
Refs CASSANDRA-14345

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180403122244.10626-1-duarte@scylladb.com>
(cherry picked from commit ec8960df45)
2018-04-03 19:08:38 +03:00
Asias He
0aa49d0311 gossip: Relax generation max difference check
start node 1 2 3
shutdown node2
shutdown node1 and node3
start node1 and node3
nodetool removenode node2
clean up all scylla data on node2
bootstrap node2 as a new node

I saw node2 could not bootstrap stuck at waiting for schema information to compelte for ever:

On node1, node3

    [shard 0] gossip - received an invalid gossip generation for peer 127.0.0.2; local generation = 2, received generation = 1521779704

On node2

    [shard 0] storage_service - JOINING: waiting for schema information to complete

This is becasue in nodetool removenode operation, the generation of node1 was increased from 0 to 2.

   gossiper::advertise_removing () calls eps.get_heart_beat_state().force_newer_generation_unsafe();
   gossiper::advertise_token_removed() calls eps.get_heart_beat_state().force_newer_generation_unsafe();

Each force_newer_generation_unsafe increases the generation by 1.

Here is an example,

Before nodetool removenode:
```
curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" | python -mjson.tool
   {
   "addrs": "127.0.0.2",
   "generation": 0,
   "is_alive": false,
   "update_time": 1521778757334,
   "version": 0
   },
```

After nodetool revmoenode:
```
curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" | python -mjson.tool
 {
     "addrs": "127.0.0.2",
     "application_state": [
         {
             "application_state": 0,
             "value": "removed,146b52d5-dc94-4e35-b7d4-4f64be0d2672,1522038476246",
             "version": 214
         },
         {
             "application_state": 6,
             "value": "REMOVER,14ecc9b0-4b88-4ff3-9c96-38505fb4968a",
             "version": 153
            }
     ],
     "generation": 2,
     "is_alive": false,
     "update_time": 1521779276246,
     "version": 0
 },
```

In gossiper::apply_state_locally, we have this check:

```
if (local_generation != 0 && remote_generation > local_generation + MAX_GENERATION_DIFFERENCE) {
    // assume some peer has corrupted memory and is broadcasting an unbelievable generation about another peer (or itself)
  logger.warn("received an invalid gossip generation for peer {}; local generation = {}, received generation = {}",ep, local_generation, remote_generation);

}
```
to skip the gossip update.

To fix, we relax generation max difference check to allow the generation
of a removed node.

After this patch, the removed node bootstraps successfully.

Tests: dtest:update_cluster_layout_tests.py
Fixes #3331

Message-Id: <678fb60f6b370d3ca050c768f705a8f2fd4b1287.1522289822.git.asias@scylladb.com>
(cherry picked from commit f539e993d3)
2018-04-03 19:08:38 +03:00
Shlomi Livne
cce455b1f5 release: prepare for 2.1.1
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
2018-03-25 09:32:02 +03:00
Avi Kivity
6772f3806b tests: mutation_source_test: fix scattering or partition tombstone
The partition tombstone is not part of a mutation_fragment in the old
streamed_mutation, so it was not scattered correctly by fragment_scatterer.
This causes test failures if the mutations to be scattered have a partition
tombstone.

Fix by calling consume(tombstone) directly. This isn't nice, but the code
is dead anyway.
2018-03-24 15:15:02 +03:00
Avi Kivity
6c9d699835 Merge "Fix abort during counter table read-on-delete" from Tomasz
"
This fixes an abort in an sstable reader when querying a partition with no
clustering ranges (happens on counter table mutation with no live rows) which
also doesn't have any static columns. In such case, the
sstable_mutation_reader will setup the data_consume_context such that it only
covers the static row of the partition, knowing that there is no need to read
any clustered rows. See partition.cc::advance_to_upper_bound(). Later when
the reader is done with the range for the static row, it will try to skip to
the first clustering range (missing in this case). If clustering_ranges_walker
tells us to skip to after_all_clustering_rows(), we will hit an assert inside
continuous_data_consumer::fast_forward_to() due to attempt to skip past the
original data file range. If clustering_ranges_walker returns
before_all_clustering_rows() instead, all is fine because we're still at the
same data file position.

Fixes #3304.
"

* 'tgrabiec/fix-counter-read-no-static-columns' of github.com:scylladb/seastar-dev:
  tests: mutation_source_test: Test reads with no clustering ranges and no static columns
  tests: simple_schema: Allow creating schema with no static column
  clustering_ranges_walker: Stop after static row in case no clustering ranges

(cherry picked from commit 054854839a)
2018-03-23 10:47:23 +03:00
Vlad Zolotarov
a75e1632c8 test.py: limit the tests to run on 2 shards with 4GB of memory
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
(cherry picked from commit 57a6ed5aaa)
2018-03-22 12:45:25 +02:00
Jesse Haber-Kucharsky
c5718bf620 auth: Fix improper sharing of sharded service
This change is backported from 092f2e659c.

Previously, the sharded permissions cache was only accessible to the
implementation of `auth::service` in `auth/service.cc`. The intention
was that invoking `auth::service::get_permissions` on shard `k` would
query the cache on shard `k`, which would in turn depend on
`auth::service` on shard k to check for superuser status.

The problem is in `auth::service::start`.

`seastar::sharded<auth::permissions_cache>::start` is invoked with
`*this` of shard 0, causing all instances of the cache to reference the
same object.

I wasn't able to locally reproduce errors or crashes due to this bug
when I compiled a release build of Scylla. However, running a debug
build meant that the glorious `seastar::debug_shared_ptr_counter_type`
quickly saved the day with its checks that `seastar::shared_ptr` isn't
being misused.

To eliminate this problem, we move ownership of a single instance of
`auth::permissions_cache` to a single instance of `auth::service`. When
`auth::service` is sharded, so is the permissions cache.

I verified interactively that no assertions failed in debug mode with
this change.

Fixes #3296.

Tests: unit (debug, release)
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <280a889f551180db1c00d8a80eddf85b2ff0ac60.1521696176.git.jhaberku@scylladb.com>
2018-03-22 10:04:50 +02:00
Duarte Nunes
2315fcd6cf gms/gossiper: Synchronize endpoint state destruction
In gossiper::handle_major_state_change() we set the endpoint_state for
a particular endpoint and replicate the changes to other cores.

This is totally unsynchronized with the execution of
gossiper::evict_from_membership(), which can happen concurrently, and
can remove the very same endpoint from the map  (in all cores).

Replicating the changes to other cores in handle_major_state_change()
can interleave with replicating the changes to other cores in
evict_from_membership(), and result in an undefined final state.

Another issue happened in debug mode dtests, where a fiber executes
handle_major_state_change(), calls into the subscribers, of which
storage_service is one, and ultimately lands on
storage_service::update_peer_info(), which iterates over the
endpoint's application state with deferring points in between (to
update a system table). gossiper::evict_from_membership() was executed
concurrently by another fiber, which freed the state the first one is
iterating over.

Fixes #3299.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180318123211.3366-1-duarte@scylladb.com>
(cherry picked from commit 810db425a5)
2018-03-18 14:55:32 +02:00
Asias He
8c5464d2fd range_streamer: Stream 10% of ranges instead of 10 ranges per time
If there are a lot of ranges, e.g., num_tokens=2048, 10 ranges per
stream plan will cause tons of stream plan to be created to stream data,
each having very few data. This cause each stream plan has low transfer
bandwidth, so that the total time to complete the streaming increases.

It makes more sense to send a percentage of the total ranges per stream
plan than a fixed ranges.

Here is an example to stream a keyspace with 513 ranges in
total, 10 ranges v.s. 10% ranges:

Before:
[shard 0] range_streamer - Bootstrap with 127.0.0.1 for
keyspace=system_traces, 510 out of 513 ranges: ranges = 51
[shard 0] range_streamer - Bootstrap with ks for keyspace=127.0.0.1
succeeded, took 107 seconds

After:
[shard 0] range_streamer - Bootstrap with 127.0.0.1 for
keyspace=system_traces, 510 out of 513 ranges: ranges = 10
[shard 0] range_streamer - Bootstrap with ks for keyspace=127.0.0.1
succeeded, took 22 seconds

Message-Id: <a890b84fbac0f3c3cc4021e30dbf4cdf135b93ea.1520992228.git.asias@scylladb.com>
(cherry picked from commit 9b5585ebd5)
2018-03-14 10:13:01 +02:00
Asias He
346d2788e3 Revert "streaming: Do not abort session too early in idle detection"
This reverts commit f792c78c96.

With the "Use range_streamer everywhere" (7217b7ab36) series,
all the user of streaming now do streaming with relative small ranges
and can retry streaming at higher level.

Reduce the time-to-recover from 5 hours to 10 minutes per stream session.

Even if the 10 minutes idle detection might cause higher false positive,
it is fine, since we can retry the "small" stream session anyway. In the
long term, we should replace the whole idle detection logic with
whenever the stream initiator goes away, the stream slave goes away.

Message-Id: <75f308baf25a520d42d884c7ef36f1aecb8a64b0.1520992219.git.asias@scylladb.com>
(cherry picked from commit ad7b132188)
2018-03-14 10:12:59 +02:00
Avi Kivity
4f68fede6d Merge "Make reader concurrency dual-restricted by count and memory" from Botond
"
Refs #2692
Fixes #3246

The current restricting algorithm [1] restricts the active-reader queue
based on the memory consumption of the existing active readers. When
this memory consumption is above the limit new readers are not admitted.
The inactive reader queue on the other hand has a fixed length.
This caused performance regressions on two workloads:
* read-only: since the inactive-reader queue length is severly limited
  (compared to the previous situation) reads will timeout at loads
  comfortably handled before.
* mixed: since the memory consumption happens only at admission time
  (already created active readers are not limited) memory consumption
  growed significantly causing problems when compactions kicked in.

The solution is to reintroduce the old limit of 100 active concurrent
user-reads while still keeping the memory-based limit as well. For
workloads that don't consume a lot of memory or on large boxes with lots
of memory the count-based limit will be reached which is reverting to the
old well-known behaviour. For memory-hungry workloads or on small boxes
with little memory the memory based-limit will kick in sooner avoiding
memory overconsumption.

[1] introduced by bdbbfe9390
"

* 'restricted-reader-dual-limit/v3-backport-2.1' of https://github.com/denesb/scylla:
  Modify unit tests so that they test the dual-limits
  Use the reader_concurrency_semaphore to limit reader concurrency
  Add reader_concurrency_semaphore
  Add reader_resource_tracker param to mutation_source
  mv reader_resource_tracker.hh -> reader_concurrency_semaphore.hh
2018-03-08 19:10:06 +02:00
Botond Dénes
681f9e4f50 Modify unit tests so that they test the dual-limits 2018-03-08 18:54:16 +02:00
Botond Dénes
c503bc7693 Use the reader_concurrency_semaphore to limit reader concurrency 2018-03-08 18:54:15 +02:00
Botond Dénes
de7024251b Add reader_concurrency_semaphore
This semaphore implements the new dual, count and memory based active
reader limiting. As purely memory-based limiting proved to cause
problems on big boxes admitting a large number of readers (more than any
disk could handle) the previous count-based limit is reintroduced in
addition to the existing memory-based limit.
When creating new readers first the count-based limit is checked. If
that clears the memory limit is checked before admitting the reader.
reader_conccurency_semaphore wraps the two semaphores that implement
these limits and enforces the correct order of limit checking.
This class also completely replaces the restricted_reader_config struct,
it encapsulates all data and related functinality of the latter, making
client code simpler.
2018-03-08 18:54:15 +02:00
Botond Dénes
9a0eb2319c Add reader_resource_tracker param to mutation_source
Soon, reader_resource_tracker will only be constructible after the
reader has been admitted. This means that the resource tracker cannot be
preconstructed and just captured by the lambda stored in the mutation
source and instead has to be passed in along the other parameters.
2018-03-08 18:54:12 +02:00
Botond Dénes
9ef462449b mv reader_resource_tracker.hh -> reader_concurrency_semaphore.hh
In preparation to reader_concurrency_semaphore being added to the file.
The reader_resource_tracker is really only a helper class for
reader_concurrency_semaphore so the latter is better suited to provide
the name of the file.
2018-03-08 15:34:48 +02:00
Amnon Heiman
6271f30716 dist/docker: Add support for housekeeping
This patch takes a modified version of the Ubuntu 14.04 housekeeping
service script and uses it in Docker to validate the current version.

To disable the version validation, pass the --disable-version-check flag
when running the container.

Message-Id: <20180220161231.1630-1-amnon@scylladb.com>
(cherry picked from commit edcfab3262)
2018-03-07 16:17:13 +02:00
Takuya ASADA
8b64e80c88 dist/debian: install scylla-housekeeping upstart script correctly on Ubuntu 14.04
Since we splited scylla-housekeeping service to two different services for systemd, we don't share same service name between systemd and upstart anymore.
So handle it independently for each distribution, try to install
/etc/init/scylla-housekeeping.conf on Ubuntu 14.04.

Fixes #3239

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1519852659-10688-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 101e909483)
2018-03-07 16:16:36 +02:00
Amnon Heiman
c5bffcaa68 scylla-housekeeing: need to support both debian/ubuntu variations
Debian and ubuntu list files come in two variations.
The housekeeping should support both.

This patch change the regexp that match the os in the repository file.
After the introduction of the second list variation, the os name can be in the middle of the path not only at the end.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20180227092543.19538-1-amnon@scylladb.com>
(cherry picked from commit 57d46c6959)
2018-03-07 16:15:54 +02:00
Tomasz Grabiec
8aa0b60e91 tests: cache: Fix invalidate() not being waited for
Probably responsible for occasional failures of subsequent assertion.
Didn't mange to reproduce.

Message-Id: <1520330967-584-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit d9f0c1f097)
2018-03-06 12:17:16 +02:00
Asias He
dccf762654 storage_service: Add missing return in pieces empty check
If pieces.empty is empty, it is bogus to access pieces[0]:

   sstring move_name = pieces[0];

Fix by adding the missing return.

Spotted by Vlad Zolotarov <vladz@scylladb.com>

Fixes #3258
Message-Id: <bcb446f34f953bc51c3704d06630b53fda82e8d2.1520297558.git.asias@scylladb.com>

(cherry picked from commit 8900e830a3)
2018-03-06 09:58:21 +02:00
Tomasz Grabiec
e5344079d9 intrusive_set_external_comparator: Fix _header having undefined color on move
swap_tree() doesn't change the color of the header, and becasue header
was not initialized, it is undefined (can be both red or black). One
problem this causes is that algo::is_header() expects the header to be
always red. It is used by unlink(), which for trees which have a black
header would infinite-loop.

The fix is to initialize the header.

Fixes #3242.

Message-Id: <1519815091-13111-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 30635510a2)
2018-02-28 13:57:33 +02:00
Paweł Dziepak
7bc8515c48 tests/cql3: increase TTL to avoid spurious failures
The test inserts some values with a TTL of 1 second and then
reads them back expecting them not to be expired yet. That may not
always be the case if the machine is slow and we are running in the
debug mode. Increasising the TTLs by x100 should help avoid these
false positives.

Message-Id: <20180219133816.17452-1-pdziepak@scylladb.com>
(cherry picked from commit d97eebe82d)
2018-02-22 14:14:41 +00:00
Duarte Nunes
1228a41eaa cql3/query_processor: Remove prepared statements upon dropping a view
Fixes #3198

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180209143652.31852-1-duarte@scylladb.com>
(cherry picked from commit d757c87107)
2018-02-22 14:11:08 +00:00
Tomasz Grabiec
58b90ceee0 tests: row_cache: Improve test for snapshot consistency on eviction
Reproduces https://github.com/scylladb/scylla/issues/3215.
Message-Id: <1518710592-21925-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit 9c3e56fb16)
2018-02-16 11:42:33 +01:00
Tomasz Grabiec
ef46067606 mvcc: Do not move unevictable snapshots to cache
Commit 6ccd317 introduced a bug in partition_entry::evict() where a
partition entry may be partially evicted if there are non-evictable
snapshots in it. Partially evicting some of the versions may violate
consistency of a snapshot which includes evicted versions. For one,
continuity flags are interpreted realtive to the merged view, not
within a version, so evicting from some of the versions may mark
reanges as continuous when before they were discontinuous. Also, range
tombtsones of the snapshot are taken from all versions, so we can't
partially evict some of them without marking all affected ranges as
discontinuous.

The fix is to revert back to full eviciton, and avoid moving
non-evictable snapshots to cache. When moving whole partition entry to
cache, we first create a neutral empty partition entry and then merge
the memtable entry into it just like we would if the entry already
existed.

Fixes #3215.

Tests: unit (release)
Message-Id: <1518710592-21925-2-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit b0b57b8143)
2018-02-16 11:26:13 +01:00
Shlomi Livne
ffdd0f6392 release: prepare for 2.1.0
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
2018-02-14 15:17:43 +02:00
Paweł Dziepak
3ab1c8abff cql3/select_statement: do not capture stack variables by reference
Default capture by reference considered harmful in async code.

(cherry picked from commit b635fec9bf)
2018-02-08 17:54:00 +02:00
Amnon Heiman
d306c40507 database: correct the label creation for database reads
The labels in database active_reads metrics where not define correctly.

Label should be created so it will be possible to select based on their
value.

The current implementation define a label "class" with three instances:
user, streaming, system.

Fixes: #2770

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20180123125206.23660-1-amnon@scylladb.com>
(cherry picked from commit a0a1961b6d)
2018-02-08 15:20:55 +02:00
Paweł Dziepak
b98d5b30de Merge "Do not evict from memtable snapshots" from Tomasz
"When moving whole partition entries from memtable to cache, we move
snapshots as well. It is incorrect to evict from such snapshots
though, because associated readers would miss data.

Solution is to record evictability of partition version references (snapshots)
and avoiding eviction from non-evictable snapshots.

Could affect scanning reads, if the reader uses partition entry from
memtable, and the partition is too large to fit in reader's buffer,
and that entry gets moved to cache (was absent in cache), and then
gets evicted (memory pressure). The reader will not see the remainder
of that entry. Found during code review.

Introduced in ca8e3c4, so affects 2.1+

Fixes #3186.

Tests: unit (release)"

* 'tgrabiec/do-not-evict-memtable-snapshots' of github.com:tgrabiec/scylla:
  tests: mvcc: Add test for eviction with non-evictable snapshots
  mutation_partition: Define + operator on tombstones
  tests: mvcc: Check that partition is fully discontinuous after eviction
  tests: row_cache: Add test for memtable readers surviving flush and eviction
  memtable: Make printable
  mvcc: Take partition_entry by const ref in operator<<()
  mvcc: Do not evict from non-evictable snapshots
  mvcc: Drop unnecessary assignment to partition_snapshot::_version
  tests: Use partition_entry::make_evictable() where appropriate
  mvcc: Encapsulate construction of evictable entries

(cherry picked from commit 6ccd317c38)
2018-02-06 19:29:56 +01:00
Tomasz Grabiec
85f5e57502 tests: Introduce mutation_partition_assertions
mutation_assertions are now delegating to mutation_partition_assertions.

(cherry picked from commit c7539f2ed0)
2018-02-06 19:29:56 +01:00
Tomasz Grabiec
19158f3401 mutation_partition: Make check_continuity() const-qualified
(cherry picked from commit bde050835f)
2018-02-06 19:29:56 +01:00
Tomasz Grabiec
a7e40d6acb mutation_partition: Make check_continuity() public
(cherry picked from commit f9257886cb)
2018-02-06 19:29:56 +01:00
Tomasz Grabiec
eedcfedd5a mutation_partition: Extract sliced() from mutation into mutation_partition
So that we can call it on mutation_partition.

(cherry picked from commit b3709047b0)
2018-02-06 19:29:56 +01:00
Tomasz Grabiec
b655fe262b mvcc: Add const-qualified partition_version_ref::operator*()
(cherry picked from commit a6e083ef6f)
2018-02-06 19:29:56 +01:00
Shlomi Livne
cbb3b959e3 release: prepare for 2.1.rc3
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
2018-02-06 12:12:31 +02:00
Raphael S. Carvalho
3dd282f7f0 sstables/compress: Fix race condition in segmented offset reading of shared sstable
Race condition was introduced by commit 028c7a0888, which introduces chunk offset
compression, because a reading state is kept in the compress structure which is
supposed to be immutable and can be shared among shards owning the same sstable.

So it may happen that shard A updates state while shard B relies on information
previously set which leads to incorrect decompression, which in turn leads to
read misbehaving.

We could serialize access to at() which would only lead to contention issues for
shared sstables, but that can be avoided by moving state out of compress structure
which is expected to be immutable after sstable is loaded and feeded to shards that
own it. Sequential accessor (wraps state and reference to segmented_offset) is
added to prevent at() and push_back() interfaces from being polluted.

Tests: release mode.

Fixes #3148.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180205192432.23405-1-raphaelsc@scylladb.com>
(cherry picked from commit 09f4ee808f)
2018-02-06 12:10:29 +02:00
Tomasz Grabiec
574548e50f Merge 'Fixes for exception safety in memtable range reads' from Paweł
These patches deal with the remaining exception safety issues in the
memtable partition range readers. That includes moving the assignment
to iterator_reader::_last outside of allocating section to avoid
problems caused by exception-unsafe assignment operator. Memory
accotuning code is also moved out of the retryable context to improve
the code robustness and avoid potential problems in the future.

Fixes #3172.

 * https://github.com/pdziepak/scylla.git memtable-range-read-exception-safety-2.1/v1:
  memtable: do not update iterator_reader::_last in alloc section
  memtable: do not change accounting state in alloc section
  tests/memtable: add more reader exception safety tests
2018-02-05 20:51:26 +01:00
Paweł Dziepak
688d58f54a tests/memtable: add more reader exception safety tests 2018-02-05 15:11:55 +00:00
Paweł Dziepak
ea9b0bb4b0 memtable: do not change accounting state in alloc section
Allocating sections can be retried so code that has side effects (like
updating flushed bytes accouting) has no place there.
2018-02-05 15:11:54 +00:00
Paweł Dziepak
6a9b026601 memtable: do not update iterator_reader::_last in alloc section
iterator_reader::_last is a part of the state that survives allocating
section retries, therefore, it should not be modified in the retryable
context.
2018-02-05 15:11:53 +00:00
Amnon Heiman
adc1523aaa scylla_setup support private repo on debian during setup
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170917145248.19677-1-amnon@scylladb.com>
(cherry picked from commit bc356a3c15)
2018-02-01 15:58:00 +02:00
Tomasz Grabiec
5444eead08 Merge "Make memtable reads exception safe" from Paweł
These patches change the memtable reader implementation (in particular
partition_snapshot_reader) so that the existing exception safety
paroblems are fixed, but also in a way that, hopefully, would make it
easier to reason about the error handling and avoid future bugs in that
area.

The main difficulty related to exception safety is that when an
exception is thrown out of an allocating section that code is run again
with increased memory reserved. If the retryable code has side effects
it is very easy to get incorrect behaviour.

In addition to that, entering an allocating section is not exactly cheap
which encourages doing so rarely and having large sections.

The approach taken by this series is to, first, make entering allocating
sections cheaper and then reducing the amount of logic that runs inside
of them to a minimum.

This means that instead of entering a section once per a call to
flat_mutation_reader::fill_buffer() the allocation section is entered
once for each emitted row. The only state modified from within the
section are cached iterators to the current row, which are dropped on
retry. Hopefully, this would make the reader code easier to reason
about.

The optimisations to the allocating sections and managed_bytes
linearised context has successfully eliminated any penalty caused by
much more fine grained allocating sections.

Fixes #3123.
Fixes #3133.

Tests: unit-tests (release)

BEFORE
test                                      iterations      median         mad         min         max
memtable.one_partition_one_row               1155362   869.139ns     0.282ns   868.465ns   873.253ns
memtable.one_partition_many_rows              127252     7.871us    15.252ns     7.851us     7.886us
memtable.many_partitions_one_row               58715    17.109us     2.765ns    17.013us    17.112us
memtable.many_partitions_many_rows              4839   206.717us   212.385ns   206.505us   207.448us

AFTER
test                                      iterations      median         mad         min         max
memtable.one_partition_one_row               1194453   839.223ns     0.503ns   834.952ns   842.841ns
memtable.one_partition_many_rows              133785     7.477us     4.492ns     7.473us     7.507us
memtable.many_partitions_one_row               60267    16.680us    18.027ns    16.592us    16.700us
memtable.many_partitions_many_rows              4975   201.048us   144.929ns   200.822us   201.699us

        ./before_sq  ./after_sq  diff
 read     337373.86   353694.24  4.8%
 write    388759.99   394135.78  1.4%

* https://github.com/pdziepak/scylla.git memtable-exception-safety-2.1/v1:
  flat_mutation_reader: add allocation point in push_mutation_fragment
  linearization_context: remove non-trivial operations from fast path
  lsa: split alloc section into reserving and reclamation-disabled parts
  lsa: optimise disabling reclamation and invalidation counter
  mutation_fragment: allow creating clustering row in place
  paratition_snapshot_reader: minimise amount of retryable code
  memtable: drop memtable_entry::read()
  tests/memtable: add test for reader exception safety
2018-02-01 10:54:35 +01:00
Paweł Dziepak
1e74362ec9 tests/memtable: add test for reader exception safety 2018-02-01 10:54:34 +01:00
Paweł Dziepak
72e52dafba memtable: drop memtable_entry::read() 2018-02-01 10:54:34 +01:00
Paweł Dziepak
29746e1e7b paratition_snapshot_reader: minimise amount of retryable code
Retryable code that has side effects is a recipe for bugs. This patch
reworkds the snapshot reader so that the amount of logic run with
reclamation disabled is minimal and has a very limited side effects.
2018-02-01 10:54:34 +01:00
Paweł Dziepak
13cd56774f mutation_fragment: allow creating clustering row in place
Moving clustering_row is expensive due to amount of data stored
internally. Adding a mutation_fragment constructor that builds a
clustering_row in-place saves some of that moving.
2018-02-01 10:54:34 +01:00
Paweł Dziepak
812018479b lsa: optimise disabling reclamation and invalidation counter
Most of the lsa gory details are hidden in utils/logalloc.cc. That
includes the actual implementation of a lsa region: region_impl.

However, there is code in the hot path that often accesses the
_reclaiming_enabled member as well as its base class
allocation_strategy.

In order to optimise those accesses another class is introduced:
basic_region_impl that inherits from allocation_strategy and is a base
of region_impl. It is defined in utils/logalloc.hh so that it is
publicly visible and its member functions are inlineable from anywhere
in the code. This class is supposed to be as small as possible, but
contain all members and functions that are accessed from the fast path
and should be inlined.
2018-02-01 10:54:34 +01:00
Paweł Dziepak
0ee2462811 lsa: split alloc section into reserving and reclamation-disabled parts
Allocating sections reserves certain amount of memory, then disables
reclamation and attempts to perform given operation. If that fails due
to std::bad_alloc the reserve is increased and the operation is retried.

Reserving memory is expensive while just disabling reclamation isn't.
Moreover, the code that runs inside the section needs to be safely
retryable. This means that we want the amount of logic running with
reclamation disabled as small as possible, even if it means entering and
leaving the section multiple times.

In order to reduce the performance penalty of such solution the memory
reserving and reclamation disabling parts of the allocating sections are
separated.
2018-02-01 10:54:34 +01:00
Paweł Dziepak
c8bc3a7053 linearization_context: remove non-trivial operations from fast path
Since linearization_context is thread_local every time it is accessed
the compiler needs to emit code that checks if it was already
constructed and does so if it wasn't. Moreover, upon leaving the context
from the outermost scope the map needs to be cleared.

All these operations impose some performance overhead and aren't really
necessary if no buffers were linearised (the expected case). This patch
rearranges the code so that lineatization_context is trivially
constructible and the map is cleared only if it was modified.
2018-02-01 10:54:34 +01:00
Paweł Dziepak
9f78799e80 flat_mutation_reader: add allocation point in push_mutation_fragment
Exception safety tests inject a failure at every allocation and verify
whether the error is handled properly.

push_mutation_fragment() adds a mutation fragment to a circular_buffer,
in theory any call to that function can result in a memory allocation,
but in practice that depends on the implementation details. In order to
improve the effectiveness of the exception safety tests this patch adds
an explicit allocation point in push_mutation_fragment().
2018-02-01 10:54:33 +01:00
Calle Wilund
5bba3856ca auth: Fix transitional auth for non-valid credentials
Fixes #3096

The credentials processing for transitional auth was broken
in ba6a41d, "auth: Switch to sharded service which effectively removed
the "virtualization" of underlying auth in the SASL challenge.

As a quick workaround, add the permissive exception handling to
sasl object as well.

Message-Id: <20180103102724.1083-1-calle@scylladb.com>
(cherry picked from commit 35b9ec868a)
2018-02-01 11:36:46 +02:00
Avi Kivity
63e92418dd Update seastar submodule
* seastar 8d254a1...af1b789 (3):
  > tls_test: Fix echo test not setting server trust store
  > tls: Do not restrict re-handshake to client
  > tls: Actually verify client certificate if requested

Fixes #3072
2018-01-28 13:59:04 +02:00
Paweł Dziepak
9eaa6f233e Update scylla-ami submodule
* scylla-ami 3366c93...c5d9e96 (1):
  > Update Amazon kernel packages release stream to 2017.09
2018-01-24 13:27:52 +00:00
Raphael S. Carvalho
6600317b2c sstables: fix wildly inaccurate sstable key estimation after dynamic index sampling
The reason sstable key estimation is inaccurate is that it doesn't account that
index sampling is now dynamic.

The estimation is done as follow:
    uint64_t get_estimated_key_count() const {
        return ((uint64_t)_components->summary.header.size_at_full_sampling + 1) *
                _components->summary.header.min_index_interval;
    }

The biggest problem is that _components->summary.header.min_index_interval isn't
actually the minimum interval, but instead the default interval value set in the
schema.
So the estimation gets worse the larger the average partition, because the larger
the average partition the lower the index sampling interval.
One of the problems is that estimation has a big influence on bloom filter size,
and so for large partitions we were generating bigger filters than we had to.

From now on, size at full sampling is calculated as if sampling were static
(which was the case until commit 8726ee937d which introduced size-based
sampling), using minimum index as a strict sampling interval.

Tests: units (release)

Fixes #3113.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180122233612.11147-1-raphaelsc@scylladb.com>
(cherry picked from commit 2c181b69c9)
2018-01-24 11:42:28 +02:00
Vladimir Krivopalov
807acb2dd9 main: Fix warnings when running "scylla --version"
Print Scylla version, if requested, before running Seastar application.

Fixes #3124

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <bbd0f303f612327446ce1f10ebd17ebed8d76048.1516144651.git.vladimir@scylladb.com>
(cherry picked from commit 73b6e9fbb1)
2018-01-17 16:59:28 +02:00
Takuya ASADA
5e44bf97f0 dist/debian: follow gcc-7.2 package naming changes on 3rdparty repo for Debian 9
Switch to renamed gcc-7.2 package on Debian 9, too.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1516191853-2562-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit f3c8574135)
2018-01-17 14:38:55 +02:00
Takuya ASADA
4003be40b3 dist/debian: fix package name typo on Debian 8
Correct package name is scylla-gcc72-g++-7, not scylla-g++-7.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1516189354-5880-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 15e266eea4)
2018-01-17 13:45:39 +02:00
Takuya ASADA
cf059b6ee2 dist/debian: follow renaming of gcc-7.2 packages on Ubuntu 14.04/16.04
Now we applied our scylla-$(pkg)$(ver) style package naming on gcc-7.2,
so switch to it.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1516103292-26942-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 912a14eb9b)
2018-01-17 13:38:56 +02:00
Shlomi Livne
d96c31ee4d release: prepare for 2.1.rc2
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
2018-01-16 16:22:56 +02:00
Avi Kivity
680ce234b0 Merge "Fix memory leak on zone reclaim" from Tomek
"_free_segments_in_zones is not adjusted by
segment_pool::reclaim_segments() for empty zones on reclaim under some
conditions. For instance when some zone becomes empty due to regular
free() and then reclaiming is called from the std allocator, and it is
satisfied from a zone after the one which is empty. This would result
in free memory in such zone to appear as being leaked due to corrupted
free segment count, which may cause a later reclaim to fail. This
could result in bad_allocs.

The fix is to always collect such zones.

Fixes #3129
Refs #3119
Refs #3120"

* 'tgrabiec/fix-free_segments_in_zones-leak' of github.com:scylladb/seastar-dev:
  tests: lsa: Test _free_segments_in_zones is kept correct on reclaim
  lsa: Expose max_zone_segments for tests
  lsa: Expose tracker::non_lsa_used_space()
  lsa: Fix memory leak on zone reclaim

(cherry picked from commit 4ad212dc01)
2018-01-16 15:54:40 +02:00
Asias He
ad656b2c55 storage_service: Do not wait for restore_replica_count in handle_state_removing
The call chain is:

storage_service::on_change() -> storage_service::handle_state_removing()
-> storage_service::restore_replica_count() -> streamer->stream_async()

Listeners run as part of gossip message processing, which is serialized.
This means we won't be processing any gossip messages until streaming
completes.

In fact, there is no need to wait for restore_replica_count to complete
which can take a long time, since when it completes, this node will send
notification to tell the removal_coordinator that the restore process is
finished on this node. This node will be removed from _replicating_nodes
on the removal_coordinator.

Tested with update_cluster_layout_tests.py

Fixes #2886

Message-Id: <8b4fe637dfea6c56167ddde3ca86fefb8438ce96.1516088237.git.asias@scylladb.com>
(cherry picked from commit 5107b6ad16)
2018-01-16 11:37:55 +02:00
Tomasz Grabiec
43101b6bff database: Invalidate only affected ranges from flush_streaming_mutations()
Invalidating whole range causes larger latency spikes.

Regression from 2.0 introduced in d22fdf4261.

Refs #3119

Tests: units (release)

Message-Id: <1516046938-26855-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit b5d5bf5bc4)
2018-01-16 11:18:36 +02:00
Asias He
492a5c8886 storage_service: Set NORMAL status after token_metadata is replicated
Commit 2d5fb9d109 (gms/gossiper: Replicate changes incrementally to
other shards) changes the way we replicate _token_metadata and
endpoint_state_map. Before they are replicated at the same time, after
they are not any more. This causes a shard in NORMAL status can still be
with a empty _token_metadata.

We saw errors:

   [shard 12] token_metadata - sorted_tokens is empty in first_token_index!

during CorruptThenRepairNemesis.

Fix by setting the gossip status to NORMAL after replication of
_token_metadata, so that once a node is in NORMAL, we can do repair. The
commit 69c81bcc87 (repair: Do not allow repair until node is in NORMAL
status) prevents the early repair operation by checking if a node is in
NORMAL status.

Fixes #3121

Message-Id: <af6a223733d2e11351f1fa35f59eacfa7d65dd30.1516065564.git.asias@scylladb.com>
(cherry picked from commit 3c8ed255ac)
2018-01-16 09:41:34 +02:00
Raphael S. Carvalho
152747b8fd mutation_reader: Fix use-after-move
Problem introduced in 375ed938b4

Also remove redefinition of schema in dummy incremental selector
which is supposed to use the one in base class instead.

Following tests are fixed:
    ./build/release/tests/mutation_reader_test
    ./build/release/tests/sstable_test -- -c1
    ./build/release/tests/row_cache_test
    ./build/release/tests/cache_flat_mutation_reader_test
    ./build/release/tests/row_cache_stress_test

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180111153831.17462-1-raphaelsc@scylladb.com>
2018-01-11 17:43:41 +02:00
Takuya ASADA
00c08519a7 dist/debian: make pbuilder works on Debian 9
On Debian 9, 'pbuilder create' fails because of lack of GPG key for
3rdparty repo, so we need --allow-untrusted on 'pbuilder create' and
'pbuilder update'.

Also, apt-key adv --fetch-keys does not works correctly on it, but we can use
"curl <URL> | apt-key add -" as workaround.

Fixes #3088

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1513797714-18067-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit b68ee98310)
2018-01-11 15:03:49 +02:00
Takuya ASADA
5d47a39b7b dist/debian: follow renaming of gcc-7.2 packages on Debian 8
Now we applied our scylla-$(pkg)$(ver) style package naming on gcc-7.2,
so switch to it.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1515522920-8266-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 420b61b466)
2018-01-11 15:03:47 +02:00
Takuya ASADA
4f8e8bdc04 dist/debian: rename boost1.63 to scylla-boost163 on Debian 8
We provided "boost1.63" package for Debian 8 since we couldn't build
"scylla-boost163" package witch is available on Ubuntu14/16, but I fixed the
problem and now we have it for Debian 8 too, so switch to it.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1514220163-25985-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 51013f561d)
2018-01-11 15:03:44 +02:00
Paweł Dziepak
ef1dab4565 combined_reader: optimise for disjoint partition streams
The legacy mutation_reader/streamed_mutation design allowed very easily
to skip the partition merging logic if there was only one underlying
reader that has emitted it.

That optimisation was lost after conversion to flat mutation readers
which has impacted the performance. This patch mostly recovers it by
bypassing most of mutation_reader_merger logic if there is only a single
active reader for a given partition.

The performance regression was introduced in
8731c1bc66 "Flatten the implementation of
combined_mutation_reader".

perf_simple_query -c4 read results (medians of 60):

original regression
             before 8731c1     after 8731c1   diff
 read            326241.02        300244.09  -8.0%

this patch
                    before            after  diff
 read            313882.59        325148.05  3.6%
Message-Id: <20180103121019.764-1-pdziepak@scylladb.com>

(cherry picked from commit b4a4c04bab)
2018-01-11 10:33:31 +01:00
Tomasz Grabiec
3f602814ba mutation_reader: Move definition of combining mutation reader to source file
So that the whole world doesn't recompile when it changes.

(cherry picked from commit 60ed5d29c0)
2018-01-11 10:33:08 +01:00
Tomasz Grabiec
83d4e85e00 mutation_reader: Use make_combined_reader() to create combined reader
So that we can hide the definition of combined_mutation_reader. It's
also less verbose.

(cherry picked from commit 52285a9e73)
2018-01-11 10:33:06 +01:00
Asias He
857ffeefce streaming: Do send failed message for uninitialized session
The uninitialized session has no peer associated with it yet. There is
no point sending the failed message when abort the session. Sending the
failed message in this case will send to a peer with uninitialized
dst_cpu_id which will casue the receiver to pass a bogus shard id to
smp::submit_to which cases segfault.

In addition, to be safe, initialize the dst_cpu_id to zero. So that
uninitialized session will send message to shard zero instead of random
bogus shard id.

Fixes the segfault issue found by
repair_additional_test.py:RepairAdditionalTest.repair_abort_test

Fixes #3115
Message-Id: <9f0f7b44c7d6d8f5c60d6293ab2435dadc3496a9.1515380325.git.asias@scylladb.com>

(cherry picked from commit 774307b3a7)
2018-01-09 16:32:12 +02:00
Piotr Jastrzebski
a845e23702 Fix fast_forward_to(partition_range&) in forwardable flat reader.
Making sure fast_forward_to(const partition_range&) sets _current
correctly.

Fixes #3089

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <6c29cf273f191da0e21035bcbe1592042ecffc70.1515490058.git.piotr@scylladb.com>
(cherry picked from commit 945f45f490)
2018-01-09 14:57:53 +02:00
George Tavares
f9b14df3a3 db/view: Consume updated rows regardless of static row
Using Materialized Views, if the base table has static columns,
and the update in base table mutates static and non static rows,
the streamed_mutation is stopped before process non static row.
The patch avoids stopping the stream_mutation and adds a test case.

Message-Id: <20171220173434.25091-1-tavares.george@gmail.com>
(cherry picked from commit ceecd542cd)
2018-01-08 15:39:57 +01:00
Raphael S. Carvalho
ae47dfde7d sstables: cure our blindness on sstable read failure
After 611774b, we're blind again on which sstable caused a compaction
to fail, leaving us with cryptic message as follow:
compaction_manager - compaction failed: std::runtime_error (compressed
chunk failed checksum)

After this change, now both read failure in compaction or regular read
will report the guilty sstable, see:
compaction_manager - compaction failed: std::runtime_error (SSTable reader
found an exception when reading sstable ./data/.../keyspace1-standard1
ka-1-Data.db : std::runtime_error(compressed chunk failed checksum))

Fixes #3006.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180102230752.14701-1-raphaelsc@scylladb.com>
(cherry picked from commit 4610e994e1)
2018-01-08 13:43:32 +02:00
Vladimir Krivopalov
cc15a13365 Use CharReaderBuilder/CharReader and StreamWriterBuilder from JsonCpp.
In version 1.8.3 of JsonCpp shipped with Fedora 27, old FastWriter and
Reader classes from JsonCpp have been deprecated in favour of
newer/better ones: CharReaderBuilder/CharReader and
StreamWriterBuilder/StreamWriter.
This fix uses the new classes where available or resorts to old ones for
older versions of the library.

Fixes #2989

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
(cherry picked from commit 76775ddf26)
2018-01-07 14:48:54 +02:00
Avi Kivity
6e14dcb84c Merge "Fix potential infinite recursion in leveled compaction" from Raphael
'"The issue is triggered by compaction of sstables of level higher than 0.

The problem happens when interval map of partitioned sstable set stores
intervals such as follow:
[-9223362900961284625 : -3695961740249769322 ]
(-3695961740249769322 : -3695961103022958562 ]

When selector is called for first interval above, the exclusive lower
bound of the second interval is returned as next token, but the
inclusivess info is not returned.
So reader_selector was returning that there *were* new readers when
the current token was -3695961740249769322 because it was stored in
selector position field as inclusive, but it's actually exclusive.

This false positive was leading to infinite recursion in combined
reader because sstable set's incremental selector itself knew that
there were actually *no* new readers, and therefore *no* progress
could be made."

Fixes #2908.'

* 'high_level_compaction_infinite_recursion_fix_v4' of github.com:raphaelsc/scylla:
  tests: test for infinite recursion bug when doing high-level compaction
  Fix potential infinite recursion when combining mutations for leveled compaction
  dht: make it easier to create ring_position_view from token
  dht: introduce is_min/max for ring_position

(cherry picked from commit 375ed938b4)
2018-01-07 14:47:18 +02:00
Pekka Enberg
9ed64cc11c dist/docker: Switch to Scylla 2.1 repository 2018-01-05 10:43:29 +02:00
Shlomi Livne
d4c46afc50 release: prepare for 2.1.rc1
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
2018-01-03 10:48:35 +02:00
Paweł Dziepak
f371d17884 db/schema_tables: do not use moved from shared pointer
Shared pointer view is captured by two continuations, one of which is
moving it away. Using do_with() solves the problem.

Fixes #3092.
Message-Id: <20171221111614.16208-1-pdziepak@scylladb.com>

(cherry picked from commit 4dfddc97c7)
2017-12-21 15:13:53 +01:00
Tomasz Grabiec
0a82a885a4 Merge "Remove memtable::make_reader" from Piotr
Migrate all the places that used memtable::make_reader to use
memtable::make_flat_reader and remove memtable::make_reader.

* seastar-dev.git haaawk/remove_memtable_make_reader_v2_rebased:
  Remove memtable::make_reader
  Stop using memtable::make_reader in row_cache_stress_test
  Stop using memtable::make_reader in row_cache_test
  Stop using memtable::make_reader in mutation_test
  Stop using memtable::make_reader in streamed_mutation_test
  Stop using memtable::make_reader in memtable_snapshot_source.hh
  Stop using memtable::make_reader in memtable::apply
  Add consume_partitions(flat_mutation_reader& reader, Consumer consumer)
  Add default parameter values in make_combined_reader
  Migrate test_virtual_dirty_accounting_on_flush to flat reader
  Migrate test_adding_a_column_during_reading_doesnt_affect_read_result
  Simplify flat_reader_assertions& produces(const mutation& m)
  Migrate test_partition_version_consistency_after_lsa_compaction_happens
  flat_mutation_reader: Allow setting buffer capacity
  Add next_mutation() to flat_mutation_reader_assertions
  cf::for_all_partitions::iteration_state: don't store schema_ptr
  read_mutation_from_flat_mutation_reader: don't take schema_ptr
  Migrate test_fast_forward_to_after_memtable_is_flushed to flat reader

(cherry picked from commit b0a56a91c2)
2017-12-21 14:10:31 +01:00
Tomasz Grabiec
17febfdb0e database: Move operator<<() overloads to appropriate source files
(cherry picked from commit fd7ab5fe99)
2017-12-21 14:10:24 +01:00
Vlad Zolotarov
830bf99528 tests: sstable_datafile_test: fix the compilation error on Power
'char' and int8_t ('unsigned char') are different types. 'bytes' base type
is int8_t - use the correct type for casting.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
(cherry picked from commit 22ca5d2596)
2017-12-21 14:09:47 +01:00
Tomasz Grabiec
90000d9861 Merge "Fixes for multi_range_reader" from Paweł
The following patches contain fixes for skipping to the next parititon
in multi_range_reader and completelty dissable support for fast
forwarding inside a single partition, which is not needed and would only
add unnecessary complexity.

* https://github.com/pdziepak/scylla.git fix-multi_range_reader/v1:
  flat_multi_range_mutation_reader: disallow
    streamed_mutation::forwarding
  flat_multi_range_mutation_reader: clear buffer on next_partition()
  tests/flat_multi_range_mutation_reader: test skipping to next
    partition

(cherry picked from commit 71cc63dfa6)
2017-12-21 14:07:15 +01:00
Asias He
46dae42dcd streaming: One cf per time on sender
In the case there are large number of column families, the sender will
send all the column families in parallel. We allow 20% of shard memory
for streaming on the receiver, so each column family will have 1/N, N is
the number of in-flight column families, memory for memtable. Large N
causes a lot of small sstables to be generated.

It is possible there are multiple senders to a single receiver, e.g.,
when a new node joins the cluster, the maximum in-flight column families
is number of peer node. The column families are sent in the order of
cf_id. It is not guaranteed that all peers has the same speed so they
are sending the same cf_id at the same time, though. We still have
chance some of the peers are sending the same cf_id.

Fixes #3065

Message-Id: <46961463c2a5e4f1faff232294dc485ac4f1a04e.1513159678.git.asias@scylladb.com>
(cherry picked from commit a9dab60b6c)
2017-12-20 17:07:39 +01:00
Tomasz Grabiec
d6395634ad range_tombstone_list: Fix insert_from()
end_bound was not updated in one of the cases in which end and
end_kind was changed, as a result later merging decision using
end_bound were incorrect. end_bound was using the new key, but the old
end_kind.

Fixes #3083.
Message-Id: <1513772083-5257-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit dfe48bbbc7)
2017-12-20 15:31:51 +01:00
Avi Kivity
d886b3def4 Merge "Fix read amplification in sstable reads" from Paweł
"4b9a34a85425d1279b471b2ff0b0f2462328929c "Merge sstable_data_source
into sstable_mutation_reader" has introduced unintentional changes, some
of them causing excessive read amplification during empty range reads.
The following patches restore the previous behaviour."

* tag 'fix-read-amplification/v1' of https://github.com/pdziepak/scylla:
  sstables: set _read_enabled to false if possible
  sstables: set _single_partition_read for single parititon reads

(cherry picked from commit 772d1f47d7)
2017-12-19 18:18:06 +02:00
Tomasz Grabiec
bcb06bb043 flat_mutation_reader: Fix make_nonforwardable()
It emitted end-of-stream prematurely if buffer was full.
Message-Id: <1513697716-32634-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit 6a6bf58b98)
2017-12-19 16:01:21 +00:00
Tomasz Grabiec
4606300b25 row_cache: Fix single_partition_populating_reader not waiting on create_underlying() to resolve
Results in undefined behavior.
Message-Id: <1513691679-27081-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit 7b36c8423c)
2017-12-19 16:12:37 +02:00
Piotr Jastrzebski
282d93de99 Use row_cache::make_flat_reader in column_family::make_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <ba1659ceed8676f45942ce6e7506158026947345.1513687259.git.piotr@scylladb.com>
(cherry picked from commit 570fc5afed)
2017-12-19 14:42:52 +02:00
Avi Kivity
52d3403cb0 Update scylla-ami submodule
* dist/ami/files/scylla-ami be90a3f...3366c93 (1):
  > scylla_install_ami: skip ec2_check while building AMI

Still tracking master.
2017-12-19 10:12:05 +02:00
Tomasz Grabiec
97f6073699 Merge "Migrate cache to use flat_mutation_reader" from Piotr
(cherry picked from commit 37b19ae6ba)
2017-12-18 20:51:09 +01:00
Glauber Costa
5454e6e168 conf: document listen_on_broadcast_address
That's a supported feature that is listed in our help message, but it
is not present in the yaml file.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20171215011240.16027-1-glauber@scylladb.com>
(cherry picked from commit b8f49fcc14)
2017-12-18 17:00:46 +02:00
Vlad Zolotarov
498fb11c70 messaging_service: fix a mutli-NIC support
Don't enforce the outgoing connections from the 'listen_address'
interface only.

If 'local_address' is given to connect() it will enforce it to use a
particular interface to connect from, even if the destination address
should be accessed from a different interface. If we don't specify the
'local_address' the source interface will be chosen according to the
routing configuration.

Fixes #3066

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1513372688-21595-1-git-send-email-vladz@scylladb.com>
(cherry picked from commit be6f8be9cb)
2017-12-17 10:51:37 +02:00
Avi Kivity
a6b4881994 Merge "SSTable summary regeneration fixes" from Raphael
"Fixes #3057."

* 'summary_recreation_fixes_v2' of github.com:raphaelsc/scylla:
  tests: sstable summary recreation sanity test
  sstables: make loading of sstable without summary to work again
  sstables: fix summary generation with dynamic index sampling

(cherry picked from commit 11de20fc33)
2017-12-17 09:39:16 +02:00
Takuya ASADA
9848df6667 dist/common/systemd: specify correct repo file path for housekeeping service on Ubuntu/Debian
Currently scylla-housekeeping-daily.service/-restart.service hardcoded
"--repo-files '/etc/yum.repos.d/scylla*.repo'" to specify CentOS .repo file,
but we use same .service for Ubuntu/Debian.
It doesn't work correctly, we need to specify .list file for Debian variants.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1513385159-15736-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit c2e87f4677)
2017-12-16 22:03:42 +02:00
Piotr Jastrzebski
2090a5f8f6 Fix build by removing semicolon after concept
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <4504cf47be0a451c58052476bc8cc4f9cba59472.1513248094.git.piotr@scylladb.com>
(cherry picked from commit ac1d2f98e4)
2017-12-14 12:48:29 +02:00
Amos Kong
7634ed39eb Reset default cluster_name back to 'Test Cluster' for compatibility
There are some users used original default cluster_name 'Test Cluster',
they will fail to start the node for cluster_name change if they use
new scylla.yaml.

'ScyllaDB Cluster' isn't more beautiful than 'Test Cluster', reset back
to original old to avoid problem for users.

Fixes #3060

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <8c9dab8a64d0f4ab3a5d6910b87af696c60e5076.1513072453.git.amos@scylladb.com>
(cherry picked from commit b07de93636)
2017-12-13 16:58:10 +02:00
Avi Kivity
fb9b15904a Merge "Convert sstable readers to flat streams" from Paweł
"While aa8c2cbc16 'Merge "Migrate sstables
to flat_mutation_reader" from Piotr' has converted the low-level sstable
reader to the new flat_mutation_reader interface there were still
multiple readers related to sstables that required converting,
including:
 - restricted reader
 - filtering reader
 - single partition sstable reader
This series completes their conversion to the flat stream interface."

* tag 'flat_mutation_reader-sstable-readers/v2' of https://github.com/pdziepak/scylla:
  db: convert single_key_sstalbe_reader to flat streams
  db: fully convert incremental_reader_selector to flat readers
  db: make make_range_sstable_reader() return flat reader
  db: make column_family::make_reader() return flat reader
  db: make column_family::make_sstable_reader() return a flat reader
  filtering_reader: switch to flat mutation fragment streams
  filtering_reader: pass a const dht::decorated_key& to the callback
  mutation_reader: drop make_restricted_reader()
  db: use make_restricted_flat_reader
  mutation_reader: convert restricted reader to flat streams

(cherry picked from commit 6cb3b29168)
2017-12-13 15:38:22 +02:00
Glauber Costa
4e11f05aa7 database: delete created SSTables if streaming writes fail
We have had an issue recently where failed SSTable writes left the
generated SSTables dangling in a potentially invalid state. If the write
had, for instance, started and generated tmp TOCs but not finished,
those files would be left for dead.

We had fixed this in commit b7e1575ad4,
but streaming memtables still have the same isse.

Note that we can't fix this in the common function
write_memtable_to_sstable because different flushers have different
retry policies.

Fixes #3062

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20171213011741.8156-1-glauber@scylladb.com>
(cherry picked from commit 1aabbc75ab)
2017-12-13 10:09:43 +02:00
Jesse Haber-Kucharsky
516a1ae834 cql3: Add missing return
Since `return` is missing, the "else" branch is also taken and this
results a user being created from scratch.

Fixes #3058.

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <bf3ca5907b046586d9bfe00f3b61b3ac695ba9c5.1512951084.git.jhaberku@scylladb.com>
(cherry picked from commit 7e3a344460)
2017-12-11 09:55:27 +02:00
Paweł Dziepak
be5127388d Merge "Fix range tombstone emitting which led to skipping over data" from Tomasz
"Fixes cache reader to not skip over data in some cases involving overlapping
range tombstones in different partition versions and discontinuous cache.

Introduced in 2.0

Fixes #3053."

* tag 'tgrabiec/fix-range-tombstone-slicing-v2' of github.com:scylladb/seastar-dev:
  tests: row_cache: Add reproducer for issue #3053
  tests: mvcc: Add test for partition_snapshot::range_tombstones()
  mvcc: Optimize partition_snapshot::range_tombstones() for single version case
  mvcc: Fix partition_snapshot::range_tombstones()
  tests: random_mutation_generator: Do not emit dummy entries at clustering row positions

(cherry picked from commit 051cbbc9af)
2017-12-08 13:03:32 +01:00
Tomasz Grabiec
6d0679ca72 mvcc: Extract partition_entry::add_version()
(cherry picked from commit 52cabe343c)
2017-12-08 12:33:49 +01:00
Avi Kivity
eb67b427b2 Merge "SSTable resharding fixes" from Raphael
"Didn't affect any release. Regression introduced in 301358e.

Fixes #3041"

* 'resharding_fix_v4' of github.com:raphaelsc/scylla:
  tests: add sstable resharding test to test.py
  tests: fix sstable resharding test
  sstables: Fix resharding by not filtering out mutation that belongs to other shard
  db: introduce make_range_sstable_reader
  rename make_range_sstable_reader to make_local_shard_sstable_reader
  db: extract sstable reader creation from incremental_reader_selector
  db: reuse make_range_sstable_reader in make_sstable_reader

(cherry picked from commit d934ca55a7)
2017-12-07 16:43:28 +02:00
Amos Kong
2931324b34 dist/debian: add scylla-tools-core to depends list
Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <db39cbda0e08e501633556ab238d816e357ad327.1512646123.git.amos@scylladb.com>
(cherry picked from commit 8fd5d27508)
2017-12-07 13:40:46 +02:00
Amos Kong
614519c4be dist/redhat: add scylla-tools-core to requires list
Fixes #3051

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <f7013a4fbc241bb4429d855671fee4b845b255cd.1512646123.git.amos@scylladb.com>
(cherry picked from commit eb3b138ee2)
2017-12-07 13:40:46 +02:00
Botond Dénes
203b924c76 mutation_reader_merger: don't query the kind of moved-from fragment
Call mutation_fragment_kind() on the fragment *before* it's moved as
there are not guarantees for the state of a moved-from object (apart
from that it's in a valid one).

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <c47b1e22877bb9499f1fbb9d513093c29ef1901b.1512635422.git.bdenes@scylladb.com>
(cherry picked from commit 1ff65f41fd)
2017-12-07 11:41:04 +01:00
Botond Dénes
f4f957fa53 Add streamed mutation fast-forwarding unit test for the flat combined-reader
Test for the bug fixed by 9661769.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <fc917bae8e9c99f026bf7b366e6e9d39faf466af.1512630741.git.bdenes@scylladb.com>
(cherry picked from commit 9fce51f8a0)
2017-12-07 11:40:53 +01:00
Botond Dénes
39e614a444 combined_mutation_reader: fix fast-fowarding related row-skipping bug
When fast forwarding is enabled and all readers positioned inside the
current partition return EOS, return EOS from the combined-reader
too. Instead of skipping to the next partition if there are idle readers
(positioned at some later partition) available. This will cause rows to
be skipped in some cases.

The fix is to distinguish EOS'd readers that are only halted (waiting
for a fast-forward) from thoose really out of data. To achieve this we
track the last fragment-kind the reader emitted. If that was a
partition-end then the reader is out of data, otherwise it might emit
more fragments after a fast-forward. Without this additional information
it is impossible to determine why a reader reached EOS and the code
later may make the wrong decision about whether the combined-reader as
a whole is at EOS or not.
Also when fast-forwarding between partition-ranges or calling
next_partition() we set the last fragment-kind of forwarded readers
because they should emit a partition-start, otherwise they are out of
data.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <6f0b21b1ec62e1197de6b46510d5508cdb4a6977.1512569218.git.bdenes@scylladb.com>
(cherry picked from commit 9661769313)
2017-12-06 16:42:06 +02:00
Paweł Dziepak
d8521d0fa2 Merge "Flatten combined_mutation_reader" from Botond
"Convert combined_mutation_reader into a flat_mutation_reader impl. For
now - in the name of incremental progress - all consumers are updated to
use the combined reader through the
mutation_reader_from_flat_mutation_reader adaptor. The combined reader also
uses all it's sub mutation_readers through the
flat_mutation_reader_from_mutation_reader adaptor."

* 'bdenes/flatten-combined-reader-v8' of https://github.com/denesb/scylla:
  Add unit tests for the combined reader - selector interactions
  Add flat_mutation_reader overload of make_combined_reader
  Flatten the implementation of combined_mutation_reader
  Add mutation_fragment_merger
  mutation_fragment::apply(): handle partition start and end too
  Add non-const overload of partition_start::partition_tombstone()
  Make combined_mutation_reader a flat_mutation_reader
  Move the mutation merging logic to combined_mutation_reader
  Remove the unnecessary indirection of mutation_reader_merger::next()
  Move the implementation of combined_mutation_reader into mutation_reader_merger
  Remove unused mutation_and_reader::less_compare and operator<

(cherry picked from commit 046991b0b7)
2017-12-06 16:41:42 +02:00
Takuya ASADA
f60696b55f dist/debian: need apt-get update after installing GPG key for 3rdparty repo
We need apt-get update after install GPG key, otherwise we still get
unauthenticated package error on Debian package build.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1512556948-29398-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit aeb6ebce5a)
2017-12-06 12:43:42 +02:00
Avi Kivity
1b15a0926a Merge "Make sstable tests use flat_mutation_reader" from Paweł
"This series makes sstable tests use flat stream interface. The main
motivation is to allow eventual removal of mutation_reader and
streamed_mutation and ensuring that the conversion between the
interfaces doesn't hide any bugs that would be otherwise found."

* tag 'flat_mutation_reader-sstable-tests/v1' of https://github.com/pdziepak/scylla:
  sstables: drop read_range_rows()
  tests/mutation_reader: stop using read_range_rows()
  incremental_reader_selector: do not use read_range_rows()
  tests/sstable: stop using read_range_rows()
  sstables: drop read_row()
  tests/sstables: use read_row_flat() instead of read_row()
  database: use read_row_flat() instead of read_row()
  tests/sstable_mutation_test: get flat_mutation_readers from mutation sources
  tests/sstables: make sstable_reader return flat_mutation_reader
  sstable: drop read_row() overload accepting sstable::key
  tests/sstable: stop using read_row() with sstable::key
  tests/flat_mutation_reader_assertions: add has_monotonic_positions()
  tests/flat_mutation_reader_assertions: add produces(Range)
  tests/flat_mutation_reader_assertions: add produces(mutation)
  tests/flat_mutation_reader_assertions: add produces(dht::decorated_key)
  tests/flat_mutation_reader_assertions: add produces(mutation_fragment::kind)
  tests/flat_mutation_reader_assertions: fix fast forwarding

(cherry picked from commit 601a03dda7)
2017-12-06 10:12:36 +02:00
Takuya ASADA
32efd3902c dist/debian: install CA certificates before install repo GPG key
Since pbuilder chroot environment does not install CA certificates by default,
accessing https://download.opensuse.org will cause certificate verification
error.
So we need to install it before installing 3rdparty repo GPG key.

Also, checking existance of gpgkeys_curl is not needed, since it's always
not installed since we are running the script in clean chroot environment.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1512517001-27524-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 8f02967a3b)
2017-12-06 10:12:17 +02:00
Avi Kivity
6b2f7f8c39 Merge "enable secure-apt for Ubuntu/Debian pbuilder" from Takuya
* 'debian-secure-apt-3rdparty-v3' of https://github.com/syuu1228/scylla:
  dist/debian: support Ubuntu 18.04LTS
  dist/debian: disable ALLOWUNTRUSTED
  dist/debian: enable secure-apt for Debian
  dist/debian: enable secure-apt for Ubuntu

(cherry picked from commit a25b5e30f8)
2017-12-04 14:47:23 +02:00
Takuya ASADA
370a6482e3 dist/debian: disable entire pybuild actions
Even after 25bc18b commited, we still see the build error similar to #3036 on
some environment, but not on dh_auto_install, it on dh_auto_test (see #3039).

So we need to disable entire pybuild actions, not just dh_auto_install.

Fixes #3039

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1512185097-23828-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 8c403ea4e0)
2017-12-02 19:37:01 +02:00
Takuya ASADA
981644167b dist/debian: skip running dh_auto_install on pybuild
We are getting package build error on dh_auto_install which is invoked by
pybuild.
But since we handle all installation on debian/scylla-server.install, we can
simply skip running dh_auto_install.

Fixes #3036

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1512065117-15708-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 25bc18b8ff)
2017-12-01 16:06:44 +02:00
Avi Kivity
6f669da227 Update seastar submodule
* seastar 78cd87f...8d254a1 (2):
  > fstream: do not ignore dma_write return value
  > Update dpdk submodule

Fixes dpdk build and missing file write error check.
2017-11-30 10:43:22 +02:00
Avi Kivity
bdf1173075 Point seastar submodule at scylla-seastar.git
This allows fixes to seastar to be cherry-picked into
scylla-seastar.git branch-2.1.
2017-11-30 10:40:51 +02:00
Duarte Nunes
106c69ad45 compound_compact: Change universal reference to const reference
The universal reference was introduced so we could bind an rvalue to
the argument, but it would have sufficed to make the argument a const
reference. This is also more consistent with the function's other
overload.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171129132758.19654-1-duarte@scylladb.com>
(cherry picked from commit cda3ddd146)
2017-11-29 14:42:08 +01:00
Tomasz Grabiec
740fcc73b8 Merge "compact_storage serialization fixes" from Duarte
Fix two issues with serializing non-compound range tombstones as
compound: convert a non-compound clustering element to compound and
actually advertise the issue to other nodes.

* git@github.com:duarten/scylla.git  rt-compact-fixes/v1:
  compound_compact: Allow rvalues in size()
  sstables/sstables: Convert non-compound clustering element to compound
  tests/sstable_mutation_test: Verify we can write/read non-correct RTs
  service/storage_service: Export non-compound RT feature

(cherry picked from commit e9cce59b85)
2017-11-29 14:18:21 +01:00
Raphael S. Carvalho
cefbb0b999 sstables: fix data_consume_context's move operator and ctor
after 7f8b62bc0b, its move operator and ctor broke. That potentially
leads to error because data_consume_context dtor moves sstable ref
to continuation when waiting for in-flight reads from input stream.
Otherwise, sstable can be destroyed meanwhile and file descriptor
would be invalid, leading to EBADF.

Fixes #3020.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171129014917.11841-1-raphaelsc@scylladb.com>
(cherry picked from commit f699cf17ae)
2017-11-29 09:54:27 +01:00
Tomasz Grabiec
02f43f5e4c Merge "Convert memtable flush reader to flat streams" from Paweł
This series converts memtable flush reader to the new flat mutation
readers. Just like the scanning reader, flush reader concatenates
multiple partition snapshot readers in order to provide a stream
of all partitions in the memtable.

* https://github.com/pdziepak/scylla.git flat_mutation_reader-memtable-flush/v1
   tests/flat_mutation_reader_assertion: add produces_partition()
   memtable: make make_flush_reader() return flat_mutation_reader
   flat_mutation_reader: add optimised flat_mutation_reader_opt
   memtable: switch flush reader implementation to flat streams
   tests/memtable: add test for flush reader

(cherry picked from commit 04106b4c96)
2017-11-27 20:29:25 +01:00
Duarte Nunes
8850ef7c59 tests/sstable_mutation_test: Change make_reader to make_flat_reader
A merge conflict between 596ebaed1f and
bd1efbc25c caused the test to fail to
build.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
(cherry picked from commit 4a6ffa3f5c)
2017-11-27 09:59:56 +01:00
Duarte Nunes
8567723a7b tests: Initialize storage service for some tests
These tests now require having the storage service initialize, which
is needed to decide whether correct non-compound range tombstones
should be emitted or not.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171126152921.5199-1-duarte@scylladb.com>
(cherry picked from commit 922f095f22)
2017-11-26 17:41:20 +02:00
Duarte Nunes
b0b7c73acd cql3/delete_statement: Allow non-range deletions on non-compound schemas
This patch fixes a regression introduced in
1c872e2ddc.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171126102333.3736-1-duarte@scylladb.com>
(cherry picked from commit 15fbb8e1ca)
2017-11-26 12:29:27 +02:00
Takuya ASADA
eb82d66849 dist/debian: link libgcc dynamically
As we discussed on the thread (https://github.com/scylladb/scylla/issues/2941),
since we override symbols on libgcc, we need to link libgcc dynamically for
Ubuntu/Debian too (CentOS already do it).

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1511542866-21486-2-git-send-email-syuu@scylladb.com>
(cherry picked from commit 7380a6088b)
2017-11-25 20:10:15 +02:00
Takuya ASADA
eb12fb3733 dist/debian: switch to our PPA verions of gcc-72
Now we have gcc-7.2 on our PPA for Ubuntu 16.04/14.04, let's switch to it.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1511542866-21486-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit df6546d151)
2017-11-25 20:10:14 +02:00
Tomasz Grabiec
60d011c9c0 Merge "Convert sstable writers to flat mutation readers" from Paweł
The following patches convert sstable writers to use flat mutation
readers instead of the legacy mutation_reader interface.
Writers were already using flat consumer interface and used
consume_flattened_in_thread(), so most of the work was limited to
providing an appropriate equivalent for flat mutation readers.

* https://github.com/pdziepak/scylla.git flat_mutation_reader-sstable-write/v1:
  flat_mutation_reader: move consumer_adapter out of consume()
  flat_mutation_reader: introduce consume_in_thread()
  tests/flat_mutation_reader: test consume_in_thread()
  sstables: switch write_components() to flat_mutation_reader
  streamed_mutation: drop streamed_mutation_returning()
  sstables: convert compaction to flat_mutation_reader
  mutation_reader: drop consume_flattened_in_thread()

(cherry picked from commit 596ebaed1f)
2017-11-24 18:49:32 +01:00
Tomasz Grabiec
7c3390bde8 Merge "Fixes to sstable files for non-compound schemas" from Duarte
This series mainly fixes issues with the serialization of promoted
index entries for non-compound schemas and with the serialization of
range tombstones, also for non-compound schemas.

We lift the correct cell name writing code into its own function,
and direct all users to it. We also ensure backward compatibility with
incorrectly generated promoted indexes and range tombstones.

Fixes #2995
Fixes #2986
Fixes #2979
Fixes #2992
Fixes #2993

* git@github.com:duarten/scylla.git  promoted-index-serialization/v3:
  sstables/sstables: Unify column name writers
  sstables/sstables: Don't write index entry for a missing row maker
  sstables/sstables: Reuse write_range_tombstone() for row tombstones
  sstables/sstables: Lift index writing for row tombstones
  sstables/sstables: Leverage index code upon range tombstone consume
  sstables/sstables: Move out tombstone check in write_range_tombstone()
  sstables/sstables: A schema with static columns is always compound
  sstables/sstables: Lift column name writing logic
  sstables/sstables: Use schema-aware write_column_name() for
    collections
  sstables/sstables: Use schema-aware write_column_name() for row marker
  sstables/sstables: Use schema-aware write_column_name() for static row
  sstables/sstables: Writing promoted index entry leverages
    column_name_writer
  sstables/sstables: Add supported feature list to sstables
  sstables/sstables: Don't use incorrectly serialized promoted index
  cql3/single_column_primary_key_restrictions: Implement is_inclusive()
  cql3/delete_statement: Constrain range deletions for non-compound
    schemas
  tests/cql_query_test: Verify range deletion constraints
  sstables/sstables: Correctly deserialize range tombstones
  service/storage_service: Add feature for correct non-compound RTs
  tests/sstable_*: Start the storage service for some cases
  sstables/sstable_writer: Prepare to control range tombstone
    serialization
  sstables/sstables: Correctly serialize range tombstones
  tests/sstable_assertions: Fix monotonicity check for promoted indexes
  tests/sstable_assertions: Assert a promoted index is empty
  tests/sstable_mutation_test: Verify promoted index serializes
    correctly
  tests/sstable_mutation_test: Verify promoted index repeats tombstones
  tests/sstable_mutation_test: Ensure range tombstone serializes
    correctly
  tests/sstable_datafile_test: Add test for incorrect promoted index
  tests/sstable_datafile_test: Verify reading of incorrect range
    tombstones
  sstables/sstable: Rename schema-oblivious write_column_name() function
  sstables/sstables: No promoted index without clustering keys
  tests/sstable_mutation_test: Verify promoted index is not generated
  sstables/sstables: Optimize column name writing and indexing
  compound_compat: Don't assume compoundness

(cherry picked from commit bd1efbc25c)
2017-11-24 18:49:19 +01:00
Tomasz Grabiec
95b55a0e9d tests: sstable: Make tombstone_purge_test more reliable
TTL of 1 second may cause the cell to expire right after we write it,
if the second component of current time changes right after it. Use
larger ttl to avoid spurious faliures due to this.
Message-Id: <1511463392-1451-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit 35e404b1a2)
2017-11-24 18:49:16 +01:00
Amnon Heiman
7785d8f396 estimated_histogram: update the sum and count when merging
When merging histograms the count and the sum should be updated.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20171122154822.23855-1-amnon@scylladb.com>
(cherry picked from commit 3f8d9a87ee)
2017-11-22 16:57:08 +01:00
Glauber Costa
b805e37d30 estimated_histogram: also fill up sum metric
Prometheus histograms have 3 embedded metrics: count, buckets, and sum.
Currently we fill up count and buckets but sum is left at 0. This is
particularly bad, since according to the prometheus documentation, the
best way to calculate histogram averages is to write:

  rate(metric_sum[5m]) / rate(metric_count[5m])

One way of keeping track of the sum is adding the value we sampled,
every time we sample. However, the interface for the estimated histogram
has a method that allows to add a metric while allowing to adjust the
count for missing metrics (add_nano())

That makes acumulating a sum inaccurate--as we will have no values for
the points that were added. To overcome that, when we call add_nano(),
we pretend we are introducing new_count - _count metrics, all with the
same value.

Long term, doing away with sampling may help us provide more accurate
results.

After this patch, we are able to correctly calculate latency averages
through the data exported in prometheus.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20171122144558.7575-1-glauber@scylladb.com>
(cherry picked from commit 6c4e8049a0)
2017-11-22 16:57:07 +01:00
Tomasz Grabiec
a790b8cd20 Merge "Remove sstable::read_rows" from Piotr
* seastar-dev.git haaawk/flat_reader_remove_read_rows:
  sstable_mutation_test: use read_rows_flat instead of read_rows
  perf_sstable: use read_rows_flat instead of read_rows
  Remove sstable::read_rows

(cherry picked from commit e9ffe36d65)
2017-11-22 16:11:31 +01:00
Tomasz Grabiec
a10ea80a63 Merge "Migrate sstables to flat_mutation_reader" from Piotr
Introduce sstable::read_row_flat and sstable::read_range_rows_flat methods
and use them in sstable::as_mutation_source.

* https://github.com/scylladb/seastar-dev/tree/haaawk/flat_reader_sstables_v3:
  Introduce conversion from flat_mutation_reader to streamed_mutation
  Add sstables::read_rows_flat and sstables::read_range_rows_flat
  Turn sstable_mutation_reader into a flat_mutation_reader
  sstable: add getter for filter_tracker
  Move mp_row_consumer methods implementations to the bottom
  Remove unused sstable_mutation_reader constructor
  Replace "sm" with "partition" in get_next_sm and on_sm_finished
  Move advance_to_upper_bound above sstable_mutation_reader
  Store sstable_mutation_reader pointer in mp_row_consumer
  Stop using streamed_mutation in consumer and reader
  Stop using streamed_mutation in sstable_data_source
  Delete sstable_streamed_mutation
  Introduce sstable::read_row_flat
  Migrate sstable::as_mutation_source to flat_mutation_reader
  Remove single_partition_reader_adaptor
  Merge data_consume_context::impl into data_consume_context
  Create data_consume_context_opt.
  Merge on_partition_finished into mark_partition_finished
  Check _partition_finished instead of _current_partition_key
  Merge sstable_data_source into sstable_mutation_reader
  Remove sstable_data_source
  Remove get_next_partition and partition_header

(cherry picked from commit aa8c2cbc16)
2017-11-22 15:49:22 +01:00
Takuya ASADA
91a5c9d20c dist/redhat: avoid hardcoding GPG key file path on scylla-epel-7-x86_64.cfg
Since we want to support cross building, we shouldn't hardcode GPG file path,
even these files provided on recent version of mock.

This fixes build error on some older build environment such as CentOS-7.2.

Fixes #3002

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1511277722-22917-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit c1b97d11ea)
2017-11-21 17:26:53 +02:00
Takuya ASADA
f846b897bf configure.py: suppress 'nonnull-compare' warning on antlr3
We get following warning from antlr3 header when we compile Scylla with gcc-7.2:
/opt/scylladb/include/antlr3bitset.inl: In member function 'antlr3::BitsetList<AllocatorType>::BitsetType* antlr3::BitsetList<AllocatorType>::bitsetLoad() [with ImplTraits = antlr3::TraitsBase<antlr3::CustomTraitsBase>]':
/opt/scylladb/include/antlr3bitset.inl:54:2: error: nonnull argument 'this' compared to NULL [-Werror=nonnull-compare]

To make it compilable we need to specify '-Wno-nonnull-compare' on cflags.

Message-Id: <1510952411-20722-2-git-send-email-syuu@scylladb.com>
(cherry picked from commit f26cde582f)
2017-11-21 17:26:53 +02:00
Takuya ASADA
8d7c34bf68 dist/debian: switch Debian 3rdparty packages to external build service
Switch Debian 3rdparty packages to our OBS repo
(https://build.opensuse.org/project/subprojects/home:scylladb).

We don't use 3rdparty packages on dist/debian/dep, so dropped them.
Also we switch Debian to gcc-7.2/boost-1.63 on same time.

Due to packaging issues following packages doesn't renamed our 3rdparty
package naming rule for now:
 - gcc-7: renamed as 'xxx-scylla72', instead of scylla-xxx-72.
 - boost1.63: doesn't renamed, also doesn't changed prefix to /opt/scylladb

Message-Id: <1510952411-20722-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit ab9d7cdc65)
2017-11-21 17:26:53 +02:00
Duarte Nunes
7449586a26 thrift/server: Handle exception within gate
The exception handling code inspects server state, which could be
destroyed before the handle_exception() task runs since it runs after
exiting the gate. Move the exception handling inside the gate and
avoid scheduling another accept if the server has been stopped.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171116122921.21273-1-duarte@scylladb.com>
(cherry picked from commit 34a0b85982)
2017-11-21 15:52:38 +02:00
Daniel Fiala
b601b9f078 utils/big_decimal: Fix compilation issue with converion of cpp_int to uint64_t.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
Message-Id: <20171121134854.16278-1-daniel@scylladb.com>
(cherry picked from commit 21ea05ada1)
2017-11-21 15:52:01 +02:00
Tomasz Grabiec
1ec81cda37 Merge "Convert queries to flat mutation readers" from Paweł
These patches convert queries (data, mutation and counter) to flat
mutation readers. All of them already use consume_flattened() to
consume a flat stream of data, so the only major missing thing
 was adding support for reversed partitions to
flat_mutation_reader::consume().

* pdziepak flat_mutation_reader-queries/v3-rebased:
  flat_mutation_reader: keep reference to decorated key valid
  flat_muation_reader: support consuming reversed partitions
  tests/flat_mutation_reader: add test for
    flat_mutation_reader::consume()
  mutation_partition: convert queries to flat_mutation_readers
  tests/row_cache_stress_test: do not use consume_flattened()
  mutation_reader: drop consume_flattened()
  streamed_mutation: drop reverse_streamed_mutation()

(cherry picked from commit 6969a235f3)
2017-11-21 12:58:41 +01:00
Paweł Dziepak
e87a2bc9c0 streamed_mutation: make emit_range_tombstone() exception safe
For a time range tombstone that was already removed from a tree
is owned by a raw pointer. This doesn't end well if creation of
a mutation fragment or a call to push_mutation_fragment() throw.
Message-Id: <20171121105749.16559-1-pdziepak@scylladb.com>

(cherry picked from commit 1b936876b7)
2017-11-21 12:35:47 +01:00
Tomasz Grabiec
b84d13d325 Merge "Fix reversed queries with range tombstones" from Paweł
This series reworks handling of range tombstones in reversed queries
so that they are applied to correct rows. Additionally, the concept
of flipped range tombstones is removed, since it only made it harder
to reason about the code.

Fixes #2982.

* https://github.com/pdziepak/scylla fix-reverse-query-range-tombstone/v2:
  streamed_mutation: fix reversing range tombstones
  range_tombstone: drop flip()
  tests/cql_query_test: test range tombstones and reverse queries
  tests/range_tombstone_list: add test for range_tombstone_accumulator

(cherry picked from commit cec5b0a5b8)
2017-11-21 12:35:37 +01:00
Botond Dénes
b5abf6541d Add fast-forwarding with no data test to mutation_source_test
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <9cb630bf9441e178b2040709f92767d4a740a875.1511180262.git.bdenes@scylladb.com>
(cherry picked from commit f059e71056)
2017-11-21 12:34:46 +01:00
Botond Dénes
8cf869cb37 flat_mutation_reader_assertions: add fast_forward_to(position_range)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <7b530909cf188887377aec3985f9f8c0e3b9b1e8.1511180262.git.bdenes@scylladb.com>
(cherry picked from commit a1a0d445d6)
2017-11-21 12:34:43 +01:00
Botond Dénes
df509761b0 flat_mutation_reader_from_mutation_reader(): make ff more resilient
Currently flat_mutation_reader_from_mutation_reader()'s
converting_reader will throw std::runtime_error if fast_forward_to() is
called when its internal streamed_mutation_opt is disengaged. This can
create problems if this reader is a sub-reader of a combined reader as the
latter has no way to determine the source of a sub-reader EOS. A reader
can be in EOS either because it reached the end of the current
position_range or because it doesn't have any more data.
To avoid this, instead of throwing we just silently ignore the fact that
the streamed_mutation_opt is disengaged and set _end_of_stream to true
which is still correct.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <83d309b225950bdbbd931f1c5e7fb91c9929ba1c.1511180262.git.bdenes@scylladb.com>
(cherry picked from commit 8065dca4a1)
2017-11-21 12:34:40 +01:00
Vlad Zolotarov
b90e11264e cql_transport::cql_server: fix the distributed prepared statements cache population
Don't std::move() the "query" string inside the parallel_for_each() lambda.
parallel_for_each is going to invoke the given callback object for each element of the range
and as a result the first call of lambda that std::move()s the "query" is going to destroy it for
all other calls.

Fixes #2998

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1511225744-1159-1-git-send-email-vladz@scylladb.com>
(cherry picked from commit 941aa20252)
2017-11-21 10:53:50 +02:00
Shlomi Livne
84b2bff0a6 release: prepare for 2.1.rc0
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
2017-11-19 18:53:20 +02:00
Tomasz Grabiec
2113299b61 sstables: index_reader: Reset lower bound for promoted index lookups from advance_to_next_partition()
_current_pi_idx was not reset from advance_to_next_partition(), which
is used when we skip to the next partition before fully consuming
it. As a result, if we try to skip to a clustering position which is
before the index block used by the last skip in the previous
partition, we would not skip assuming that the new position is in the
current block. This may result in more data being read from the
sstable than necessary.

Fixes #2984
Message-Id: <1510915793-20159-1-git-send-email-tgrabiec@scylladb.com>
2017-11-17 11:00:26 +00:00
Avi Kivity
6950389a3f Update seastar submodule
* seastar 11ad0b1...78cd87f (3):
  > Merge "http: Use output stream for files" from Amnon
  > tutorial: a section about when_all() and when_all_succeed()
  > Merge "Power8 related changes (what's left of them)" from Vlad
2017-11-16 16:31:46 +02:00
Avi Kivity
f18b3928d0 Merge 2017-11-16 14:55:45 +02:00
Avi Kivity
beffe469af index_entry: add move constructor, assigment operators
As can be seen in one of the traces in #2958, the copy constructor
of index_entry is called in response to std::vector<index_entry>::push_back(index_entry&&).

This is wasteful. Fix by providing the full suite of constructors/assignment operators.
Message-Id: <20171116121608.5580-1-avi@scylladb.com>
2017-11-16 13:54:05 +01:00
Avi Kivity
bbcfc57cb4 Merge "Free auth and its use from global variables" from Jesse
"This patch series addresses #2929. The objective is to eliminate global
state from the implementation and use of all access-control functionlity.

I've made every effort to make these patches logically independent and
incremental, but the final patch is big: this was necessary because
eliminating the global instances themselves is an atomic change."

* 'jhk/non_global_auth/v2' of https://github.com/hakuch/scylla:
  auth: Switch to sharded service
  tracing/trace_keyspace_helper: Use internal `client_state`
  auth: Make the QP an explicit dependency
  auth: Unify Java class name attributes
  auth: Make life-time control more consistent
  auth: Move metadata constants
  auth: Don't expose internal constant
  auth: Extract `permissions_cache`
  utils/loading_cache: Include necessary dependency
  auth: Fix static constant initialization
  auth: Extract `delayed_tasks` from `auth.cc`
2017-11-16 14:52:34 +02:00
Jesse Haber-Kucharsky
ba6a41d397 auth: Switch to sharded service
This change appears quite large, but is logically fairly simple.

Previously, the `auth` module was structured around global state in a
number of ways:

- There existed global instances for the authenticator and the
  authorizer, which were accessed pervasively throughout the system
  through `auth::authenticator::get()` and `auth::authorizer::get()`,
  respectively. These instances needed to be initialized before they
  could be used with `auth::authenticator::setup(sstring type_name)`
  and `auth::authorizer::setup(sstring type_name)`.

- The implementation of the `auth::auth` functions and the authenticator
  and authorizer depended on resources accessed globally through
  `cql3::get_local_query_processor()` and
  `service::get_local_migration_manager()`.

- CQL statements would check for access and manage users through static
  functions in `auth::auth`. These functions would access the global
  authenticator and authorizer instances and depended on the necessary
  systems being started before they were used.

This change eliminates global state from all of these.

The specific changes are:

- Move out `allow_all_authenticator` and `allow_all_authorizer` into
  their own files so that they're constructed like any other
  authenticator or authorizer.

- Delete `auth.hh` and `auth.cc`. Constants and helper functions useful
  for implementing functionality in the `auth` module have moved to
  `common.hh`.

- Remove silent global dependency in
  `auth::authenticated_user::is_super()` on the auth* service in favour
  of a new function `auth::is_super_user()` with an explicit auth*
  service argument.

- Remove global authenticator and authorizer instances, as well as the
  `setup()` functions.

- Expose dependency on the auth* service in
  `auth::authorizer::authorize()` and `auth::authorizer::list()`, which
  is necessary to check for superuser status.

- Add an explicit `service::migration_manager` argument to the
  authenticators and authorizers so they can announce metadata tables.

- The permissions cache now requires an auth* service reference instead
  of just an authorizer since authorizing also requires this.

- The permissions cache configuration can now easily be created from the
  DB configuration.

- Move the static functions in `auth::auth` to the new `auth::service`.
  Where possible, previously static resources like the `delayed_tasks`
  are now members.

- Validating `cql3::user_options` requires an authenticator, which was
  previously accessed globally.

- Instances of the auth* service are accessed through `external`
  instances of `client_state` instead of globally. This includes several
  CQL statements including `alter_user_statement`,
  `create_user_statement`, `drop_user_statement`, `grant_statement`,
  `list_permissions_statement`, `permissions_altering_statement`, and
  `revoke_statement`. For `internal` `client_state`, this is `nullptr`.

- Since the `cql_server` is responsible for instantiating connections
  and each connection gets a new `client_state`, the `cql_server` is
  instantiated with a reference to the auth* service.

- Similarly, the Thrift server is now also instantiated with a reference
  to the auth* service.

- Since the storage service is responsible for instantiating and
  starting the sharded servers, it is instantiated with the sharded
  auth* service which it threads through. All relevant factory functions
  have been updated.

- The storage service is still responsible for starting the auth*
  service it has been provided, and shutting it down.

- The `cql_test_env` is now instantiated with an instance of the auth*
  service, and can be accessed through a member function.

- All unit tests have been updated and pass.

Fixes #2929.
2017-11-15 23:22:42 -05:00
Jesse Haber-Kucharsky
1dd181bd7b tracing/trace_keyspace_helper: Use internal client_state 2017-11-15 23:19:18 -05:00
Jesse Haber-Kucharsky
41612ee577 auth: Make the QP an explicit dependency
Rather than have all uses of the QP in auth reference global variables,
we supply a QP reference to both the authenticator and authorizer on
construction.

The caller still references a global variable when constructing the
instances, but fixing this problem is a much larger task that is out of
scope of this change.
2017-11-15 23:19:13 -05:00
Jesse Haber-Kucharsky
157e22a4f0 auth: Unify Java class name attributes 2017-11-15 23:19:00 -05:00
Jesse Haber-Kucharsky
9aff5d9a77 auth: Make life-time control more consistent 2017-11-15 23:18:44 -05:00
Jesse Haber-Kucharsky
5825e37310 auth: Move metadata constants
This change is motivated partly be aesthetics, but more significantly
due to the future work to refactor `auth` into a sharded service. Since
doing so will require writing `auth::auth` from scratch, these
constants (and other common functionality) need a new home.
2017-11-15 23:18:42 -05:00
Jesse Haber-Kucharsky
22670cae82 auth: Don't expose internal constant 2017-11-15 23:17:52 -05:00
Jesse Haber-Kucharsky
20b7f92b9c auth: Extract permissions_cache
In addition to improving clarity, this makes the cache testable.

There shouldn't be any functional changes.
2017-11-15 23:17:41 -05:00
Jesse Haber-Kucharsky
6f4241574c utils/loading_cache: Include necessary dependency 2017-11-15 23:17:05 -05:00
Jesse Haber-Kucharsky
5c39a2cc15 auth: Fix static constant initialization
Using "Meyer's singletons" eliminate the problem of static constant
initialization order because static variables inside functions are
initialized only the first time control flow passes over their
declaration.

Fixes #2966.
2017-11-15 23:16:52 -05:00
Jesse Haber-Kucharsky
507e1ef8d5 auth: Extract delayed_tasks from auth.cc
This simple task scheduler is used by the auth module to delay metadata
creation until the system is settled.

Extracting it out allows the `auth` module to be refactored into a
sharded service and for other components of `auth` to make use of it.

Fixes #2965.
2017-11-15 23:16:46 -05:00
Botond Dénes
7f60fb19b4 flat_mutation_reader::fast_forward_buffer_to: remove schema parameter
e7a0732f72 added the schema to
flat_mutation_reader::impl so the schema doesn't need to be provided
externally anymore.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <04933512d3485d85629a9945b8ecb211aa2aab50.1510732121.git.bdenes@scylladb.com>
2017-11-15 10:40:02 +01:00
Tomasz Grabiec
a061be688d Merge "Prepare sstables read path for flat_mutation_reader" from Piotr
This patchset prepares sstables read path for flat_mutation_reader.
It cuts some dependencies between classes and replaces
sstables::mutation_reader with ::mutation_reader.  This will make it
possible to gradually convert the code to flat_mutation_reader because
we have converters between flat_mutation_reader and ::mutation_reader.

* seastar-dev.git haaawk/flat_reader_prepare_sstables_rebased
  Reduce dependencies from mp_row_consumer to sstable_streamed_mutation
  Replace sstables::mutation_reader with ::mutation_reader
  Remove range_reader_adaptor
  Remove sstable_range_wrapping_reader
2017-11-15 10:40:02 +01:00
Piotr Jastrzebski
6cd4b6b09c Remove sstable_range_wrapping_reader
The wrapper is no longer needed because
read_range_rows returns ::mutation_reader instead of
sstables::mutation_reader and the reader returned from
it keeps the pointer to shared_sstable that was used to
create the reader.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-15 10:40:02 +01:00
Piotr Jastrzebski
6d85e4fb0c Remove range_reader_adaptor
The wrapper is no longer needed because
read_range_rows returns ::mutation_reader instead of
sstables::mutation_reader and the reader returned from
it keeps the pointer to shared_sstable that was used to
create the reader.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-15 10:40:02 +01:00
Piotr Jastrzebski
ea449c9cce Replace sstables::mutation_reader with ::mutation_reader
This will make migration to flat_mutation_reader much
easier and sstables::mutation_reader is going away with
this migration anyway.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-15 10:40:01 +01:00
Piotr Jastrzebski
228f0737f4 Reduce dependencies from mp_row_consumer to sstable_streamed_mutation
Before this patch mp_row_consumer was using sstable_streamed_mutation
in two ways:

1. Populate sstable_streamed_mutation's buffer with mutation_fragments
2. Advance sstable_streamed_mutation's sstable_data_source to new position.

We can easily reduce those dependencies only to the first one.
This will reduce the coupling between those classes and simplify
the flow of execution.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-15 10:40:01 +01:00
Takuya ASADA
07c191af41 dist/common/scripts/scylla_dev_mode_setup: include scylla_lib.sh
To use verify_args function we requires scylla_lib.sh, so include it.

Fixes #2945

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1510154173-18017-1-git-send-email-syuu@scylladb.com>
2017-11-15 11:31:14 +02:00
Avi Kivity
7caf3a543e Merge "Respect size-tiered options in strategies that rely on its functionality" from Raphael
"Otherwise, such strategies couldn't behave as expected when it needs to do STCS."

* 'respecting_stcs_options_v2' of github.com:raphaelsc/scylla:
  tests: enable twcs test that relied on size-tiered properties
  twcs: respect stcs options by forwarding them to stcs method
  lcs: forward stcs options to respect them
  stcs: make most_interesting_bucket respect size-tiered options
  stcs: make most_interesting_bucket respect thresholds
  compaction: make size_tiered_most_interesting_bucket static method of stcs class
  stcs: introduce new ctor
  stcs: make header self contained
  stcs: inline function definition so as not to break one definition rule
2017-11-14 17:57:57 +02:00
Raphael S. Carvalho
1f478d5daa tests: enable twcs test that relied on size-tiered properties
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-11-14 13:27:27 -02:00
Raphael S. Carvalho
8165af1d08 twcs: respect stcs options by forwarding them to stcs method
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-11-14 13:27:27 -02:00
Raphael S. Carvalho
9cdc047a4c lcs: forward stcs options to respect them
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-11-14 13:27:27 -02:00
Raphael S. Carvalho
2b7f87474b stcs: make most_interesting_bucket respect size-tiered options
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-11-14 13:27:25 -02:00
Raphael S. Carvalho
d8ec913c34 stcs: make most_interesting_bucket respect thresholds
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-11-14 13:26:04 -02:00
Raphael S. Carvalho
cb6d060d8e compaction: make size_tiered_most_interesting_bucket static method of stcs class
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-11-14 13:24:03 -02:00
Raphael S. Carvalho
b69dbf8b99 stcs: introduce new ctor
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-11-14 13:21:59 -02:00
Avi Kivity
0dc888f963 Merge 2017-11-14 15:59:52 +02:00
Tomasz Grabiec
7323fe76db gossiper: Replicate endpoint_state::is_alive()
Broken in f570e41d18.

Not replicating this may cause coordinator to treat a node which is
down as alive, or vice verse.

Fixes regression in dtest:

  consistency_test.py:TestAvailability.test_simple_strategy

which was expected to get "unavailable" exception but it was getting a
timeout.

Message-Id: <1510666967-1288-1-git-send-email-tgrabiec@scylladb.com>
2017-11-14 15:58:00 +02:00
Vlad Zolotarov
c6c41aa877 tests: loading_cache_test: make it more robust
Make sure loading_cache::stop() is always called where appropriate:
regardless whether the test failed or there was an exception during the test.
Otherwise a false-alarm use-after-free error may occur.

Fixes #2955

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1510625736-3109-1-git-send-email-vladz@scylladb.com>
2017-11-14 11:35:49 +00:00
Avi Kivity
09e730f9f2 Merge "Fix bugs in cache related to handling of bad_alloc" from Tomasz
"Fixes #2944."

* tag 'tgrabiec/cache-exception-safety-fixes-v2' of github.com:scylladb/seastar-dev:
  tests: row_cache: Add test for exception safety of multi-partition scans
  tests: row_cache: Add test for exception safety of single-partition reads
  tests: mutation_source_tests: Always print the seed
  tests: Disable alloc failure injection in test assertions
  tests: Avoid needless copies
  row_cache: Fix exception safety of cache_entry::read()
  row_cache: scanning_and_populating_reader: Fix exception unsafety causing read to skip data
  row_cache: partition_range_cursor: Extract valid() and advance_to() from refresh()
  cache_streamed_mutation: Add trace-level logging to cache_streamed_mutation
  mvcc: Lift noexcept off partition_snapshot_row_weakref assignment/constructors
  cache_streamed_mutation: Make advancing to the next range exception-safe
  cache_streamed_mutation: Make add_clustering_row_to_buffer() exception-safe
  cache_streamed_mutation: Make drain_tombstones() exception-safe
  cache_streamed_mutation: Return void from start_reading_from_underlying()
  cache_streamed_mutation: Document invariants related to exception-safety
  streamed_mutation: Add reserve_one()
  lsa: Guarantee invalidated references on allocating section retry
  mvcc: partition_snapshot_row_cursor: Mark allocation points
2017-11-14 11:42:13 +02:00
Raphael S. Carvalho
f6574412a3 stcs: make header self contained
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-11-13 18:07:31 -02:00
Raphael S. Carvalho
2b45aa3593 stcs: inline function definition so as not to break one definition rule
goal is to allow multiple definitions of header

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-11-13 18:07:30 -02:00
Tomasz Grabiec
638d23025b tests: row_cache: Add test for exception safety of multi-partition scans 2017-11-13 20:55:14 +01:00
Tomasz Grabiec
084e1861c8 tests: row_cache: Add test for exception safety of single-partition reads 2017-11-13 20:55:14 +01:00
Tomasz Grabiec
a968a84ec5 tests: mutation_source_tests: Always print the seed
BOOST_TEST_MESSAGE() is not logged by default, and for some tests we
don't want to enable that because it's too noisy. But we need to know
the seed to reproduce a failure, so we better to always print it.
2017-11-13 20:55:14 +01:00
Tomasz Grabiec
e868929faf tests: Disable alloc failure injection in test assertions
Injecting failures to assertions doesn't add much value but slows down
test execution by adding extra iterations.
2017-11-13 20:55:14 +01:00
Tomasz Grabiec
5cf7f9d1bb tests: Avoid needless copies 2017-11-13 20:55:14 +01:00
Tomasz Grabiec
1971332195 row_cache: Fix exception safety of cache_entry::read()
When we fail, we need to return streamed_mutation back, so that
the operation can be retried.

Causes SIGSEGV on nullptr otherwise on bad_alloc.
2017-11-13 20:55:14 +01:00
Tomasz Grabiec
11a195c403 row_cache: scanning_and_populating_reader: Fix exception unsafety causing read to skip data
If assignment to _lower_bound in the "_secondary_in_progress = true;"
case in do_read_from_primary() throws due to allocation failure, the
update section will be retried and we will take the not_moved path,
skipping the range which was discontinuous and was supposed to be read
from underlying.

Fix by redoing lookup using _lower_bound in case the section is
retried. When we retry, _primary.valid() will be false. We need to
ensure now that _lower_bound is always valid.

Fixes #2944.
2017-11-13 20:55:14 +01:00
Tomasz Grabiec
5dc1ee41e4 row_cache: partition_range_cursor: Extract valid() and advance_to() from refresh() 2017-11-13 20:55:14 +01:00
Tomasz Grabiec
09c49b2db3 cache_streamed_mutation: Add trace-level logging to cache_streamed_mutation 2017-11-13 20:55:14 +01:00
Tomasz Grabiec
f60cfa34f4 mvcc: Lift noexcept off partition_snapshot_row_weakref assignment/constructors
Assignment to _pos (position_in_partition) may throw. noexcept is a
remnant from the version which didn't have _pos.
2017-11-13 20:55:14 +01:00
Tomasz Grabiec
bd7b68f877 cache_streamed_mutation: Make advancing to the next range exception-safe
Changing _ck_ranges_curr and _lower_bound should be atomic, either
both fail or both succeed.  Currently it could happen that if
position_in_partition::for_range_start() fails, _ck_ranges_curr would
be advanced but _lower_bound not.
2017-11-13 20:55:14 +01:00
Tomasz Grabiec
081deec731 cache_streamed_mutation: Make add_clustering_row_to_buffer() exception-safe
We need to maintain the following invariants:
 (1) no fragment with position >= _lower_bound was pushed yet
 (2) If _lower_bound > mf.position(), mf was emitted

Before this patch (1) could be violated if drain_tombstones() failed
in the middle. (2) could be violated if push_mutation_fragment()
failed.
2017-11-13 20:55:14 +01:00
Tomasz Grabiec
d1b844737a cache_streamed_mutation: Make drain_tombstones() exception-safe
If push_mutation_fragment() failed, mfo which we got from get_next()
would be lost. Fix by making sure push_mutation_fragment() won't fail.
2017-11-13 20:55:14 +01:00
Tomasz Grabiec
875fc93956 cache_streamed_mutation: Return void from start_reading_from_underlying()
The return value is no longer used.
2017-11-13 20:55:14 +01:00
Tomasz Grabiec
5fb319bbb9 cache_streamed_mutation: Document invariants related to exception-safety 2017-11-13 20:55:14 +01:00
Tomasz Grabiec
53f4452b47 streamed_mutation: Add reserve_one() 2017-11-13 20:55:13 +01:00
Tomasz Grabiec
8d69d217af lsa: Guarantee invalidated references on allocating section retry
There is existing code (e.g. use of partition_snapshot_row_cursor in
cache_streamed_mutation) which assumes that references will be
invalidated when bad_alloc is thrown from allocating_section. That is
currently the case because on retry we will attempt memory reclamation
which will invalidate references either through compaction or
eviction. Make this guarantee explicit.
2017-11-13 20:55:13 +01:00
Tomasz Grabiec
6bf1c6014f mvcc: partition_snapshot_row_cursor: Mark allocation points
This marks places which may allocate but not always do as allocation
points to increase effectiveness of testing.
2017-11-13 20:55:13 +01:00
Raphael S. Carvalho
cfd2343689 sstables: fix report in integrity check file interposer
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171113185842.11018-1-raphaelsc@scylladb.com>
2017-11-13 20:07:09 +01:00
Raphael S. Carvalho
cf8e12c760 checked_file_impl: remove unneeded variant of open_checked_file_dma
like in integrity_checked_file_impl, we don't need a variant of
open for default file open options.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171113185412.10880-1-raphaelsc@scylladb.com>
2017-11-13 20:06:58 +01:00
Raphael S. Carvalho
564046a135 thrift: fix compilation error
thrift/server.cc:237:6:   required from here
thrift/server.cc:236:9: error: cannot call member function ‘void thrift_server::maybe_retry_accept(int, bool, std::__exception_ptr::exception_ptr)’ without object
         maybe_retry_accept(which, keepalive, std::move(ex));

gcc version: gcc (GCC) 6.3.1 20161221 (Red Hat 6.3.1-1)

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171113184537.10472-1-raphaelsc@scylladb.com>
2017-11-13 20:05:33 +01:00
Avi Kivity
9cd5bc4eb8 Merge "Convert streaming to flat mutation readers" from Paweł
"The following patches convert streaming and repair code to the new
flat mutation reader interface. In particular this involves changing::
 - fragment_and_freeze() -- a consumer that fragments and freezes mutations
 - checksum computation for repair which until now was using two-level
   mutation_reader/streamed_mutation interface
 - multi_range_reader -- a mutation reader that automatically fast
   forwards to between given partiton ranges"

* tag 'flat_mutation_reader-streaming/v2' of https://github.com/pdziepak/scylla: (24 commits)
  mutation_reader: drop multi_range_reader
  db: convert make_streaming_reader() to flat_mutation_reader
  tests/flat_mutation_reader: add test for multi range reader
  tests/flat_mutation_reader_assertions: add fast_forward_to()
  tests/simple_schema: add to_ring_positions() helper
  flat_mutation_reader: convert flat_multi_range_mutation_reader
  flat_mutation_reader: add partition_range_forwarding
  flat_mutation_reader: make pop_mutation_fragment() public
  flat_mutation_reader: copy multi_range_mutation_reader
  streamed_mutation: drop mutation_hasher
  tests/flat_mutation_reader: add test for partition checksum
  repair: convert partition_checksum::compute_streamed() to flat streams
  repair: make partition_hasher consume flat mutation streams
  mutation_hasher: copy mutation_hasher to repair.cc
  partition_start: make partition_tombstone() const
  partition_checksum: introduce compute() for flat_mutation_reader
  db: drop single-range make_streaming_reader()
  fragment_and_freeze: drop streamed_mutation overload
  stream_transfer_task: switch to flat_mutation_reader
  tests/flat_mutation_reader: add test for fragment_and_freeze
  ...
2017-11-13 18:56:59 +02:00
Paweł Dziepak
97767963a0 mutation_reader: drop multi_range_reader 2017-11-13 16:49:52 +00:00
Paweł Dziepak
dca93bea23 db: convert make_streaming_reader() to flat_mutation_reader 2017-11-13 16:49:52 +00:00
Paweł Dziepak
98965add5b tests/flat_mutation_reader: add test for multi range reader
Based on mutation_reader.cc:test_multi_range_reader.
2017-11-13 16:49:52 +00:00
Paweł Dziepak
d23813cd41 tests/flat_mutation_reader_assertions: add fast_forward_to() 2017-11-13 16:49:52 +00:00
Paweł Dziepak
8fc9d250c5 tests/simple_schema: add to_ring_positions() helper
Based on mutation_reader_test.cc:to_ring_position()
2017-11-13 16:49:52 +00:00
Paweł Dziepak
a9ec01d5a5 flat_mutation_reader: convert flat_multi_range_mutation_reader 2017-11-13 16:49:52 +00:00
Paweł Dziepak
11e8866aee flat_mutation_reader: add partition_range_forwarding
flat_mutation_reader::partition_range_forwarding and
mutation_reader::forwarding are aliases of the same type. The change was
necessary in order to make mutation_reader::forwarding available in
flat_mutation_reader.hh even though it is included by mutation_reader.hh
2017-11-13 16:49:52 +00:00
Paweł Dziepak
009785a178 flat_mutation_reader: make pop_mutation_fragment() public
flat_mutation_reader public interface already exposes low leve
is_buffer_empty() and is_buffer_full() adding pop_mutation_fragment()
will make implementation of intermediate readers more straightforward.
2017-11-13 16:49:52 +00:00
Paweł Dziepak
d9a2b00d4a flat_mutation_reader: copy multi_range_mutation_reader
multi_range_mutation_reader for flat mutation readers is going to be
based on the original one.
2017-11-13 16:49:52 +00:00
Paweł Dziepak
7866e5b4a9 streamed_mutation: drop mutation_hasher 2017-11-13 16:49:52 +00:00
Paweł Dziepak
aa64b711d1 tests/flat_mutation_reader: add test for partition checksum
Based on streamed_mutation_test:test_mutation_hash
2017-11-13 16:49:52 +00:00
Paweł Dziepak
f690e2e80b repair: convert partition_checksum::compute_streamed() to flat streams 2017-11-13 16:49:52 +00:00
Paweł Dziepak
d71a14b943 repair: make partition_hasher consume flat mutation streams 2017-11-13 16:49:52 +00:00
Paweł Dziepak
2b774119a1 mutation_hasher: copy mutation_hasher to repair.cc
Repair is the exclusive user of mutation_hasher. Moving it there will
make integration with partition_checksum easier.
2017-11-13 16:49:52 +00:00
Paweł Dziepak
af4fa6152b partition_start: make partition_tombstone() const 2017-11-13 16:49:52 +00:00
Paweł Dziepak
f648f94464 partition_checksum: introduce compute() for flat_mutation_reader 2017-11-13 16:49:52 +00:00
Paweł Dziepak
37640f223b db: drop single-range make_streaming_reader() 2017-11-13 16:49:52 +00:00
Paweł Dziepak
e2481a89e1 fragment_and_freeze: drop streamed_mutation overload 2017-11-13 16:49:52 +00:00
Paweł Dziepak
6f1e0d3ed8 stream_transfer_task: switch to flat_mutation_reader 2017-11-13 16:49:52 +00:00
Paweł Dziepak
50a1d76c1f tests/flat_mutation_reader: add test for fragment_and_freeze
Based on streamed_mutation_test:test_fragmenting_and_freezing_streamed_mutations
2017-11-13 16:49:52 +00:00
Paweł Dziepak
f5c40e0861 flat_mutation_reader_from_mutations: take vector by value 2017-11-13 16:49:51 +00:00
Paweł Dziepak
9854b8a450 fragment_and_freeze: work on flat_mutation_readers 2017-11-13 16:49:47 +00:00
Paweł Dziepak
8bb672502d fragment_and_freeze: allow callback to stop iteration
There is a user of fragment_and_freeze() (streaming) that will need
to be able to break the loop Right now, it does that between
streamed_mutation, but that won't be possible after we switch to flat
readers.
2017-11-13 16:44:33 +00:00
Paweł Dziepak
73b8f54cf4 test/mutation_source_test: generate sets of mutations 2017-11-13 16:42:56 +00:00
Tomasz Grabiec
3536d2156c tests: row_cache: Add reproducer for issue #2948
Message-Id: <1510229584-14398-2-git-send-email-tgrabiec@scylladb.com>
2017-11-13 15:20:21 +00:00
Tomasz Grabiec
8402728747 row_cache: Call open_version() under region's allocator
partition_entry::read() calls open_version() under standard allocator,
but it may allocate a new partition version if a snapshot already
exists which was created in an earlier phase. Versions are supposed to
be allocated using region's allocator, they will be freed using
region's allocator. LSA will delegate free() to the standard allocator
correctly in this case, but it will subtract from its
_non_lsa_occupancy, assuming the allocation was done through it. This
will corrupt occupancy() for cache region.

Fixes #2948.
Message-Id: <1510229584-14398-1-git-send-email-tgrabiec@scylladb.com>
2017-11-13 15:20:08 +00:00
Avi Kivity
061f6830fa Merge "thrift/server: Ensure stop() waits for accepts" from Duarte
"Ensure stop() waits for the accept loop to complete to avoid crashes
during shutdown."

* 'thrift-server-stop/v4' of https://github.com/duarten/scylla:
  thrift/server: Restore code format
  thrift/server: Stopping the server waits for connection shutdown
  thrift/server: Abort listeners on stop()
  thrift/server: Avoid manual memory management
  thrift/server: Add move ctor for connection
  thrift/server: Extract retry logic
  thrift/server: Retry with backoff for some error types
  thrift/server: Retry accept in case of error
2017-11-13 12:48:05 +02:00
Duarte Nunes
049fbb58f3 thrift/server: Restore code format
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-13 11:21:54 +01:00
Duarte Nunes
7b25e3200a thrift/server: Stopping the server waits for connection shutdown
This patch ensures the future returned from stop() resolves only when
all connections and listeners are no longer in use.

Fixes #2657
Fixes #2942

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-13 11:21:53 +01:00
Duarte Nunes
f523a0f845 thrift/server: Abort listeners on stop()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-13 11:19:44 +01:00
Duarte Nunes
8e0e2363e9 thrift/server: Avoid manual memory management
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-13 11:19:44 +01:00
Duarte Nunes
75d04be96f thrift/server: Add move ctor for connection 2017-11-13 11:19:44 +01:00
Duarte Nunes
9d3322ff1a thrift/server: Extract retry logic
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-13 11:19:43 +01:00
Duarte Nunes
b5cf1a152f thrift/server: Retry with backoff for some error types
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-13 11:19:19 +01:00
Duarte Nunes
f367dbe1ed thrift/server: Retry accept in case of error
In case of errors like ECONNABORTED, we want to retry accepting
connections. Right now we immediately retry the accept, but in
subsequent patches we introduce a backoff for other types of errors.

We also consider fatal errors like EBADFD, which should not trigger a
retry.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-13 11:19:03 +01:00
Avi Kivity
d57395dce9 cql: prevent overflow when computing averages
Currently, we use type type of the column as the accumulator when we
average it. This can easily overflow, e.g. (2^31-1)+(3) = overflow.

Fix by using __int128 for the accumulator.  It's not standard, but
it's way more efficient and simpler than the alternatives.

Inspired by CASSANDRA-12417, but much simpler due to the availability
of __int128.
Message-Id: <20171112173529.30764-1-avi@scylladb.com>
2017-11-13 08:53:59 +01:00
Piotr Jastrzebski
acfc6fef55 Simplify flat_mutation_reader wrappers
If a wrapper takes a flat_mutation_reader in a constructor
then it does not have to take schema_ptr because it can obtain
it from the inner flat_mutation_reader.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <88c3672df08d2ac465711e9138d426e43ae9c62b.1510331382.git.piotr@scylladb.com>
2017-11-13 08:53:34 +01:00
Avi Kivity
f8af4f507b Merge "Support for varint and decimal in aggregate functions" from Daniel
"This patch adds support for varint and decimal to aggregate functions.
Some other types (like byte or smallint) weren't supported and they are
supported by C*. So their aggregate functions were added as well.

To allow aggregate functions for big_decimal, following methods were added
to big_decimal type:
  * Division by int64_t that preservers number of decimal digits.
  * Operator += .
  * Comparison operators.

Fixes #2842."

* 'danfiala/scylla-2842-send-002' of https://github.com/hagrid-the-developer/scylla:
  tests: Add tests for aggregate functions.
  tests: Add tests for big_decimal type.
  cql3/functions: Add aggregate functions for big_decimal.
  utils/big_decimal: Added necessary operators and methods for aggregate functions.
  cql3/functions: Add aggregate functions for types for which it is trivial.
2017-11-12 17:11:33 +02:00
Daniel Fiala
bc20484c47 tests: Add tests for aggregate functions.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2017-11-12 15:53:22 +01:00
Daniel Fiala
ee1d69502b tests: Add tests for big_decimal type.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2017-11-12 15:53:22 +01:00
Daniel Fiala
74c5f70b0a cql3/functions: Add aggregate functions for big_decimal.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2017-11-12 15:53:13 +01:00
Daniel Fiala
ce2f010859 utils/big_decimal: Added necessary operators and methods for aggregate functions.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2017-11-12 15:51:29 +01:00
Daniel Fiala
115668fe70 cql3/functions: Add aggregate functions for types for which it is trivial.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2017-11-12 13:56:20 +01:00
Tomasz Grabiec
484dde692f Merge "make sure that cache updates don't overflow dirty memory" from Glauber
Since we started accounting virtual dirty memory we no longer have a cap
on real dirty memory. In most situations that is not needed, since real
dirty will just be at most twice as much as virtual dirty (current
flushing memtable plus new memtable).

However, due to things like cache updates and component flushing we can
end up having a lot of memtables that are virtually freed but not yet
fully released, leading real dirty memory to explode using all the box'
memory.

This patch adds a cap on real dirty memory as well. Because of the
hierarchical nature of region_group, if the parent blocks due to memory
depletion, so will the child (virtual dirty region group).

After that is done, we need to make sure that dirty memory is not seen
as freed until the cache update is done. Until a particular partition is
moved to the cache it is not evictable. As a result we can OOM the
system if we have a lot of pending cache updates as the writes will not
be throttled and memory won't be made available.

This patch pins the memory used by the region as real dirty before the
cache update starts, and unpins it when it is over. In the mean time it
gradually releases memory of the partitions that are being moved to
cache.

I have verified in a couple of workloads that the amount of memory
accounted through this is the same amount of memory accounted through
the memtable flush procedure.

Fixes #1942

* git@github.com:glommer/scylla.git glommer/update-cache-v4:
  row_cache: modernize use of seastar threads
  mutation_partition: estimate size of partition
  memtable: factor out calculation of memtable_entry memory size
  memtable: add a method to export memtable's dirty memory manager
  dirty_memory_manager: block if we hit the real dirty limit
  dirty_memory_manager: add functions to manipulate real dirty
  partition: add method to calculate memory size of a partition
  row cache: pin real dirty during cache updates.
2017-11-10 13:55:12 +01:00
Piotr Jastrzebski
e7a0732f72 Add schema_ptr to flat_mutation_reader
It is usefull to have a schema inside a flat reader
the same way we had schema inside a streamed_mutation.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <b37e0dbf38810c00bd27fb876b69e1754c16a89f.1510312137.git.piotr@scylladb.com>
2017-11-10 13:54:55 +01:00
Pekka Enberg
0c192c835c cql3: Fix 'DROP INDEX' to also drop index view
This patch fixes 'DROP INDEX' CQL statement to also drop the underlying
index view automatically so that we don't leave unused materialized
views behind.
Message-Id: <1510303421-15945-1-git-send-email-penberg@scylladb.com>
2017-11-10 10:52:08 +01:00
Duarte Nunes
73f6c9a612 Merge seastar upstream
* seastar 8040cab...11ad0b1 (7):
  > alloc_failure_injector: Fix compilation error with gcc 7.1
  > core/gate: Add is_closed() function
  > doc: code formatting and fix function call
  > doc: tutoral code formatting
  > build: adjust -Wno-error=cpp for clang
  > build: don't error out on preprocessor #warning
  > Merge 'Enhancements of allocation failure injector' from Tomasz

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-11-09 14:42:06 +01:00
Takuya ASADA
f607a01cc5 dist/debian: link boost statically
Since we switched scylla-boost163 which isn't provided by distribution repo,
we need to link them statically.

Fixes #2946

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1510229553-29801-1-git-send-email-syuu@scylladb.com>
2017-11-09 14:51:00 +02:00
Glauber Costa
1d7617723d row cache: pin real dirty during cache updates.
Right now, once a region is moved to the cache is no longer visible to
the dirty memory system. Not as real dirty nor virtual dirty.

The problem is that until a particular partition is moved to the cache
it is not evictable. As a result we can OOM the system if we have a lot
of pending cache updates as the writes will not be throttled and memory
won't be made available.

This patch pins the memory used by the region as real dirty before the
cache update starts, and unpins it when it is over. In the mean time it
gradually releases memory of the partitions that are being moved to
cache.

I have verified in a couple of workloads that the amount of memory
accounted through this is the same amount of memory accounted through
the memtable flush procedure.

Fixes #1942

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-11-08 19:46:36 -05:00
Glauber Costa
c2f49da609 partition: add method to calculate memory size of a partition
Once that is added, also add a method to a memtable entry to calculate
the entire size of a memtable entry. Right now we only have one method
to calculate the size minus rows.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-11-08 16:21:44 -05:00
Glauber Costa
b02ab991b9 dirty_memory_manager: add functions to manipulate real dirty
There are times in which we want to add and remove real dirty memory
without impacting virtual dirty. One such example is the cache update
process, where real dirty is the limiting factor.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-11-08 16:21:44 -05:00
Glauber Costa
a6b2226562 dirty_memory_manager: block if we hit the real dirty limit
Since we started accounting virtual dirty memory we no longer have a cap
on real dirty memory. In most situations that is not needed, since real
dirty will just be at most twice as much as virtual dirty (current
flushing memtable plus new memtable).

However, due to things like cache updates and component flushing we can
end up having a lot of memtables that are virtually freed but not yet
fully released, leading real dirty memory to explode using all the box'
memory.

This patch adds a cap on real dirty memory as well. Because of the
hierarchical nature of region_group, if the parent blocks due to memory
depletion, so will the child (virtual dirty region group).

A next step is to add a controller that will increase the priority of
the tasks involving in releasing real dirty memory if we get dangerously
close to the threshold.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-11-08 16:21:44 -05:00
Glauber Costa
b98a48657e memtable: add a method to export memtable's dirty memory manager
It will be used by the cache update process to gradually return real
dirty memory to the manager.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-11-08 16:21:44 -05:00
Glauber Costa
ec36b9eddc memtable: factor out calculation of memtable_entry memory size
The total size is the sum of two components. Add a method that
does that sum so this code gets easier to reuse.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-11-08 16:21:44 -05:00
Glauber Costa
d49ecae201 mutation_partition: estimate size of partition
In the memtable flusher, we account for the size of a partition as we
read them. However, there are other points in the architecture where we
would like to calculate the size of a partition in a point in which we
are not reading it. One such example is the cache update process.

This patch enhances the mutation_partition adding a method that returns
the total size for this partition.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-11-08 16:21:44 -05:00
Glauber Costa
b836005555 row_cache: modernize use of seastar threads
For a while now we have an async() function, that simplifies the code by not
needing to issue an explicit join. This patch converts the row cache to use
async() as well, which most of our code already does. Doing so will make
it easier to make changes to update_cache.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-11-08 16:21:44 -05:00
Paweł Dziepak
b69f94fece Merge "Implement flat_mutation_reader::consume" from Piotr
"Implement flat_mutation_reader::consume and add tests for it.
For that implement flat_mutation_reader_from_mutations and
read_mutation_from_flat_mutation_reader."

* 'haaawk/flat_reader_consume_v3' of github.com:scylladb/seastar-dev:
  Add tests for flat_mutation_reader::consume
  Add tests for flat_mutation_reader utils
  Introduce read_mutation_from_flat_mutation_reader
  Make mutation_rebuilder streamed_mutation independent
  flat_mutation_reader_from_mutation: support multiple mutations
  Introduce flat_mutation_reader::consume
  Move FlattenedConsumer concept to flat_mutation_reader.hh
2017-11-08 15:08:47 +00:00
Paweł Dziepak
0373f357a8 Merge "Make memtable::make_reader return flat_mutation_reader" from Piotr
"This patchset introduces memtable::make_flat_reader that returns
flat_mutation_reader and converts internal memtable readers into
flat_mutation_readers.

It also introduces some utility methods like make_forwardable and
make_partition_snapshot_flat_reader."

* 'haaawk/flat_reader_memtable_v4' of github.com:scylladb/seastar-dev:
  Turn scanning_reader into flat_mutation_reader
  Change memtable_entry::read to return flat_mutation_reader
  Make iterator_reader independent from mutation_reader
  Introduce make_partition_snapshot_flat_reader
  Prepare partition_snapshot_flat_reader
  Introduce flat_mutation_reader_from_mutation
  Prepare flat_mutation_reader_from_mutation
  Introduce make_forwardable
  Prepare make_forwardable for flat_mutation_reader
  Introduce empty_flat_reader
  memtable: Introduce make_flat_reader
2017-11-08 14:24:26 +00:00
Piotr Jastrzebski
29d409de2f Add tests for flat_mutation_reader::consume
Make sure that flat_mutation_reader::consume stops
as it's asked by the consumer.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 14:26:10 +01:00
Piotr Jastrzebski
d42e53982d Add tests for flat_mutation_reader utils
Test flat_mutation_reader_from_mutations and
read_mutation_from_flat_mutation_reader.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 14:26:10 +01:00
Piotr Jastrzebski
4b58a05053 Introduce read_mutation_from_flat_mutation_reader
This helper method reads a single mutation from
a flat_mutation_reader.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 14:26:10 +01:00
Piotr Jastrzebski
6718ecab82 Make mutation_rebuilder streamed_mutation independent
mutation_rebuilder will be used not only with streamed_mutations
but also with flat_mutation_readers so it's better for it to be
independent from streamed_mutation.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 14:26:10 +01:00
Piotr Jastrzebski
aa16cd7eef flat_mutation_reader_from_mutation: support multiple mutations
Rename flat_mutation_reader_from_mutation to
flat_mutation_reader_from_mutations.

Make it work with std::vector<mutation> instead of a single
mutation.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 14:26:10 +01:00
Piotr Jastrzebski
bcd5415413 Introduce flat_mutation_reader::consume
This is equivalent to consume_flattened for old
mutation_reader.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 14:25:28 +01:00
Piotr Jastrzebski
9233ee7309 Move FlattenedConsumer concept to flat_mutation_reader.hh
This concept will be used both in flat_mutation_reader.hh
and mutation_reader.hh. mutation_reader.hh includes
flat_mutation_reader.hh so we have to move the concept to
make it accessible in both files.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 14:14:51 +01:00
Piotr Jastrzebski
864d02e795 Turn scanning_reader into flat_mutation_reader
This will make memtable::make_reader more efficient.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 14:08:53 +01:00
Tomasz Grabiec
9e115fb7e2 Merge addition of mutation_source::make_flat_mutation_reader() from Piotr
Make it possible for a mutation_source to be created both for sources
that use old mutation_reader and new flat_mutation_reader.

Add tests for flat_mutation_reader::next_partition to run_mutation_source_tests.

* seastar-dev.git 'dev/haaawk/flat_reader_mutation_source_v3':
  Remove mutation_reader.hh dependency from flat_mutation_reader.hh
  Prepare mutation_source for more than one implementation
  Add flat reader mutation source implementation
  Add mutation_source::make_flat_mutation_reader
  Use mutation_source::make_flat_mutation_reader in tests
  Add flat_mutation_reader_assertions
  Add test for flat_mutation_reader::next_partition
2017-11-08 14:05:25 +01:00
Piotr Jastrzebski
68505a5065 Change memtable_entry::read to return flat_mutation_reader
This is the first step to move scanning_reader to
be flat_mutation_reader.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 13:52:09 +01:00
Piotr Jastrzebski
7b016527bf Make iterator_reader independent from mutation_reader
iterator_reader will be used also in flat_mutation_reader.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 13:52:09 +01:00
Piotr Jastrzebski
f499949645 Introduce make_partition_snapshot_flat_reader
This allows creation of flat_mutation_reader from MVCC.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 13:52:09 +01:00
Piotr Jastrzebski
ced64d7571 Prepare partition_snapshot_flat_reader
This commit creates a copy of partition_snapshot_reader
and names it partition_snapshot_flat_reader.
This new class will be turned into a flat_mutation_reader
in the next commit.

The purpose of this commit is to make it easier to review the next
commit.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 13:52:09 +01:00
Piotr Jastrzebski
ed074a4f56 Introduce flat_mutation_reader_from_mutation
This is a utility method that will be handy in conversion
from mutation_reader to flat_mutation_reader.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 13:52:09 +01:00
Piotr Jastrzebski
c3a4ce842a Prepare flat_mutation_reader_from_mutation
This commit copies streamed_mutation_from_mutation
from streamed_mutation to flat_mutation_reader
and renames it to streamed_mutation_from_mutation_copy.
This copy will be used as a base for
flat_mutation_reader_from_mutation.

The purpose of this commit is to make it easier to review the next
commit.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 13:52:09 +01:00
Piotr Jastrzebski
decefe6eaa Introduce make_forwardable
It will add the ability to fast_forward_to on position_range
to flat_mutation_reader that does not have this ability.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 13:52:09 +01:00
Piotr Jastrzebski
6da8caf26f Prepare make_forwardable for flat_mutation_reader
This commit copies make_forwardable from streamed_mutation
to flat_mutation_reader and renames it to make_forwardable_copy.
This copy will be used as a base for make_forwardable implementation
for flat_mutation_reader.

The purpose of this commit is to make it easier to review the next
commit.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 13:52:09 +01:00
Piotr Jastrzebski
647dd7f86a Introduce empty_flat_reader
This is an implementation of flat_mutation_reader
that returns nothing.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 13:52:09 +01:00
Piotr Jastrzebski
0a9ab7ff80 memtable: Introduce make_flat_reader
This method creates a flat_mutation_reader
instead of mutation_reader. All users will be gradually
converted to the new interface. make_reader is implemented
using make_flat_reader and will be removed once all users
are migrated.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 13:52:09 +01:00
Piotr Jastrzebski
3661aca7ee Add test for flat_mutation_reader::next_partition
This is added to the run_mutation_source_tests suite.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 12:58:31 +01:00
Piotr Jastrzebski
1c9e4ba04c Add flat_mutation_reader_assertions
This will be usefull in tests.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 12:58:31 +01:00
Piotr Jastrzebski
4bca2210bf Use mutation_source::make_flat_mutation_reader in tests
Use the new call in run_conversion_to_mutation_reader_tests.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 12:58:31 +01:00
Piotr Jastrzebski
6efda10790 Add mutation_source::make_flat_mutation_reader
This will be used as an intermediate state of migration
from mutation_reader to flat_mutation_reader.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 12:58:31 +01:00
Piotr Jastrzebski
93e8b43e7b Add flat reader mutation source implementation
This will be used by sources that are migrated to
flat_mutation_reader.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 12:41:12 +01:00
Piotr Jastrzebski
1a7936561e Prepare mutation_source for more than one implementation
There will be a second implementation that will be used by
sources that are converted to flat_mutation_reader.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 12:41:12 +01:00
Piotr Jastrzebski
e80007559b Remove mutation_reader.hh dependency from flat_mutation_reader.hh
It's not needed and causes cyclic dependency when we need
flat_mutation_reader in mutation_source.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-08 12:41:12 +01:00
Duarte Nunes
f50b7c240f tests/view_schema_test: Wrap view queries in eventually()
...instead of wrapping the base table queries, since those will
immediately succeed. This fixes ocasional failures.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171107170221.4309-1-duarte@scylladb.com>
2017-11-08 09:20:43 +02:00
Duarte Nunes
328b908574 tests/view_schema_test: Avoid non-pk restrictions
We don't support non-PK restrictions correctly as explained in commit
3c90607 ("tests/cql_query_test: Fix view creation in
test_duration_restrictions()") and Apache Cassandra doesn't support them
for MVs either. Change some test cases to not rely on them.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171107165138.3176-1-duarte@scylladb.com>
2017-11-08 09:20:11 +02:00
Duarte Nunes
cb9daec8fd thrift: Preserve query order for some verbs
f44131226a introduced a regression where for some verbs we would
return partitions in their natural sort order, but since thrift
partition ranges can wrap-around, what we need to preserve is query
order.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171103201118.18175-1-duarte@scylladb.com>
2017-11-07 17:00:48 +00:00
Paweł Dziepak
5a4b46f555 Merge "Fix exception safety related to range tombstones in cache" from Tomasz
Fixes #2938.

* 'tgrabiec/fix-range-tombstone-list-exception-safety-v1' of github.com:scylladb/seastar-dev:
  tests: range_tombstone_list: Add test for exception safety of apply()
  tests: Introduce range_tombstone_list assertions
  cache: Make range tombstone merging exception-safe
  range_tombstone_list: Introduce apply_monotonically()
  range_tombstone_list: Make reverter::erase() exception-safe
  range_tombstone_list: Fix memory leaks in case of bad_alloc
  mutation_partition: Fix abort in case range tombstone copying fails
  managed_bytes: Declare copy constructor as allocation point
  Integrate with allocation failure injection framework
2017-11-07 15:30:52 +00:00
Pekka Enberg
b515cca5e2 tests/view_schema_test: Disable non-PK restriction tests
We don't support non-PK restrictions correctly as explained in commit
3c90607 ("tests/cql_query_test: Fix view creation in
test_duration_restrictions()") and Apache Cassandra doesn't support them
for MVs either. Disable the tests, but don't remove them because they
will be resurrected once CASSANDRA-13832 is fixed.

Message-Id: <1510052422-3478-1-git-send-email-penberg@scylladb.com>
2017-11-07 16:42:00 +02:00
Tomasz Grabiec
0123eb876c tests: range_tombstone_list: Add test for exception safety of apply() 2017-11-07 15:33:24 +01:00
Tomasz Grabiec
dedb8a6a15 tests: Introduce range_tombstone_list assertions 2017-11-07 15:33:24 +01:00
Tomasz Grabiec
bbca83d4c0 cache: Make range tombstone merging exception-safe
range_tombstone_list::apply() has no exception safety guarantees about
the logical state. The target mutation_partition in cache should be
assumed to be left in unspecified state. In particular, some of the
preexisting overlapping tombstones may be removed and not reinserted,
so the cache would be missing some of the range tombstone information
in case the whole allocating section fails.

Use apply_monotonically() which provides the needed guarantees.

Fixes #2938.
2017-11-07 15:33:24 +01:00
Tomasz Grabiec
9c620e0246 range_tombstone_list: Introduce apply_monotonically() 2017-11-07 15:33:24 +01:00
Tomasz Grabiec
2fe53ac617 range_tombstone_list: Make reverter::erase() exception-safe
erase_undo_op() constructor takes ownership of *it, and destroys it
when it goes out of scope. If emplace_back() fails, *it would be
destroyed before being removed from its container (_dst._tombstones).
Fix by making sure _ops.emplace_back() won't fail.
2017-11-07 15:33:24 +01:00
Tomasz Grabiec
6190f9fc63 range_tombstone_list: Fix memory leaks in case of bad_alloc
If insert() fails, the allocated range_tombstone would not be freed.
Use alloc_strategy_unique_ptr.
2017-11-07 15:33:24 +01:00
Tomasz Grabiec
ca3e72266f mutation_partition: Fix abort in case range tombstone copying fails
If exception is thrown from _row_tombstones.apply(), _rows will be
left uncleared. This will trigger assertion in bi::set_member_hook
destructor, which assrts that the hook is not linked.

Always clear _rows.
2017-11-07 15:33:24 +01:00
Tomasz Grabiec
5348d9f596 managed_bytes: Declare copy constructor as allocation point
Because of the small size optimization, not all copies will call the
allocator, so allocation failure injection may miss this site if the
value is not large enough. Make the testing more effective by marking
this place explicitly as an allocation point.
2017-11-07 15:33:24 +01:00
Tomasz Grabiec
34ccf234ea Integrate with allocation failure injection framework 2017-11-07 15:33:24 +01:00
Tomasz Grabiec
2d3f3ab2b8 Update seastar submodule
* seastar d71922c...8040cab (4):
  > util: Introduce support for allocation failure injection
  > Adding dpdk-port-index as a command line option with default value of 0
  > core/sharded: Introduce invoke_on_others()
  > noncopyable_function: improve support for capturing mutable lambdas
2017-11-07 15:29:45 +01:00
Paweł Dziepak
80cfcc357f Merge "Config fixes" from Calle
"Fixes #2933

Fixes regressions introduced by config restructuring.
Allows "base" config to handle errors by warning, while
other uses can opt otherwise."

[pdziepak: resolved merge conflict]

* 'calle/cfgfix' of github.com:scylladb/seastar-dev:
  config_test: Use error handler (ignore errors) + add error test
  config: Resurrect command line aliases that where lost
  main: Use error handler for config parse
  config_file: Add optional "error_handler" to yaml parse functions
2017-11-06 11:40:37 +00:00
Amos Kong
76ab8bf292 scylla_setup: parse --no-ec2-check option
The option was introduced by commit e645b0f ("dist/common/scripts: move
EC2 configuration verification to 'scylla_ec2_check'"), but it doesn't
parsed the option at all.

Fixes #2934

Signed-off-by: Amos Kong <amos@scylladb.com>
Acked-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <df21356e528c6e161f73f4408a201fedef8f52d9.1509744954.git.amos@scylladb.com>
2017-11-06 12:06:54 +02:00
Calle Wilund
21b2c2e310 config_test: Use error handler (ignore errors) + add error test
Fixes #2933

Uses handler on main test, ignoring the invalid option present.
Also adds test to verify error handling works as expected.
2017-11-06 09:58:16 +00:00
Calle Wilund
959d729428 config: Resurrect command line aliases that where lost 2017-11-06 09:54:46 +00:00
Calle Wilund
f1dd698600 main: Use error handler for config parse
Treat all errors as loggable errors/warnings. Preserving previous
behaviour.
2017-11-06 09:54:09 +00:00
Calle Wilund
287b6fd8bd config_file: Add optional "error_handler" to yaml parse functions
Allowing parse errors / unknown options to be ignored.
2017-11-06 09:53:05 +00:00
Amos Kong
d9f16cf23c trivial: fix a typo in warning message
> std::invalid_argument: Option memtable_allocation_typeis not applicable

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <37d293a4eadfb8a58acaf96f80b1d2e943530c6b.1509947604.git.amos@scylladb.com>
2017-11-06 09:41:07 +02:00
Duarte Nunes
d8e0b47e75 Merge 'CQL secondary index queries' from Pekka
"This patch series adds support for secondary index queries using the
backing index view that's created when CREATE INDEX statement is
executed.

Example:

  -- Create keyspace and table:

  CREATE KEYSPACE ks WITH REPLICATION = {'class' : 'SimpleStrategy', 'replication_factor' : 1};

  CREATE TABLE ks.users (
    userid uuid,
    name text,
    email text,
    country text,
    PRIMARY KEY (userid)
  );

  -- Create secondary indexes:

  CREATE INDEX ON ks.users (email);

  CREATE INDEX ON ks.users (country);

  -- Insert some data:

  INSERT INTO ks.users (userid, name, email, country) VALUES (uuid(), 'Bondie Easseby', 'beassebyv@house.gov', 'France');

  INSERT INTO ks.users (userid, name, email, country) VALUES (uuid(), 'Demetri Curror', 'dcurrorw@techcrunch.com', 'France');

  INSERT INTO ks.users (userid, name, email, country) VALUES (uuid(), 'Langston Paulisch', 'lpaulischm@reverbnation.com', 'United States');

  INSERT INTO ks.users (userid, name, email, country) VALUES (uuid(), 'Channa Devote', 'cdevote14@marriott.com', 'Denmark');

  -- Query on the secondary-index backed non-primary keys:

  SELECT * FROM ks.users WHERE email = 'beassebyv@house.gov';

   userid | country | email | name
  --------+---------+-------+------
   022238c8-5213-44b5-959e-4e3e1b032f85 |        France |         beassebyv@house.gov |    Bondie Easseby

  (1 rows)

  SELECT * FROM ks.users WHERE country = 'France';

   userid                               | country | email                   | name
  --------------------------------------+---------+-------------------------+----------------
   2152d85a-61f6-4eab-af4d-e7e7d0872319 |  France |     beassebyv@house.gov | Bondie Easseby
   59fddb6d-bfc9-4636-a9a0-85383fd815ee |  France | dcurrorw@techcrunch.com | Demetri Curror

Known imitations:

- Only regular column indexes return results. Indexing primary key
  components like clustering keys return empty result set because of
  index view query partition key serialization issues that will be fixed
  in subsequent patches.

- Secondary index queries are not paginated, which can cause problems
  for queries that return a large number of rows.

- Multiple restrictions don't work correctly if one of them is backed by
  a secondary-index.

- Only one secondary-indexed restriction per query is supported -- other
  restrictions are ignored.

- Compound partition keys are not supported.

- ALLOW FILTERING on non-primary key columns does not work correctly
  without secondary index (see issue #2200)."

* 'penberg/cql-2i-queries/v2' of github.com:penberg/scylla:
  tests/cql_query_test: Add test case for secondary index queries
  cql3: Secondary-index backed select statements
  index: Fix index view schema when primary key component is indexed
  tests/cql_query_test: Fix view creation in test_duration_restrictions()
  cql3/restrictions: Add statement_restrictions::index_restrictions() helper
  index: Implement index::supports_expression() for EQ operator
  cql3: Make operator_type class non-copyable
  index: Fix index::supports_expression() operator parameter type
  cql3: Implement statement_restriction index validation
2017-11-04 01:51:55 +01:00
Tomasz Grabiec
2e96069f2f tests: perf_cache_eviction: Switch to time-series like workload
Before the patch we appended and queried at the front. Insert at the
front instead, so that writes and reads overlap. Stresses eviction and
population more.
Message-Id: <1506369562-14892-1-git-send-email-tgrabiec@scylladb.com>
2017-11-03 13:45:41 +00:00
Tomasz Grabiec
92e3449d59 mutation_reader: Do not call fast_forward_to() on a reference to a capture
The range reference is supposed to be valid as long as the reader is
used, not just around fast_forward_to().

Introduced in a6b9186cab
Message-Id: <1509710642-12713-1-git-send-email-tgrabiec@scylladb.com>
2017-11-03 12:09:42 +00:00
Amos Kong
4762326f35 dist/redhat: Fix dependence version issue
The spec file requires two different version, it causes conflict.

The problem was introduced in commit
6893ad46b8 ("dist/redhat: Switch to
g++-7/boost-1.63 on CentOS7").

Fixes #2931

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <f0d74c4ae0d325d7e2bd827f56a36330b9ef19eb.1509703504.git.amos@scylladb.com>
2017-11-03 13:40:22 +02:00
Paweł Dziepak
0de651d617 Merge "Mark whole query range continuous in cache" from Tomasz
"We currently can't insert row entries at any position_in_partition,
but only at full keys and after all keys. If a query range has bounds
such that we have to insert a dummy entry at non-representable position
then information about range continuity will not be fully populated.
In particular, single-row queries of a row which is not present in sstables
will miss when repeated again.

The series fixes the problem by marking the whole query range as continuous
by inserting dummy entries at boundaries when necessary.

Refs #2579."

* tag 'tgrabiec/cache-range-continuity-v2' of github.com:scylladb/seastar-dev:
  tests: row_cache: Add test for population of single rows
  tests: Add test for population of continuity
  tests: mutation_reader_assertions: Introduce produces_compacted()
  mutation: Introduce apply(mutation_fragment)
  cache: Document invariants of cache_streamed_mutation::_lower_bound
  cache_streamed_mutation: Special-case population for singular ranges
  query: Introduce is_single_row()
  cache_streamed_mutation: Increment mispopulation counter when can't populate due to eviction
  cache_streamed_mutation: Override continuity of older versions when populating
  cache_streamed_mutation: Mark whole query range as continuous
  tests: cache_streamed_mutation: Allow creating expected_row at any position_in_partition
  cache_streamed_mutation: Populate continuity when range adjacent to non-latest version rows
  cache_streamed_mutation: Avoid lookup in maybe_add_to_cache() in more cases
  row_cache: Make read_context::key() valid before reading from underlying starts
  mutation_partition: Allow creating rows_entry at any clustered position_in_partition
  position_in_partition: Do not use -2 and +2 weights
  clustering_ranges_walker: Make contains() drop range tombstones adjacent to query range
  mutation_partition: Remove delegating_compare()
  mvcc: Print iterators in operator<< for partition_snapshot_row_cursor
  mvcc: Introduce partition_snapshot_row_weakref
  mvcc: Make the null state of partition_snapshot::change_mark explicit
  mvcc: Add partition_snapshot::region() getter
  mvcc: Add partition_snapshot::schema() getter
  position_in_partition: Introduce before_key()
  position_in_partition: Introduce min()
  position_in_partition: Introduce for_static_row()
2017-11-03 11:05:49 +00:00
Paweł Dziepak
ab12981491 test.py: make sure that tests/memory_footprint is being run
While not being a real unit tests memory_footprint can be a quite useful
tool and running it among other tests will ensure that we will notice
when it gets broken.
Message-Id: <20171102160233.6756-2-pdziepak@scylladb.com>
2017-11-03 11:46:30 +01:00
Paweł Dziepak
4cda3170d6 tests/memory_footprint: do not create two cache instances
When created cache registers several metrics, since attempts to create
an already existing metrics result in an exception being thrown it is no
longer possible to have two cache instances at the same time. This is
exactly what happens in memory_footprint: one (useless) cache object is
created through a call to do_with_cql_env() and, then, memory_footprint
explicitly creates another one (not a useless one).

The tests itself doesn't really need a full cql environment and the only
reason it was added is so that storage_service is initialised and various
code paths can query for the available cluster features. This can be
done in a much lightweight way using storage_service_for_tests.

Fixes memory_footprint failure (until next time we decide there is
nothing wrong with globals).

Message-Id: <20171102160233.6756-1-pdziepak@scylladb.com>
2017-11-03 11:46:30 +01:00
Amos Kong
f2ff431b75 dist/redhat: Fix baseurl of 3rdparty repo
$basearch isn't parsed as expected, the finaly baseurl is wrong.
We only have x86_64 arch in external 3rdparty repository, and
the conf file is only for x86_64, so it's fine to use hardcode
x86_64.

The problem was introduced by commit
b5e83ebd94 ("dist/redhat: switch 3rdparty
packages to external build service").

Fixes #2930

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <708f46a7c36623e86fee278462c80db1eff3b820.1509700430.git.amos@scylladb.com>
2017-11-03 11:51:34 +02:00
Pekka Enberg
aeea83172b tests/cql_query_test: Add test case for secondary index queries 2017-11-03 10:12:58 +02:00
Pekka Enberg
9048f741ad cql3: Secondary-index backed select statements
This patch adds support for secondary-index backed select statements.
Current select_statement class is split into two separate classes:
primary_key_select_statement that retains regular query behavior and
indexed_table_select_statement that introduces the new secondary-index
backed query logic. One of the two behaviors is selected at query
preparation time to minimize overhead for non-indexed queries.
2017-11-03 10:12:58 +02:00
Pekka Enberg
3150962cb7 index: Fix index view schema when primary key component is indexed
This fixes index view schema to exclude indexed column when a primary
key component like clustering key is indexed. This fixes a server crash
when CREATE INDEX statement is executed on a clustering key column.
2017-11-03 10:12:58 +02:00
Pekka Enberg
3c90607988 tests/cql_query_test: Fix view creation in test_duration_restrictions()
The materialized view created in test_duration_restriction() restricts
on a non-PK column. Since Scylla's ALLOW FILTERING and secondary index
validation path is broken, once we start to do secondary index queries,
query processor thinks there's a secondary index backing that non-PK
column and fails because it's unable to find such column.

Fix up the view to only trigger the duration type validation error we're
interested in here.
2017-11-03 10:12:58 +02:00
Pekka Enberg
c243a0c8fc cql3/restrictions: Add statement_restrictions::index_restrictions() helper 2017-11-03 09:10:43 +02:00
Pekka Enberg
678a6f6e2f index: Implement index::supports_expression() for EQ operator 2017-11-03 09:10:43 +02:00
Pekka Enberg
04b482146c cql3: Make operator_type class non-copyable
The operator_type class is really an enumeration, which is not supposed
to be copied.
2017-11-03 09:10:43 +02:00
Pekka Enberg
1ae9343f68 index: Fix index::supports_expression() operator parameter type
The cql3::operator_type is supposed to be passed around as const
reference, not by value; otherwise equality won't work.
2017-11-03 09:10:43 +02:00
Pekka Enberg
3e3c580f74 cql3: Implement statement_restriction index validation 2017-11-03 09:10:43 +02:00
Botond Dénes
ce03a4d2c7 test.py: print failed test summary if there are failed tests
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <3a913b111552276ab94dfb83738244699550f929.1507894597.git.bdenes@scylladb.com>
2017-11-02 11:49:14 +00:00
Tomasz Grabiec
16d6222a96 tests: row_cache: Add test for population of single rows 2017-11-02 12:16:17 +01:00
Tomasz Grabiec
bbf8ccb709 tests: Add test for population of continuity 2017-11-02 12:16:17 +01:00
Tomasz Grabiec
3ad5666098 tests: mutation_reader_assertions: Introduce produces_compacted() 2017-11-02 12:16:17 +01:00
Tomasz Grabiec
749f5770df mutation: Introduce apply(mutation_fragment) 2017-11-02 12:16:17 +01:00
Tomasz Grabiec
a76202df4f cache: Document invariants of cache_streamed_mutation::_lower_bound
(cherry picked from commit b52813279d30782270ac83856233f18787b28b7e)
2017-11-02 12:16:17 +01:00
Tomasz Grabiec
328faf695e cache_streamed_mutation: Special-case population for singular ranges
This is an optimization which avoids creating dummy entries around row
entry when populating a singular range.
2017-11-02 12:16:09 +01:00
Tomasz Grabiec
90796893ee query: Introduce is_single_row() 2017-11-02 11:05:19 +01:00
Tomasz Grabiec
0fd57cdff5 cache_streamed_mutation: Increment mispopulation counter when can't populate due to eviction 2017-11-02 11:05:19 +01:00
Tomasz Grabiec
8c41a3eb43 cache_streamed_mutation: Override continuity of older versions when populating
Fixes the case of continuity not being populated when the row which is
the upper bound of the population range belongs to a non-latest
version. In such case we wouldn't mark the range as continuous,
because we can't modify rows of non-latest versions. To fix this,
create an empty entry in latest version which will just override the
continuity flag of the old entry.
2017-11-02 11:05:19 +01:00
Tomasz Grabiec
65ed490e1c cache_streamed_mutation: Mark whole query range as continuous
Before this patch only ranges between returned row fragments were
marked as continuous.  In the extreme case, there could be no such
fragments, in which case next read would miss as well. To avoid this,
mark whole query range as continuous by inserting dummy entries when
necessary.

Refs #2579.
2017-11-02 11:05:19 +01:00
Tomasz Grabiec
552d7a683a tests: cache_streamed_mutation: Allow creating expected_row at any position_in_partition 2017-11-02 11:05:19 +01:00
Tomasz Grabiec
d4928eb1b7 cache_streamed_mutation: Populate continuity when range adjacent to non-latest version rows
Current code will not mark the range as continuous if the previous
entry does not come in the latest version. Fix that by switching
to partition_snapshot_row_pointer, which is capable of checking
in older versions as necessary.

Also, we avoid the key comparison if we know that the iterator
is still valid.
2017-11-02 11:05:19 +01:00
Tomasz Grabiec
835d17ee37 cache_streamed_mutation: Avoid lookup in maybe_add_to_cache() in more cases 2017-11-02 11:05:19 +01:00
Tomasz Grabiec
af4a9a4a30 row_cache: Make read_context::key() valid before reading from underlying starts
So that we can call cache_streamed_mutation::can_populate() before
we start reading from underlying. Will be needed in upcoming changes
which insert dummy entries when falling back to underlying.
2017-11-02 11:05:19 +01:00
Tomasz Grabiec
72028bb048 mutation_partition: Allow creating rows_entry at any clustered position_in_partition
In preparation for supporting setting continuity of arbitrary clustering range.
2017-11-02 11:05:19 +01:00
Tomasz Grabiec
9ac1b515e1 position_in_partition: Do not use -2 and +2 weights
::weight() is using those values for excl_end and excl_start in order
to be able to represent non-overlapping ranges. In their model the end
bound is inclusive. We don't need this, since position_range has end
bound exclusive.

This change makes that:

  position_in_partition::after_key(y)
    == position_in_partition::for_range_end(clutering_range::make({x}, {y})
2017-11-02 11:05:19 +01:00
Tomasz Grabiec
4b25fa1130 clustering_ranges_walker: Make contains() drop range tombstones adjacent to query range
position_range is end-exclusive. The reader might have returned a tombstone
which is not really relevant for the range.
2017-11-02 11:05:19 +01:00
Tomasz Grabiec
409adc045a mutation_partition: Remove delegating_compare()
It can't work with rows_entry at any position_in_partition,
so we need to drop it.
2017-11-02 11:05:19 +01:00
Tomasz Grabiec
b4954f55b9 mvcc: Print iterators in operator<< for partition_snapshot_row_cursor 2017-11-02 11:05:19 +01:00
Tomasz Grabiec
ad156b5986 mvcc: Introduce partition_snapshot_row_weakref 2017-11-02 11:05:19 +01:00
Tomasz Grabiec
967cabcaf2 mvcc: Make the null state of partition_snapshot::change_mark explicit 2017-11-02 11:05:19 +01:00
Tomasz Grabiec
4b7933543d mvcc: Add partition_snapshot::region() getter 2017-11-02 11:05:19 +01:00
Tomasz Grabiec
9cf30f19ae mvcc: Add partition_snapshot::schema() getter 2017-11-02 11:05:19 +01:00
Tomasz Grabiec
34cb13939f position_in_partition: Introduce before_key() 2017-11-02 11:05:19 +01:00
Tomasz Grabiec
cc06c328ef position_in_partition: Introduce min() 2017-11-02 11:05:19 +01:00
Tomasz Grabiec
d05f130b09 position_in_partition: Introduce for_static_row() 2017-11-02 11:05:19 +01:00
Calle Wilund
8c257c40b4 storage_service: Only replicate token metadata iff modified in on_change
Fixes #2869

Message-Id: <20171101105629.22104-1-calle@scylladb.com>
2017-11-01 14:56:55 +02:00
Jesse Haber-Kucharsky
da5c486e49 Add coding-style.md referencing Seastar
While it would be nice if we could reference the file corresponding to
the exact version of Seastar pinned as a Scylla submodule, GitHub does
not support this.

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <b7565acd1ccb8bce42b0edf00221922e78e1c9ef.1508274655.git.jhaberku@scylladb.com>
2017-10-30 16:52:29 -07:00
Duarte Nunes
74a4cf8bb1 thrift/handler: multiget_{slice, count} always returns queried keys
This patch changes the way the multiget_{slice, count} verbs return
their results, by ensuring a queried key that produced no results is
still present in the returned map, associated with an empty list.

This is not required by the thrift interface, and it is a performance
step back, but matches the behavior of Apache Cassandra.

Said behavior is relied upon by projects like JanusGraph, whose
integration with Scylla motivated this patch.

Fixes #2900

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171019161104.22797-2-duarte@scylladb.com>
2017-10-30 16:48:58 -07:00
Duarte Nunes
f44131226a thrift/handler: Use map for column_visitor aggregation
Most common operations, like multiget_count and multiget_slice, return
maps. So, instead of keeping a vector internally in column_visitor
that we later transform into a map, keep a map that we transform into
a vector for the uncommon operations.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171019161104.22797-1-duarte@scylladb.com>
2017-10-30 16:48:55 -07:00
Takuya ASADA
a56dd79d69 dist/redhat: moving gcc-7.1 to gcc-7.2
We mistakenly merged the patch witch compiles Scylla using gcc-7.1, it need to
fix correct version (gcc-7.2).

Message-Id: <1508875618-31659-1-git-send-email-syuu@scylladb.com>
2017-10-30 14:26:43 -07:00
Tomasz Grabiec
b4e3c0946a cache_streamed_mutation: Avoid copy of decorated_key
Message-Id: <1509060503-17483-1-git-send-email-tgrabiec@scylladb.com>
2017-10-26 16:51:27 -07:00
Pekka Enberg
9ba3920fd7 Merge "cql3/query_processor: Clean-up" from Jesse
"This series cleans-up the query processor header and source file,
 including deleting dead Java code.

 There are no functional or interface changes.

 I've run all unit tests and observed no failures."

* 'jhk/clean_up_qp/v2' of github.com:hakuch/scylla:
  cql3/query_processor: Fix formatting
  cql3/query_processor: Organize headers
  cql3/query_processor.hh: Consolidate `public` and `private` sections
  cql3/query_processor: Remove dead Java code
2017-10-21 21:48:06 +03:00
Jesse Haber-Kucharsky
66c4abe4fb cql3/query_processor: Fix formatting
Lines are now less than 120 columns and formatting conforms to the Seastar coding standards document.
2017-10-21 13:53:03 -04:00
Jesse Haber-Kucharsky
edb83c0014 cql3/query_processor: Organize headers 2017-10-21 13:53:03 -04:00
Jesse Haber-Kucharsky
ed6a3179a1 cql3/query_processor.hh: Consolidate public and private sections 2017-10-21 13:53:03 -04:00
Jesse Haber-Kucharsky
50cfa8a7b8 cql3/query_processor: Remove dead Java code 2017-10-21 13:53:03 -04:00
Avi Kivity
ef8587a910 Merge seastar upstream
* seastar 8babd1f...d71922c (11):
  > configure.py: add -Wno-sign-compare to compile Boost.Test with gcc-7
  > log: Print nested exceptions
  > reactor: do not account non idle activity for total idle time calculation
  > execution_stage: defer execution less aggressively
  > Fix -Wreturn-type warnings
  > cpu scheduler: make _reciprocal_shares_times_2_32 wider to avoid overflow problems
  > noncopyable_function add bool operator
  > execution_stage: make make_execution_stage return a named type
  > memory: support overriding the default allocator page size
  > memory: fix crash during startup with large page_size
  > core: io_destroy is missing when destructing reactor, which causes io_context leak
2017-10-21 16:38:32 +03:00
Takuya ASADA
bc76d34e34 dist/debian: handle python scripts correctly on package builder
We are failing to build .deb package on pbuilder due to lack of build time
dependencies so we need add those packages on Build-Depends, also we need to
follow Debian packaging style for the package contains python scripts.

Fixes #2918

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1508457215-11552-1-git-send-email-syuu@scylladb.com>
2017-10-21 16:38:20 +03:00
Takuya ASADA
6893ad46b8 dist/redhat: Switch to g++-7/boost-1.63 on CentOS7
Switch to g++-7/boost-1.63 on CentOS7, too.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1508457509-13122-2-git-send-email-syuu@scylladb.com>
2017-10-21 12:28:40 +03:00
Takuya ASADA
7f38634080 dist/debian: Switch to g++-7/boost-1.63 on Ubuntu 14.04/16.04
Switch to g++-7/boost-1.63 for Ubuntu 14.04/16.04 that newly provided via
our 3rdparty PPA.

To make Scylla compilable with boost-1.63/g++-7, we need to disable following
warnings:
 - misleading-indentation
 - overflow
 - noexcept-type
Compile error message:
https://gist.github.com/syuu1228/96acc640c56c3316df5ce6911d60beea

Seastar also has similar problem, it needs to disable 'sign-compare', detail
is in a patch for Seastar.

This update also fixes current Ubuntu 14.04/16.04 compilation error problem,
since errors were come from too old g++/boost.

Fixes #2902
Fixes #2903

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1508457509-13122-1-git-send-email-syuu@scylladb.com>
2017-10-21 12:28:38 +03:00
Avi Kivity
d6cd44a725 Revert "Merge 'Single key sstable reader optimization' from Botond"
This reverts commit 5e9cd128ad, reversing
changes made to 1f4e6759a7. Tomek found
some serious issues.
2017-10-19 12:47:21 +03:00
Botond Dénes
9bd4d7cbb2 Readd x right to configure.py (removed by 05db87e06)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <41d505277fe29b275aba65477cafc9275b393a64.1508395961.git.bdenes@scylladb.com>
2017-10-19 11:30:06 +03:00
Duarte Nunes
5e9cd128ad Merge 'Single key sstable reader optimization' from Botond
"When reading a single row it is possible that the read will be satisfied
by just reading from one of the data source candidates. To exploit this
an optimization is employed which sorts data source candidates by their
timestamp and reads mutations from the most recent to the oldest. When
all needed cells are present and their earliest timestamp is still
later than the latest one of the remaining data source the read can be
terminated early.
However this optimization also has the possibility to backfire as the
data sources are read sequentially, so if all of them has to be read
eventually then we will end up worse then without it.
Thus the optimization can be disabled up-front or enabled to only run
until its efficiency degrades below a certain threshold.
Also counters are added to column-families to make it possible to
observe how well it performs.

Benchmarking

Benchmarking was done with disabled cache and at a constant op rate of
4k (1/3 of the max op rate on my box), against 3 sstables containing the
same 10000 rows.

1) Optimization turned off (all sstables read paralelly)
latency mean              : 1.3 [simple:1.3]
latency median            : 1.0 [simple:1.0]
latency 95th percentile   : 2.4 [simple:2.4]
latency 99th percentile   : 2.9 [simple:2.9]
latency 99.9th percentile : 8.0 [simple:8.0]
latency max               : 13.5 [simple:13.5]

2) Optimization turned on, best case (1 of 3 sstables read)
latency mean              : 0.6 [simple:0.6]
latency median            : 0.6 [simple:0.6]
latency 95th percentile   : 1.0 [simple:1.0]
latency 99th percentile   : 1.2 [simple:1.2]
latency 99.9th percentile : 4.4 [simple:4.4]
latency max               : 13.4 [simple:13.4]

3) Optimization turned on, best case, IN query (1 of 3 sstables read)
latency mean              : 0.7 [simple_in:0.7]
latency median            : 0.6 [simple_in:0.6]
latency 95th percentile   : 1.1 [simple_in:1.1]
latency 99th percentile   : 1.4 [simple_in:1.4]
latency 99.9th percentile : 5.4 [simple_in:5.4]
latency max               : 16.8 [simple_in:16.8]

4) Optimization turned on, worst case (3 of 3 sstables read sequentally)
latency mean              : 2.8 [simple:2.8]
latency median            : 2.3 [simple:2.3]
latency 95th percentile   : 5.4 [simple:5.4]
latency 99th percentile   : 6.5 [simple:6.5]
latency 99.9th percentile : 13.5 [simple:13.5]
latency max               : 19.2 [simple:19.2]

5) Optimization turned on, mid case (2 of 3 sstables read sequentally)
latency mean              : 1.4 [simple:1.4]
latency median            : 1.1 [simple:1.1]
latency 95th percentile   : 2.7 [simple:2.7]
latency 99th percentile   : 3.2 [simple:3.2]
latency 99.9th percentile : 7.7 [simple:7.7]
latency max               : 15.1 [simple:15.1]"

Ref #324

* 'bdenes/optimize_single_row_read_v6' of github.com:denesb/scylla:
  Add unit tests for single_key_sstable_reader
  Add counters for the single-key reader optimization
  Add single_key_parallel_scan_threshold option
  single_key_sstable_reader: optimize single-row queries
  single_key_sstable_reader: move reading code into it's own method
  Add selects_only_full_rows() and selects_only_full_rows_with_atomic_columns()
2017-10-18 16:38:53 +01:00
Duarte Nunes
1f4e6759a7 tests: Fix compile errors introduced in c468e5981
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1508337315-8224-1-git-send-email-duarte@scylladb.com>
2017-10-18 16:38:18 +01:00
Botond Dénes
c3bd89ad63 Add unit tests for single_key_sstable_reader 2017-10-18 17:24:03 +03:00
Botond Dénes
dfe312ca3a Add counters for the single-key reader optimization
Add two counters, one to determine how many of the reads fall into the
optimization, and a second one to determine it's effectiveness.

The first one is single_key_reader_optimization_hit_rate. It contains
the rate of reads that the optimization applies to out of all the reads
that go into the single_key_sstable_reader.

The second one, single_key_reader_optimization_extra_read_proportion is
a histogram of the efficiency of the optimization. It contains the
proportion of extra data-sources read. It's a number between 0 and 1,
where 0 is the best case (only one data-source was read) and 1 is the
worst case (all data-sources were read eventually). This is the same
number that is used for the threshold option (see previous patch).
Each of the histogram's buckets cover a chunk of 0.1 from the [0, 1]
range.
Note that single_key_parallel_scan_threshold effectively provides an
upper bound for the proportion as the optimization is turned off as soon
as it goes above that number.

The counters are disabled if single_key_parallel_scan_threshold is set
to 0 disabling the optimization entirely.
2017-10-18 17:24:03 +03:00
Botond Dénes
08502f2d48 Add single_key_parallel_scan_threshold option
This option regulates when exactly the single-key optimization is
considered ineffective and turned off.
The threshold is the proportion of the extra data source candidates that
can be read before the optimization is considered ineffective and
disabled. The proportion is calculated as follows:
    (read_data_sources - 1) / (total_data_sources - 1)

We substract 1 from the read_data_sources and total_data_sources to
effectively measure the rate of *extra* data sources we read. This
makes sure that the proportion is meaningful even if e.g. we have only
have a total of 2 data-sources and we read only 1 (best case).

Whenever this number goes above the threshold the optimization is
disabled. The threshold is number between 0 and 1, 0 forces the
optimization off and 1 forces it on. Increase the treshold to favor
throughput over latency for single-row reads, decrease the treshold to
improve latency at the expense of throughput.

If the threshold is > 0 (it's not force disabled) and the optimization
is disabled due to a read crossing the threshold, we will issue
"probing" reads (every 100th read) to determine if the optimization is
worth re-enabling. Probing reads are allowed to run through the
optimization path and if they go below the threshold the optimization is
re-enabled.
2017-10-18 17:24:03 +03:00
Botond Dénes
3c1fa3ecc1 single_key_sstable_reader: optimize single-row queries
For single-row queries that only query atomic cells one can put a lower
bound on the timestamps which may affect the query results and thus rule
out entire data sources. This allows the query to read only those
sstables that actually contribute to the result.
To do this we incrementally move through the sstables overlapping with
the query range, checking after each read mutation whether we already
have a value for all required cells and whether the lower-bound of their
timestamps is higher than the upper-bound of the timestamps of all the
remaining data-sources. When this condition is met we terminate the
read.
2017-10-18 17:24:03 +03:00
Botond Dénes
5fc44c4307 single_key_sstable_reader: move reading code into it's own method 2017-10-18 17:24:03 +03:00
Botond Dénes
6cdeca1846 Add selects_only_full_rows() and selects_only_full_rows_with_atomic_columns() 2017-10-18 17:24:03 +03:00
Botond Dénes
7aceb14395 Fix compile errors in tests/config_test.cc introduced by c468e5981
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <2700ac3987c3a229eb7083ce6f5d390012a3b66c.1508336217.git.bdenes@scylladb.com>
2017-10-18 15:20:45 +01:00
Paweł Dziepak
c28e31eac4 database: fix build (auto shards&) 2017-10-18 13:10:00 +01:00
Duarte Nunes
446e5f53db database: Avoid superfluous shards_for_this_sstable vector copies
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171018112643.40411-1-duarte@scylladb.com>
2017-10-18 15:00:52 +03:00
Duarte Nunes
044b8deae4 Merge 'Solves problems related to gossip which can be observed in a large cluster' from Tomasz
"The main problem fixed is slow processing of application state changes.
This may lead to a bootstrapping node not having up to date view on the
ring, and serve incorrect data.

Fixes #2855."

* tag 'tgrabiec/gossip-performance-v3' of github.com:scylladb/seastar-dev:
  gms/gossiper: Remove periodic replication of endpoint state map
  gossiper: Check for features in the change listener
  gms/gossiper: Replicate changes incrementally to other shards
  gms/gossiper: Document validity of endpoint_state properties
  storage_service: Update token_metadata after changing endpoint_state
  gms/gossiper: Process endpoints in parallel
  gms/gossiper: Serialize state changes and notifications for given node
  utils/loading_shared_values: Allow Loader to return non-future result
  gms/gossiper: Encapsulate lookup of endpoint_state
  storage_service: Batch token metadata and endpoint state replication
  utils/serialized_action: Introduce trigger_later()
  gossiper: Add and improve logging
  gms/gossiper: Don't fire change listeners when there is no change
  gms/gossiper: Allow parallel apply_state_locally()
  gms/gossiper: Avoid copies in endpoint_state::add_application_state()
  gms/failure_detector: Ignore short update intervals
2017-10-18 10:13:25 +01:00
Duarte Nunes
c468e59817 Merge 'Extract config file mechanism + allow additional' from Calle
"Extracts the yaml/boost-po aspects of the "self-describing" db::config
into an abstract type.

db::config is then reimplemented in said type, removing some of the
slightly cumbersome entanglement with seastar opts (log).

Adds a main hook for additional configuration files (options + file)"

* 'calle/config' of github.com:scylladb/seastar-dev:
  main/init: Add registerable configuration objects
  db::config: Re-implement on utils/config_file.
  utils::config_file: Abstract out config file to external type
2017-10-18 09:50:53 +01:00
Tomasz Grabiec
f570e41d18 gms/gossiper: Remove periodic replication of endpoint state map
For large clusters the map can be big and cause latency problems.
Since we now actively replicate changes, this is no longer needed.
2017-10-18 08:49:53 +02:00
Tomasz Grabiec
84c7b63c51 gossiper: Check for features in the change listener
In preparation for removal of periodic replication
2017-10-18 08:49:53 +02:00
Tomasz Grabiec
2d5fb9d109 gms/gossiper: Replicate changes incrementally to other shards
storage_service depends on endpoint states to be replicated to all
shards before token metadata is replicated. Currently this is taken
care of by storage_service::replicate_to_all_cores(), invoked from
storage_service's change listener. It copies whole endpoint state map,
which is expensive in large clusters. It's more efficient to replicate
only incremental changes, and only once, rather than for each
application state.
2017-10-18 08:49:53 +02:00
Tomasz Grabiec
28c9609370 gms/gossiper: Document validity of endpoint_state properties 2017-10-18 08:49:53 +02:00
Tomasz Grabiec
cf113ed295 storage_service: Update token_metadata after changing endpoint_state
There is a requirement that whatever is present in token_metadata,
should also be present in endpoint_state. Because of that, we should update
endpoint_state first (set_gossip_tokens).

Apache Cassandra switched to this order as well in commit
b39d984f7bd682c7638415d65dcc4ac9bcb74e5f.
2017-10-18 08:49:53 +02:00
Tomasz Grabiec
5cc83b9b3c gms/gossiper: Process endpoints in parallel
Makes state application faster due to increased parallelism.

Refs #2855.

Bootrap of 11th node, ignoring apply_state_locally() which complete instantly:

Before:

DEBUG 2017-10-06 15:24:04,213 [shard 0] gossip - apply_state_locally() took 1230 ms
DEBUG 2017-10-06 15:24:04,223 [shard 0] gossip - apply_state_locally() took 1421 ms
DEBUG 2017-10-06 15:24:04,225 [shard 0] gossip - apply_state_locally() took 607 ms
DEBUG 2017-10-06 15:24:04,288 [shard 0] gossip - apply_state_locally() took 488 ms
DEBUG 2017-10-06 15:24:04,408 [shard 0] gossip - apply_state_locally() took 1425 ms

After:

DEBUG 2017-10-06 16:24:13,130 [shard 0] gossip - apply_state_locally() took 814 ms
2017-10-18 08:49:53 +02:00
Tomasz Grabiec
8f01e08690 gms/gossiper: Serialize state changes and notifications for given node
It's possible that a change listener for a later state will run before
change listener for the previous state completes, in which case
node's state can be corruped. For example, the previous change listener
may override sysytem.peers with an old value.

This patch fixes the problem by serializing state changes and
listeners for each node.

The implementation uses loading_shared_values so that the lock remains
alive as long as there is anyone holding it. Using endpoint_state_map
for that doesn't seem appropraite, because entries can be removed from
it while listeners are still running.  There is code in the gossiper
which anticipates that entry may be gone across deferring points in
some places.
2017-10-18 08:49:53 +02:00
Tomasz Grabiec
f7a7e97095 utils/loading_shared_values: Allow Loader to return non-future result 2017-10-18 08:49:52 +02:00
Tomasz Grabiec
6fccf7f4d0 gms/gossiper: Encapsulate lookup of endpoint_state 2017-10-18 08:49:52 +02:00
Tomasz Grabiec
6263b0ebb6 storage_service: Batch token metadata and endpoint state replication
Replication needs to be serialized. We can batch replication requests
which are waiting to start. Use serialized_action, which does this.
2017-10-18 08:49:52 +02:00
Tomasz Grabiec
2e2ae4671e utils/serialized_action: Introduce trigger_later()
Can be used instead of trigger() to improve batching.
2017-10-18 08:49:52 +02:00
Tomasz Grabiec
41ffefd194 gossiper: Add and improve logging 2017-10-18 08:49:52 +02:00
Tomasz Grabiec
0ed84710d9 gms/gossiper: Don't fire change listeners when there is no change
apply_new_states() always fires change listeners for received values,
even if we already processed the state earlier. Some change listeners
are heavy-weight, e.g. storage_service::handle_state_normal().  We
should avoid calling them more than necessary.

Make sure that we always run the change listeners by putting them in a
defer() block. Otherwise, if exception is thrown in the middle of state
application, change listeners would not be run. Later we would not
detect the change for states which were already applied, and not run
the change listers.

Fixes #2867
2017-10-18 08:49:52 +02:00
Tomasz Grabiec
c780a74b58 gms/gossiper: Allow parallel apply_state_locally()
It is serialized since e428d06f40. This causes regression in
performance of application state propagation due to reduced
parallelism.

Processing states for each node has high latency due to memtable
flushes triggered by update_tokens() and commitlog syncs done by
system.peers updates, if commitlog sync mode is set to "batch".  We
have high internal concurrency for these, so increasing parallelism
significantly reduces time to process all states.

Fixes #2855.
2017-10-18 08:49:52 +02:00
Tomasz Grabiec
f20a805eca gms/gossiper: Avoid copies in endpoint_state::add_application_state() 2017-10-18 08:49:52 +02:00
Tomasz Grabiec
a71624d58d gms/failure_detector: Ignore short update intervals
Failure detector decides that a node is down if it hasn't received a change of
its heartbeat for longer than ~11 times the average of past intervals between
updates.

If there are multiple incoming ACKs containing information about the
same node, we may detect and report a change for each of them. This
will cause failure_detector to establish that the average report
period is in milliseconds. After the update storm is over, it will
claim the node failure very soon, because report period will now be a
large multiple of the average.

Fix by not counting short updates into the calculation of average
arrival time.

Fixes #2861.
2017-10-18 08:49:52 +02:00
Calle Wilund
12a54805ea main/init: Add registerable configuration objects
Allowing plugging in command line arguments + "parse-points"
for configs outside db/config
2017-10-18 00:52:04 +00:00
Calle Wilund
4bd98f7296 db::config: Re-implement on utils/config_file.
Re-use config abstraction, and de-couple the seastar logging 
parts a little bit more.
2017-10-18 00:51:54 +00:00
Calle Wilund
05db87e068 utils::config_file: Abstract out config file to external type
Handling all the boost::commandline + YAML stuff.
This patch only provides an external version of these functions,
it does not modify the db::config object. That is for a follow-up
patch.
2017-10-18 00:51:41 +00:00
Pekka Enberg
ae92055b52 Merge "Bring histogram closer to what Prometheus expects" from Glauber
"Histograms are a native prometheus type, and there are many functions
available that operate on them. There is extensive documentation about
them at https://prometheus.io/docs/practices/histograms/

One example is the function histogram_quantile(), that can extract
useful quantiles from the histograms. Currently, those functions don't
work well.

The reasons are twofold:
1) We are only exporting 16 metrics, starting from 1usec. That means
that the highest latency we can differentiate is 4ms. After that,
everything falls into the same bin.

2) The format that prometheus expects is that each bin will contain
the total number of points seen *up until that bin*, while we
currently export the total number of points that falls between bins.
IOW, it is a cummulative histogram.

About point two, granted it is a bit hidden in their website, but it is
there. The following phrase about a caveat make it clear:

"Note that we divide the sum of both buckets. The reason is that the
histogram buckets are cumulative. The le="0.3" bucket is also contained
in the le="1.2" bucket; dividing it by 2 corrects for that."

It is also not needed to accumulate things that fall over the last bin:
the _count component of the histogram will already account for that."

Acked-by: Amnon Heiman <amnon@scylladb.com>
Acked-by: Gleb Natapov <gleb@scylladb.com>

* 'prometheus-histograms' of github.com:glommer/scylla:
  storage_proxy: change reporting of estimated histograms
  estimated_histogram: bring histogram closer to what prometheus expects.
2017-10-17 20:23:10 +03:00
Takuya ASADA
3cab5557e5 dist/debian: fix Debian not to use new dependency package names
We moved to new dependency package names like
antlr3-c++-dev to scylla-antlr35-c++-dev when we moved to ppa on Ubuntu, but
Debian still uses old dist/debian/dep packages.
So keep using old style package names.

Fixes #2831

Message-Id: <1508245175-2184-1-git-send-email-syuu@scylladb.com>
2017-10-17 16:39:35 +03:00
Daniel Fiala
f5629b3a23 types: Use std::pair instead of std::tuple to avoid compile-time error with explicit constructor.
Fixes #2895.

Signed-off-by: Daniel Fiala <daniel@scylladb.com>
Message-Id: <20171017071316.2836-1-daniel@scylladb.com>
2017-10-17 12:32:43 +01:00
Duarte Nunes
baeec0935f Replace query::full_slice with schema::full_slice()
query::full_slice doesn't select any regular or static columns, which
is at odds with the expectations of its users. This patch replaces it
with the schema::full_slice() version.

Refs #2885

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1507732800-9448-2-git-send-email-duarte@scylladb.com>
2017-10-17 11:25:53 +02:00
Duarte Nunes
fbb4c9edda schema: Provide all-selecting partition slice
This patch introduces schema::full_slice(), which returns a
partition_slice selecting the full clustering range, as well as all
static and regular columns. No options aside from the default are
set in that partition_slice.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1507732800-9448-1-git-send-email-duarte@scylladb.com>
2017-10-17 11:25:35 +02:00
Tomasz Grabiec
6d5a0f8a98 db: Add debug-level logging related to streaming
Message-Id: <1505896395-30203-1-git-send-email-tgrabiec@scylladb.com>
2017-10-16 18:49:10 +01:00
Paweł Dziepak
d9abb75bfa tests/perf_simple_query: fix counter update query
Message-Id: <20171016125334.4423-1-pdziepak@scylladb.com>
2017-10-16 19:41:31 +02:00
Calle Wilund
cc28cf838c password_auth: Return actual generated salt from gensalt
Fixes: 2898
Typo error in gensalt(). Only returned selected hash method, not
the random salt bytes. Does not prevent the hash function from
operating, but strength is ever so reduced.
Message-Id: <20171016130505.25593-2-calle@scylladb.com>
2017-10-16 14:07:46 +01:00
Calle Wilund
57c5f13166 password_auth: Keep crypt_data as thread local
Fixes: 2887
Speeds up password hashing ever so slightly.
Message-Id: <20171016130505.25593-1-calle@scylladb.com>
2017-10-16 14:07:42 +01:00
Paweł Dziepak
8c3b7fea81 Merge "Introduce new API and converters from/to old mutation_reader" from Piotr
"This changeset is the first step to flatten mutation_reader.
Then it introduces new mutation_fragment types for partition header and end of partition.
Using those a new flat_mutation_reader is defined.
Finally it introduces converters between new flat_mutation_reader and
old mutation_reader."

* 'haaawk/flattened_mutation_reader_v12' of github.com:scylladb/seastar-dev:
  Add tests for flat_mutation_reader
  Introduce conversion from flat_mutation_reader to mutation_reader
  Introduce conversion from mutation_reader to flat_mutation_reader
  Introduce flat_mutation_reader
  Extract FlattenedConsumer concept using GCC6_CONCEPT
  Introduce partition_end mutation_fragment
  Introduce a position for end of partition
  Introduce partition_start mutation_fragment
  Introduce FragmentConsumer
  Introduce a position for partition start
  streamed_mutation: Extract concepts using GCC6_CONCEPT macro
2017-10-16 12:14:23 +01:00
Gleb Natapov
bd09ce7cd4 gdb: Add new command task_histogram
The command scans random set of objects in a small pool (or, optionally
only objects of a certain size) for vptrs and builds a histogram, so that
most often used vptrs can be easily found. The command is useful to find
"memory leaks" caused by creating of too many tasks of a certain type
which is usually a result of unlimited parallelism somewhere.

Message-Id: <20171015081634.GB21092@scylladb.com>
2017-10-15 12:12:42 +03:00
Piotr Jastrzebski
5f34559b78 Add tests for flat_mutation_reader
Those tests run mutation source test for all sources
using conversion to and from flat_mutation_reader.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-10-13 16:08:59 +02:00
Piotr Jastrzebski
31733a7eeb Introduce conversion from flat_mutation_reader to mutation_reader
This will be used in transition from mutation_reader
to flat_mutation_reader

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-10-13 16:08:59 +02:00
Piotr Jastrzebski
6a66bee788 Introduce conversion from mutation_reader to flat_mutation_reader
This will be used in transition from mutation_reader
to flat_mutation_reader

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-10-13 16:08:59 +02:00
Piotr Jastrzebski
748205ca75 Introduce flat_mutation_reader
This reader operates on mutation_fragments instead of
streamed_mutations.

Each partition starts with a partition_header fragment
and ends with end_of_partition fragment.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-10-13 16:08:40 +02:00
Tomasz Grabiec
b74b06808e tests: row_cache: Add test for concurrent population of partition entries
Message-Id: <1507815478-20269-2-git-send-email-tgrabiec@scylladb.com>
2017-10-12 15:55:33 +01:00
Tomasz Grabiec
083b9cddef row_cache: Fix handling of concurrent partition population
This fixes a regression introduced in 27a3b4bca9 (master only).

partition_range_cursor assumes that as long as references are valid,
_end is valid as well. But if new entries were inserted before _end,
it may not, if the new entries fall after the query range. This may
result in reads returning partitions from outside the query range.
Message-Id: <1507815478-20269-1-git-send-email-tgrabiec@scylladb.com>
2017-10-12 15:55:20 +01:00
Tomasz Grabiec
68fe1a5bee utils/loading_cache: Fix compilation on older compilers
Message-Id: <1507728312-10585-1-git-send-email-tgrabiec@scylladb.com>
2017-10-12 14:55:34 +03:00
Raphael S. Carvalho
25a4f152cd sstables: remove dead sstable method
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171012072905.12737-1-raphaelsc@scylladb.com>
2017-10-12 11:58:39 +02:00
Raphael S. Carvalho
16dd0d15fc sstables: make get_shards_for_this_sstable return const ref
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171012072850.12681-1-raphaelsc@scylladb.com>
2017-10-12 11:58:23 +02:00
Pekka Enberg
1701fc2e50 Merge "gms/gossiper: Multiple cleanups" from Duarte
"Based on the functions get_endpoint_state_for_endpoint_ptr(),
get_application_state_ptr() and
endpoint_state::get_application_state_ptr(), this series
cleanups miscelaneous functions related to the gossiper.

It not only removes duplicated code, but also omits many copies.

All pointer usages have been audited for safety."

Acked-by: Asias He <asias@scylladb.com>
Acked-by: Tomasz Grabiec <tgrabiec@scylladb.com>

* 'gossiper-cleanup/v2' of github.com:duarten/scylla: (27 commits)
  gms/endpoint_state: Remove get_application_state()
  service/storage_service: Avoid copies in prepare_replacement_info()
  service/storage_service: Cleanup get_application_state_value()
  service/storage_service: Cleanup handle_state_removing()
  service/storage_service: Cleanup get_rpc_address()
  locator/reconnectable_snitch_helper: Avoid versioned_value copies
  locator/production_snitch_base: Cleanup get_endpoint_info()
  service/migration_manager: Avoid copies in is_ready_for_bootstrap()
  service/migration_manager: Cleanup has_compatible_schema_tables_version()
  service/migration_manager: Fix usages of get_application_state()
  cache_hit_rate: Avoid copies in get_hit_rate()
  gms/endpoint_state: Avoid copies in is_shutdown()
  service/load_broadcaster: Avoid copy in on_join()
  gms/gossiper: Cleanup get_supported_features()
  gms/gossiper: Cleanup get_gossip_status()
  gms/gossiper: Cleanup seen_any_seed()
  gms/gossiper: Cleanup get_host_id()
  gms/gossiper: Removed dead uses_vnodes() function
  gms/gossiper: Cleanup uses_host_id()
  gms/gossiper: Add get_application_state_ptr()
  ...
2017-10-11 13:45:36 +03:00
Duarte Nunes
f67a553b96 gms/endpoint_state: Remove get_application_state()
It is no longer used, as all callsites have moved to
get_application_state_ptr().

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
e9358c1c83 service/storage_service: Avoid copies in prepare_replacement_info()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
674f5d8eaf service/storage_service: Cleanup get_application_state_value()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
0ccb9211d7 service/storage_service: Cleanup handle_state_removing()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
bdee795876 service/storage_service: Cleanup get_rpc_address()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
2f05d7423a locator/reconnectable_snitch_helper: Avoid versioned_value copies
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
28d63a76df locator/production_snitch_base: Cleanup get_endpoint_info()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
03e6fc95ba service/migration_manager: Avoid copies in is_ready_for_bootstrap()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
72ca6b34ef service/migration_manager: Cleanup has_compatible_schema_tables_version()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
976324bbb8 service/migration_manager: Fix usages of get_application_state()
We were taking a reference to a temporary value in different places.
Fix them by using get_application_state_ptr(), which also avoids a copy.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
bb89b97cbb cache_hit_rate: Avoid copies in get_hit_rate()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
9d5c6e0c72 gms/endpoint_state: Avoid copies in is_shutdown()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
25b0654312 service/load_broadcaster: Avoid copy in on_join()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
92df519b91 gms/gossiper: Cleanup get_supported_features()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
39f71f7d12 gms/gossiper: Cleanup get_gossip_status()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
db660f1e08 gms/gossiper: Cleanup seen_any_seed()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
88dd97fe8e gms/gossiper: Cleanup get_host_id()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
95079795ce gms/gossiper: Removed dead uses_vnodes() function
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
7db7704edc gms/gossiper: Cleanup uses_host_id()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
2984bdab29 gms/gossiper: Add get_application_state_ptr()
This patch introduces the get_application_state_ptr() function, which
allows access to a versioned_value of a particular endpoint.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
f41748af81 gms/gossiper: Cleanup notify_failure_detector()
Now that we have get_endpoint_state_for_endpoint_ptr(), which does not
return a copy and allows mutating the actual state, we can use it
instead of repeating the lookup code.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
2210d10552 gms/gossiper: Cleanup is_alive()
Make it use get_endpoint_state_for_endpoint_ptr(), check if gossiper is
enabled, mark it as const, and have some callers use it instead of open
coding the logic.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
ceef45a6fe gms/gossiper: Const-qualify functions
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:31 +01:00
Duarte Nunes
955aee1588 gms/gossiper: Cleanup convict()
Have convict() use get_endpoint_state_for_endpoint_ptr(), simplify
logging, and also protect expensive operations by checking the log
level.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:31 +01:00
Duarte Nunes
cf99a41226 gms/gossiper: Add non-const get_endpoint_state_for_endpoint_ptr()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:31 +01:00
Duarte Nunes
d0fba1a113 gms/failure_detector: Simplify alive/dead endpoint count
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:31 +01:00
Duarte Nunes
dc65cda1a3 gms/failure_detector: Fix if/else style to include braces
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:31 +01:00
Raphael S. Carvalho
67c5c8dc67 sstables: do not recompute shards for all tables after each compaction
For every finished compaction, we were calculating shards for all
existing tables. With ignore_msb set to 0, it's probably not a big
deal, but if ignore_msb is like 12 and LCS is used (meaning thousands
of tables possibly), the operation may stall the reactor for a
considerable amount of time. That's fixed by caching shards.

Fixes #2875.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171011053424.22308-1-raphaelsc@scylladb.com>
2017-10-11 11:45:01 +03:00
Tomasz Grabiec
66a15ccd18 gms/gossiper: Introduce copy-less endpoint_state::get_application_state_ptr()
Message-Id: <1507642411-28680-3-git-send-email-tgrabiec@scylladb.com>
2017-10-10 18:27:43 +01:00
Gleb Natapov
36d9225e40 scylla-gdb: print number of allocated objects as an integer instead of float
Message-Id: <20171010151835.GT23527@scylladb.com>
2017-10-10 18:19:44 +03:00
Avi Kivity
4ad3900d8d Merge "gossiper: Optimize endpoint_state lookup" from Duarte
"gossiper::get_endpoint_state_for_endpoint() returns a copy of
endpoint_state, which we've seen can be very expensive. This
series introduces a function that returns a pointer and avoids
the copy.

Fixes #764"

* 'endpoint-state/v2' of https://github.com/duarten/scylla:
  gossiper: Avoid endpoint_state copies
  endpoint_state: const-qualify functions
  storage_service: Remove duplicate endpoint state check
2017-10-10 17:29:22 +03:00
Piotr Jastrzebski
f325fef362 Extract FlattenedConsumer concept using GCC6_CONCEPT
This concept will be used in flat_mutation_reader::consume

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-10-10 16:15:59 +02:00
Piotr Jastrzebski
46727f12e0 Introduce partition_end mutation_fragment
This type of mutation_fragment will be used in new mutation_reader
to signal the end of the current partition.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-10-10 16:15:59 +02:00
Piotr Jastrzebski
adffc80619 Introduce a position for end of partition
This position will be used for mutation fragment
that represents the end of partition.

This position sorts after all other mutation fragments.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-10-10 16:15:59 +02:00
Piotr Jastrzebski
2516b42752 Introduce partition_start mutation_fragment
This type of mutation_fragment will be used in new mutation_reader
to signal the beginning of the next partition.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-10-10 16:15:59 +02:00
Piotr Jastrzebski
1f4fb6dd4a Introduce FragmentConsumer
This concept helps define StreamedMutationConsumer.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-10-10 16:05:44 +02:00
Duarte Nunes
ceebbe14cc gossiper: Avoid endpoint_state copies
gossiper::get_endpoint_state_for_endpoint() returns a copy of
endpoint_state, which we've seen can be very expensive.

This patch adds a similar function which returns a pointer instead,
and changes the call sites where using the pointer-returning variant
is deemed safe (the pointer neither escapes the function, nor crosses
any defer point).

Fixes #764

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-10 13:48:02 +01:00
Duarte Nunes
bc976b4773 endpoint_state: const-qualify functions
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-10 13:30:28 +01:00
Duarte Nunes
198b1b76b5 storage_service: Remove duplicate endpoint state check
We already performed the check, so we don't need to do it again.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-10 13:25:34 +01:00
Avi Kivity
c0687a9761 sstables: replace naked new with make_lw_shared
Fallout from the sstables dependency reduction patches.
Message-Id: <20171010121134.26342-1-avi@scylladb.com>
2017-10-10 13:21:46 +01:00
Tomasz Grabiec
46c7e06e56 locator: Optimize token_metadata::is_member()
Currently it's linear in the number of tokens in the system in the
worst case. We could use the knowledge which _topology has to make it
O(1).

Fixes #2873.

Message-Id: <1507630182-13410-1-git-send-email-tgrabiec@scylladb.com>
2017-10-10 14:27:54 +03:00
Tomasz Grabiec
44faaafc29 cache_streamed_mutation: Read static row with cache region locked
_snp->static_row() allocates and needs reference stability.
Message-Id: <1507555031-11567-1-git-send-email-tgrabiec@scylladb.com>
2017-10-09 15:55:53 +01:00
Avi Kivity
8d81ec92f6 gdb: adjust 'scylla memory' command for fallback small pools
Seastar small pools can now fall back to smaller spans. Adjust
the 'scylla memory' command accordingly.
Message-Id: <20171005123935.13503-1-avi@scylladb.com>
2017-10-09 11:44:03 +02:00
Botond Dénes
dead2617ce mp_row_consumer: remove unnecessary _reasource_tracker member
Leftovers from a43901f84.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <88237d9cd97feeca47e12ec4af89c90f1a3a6bb5.1507535176.git.bdenes@scylladb.com>
2017-10-09 10:59:40 +03:00
Botond Dénes
af083d6507 Merge mutation_reader related test cases into mutation_reader_test
The following tests were merged:
* combined_mutation_reader_test
* restricted_reader_test

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <db6b5b3c2d30cfaa720fff07c859649a180cff95.1507299293.git.bdenes@scylladb.com>
2017-10-08 17:33:55 +03:00
Avi Kivity
fd1d35d4af Update seastar submodule
* seastar c62bbf9...8babd1f (9):
  > Enhanced support for Travis CI: build with and without DPDK support, use varioius compilers (GCC 5/6/7)
  > backtrace: Allow whitespace after the backtrace addresses
  > test.py: fix typo in noncopyable_function_test
  > utils: introduce noncopyable_function
  > Revert "utils: introduce noncopyable_function"
  > utils: introduce noncopyable_function
  > Add seastar-addr2line helper script to decode backtraces
  > execution_stage: pass scheduling_group to constructor
  > reactor: preempt tasks when a signal is received
2017-10-08 16:36:10 +03:00
Avi Kivity
98e69482bf Merge "Add support for CAST AS functions" from Daniel
"This series implements CAST AS functions in scylla.

It allows to use expressions of the form CAST(x AS type) in select statements.
Primary motivation for this functions came from aggregate functions, because
function avg(.) gives rounded results for interger columns. Now it is possible
to convert such column to float/double and obtain floating point results:

    SELECT ... avg(cast(x as double)), ...

Fixes #2280."

* 'danfiala/2280-patch-series-v2' of https://github.com/hagrid-the-developer/scylla:
  tests: Add test for CAST AS functions.
  cql3: Add support for CAST AS functions to ANTLR grammar.
  cql3/selectable: Add selectable::with_cast for CAST AS functions.
  cql3/functions: Add support for CAST AS functions.
  types:: Add support for CAST AS functions.
  types: Moved code that implements conversion of types' values to string.
2017-10-08 12:55:07 +03:00
Botond Dénes
046a1f9b05 sstables: Get rid of [[deprecated]] index_reader::get_index_entries()
Change test code (the only consumers) to read index by partitions.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <b6111e92b5e0729bfa2e76fd848215804174067a.1507297154.git.bdenes@scylladb.com>
2017-10-08 12:18:52 +03:00
Daniel Fiala
9e11bfe8fa tests: Add test for CAST AS functions.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2017-10-07 21:05:53 +02:00
Daniel Fiala
4dd504b9ac cql3: Add support for CAST AS functions to ANTLR grammar.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2017-10-07 21:04:40 +02:00
Daniel Fiala
7fe653f08c cql3/selectable: Add selectable::with_cast for CAST AS functions.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2017-10-07 21:04:40 +02:00
Daniel Fiala
ca092a0b7d cql3/functions: Add support for CAST AS functions.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2017-10-07 21:04:40 +02:00
Daniel Fiala
61570e4a73 types:: Add support for CAST AS functions.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2017-10-07 21:04:40 +02:00
Daniel Fiala
e2c0a57ecf types: Moved code that implements conversion of types' values to string.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2017-10-07 21:04:40 +02:00
Botond Dénes
a43901f842 row_consumer: de-virtualize io_priority() and resource_tracker()
Fixes #2830

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <448a1f739ab8c88a7a5562bce8dce5ae6efdf934.1507302530.git.bdenes@scylladb.com>
2017-10-06 18:50:12 +01:00
Botond Dénes
d2b294dc06 loading_cache: prepend this-> to method calls on captured this
To make gcc 6.3 happy.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <849402e20a1ffa6f603eff4fe295981a94b9ca79.1507282527.git.bdenes@scylladb.com>
2017-10-06 12:09:34 +02:00
Vlad Zolotarov
bc9d17963f test.py: add loading_cache_test
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1507137724-2408-3-git-send-email-vladz@scylladb.com>
2017-10-05 15:30:07 +01:00
Vlad Zolotarov
1394e781be utils + cql3: use a functor class instead of std::function
Define value_extractor_fn as a functor class instead of std::function.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1507137724-2408-2-git-send-email-vladz@scylladb.com>
2017-10-05 15:29:51 +01:00
Duarte Nunes
a011eb72c2 Merge branch 'CQL secondary index backing views' from Pekka
"This patch series adds backing materialized view for secondary indices.
When a new index is created with the 'CREATE INDEX' statement, a backing
materialized view is created automatically.

For example, assuming the following table:

  CREATE TABLE ks1.users (
    userid uuid,
    email text,
    PRIMARY KEY (userid)
  );

When the following index is created:

  CREATE INDEX user_email ON ks1.users (email);

The following materialized view is also created:

  cqlsh> DESCRIBE ks1.users;

  <snip>

  CREATE MATERIALIZED VIEW ks1.user_email_index AS
      SELECT email, userid
      FROM ks1.users
      WHERE email IS NOT NULL
      PRIMARY KEY (email, userid)
      WITH CLUSTERING ORDER BY (userid ASC)
      AND bloom_filter_fp_chance = 0.01
      AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
      AND comment = ''
      AND compaction = {'class': 'SizeTieredCompactionStrategy'}
      AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
      AND crc_check_chance = 1.0
      AND dclocal_read_repair_chance = 0.1
      AND default_time_to_live = 0
      AND gc_grace_seconds = 864000
      AND max_index_interval = 2048
      AND memtable_flush_period_in_ms = 0
      AND min_index_interval = 128
      AND read_repair_chance = 0.0
      AND speculative_retry = '99.0PERCENTILE';

CQL queries will use the backing materialized view as part of queries on
indexed columns to fetch the primary keys."

* 'penberg/cql-2i-backing-view/v3' of github.com:scylladb/seastar-dev:
  schema_tables: Create backing view for indices
  database: Kill obsolete secondary index manager stub
  cql3: Wire up secondary index manager
  cql3/restrictions: Add term_slice::is_supported_by() function
  index: Add secondary_index_manager::create_view_for_index()
  index: Add target_parser::parse() helper
  cql3/statements: Add index_target::from_sstring() helper
  index: Add secondary_index_manager::get_dependent_indices()
  index: Add secondary_index_manager::reload()
  index: Add secondary_index_manager::list_indexes()
  index: Add index class
  index: Pass column_family to secondary_index_manager constructor
  database: Make secondary index manager per-column family
2017-10-05 12:08:14 +01:00
Pekka Enberg
4045e1ec09 schema_tables: Create backing view for indices
This patch wires calls to secondary index manager reload() in
merge_tables_and_views() and changes make_update_indices_mutations() to
also create mutations for the backing materialized view. After this
patch, "CREATE INDEX" CQL statement also creates a materialized view.
2017-10-05 10:07:44 +03:00
Pekka Enberg
5d30ad5e1a database: Kill obsolete secondary index manager stub 2017-10-05 10:07:44 +03:00
Pekka Enberg
3a27f2e812 cql3: Wire up secondary index manager 2017-10-05 10:07:44 +03:00
Pekka Enberg
feae924c8c cql3/restrictions: Add term_slice::is_supported_by() function 2017-10-05 10:07:44 +03:00
Pekka Enberg
ed4c96c025 index: Add secondary_index_manager::create_view_for_index()
This patch adds a create_view_for_index() function, which creates a
view_ptr for index_metadata.
2017-10-05 10:07:44 +03:00
Pekka Enberg
a809ea902e index: Add target_parser::parse() helper 2017-10-05 10:07:44 +03:00
Pekka Enberg
9f07af8224 cql3/statements: Add index_target::from_sstring() helper 2017-10-05 10:07:44 +03:00
Pekka Enberg
50943ce592 index: Add secondary_index_manager::get_dependent_indices() 2017-10-05 10:07:44 +03:00
Glauber Costa
189ef02596 storage_proxy: change reporting of estimated histograms
We are currently collapsing the histograms in 16 points, exponentially
increasing in value, starting from 1.

While reducing the number of points is a worthy goal, the current
configuration caps us at 4ms. Our latencies tend to be higher than this.

Starting from 1 is also a bit of an exhaggeration: rarely are our
latencies in that range. This patch changes reporting so that we
report 20 points starting from 32.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-10-04 20:01:15 -04:00
Glauber Costa
fc4416abcc estimated_histogram: bring histogram closer to what prometheus expects.
Histograms are a native prometheus type, and there are many functions
available that operate on them. There is extensive documentation about
them at https://prometheus.io/docs/practices/histograms/

One example is the function histogram_quantile(), that can extract
useful quantiles from the histograms. Currently, those functions don't
work well.

The reasons are twofold:
1) We are only exporting 16 metrics, starting from 1usec. That means
   that the highest latency we can differentiate is 4ms. After that,
   everything falls into the same bin.

2) The format that prometheus expects is that each bin will contain
   the total number of points seen *up until that bin*, while we
   currently export the total number of points that falls between bins.
   IOW, it is a cummulative histogram.

About point two, granted it is a bit hidden in their website, but it is
there. The following phrase about a caveat make it clear:

 "Note that we divide the sum of both buckets. The reason is that the
  histogram buckets are cumulative. The le="0.3" bucket is also contained
  in the le="1.2" bucket; dividing it by 2 corrects for that."

It is also not needed to accumulate things that fall over the last bin:
the _count component of the histogram will already account for that.

This patch changes the histogram format to be more in line with what
prometheus expect.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-10-04 20:01:13 -04:00
Avi Kivity
ab65b42bb6 size_estimates: remove ambiguity in call to std::ref()
The call to std::ref() is not namespace-qualified, and so can conflict
with seastar::ref().

Fix by naming std::ref() explicitly.
Message-Id: <20171004155250.4960-1-avi@scylladb.com>
2017-10-04 18:31:40 +02:00
Duarte Nunes
953888d0d0 Merge "Auth: pluggable auth + "transitional" auth-objects" from Calle
"Makes authorizer/authenticator actually pluggable (by class name)
and adds a "Transitional" type for both, conforming to the DSE
definition of the types.

The idea is to allow a rolling upgrade of a cluster to
authentication op by first making all clients provide credentials
(ignored by non-auth), then node by node enable auth with
transitional handlers, then ensure user DB is populated and
distributed, and finally rollingly enable strict auth for
each node. Pfew."

Fixes #2836

* auth: Transitional auth wrappers
  auth: Make authenticator/authorizer use actual name based lookup
2017-10-04 12:46:16 +02:00
Calle Wilund
611e00646b auth: Transitional auth wrappers
Similar to DSE objects with similar name. Basically ignores
all authentication/authorization except "superuser" login. All others
sessions are treated as anonymous.
Note: like DSE counterparts, a client session must still _use_
authentication to be able to connect, even though the actual content of
the auth is mostly ignored.
2017-10-04 12:44:44 +02:00
Calle Wilund
b96a7ae656 auth: Make authenticator/authorizer use actual name based lookup
Allowing for pluggable auth objects.

Note: requires "class_registrator: Fix qualified name matching +
provider helpers" patch previously sent.
2017-10-04 12:44:44 +02:00
Calle Wilund
801ee44cb8 class_registrator: Fix qualified name matching + provider helpers
Should not assume namespace "org", nor should we allow "loose"
substring matching.
2017-10-04 12:43:42 +02:00
Calle Wilund
3c509e0333 class_registrator: Allow different return types
Allows registry to give back, for example, shared_ptr etc instead of
solely unique_ptr. If a registry is defined with seastar/std
shared/lw_shared/unique_ptr as "BaseType", the type will assume
this is the intended result type.
2017-10-04 12:43:42 +02:00
Avi Kivity
bdbbfe9390 Merge "Make restricting_mutation_reader more accurate" from Botond
"Currently restricting_mutation_reader restricts mutation_readears on a
count basis. This is inaccurate on multiple levels. The reader might be
a combined_mutation_reader, which might be composed of multiple
individual readers, whose number might change during the lifetime of the
reader. The memory consumption of the readers can vary and may change
during the lifetime of the reader as well.
To remedy this, make the restriction memory-consumption based. The
restricting semaphore is now configured with the amound of memory
(bytes) that its readers are allowed to consume in total. New readers
consume 128k units up-front to account for read-ahead buffers, and then
consume additional units for any buffer (returned
from input_stream<>::read()) they keep around.
Like before, readers already allowed to read will not be blocked,
instead new readers will be blocked on their first read if all the units
all consumed.

Fixes #2692."

* 'bdenes/restricting_mutation_reader-v5' of https://github.com/denesb/scylla:
  Update reader restriction related metrics
  Add restricted_reader_test unit test
  restricted_mutation_reader: restrict based-on memory consumption
  mutation_reader.hh: Move restricted_reader related code
2017-10-04 12:43:58 +03:00
Paweł Dziepak
fdfa6703c3 Merge "loading_shared_values and size limited and evicting prepared statements cache" from Vlad
"
The original motivation for the "utils: introduce a loading_shared_values" series was a hinted handoff work where
I needed an on-demand asynchronously loading key-value container (a replica address to a commitlog instance map).

It turned out that we already have the classes that do almost what I needed:
   - utils::loading_cache
   - sstables::shared_index_lists

Therefore it made sense to find a common ground, unify this functionality and reuse the code both in the classes above and in the
new hinted handoff code.

This series introduces the utils::loading_shared_values that generalizes the sstables::shared_index_lists
API on top of bi::unordered_set with the rehashing logic from the utils::loading_cache triggered by an addition
of an entry to the set (PATCH1).

Then it reworks the sstables::shared_index_lists and utils::loading_cache on top of the new class (PATCH2 and PATCH3).

PATCH4 optimizes the loading_cache for the long timer period use case.

But then we have discovered that we have another "customer" for the loading_cache. Apparently our prepared statements cache
had a birth flaw - it was unlimited in size - unless the corresponding keyspace and/or table are modified/dropped the entries
are never evicted. We clearly need to limit its size and it would also make sense to evict the cache entries that haven't been
used long enough.

This seems like a perfect match for a utils::loading_cache except for prepared statements don't need to be reloaded after
they are created.

Patches starting from PATCH5 are dealing with adding the utils::loading_cache the missing functionality (like making the "reloading"
conditional and adding the synchronous methods like find(key)) and then transitioning the CQL and Thrift prepared statements
caches to utils::loading_cache.

This also fixes #2474."

* 'evict_unused_prepared-v5' of https://github.com/vladzcloudius/scylla:
  tests: loading_cache_test: initial commit
  cql3::query_processor: implement CQL and Thrift prepared statements caches using cql3::prepared_statements_cache
  cql3: prepared statements cache on top of loading_cache
  utils::loading_cache: make the size limitation more strict
  utils::loading_cache: added static_asserts for checking the callbacks signatures
  utils::loading_cache: add a bunch of standard synchronous methods
  utils::loading_cache: add the ability to create a cache that would not reload the values
  utils::loading_cache: add the ability to work with not-copy-constructable values
  utils::loading_cache: add EntrySize template parameter
  utils::loading_cache: rework on top of utils::loading_shared_values
  sstables::shared_index_list: use utils::loading_shared_values
  utils: introduce loading_shared_values
2017-10-04 09:13:32 +01:00
Daniel Fiala
1133838b9f types: Add data_type_for for varint and decimal, data_value constructor for simple_date_type.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
Message-Id: <20171004044040.21631-1-daniel@scylladb.com>
2017-10-04 10:52:57 +03:00
Tomasz Grabiec
f506339582 tests: perf_fast_forward: Auto-create test directory
To avoid exception due to missing directory.

Message-Id: <1506081627-12933-1-git-send-email-tgrabiec@scylladb.com>
2017-10-03 15:36:37 +03:00
Botond Dénes
fea6214a0a Update reader restriction related metrics
Update description of existing reader count metrics, add memory
consumption metrics. Use labels to distinguish between system, user and
streaming reads related metrics.
2017-10-03 12:44:17 +03:00
Botond Dénes
3280fbc4d4 Add restricted_reader_test unit test 2017-10-03 12:44:17 +03:00
Botond Dénes
47e07b787e restricted_mutation_reader: restrict based-on memory consumption
Restrict readers based on their memory consumption, instead of the count
of the top-level readers. To do this an interposer is installed at the
input_stream level which tracks buffers emmited by the stream. This way
we can have an accurate picture of the readers' actual memory
consumption.
New readers will consume 16k units from the semaphore up-front. This is
to account their own memory-consumption, apart from the buffers they
will allocate. Creating the reader will be deferred to when there are
enough resources to create it. As before only new readers will be
blocked on an exhausted semaphore, existing readers can continue to
work.
2017-10-03 12:44:12 +03:00
Botond Dénes
0a07e9e7c7 mutation_reader.hh: Move restricted_reader related code
In preparation of make_restricted_reader taking a mutation_source as
its argument.
2017-10-03 12:39:22 +03:00
Avi Kivity
78eae8bf48 Revert "Merge "Make restricting_mutation_reader more accurate" from Botond"
This reverts commit c6e5dcc556, reversing
changes made to 19b21a0ab2. Failes to build,
plus author has more changes.
2017-10-03 11:58:59 +03:00
Pekka Enberg
641f28da02 cql3/statements: Clean up select_statement class definition
We have some historical #ifdef'd code that really ought to be removed by now...

Message-Id: <1507015932-8165-1-git-send-email-penberg@scylladb.com>
2017-10-03 11:17:32 +03:00
Avi Kivity
c6e5dcc556 Merge "Make restricting_mutation_reader more accurate" from Botond
"Currently restricting_mutation_reader restricts mutation_readears on a
count basis. This is inaccurate on multiple levels. The reader might be
a combined_mutation_reader, which might be composed of multiple
individual readers, whose number might change during the lifetime of the
reader. The memory consumption of the readers can vary and may change
during the lifetime of the reader as well.
To remedy this, make the restriction memory-consumption based. The
restricting semaphore is now configured with the amound of memory
(bytes) that its readers are allowed to consume in total. New readers
consume 128k units up-front to account for read-ahead buffers, and then
consume additional units for any buffer (returned
from input_stream<>::read()) they keep around.
Like before, readers already allowed to read will not be blocked,
instead new readers will be blocked on their first read if all the units
all consumed."

Fixes #2692.

* 'bdenes/restricting_mutation_reader-v4' of https://github.com/denesb/scylla:
  Update reader restriction related metrics
  Add restricted_reader_test unit test
  restricted_mutation_reader: restrict based-on memory consumption
  mutation_reader.hh: Move restricted_reader related code
2017-10-03 11:15:34 +03:00
Daniel Fiala
19b21a0ab2 types: Allow 'T' as a date-time separator in timestamps.
* Letter 'T' is specified in ISO 8601 and also in Cassandra
  documentation.

Signed-off-by: Daniel Fiala <daniel@scylladb.com>
Message-Id: <20171003073558.19257-1-daniel@scylladb.com>
2017-10-03 11:10:11 +03:00
Avi Kivity
3cc1c2c387 Merge seastar upstream
* seastar 899fc4e...c62bbf9 (6):
  > Merge "CPU Scheduler for seastar" from Avi
  > reactor: set SCHED_FIFO policy for timer thread
  > future: mark future::wait() as noexcept
  > shared_promise: Make get_shared_future() const-qualified
  > Remove pessimizing and redundant std::move()-s reported by Clang-tidy utility
  > Work around GCC 5 bug: scylladb/seastar#338, scylladb/seastar#339
2017-10-02 20:47:32 +03:00
Avi Kivity
dd5ab75d04 range: add missing include
Message-Id: <20171002144608.5032-1-avi@scylladb.com>
2017-10-02 16:49:24 +02:00
Avi Kivity
5ed6d1b176 dist: enable CAP_SYS_NICE
Allow scylla to use SCHED_FIFO for the timer thread for more accurate
scheduling.
Message-Id: <20171001121500.28318-1-avi@scylladb.com>
2017-10-02 16:32:00 +02:00
Avi Kivity
dbce5158a3 Update ami submodule
* dist/ami/files/scylla-ami 5ffa449...be90a3f (1):
  > amazon kernel: enable updates
2017-10-02 17:07:09 +03:00
Piotr Jastrzebski
83fd22face Add test to reproduce #2854
When memtable gets flushed, existing mutation_readers created
for it stop handling fast_forward_to correctly.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <f580ac59f3fcec53e7c78ad7a8b6374eb36958c6.1506690042.git.piotr@scylladb.com>
2017-09-29 15:17:53 +02:00
Piotr Jastrzebski
2583207d9d Fix memtable scanning_reader::fast_forward_to
If memtable is flushed then call fast_forward_to on _delegate.
Otherwise call iterator_reader::fast_forward_to.

Fixes #2854

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <6bf1c8bafce845ef945698ce4d722c3c8606e632.1506690042.git.piotr@scylladb.com>
2017-09-29 15:17:39 +02:00
Asias He
c0b965ee56 gossip: Better check for gossip stabilization on startup
This is a backport of Apache CASSANDRA-9401
(2b1e6aba405002ce86d5badf4223de9751bf867d)

It is better to check the number of nodes in the endpoint_state_map
is not changing for gossip stabilization.

Fixes #2853
Message-Id: <e9f901ac9cadf5935c9c473433dd93e9d02cb748.1506666004.git.asias@scylladb.com>
2017-09-29 08:57:25 +02:00
Tomasz Grabiec
d75f243a8b Update seastar submodule
Fixes #2770.
Fixes #2819.

* seastar 92fdce2...899fc4e (14):
  > scollectd: increment the metadata iterator with the values
  > Enable Travis CI builds for Seastar.
  > tests: Fix httpd test compilation error caused by unconditionally explicit tuple constructor in GCC5: scylladb/seastar#326
  > core::shared_future: add available() and failed() methods
  > rpc: make sure that _write_buf stream is always properly closed
  > log: Fail on attempt to register logger with the same name twice
  > Merge "Make backtraces useful on ASLR-enabled machines as well" from Botond
  > reactor: add option to bypass fsync
  > future-util: modernize do_until() implementation
  > future-util: fix do_until() API to not have forwarding references
  > input_stream: add rvalue variant of input_stream::consume()
  > logger: remove extra spaces after timestamp
  > tutorial: lifetime management
  > Fix broken link for fsqual failure message
2017-09-28 15:27:34 +02:00
Piotr Jastrzebski
6069bab755 Cache single queries to non-existing partitions
This way we don't need to query sstables again
when the query is repeated.

Fixes #1533

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <8f8559ff19c534dbbb7c9ef6c28271cec607ba20.1506521461.git.piotr@scylladb.com>
2017-09-27 16:15:18 +02:00
Tomasz Grabiec
b704710954 migration_manager: Make sure schema pulls eventually happen when schema_tables_v3 is enabled
We don't pull schema during rolling upgrade, that is until
schema_tables_v3 feature is enabled on all nodes.

Because features are enabled from gossiper timer, there is a race
between feature enablement and processing of endpoint states which may
trigger schema pull.  It can happen that we first try to pull, but
only later enable the feature. In that case the schema pull will not
happen until the next schema change.

The fix is to ensure that pulls abandoned due to feature not being enabled
will be retried when it is enabled.

Fixes sporadic failure in dtest:

  repair_additional_test.py:RepairAdditionalTest.repair_schema_test
Message-Id: <1506428715-8182-2-git-send-email-tgrabiec@scylladb.com>
2017-09-27 12:00:07 +01:00
Tomasz Grabiec
7a58fb5767 gossiper: Allow waiting for feature to be enabled
Message-Id: <1506428715-8182-1-git-send-email-tgrabiec@scylladb.com>
2017-09-27 11:57:06 +01:00
Raphael S. Carvalho
63eb9f61c0 db: use correct dirty memory manager for system column families
Dirty memory manager for non-system column families was being used
when applying mutations to system cfs.
That previously lead to deadlock when updating history. Basically,
write disable waits on compaction, and compaction waits on a write
that would release dirty memory for updating compaction history.

Only using the correct dirty manager wouldn't solve this problem
if write is disabled for system cf, but the problem is completely
solved in addition to previous change which updates history
outside the sstable lock.

Refs #2769.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170918215238.9810-3-raphaelsc@scylladb.com>
2017-09-26 19:51:31 +02:00
Raphael S. Carvalho
e34c1db642 db: update compaction history outside the sstable write lock
The reason to do that is because compaction can deadlock if refresh
disables write which waits for compaction, and compaction in turn
waits for dirty memory[1] that would be released by memtable write.

Dirty memory manager for non-system cfs was being used for system cfs,
which was useful for exposing this problem.

[1]: when updating compaction history.

Fixes #2769.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170918215238.9810-2-raphaelsc@scylladb.com>
2017-09-26 19:51:12 +02:00
Asias He
4b1034b9cd storage_service: Remove the stream_hints
Our hinted handoff implementation will not use the
db::system_keyspace::HINTS system table to store hints.
No need to stream them.

Acked-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <3b9190e250b54321ceb87767f4722c7458d41797.1506391500.git.asias@scylladb.com>
2017-09-26 19:05:21 +03:00
Paweł Dziepak
af1976bc30 Merge "Fix cache reader skipping rows in some cases" from Tomasz
"Fixes the problem of concurrent populations of clustering row ranges
leading to some readers skipping over some of the rows.
Spotted during code review.

Fixes #2834."

* tag 'tgrabiec/fix-cache-reader-skipping-rows-v2' of github.com:scylladb/seastar-dev:
  tests: mvcc: Add test for partition_snapshot_row_cursor
  tests: row_cache: Add test for concurrent population
  tests: row_cache: Make populate_range() accept partition_range
  tests: Add simple_schema::make_ckey_range()
  cache_streamed_mutation: Add missing _next_row.maybe_refresh() call
  mvcc: partition_snapshot_row_cursor: Fix cursor skipping over rows added after its position
  mvcc: partition_snapshot_row_cursor: Rename up_to_date() to iterators_valid()
  mvcc: Keep track of all iterators in partition_snapshot_row_cursor
  mvcc: Make partition_snapshot_row_cursor printable
2017-09-26 15:09:58 +01:00
Tomasz Grabiec
3eb251e3a4 tests: perf_fast_forward: Fail if ran with more than one shard
The test reads only from local shard, if ran with more shards,
current shard will miss some of the data.

Message-Id: <1506081609-12811-1-git-send-email-tgrabiec@scylladb.com>
2017-09-26 15:23:10 +03:00
Calle Wilund
dd2b8821a4 everywhere_strategy: Make get_natural_endpoints handle non-init state
Make get_natural_endpoints return local address iff token metadata
is not yet setup (since that is the one address we already know of).

If a request has a consistency level requiring more endpoints, it
will still fail, but for calls with, for example, CL=ONE, at startup
we will succeed, and more or less act like local strategy. Yet,
further down the line, have data distributed as desired.

Acked-by: Gleb Natapov <gleb@scylladb.com>
Message-Id: <20170926113512.15707-1-calle@scylladb.com>
2017-09-26 15:21:30 +03:00
Asias He
98e9049820 gossip: Print SCHEMA_TABLES_VERSION correctly
Found this when debugging gossip with debug print. The application state
SCHEMA_TABLES_VERSION was printed as UNKNOWN.
Message-Id: <d7616920d2e6516b5470a758bcf9c88f3d857381.1506391495.git.asias@scylladb.com>
2017-09-26 08:38:28 +02:00
Tomasz Grabiec
e5e9886014 tests: mvcc: Add test for partition_snapshot_row_cursor 2017-09-25 11:21:58 +02:00
Tomasz Grabiec
e4adc9c600 tests: row_cache: Add test for concurrent population 2017-09-25 11:21:58 +02:00
Tomasz Grabiec
a3fb7ce660 tests: row_cache: Make populate_range() accept partition_range 2017-09-25 11:21:58 +02:00
Tomasz Grabiec
dd7af02251 tests: Add simple_schema::make_ckey_range() 2017-09-25 11:21:58 +02:00
Tomasz Grabiec
e83cd508f6 cache_streamed_mutation: Add missing _next_row.maybe_refresh() call
We were checking if the cursor is up_to_date(), but this is not enough
to guarantee that the cursor is valid, merely that its iterators are
valid. The cursor may be invalidated even if its iterators are valid
if there was an insertion after cursor's position.

Fixes #2834.
2017-09-25 11:21:58 +02:00
Tomasz Grabiec
2f8d91043d mvcc: partition_snapshot_row_cursor: Fix cursor skipping over rows added after its position
The cursor maintains a heap of iterators in all versions. If rows were
inserted before the latest version's iterator, cursor would not see
them. Fix by redoing the lookup for iterators not in the current row
in maybe_refresh().

Refs #2834.
2017-09-25 11:21:58 +02:00
Tomasz Grabiec
09d99b0358 mvcc: partition_snapshot_row_cursor: Rename up_to_date() to iterators_valid() 2017-09-25 11:21:58 +02:00
Tomasz Grabiec
4ee11641c0 mvcc: Keep track of all iterators in partition_snapshot_row_cursor
Will be needed when updating the iterator for latest version. Before
this change, such iterator could be neither in _current_row nor in
_heap.

Besides that, this will allow user to always access the iterator of
latest version, which enables some optimizations in the future of
avoiding unnecessary lookups. get_iterator_in_latest_version() is now
always valid.
2017-09-25 11:21:58 +02:00
Tomasz Grabiec
a8cbd34dde mvcc: Make partition_snapshot_row_cursor printable 2017-09-25 11:21:58 +02:00
Tomasz Grabiec
8e46d15f91 storage_service: Register features before joining
Since commit 8378fe190, we disable schema sync in a mixed cluster.
The detection is done using gossiper features. We need to make sure
the features are registerred, and thus can be enabled, before the
bootstrapping of a non-seed node happens. Otherwise the bootstrap will
hang waiting on schema sync which will not happen.
Message-Id: <1505893837-27876-2-git-send-email-tgrabiec@scylladb.com>
2017-09-25 09:13:02 +01:00
Tomasz Grabiec
b92dcb0284 storage_service: Extract register_features()
Message-Id: <1505893837-27876-1-git-send-email-tgrabiec@scylladb.com>
2017-09-25 09:12:46 +01:00
Tomasz Grabiec
d11d696072 tests: mutation_source_tests: Fix use-after-scope on partition range
Message-Id: <1506096881-3076-1-git-send-email-tgrabiec@scylladb.com>
2017-09-22 19:13:47 +02:00
Botond Dénes
015ac042a8 combined_mutation_reader_test: remove unneeded includes
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <a388efa6fc93049f4d69c049764cc9225a04bce4.1506098363.git.bdenes@scylladb.com>
2017-09-22 18:45:04 +02:00
Botond Dénes
a7984a9908 combined_mutation_reader_test: remove leftover debug logging
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <96e61fcd2543ec84921f1b2188d7248e55e7efe0.1506097635.git.bdenes@scylladb.com>
2017-09-22 18:44:47 +02:00
Tomasz Grabiec
5def901a92 sstables: Don't register logger with the same name twice
There can be one logger with given name. This was causing
--logger-log-level sstable=trace to not work for the majority of log
points.

Message-Id: <1505902259-4561-1-git-send-email-tgrabiec@scylladb.com>
2017-09-20 16:40:06 +03:00
Piotr Jastrzebski
98c359d7de Introduce a position for partition start
This position will be used for mutation fragment
that represents the start of a partition.

This position sorts before static row.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-09-20 11:34:03 +02:00
Piotr Jastrzebski
e1f7d1f25d streamed_mutation: Extract concepts using GCC6_CONCEPT macro
It makes it easier to actually use those concepts.

Lambdas passed to mutation_fragment::visit have to declare
return type otherwise compiler fails with:

internal compiler error: Segmentation fault

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-09-20 11:34:03 +02:00
Tomasz Grabiec
02d41864af Merge "Fix miss opportunity to update gossiper features" from Asias
The gossiper checks if features should be enabled from its timer
callback when it detects that endpoint_state_map changed, that is
different than shadow_endpoint_state_map.

shadow_endpoint_state_map is also assigned from endpoint_state_map in
storage_service::replicate_tm_and_ep_map(), called from
storage_service::on_change()

Call gossiper:maybe_enable_features() in replicate_tm_and_ep_map so
that we won't miss gossip feature update.

Fixes #2824

* git@github.com:scylladb/seastar-dev asias/gossip_miss_feature_update_v1:
  gossip: Move the _features_condvar signal code to
    maybe_enable_features
  gossip: Make maybe_enable_features public
  storage_service: Check gossip feature update in
    replicate_tm_and_ep_map
2017-09-20 11:16:37 +02:00
Asias He
ebc3bada12 storage_service: Check gossip feature update in replicate_tm_and_ep_map
This is another place we can update endpoint_state_map in addition to
gossiper::run().

Call the gossiper:maybe_enable_features() so that we won't miss gossip
feature update.
2017-09-20 16:58:33 +08:00
Asias He
6022b7423a gossip: Make maybe_enable_features public
It will be needed by storage_service.
2017-09-20 16:58:33 +08:00
Asias He
68c7a391b5 gossip: Move the _features_condvar signal code to maybe_enable_features
It is easier to call to features update logic outside gossiper.
2017-09-20 16:58:32 +08:00
Asias He
173cba67ba storage_service: Remove rpc client on all shards in on_dead
We should close connections to nodes that are down on all shards instead
of the shard which runs the on_dead gossip callback.

Found by Gleb.
Message-Id: <527a14105a07218066e9f1da943693d9de6993e5.1505894260.git.asias@scylladb.com>
2017-09-20 10:23:31 +02:00
Botond Dénes
43dba8f173 Update reader restriction related metrics
Update description of existing reader count metrics, add memory
consumption metrics.
2017-09-20 11:16:21 +03:00
Botond Dénes
b2db29dc65 Add restricted_reader_test unit test 2017-09-20 11:15:45 +03:00
Botond Dénes
33e97e7457 restricted_mutation_reader: restrict based-on memory consumption
Restrict readers based on their memory consumption, instead of the count
of the top-level readers. To do this an interposer is installed at the
input_stream level which tracks buffers emmited by the stream. This way
we can have an accurate picture of the readers' actual memory
consumption.
New readers will consume 16k units from the semaphore up-front. This is
to account their own memory-consumption, apart from the buffers they
will allocate. Creating the reader will be deferred to when there are
enough resources to create it. As before only new readers will be
blocked on an exhausted semaphore, existing readers can continue to
work.
2017-09-20 11:14:35 +03:00
Botond Dénes
e4a9e55e0d mutation_reader.hh: Move restricted_reader related code
In preparation of make_restricted_reader taking a mutation_source as
its argument.
2017-09-20 11:12:57 +03:00
Tomasz Grabiec
741ec61269 streaming: Fix streaming not streaming all ranges
It skipped one sub-range in each of the 10 range batch, and
tried to access the range vector using end() iterator.

Fixes sporadic failures of
update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_add_node_1_test.

Message-Id: <1505848902-16734-1-git-send-email-tgrabiec@scylladb.com>
2017-09-20 10:33:59 +03:00
Avi Kivity
5b0cb28af9 Merge "row_cache: Call fast_forward_to() outside allocating section" from Tomasz
"On bad_alloc the section is retried. If the exception happened inside
fast_forward_to() on the underlying reader, that call will be
retried. However, the reader should not be used after exception is
thrown, since it is in unspecified state. Also, calling
fast_forward_to() with cache region locked increases the chances of it
failing to allocate.

We shouldn't call fast_forward_to() with the cache region locked.

Fixes #2791."

* 'tgrabiec/dont-ffwd-in-alloc-section' of github.com:scylladb/seastar-dev:
  cache_streamed_mutation: De-futurize cursor movement
  cache_streamed_mutation: Call fast_forward_to() outside allocating section
  cache_streamed_mutation: Switch from flags to explicit state machine
2017-09-19 17:11:22 +03:00
Botond Dénes
96c6d54a5c incremental_reader_selector: Remove unecessary check for duplicated next_token
The next_token will never be the same as the current _selector_position,
unless they are both maximum_token, which is already handled.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <9c54ae07a18d201185027c9b533bcb5256bead8a.1505826102.git.bdenes@scylladb.com>
2017-09-19 16:42:02 +03:00
Avi Kivity
12c393dd16 build: default to gold linker
Better, faster.
Message-Id: <20170919115737.12084-1-avi@scylladb.com>
2017-09-19 14:02:31 +02:00
Avi Kivity
a31ade54e0 streamed_mutation: optimize merge_mutations() if only one mutation
If we read a partition from a single sstable (a fairly common case),
we can bypass mutation_merger and just return the input.
Message-Id: <20170918181418.14021-1-avi@scylladb.com>
2017-09-19 11:00:59 +01:00
Botond Dénes
8cb953b58b incremental_reader_selector: don't create readers unconditionally on ff
When fast-forwarding check that the new position is past the selector
before attempting to create new readers. Also don't clear the set of
already created readers and don't overwrite the selector position.

Fixes #2807

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <514f69005eb29c2a3359f098d40abf588900b76f.1505811064.git.bdenes@scylladb.com>
2017-09-19 11:27:47 +02:00
Asias He
8f8273969d gossip: Do not wait for echo message in mark_alive
gossiper::apply_state_locally() calls handle_major_state_change() for
each endpoint, in a seastar thread, which calls mark_alive() for new
nodes, which calls ms().send_gossip_echo(id).get(). So it synchronously
waits for each node to respond before it moves on to the next entry. As
a result it may take a while before whole state is processed.

Apache (tm) Cassandra (tm) sends echos in the background.

In a large cluster, we see at the time the joining node starts
streaming, it hasn't managed to apply all the endpoint_state for peer
nodes, so the joining node does not know some of the nodes yet, which
results in the joining node ingores to stream from some of the existing
nodes.

Fixes #2787
Fixes #2797

Message-Id: <3760da2bef1a83f1b6a27702a67ca4170e74b92c.1505719669.git.asias@scylladb.com>
2017-09-19 10:49:00 +03:00
Raphael S. Carvalho
1524426deb sstables: Fix compaction correctness of higher-level tables
When incremental_reader_selector is used for compaction, it will
first call incremental selector of partitioned sstable set with
minimum token that will result in first interval being skipped,
which means not everything being compacted. The interval is
skipped because iterator is incorrectly advanced when token
lies before it.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170918021446.15920-1-raphaelsc@scylladb.com>
2017-09-19 09:59:30 +03:00
Avi Kivity
55e0b63e65 storage_proxy: scan more nodes exponentially to achieve target result set size
The current sequential scan can take a long time on a small or empty table
with a large (nr_nodes * nr_vnodes) count, and can time out. Switching to
exponential scan reduces the time.

Fixes #1230.
Message-Id: <20170912173803.8277-1-avi@scylladb.com>
2017-09-18 15:15:15 +02:00
Avi Kivity
e44517851e untyped_result_set: reduce dependencies
Forward-declare untyped_result_set and untyped_result_set_row, and remove
the include from query_processor.hh.
Message-Id: <20170916170859.27612-3-avi@scylladb.com>
2017-09-18 15:15:15 +02:00
Avi Kivity
0317746822 untyped_result_set: make untyped_result_set::row a namespace scope class
Makes it possible to forward-declare, with the aim of reducing dependencies.
Message-Id: <20170916170859.27612-2-avi@scylladb.com>
2017-09-18 15:15:15 +02:00
Pekka Enberg
9ebd8be82b index: Add secondary_index_manager::reload()
This patch adds a reload() function, which updates the secondary index
manager index map to match underlying column family indices.
2017-09-18 14:31:35 +03:00
Duarte Nunes
16d2e4e81b Merge 'reduce sstables.hh coupling' from Glauber
"sstables.hh is already too big, and it is soon to become bigger with the
inclusion of the read_monitor, to pair it with the write_monitor.

It's a good opportunity for us to reduce sstables.hh dependencies by
moving the write monitor to its own reader. One obvious caller is
already changed so we don't need to include sstables.hh anymore."

* 'progress-monitor' of https://github.com/glommer/scylla:
  sstables: do not include sstables.hh from memtable glue
  sstables: move write_monitor to its own header
2017-09-18 13:31:32 +02:00
Pekka Enberg
2ae6b141e5 index: Add secondary_index_manager::list_indexes() 2017-09-18 14:27:35 +03:00
Avi Kivity
a2f26f7b29 log_histogram: rename to log_heap
log_histogram is not really a histogram, it is a heap-like container.
Rename to log_heap in case we do want a log_histogram one day.
Message-Id: <20170916172137.30941-1-avi@scylladb.com>
2017-09-18 12:44:05 +02:00
Amnon Heiman
8d668a9dc0 API: storage_service repair_async_status to return proper error code
This patch change the implementation of storage_service
repair_async_status to throw an exception, this way a 400 return code
will be returned.

Fixes #2794

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170917080533.6612-1-amnon@scylladb.com>
2017-09-18 09:08:26 +03:00
Vlad Zolotarov
cea15486c4 tests: loading_cache_test: initial commit
Test utils::loading_shared_values and utils::loading_cache.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-09-15 22:19:15 -04:00
Vlad Zolotarov
66568be969 cql3::query_processor: implement CQL and Thrift prepared statements caches using cql3::prepared_statements_cache
- Transition the prepared statements caches for both CQL and Trhift to the cql3::prepared_statements_cache class.
   - Add the corresponding metrics to the query_processor:
      - Evictions count.
      - Current entries count.
      - Current memory footprint.

Fixes #2474

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-09-15 22:19:15 -04:00
Vlad Zolotarov
8f912b46b1 cql3: prepared statements cache on top of loading_cache
This is a template class that implements caching of prepared statements for a given ID type:
   - Each cache instance is given 1/256 of the total shard memory. If the new entry is going to overflow
     this memory limit - the less recently used entries are going to be evicted so that the new entry could
     be added.
   - The memory consumption of a single prepared statement is defined by a cql3::prepared_cache_entry_size
     functor class that returns a number of bytes for a given prepared statement (currently returns 10000
     bytes for any statement).
   - The cache entry is going to be evicted if not used for 60 minutes or more.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-09-15 22:19:11 -04:00
Vlad Zolotarov
9a43398d6a utils::loading_cache: make the size limitation more strict
Ensure that the size of the cache is never bigger than the "max_size".

Before this patch the size of the cache could have been indefinitely bigger than
the requested value during the refresh time period which is clearly an undesirable
behaviour.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-09-15 20:53:11 -04:00
Vlad Zolotarov
4e72a56310 utils::loading_cache: added static_asserts for checking the callbacks signatures
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-09-15 20:53:11 -04:00
Vlad Zolotarov
a13362e74b utils::loading_cache: add a bunch of standard synchronous methods
Add a few standard synchronous methods to the cache, e.g. find(), remove_if(), etc.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-09-15 20:53:11 -04:00
Vlad Zolotarov
fa2f8162a5 utils::loading_cache: add the ability to create a cache that would not reload the values
Sometimes we don't want the cached values to be periodically reloaded.
This patch adds the ability to control this using a ReloadEnabled template parameter.

In case the reloading is not needed the "loading" function is not given to the constructor
but rather to the get_ptr(key, loader) method (currently it's the only method that is used, we may add
the corresponding get(key, loader) method in the future when needed).

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-09-15 20:53:11 -04:00
Vlad Zolotarov
a60a77dfc8 utils::loading_cache: add the ability to work with not-copy-constructable values
Current get(...) interface restricts the cache to work only with copy-constructable
values (it returns future<Tp>).
To make it able to work with non-copyable value we need to introduce an interface that would
return something like a reference to the cached value (like regular containers do).

We can't return future<Tp&> since the caller would have to ensure somehow that the underlying
value is still alive. The much more safe and easy-to-use way would be to return a shared_ptr-like
pointer to that value.

"Luckily" to us we value we actually store in a cache is already wrapped into the lw_shared_ptr
and we may simply return an object that impersonates itself as a smart_pointer<Tp> value while
it keeps a "reference" to an object stored in the cache.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-09-15 20:53:11 -04:00
Vlad Zolotarov
c24d85f632 utils::loading_cache: add EntrySize template parameter
Allow a variable entry size parameter.
Provide an EntrySize functor that would return a size for a
specific entry.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-09-15 20:53:11 -04:00
Vlad Zolotarov
6024014f92 utils::loading_cache: rework on top of utils::loading_shared_values
Get rid of the "proprietary" solution for asynchronous values on-demand loading.
Use utils::loading_shared_values instead.

We would still need to maintain intrusive set and list for efficient shrink and invalidate
operations but their entry is not going to contain the actual key and value anymore
but rather a loading_shared_values::entry_ptr which is essentially a shared pointer to a key-value
pair value.

In general, we added another level of dereferencing in order to get the key value but since
we use the bi::store_hash<true> in the hook and the bi::compare_hash<true> in the bi::unordered_set
this should not translate into an additional set lookup latency.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-09-15 20:53:11 -04:00
Vlad Zolotarov
d56684b1a5 sstables::shared_index_list: use utils::loading_shared_values
Since utils::loading_shared_values API is based on the original shared_index_list
this change is mostly a drop-in replacement of the corresponding parts.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-09-15 20:53:11 -04:00
Vlad Zolotarov
ec3fed5c4d utils: introduce loading_shared_values
This class implements an key-value container that is populated
using the provided asynchronous callback.

The value is loaded when there are active references to the value for the given key.

Container ensures that only one entry is loaded per key at any given time.

The returned value is a lw_shared_ptr to the actual value.

The value for a specific key is immediately evicted when there are no
more references to it.

The container is based on the boost::intrusive::unordered_set and is rehashed (grown) if needed
every time a new value is added (asynchronously loaded).

The container has a rehash() method that would grow or shrink the container as needed
in order to get the load factor into the [0.25, 0.75] range.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-09-15 20:53:06 -04:00
Glauber Costa
2227ae3f19 sstables: do not include sstables.hh from memtable glue
There is no need to include the whole sstables.hh file in
memtable-sstable.hh anymore. All we need is the shared_sstable
definition and the progress monitor.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-09-15 14:16:35 -04:00
Glauber Costa
51829f528d sstables: move write_monitor to its own header
Soon I am about to introduce a read monitor, and pairing infrastructure
to manage it. Having it all living in sstables.hh force to include it
everytime, even in places that don't really need it.

Move to its own header.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-09-15 14:09:07 -04:00
Tomasz Grabiec
804722b6c8 tests: perf_fast_forward: Fix use-after-scope on partition range
Message-Id: <1505489249-16806-1-git-send-email-tgrabiec@scylladb.com>
2017-09-15 16:34:41 +01:00
Tomasz Grabiec
7b5b461067 cache_streamed_mutation: De-futurize cursor movement
start_reading_from_underlying() doesn't return future<>
any more, so we can simplify this.
2017-09-15 15:41:55 +02:00
Tomasz Grabiec
22019577cc cache_streamed_mutation: Call fast_forward_to() outside allocating section
On bad_alloc the section is retried. If the exception happened inside
fast_forward_to() on the underlying reader, that call will be
retried. However, the reader should not be used after exception is
thrown, since it is in unspecified state. Also, calling
fast_forward_to() with cache region locked increases the chances of it
failing to allocate.

We shouldn't call fast_forward_to() with the cache region locked.

Fixes #2791.
2017-09-15 15:41:55 +02:00
Tomasz Grabiec
3b790a1e80 cache_streamed_mutation: Switch from flags to explicit state machine
We're in one state at a time, so it's better to express it as a single
variable rather than N independent flags.

In preparation before adding more states.
2017-09-15 15:41:55 +02:00
Glauber Costa
eb93d5f8ad database: pass a monitor as a parameter to memtable writer
Right now we pass a permit to the memtable writer and that permit is
used insite write_memtable_to_sstable to compose a write_monitor.

We would like to extend the write_monitor to include other things, that
right now are not available as parameters to write_memtable_to_sstable -
and which are possibly too specialized to be.

The solution for that is to pass the write_monitor instead of the permit
to the writer. Conceptually, that also makes sense because the
write_monitor is something the sstable writer is aware of. Permits, on
the other hand, are a database concept that is alien to the sstable
writer.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20170915032836.21154-1-glauber@scylladb.com>
2017-09-15 12:26:56 +02:00
Duarte Nunes
8378fe190a Merge 'Fix schema version mismatch during rolling upgrade from 1.7' from Tomasz
"When there are at least 2 nodes upgraded to 2.0, and the two exchanged schema
for some reason, reads or writes which involve both 1.7 and 2.0 nodes may
start to fail with the following error logged:

    storage_proxy - Exception when communicating with 127.0.0.3: Failed to load schema version 58fc9b89-74ab-37ca-8640-8b38a1204f8d

The situation should heal after whole cluster is upgraded.

Table schema versions are calculated by 2.0 nodes differently than 1.7 nodes
due to change in the schema tables format. Mismatch is meant to be avoided by
having 2.0 nodes calculate the old digest on schema migration during upgrade,
and use that version until next time the table is altered. It is thus not
allowed to alter tables during the rolling upgrade.

Two 2.0 nodes may exchange schema, if they detect through gossip that their
schema versions don't match. They may not match temporarily during boot, until
the upgraded node completes the bootstrap and propagates its new schema
through gossip. One source of such temporary mismatch is construction of new
tracing tables, which didn't exist on 1.7. Such schema pull will result in a
schema merge, which cause all tables to be altered and their schema version to
be recalculated. The new schema will not match the one used by 1.7 nodes,
causing reads and writes to fail, because schema requesting won't work during
rolling upgrade from 1.7 to 2.0.

The main fix employed here is to hold schema pulls, even among 2.0 nodes,
until rolling upgrade is complete."

* 'tgrabiec/fix-schema-mismatch' of github.com:scylladb/seastar-dev:
  tests: schema_change_test: Add test_merging_does_not_alter_tables_which_didnt_change test case
  tests: cql_test_env: Enable all features in tests
  schema_tables: Make make_scylla_tables_mutation() visible
  migration_manager: Disable pulls during rolling upgrade from 1.7
  storage_service: Introduce SCHEMA_TABLES_V3 feature
  schema_tables: Don't alter tables which differ only in version
  schema_mutations: Use mutation_opt instead of stdx::optional<mutation>
2017-09-15 10:27:47 +02:00
Tomasz Grabiec
c657eec4cf tests: schema_change_test: Add test_merging_does_not_alter_tables_which_didnt_change test case 2017-09-14 20:26:31 +02:00
Tomasz Grabiec
f0fdf75e7c tests: cql_test_env: Enable all features in tests 2017-09-14 20:26:31 +02:00
Tomasz Grabiec
571cac95ed schema_tables: Make make_scylla_tables_mutation() visible
For tests.
2017-09-14 20:26:31 +02:00
Tomasz Grabiec
5a92c18e63 migration_manager: Disable pulls during rolling upgrade from 1.7
If there is a schema pull during rolling upgrade among a two 2.0 nodes,
then schema merge will delete the persisted schema version. When the node
loads that table again, e.g. on restart, it will generate a version
which is different than the one which 1.7 nodes use. This will
cause reads and writes to fail.

To avoid this, disable pulls until all nodes are upgraded.

Fixes #2802.
2017-09-14 20:26:31 +02:00
Tomasz Grabiec
713d75fd51 storage_service: Introduce SCHEMA_TABLES_V3 feature 2017-09-14 20:26:31 +02:00
Tomasz Grabiec
f943d2efbf schema_tables: Don't alter tables which differ only in version
We apply deletion of scylla_tables.version to the incoming schema
mutations so that table schema version is recalculated after merge.
The mutations which we read from local schema tables may not have it
deleted in which case all tables would be considered as differing on
the presence of the version field. Avoid this by deleting the field
from old mutations as well.
2017-09-14 20:26:31 +02:00
Tomasz Grabiec
99272087e6 schema_mutations: Use mutation_opt instead of stdx::optional<mutation> 2017-09-14 20:26:31 +02:00
Takuya ASADA
7662271fc9 dist/ami: show correct message when scylla-ami-setup.service failed
After the service started, a state of the service may become
"failed", "active" or "activating".
But our script does not accept scylla-ami-setup.service become "failed" state,
in result the script shows up wrong message.
So we handle these three types of state correctly.

Fixes #2759

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1504589079-1986-1-git-send-email-syuu@scylladb.com>
2017-09-14 12:40:05 +03:00
Tomasz Grabiec
0911fbbdef row_cache: Fix row_cache::update_invalidating()
evict() doesn't guarantee that the whole partition is discontinuous.
In particular, partition tombstone cannot be marked as discontinuous.
The parts which are still continuous must be updated.

Broken after c78047fa5b.

Message-Id: <1505375684-28574-1-git-send-email-tgrabiec@scylladb.com>
2017-09-14 10:58:25 +03:00
Asias He
0ec574610d locator: Get rid of assert in token_metadata
In commit 69c81bcc87 (repair: Do not allow repair until node is in
NORMAL status), we saw a coredump due to an assert in
token_metadata::first_token_index.

Throw an exception instead of abort the whole scylla process.
Message-Id: <c110645cee1ee3897e30a3ae1b7ab3f49c97412c.1504752890.git.asias@scylladb.com>
2017-09-14 10:33:02 +03:00
Gleb Natapov
31e803a36c storage_proxy: wire up percentile speculative read properly
Collect coordinator side read statistic per CF and use them in percentile
speculative read executor. Getting percentile from estimated_histogram
object is rather expensive, so cache it and recalculate only once per
second (or if requested percentile changes).

Fixes #2757

Message-Id: <20170911131752.27369-3-gleb@scylladb.com>
2017-09-14 10:31:26 +03:00
Gleb Natapov
0842faecef estimated_histogram: fix overflow error handling
Currently overflow values are stored in incorrect bucket (last one
instead of special "overflow" one) and percentile() function throws
if there is overflow value. The patch fixes the code to store overflow
value in corespondent bucket and makes percentile() to take it into
account instead of throwing.

Message-Id: <20170911131752.27369-2-gleb@scylladb.com>
2017-09-14 10:31:21 +03:00
Asias He
5ff0b113c9 gossip: Fix indentation in apply_state_locally
Message-Id: <2bdefa8d982ad8da7452b41e894f41d865b83b0b.1505356245.git.asias@scylladb.com>
2017-09-14 10:09:50 +03:00
Tomasz Grabiec
65ca8eebb8 mutation_partition: Print rows_entry's position instead of key
For dummy rows, _key doesn't reflect the right position.

Message-Id: <1505317040-6783-1-git-send-email-tgrabiec@scylladb.com>
2017-09-13 20:49:28 +03:00
Avi Kivity
ca8e3c4a78 Merge "Evict from partition snapshots in cache" from Tomasz
"This series fixes the problem of active reads causing OOM due to the fact that
partition snapshots they hold are not evictable. In particular, a single scan
of a partition larger than memory will bad_alloc due to itself.

After this, when partition entry is evicted from cache, data in all the snapshots
is also evicted. We still don't have row-level eviction, but this series lays some
grounds for it by making cache readers prepared for the possibility of rows
being evicted.

Fixes #2775.
Fixes #2730."

* tag 'tgrabiec/snapshot-evicition-in-cache-v1' of github.com:scylladb/seastar-dev:
  tests: Add test for partition_entry::evict()
  mutation_partition: Introduce range continuity checking methods
  mutation_partition: Enable rows_entry::compare() on position_in_partition_views
  tests: Extract mvcc tests to separate file
  tests: row_cache: Add evicition tests
  tests: simple_schema: Add new_tombstone() helper
  tests: streamed_mutation_assertions: Introduce produces(mutation&)
  streamed_mutation: Allow setting buffer capacity
  row_cache: Evict partition snapshots
  mvcc: Introduce partition_entry::evict()
  row_cache: Handle eviction in partition reader
  tests: row_cache_test: Don't assume mvcc snapshots are not evictable
  row_cache: Reuse allocation_strategy::invalidate_references()
  row_cache: Don't invalidate references on insertion
  lsa: Move reclaim counter concept to allocation_strategy level
  mvcc: Ensure partition_snapshot always destroys versions using proper allocator
  mvcc: Encapsulate reference stability check in partition_snapshot
  mvcc: Store LSA region reference in partition_snapshot
2017-09-13 20:48:33 +03:00
Tomasz Grabiec
a45b1ef4bc sstables: Make atomic_deletion_manager logger static
So that it's visible to the framework at boot and --logger-log-level can
be used on it.

Message-Id: <1505286578-21904-1-git-send-email-tgrabiec@scylladb.com>
2017-09-13 20:35:41 +03:00
Tomasz Grabiec
b8f62e86de tests: Add test for partition_entry::evict() 2017-09-13 17:47:04 +02:00
Tomasz Grabiec
455a1b0d24 mutation_partition: Introduce range continuity checking methods 2017-09-13 17:47:04 +02:00
Tomasz Grabiec
abc489e99d mutation_partition: Enable rows_entry::compare() on position_in_partition_views
For full symmetry with existing overloads.
2017-09-13 17:47:04 +02:00
Tomasz Grabiec
d76b141b34 tests: Extract mvcc tests to separate file 2017-09-13 17:47:04 +02:00
Tomasz Grabiec
2dfb3b95a5 tests: row_cache: Add evicition tests 2017-09-13 17:47:03 +02:00
Tomasz Grabiec
204ec9c673 tests: simple_schema: Add new_tombstone() helper 2017-09-13 17:47:03 +02:00
Tomasz Grabiec
5b1adfa542 tests: streamed_mutation_assertions: Introduce produces(mutation&) 2017-09-13 17:47:03 +02:00
Tomasz Grabiec
cb16b038ef streamed_mutation: Allow setting buffer capacity
Needed in tests to limit amount of prefetching done by readers, so
that it's easier to test interleaving of various events.
2017-09-13 17:47:03 +02:00
Tomasz Grabiec
c78047fa5b row_cache: Evict partition snapshots
If snapshots are not evicted, they may pin unbouned amount of memory
for a long time in cache, which may lead to OOM. Evict snapshots
together with the entry.

Fixes #2775.
Fixes #2730.
2017-09-13 17:47:03 +02:00
Tomasz Grabiec
b6ae5783cd mvcc: Introduce partition_entry::evict()
The operation frees as much memory as possible, marking affected
mutation elements as discontinuous.
2017-09-13 17:47:03 +02:00
Tomasz Grabiec
fa2c26342c row_cache: Handle eviction in partition reader 2017-09-13 17:38:08 +02:00
Tomasz Grabiec
99aa3d1964 tests: row_cache_test: Don't assume mvcc snapshots are not evictable
The test was not updating the underlying mutation source but still
expecting to get the right data after calling invalidate().  If
snapshots are evictable, that's not guaranteed. Apply to underlying as
well, so data is read from underlying if necessary.
2017-09-13 17:38:08 +02:00
Tomasz Grabiec
adb159d51b row_cache: Reuse allocation_strategy::invalidate_references()
Modification count in the tracker is redundant, we can rely on
allocator's invalidation counter.
2017-09-13 17:38:08 +02:00
Tomasz Grabiec
27a3b4bca9 row_cache: Don't invalidate references on insertion
modification_count is currently only used to detect invalidation of
references, intended to be incremented on erasure.

Insertion into intrusive set doesn't invalidate references, so no
need to increment the counter.
2017-09-13 17:38:08 +02:00
Tomasz Grabiec
87be474c19 lsa: Move reclaim counter concept to allocation_strategy level
So that generic code can detect invalidation of references. Also, to
allow reusing the same mechanism for signalling external reference
invalidation.
2017-09-13 17:38:08 +02:00
Tomasz Grabiec
4053c801e2 mvcc: Ensure partition_snapshot always destroys versions using proper allocator
partition_snapshot is managed by lw_shared_ptr. Currently it is
assumed that before it dies, maybe_merge_versions() is called on it,
which destroyes it in the right allocator context. It's not very
safe. This patch improves safety by using the right allocator in
snapshot's destructor.
2017-09-13 17:38:08 +02:00
Tomasz Grabiec
cda86abdbc mvcc: Encapsulate reference stability check in partition_snapshot 2017-09-13 17:38:08 +02:00
Tomasz Grabiec
2df6f356b1 mvcc: Store LSA region reference in partition_snapshot
Will be useful for improving encapsulation.
2017-09-13 17:38:08 +02:00
Tomasz Grabiec
4c920c9891 tests: cql_test_env: Use cancel_prior_atomic_deletions()
This fixes a failure in view_schema_test, which starts many instances
of single_node_cql_env. cancel_atomic_deletions() causes later
deletions to fail, which causes some of the test cases to fail.
Message-Id: <1505311250-3118-2-git-send-email-tgrabiec@scylladb.com>
2017-09-13 17:11:34 +03:00
Tomasz Grabiec
dc0860ac70 sstables: Introduce cancel_prior_atomic_deletions()
Like cancel_atomic_deletions() but doesn't fail later deletions.
Message-Id: <1505311250-3118-1-git-send-email-tgrabiec@scylladb.com>
2017-09-13 17:11:33 +03:00
Tomasz Grabiec
8a425cedc6 tests: cql_test_env: Cancel pending sstable deletions on shutdown
Fixes a hang on shutdown with --smp 2 in perf_fast_forward. The hang
is in sstables::await_background_jobs_on_all_shards(), which is
waiting on sstable deletions. Not all shards agree to delete certain
sstables, because e.g. not all shards decide to compact them
yet. Cancel those deletes after database is stopped on all shards,
like we do in main.cc

Fixes #2796.

Message-Id: <1505292239-26032-1-git-send-email-tgrabiec@scylladb.com>
2017-09-13 11:56:48 +03:00
Asias He
c84dcabb8f gossip: Use boost::copy_range in apply_state_locally
boost::copy_range is better because the vector is allocated with the
correct size instead of growing when the inserter is called.

[avi: also crashes less]

Message-Id: <b19ca92d56ad070fca1e848daa67c00c024e3a4d.1505291199.git.asias@scylladb.com>
2017-09-13 11:33:15 +03:00
Tomasz Grabiec
b3a8ba5af6 gdb: Introduce "scylla find" command
Finds live objects on seastar heap of current shard which contain
given value. Prints results in 'scylla ptr' format.

Example:

  (gdb) scylla find 0x600005321900
  thread 1, small (size <= 512), live (0x6000000f3800 +48)
  thread 1, small (size <= 56), live (0x6000008a1230 +32)

Message-Id: <1505284614-19577-1-git-send-email-tgrabiec@scylladb.com>
2017-09-13 11:22:23 +03:00
Asias He
fa9d47c7f3 gossip: Fix a log message typo in compare_endpoint_startup
Message-Id: <c4958950e1108082b63e08ab81ee2177edc9b232.1505286843.git.asias@scylladb.com>
2017-09-13 09:54:56 +02:00
Glauber Costa
ecad1be161 compaction_strategy: add missing header
compaction_strategy.hh throws an exception, but it doesn't add the
exception header. It is working in-tree because of inclusion order,
but it broke one of my yet-out-of-tree changes.

In any case, it is best to add the headers we will need to the files,
and that is what this patch does.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20170912233326.26114-1-glauber@scylladb.com>
2017-09-13 08:40:15 +02:00
Raphael S. Carvalho
ef18b1162b sstables/compaction_manager: rename and better explain reshard function
submit doesn't properly describe the function and also improve explanation
of the relationship between function itself and its job parameter.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170912032034.23043-1-raphaelsc@scylladb.com>
2017-09-12 12:25:17 +03:00
Avi Kivity
1bd207a306 sstables: merge filter.cc into sstables.cc
filter.cc has just two smallish functions, which are part of the sstable
class. Move them to sstables.cc where the rest of the class members are defined.
Message-Id: <20170912080541.7836-1-avi@scylladb.com>
2017-09-12 10:06:52 +02:00
Tomasz Grabiec
423142ec81 tests: row_cache_test: Fix abort in debug mode
The test used apply() variant which assumed that it was invoked in a
seastar thread, which is no longer the case after commit d22fdf4. Fix
by copying outisde cache update, and use non-deferring apply() variant
for cache update.

Message-Id: <1505200142-3650-1-git-send-email-tgrabiec@scylladb.com>
2017-09-12 10:57:36 +03:00
Tomasz Grabiec
3f527e028d Merge "Reduce dependencies on sstables.hh" from Avi
This patchset reduces includes of sstables.hh, reducing compile time
by both reducing the amount of code compiled, and the amount of
needless recompiles caused by false dependencies.  It does so by
replacing lw_shared_ptr<sstable>, which requires a complete class,
with a new custom type shared_sstable, which allows an incomplete
sstable class definition.

* https://github.com/avikivity/scylla deps2/v2.1
  database: change truncate() to flush while compaction is disabled
  database: make run_with_compaction_disabled() a non-template
  database: add indirection to compaction_manager instance
  database: remove dependency on compaction.hh and compaction_manager.hh
  size_estimates_virtual_reader.hh: add missing include
  system_keyspace: add missing include
  main: add missing include
  storage_service: add missing include
  repair: add missing include
  compaction.hh: add missig include and forward declaration
  compaction_manager: add missing include
  shared_index_lists.hh: add missing include
  perf_fast_forward: add missing include
  sstable_mutation_test: add missing include
  sstables: extract version and format enum into a separate header file
  database.hh: add missing forward declaration for
    foreign_sstable_open_info
  cql_test_env: add forward declaration
  database: make column_family::disable_sstable_write() out-of-line
  sstables: introduce make_sstable() as a shortcut for
    make_lw_shared<sstable>
  treewide: use shared_sstable, make_sstable in place of
    lw_shared_ptr<sstable>
  sstables: use support for lw_shared_ptr with incomplete type for
    shared_sstable
  sstables: reduce dependencies
  streaming: remove unneeded includes
2017-09-12 09:56:46 +02:00
Tomasz Grabiec
ee1e7732a6 database: Create tables with continuous cache
When table is created, it doesn't contain any data, so we can mark the whole
data range as continuous in cache. This way reads will immediately hit, and
flushes will populate. If sstables are later attached, the attaching process
is supposed to invalidate affected ranges (and it does).

Fixes #2536.

Message-Id: <1505200269-4031-1-git-send-email-tgrabiec@scylladb.com>
2017-09-12 10:53:07 +03:00
Avi Kivity
85a6a2b3cb streaming: remove unneeded includes 2017-09-12 10:43:39 +03:00
Avi Kivity
578bf55371 sstables: reduce dependencies
Use shared_sstable.hh instead of sstables.hh.
2017-09-12 10:43:36 +03:00
Avi Kivity
07feaf9c4c sstables: use support for lw_shared_ptr with incomplete type for shared_sstable
Use the lw_shared_ptr deleter support to define shared_sstable without
pulling the definition of class sstable, reducing compile time and
dependencies if only shared_sstable is needed.
2017-09-12 10:43:05 +03:00
Avi Kivity
f7023501d6 treewide: use shared_sstable, make_sstable in place of lw_shared_ptr<sstable>
Since shared_sstable is going to be its own type soon, we can't use the old alias.
2017-09-12 10:43:05 +03:00
Avi Kivity
1a3cdffbc1 sstables: introduce make_sstable() as a shortcut for make_lw_shared<sstable>
shared_sstable will soon not be an alias for lw_shared_ptr<sstable>, so we need
another factory function.
2017-09-12 10:43:05 +03:00
Avi Kivity
88b91c84a1 database: make column_family::disable_sstable_write() out-of-line
Reduces dependencies.
2017-09-12 10:43:05 +03:00
Avi Kivity
02028df9b1 cql_test_env: add forward declaration
Not worthwhile to add a new #include for this.
2017-09-12 10:43:05 +03:00
Avi Kivity
02e5bf1c20 database.hh: add missing forward declaration for foreign_sstable_open_info
Supplied by an incidental include now, but it will be gone soon.
2017-09-12 10:43:05 +03:00
Avi Kivity
c4bafd912c sstables: extract version and format enum into a separate header file
This allows removing a dependency on sstables.hh later on.
2017-09-12 10:43:05 +03:00
Avi Kivity
5ebb15b9d4 sstable_mutation_test: add missing include 2017-09-12 10:43:05 +03:00
Avi Kivity
fdab47ab32 perf_fast_forward: add missing include 2017-09-12 10:43:05 +03:00
Avi Kivity
ca2d0b4efb shared_index_lists.hh: add missing include 2017-09-12 10:43:05 +03:00
Avi Kivity
eb62b2c00d compaction_manager: add missing include 2017-09-12 10:43:05 +03:00
Avi Kivity
0efa444a56 compaction.hh: add missing includes 2017-09-12 10:42:45 +03:00
Avi Kivity
7ca029c8f1 database_fwd.hh: add column_family forward declaration 2017-09-12 10:41:28 +03:00
Avi Kivity
4751402709 build: disable -fsanitize-address-use-after-scope on CqlParser.o
The parser generator somehow confuses the use-after-scope sanitizer, causing it
to use large amounts of stack space. Disable that sanitizer on that file.
Message-Id: <20170905110628.18047-1-avi@scylladb.com>
2017-09-11 19:42:26 +02:00
Avi Kivity
43a72254ff repair: add missing include 2017-09-11 20:09:45 +03:00
Avi Kivity
aebab377d9 storage_service: add missing include 2017-09-11 20:09:45 +03:00
Avi Kivity
a3b8089bd4 main: add missing include 2017-09-11 20:09:45 +03:00
Avi Kivity
0aaefe665b system_keyspace: add missing include 2017-09-11 20:09:45 +03:00
Avi Kivity
d3cde2e2be size_estimates_virtual_reader.hh: add missing include 2017-09-11 20:09:45 +03:00
Avi Kivity
9b540eccb0 database: remove dependency on compaction.hh and compaction_manager.hh 2017-09-11 20:09:45 +03:00
Avi Kivity
f9c8c1ddc2 database: add indirection to compaction_manager instance
Allows making it forward-declared later on, reducing dependencies.
2017-09-11 20:09:45 +03:00
Avi Kivity
9d0aaa941a database: make run_with_compaction_disabled() a non-template
Allows reducing dependencies down the line, and un-templating
non-performance-critical functions is a good thing.
2017-09-11 20:09:45 +03:00
Avi Kivity
6b5514a3df database: change truncate() to flush while compaction is disabled
In preparation to make run_with_compaction_disabled() a non-template,
we want to remove any non-copyable captures (so the function can be
an std::function, which requires copyability). Move the flush within
the compaction disabled region. This changes the behavior, but it shouldn't
matter.
2017-09-11 20:09:45 +03:00
Avi Kivity
14fd4168dc Merge seastar upstream
* seastar 31b925d...92fdce2 (3):
  > shared_ptr: allow incomplete classes in lw_shared_ptr<>
  > Update DPDK to 17.05
  > future: pass func as mutable to lambda arg of handle_exception[_type]
2017-09-11 20:09:04 +03:00
Tomasz Grabiec
95b3eaac97 debug: Allow running scylla_row_cache_report.stp script against a running process
Message-Id: <1504776359-16424-1-git-send-email-tgrabiec@scylladb.com>
2017-09-11 14:17:30 +03:00
Avi Kivity
fe019ad84d Merge "Refuse to load non-Scylla counter sstables" from Paweł
"These patches make Scylla refuse to load counter sstables that may
contain unsupported counter shards. They are recognised by the lack of
the Scylla component.

Fixes #2766."

* tag 'reject-non-scylla-counter-sstables/v1' of https://github.com/pdziepak/scylla:
  db: reject non-Scylla counter sstables in flush_upload_dir
  db: disallow loading non-Scylla counter sstables
  sstable: add has_scylla_component()
2017-09-11 13:28:44 +03:00
Tzach Livyatan
83eab5c8d7 Remove comment about Too high number of concurrent compactions from scylla_compaction_manager_compactions help
It should never happen and its not clear what too high stands for

Signed-off-by: Tzach Livyatan <tzach@scylladb.com>
Message-Id: <20170911085645.21222-1-tzach@scylladb.com>
2017-09-11 13:27:35 +03:00
Gleb Natapov
d0d8bdf615 storage_proxy: remove unused parameter from get_restricted_ranges() function
Message-Id: <20170911084653.GH24167@scylladb.com>
2017-09-11 11:58:44 +02:00
Gleb Natapov
f66e9377d4 storage_proxy: do not keep reference to a keyspace during write
A keyspace can be deleted while write is ongoing, so the object cannot
be used after defer point. The keyspace reference is only used to check
how many replies a write operation should wait for and this can be
precalculated during write handler creation.

Fixes #2777

Message-Id: <20170911084436.GG24167@scylladb.com>
2017-09-11 11:57:00 +02:00
Asias He
bb9dbc5ade storage_service: Do not use c_str() in the logger
Use logger.info("{}", msg) instead.

Message-Id: <d2f15007a54554b58e29fd05331c06ae030d582f.1504832296.git.asias@scylladb.com>
2017-09-10 18:10:24 +03:00
Botond Dénes
9ebeb9d5ce Fix --Wreturn-type warnings in tests: use abort() instead of assert(0)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <95927f933411302e84d57d169ee0147def7bc643.1504890922.git.bdenes@scylladb.com>
2017-09-10 17:09:53 +03:00
Gleb Natapov
9137446109 api: uses correct statistics for storage proxy range histograms.
Message-Id: <20170910073458.GB1870@scylladb.com>
2017-09-10 16:18:36 +03:00
Pekka Enberg
d2632ddf1d Merge "gossip: optimize apply_state_locally for large cluster" from Asias
"This series tries to improve the bootstrap of a node in a large cluster by
improving how gossip applies the gossip node state. In #2404, the joining node
failed to bootstrap, because it did not see the seed node when
storage_service::bootstrap ran. After this series, we apply the whole gossip
state contained in the gossip ack/ack2 message before applying the next one,
and we apply the state of the seed node earlier than non-seed node so we can
have the seed node's state faster. We also add some randomness to the order of
applying gossip node state to prevent some of the nodes' state are always
applied earlier than the others.

This series improves apply_state_locally for large cluster:

 - Tune the order of applying endpoint_state
 - Serialize apply_state_locally
 - Avoid copying of the gossip state map

Fixes #2404"

* tag 'asias/gossip_issue_2404_v2' of github.com:scylladb/seastar-dev:
  gossip: Avoid copying with apply_state_locally
  gossip: Serialize apply_state_locally
  gossip: Tune the order of applying endpoint_state in apply_state_locally
  gossip: Introduce is_seed helper
  gossip: Pass const endpoint_state& in notify_failure_detector
  gossip: Pass reference in notify_failure_detector
2017-09-08 11:41:43 +03:00
Asias He
57dd3cb2c5 gossip: Do not use c_str() in the logger
Use logger.info("{}", msg) instead.

Message-Id: <52c24d7dfe082ee926f065a6268d83fcb31ddc28.1504832289.git.asias@scylladb.com>
2017-09-08 10:59:42 +03:00
Asias He
e98ce7887b gossip: Avoid copying with apply_state_locally
Move the std::map<inet_address, endpoint_state> map from the gossip
ack/ack2 message directly and move it around in apply_state_locally to
avoid copying the map.
2017-09-08 15:19:48 +08:00
Asias He
fd879b4e09 gossip: Serialize apply_state_locally
apply_state_locally will be called when gossip ack/ack2
message is received. It will use the std::map<inet_address,
endpoint_state>& map to update the endpoint state.

However, we can receive multiple such gossip ack/ack2  messages from
multiple peer nodes in parallel. Currently, we process them in parallel.
It is better to apply all the states from one node then move to apply
all the states from another node than interleaving. Because it is more
important to have the state of the whole cluster than to have a bit
newer state from another peer (if it is newer), especially when the node
boots up and runs its first round of gossip exchange.

After this patch, we apply the whole gossip state contained in the
gossip ack/ack2 message before applying the next one.
2017-09-08 15:19:47 +08:00
Asias He
9ccba950ba gossip: Tune the order of applying endpoint_state in apply_state_locally
We currently always apply the endpoint_state in the order of the
endpoint ip address. This is not good because some of the endpoint's
state is always applied earlier than the others.

In large cluster, the number of endpoints can be large, it takes time to
apply all of them. To make it more fair, we apply the endpoint_state
randomly.

Apply the seed node's state earlier because in bootstrap, we will check
if we have seen the seed node in storage_service::bootstrap. In #2404,
the bootstrap failed because, the joining node hasn't apply the seed
node's state when storage_service::bootstrap runs.
2017-09-08 15:19:47 +08:00
Asias He
c5456ed38f gossip: Introduce is_seed helper
To check if a endpoint is a seed node.
2017-09-08 15:19:47 +08:00
Asias He
32edd95241 gossip: Pass const endpoint_state& in notify_failure_detector 2017-09-08 15:19:47 +08:00
Asias He
46e562cbfa gossip: Pass reference in notify_failure_detector
In large cluster, the map can be large. Pass reference to avoid copying.
2017-09-08 15:19:47 +08:00
Glauber Costa
db846326f8 compaction: remove dead code
This code has no more users. Bury it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20170908005305.29925-1-glauber@scylladb.com>
2017-09-08 08:17:15 +02:00
Tomasz Grabiec
57dc988475 Update seastar submodule
* seastar 85ca12d...31b925d (19):
  > net/byteorder: fix 64 bit ntohq and htonq on big endian machines
  > core, util: fix compilation on non-x86 processors
  > core/memory: Fix SIGSEGV in small_pool::add_more_objects()
  > log: remove debug leftovers
  > Merge "TLS state machine fixes" from Calle
  > logger: allow adjusting the timestamp style for stdout logs
  > thread: make thread_context::s_main portable
  > core: add seastar::cache_line_size constant
  > Add detach() to input_stream and output_stream
  > Install dependencies for Arch Linux.
  > tls: Guard non-established sockets in sesrefs + more explicit close + states
  > tls: Make vec_push fully exception safe
  > basic_sstring: resize uses sstring
  > Merge "Add and correct unit tests" from Jesse
  > tcp: enforce 1-byte maximum segment invariant with zero window
  > tcp: verify 1-byte maximum segment invariant during send with zero window
  > memory: reduce small_pool vulnerability to fragmentation further
  > Prometheus: avoid merging all metrics family
  > net: Fix possible NULL pointer dereference.
2017-09-07 10:34:27 +02:00
Avi Kivity
d9ee2ad9f0 chunked_vector: avoid boost::small_vector with old boost versions
Apparently older boost versions have a bug resulting in a double-free
in boost::container::small_vector. Use std::vector instead.

Fixes #2748.

Tested-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20170903170207.21635-1-avi@scylladb.com>
2017-09-07 09:32:51 +03:00
Tomasz Grabiec
121cd8cb6c tests: Fix cql_query_test.cc::test_duration_restrictions
validate_request_failure() assumed that the future returned by execute_cql()
is always ready, which doesn't have to be the case, and caused aborts
in debug mode build.

Message-Id: <1504701342-13300-1-git-send-email-tgrabiec@scylladb.com>
2017-09-06 15:49:03 +03:00
Tomasz Grabiec
3986486cb3 tests: cql_test_env: Avoid exceptions to make debugging easier
Message-Id: <1504701375-13491-1-git-send-email-tgrabiec@scylladb.com>
2017-09-06 15:48:59 +03:00
Paweł Dziepak
e401d2d50b db: reject non-Scylla counter sstables in flush_upload_dir
Scylla already refuses to load counter sstables that do not have Scylla
component. However, if this happens because of 'nodetool refresh'
command the existing protection will trigger after sstables have been
moved to the data directory. This is too later, so an additional check
is added when the upload directory is scanned.
2017-09-06 12:04:26 +01:00
Paweł Dziepak
6a5e8bace1 db: disallow loading non-Scylla counter sstables
Scylla does not support local and remote counter shards. This means that
it is unsafe to directly load sstables that may contain them.
2017-09-06 12:03:58 +01:00
Paweł Dziepak
ebc538f4a3 sstable: add has_scylla_component()
has_scylla_component() is going to be used to verify that an sstable has
been generated by a recent version of Scylla. This would make it
possible to reject sstables that may be unsafe to load (e.g. sstables
containing legacy counter shards).
2017-09-06 12:03:45 +01:00
Avi Kivity
a59e375aad Merge "Support termination of repair jobs" from Asias
"This series implements the missing API to terminate all repairs.

For example:

$ curl -X POST  --header "Accept: application/json"
"http://127.0.0.1:10000/storage_service/force_terminate_repair"

With the new stream_plan::abort() api we can now abort the stream
session assocaited with the repair as well.

On top of this, we can support termination of single repair job instead all
jobs.

Fixes #2105"

* tag 'asisas/repair_abort_v4' of github.com:scylladb/seastar-dev:
  repair: Support termination of repair jobs
  repair: Track repair_info
  repair: Intorduce repair id to repair_info map
  api: Add force_terminate_repair API
  streaming: Add abort to stream_plan
  streaming: Add abort_all_stream_sessions for stream_coordinator
  streaming: Introduce streaming::abort()
  streaming: Make stream_manager and coordinator message debug level
  streaming: Check if _stream_result is valid
  streaming: Log peer address in on_error
  streaming: Introduce received_failed_complete_message
2017-09-06 12:58:05 +03:00
Avi Kivity
31706ba989 Merge "Fix Scylla upgrades when counters are used" from Paweł
"Scylla 1.7.4 and older use incorrect ordering of counter shards, this
was fixed in 0d87f3dd7d ("utils::UUID:
operator< should behave as comparison of hex strings/bytes"). However,
that patch was not backported to 1.7 branch until very recently. This
means that versions 1.7.4 and older emit counter shards in an incorrect
order and expect them to be so. This is particularly bad when dealing
with imported correct sstables in which case some shards may become
duplicated.

The solution implemented in this patch is to allow any order of counter
shards and automaticly merge all duplicates. The code is written in a
way so that the correct ordering is expected in the fast path in order
not to excessively punish unaffected deployments.

A new feature flag CORRECT_COUNTER_ORDER is introduced to allow seamless
upgrade from 1.7.4 to later Scylla versions. If that feature is not
available Scylla still writes sstables and sends on-wire counters using
the old ordering so that it can be correctly understood by 1.7.4, once
the flag becomes available Scylla switches to the correct order.

Fixes #2752."

* tag 'fix-upgrade-with-counters/v2' of https://github.com/pdziepak/scylla:
  tests/counter: verify counter_id ordering
  counter: check that utils::UUID uses int64_t
  mutation_partition_serializer: use old counter ordering if necessary
  mutation_partition_view: do not expect counter shards to be sorted
  sstables: write counter shards in the order expected by the cluster
  tests/sstables: add storage_service_for_tests to counter write test
  tests/sstables: add test for reading wrong-order counter cells
  sstables: do not expect counter shards to be sorted
  storage_service: introduce CORRECT_COUNTER_ORDER feature
  tests/counter: test 1.7.4 compatible shard ordering
  counters: add helper for retrieving shards in 1.7.4 order
  tests/counter: add tests for 1.7.4 counter shard order
  counters: add counter id comparator compatible with Scylla 1.7.4
  tests/counter: verify order of counter shards
  tests/counter: add test for sorting and deduplicating shards
  counters: add function for sorting and deduplicating counter cells
  counters: add counter_id::operator>
2017-09-05 14:20:55 +03:00
Paweł Dziepak
ed68a75b75 tests/counter: verify counter_id ordering 2017-09-05 10:52:54 +01:00
Paweł Dziepak
cdf7ba76f1 counter: check that utils::UUID uses int64_t 2017-09-05 10:46:03 +01:00
Paweł Dziepak
4aa72c6454 mutation_partition_serializer: use old counter ordering if necessary
Until the cluster is fully upgraded from a version that uses the
incorrect counter shard ordering it is essential to keep using it lest
the old nodes corrupt the data upon receiving mutations with a counter
shard ordering they do not expect.
2017-09-05 10:32:48 +01:00
Paweł Dziepak
b540516e5e mutation_partition_view: do not expect counter shards to be sorted 2017-09-05 10:32:48 +01:00
Paweł Dziepak
84edb5a1f2 sstables: write counter shards in the order expected by the cluster
If the feature signaling that we have switched to the correct ordering
of counter shards is not enabled it means that the user still can do a
rollback to a version that expects wrong ordering. In order to avoid any
disasters when that happens write sstables using the 1.7.4 order until
we know for sure that it is no longer needed.
2017-09-05 10:32:48 +01:00
Paweł Dziepak
2b614201a7 tests/sstables: add storage_service_for_tests to counter write test
Writing a counters to a sstable is going to require cluster feature
information, which requires accessing some singletons.
2017-09-05 10:32:48 +01:00
Paweł Dziepak
5007c9290a tests/sstables: add test for reading wrong-order counter cells 2017-09-05 10:32:48 +01:00
Paweł Dziepak
3e1d09e71d sstables: do not expect counter shards to be sorted 2017-09-05 10:32:48 +01:00
Paweł Dziepak
ecd2bf128b storage_service: introduce CORRECT_COUNTER_ORDER feature
Scylla 1.7.4 used incorrect ordering of counter shards. In order to fix
this problem a new feature is introduced that will be used to determine
when nodes with that bug fixed can start sending counter shard in the
correct order.
2017-09-05 10:32:48 +01:00
Paweł Dziepak
1e03c4acbe tests/counter: test 1.7.4 compatible shard ordering 2017-09-05 10:32:47 +01:00
Paweł Dziepak
067e429881 counters: add helper for retrieving shards in 1.7.4 order 2017-09-05 10:32:47 +01:00
Paweł Dziepak
fd25a09db2 tests/counter: add tests for 1.7.4 counter shard order 2017-09-05 10:32:47 +01:00
Paweł Dziepak
a93e8ce185 counters: add counter id comparator compatible with Scylla 1.7.4 2017-09-05 10:32:47 +01:00
Paweł Dziepak
b0f67c1680 tests/counter: verify order of counter shards 2017-09-05 10:32:47 +01:00
Paweł Dziepak
27397b5dad tests/counter: add test for sorting and deduplicating shards 2017-09-05 10:32:47 +01:00
Paweł Dziepak
e0c2379f26 counters: add function for sorting and deduplicating counter cells
Due to a bug in an implementation of UUID less compare some Scylla
versions sort counter shards in an incorrect order. Moreover, when
dealing with imported correct data the inconsistencies in ordering
caused some counter shards to become duplicated.
2017-09-05 10:32:39 +01:00
Paweł Dziepak
74af818eaf counters: add counter_id::operator> 2017-09-04 18:25:47 +01:00
Avi Kivity
4b06a2e95d Merge "Fix exception safety in cache update related paths" from Tomasz
* 'tgrabiec/make-row-cache-update-exception-safe' of github.com:scylladb/seastar-dev:
  row_cache: Improve safety of cache updates
  row_cache: Extract invalidate_sync()
  memtable: Mark mark_flushed() as noexcept
  database: Add non-throwing try_trigger_compaction()
  database: Make add_sstable() have strong exception guarantees
  row_cache: Don't require presence checker to be supplied externally
  database: Supply presence checker in sstable snapshots
  mutation_source: Introduce mutation_source::make_partition_presence_checker()
  mutation_reader: Move definitions up in the header
  mutation_reader: Use constructor delegation to reduce code duplication
  row_cache: Make populate() preserve continuity
  row_cache: Allow marking as fully continuous on construction
  database: Add missing serialization of sstable set udpate and cache invalidation
2017-09-04 18:37:42 +03:00
Tomasz Grabiec
d22fdf4261 row_cache: Improve safety of cache updates
Cache imposes requirements on how updates to the on-disk mutation source
are made:
  1) each change to the on-disk muation source must be followed
     by cache synchronization reflecting that change
  2) The two must be serialized with other synchronizations
  3) must have strong failure guarantees (atomicity)

Because of that, sstable list update and cache synchronization must be
done under a lock, and cache synchronization cannot fail to synchronize.

Normally cache synchronization achieves no-failure thing by wiping the
cache (which is noexcept) in case failure is detect. There are some
setup steps hoever which cannot be skipped, e.g. taking a lock
followed by switching cache to use the new snapshot. That truly cannot
fail.  The lock inside cache synchronizers is redundant, since the
user needs to take it anyway around the combined operation.

In order to make ensuring strong exception guarantees easier, and
making the cache interface easier to use correctly, this patch moves
the control of the combined update into the cache. This is done by
having cache::update() et al accept a callback (external_updater)
which is supposed to perform modiciation of the underlying mutation
source when invoked.

This is in-line with the layering. Cache is layered on top of the
on-disk mutation source (it wraps it) and reading has to go through
cache. After the patch, modification also goes through cache. This way
more of cache's requirements can be confined to its implementation.

The failure semantics of update() and other synchronizers needed to
change due to strong exception guaratnees. Now if it fails, it means
the update was not performed, neither to the cache nor to the
underlying mutation source.

The database::_cache_update_sem goes away, serialization is done
internally by the cache.

The external_updater needs to have strong exception guarantees. This
requirement is not new. It is however currently violated in some
places. This patch marks those callbacks as noexcept and leaves a
FIXME. Those should be fixed, but that's not in the scope of this
patch. Aborting is still better than corrupting the state.

Fixes #2754.

Also fixes the following test failure:

  tests/row_cache_test.cc(949): fatal error: in "test_update_failure": critical check it->second.equal(*s, mopt->partition()) has failed

which started to trigger after commit 318423d50b. Thread stack
allocation may fail, in which case we did not do the necessary
invalidation.
2017-09-04 10:04:29 +02:00
Tomasz Grabiec
b0f3efa577 row_cache: Extract invalidate_sync() 2017-09-04 10:04:29 +02:00
Tomasz Grabiec
673a22f8e1 memtable: Mark mark_flushed() as noexcept
Callers rely on that.
2017-09-04 10:04:29 +02:00
Tomasz Grabiec
bf75b882ae database: Add non-throwing try_trigger_compaction() 2017-09-04 10:04:29 +02:00
Tomasz Grabiec
116d4ae02b database: Make add_sstable() have strong exception guarantees
If insert() fails, we left the database with stats updated, but
sstable not being attached.
2017-09-04 10:04:29 +02:00
Tomasz Grabiec
56e3ce05db row_cache: Don't require presence checker to be supplied externally
The API is simpler and safer this way.
2017-09-04 10:04:29 +02:00
Tomasz Grabiec
df787afe6a database: Supply presence checker in sstable snapshots 2017-09-04 10:04:29 +02:00
Tomasz Grabiec
8a9f0f86e7 mutation_source: Introduce mutation_source::make_partition_presence_checker()
Every mutation source can have a presence checker. By default all
answer "maybe contains".

Having this on mutation_source level will be useful for simplifying
cache update flow. The cache can ask the right snapshot for a presence
checker rather than relying on database to know when and how to make
the right one which preserves all invariants.

This will be especially useful once all updates of the underlying
mutation source of cache (e.g. sstable list) will have to go through
cache for safety reasons.
2017-09-04 10:04:29 +02:00
Tomasz Grabiec
065feb1b7b mutation_reader: Move definitions up in the header 2017-09-04 10:04:29 +02:00
Tomasz Grabiec
4e4839082b mutation_reader: Use constructor delegation to reduce code duplication 2017-09-04 10:04:29 +02:00
Tomasz Grabiec
1a2f17d42c row_cache: Make populate() preserve continuity 2017-09-04 10:04:29 +02:00
Tomasz Grabiec
bc3112a187 row_cache: Allow marking as fully continuous on construction
Will be needed in tests.
2017-09-04 10:04:29 +02:00
Tomasz Grabiec
ab8632b225 database: Add missing serialization of sstable set udpate and cache invalidation
Commit e3ad676433 missed a few places.

It is required to serialize sstable list update and cache synchronization
in order to preserve partition update isolation.

Fixes #2746.
2017-09-04 10:04:29 +02:00
Piotr Jastrzebski
dd5dc75605 Stop calling _local_cache.stop in at_exit.
This removes a race condition that was causing #2721

Fixes #2721

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <ad060fab43d63c17db9f811c421d7ab26e5e57c8.1503933021.git.piotr@scylladb.com>
2017-09-03 15:55:48 +03:00
Asias He
e14bb7b1d5 repair: Remove #if'ed code in repair_ranges
It is unlikely we will use parallel_for_each version in repair_ranges.
Get rid of the dead code.

Message-Id: <31a9366adfe0262512a326ef9703aa0bba05e1fb.1503996138.git.asias@scylladb.com>
2017-09-03 11:13:02 +03:00
Avi Kivity
0524cbbd72 Merge db/config.cc cleanups from Jesse
* 'jhk/config_hygiene/v1' of https://github.com/hakuch/scylla:
  db/config.cc: Clarify documentation for `typed_value_ex`
  db/config.cc: Fix formatting and warnings
  db/config.cc: Remove unnecessary `mutable` on lambdas
  db/config.cc: Remove unused variables
2017-09-03 11:08:53 +03:00
Botond Dénes
a980ff6463 Use abort() instead of assert + throw in unreachable code
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <393c3730111dfe090c44d8fc2e31602956a7d008.1504022425.git.bdenes@scylladb.com>
2017-09-03 11:07:27 +03:00
Raphael S. Carvalho
22701346de sstables/stcs: avoid needless copy of bucket in get_buckets()
In addition, remove bucket by iterator which is faster.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170903000315.16338-1-raphaelsc@scylladb.com>
2017-09-03 10:46:48 +03:00
Avi Kivity
551eb75eb0 Update AMI submodule
* dist/ami/files/scylla-ami b41e5eb...5ffa449 (3):
  > amzn-main.repo: stick to Amazon Linux 2017.03 kernel (4.9.x)
  > Prevent dependency error on 'yum update'
  > scylla_create_devices: don't raise error when no disks found
2017-08-31 15:13:52 +03:00
Glauber Costa
e642aee3f7 database: wait for asynchronous operations to end before closing CF
This was part of "add gate for generic async operations to column family" but
somehow didn't make it into the final patch.

Add the missing piece.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20170830164205.4497-1-glauber@scylladb.com>
2017-08-31 11:16:30 +03:00
Avi Kivity
23d3ca56a1 Merge "optional integrity checker of sstable component writes" from Raphael
"optional interposer that will check integrity of writes to sstable components.
The option name is enable_sstable_data_integrity_check, it's disabled by default
and can be enabled via config file.
It will provide enough details that will help to find the root of the issue.

if disk failed for example, we would've something like the following reported:
ERROR 2017-08-17 09:18:11,577 [shard 0] sstable - integrity check failed for ./data/data/system_schema/aggregates-924c55872e3a345bb10c12f37c1ba895/system_schema-aggregates-ka-111-Scylla.db, stage: read after write verification, write: 4096 bytes to offset 0, reason: data read from underlying storage isn't the same as written, mismatch at byte 0:
 data written sample:   10000001000000010000001a00000001
 date read sample:      00000000000000000000000000000000"

* 'integrity_check_interposer_v3' of github.com:raphaelsc/scylla:
  sstables: optionally check integrity of sstable component writes
  sstables: remove unneeded new_sstable_component_file variant
  db/config: add sstable_data_integrity_check option
  sstables: introduce file interposer for integrity check
2017-08-31 11:08:12 +03:00
Raphael S. Carvalho
a84fbde8c8 sstables: optionally check integrity of sstable component writes
If config file's sstable_data_integrity_check option is enabled,
new integrity check interposer will be used in addition to the
existing one. Performance is expected to drop because of all
the integrity checks for every write. This new interposer will
provide detailed info when integrity check fails.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-08-31 02:27:50 -03:00
Raphael S. Carvalho
04ea4daa7e sstables: remove unneeded new_sstable_component_file variant
can get rid of it because file_open_options is optional in
reactor::open_file_dma()

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-08-31 02:17:34 -03:00
Raphael S. Carvalho
0218d6fd8f db/config: add sstable_data_integrity_check option
If enabled, interposer for checking integrity of sstable component
writes will be used.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-08-30 13:57:08 -03:00
Raphael S. Carvalho
f76b609cf5 sstables: introduce file interposer for integrity check
optional interposer that will check integrity when writing to
sstable components. It will provide enough details that will
help to find the root of the issue, which may come from lower
level layers.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-08-30 11:55:36 -03:00
Jesse Haber-Kucharsky
eddf34d005 test.py: Add missing tests
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <6fc5e810495801e646ccc41c16b581c8eceeda22.1504030666.git.jhaberku@scylladb.com>
2017-08-30 09:58:12 +01:00
Asias He
471e8b341f repair: Support termination of repair jobs
This patch implements the missing API to terminate all repairs.

For example:

$ curl -X POST  --header "Accept: application/json"
"http://127.0.0.2:10000/storage_service/force_terminate_repair"

With the new stream_plan::abort() api we can now abort the stream
session assocaited with the repair as well.

Fixes #2105
2017-08-30 15:19:52 +08:00
Asias He
07d9dc03ec repair: Track repair_info
Make repair_info a shared pointer and store them in _repairs map so we
can find by the repair id and access them later.
2017-08-30 15:19:52 +08:00
Asias He
5c9732c645 repair: Intorduce repair id to repair_info map
The maps are stored in a vector. The vector has smp::count elements, each
element will be accessed by only one shard.

The add_repair_info, remove_repair_info and get_repair_info helpers
are added.
2017-08-30 15:19:51 +08:00
Asias He
6dc62c6215 api: Add force_terminate_repair API
The api /storage_service/force_terminate is supposed to be
/storage_service/force_terminate_repair.

scylla-jmx uses /storage_service/force_terminate api.
So instead of renaming it, it is better to add a new name for it.
2017-08-30 15:19:51 +08:00
Asias He
9c8da2cc56 streaming: Add abort to stream_plan
It can be used by the user of stream_plan to abort the stream sessions.
Repair will be the first user when aborting the repair.
2017-08-30 15:19:51 +08:00
Asias He
475b7a7f1c streaming: Add abort_all_stream_sessions for stream_coordinator
It will abort all the sessions within the stream_coordinator.
It will be used by stream_plan soon.
2017-08-30 15:19:51 +08:00
Asias He
fad34801bf streaming: Introduce streaming::abort()
It will be used soon by stream_plan::abort() to abort a stream session.
2017-08-30 15:19:50 +08:00
Asias He
7fba7cca01 streaming: Make stream_manager and coordinator message debug level
When we abort a session, it is possible that:

node 1 abort the session by user request
node 1 send the complete_message to node 2
node 2 abort the session upon receive of the complete_message
node 1 sends one more stream message to node 2 and the stream_manager
for the session can not be found.

It is fine for node 2 to not able to find the stream_manager, make the
log on node 2 less verbose to confuse user less.
2017-08-30 15:19:50 +08:00
Asias He
be573bcafb streaming: Check if _stream_result is valid
If on_error() was called before init() was executed, the
_stream_result can be invalid.
2017-08-30 15:19:44 +08:00
Asias He
8a3f6acdd2 streaming: Log peer address in on_error 2017-08-30 15:18:27 +08:00
Asias He
eace5fc6e8 streaming: Introduce received_failed_complete_message
It is the handler for the failed complete message. Add a flag to
remember if we received a such message from peer, if so, do not send
back the failed complete message back to the peer when running
close_session with failed status.
2017-08-30 15:18:27 +08:00
Asias He
cc18da5640 Revert "gossip: Make bootstrap more robust"
This reverts commit b56ba02335.

After commit 8fa35d6ddf (messaging_service: Get rid of timeout and retry
logic for streaming verb), streaming verb in rpc does not check if a
node is in gossip memebership since all the retry logic is removed.

Remove the extra wait before removing the joining node from gossip
membership.

Message-Id: <a416a735bb8aad533bbee190e3324e6b16799415.1504063598.git.asias@scylladb.com>
2017-08-30 10:14:11 +03:00
Avi Kivity
48b9e47f7d Revert "row_cache: Add missing handling for failures happening outside the updating thread"
This reverts commit f9feb310ab (requested by author).
2017-08-29 19:26:02 +03:00
Tomasz Grabiec
f9feb310ab row_cache: Add missing handling for failures happening outside the updating thread
Thread stack allocation may fail, in which case we did not do the
necessary invalidation. Fix by hoisting the scope of the cleanup function.

Also fixes the following test failure:

  tests/row_cache_test.cc(949): fatal error: in "test_update_failure": critical check it->second.equal(*s, mopt->partition()) has failed

which started to trigger after commit 318423d50b.

Message-Id: <1504023113-30374-2-git-send-email-tgrabiec@scylladb.com>
2017-08-29 19:17:22 +03:00
Tomasz Grabiec
5d2f2bc90b lsa: Mark region::merge() as noexcept
It seems to satisfy this, and row_cache::do_update() will rely on it to simplify
error handling.

Message-Id: <1504023113-30374-1-git-send-email-tgrabiec@scylladb.com>
2017-08-29 19:17:17 +03:00
Asias He
8fa35d6ddf messaging_service: Get rid of timeout and retry logic for streaming verb
With the "Use range_streamer everywhere" (7217b7ab36) seires, all
the user of streaming now do streaming with relative small ranges and
can retry streaming at higher level.

There are problems with timeout and retry at RPC verb level in streaming:
1) Timeout can be false negative.
2) We can not cancel the send operations which are already called. When
user aborts the streaming, the retry logic keeps running for a long
time.

This patch removes all the timeout and retry logic for streaming verbs.
After this, the timeout is the job of TCP, the retry is the job of the
upper layer.

Message-Id: <df20303c1fa728dcfdf06430417cf2bd7a843b00.1503994267.git.asias@scylladb.com>
2017-08-29 17:20:00 +03:00
Botond Dénes
d1209c548a Fix -Wreturn-type warnings
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <99f7a006daaa78eb87720ac51c394093398bc868.1504013915.git.bdenes@scylladb.com>
2017-08-29 16:41:09 +03:00
Tomer Sandler
f1eb6a8de3 node_health_check: Various updates
- Removed text from Report's "PURPOSE" section, which was referring to the "MANUAL CHECK LIST" (not needed anymore).
- Removed curl command (no longer using the api_address), instead using scylla --version
- Added -v flag in iptables command, for more verbosity
- Added support to for OEL (Oracle Enterprise Linux) - minor fix
- Some text changes - minor
- OEL support indentation fix + collecting all files under /etc/scylla
- Added line seperation under cp output message

Signed-off-by: Tomer Sandler <tomer@scylladb.com>
Message-Id: <20170828131429.4212-1-tomer@scylladb.com>
2017-08-29 15:15:10 +03:00
Paweł Dziepak
90c77c89ae test.py: add missing compress_test
Message-Id: <20170829105331.27078-1-pdziepak@scylladb.com>
2017-08-29 13:05:11 +02:00
Paweł Dziepak
d5fa07f6df Merge "sstables: switch from deque<> to a custom container" from Avi
Large deques require contiguous storage, which may not be available (or may
be expensive to obtain).  Switch to new custom container instead, which allocates
less contiguous storage.

Allocation problems were observed with the summary and compression info. While
there is work to reduce compression info contiguous space use, this solves
all std::deque problems (and should not conflict with that work).

Fixes #2708

* tag '2708/v6' of https://github.com/avikivity/scylla:
  sstables: switch std::deque to chunked_vector
  tests: add test for chunked_vector
  utils: add a new container type chunked_vector
2017-08-29 11:11:01 +01:00
Avi Kivity
5224ab9c92 Merge "Fix sstable reader not working for empty set of clustering ranges" from Tomasz
"Fixes #2734."

* 'tgrabiec/make-sstable-reader-work-with-empty-range-set' of github.com:scylladb/seastar-dev:
  tests: Introduce clustering_ranges_walker_test
  tests: simple_schema: Add missing include
  sstables: reader: Make clustering_ranges_walker work with empty range set
  clustering_ranges_walker: Make adjacency more accurate
2017-08-29 10:28:49 +03:00
Asias He
a36141843a gossip: Switch to seastar::lowres_system_clock
The newly added lowres_system_clock is good enough for gossip
resolution. Switch to use it.

Message-Id: <fe0e7a9ef1ea0caffaa8364afe5c78b6988613bf.1503971833.git.asias@scylladb.com>
2017-08-29 10:16:25 +03:00
Asias He
2701bfd1f8 gossip: Use unordered_map for _unreachable_endpoints and _shadow_unreachable_endpoints
The _unreachable_endpoints will be accessed in fast path soon by the
hinted hand off code.

Message-Id: <500d9cbb2117ab7b070fd1bd111c5590f46c3c3a.1503971826.git.asias@scylladb.com>
2017-08-29 10:15:55 +03:00
Tomasz Grabiec
05e0ca6546 tests: Introduce clustering_ranges_walker_test 2017-08-28 21:08:55 +02:00
Tomasz Grabiec
dcbc1282a9 tests: simple_schema: Add missing include 2017-08-28 21:00:06 +02:00
Tomasz Grabiec
48dabc8262 sstables: reader: Make clustering_ranges_walker work with empty range set
Such queries can be issued by counter updates which involve only
static row.

Causes failure in test_query_only_static_row invoked from
sstable_mutation_test. See commit 6572f38, which fixed the problem in
cache reader.

Fixes #2734.
2017-08-28 21:00:06 +02:00
Tomasz Grabiec
071badce3b clustering_ranges_walker: Make adjacency more accurate
Current check considered some adjacent range tombstones as overlapping
with the ranges. Making this more accurate will become more important
after we will rely on putting p_i_p::after_all_clustered_rows() in
_current_start in out-of-range state.
2017-08-28 21:00:06 +02:00
Jesse Haber-Kucharsky
abf4c1688d db/config.cc: Clarify documentation for typed_value_ex 2017-08-28 10:08:29 -04:00
Jesse Haber-Kucharsky
7374f9d86f db/config.cc: Fix formatting and warnings 2017-08-28 10:08:29 -04:00
Jesse Haber-Kucharsky
90666e5744 db/config.cc: Remove unnecessary mutable on lambdas 2017-08-28 10:08:29 -04:00
Jesse Haber-Kucharsky
449bd60480 db/config.cc: Remove unused variables 2017-08-28 10:08:29 -04:00
Botond Dénes
eec451bcf8 segmented_offsets: use _current_bucket_segment_index consistently
Previously _current_bucket_segment_index was used differently depending on
whether update_position_trackers() is used in a random or sequential
access. In the former case was used as the absolute index of the segment
(independent of the buckets) and in the latter as the relative index of
the segment within its bucket. This caused problems when there was a
switch between random and sequential access, meaning one could get different
results for an at() call depending on what was the previous at() call.
Fix this by consistently using _current_bucket_segment_index as - like its
name suggest - the bucket relative segment index.

Ref #1946.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <7f68ac1d32c80e8dea6dfa11be02acaa961bce2a.1503924927.git.bdenes@scylladb.com>
2017-08-28 16:14:25 +03:00
Avi Kivity
fa8d0fe4d0 Revert "Revert "Revert "Revert "Merge "Compress in-memory compression-info" from Botond""""
This reverts commit 238877a0c6.  A fix was found and will be committed
shortly.
2017-08-28 16:14:13 +03:00
Tomer Sandler
83f249c15d node_health_check: added line seperation under cp output message
Signed-off-by: Tomer Sandler <tomer@scylladb.com>
Message-Id: <20170828124307.2564-1-tomer@scylladb.com>
2017-08-28 15:44:13 +03:00
Tomasz Grabiec
16c1b0fb6b Merge "Reduce dependencies on types.hh" from Avi
* 'deps1/v1' of https://github.com/avikivity/scylla:
  types.hh: extract marshal_exception from types.hh into a new file
  utils: remove dependency on types.hh
  locator: add missing include "log.hh"
  supervisor: remove dependency on init.hh
  tracing: add missing include "log.hh"
  gms: remove unneeded #include "types.hh"
2017-08-28 13:58:46 +02:00
Avi Kivity
4e67bc9573 Merge "Fixes for skipping in sstable reader" from Tomasz
* 'tgrabiec/fix-fast-forwarding' of github.com:scylladb/seastar-dev:
  tests: mutation_source_test: Add more tests for fast forwarding across partitions
  sstables: Fix abort in mutation reader for certain skip pattern
  sstables: Fix reader returning partition past the query range in some cases
  sstables: Introduce data_consume_context::eof()
2017-08-28 12:48:02 +03:00
Tomasz Grabiec
3241018c79 tests: mutation_source_test: Add more tests for fast forwarding across partitions 2017-08-28 10:30:08 +02:00
Tomasz Grabiec
65e488c150 sstables: Fix abort in mutation reader for certain skip pattern
The problem happens for the following sequence of events:

 1) reader stops in the middle of some partition before it
    skips to another partition range

 2) reader is fast forwarded to a partition range which has no data in
    the sstable. There are some partitions between the previous
    partition range and the one we skip to

 3) the reader is asked for next partition

The problem was that mutation_reader::fast_forward_to() was putting
the reader in _read_enabled == false state in step 2, but
data_consume_context was not fast forwarded to the range. When in step
3 we were asked for the next partition, we attempted to skip using
index (because of 1). The result of the skip was some position which
is outside of the current range of data_consume_context, which causes
it to abort. To fix, add a check for _read_enabled before we try to
skip.
2017-08-28 10:28:15 +02:00
Tomasz Grabiec
dc3c8863f3 sstables: Fix reader returning partition past the query range in some cases
If index was used to skip to the next partition (because the current
partition wasn't consumed in full) and reader's partition range ends
before the data file ends, we did not detect that we're out of range
before returning a streamed_mutation. Fix by checking _context.eof()
before doing that.

Refs #2733.
2017-08-28 10:16:27 +02:00
Tzach Livyatan
12fb975282 Fix typos in metrics description
Fixes #2658

Signed-off-by: Tzach Livyatan <tzach@scylladb.com>
Message-Id: <20170803121732.19640-1-tzach@scylladb.com>
2017-08-28 10:48:28 +03:00
Takuya ASADA
437931f499 dist/redhat: fix dependency package name typo
scylla-libstdc++-static53 -> scylla-libstdc++53-static

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1503306027-7316-1-git-send-email-syuu@scylladb.com>
2017-08-28 10:44:40 +03:00
Tomasz Grabiec
6baad2c2e6 sstables: Introduce data_consume_context::eof() 2017-08-28 09:19:43 +02:00
Avi Kivity
171fe67a64 gms: remove unneeded #include "types.hh" 2017-08-27 15:18:57 +03:00
Avi Kivity
a9f19e37b5 tracing: add missing include "log.hh"
It's currently made available via another include, which is going away.
2017-08-27 15:18:41 +03:00
Avi Kivity
471ae5b22b supervisor: remove dependency on init.hh
Replace with a simpler dependency on log.hh
2017-08-27 15:17:55 +03:00
Avi Kivity
27d3ab20a9 locator: add missing include "log.hh"
It's currently made available via another include, which is going away.
2017-08-27 15:17:05 +03:00
Avi Kivity
7234f0f0a0 utils: remove dependency on types.hh
Replace with dependency on much smaller marshal_exception.hh.
2017-08-27 15:16:21 +03:00
Avi Kivity
93317e2f4a types.hh: extract marshal_exception from types.hh into a new file
For better or worse, marshal_exception is used from utils/, and it's not good
to have utils/ depend on types.hh. Extract marshal_exception to make it possible
to remove the dependency.
2017-08-27 15:14:55 +03:00
Avi Kivity
238877a0c6 Revert "Revert "Revert "Merge "Compress in-memory compression-info" from Botond"""
This reverts commit 9d27455744. It's still broken.

To reproduce:

  ./tools/bin/cassandra-stress write -schema compression=LZ4Compressor

(on a clean database)

.0  0x00007ffff32aa69b in raise () from /lib64/libc.so.6
.1  0x00007ffff32ac4a0 in abort () from /lib64/libc.so.6
.2  0x000000000054a0e8 in seastar::memory::abort_on_underflow (size=<optimized out>) at core/memory.cc:1189
.3  seastar::memory::allocate_large (size=<optimized out>) at core/memory.cc:1194
.4  0x000000000054b305 in seastar::memory::allocate (size=size@entry=18446744073702885265) at core/memory.cc:1227
.5  0x000000000054b45e in malloc (n=n@entry=18446744073702885265) at core/memory.cc:1452
.6  0x00000000006013e4 in seastar::temporary_buffer<char>::temporary_buffer (this=0x6010195fc800, size=18446744073702885265) at /home/avi/urchin/seastar/core/temporary_buffer.hh:72
.7  0x0000000000a3908b in seastar::input_stream<char>::read_exactly (this=0x6010053d0248, n=18446744073702885265) at /home/avi/urchin/seastar/core/iostream-impl.hh:189
.8  0x0000000000a9c77f in compressed_file_data_source_impl::get (this=0x6010053d0240) at sstables/compress.cc:499
.9  0x0000000000aa1b01 in seastar::data_source::get (this=<optimized out>) at /home/avi/urchin/seastar/core/iostream.hh:63
.10 seastar::future<> seastar::input_stream<char>::consume<sstables::data_consume_rows_context>(sstables::data_consume_rows_context&)::{lambda()#1}::operator()() const (__closure=__closure@entry=0x6010195fcab0) at /home/avi/urchin/seastar/core/iostream-impl.hh:204
.11 0x0000000000aa22f0 in seastar::futurize<seastar::future<seastar::bool_class<seastar::stop_iteration_tag> > >::apply<seastar::future<> seastar::input_stream<char>::consume<sstables::data_consume_rows_context>(sstables::data_consume_rows_context&)::{lambda()#1}&>(sstables::data_consume_rows_context&&) (func=...) at /home/avi/urchin/seastar/core/future.hh:1312
.12 seastar::repeat<seastar::future<> seastar::input_stream<char>::consume<sstables::data_consume_rows_context>(sstables::data_consume_rows_context&)::{lambda()#1}>(sstables::data_consume_rows_context&&) (action=...) at /home/avi/urchin/seastar/core/future-util.hh:203
.13 0x0000000000a9e730 in seastar::input_stream<char>::consume<sstables::data_consume_rows_context> (consumer=..., this=<optimized out>) at /home/avi/urchin/seastar/core/iostream-impl.hh:237
.14 data_consumer::continuous_data_consumer<sstables::data_consume_rows_context>::consume_input<sstables::data_consume_rows_context> (c=..., this=<optimized out>) at sstables/consumer.hh:226
.15 sstables::data_consume_context::impl::read (this=<optimized out>) at sstables/row.cc:411
.16 sstables::data_consume_context::read (this=<optimized out>) at sstables/row.cc:437
.17 0x0000000000aafbae in sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const (__closure=<optimized out>) at sstables/partition.cc:843
.18 seastar::apply_helper<sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#2}::operator()() const::{lambda()#1}, std::tuple<>&&, std::integer_sequence<unsigned long> >::apply({lambda()#2}&&, std::tuple) (args=..., func=...) at ./seastar/core/apply.hh:36
.19 seastar::apply<sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#2}::operator()() const::{lambda()#1}>(sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#2}::operator()() const::{lambda()#1}&&, std::tuple<>&&) (args=..., func=...)
    at ./seastar/core/apply.hh:44
.20 seastar::futurize<seastar::future<> >::apply<sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#2}::operator()() const::{lambda()#1}>(sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#2}::operator()() const::{lambda()#1}&&, std::tuple<>&&) (args=...,
    func=...) at ./seastar/core/future.hh:1302
.21 seastar::future<>::then<sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#2}::operator()() const::{lambda()#1}, seastar::future<> >(sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#2}::operator()() const::{lambda()#1}&&) (
    this=this@entry=0x6010195fcbb0, func=...) at ./seastar/core/future.hh:890
.22 0x0000000000ac273f in sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#2}::operator()() const (__closure=0x6010195fcc28) at sstables/partition.cc:843
.23 seastar::do_until_continued<sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#2}, sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#1}>(sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#1}&&, sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#2}&&, seastar::promise<>) (stop_cond=..., action=..., p=...) at /home/avi/urchin/seastar/core/future-util.hh:155
.24 0x0000000000ac29c3 in seastar::do_until<sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#2}, sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#1}>(sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#1}&&, sstables::sstable_streamed_mutation::fill_buffer()::{lambda()#2}&&) (action=..., stop_cond=..., this=<optimized out>) at /home/avi/urchin/seastar/core/future-util.hh:330
.25 sstables::sstable_streamed_mutation::fill_buffer (this=<optimized out>) at sstables/partition.cc:844
.26 0x0000000000ad3d2b in streamed_mutation::fill_buffer (this=0x6010195fcd10) at ./streamed_mutation.hh:489
.27 consume_flattened_in_thread<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer> >, std::function<bool (streamed_mutation const&)> >(mutation_reader&, stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer> >&, std::function<bool (streamed_mutation const&)>&&) (

(gdb) p addr
$1 = {
  chunk_start = 13330037,
  chunk_len = 18446744073702885265,
  offset = 0
}
2017-08-27 13:32:37 +03:00
Avi Kivity
576e33149f Merge seastar upstream
* seastar 0083ee8...85ca12d (1):
  > Merge "Run-time logging configuration" from Jesse

Includes patch from Jesse:

"Switch to Seastar for logging option handling

In addition to updating the abstraction layer for Seastar logging in `log.hh`,
the configuration system (`db/config.{hh,cc}`) has been updated in two ways:

- The string-map type for Boost.program_options is now defined in Seastar.

- A configuration value can be marked as `UsedFromSeastar`. This is like `Used`,
  except the option is expected to be defined in the Boost.Program_options
  description for Seastar. If the option is not defined in Seastar, or it is
  defined with a different type, then a run-time exception is thrown early in
  Scylla's initialization. This is necessary because logging options which are
  now defined in Seastar were previously defined in Scylla and support for these
  options in the YAML file cannot be dropped. In order to be able to verify that
  options marked `UsedFromSeastar` are actually defined in Seastar, the
  interface for adding options to `db::config` has changed from taking a
  `boost::program_options::options_description_easy_init` (which is handle into
  a `boost::program_options::options_description` which only allows adding
  options) to taking a `boost::program_options::options_description`
  directly (which also allows querying existing options).

Scylla also fully defers to Seastar's support for run-time logging
configuration."

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <ef26cffb91bef1ae95d508187a6dd861a6c4fc84.1503344007.git.jhaberku@scylladb.com>
2017-08-27 13:11:33 +03:00
Avi Kivity
4f5b5bc8e6 Merge seastar upstream
* seastar b9f1eb7...0083ee8 (1):
  > http: Add MIME type support for JSON
2017-08-27 13:09:04 +03:00
Jesse Haber-Kucharsky
af95d3baa7 db/config.cc: Remove unused function
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <5a4e4e153c2d87e838d1cf6def7a494a92a72f63.1503344007.git.jhaberku@scylladb.com>
2017-08-27 13:08:19 +03:00
Vlad Zolotarov
9b9f19606f scylla_cpuset_setup: add the description near the perftune.yaml removing
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1503600250-25169-1-git-send-email-vladz@scylladb.com>
2017-08-27 12:51:12 +03:00
Asias He
68346f7e53 repair: Use with_semaphore for sp_parallelism_semaphore
Instead of calling semaphore.signal() manually.

Message-Id: <51b7ecdebac91763a2340fe00959742810614845.1503648936.git.asias@scylladb.com>
2017-08-27 12:50:38 +03:00
Avi Kivity
2b3ee4b0a7 Merge "make cf drop more robust" from Glauber
"We have recently found two problems with the drop_column_family code
that needs addressing. The first is that exceptions in truncate() may
lead to stop() being skipped, which can cause Scylla to crash.

The other is that a truncate() issued before drop_column_family may get
the chance to execute only after the column family is already dropped
and also crash (That is issue 2726).

The second problem is the classic problem of asynchronous execution on
an object that may terminate, which we have been traditionally solving
with a gate. We add a gate to the column family that will be closed
during CF stop(), and we will require all asychronous operations to
enter it.

The immediate fix is for truncate(), where we have seen a real, concrete
problem. But it would be good to audit other code paths to make sure
that they are sane.

The most obvious ones, flush, compaction and sstable deletion are
already sane, since they are waited on explicitly during stop()."

Fixes #2726.

* 'issue-2726-v2-master' of github.com:glommer/scylla:
  database: add gate for generic async operations to column family
  database: make sure that column family is always stopped when dropped
2017-08-27 12:42:20 +03:00
Avi Kivity
1f66940134 sstables: switch std::deque to chunked_vector
Reduce susceptibility to memory fragmentation.
2017-08-26 16:44:47 +03:00
Avi Kivity
204659ef40 tests: add test for chunked_vector 2017-08-26 16:44:47 +03:00
Avi Kivity
3ba2c0652d utils: add a new container type chunked_vector
We currently use std::deque<> for when we need large random-access containers,
but deque<> requires nr_items * sizeof(T) / 64 bytes of contiguous memory, which can
exceed our 256k fragmentation unit with large sstables.  The new
container, which is a cross between deque and vector, has much lower
limitations.

Like deque, we allocate chunks of contiguous items, but they are
128k in size instead of 512. The last chunk can be smaller to avoid
allocating 128k for a really small vector.
2017-08-26 16:44:45 +03:00
Tomasz Grabiec
2ca99be27d ring_position_view: Print token instead of token pointer
Broken in e989d65539.
Message-Id: <1503667158-7544-1-git-send-email-tgrabiec@scylladb.com>
2017-08-25 14:25:21 +01:00
Glauber Costa
83323e155e database: add gate for generic async operations to column family
run_with_compaction_disabled(), which is called by truncate, has a
pretty large defer point in remove(). When the code gets to finally
execute, we can't guarantee that the column family will still be alive.

That is true in particular if we issued a drop table command following
truncate: by the time truncate gets to resume, the CF will be gone.
Before the column family is dropped, it will always call its stop()
method, which means we have an opportunity to do some waiting there. We
already wait for flushes and current compactions to end.

Traditionally, we have been solving similar problems by adding a gate
that will catch asynchronous operations and making sure that potentially
asynchronous operations will enter the gate before executing. Let's do
the same thing here. We will close() the gate during stop().

Fixes #2726

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-08-24 13:12:57 -04:00
Glauber Costa
d090e7be35 database: make sure that column family is always stopped when dropped
truncate can throw exceptions. If it does, cf->stop() will never be
called because it is contained in a .then clause instead of finally.

One of the things that truncate does - in a finally block of its own -
is initiate a final compaction. If it returns an exception nobody will
wait for that compaction to finish (since cf->stop() is the one doing
that) and we'll crash.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-08-24 13:01:47 -04:00
Avi Kivity
40aeb00151 Merge "consider the pre-existing cpuset.conf when configuring networking mode" from Vlad
"Preserve the networking configuration mode during the upgrade by generating the /etc/scylla.d/perftune.yaml
file and using it."

Fixes #2725.

* 'dist_respect_cpuset_conf-v3' of https://github.com/vladzcloudius/scylla:
  scylla_prepare: respect the cpuset.conf when configuring the networking
  scylla_cpuset_setup: rm perftune.yaml
  scylla_cpuset_setup: add a missing "include" of scylla_lib.sh
2017-08-24 18:53:22 +03:00
Vlad Zolotarov
c72eb34b89 scylla_prepare: respect the cpuset.conf when configuring the networking
Choose the networking configuration mode according to the current contents of /etc/scylla.d/cpuset.conf.

If it doesn't exist - use the default mode.
If it exists - use the mode that has been used for generation of the CPU set.

Store the configuration into the /etc/scylla.d/perftune.yaml

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-08-24 09:09:40 -04:00
Vlad Zolotarov
89285a13ac scylla_cpuset_setup: rm perftune.yaml
scylla_setup resets our configuration and perftune.yaml is a part of it.
perftune.yaml is generated based on the contents of cpuset.conf therefore we should reset
these together.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-08-24 09:09:40 -04:00
Vlad Zolotarov
d0ccfe34b9 scylla_cpuset_setup: add a missing "include" of scylla_lib.sh
The scylla_cpuset_setup uses a verify_args() function that is defined in the scylla_lib.sh.

Fixes #2716

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-08-24 09:09:40 -04:00
Paweł Dziepak
1006a946e8 mvcc: allow invoking maybe_merge_versions() inside allocating section
Message-Id: <20170823083544.4225-1-pdziepak@scylladb.com>
2017-08-24 14:30:38 +02:00
Pekka Enberg
870de26e35 index: Add index class
Add a simple index class, which represents an instantiated index.
2017-08-24 14:00:02 +03:00
Pekka Enberg
d63a650b3f index: Pass column_family to secondary_index_manager constructor
We need column family for various secondary index manager operations.
2017-08-24 14:00:02 +03:00
Pekka Enberg
981e320d54 database: Make secondary index manager per-column family
Make the secondary index manager per-column family like in Apache
Cassandra to keep CQL front-end similar between the two codebases.
2017-08-24 14:00:02 +03:00
Botond Dénes
839d1db4d3 parse(compression): add missing reinterpret_cast<char*>
std::copy_n was using value as uint64_t*, smashing the stack.
Also remove unused variable.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <4e2d71fc74326965dfd98bed2347100fb6ebe43b.1503568210.git.bdenes@scylladb.com>
2017-08-24 13:38:03 +03:00
Avi Kivity
9d27455744 Revert "Revert "Merge "Compress in-memory compression-info" from Botond""
This reverts commit 9656fd79a0. A fix is now
available.
2017-08-24 13:37:35 +03:00
Tomasz Grabiec
9656fd79a0 Revert "Merge "Compress in-memory compression-info" from Botond"
This reverts commit ef85cf1cb3, reversing
changes made to de011ece52.

Vlad reports that this causes SIGSEGV on cluster restarts.

seastar::backtrace_buffer::append_backtrace() at /home/vladz/work/urchin/seastar/core/reactor.cc:274
 (inlined by) print_with_backtrace at /home/vladz/work/urchin/seastar/core/reactor.cc:289
seastar::print_with_backtrace(char const*) at /home/vladz/work/urchin/seastar/core/reactor.cc:296
sigsegv_action at /home/vladz/work/urchin/seastar/core/reactor.cc:3512
 (inlined by) operator() at /home/vladz/work/urchin/seastar/core/reactor.cc:3498
 (inlined by) _FUN at /home/vladz/work/urchin/seastar/core/reactor.cc:3494
?? ??:0
operator()<seastar::temporary_buffer<char> > at /home/vladz/work/urchin/sstables/sstables.cc:870
 (inlined by) apply at /home/vladz/work/urchin/seastar/core/apply.hh:36
 (inlined by) apply<sstables::parse(sstables::random_access_reader&, sstables::compression&)::<lambda()>::<lambda()>::<lambda()>::<lambda(auto:104)>, seastar::temporary_buffer<char> > at /home/vladz/work/urchin/seastar/core/apply.hh:44
 (inlined by) do_void_futurize_apply_tuple<sstables::parse(sstables::random_access_reader&, sstables::compression&)::<lambda()>::<lambda()>::<lambda()>::<lambda(auto:104)>, seastar::temporary_buffer<char> > at /home/vladz/work/urchin/seastar/core/future.hh:1270
 (inlined by) apply<sstables::parse(sstables::random_access_reader&, sstables::compression&)::<lambda()>::<lambda()>::<lambda()>::<lambda(auto:104)>, seastar::temporary_buffer<char> > at /home/vladz/work/urchin/seastar/core/future.hh:1290
 (inlined by) then<sstables::parse(sstables::random_access_reader&, sstables::compression&)::<lambda()>::<lambda()>::<lambda()>::<lambda(auto:104)> > at /home/vladz/work/urchin/seastar/core/future.hh:890
 (inlined by) operator() at /home/vladz/work/urchin/sstables/sstables.cc:873
 (inlined by) do_until_continued<sstables::parse(sstables::random_access_reader&, sstables::compression&)::<lambda()>::<lambda()>::<lambda()>, sstables::parse(sstables::random_access_reader&, sstables::compression&)::<lambda()>::<lambda()>::<lambda()>&> at /home/vladz/work/urchin/seastar/core/future-util.hh:155
do_until<sstables::parse(sstables::random_access_reader&, sstables::compression&)::<lambda()>::<lambda()>::<lambda()>, sstables::parse(sstables::random_access_reader&, sstables::compression&)::<lambda()>::<lambda()>::<lambda()>&> at /home/vladz/work/urchin/seastar/core/future-util.hh:330
 (inlined by) operator() at /home/vladz/work/urchin/sstables/sstables.cc:874
 (inlined by) apply at /home/vladz/work/urchin/seastar/core/apply.hh:36
 (inlined by) apply<sstables::parse(sstables::random_access_reader&, sstables::compression&)::<lambda()>::<lambda()> > at /home/vladz/work/urchin/seastar/core/apply.hh:44
 (inlined by) apply<sstables::parse(sstables::random_access_reader&, sstables::compression&)::<lambda()>::<lambda()> > at /home/vladz/work/urchin/seastar/core/future.hh:1302
then<sstables::parse(sstables::random_access_reader&, sstables::compression&)::<lambda()>::<lambda()> > at /home/vladz/work/urchin/seastar/core/future.hh:890
 (inlined by) operator() at /home/vladz/work/urchin/sstables/sstables.cc:875
 (inlined by) apply at /home/vladz/work/urchin/seastar/core/apply.hh:36
 (inlined by) apply<sstables::parse(sstables::random_access_reader&, sstables::compression&)::<lambda()> > at /home/vladz/work/urchin/seastar/core/apply.hh:44
 (inlined by) apply<sstables::parse(sstables::random_access_reader&, sstables::compression&)::<lambda()> > at /home/vladz/work/urchin/seastar/core/future.hh:1302
operator()<seastar::future_state<> > at /home/vladz/work/urchin/seastar/core/future.hh:900
 (inlined by) run at /home/vladz/work/urchin/seastar/core/future.hh:395
seastar::reactor::run_tasks(seastar::circular_buffer<std::unique_ptr<seastar::task, std::default_delete<seastar::task> >, std::allocator<std::unique_ptr<seastar::task, std::default_delete<seastar::task> > > >&) at /home/vladz/work/urchin/seastar/core/reactor.cc:2317
seastar::reactor::run() at /home/vladz/work/urchin/seastar/core/reactor.cc:2775
seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) at /home/vladz/work/urchin/seastar/core/app-template.cc:142
2017-08-24 11:44:14 +02:00
Alexys Jacob
a133290694 scylla_io_setup: migrate away from deprecated string.atoi
Python 2.0 deprecated string.atoi and we should move away from it
as stated here: https://docs.python.org/2/library/string.html#string.atoi

Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20170817134002.28124-1-ultrabug@gentoo.org>
2017-08-24 12:36:34 +03:00
Avi Kivity
dcac7125fe Merge seastar upstream
* seastar e96881a...b9f1eb7 (9):
  > httpd: indentation patch
  > httpd: handle exception when shutting down
  > stall-detector: Allow backtrace throttling to be configured
  > stall-detector: Fix messages about suppresssion not appearing
  > scripts: posix_net_conf.sh: allow passing a perftune.py configuration file as a parameter
  > scripts: perftune.py: add the possibility to pass the parameters in a configuration file and print the YAML file with the current configuration
  > scripts: perftune.py: actually use the number of Rx queues when comparing to the number of CPU threads
  > core: make current_backtrace() noexcept
  > memory: add large allocation detector stubs for default allocator
2017-08-24 11:35:28 +03:00
Piotr Jastrzebski
477068d2c3 Make streamed_mutation more exception safe
Make sure that push_mutation_fragment leaves
_buffer_size with a correct value if exception
is thrown from emplace_back.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <83398412aa78332d88d91336b79140aecc988602.1503474403.git.piotr@scylladb.com>
2017-08-23 09:37:04 +01:00
Avi Kivity
2f41ed8493 Merge "repair: Do not allow repair until node is in NORMAL status" from Asias
Fixes #2723.

* tag 'asias/repair_issue_2723_v1' of github.com:cloudius-systems/seastar-dev:
  repair: Do not allow repair until node is in NORMAL status
  gossip: Add is_normal helper
2017-08-23 09:44:45 +03:00
Asias He
69c81bcc87 repair: Do not allow repair until node is in NORMAL status
The following backtrace was reported by user when running repair and keeping restarting the node at the same time.

 #0  0x00007eff077281d7 in raise () from /lib64/libc.so.6
 #1  0x00007eff07729a08 in abort () from /lib64/libc.so.6
 #2  0x00007eff07721146 in __assert_fail_base () from /lib64/libc.so.6
 #3  0x00007eff077211f2 in __assert_fail () from /lib64/libc.so.6
 #4  0x00000000010ef2c2 in locator::token_metadata::first_token_index (this=0x641000214e98, start=...) at locator/token_metadata.cc:133
 #5  0x00000000010ef2d9 in locator::token_metadata::first_token (this=0x641000214e98, start=...) at locator/token_metadata.cc:143
 #6  0x00000000010e329d in locator::abstract_replication_strategy::get_natural_endpoints (this=0x641000494000, search_token=...)
     at locator/abstract_replication_strategy.cc:66
 #7  0x0000000001481186 in get_neighbors (hosts=std::vector of length 0, capacity 0, data_centers=std::vector of length 0, capacity 0,
     range=<error reading variable: access outside bounds of object referenced via synthetic pointer>, ksname=..., db=...) at repair/repair.cc:196
 #8  repair_range<nonwrapping_range<dht::token> > (range=..., ri=...) at repair/repair.cc:781
 #9  <lambda(auto:99&)>::<lambda(auto:100&&)>::<lambda(auto:101&)>::<lambda()>::operator() (__closure=0x7efec07f7460) at repair/repair.cc:1005
 #10 futurize<future<bool_class<stop_iteration_tag> > >::apply<repair_ranges(repair_info)::<lambda(auto:99&)>::

It is reproduced with

1) while true; do curl -X POST --header "Content-Type: application/json" --header "Accept: application/json" "http://127.0.0.1:10000/storage_service/repair_async/ks3"; done

2) start node 127.0.0.1, stop node 127.0.0.1 in a loop

The problem is, during boot up, the token_metadata is not replicated to all shards until
the node goes into NORMAL status.

To fix, check until node is in NORMAL status before allowing repair.

Fixes #2723
2017-08-23 14:40:04 +08:00
Asias He
65912dd1ac gossip: Add is_normal helper
It will be used by repair to check if a node is in NORMAL status.
2017-08-23 14:40:04 +08:00
Amnon Heiman
abbd78367c Add configuration to disable per keyspace and column family metrics
The number of keysapce and column family metrics reported is
proportional to the number of shards times the number of keysapce/column
families.

This can cause a performance issue both on the reporting system and on
the collecting system.

This patch adds a configuration flag (set to false by default) to enable
or disable those metrics.

Fixes #2701

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170821113843.1036-1-amnon@scylladb.com>
2017-08-22 19:19:54 +03:00
Botond Dénes
4f42acc956 abstract_marker::raw::prepare: add missing return statement
The function doesn't return a value in the all-false branch.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <3c1976682ffc190d741c066d942b83be4463cae8.1503402721.git.bdenes@scylladb.com>
2017-08-22 15:06:18 +03:00
Paweł Dziepak
9d82a1ebfd abstract_read_executor: make make_requests() exception safe
Message-Id: <20170821162934.25386-5-pdziepak@scylladb.com>
2017-08-22 12:09:42 +02:00
Paweł Dziepak
31afc2f242 shared_index_lists: restore indentation
Message-Id: <20170821162934.25386-4-pdziepak@scylladb.com>
2017-08-22 12:09:42 +02:00
Paweł Dziepak
93eaa95378 sstables: make shared_index_lists::get_or_load exception safe
Message-Id: <20170821162934.25386-3-pdziepak@scylladb.com>
2017-08-22 12:09:42 +02:00
Avi Kivity
ef85cf1cb3 Merge "Compress in-memory compression-info" from Botond
"Overly large metadata can hog memory which especially hurts in setups
with bad disk/memory ratio. To ease the pain compress the in-memory
compression-info.

The compression is implemented based on Avi's idea which is to group n
offsets together into segments, where each segment stores a base
absolute offset into the file, the other offsets in the segments being
relative offsets (and thus of reduced size).  Also offsets are allocated
only just enough bits to store their maximum value. The offsets are thus
packed in a buffer like so:
    arrrarrrarrr...
where n is 4, a is an absolute offset and r are offsets relative to a.
This of course means that stored offsets will not be aligned, not even
on a byte boundary, but the size reduction pretty convincing.
In addition, segments are stored in buckets, where each bucket has its
own base offset. In addition, segments in a buckets are optimized to
address as large of a chunk of the data as possible for a given chunk
size."

Ref #1946.

* 'bdenes/compress-compression-v3' of https://github.com/denesb/scylla:
  Add unit test for compress::offsets
  Optimise the storage of compression chunk offsets
  Add script to precompute segmented compression parameters
2017-08-22 10:30:58 +03:00
Botond Dénes
62c18da35c Add unit test for compress::offsets 2017-08-21 17:06:20 +03:00
Botond Dénes
028c7a0888 Optimise the storage of compression chunk offsets
To reduce the memory footprint of compression-info, n offsets are
grouped together into segments, where each segment stores a base
absolute offset into the file, the other offsets in the segments being
relative offsets (and thus of reduced size). Also offsets are
allocated only just enough bits to store their maximum value. The
offsets are thus packed in a buffer like so:
     arrrarrrarrr...
where n is 4, a is an absolute offset and r are offsets relative to a.

The optimal value of n can be calculated for a given file_size (f) and
chunk_size (c), by finding the minima of the following function:

f(n) = (f/c)/n * (log2(f) + (n - 1)*log2((n-1)*(c + 64)))

This is done in an empirical way, using a script (see below).

Furthermore segments are stored in buckets, where each bucket has its
own base offset. Each bucket therefore can address an equal chunk of the
file and furthermore each segment in a bucket can address an equal
sub-chunk of this area.
The value of a given offset i is thus:
    bucket_base_offset_for(i) + segment_base_offset_for(i) + offset(i)

To account for the bucketed storage we calculate a local_f, which is
optimized so that a bucketful of segmented offsets can address the
largest possible chunk of f. As value of this local_f only depends on
the bucket_size (b) and c the value of n can be made independent of f
and therefore only depend on one dynamic value, c. This makes life much
simpler as we don't need to know the size of the file up-front, we can
just append buckets to the storage on demand, while the required storage
is still less than a third [1] of the original storage requirements
(std::deque<uint64>).

The table with the minima(f(n)) for different f and c values is
pre-computed by gen_segmented_compress_params.py and
stored in sstables/segmented_compress_params.hh. This script also
creates a table with the best values of local_f for the given
bucket_size. At runtime we only select the best params based on c.

[1] This was calculated for c=4K and b=4K
2017-08-21 17:06:12 +03:00
Avi Kivity
de011ece52 main: deprecate non-murmur3 partitioners more forcefully
Some (most?) users don't read logs or release notes, so they won't notice
that the ByteOrdered and Random partitioners were deprecated in 2.0. Make
them notice by refusing to start with a deprecated partitioner, unless a
switch is explicitly enabled.
Message-Id: <20170820073424.8331-1-avi@scylladb.com>
2017-08-21 14:32:22 +02:00
Avi Kivity
9f415ef870 sstables: accurate summary entry size calculation
Calculate the summary entry size correctly, so we don't end up with oversize
summaries.
Message-Id: <20170819184255.14181-2-avi@scylladb.com>
2017-08-21 14:28:57 +02:00
Avi Kivity
17c372bf0e sstables: get rid of 64kB minimum index advance to generate summary
Limiting summary entry generation to at most one summary entry
per 64k of index data can lead to large index pages, with thousands
of index entries per summary entry. These are slow to parse, and there
is no real gain from the limit, since we already enforce a size
limit on the summary.

Remove the limit and allow summary entry generation based solely on
spanned data size.

Fixes #2711.
Message-Id: <20170819184255.14181-1-avi@scylladb.com>
2017-08-21 14:26:44 +02:00
Avi Kivity
81a33df25d dht: reduce split_range_to_single_shard contiguous memory demand
split_range_to_single_shard() returns a vector of size 4096, with
each element (a partition_range) of size 100. The total of 400k can
cause defragmentation if memory is fragmented.

Fix by using a deque.

Fixes #2707.
Message-Id: <20170819141017.28287-1-avi@scylladb.com>
2017-08-21 14:25:45 +02:00
Piotr Jastrzebski
c602ffd610 Make Scylla ttl expiration behave like in Cassandra
Fixes #2497

[tgrabiec: reworked the title]

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <2f5a99dce6ef11fe0ef135c9fa0592078fc9a056.1502886874.git.piotr@scylladb.com>
2017-08-21 14:25:45 +02:00
Botond Dénes
eae33a1f19 Add script to precompute segmented compression parameters
The script generates sstables/segmented_compress_params.hh which
contains a list with the optimal number of grouped offsets for
different data and chunk sizes as well as a list with the best
nominal data sizes for different chunk sizes, given a bucket size.
Data sizes are in the range of [2**4,2**50] and chunks in the
range of [2**4, 2**30]. Data sizes that are not used with the current
bucket_size are ommited.
See next commit for details of how the calculated values are used.
2017-08-21 10:44:08 +03:00
Avi Kivity
5a2439e702 main: check for large allocations
Large allocations can require cache evictions to be satisfied, and can
therefore induce long latencies. Enable the seastar large allocation
warning so we can hunt them down and fix them.

Message-Id: <20170819135212.25230-1-avi@scylladb.com>
2017-08-21 10:25:40 +03:00
Pekka Enberg
318423d50b Merge seastar upstream
* seastar 2d16aca...e96881a (4):
  > memory: add detector for large allocations
  > memory: reduce large allocations for small pools
  > net: Fix potential NULL pointer dereference in udp.cc
  > Update dpdk submodule
2017-08-21 10:24:08 +03:00
Tomasz Grabiec
8f2ca52740 tests: Run test_query_only_static_row test case on all mutation sources
The test checks behavior common to all mutation readers, so it's
better to run it against all mutation sources rather than only for
cache reader.

Message-Id: <1503072333-17995-1-git-send-email-tgrabiec@scylladb.com>
2017-08-20 12:23:28 +03:00
Raphael S. Carvalho
10eaa2339e compaction: Make resharding go through compaction manager
Two reasons for this change:
1) every compaction should be multiplexed to manager which in turn
will make decision when to schedule. improvements on it will
immediately benefit every existing compaction type.
2) active tasks metric will now track ongoing reshard jobs.

Fixes #2671.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170817224334.6402-1-raphaelsc@scylladb.com>
2017-08-20 11:35:14 +03:00
Takuya ASADA
38b2ff617f dist/redhat: follow the change on libgcc/libstdc++ package name
Since we moved to external 3rdparty repository, we added '53' suffix on gcc
packages, so follow the change.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20170819092039.1090-2-syuu@scylladb.com>
2017-08-19 16:01:28 +03:00
Takuya ASADA
f1b5401d1f dist/redhat: Change g++ command name on CentOS
We have added '-5.3' suffix on g++ command from scylla-gcc53-c++-5.3.1-2.2,
follow the change on scylla build script.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20170819092039.1090-1-syuu@scylladb.com>
2017-08-19 16:01:27 +03:00
Avi Kivity
e428805ba5 Merge "Optimize query result partition and row counts" from Duarte
"Now that range queries go through the normal digest path, we rely on
query::result::calculate_counts() to count the amount of partitions
and rows returned.

This series optimizes it, in case it is needed, and also changes the
result message to include the partition and row counts, avoiding the
calculation altogether."

* 'calculate-counts/v3' of github.com:duarten/scylla:
  query-result: Send row and partition count over the wire
  query::result: Optimize calculate_counts()
2017-08-17 13:41:21 +03:00
Alexys Jacob
e5ff8efea3 dist: Fix Gentoo Linux scylla-jmx and scylla-tools packages detection
These two admin related packages will be packaged under the "app-admin"
category and not the "dev-db" one.

This fixes the detection path of the packages for scylla_setup.

Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20170817094756.21550-1-ultrabug@gentoo.org>
2017-08-17 13:20:43 +03:00
Nadav Har'El
7832d8a883 get rid of unused part in configure.py
Scylla's configure.py contains stuff we copied from Seastar's
configure.py, but is no longer used. Let's get rid of some of it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170813150842.12603-1-nyh@scylladb.com>
2017-08-17 12:05:44 +03:00
Duarte Nunes
1e7f0eab82 memtable: Created readers should be fast forwardable by default
mutation_reader::forwarding defaults to yes.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170816180304.2121-1-duarte@scylladb.com>
2017-08-17 10:21:01 +03:00
Botond Dénes
e70cfc8f36 incremental_reader_selector: account for possibly disengaged lower bound
In addition to the constructor (fixed previously) the check for no
sstables on the first call to select() also has to be prepared for the
lower bound of the range being disengaged.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <4ab1296c71814fcd492996fa36fd00fd7bbbbc7f.1502949875.git.bdenes@scylladb.com>
2017-08-17 10:07:26 +03:00
Botond Dénes
af83b7f57b incremental_reader_selector: use lazy_deref instead of tertiary operator
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <4f4b884c6a1f517bd654f3b27608d854b17a66e1.1502948635.git.bdenes@scylladb.com>
2017-08-17 08:45:46 +03:00
Botond Dénes
eb7eee510d combined_mutation_reader_test: use the global const objects directly
Instead of local ones.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <3ec1a70e4c0198c0563dff9688bbaa7fcfcace71.1502891190.git.bdenes@scylladb.com>
2017-08-16 16:56:42 +03:00
Paweł Dziepak
784dcbf1ca sstables: initialise index metrics on all shards
Fixes #2702.

Message-Id: <20170816085454.21554-1-pdziepak@scylladb.com>
2017-08-16 15:44:26 +03:00
Avi Kivity
d7e3fbc6fe Merge seastar upstream
* seastar 2a43102...2d16aca (1):
  > fstream: do not ignore unresolved future

Fixes #2697.
2017-08-16 15:09:59 +03:00
Botond Dénes
611774b1d9 Use the incremental reader for compaction
As leveled compaction strategy stands to gain the most from
incrementally opening sstables.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <292648d3fa4ea97376c0b4360754a20132194f63.1502822066.git.bdenes@scylladb.com>
2017-08-15 21:38:04 +03:00
Takuya ASADA
0f9b095867 dist/common/scripts: prevent ignoreing flag that passed after another flag which requires parameter
When user mistakenly forgot to pass parameter for a flag, our scripts misparses
next flag as the parameter.
ex) Correct usage is '--ntp-domain <domain> --setup-nic', but passed
    '--ntp-domain --setup-nic'.
Result of that, next flag will ignore by scripts.
To prevent such behavior, reject any parameter that start with '--'.

Fixes #2609

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20170815114751.6223-1-syuu@scylladb.com>
2017-08-15 18:27:32 +03:00
Duarte Nunes
c7aa3ea069 mutation_partition: Remove obsolete short read detection
When compacting a partition for querying we would read an extra row,
to include any tombstones between that one and the previous row.

This is no longer needed since we have a general mechanism to detect
short reads in the storage_proxy.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170811103031.22866-1-duarte@scylladb.com>
2017-08-15 12:01:55 +01:00
Avi Kivity
8df6dd1fa0 database: make incremental_reader_selector robust vs. full-range partition_range
incremental_reader_selector assumes the partition_range it receives has a lower
bound, but it was seen in mutation_test that this is not so.

Fix by checking whether the bound exists or not.
Message-Id: <20170815095852.14149-1-avi@scylladb.com>
2017-08-15 11:03:22 +01:00
Avi Kivity
a35bfb3ea9 Merge seastar upstream
* seastar 47b31f6...2a43102 (1):
  > Merge "Fix crash in rpc due to access to already destroyed server socket" from Gleb

Fixes #2690
2017-08-14 16:23:02 +03:00
Avi Kivity
e892a0082a Merge "Drop exhausted mutation_readers when possible" from Duarte
"Exhausted readers belonging to a combined_mutation_reader can be fast
forwarded, so we have to keep them around. However, if the reader is
not fast forwardable, then we can drop the contained readers and their
buffers."

* 'ff-reader/v2' of github.com:duarten/scylla:
  combined_mutation_reader: Drop exhausted readers if not in FF mode
  combined_mutation_reader: Remove superfluous mutation_readers list
  memtable_snapshot_source: Created readers should be fast forwardable
2017-08-14 16:20:38 +03:00
Duarte Nunes
7fb6a74302 combined_mutation_reader: Drop exhausted readers if not in FF mode
Exhausted readers can be fast forwarded, so we have to keep them
around. However, if the current reader is not fast forwardable, then
we can drop those readers and their buffers.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-08-14 14:37:27 +02:00
Duarte Nunes
0b53f88a42 combined_mutation_reader: Remove superfluous mutation_readers list
The _all_readers variable can do the same job.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-08-14 14:37:27 +02:00
Duarte Nunes
77477605c1 memtable_snapshot_source: Created readers should be fast forwardable
As they're used by the cache tests.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-08-14 14:37:27 +02:00
Avi Kivity
afff29bdb9 Merge seastar upstream
* seastar edb73ab...47b31f6 (1):
  > tls: Only recurse once in shutdown code

Fixes #2691.
2017-08-14 15:09:42 +03:00
Duarte Nunes
a17cef76b2 query-result-writer: Remove unneeded field
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170811102940.22747-1-duarte@scylladb.com>
2017-08-14 12:33:33 +01:00
Duarte Nunes
ec75eac37d ring_position_exponential_vector_sharder: Take ranges by rvalue
Avoids some copies.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170814093310.29200-1-duarte@scylladb.com>
2017-08-14 12:55:43 +03:00
Duarte Nunes
3b9a9b7321 query-result: Send row and partition count over the wire
To avoid calculating them on the coordinator side.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-08-14 10:29:06 +02:00
Duarte Nunes
d7bab684ea query::result: Optimize calculate_counts()
Now that range queries go through the normal digest path, we rely on
query::result::calculate_counts() to count the amount of partitions
and rows returned. This patch makes it a bit faster.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-08-14 10:28:29 +02:00
Avi Kivity
cb2c5016ea Merge seastar upstream
* seastar 7a49ae5...edb73ab (11):
  > scripts: perftune.py: change the network module mode auto selection heuristic
  > net/tls: explicitly ignore ready future during shutdown
  > Use python2 explicitly as an interpreter for Python v2 scripts
  > peering_sharded_service: prevent over-run the container
  > Add link to documentation to the README.md
  > Add guidelines for contributing to Seastar
  > sharded: fix move constructor for peering_sharded_service services
  > Provide a convenient way to lazy-convert to string the values of pointers
  > tutorial: overhaul semaphores section
  > simple-stream: Make fragmented::write_substream return simple if possible
  > simple-stream: Make simple/fragmented memory output stream top level
2017-08-14 10:29:27 +03:00
Raphael S. Carvalho
050a7019b8 sstables/index_reader: fix index reader for summary entry spanning lots of keys
quantity prevents index_reader from reading all index entries of a summary
entry that span more than min_index_interval entries. That can happen after
introduction of size-based sampling, and consequently, sstable will not be
able to return a key which logical position in summary entry is beyond
min_index_interval. It's ok to not use quantity because index_reader will
read all indexes until either next summary entry or end of file is reached.

Fixes test_sstable_conforms_to_mutation_source

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170812045821.25269-1-raphaelsc@scylladb.com>
2017-08-12 09:44:16 +03:00
Duarte Nunes
08e284a07e combined_mutation_reader: Don't drop mutation readers
This patch fixes a regression introduced in a6b9186ca.

We should keep the readers around in case a subsequent call to
fast_forward() will require them.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170811160444.12795-1-duarte@scylladb.com>
2017-08-11 19:17:29 +03:00
Duarte Nunes
44b6da2e90 test.py: Add combined_mutation_reader_test
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170811155017.9899-1-duarte@scylladb.com>
2017-08-11 18:54:11 +03:00
Avi Kivity
dbf8625ac9 Merge "size-based sampling for sstable summary" from Raphael
"Fixes #1842."

* 'size_based_sampling_v3' of github.com:raphaelsc/scylla:
  tests: test summary entry spanning more keys than min interval
  db/config: introduce sstable_summary_ratio option
  sstables: introduce size-based sampling for sstable summary
  sstables: make components_writer::offset const qualified and uint64_t
  sstables: make writer::offset const qualified and uint64_t
2017-08-11 18:41:45 +03:00
Duarte Nunes
e7d56884c0 list_reader_selector: Prevent infinite loop
In case the readers are empty.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170811153142.8926-1-duarte@scylladb.com>
2017-08-11 18:34:55 +03:00
Vladimir Krivopalov
003e8cf250 Use python2 explicitly as an interpreter for Python v2 scripts
Signed-off-by: Vladimir Krivopalov <vladimir.krivopalov@gmail.com>
Message-Id: <20170811032712.4362-1-vladimir.krivopalov@gmail.com>
2017-08-11 18:08:11 +03:00
Duarte Nunes
20337053ad Don't use literal lambdas
These are only available in C++17. Fixes the build after b5460c2.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-08-11 13:08:42 +02:00
Duarte Nunes
b5460c2990 Merge "Support duration type" from Jesse
"This patch series adds support for the `duration` type in CQL, which
was added to Cassandra in 3.10.

As part of this work, it was necessary also to add support for the
`vint` and `unsigned vint` types to the native protocol implementation,
which are part of v5 of the specification.

To test interactively, it is necessary to use cqlsh distributed with
Cassandra, as the version we distribute does not yet support the
duration type."

* 'jhk/duration_protocol/v5' of https://github.com/hakuch/scylla:
  Support `duration` CQL native type
  CQL native protocol: Add support for `vint` serialization
  duration_test.cc: Add test for printing zero duration
  duration.cc: Remove nop `const` qualifier on return type
  Change `const` qualifier declaration order for `duration`
  duration.cc: Simplify range checking
  Rename `duration` to `cql_duration`
2017-08-11 10:56:55 +01:00
Duarte Nunes
bcf21aacc2 storage_proxy: Directly call query_nonsingular_mutations_locally
Instead of duplicating the branch.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170811001559.25788-1-duarte@scylladb.com>
2017-08-11 09:06:01 +03:00
Duarte Nunes
a3ee99554b service/storage_proxy: Remove out of date comment
Now that we don't go directly to reconciliation for range queries, the
result isn't required to have the row and partition counts calculated
(we no longer transform a reconciled_result to a query::result).

Furthermore, this line was causing a lot of dtests to fail on account
of them not expecting an error line in the logs.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170810225351.12610-1-duarte@scylladb.com>
2017-08-11 09:04:23 +03:00
Raphael S. Carvalho
5124f94358 tests: test summary entry spanning more keys than min interval
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-08-11 01:37:06 -03:00
Raphael S. Carvalho
872412d31a db/config: introduce sstable_summary_ratio option
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-08-11 01:36:21 -03:00
Raphael S. Carvalho
8726ee937d sstables: introduce size-based sampling for sstable summary
Currently, a summary entry is added after min_index_interval index
entries were written. Not taking into account size of index entries
becomes a problem with large partitions which may create big index
entries due to promoted indexes. Read performance is affected as a
consequence because index entries spanned by summary are all read
from disk to serve request.

What we wanna do is to also add a summary entry after index reaches
a boundary. To deal with oversampling, we want to write 1 byte to
summary for every 2000 bytes written to data file (this will be
eventually made into an option in the config file).
Both conditions must be met to avoid under or oversampling.
That way, the amount of data needed from index file to satify the
request is drastically reduced.

Fixes #1842.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-08-11 00:30:12 -03:00
Raphael S. Carvalho
da7489720b sstables: make components_writer::offset const qualified and uint64_t
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-08-10 21:48:11 -03:00
Raphael S. Carvalho
881c479be8 sstables: make writer::offset const qualified and uint64_t
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-08-10 21:46:39 -03:00
Jesse Haber-Kucharsky
509626fe08 Support duration CQL native type
`duration` is a new native type that was introduced in Cassandra 3.10 [1].

Support for parsing and the internal representation of the type was added in
8fa47b74e8.

Important note: The version of cqlsh distributed with Scylla does not have
support for durations included (it was added to Cassandra in [2]). To test this
change, you can use cqlsh distributed with Cassandra.

Duration types are useful when working with time-series tables, because they can
be used to manipulate date-time values in relative terms.

Two interesting applications are:

- Aggregation by time intervals [3]:

`SELECT * FROM my_table GROUP BY floor(time, 3h)`

- Querying on changes in date-times:

`SELECT ... WHERE last_heartbeat_time < now() - 3h`

(Note: neither of these is currently supported, though columns with duration
values are.)

Internally, durations are represented as three signed counters: one for months,
for days, and for nanoseconds. Each of these counters is serialized using a
variable-length encoding which is described in version 5 of the CQL native
protocol specification.

The representation of a duration as three counters means that a semantic
ordering on durations doesn't exist: Is `1mo` greater than `1mo1d`? We cannot
know, because some months have more days than others. Durations can only have a
concrete absolute value when they are "attached" to absolute date-time
references. For example, `2015-04-31 at 12:00:00 + 1mo`.

That duration values are not comparable presents some difficulties for the
implementation, because most CQL types are. Like in Cassandra's implementation
[2], I adopted a similar strategy to the way restrictions on the `counter` type
are checked. A type "references" a duration if it is either a duration or it
contains a duration (like a `tuple<..., duration, ...>`, or a UDT with a
duration member).

The following restrictions apply on durations. Note that some of these contexts
are either experimental features (materialized views), or not currently
supported at run-time (though support exists in the parser and code, so it is
prudent to add the restrictions now):

- Durations cannot appear in any part of a primary key, either for tables or
  materialized views.

- Durations cannot be directly used as the element type of a `set`, nor can they
  be used as the key type of a `map`. Because internal ordering on durations is
  based on a byte-level comparison, this property of Cassandra was intended to
  help avoid user confusion around ordering of collection elements.

- Secondary indexes on durations are not supported.

- "Slice" relations (<=, <, >=, >) are not supported on durations with `WHERE`
   restrictions (like `SELECT ... WHERE span <= 3d`). Multi-column restrictions
   only work with clustering columns, which cannot be `duration` due to the
   first rule.

- "Slice" relations are not supported on durations with query conditions (like
  `UPDATE my_table ... IF span > 5us`).

Backwards incompatibility note:

As described in the documentation [4], duration literals take one of two
forms: either ISO 8601 formats (there are three), or a "standard" format. The ISO
8601 formats start with "P" (like "P5W"). Therefore, identifiers that have this
form are no longer supported.

Fixes #2240.

[1] https://issues.apache.org/jira/browse/CASSANDRA-11873

[2] bfd57d13b7

[3] https://issues.apache.org/jira/browse/CASSANDRA-11871

[4] http://cassandra.apache.org/doc/latest/cql/types.html#working-with-durations
2017-08-10 15:01:10 -04:00
Jesse Haber-Kucharsky
91dab1d998 CQL native protocol: Add support for vint serialization
Version 5 of the native protocol for CQL [1] adds the `vint` and `unsigned vint`
types.

An unsigned integer encoded as a `vint` has a variable size based on the
magnitude of the value. The first byte indicates the total number of bytes.

For signed integers, a "zig-zag" encoding scheme ensures that small negative
values are encoded as short-length `vint`s (0 -> 0, -1 -> 1, 1 -> 2, 2 -> 3, -2
-> 4, etc).

[1] https://github.com/apache/cassandra/blob/trunk/doc/native_protocol_v5.spec
2017-08-10 14:11:30 -04:00
Jesse Haber-Kucharsky
77489f843f duration_test.cc: Add test for printing zero duration
It's somewhat counter-intuitive, but Cassandra also formats zero-valued
duration values as an empty string.
2017-08-10 14:11:30 -04:00
Jesse Haber-Kucharsky
d9c027c2dd duration.cc: Remove nop const qualifier on return type
These have no effect according to the Clang static analyzer.
2017-08-10 14:11:30 -04:00
Jesse Haber-Kucharsky
54c3cf0201 Change const qualifier declaration order for duration
The vast majority of the code-base is written in left-`const` style, and
consistency is important.
2017-08-10 14:11:30 -04:00
Jesse Haber-Kucharsky
1889b036b1 duration.cc: Simplify range checking 2017-08-10 14:11:23 -04:00
Avi Kivity
301358e440 Merge "Optimize combined_mutation_reader for disjoint sstable ranges" from Botond
"sstables will sometimes have narrow/disjont ranges (e.g. LCS L1+).
This can be exploited when reading from a range of sstables by opening
sstables on-demand thus saving memory, processing and potentially I/O.

To achieve this combined_mutation_reader is refactored such that the
reader selection logic is moved-out into a reader_selector class.
combined_mutation_reader now takes a reader_selector instance in its
constructor and asks it for new readers for the current ring position
on every call to operator()().

At the moment two specializations of reader_selector are provided:
* list_reader_selector which implements the current logic, that is using
    a provided mutation_reader list, and
* incremental_reader_selector which implements the on-demand opening
    logic discussed above.

Fixes #1935"

* 'bdenes/optimize_combined_reader-v6' of https://github.com/denesb/scylla:
  Add combined_mutation_reader_test unit test
  Remove range_sstable_reader
  Add incremental_reader_selector
  Add reader_selector to combined_mutation_reader
  sstable_set::incremental_selector: select() now returns a selection
2017-08-10 15:16:30 +03:00
Botond Dénes
9ee9988097 Add combined_mutation_reader_test unit test 2017-08-10 12:38:10 +03:00
Botond Dénes
3e97a5cd6b Remove range_sstable_reader
range_sstable_reader is replaced with combined_mutation_reader, using
the incremental_reader_selector.
2017-08-10 12:38:10 +03:00
Botond Dénes
bfc74f1312 Add incremental_reader_selector
incremental_reader_selector is a specialization of reader_selector for
the case when sstables have narrow and/or disjoint token ranges. To
exploit this it creates new readers on-demand when their sstable's
token range intersects with the current ring position.
2017-08-10 12:38:02 +03:00
Botond Dénes
a6b9186cab Add reader_selector to combined_mutation_reader
combined_mutation_reader now accepts as a constructor argument a
reader_selector instance whoose task is to create new readers on
each call to operator()() if needed and possible.
This way it is possible to control how readers are created through
different specializations of reader_selector.

The previous logic is refactored into list_reader_selector which
is using a pre-provided mutation_reader list and forwards all of them to
combined_mutation_reader at once.
2017-08-10 12:37:40 +03:00
Takuya ASADA
1cb0fff146 dist/common/scripts/scylla_raid_setup: handle '--disks' parameter correctly when disk list is end with ','
We should handle parameters correctly even it's malformed.
Fixes #2402

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1499266239-27551-1-git-send-email-syuu@scylladb.com>
2017-08-10 11:42:33 +03:00
Takuya ASADA
8e115d69a9 dist/debian: append postfix '~DISTRIBUTION' to scylla package version
We are moving to aptly to release .deb packages, that requires debian repository
structure changes.
After the change, we will share 'pool' directory between distributions.
However, our .deb package name on specific release is exactly same between
distributions, so we have file name confliction.
To avoid the problem, we need to append distribution name on package version.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1502312935-22348-1-git-send-email-syuu@scylladb.com>
2017-08-10 10:53:56 +03:00
Vlad Zolotarov
1b4594b03a transport::server::process_prepare() don't ignore errors on other shards
If storing of the statement fails on any shard we should fail the whole PREPARE
request.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1502325392-31169-13-git-send-email-vladz@scylladb.com>
2017-08-10 10:32:37 +03:00
Jesse Haber-Kucharsky
352e9f60ba Rename duration to cql_duration
`std::chrono::duration` is a prolific enough name that it's best to
disambiguate.
2017-08-09 15:15:20 -04:00
Botond Dénes
94fc550e68 sstable_set::incremental_selector: select() now returns a selection
A seletion contains - in addition to the list of sstables - a next_token
which is a hint as to what is the next best token to call select() with.
This should be the smallest token such that at the next call to
select() the least number of new sstables will be returned, without
skipping any.
2017-08-09 16:27:33 +03:00
Takuya ASADA
3077416ecc dist/debian: Backport scalability fix of _Unwind_Find_FDE to out gcc for Debian 8
Since we provide custom build gcc only for Debian 8, the fix is not apply to
Ubuntu/Debian 9.

Fixes #2646

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1502239191-12649-1-git-send-email-syuu@scylladb.com>
2017-08-09 12:19:52 +03:00
Avi Kivity
7217b7ab36 Merge "Use range_streamer everywhere" from Asias
"With this series, all the following cluster operations:

- bootstrap
- rebuild
- decommission
- removenode

will use the same code to do the streaming.

The range_streamer is now extended to support both fetch from and push
to peer node. Another big change is now the range_streamer will stream
less ranges at a time, so less data, per stream_plan and range_streamer
will remember which ranges are failed to stream and can retry later.

The retry policy is very simple at the moment it retries at most 5 times
and sleep 1 minutes, 1.5^2 minutes, 1.5^3 minutes ....

Later, we can introduce api for user to decide when to stop retrying and
the retry interval.

The benefits:

 - All the cluster operation shares the same code to stream
 - We can know the operation progress, e.g., we can know total number of
   ranges need to be streamed and number of ranges finished in
   bootstrap, decommission and etc.
 - All the cluster operation can survive peer node down during the
   operation which usually takes long time to complete, e.g., when adding
   a new node, currently if any of the existing node which streams data to
   the new node had issue sending data to the new node, the whole bootstrap
   process will fail. After this patch, we can fix the problematic node
   and restart it, the joining node will retry streaming from the node
   again.
 - We can fail streaming early and timeout early and retry less because
   all the operations use stream can survive failure of a single
   stream_plan. It is not that important for now to have to make a single
   stream_plan successful. Note, another user of streaming, repair, is now
   using small stream_plan as well and can rerun the repair for the
   failed ranges too.

This is one step closer to supporting the resumable add/remove node
opeartions."

* tag 'asias/use_range_streamer_everywhere_v4' of github.com:cloudius-systems/seastar-dev:
  storage_service: Use the new range_streamer interface for removenode
  storage_service: Use the new range_streamer interface for decommission
  storage_service: Use the new range_streamer interface for rebuild
  storage_service: Use the new range_streamer interface for bootstrap
  dht: Extend range_streamer interface
2017-08-09 10:00:25 +03:00
Takuya ASADA
98fc7b376d dist/redhat: install mdadm/xfsprogs on package install time
We experienced 'Constructing RAID volume...' takes too much time on some AMIs,
this is because setup script stuck at 'yum -y install mdadm xfsprogs'.
We don't have to install these packages on AMI startup time, we should
preinstall them on AMI creating time.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1502192796-21040-1-git-send-email-syuu@scylladb.com>
2017-08-09 09:10:34 +03:00
Piotr Jastrzebski
4137517cdc Check arguments of table_helper::setup_keyspace
to make sure all table helpers passed as arguments are
for the right keyspace.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <10edacd509880bb18180f13e8c28593d068c5c7b.1501688729.git.piotr@scylladb.com>
2017-08-08 15:55:06 +03:00
Piotr Jastrzebski
2d8a80f211 Make table_helper constructor safer
by taking keyspace name by value and storing it inside the object.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <a5dab41647348ae311e023fe5592aec650c6e32a.1501688729.git.piotr@scylladb.com>
2017-08-08 15:55:06 +03:00
Daniel Fiala
06089474c9 Print warning if user uses default cluster_name
* Configuration for cluster_name is commented-out in config file.
* Default value set to empty string and if not rewritten by user then
  warning is printed and value is reset to "ScyllaDB Cluster".

Fixes #2648.

Message-Id: <20170808113322.9313-1-daniel@scylladb.com>
2017-08-08 14:47:17 +03:00
Avi Kivity
a71138fc84 config: mark column_index_size_in_kb as Used
Fixes #2681
Message-Id: <20170808100415.16296-1-avi@scylladb.com>
2017-08-08 11:08:00 +01:00
Ultrabug
2022da2405 Add overall python code QA and guidelines with flake8
ScyllaDB loves python & python loves ScyllaDB.

It would benefit the project to start enforcing some code guidelines
and basic QA with a linter along a PEP8 respect thanks to flake8.

This patch adds a tox config to at least start with an assessment
of the work to be done on all .py files in the code base.

To reduce its noise, tests on long lines (> 80char) are ignored
for now.

Signed-off-by: Ultrabug <ultrabug@gentoo.org>
Message-Id: <20170726134242.8927-1-ultrabug@gentoo.org>
2017-08-08 11:15:45 +03:00
Raphael S. Carvalho
dddbd34b52 sstables: close index file when sstable writer fails
index's file output stream uses write behind but it's not closed
when sstable write fails and that may lead to crash.
It happened before for data file (which is obviously easier to
reproduce for it) and was fixed by 0977f4fdf8.

Fixes #2673.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170807171146.10243-1-raphaelsc@scylladb.com>
2017-08-08 09:53:14 +03:00
Asias He
49360992d9 storage_service: Use the new range_streamer interface for removenode
So that removenode operation will now stream small ranges at a time and
restream the failed ranges.
2017-08-07 16:31:48 +08:00
Asias He
6b8dc85f12 storage_service: Use the new range_streamer interface for decommission
So that decommission operation will now stream small ranges at a time and
restream the failed ranges.
2017-08-07 16:31:48 +08:00
Asias He
24584b8509 storage_service: Use the new range_streamer interface for rebuild
So that rebuild operation will now stream small ranges at a time and
restream the failed ranges.
2017-08-07 16:31:47 +08:00
Asias He
f239b11a84 storage_service: Use the new range_streamer interface for bootstrap
So that bootstrap operation will now stream small ranges at a time and
restream the failed ranges.
2017-08-07 16:31:47 +08:00
Asias He
6810031ba7 dht: Extend range_streamer interface
After this patch and the following patches to use the new
range_streamder interface, all the following cluster operations:

- bootstrap
- rebuild
- decommission
- removenode

will use the same code to do the streaming.

The range_streamer is now extended to support both fetch from and push
to peer node. Another big change is now the range_streamer will stream
less ranges at a time, so less data, per stream_plan and range_streamer
will remember which ranges are failed to stream and can retry later.

The retry policy is very simple at the moment it retries at most 5 times
and sleep 1 minutes, 1.5^2 minutes, 1.5^3 minutes ....

Later, we can introduce api for user to decide when to stop retrying and
the retry interval.

The benefits:

- All the cluster operation shares the same code to stream

- We can know the operation progress, e.g., we can know total number of
  ranges need to be streamed and number of ranges finished in
  bootstrap, decommission and etc.
- All the cluster operation can survive peer node down during the
  operation which usually takes long time to complete, e.g., when adding
  a new node, currently if any of the existing node which streams data to
  the new node had issue sending data to the new node, the whole bootstrap
  process will fail. After this patch, we can fix the problematic node
  and restart it, the joining node will retry streaming from the node
  again.
- We can fail streaming early and timeout early and retry less because
  all the operations use stream can survive failure of a single
  stream_plan. It is not that important for now to have to make a single
  stream_plan successful. Note, another user of streaming, repair, is now
  using small stream_plan as well and can rerun the repair for the
  failed ranges too.

This is one step closer to supporting the resumable add/remove node
opeartions.
2017-08-07 16:31:47 +08:00
Avi Kivity
86de6cc7fb Merge seastat upstream
* seastar f14d2a3...7a49ae5 (8):
  > sharded: improve support for cooperating sharded<> services
  > sharded: support for peer services
  > semaphore: add a version of with_semaphore that takes a duration timeout
  > scripts: perftune.py: fix the CPU mask generation for more than 64 CPUs
  > Revert "future-utils: make when_all() (vector variant) exception safe"
  > Revert "future-utils: fix gross compilation errors in when_all()"
  > future-utils: fix gross compilation errors in when_all()
  > future-utils: make when_all() (vector variant) exception safe

Includes change to batchlog_manager constructor to adapt it to
seastar::sharded::start() change.
2017-08-06 17:47:47 +03:00
Avi Kivity
3edec66903 Revert "repair: Make send_repair_checksum_range timeout"
This reverts commit 98757069a5. We have the
failure detector which will detect an unresponsive node and fail the RPC.
Adding a timeout can just introduce false positives.
2017-08-06 13:09:36 +03:00
Avi Kivity
621926d914 dist: debian: escape "$" character for make 2017-08-05 16:51:03 +03:00
Avi Kivity
a471851bf1 dist: debian: add /opt/scylladb/bin to PATH so antlr can be found 2017-08-05 15:46:58 +03:00
Avi Kivity
8bdc0dd471 dist: debian: search for libaries in /opt/scylladb/lib 2017-08-05 13:18:14 +03:00
Takuya ASADA
2ff3bdba5c dist/debian: switch Ubuntu 3rdparty packages to external build service
Switch Ubuntu to launchpad ppa:
https://launchpad.net/~scylladb/+archive/ubuntu/ppa/+packages

Since switching 3rdparty on Debian is not ready yet, keep them to use scylla
3rdparty repo, also keep --rebuild-dep option and dist/debian/dep.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1501866678-4922-1-git-send-email-syuu@scylladb.com>
2017-08-05 11:29:13 +03:00
Glauber Costa
4a911879a3 add active streaming reads metric
In commit f38e4ff3f, we have separated streaming reads from normal reads
for the purpose of determining the maximum number of reads going on.
However, we'll now be totally unaware of how many reads will be
happening on behalf of streaming and that can be important information
when debugging issues.

This patch adds this metric so we don't fly blind.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <1501909973-32519-1-git-send-email-glauber@scylladb.com>
2017-08-05 11:06:37 +03:00
Duarte Nunes
587b6be089 dirty_memory_manager: Add missing include
Allows tests/memory_footprint to build on Ubuntu 14.04.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-08-04 10:15:23 +02:00
Avi Kivity
4f12068e50 dist: re-add --rebuild-dep to build_rpm.sh
For compatibility with existing scripts; ignored.
2017-08-04 07:10:18 +03:00
Takuya ASADA
b5e83ebd94 dist/redhat: switch 3rdparty packages to external build service
Drop existing 3rdparty build script/3rdparty repo, switch to Fedora Copr
https://copr.fedorainfracloud.org/coprs/scylladb/scylla-3rdparty/packages/

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20170803110754.22152-1-syuu@scylladb.com>
2017-08-04 06:40:09 +03:00
Pekka Enberg
90872ffa1f docker: Disable stall detector
Fixes #2162

Message-Id: <1501759957-4380-1-git-send-email-penberg@scylladb.com>
2017-08-03 14:52:49 +03:00
Takuya ASADA
91ade1a660 dist/debian: check scylla user/group existance before adding them
To prevent install failing on the environment which already has scylla
user/group, existance check is needed.

Fixes #2389

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1495023805-14905-1-git-send-email-syuu@scylladb.com>
2017-08-03 13:01:18 +03:00
Takuya ASADA
6ac254fbcb dist: change nomerges=1 on block devices during fstrim execution
We have problem to run fstrim with nomerges=2, so we need to change
the parameter to 1 during fstrim execution.
To do this, this fix changes follow things:
 - revert dropping scylla_fstrim on Ubuntu 16.04/CentOS
 - disable distribution provided fstrim script
 - enable scylla_fstrim on all distributions
 - introduce --set-nomerges on scylla-blocktune
 - scylla_fstrim call scylla-blocktune by following order:
   - 'scylla-blocktune --set-nomerges 1'
   - 'fstrim' for each devices
   - 'scylla-blocktune --set-nomerges 2'

Fixes #2649

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1501531393-21109-1-git-send-email-syuu@scylladb.com>
2017-08-03 13:00:34 +03:00
Botond Dénes
0b7ac01f0f Add QtCreator project file and .gdbinit to .gitignore
Message-Id: <ff662910fe1156cdde2bda4aa5bb9cfc45bddda9.1501752340.git.bdenes@scylladb.com>
2017-08-03 12:58:35 +03:00
Avi Kivity
f38e4ff3f9 database: prevent streaming reads from blocking normal reads
Streaming reads and normal reads share a semaphore, so if a bunch of
streaming reads use all available slots, no normal reads can proceed.

Fix by assigning streaming reads their own semaphore; they will compete
with normal reads once issued, and the I/O scheduler will determine the
winner.

Fixes #2663.
Message-Id: <20170802153107.939-1-avi@scylladb.com>
2017-08-03 10:23:01 +01:00
Avi Kivity
911536960a database: remove streaming read queue length limit
If we fail a streaming read due queue overload, we will fail the entire repair.
Remove the limit for streaming, and trust the caller (repair) to have bounded
concurrency.

Fixes #2659.
Message-Id: <20170802143448.28311-1-avi@scylladb.com>
2017-08-03 10:21:07 +01:00
Avi Kivity
e9519ca8e5 Merge "make range selects more efficient by going through digest matching stage" from Gleb
"Currently scanning reads go to reconciliation stage directly which
requires asking for mutation data from all peers. This series makes
it to try matching digests first like a single partition read."

Fixes #2666.

* 'gleb/digest_scan' of github.com:cloudius-systems/seastar-dev:
  storage_proxy: make range_slice_read_executor go through digest matching state
  storage_proxy: add capability to read data/digest for non singular ranges
  storage_proxy: remove redundant parameter from never_speculating_read_executor constructor
2017-08-03 12:18:11 +03:00
Tzach Livyatan
d3d46a5eac Add comments on cluster_name in scylla.yaml
Fix #2316

Signed-off-by: Tzach Livyatan <tzach@scylladb.com>
Message-Id: <20170730082922.21884-1-tzach@scylladb.com>
2017-08-03 12:12:15 +03:00
Gleb Natapov
d2a2a6d471 storage_proxy: make range_slice_read_executor go through digest matching state
Currently scanning reads go to reconciliation stage directly which
requires asking for mutation data from all peers. This patch makes
it to try matching digests first like a single partition read.

The change requires internode protocol changes since currently it is not
possible to ask for multi partition data/digest over RPC. It means that
the capability has to be guarded by new gossip feature flag which the
patch also adds.
2017-08-03 11:37:03 +03:00
Tzach Livyatan
99b2232c5d docs/docker: Add hostname parameter to examples
Using --hostname to give the container a meaningful name is a good
practice, and make the monitoring dashboard easier to understand

Signed-off-by: Tzach Livyatan <tzach@scylladb.com>
Message-Id: <20170803081027.6675-1-tzach@scylladb.com>
2017-08-03 11:14:12 +03:00
Gleb Natapov
3b7d8c8767 storage_proxy: add capability to read data/digest for non singular ranges
Currently only mutation_data read supports non singular ranges. This
patch extends data/digest reads to support them too.
2017-08-03 10:35:09 +03:00
Gleb Natapov
c619ef258b storage_proxy: remove redundant parameter from never_speculating_read_executor constructor
never_speculating_read_executor always waits for all targets so
block_for parameter is always equal to targets.size(). No need to
to pass it explicitly.
2017-08-03 10:08:44 +03:00
Duarte Nunes
4c9206ba2f tests/sstable_mutation_test: Don't use moved-from object
Fix a bug introduced in dbbb9e93d and exposed by gcc6 by not using a
moved-from object. Twice.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170802161033.4213-1-duarte@scylladb.com>
2017-08-03 09:45:49 +03:00
Asias He
763fa83232 repair: Fix build in repair_cf_range
The compiler does not like the mutable.
Message-Id: <83c5e8a944b72a095b8e29e9988986e6ca9cefc5.1501690749.git.asias@scylladb.com>
2017-08-02 18:57:32 +02:00
Asias He
5798625d73 repair: Singal parallelism_semaphore in case of error
If we throw after we take the semaphore and beforew the when_all
below runs, no one will increase the semaphore.

Fixes #2661
Message-Id: <49540ede4c8a6d84004e10e0f63690e3c21d72c7.1501686383.git.asias@scylladb.com>
2017-08-02 18:32:32 +03:00
Avi Kivity
ebff739a84 Merge "use paging for compaction history" from Amnon
"This series adds an option to use paging in internal query and use that for the
get compaction history function.

Internal paging will be done explicitly, to use paging, you first create a
state object (that contains the query as well) and use that state to get the
first page, the result will contain both the query result and a new state that
can be used to get the next page.

Fixes #2366"

* 'amnon/paged_compaction_history_v5' of github.com:cloudius-systems/seastar-dev:
  system_keyspace: Use paging for get compaction history
  Add paging for internal queries
  query_options: Allows creating query_options from query_options
2017-08-02 18:15:58 +03:00
Avi Kivity
ac31abf6a4 repair: don't lambda-capture repair_tracker
It is static, so it need not be captured, and some compilers complain.
2017-08-02 18:07:31 +03:00
Avi Kivity
ce60ef59f3 Revert "repair: Singal parallelism_semaphore in case of error"
This reverts commit a548eee28c. It releases
the semaphore too early (noted by Glauber).
2017-08-02 17:13:46 +03:00
Avi Kivity
b2753b0183 Merge "Fix possible repair stuck" from Asias
"This series tries to fix possible repair stuck."

Fixes #2660, #2661, #2662.

* tag 'asias/repair_stuck_v2.1' of github.com:cloudius-systems/seastar-dev:
  repair: Make send_repair_checksum_range timeout
  repair: Singal parallelism_semaphore in case of error
  repair: Fix repair_tracker done
2017-08-02 16:51:51 +03:00
Asias He
98757069a5 repair: Make send_repair_checksum_range timeout
If the verb never returns the repair will hangs forever. Make it use the
timeout version of the send_message.

Fixes #2662
2017-08-02 21:41:50 +08:00
Asias He
a548eee28c repair: Singal parallelism_semaphore in case of error
If we throw after we take the semaphore and beforew the when_all
below runs, one one will increase the semaphore.

Fixes #2661
2017-08-02 21:41:45 +08:00
Asias He
abcff4c78e repair: Fix repair_tracker done
If it throws after repair_tracker.start and before the when_all below,
the repair_tracker.done will never be called for this repair id.

Fixes #2660
2017-08-02 21:40:29 +08:00
Pekka Enberg
78f68613ce dist/docker: Reduce number of layers
One of the best practices for Dockerfiles is to minimize the number of
layers because they increase the overall image size:

https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/#minimize-the-number-of-layers

Consolidate our "yum install" commands to reduce the number of lauyers.

Suggested by Dean Hamstead.

Message-Id: <1501670572-8701-1-git-send-email-penberg@scylladb.com>
2017-08-02 15:21:05 +03:00
Takuya ASADA
ffbdacc1fa dist/debian: remove ant from prerequisite packages
This lines are mistakenly copied from scylla-tools, won't need for scylla-server.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1498029619-1928-1-git-send-email-syuu@scylladb.com>
2017-08-02 12:12:42 +03:00
Duarte Nunes
cec41f9de6 Merge seastar upstream
* seastar fc937b8...f14d2a3 (4):
  > configure.py: Ensure tmp directory exists when getting dpdk cflags
  > checked_ptr: fix hash() compilation
  > net: fix potential use after free in posix_server_socket::accept()
  > http: removed unneeded lamda captures

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-08-02 10:05:08 +02:00
Asias He
cf6f4a5185 gossip: Introduce the shadow_round_ms option
It specifies the maximum gossip shadow round time. It can be used to
reduce the gossip feature check time during node boot up.
For instance, when the first node in the cluster, which listed both
itself and other node as seed in the yaml config, boots up, it will try
to talk to other seed nodes which are not started yet. The gossip shadow
round will be used to fetch the feature info of the cluster. Since there
is no other seed node in the cluster, the shadow round will fail. User
can reduce the default shadow_round_ms option to reduce the boot time.

Fixes #2615
Message-Id: <10916ce9059f3c7f1a1fb465919ae57de3b67d59.1500540297.git.asias@scylladb.com>
2017-08-02 09:52:35 +03:00
Vlad Zolotarov
4b28ea216d utils::loading_cache: cancel the timer after closing the gate
The timer is armed inside the section guarded by the _timer_reads_gate
therefore it has to be canceled after the gate is closed.

Otherwise we may end up with the armed timer after stop() method has
returned a ready future.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1501603059-32515-1-git-send-email-vladz@scylladb.com>
2017-08-01 17:21:44 +01:00
Duarte Nunes
569bbf2edd sstables/sstables: Use per-cpu noop_write_monitor
We employ a thread-per-core architecture, so don't go about sharing
seastar::shared_ptrs across cpus.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170801144153.17354-1-duarte@scylladb.com>
2017-08-01 18:10:49 +03:00
Avi Kivity
db7329b1cb Merge "Ensure correct EOC for PI block cell names" from Duarte
"This series ensures the always write correct cell names to promoted
index cell blocks, taking into account the eoc of range tombstones.

Fixes #2333"

* 'pi-cell-name/v1' of github.com:duarten/scylla:
  tests/sstable_mutation_test: Test promoted index blocks are monotonic
  sstables: Consider eoc when flushing pi block
  sstables: Extract out converting bound_kind to eoc
2017-08-01 18:09:07 +03:00
Gleb Natapov
1da4d5c5ee cql transport: run accept loop in the foreground
It was meant to be run in the foreground since it is waited upon during
stop(), but as it is now from the stop() perspective it is completed
after first connection is accepted.

Fixes #2652

Message-Id: <20170801125558.GS20001@scylladb.com>
2017-08-01 17:04:14 +03:00
Avi Kivity
1e8bb972b6 compaction: fix iteration in leveled compaction droppable tombstones loop
Since get_level_count() is unsigned, it will never be negative, and
the loop may never terminate.

Message-Id: <20170719133502.13316-1-avi@scylladb.com>
2017-08-01 13:40:36 +03:00
Avi Kivity
ba2e170e4b compaction: fix return in leveled compaction droppable tombstones loop
If the loop ever terminates, we need to return something.

Message-Id: <20170719133508.13374-1-avi@scylladb.com>
2017-08-01 13:33:02 +03:00
Takuya ASADA
a998b7b3eb dist/ami: follow scylla-tools package name change on RedHat variants
Since scylla-tools generates two .rpm packages, we need to copy them to our AMI.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20170722090002.9850-1-syuu@scylladb.com>
2017-07-31 18:57:12 +03:00
Avi Kivity
7c8dea088a Merge seastar upstream
* seastar 54e940f...fc937b8 (2):
  > configure.py: Always ensure tmp directory exists
  > coding-style.md: introduce
2017-07-31 18:06:09 +03:00
Duarte Nunes
a85232dd82 Fix compilation errors on GCC 6
GCC 6 inconsistently requires explicitly calling a member function
through "this->" for lambda functions capturing "this".

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170731143755.21970-1-duarte@scylladb.com>
2017-07-31 17:40:44 +03:00
Benoît Canet
b44ba11e4c transport: Count the number of unpaged queries
Queries with query page size equal or smaller than
zero are unpaged queries.

Count these kind of queries and make them a metrics
since they can ruin the performance of the system.

Message-Id: <20170731130004.25807-2-benoit@scylladb.com>
2017-07-31 16:01:45 +03:00
Avi Kivity
3fe6731436 Merge "educe the effect of the latency metrics" from Amnon
"This series reduce that effect in two ways:
1. Remove the latency counters from the system keyspaces
2. Reduce the histogram size by limiting the maximum number of buckets and
   stop the last bucket."

Fixes #2650.

* 'amnon/remove_cf_latency_v2' of github.com:cloudius-systems/seastar-dev:
  database: remove latency from the system table
  estimated histogram: return a smaller histogram
2017-07-31 15:58:30 +03:00
Paweł Dziepak
402799fcc0 mutation_reader: drop move_and_clear()
Since the discovery of std::exchange(x, {}) move_and_clear has become
obsolete. Beside, the name was wrong, it did not clear the vector but
recreated it meaning that any allocated memory wasn't reused (not that
it mattered in the existing usages).

Message-Id: <20170731123549.10887-1-pdziepak@scylladb.com>
2017-07-31 15:51:19 +03:00
Gleb Natapov
87bc3f7e7f configure.py: use user provided compiler flags when checking for features
User provided compiler flags my change an outcome of the test.

Message-Id: <20170724111520.GA18230@scylladb.com>
2017-07-31 15:33:06 +03:00
Avi Kivity
f4b2a1ef4e Merge "Optimise combined_mutation_reader" from Paweł
"These patches optimise combined_mutation_reader for cases where the majority
of mutation_readers is disjoint.

perf_fast_forward:
Results are medians of 3 of fragments/s as reported by perf_fast_forward.

Command:
perf_fast_forward -c1 --enable-cache

small: small-partition-skips (read=1, skip=0)
large: large-partition-skips (read=1, skip=0)

          before        after      diff
small     195753      238196       +22%
large    1244325     1359096        +9%

perf_simple_query:

Results are medians of 10 of reads/s as reported by perf_simple_query.

Command:
perf_simple_query -c1

before   98651.40
after   104554.85
diff          +6%"

* tag 'avoid-merge_mutations/v1' of https://github.com/pdziepak/scylla:
  combined_mutation_reader: avoid unnecessary merge_mutations()
  combined_mutation_reader: do not pop mutation with different key
2017-07-31 15:14:42 +03:00
Avi Kivity
178b54e790 Merge "memtable flush: Fixes and improvements" from Duarte
"This series ensure that when we retry a memtable flush, we re-acquire the
flush permit that was previously released. It also ensures we don't hold
the sstable read lock for the duration of the sleep leading to the retry.

To achieve that cleanly we refactor the way the permit lifecycle is managed
by employing a RAII-based approach.

We also improve the latency of writes blocked on virtual dirty by releasing
the flush permit before fsyncing the sstables. There are additional avenues
for performance improvements on top of this one."

* 'memtable-flush-additional-fixes/v4' of github.com:duarten/scylla:
  column_family: Re-acquire flush permit in case of error
  column_family: Don't hold sstable read lock when retrying flush
  sstables: Release the flush permit before fsyncing
  sstables: Introduce write_monitor
  database: Extract out dirty_memory_manager
  dirty_memory_manager: Refactor flush permit lifetime management
  dirty_memory_manager: Invert permit acquisition order
  memtable_list: Register different seal functions for each behaviour
2017-07-31 14:57:19 +03:00
Paweł Dziepak
2b53a560c8 combined_mutation_reader: avoid unnecessary merge_mutations()
Merging mutations is quite an expensive operation. The creation of
streamed mutation merger involves several allocations (mostly coming
from various std::vector) and then all mutation_fragments need to go
through a heap.

All this is completely unnecessary if there is only one mutation, so
let's skip a call to merge_mutations() in such cases. This also means
that we can reuse memory allocated by _current vector if merge is not
required.
2017-07-31 12:35:40 +01:00
Paweł Dziepak
f78f2b3c92 combined_mutation_reader: do not pop mutation with different key
Originally, the loop insidecombined_mutation_reader::next() so that it
was popping mutation from the heap and when it encountered one with a
different decorated key it was pushed back and the ones accumulated so
far merged and emitted. In other words, every time the reader progressed
to the next mutation it did needless pop and push operations on the
heap.

This patch rearranges the code so that the key of the next mutation is
compared before it is popped from the heap.
2017-07-31 12:35:40 +01:00
Duarte Nunes
c81431ad16 column_family: Re-acquire flush permit in case of error
If we fail to flush an sstable, after creating the flush_reader, then
we will have released the flush permit when we retry the flush. Ensure
that when retrying, we re-acquire the flush permit.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-31 12:40:19 +02:00
Duarte Nunes
9162e016da column_family: Don't hold sstable read lock when retrying flush
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-31 12:40:19 +02:00
Duarte Nunes
1a33cc6847 sstables: Release the flush permit before fsyncing
This allows a queued flush to start while we fsync the current
sstable, which helps reduce the overall time new writes are blocked on
dirty memory.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-31 12:40:19 +02:00
Duarte Nunes
784a078e72 sstables: Introduce write_monitor
The write_monitor provides callbacks to inform an observer of the
state of the ongoing sstable write.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-31 12:40:19 +02:00
Duarte Nunes
d2b0a5a0a6 database: Extract out dirty_memory_manager
Needed to the flush_permit can be propagated to the sstables layer.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-31 12:40:19 +02:00
Duarte Nunes
a2b732c156 dirty_memory_manager: Refactor flush permit lifetime management
This patch refactors how the flush permit lifetime is managed,
dropping the current hash table in favour of a RAII approach.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-31 12:40:19 +02:00
Duarte Nunes
f647f5b14a dirty_memory_manager: Invert permit acquisition order
For an upcoming fix it is required to invert the permit acquisition
order: first we acquire the background work permit and then the single
flush permit.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-31 12:40:19 +02:00
Duarte Nunes
e371accac8 memtable_list: Register different seal functions for each behaviour
Instead of passing a flush_behaviour to the seal function, use two
different functions for each of the behaviours.

This will be important in the forthcoming patches, which will require
the signatures of those functions to differ.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-31 12:40:19 +02:00
Paweł Dziepak
e970630272 tests/serialized_action: add missing forced defers
serialized_action_tests depends on the fact that first part of the
serialized_action is executed at cetrtain points (in which it reads a
global variable that is later updated by the main thread).
This worked well in the release mode before ready continuations were
inlined and run immediately, but not in the debug mode since inlining
was not happening and the main seastar::thread was missing some yield
points.
Message-Id: <20170731103013.26542-1-pdziepak@scylladb.com>
2017-07-31 11:35:24 +01:00
Duarte Nunes
4e3232fc29 utils/log_histogram: Fix typo when calculating number of buckets
We weren't correctly calculating the number of buckets due to
returning the wrong variable.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170731094733.7746-1-duarte@scylladb.com>
2017-07-31 12:49:11 +03:00
Avi Kivity
e855a28fae Revert "Merge "memtable flush: Fixes and improvements" from Duarte"
This reverts commit 733a64a1df, reversing
changes made to e11e66723a.

Breaks sstable_test and perf_fast_forward.
2017-07-31 12:44:28 +03:00
Avi Kivity
85056f3611 log_histogram: fix constexpr-ness of log_histogram_options
1. assert() is not constexpr.
2. can't use static_assert(), because the contructor may be called in a non-constexpr
   environment; moved to log_histogram
3. pow2_rank() uses count_leading_zeros() which is not constexpr; split
   into constexpr and non-constexpr versions
4. duplicated number_of_buckets() because bucket_of() can't be constexpr due to pow2_rank
Message-Id: <20170726105444.32698-1-avi@scylladb.com>
2017-07-31 09:11:40 +01:00
Avi Kivity
733a64a1df Merge "memtable flush: Fixes and improvements" from Duarte
"This series ensure that when we retry a memtable flush, we re-acquire the
flush permit that was previously released. It also ensures we don't hold
the sstable read lock for the duration of the sleep leading to the retry.

To achieve that cleanly we refactor the way the permit lifecycle is managed
by employing a RAII-based approach.

We also improve the latency of writes blocked on virtual dirty by releasing
the flush permit before fsyncing the sstables. There are additional avenues
for performance improvements on top of this one."

* 'memtable-flush-additional-fixes/v3' of github.com:duarten/scylla:
  column_family: Re-acquire flush permit in case of error
  column_family: Don't hold sstable read lock when retrying flush
  sstables: Release the flush permit before fsyncing
  sstables: Introduce write_monitor
  database: Extract out dirty_memory_manager
  dirty_memory_manager: Refactor flush permit lifetime management
  dirty_memory_manager: Invert permit acquisition order
  memtable_list: Register different seal functions for each behaviour
  main: Don't catch polymorphic exceptions by value
2017-07-31 10:32:26 +03:00
Duarte Nunes
e11e66723a main: Don't catch polymorphic exceptions by value
GCC trunk complains due to exception slicing.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170727163021.8000-1-duarte@scylladb.com>
2017-07-31 10:12:13 +03:00
Avi Kivity
fc683c3f3e Merge seastar upstream
* seastar a14d667...54e940f (8):
  > Merge "Prometheus to use output stream" from Amnon
  > http_test: Fix an http output stream test
  > build: harden try_compile_and_link output temporary file
  > configure: disable exception scalability hack on debug build
  > build: don't perform test compiles to /dev/null
  > Provide workaround for non scaleable c++ exception runtime
  > Merge "Add output stream to http message reply" from Amnon
  > configure.py: use user provided compiler flags when checking for features
2017-07-31 10:09:48 +03:00
Avi Kivity
c1718dd5e3 Update scylla-ami submodule
* dist/ami/files/scylla-ami 2bd1481...b41e5eb (1):
  > Fix incorrect scylla-server sysconfig file edit for i3 memflush controller
2017-07-31 09:41:24 +03:00
Takuya ASADA
714540cd4c dist/debian: refuse upgrade if current scylla < 1.7.3 && commitlog remains
Commitlog replay fails when upgrade from <1.7.3 to 2.0, we need to refuse
updating package if current scylla < 1.7.3 && commitlog remains.

Note: We have the problem on scylla-server package, but to prevent
scylla-conf package upgrade, %pretrans should be define on scylla-conf.

Fixes #2551

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1501187555-4629-1-git-send-email-syuu@scylladb.com>
2017-07-31 09:08:40 +03:00
Avi Kivity
e5d2e28df9 Merge "Backport exception scalability fix from gcc-7" from Gleb
"This patch series backports scalability fix for _Unwind_Find_FDE and modifies
out CentOS package to use our libgcc and libstdc++ which are needed to make
use of the fix instead of locally installed ones."

Ref #2646 (fixes on RHEL 7 and related only)

* 'gleb/exception-gcc-fix-v2' of github.com:cloudius-systems/seastar-dev:
  dist/redhat: Make scylla rpm depend on scylla-libgcc and scylla-libstdc++ and use it instead of locally installed one
  dist/redhat: Backport scalability fix of _Unwind_Find_FDE to out gcc
2017-07-30 19:31:03 +03:00
Gleb Natapov
8fe875cc79 dist/redhat: Make scylla rpm depend on scylla-libgcc and scylla-libstdc++ and use it instead of locally installed one 2017-07-30 16:03:25 +03:00
Gleb Natapov
1cf7e72c68 dist/redhat: Backport scalability fix of _Unwind_Find_FDE to out gcc 2017-07-30 16:03:10 +03:00
Paweł Dziepak
e62403190b Merge "Introduce perf_cache_eviction test" from Tomasz
Runs appending writes to a single partition, at full speed, and a reader
which selects the head of the partition, with 100ms delay between reads.
Prints latency percentiles and some stats.

Intended to test performance at the transition from non-evicting to
evicting modes.

Currently we can see that after the transition, whole partition gets
evicted and reads constantly miss.

Sample output:

    rd/s: 10, wr/s: 135947, ev/s: 0, pmerge/s: 1, miss/s: 0, cache: 708/778 [MB], LSA: 820/910 [MB], std free: 82 [MB]

    reads : min: 149   , 50%: 179   , 90%: 1331  , 99%: 1331  , 99.9%: 1331  , max: 6866   [us]
    writes: min: 3     , 50%: 4     , 90%: 4     , 99%: 5     , 99.9%: 258   , max: 51012  [us]

    rd/s: 7, wr/s: 93354, ev/s: 9, pmerge/s: 1, miss/s: 3, cache: 0/0 [MB], LSA: 107/128 [MB], std free: 82 [MB]

    reads : min: 179   , 50%: 179   , 90%: 73457 , 99%: 73457 , 99.9%: 73457 , max: 105778 [us]
    writes: min: 3     , 50%: 4     , 90%: 4     , 99%: 5     , 99.9%: 258   , max: 105778 [us]

* tag 'tgrabiec/row-eviction-perf-test' of github.com:scylladb/seastar-dev:
  tests: Introduce perf_cache_eviction
  tests: simple_schema: Add getter for DDL statement
  estimated_histogram: Implement percentile()
  utils: estimated_histogram: Make printable
2017-07-28 09:49:22 +01:00
Duarte Nunes
0f1bd81523 column_family: Re-acquire flush permit in case of error
If we fail to flush an sstable, after creating the flush_reader, then
we will have released the flush permit when we retry the flush. Ensure
that when retrying, we re-acquire the flush permit.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-27 21:09:18 +02:00
Duarte Nunes
2f4cffc7f6 column_family: Don't hold sstable read lock when retrying flush
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-27 21:09:18 +02:00
Duarte Nunes
5e64839e85 sstables: Release the flush permit before fsyncing
This allows a queued flush to start while we fsync the current
sstable, which helps reduce the overall time new writes are blocked on
dirty memory.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-27 21:09:18 +02:00
Duarte Nunes
a737577881 sstables: Introduce write_monitor
The write_monitor provides callbacks to inform an observer of the
state of the ongoing sstable write.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-27 21:09:18 +02:00
Duarte Nunes
121f967b30 database: Extract out dirty_memory_manager
Needed to the flush_permit can be propagated to the sstables layer.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-27 21:09:18 +02:00
Duarte Nunes
ef1275e9dd dirty_memory_manager: Refactor flush permit lifetime management
This patch refactors how the flush permit lifetime is managed,
dropping the current hash table in favour of a RAII approach.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-27 21:09:18 +02:00
Duarte Nunes
cfc8fae33f dirty_memory_manager: Invert permit acquisition order
For an upcoming fix it is required to invert the permit acquisition
order: first we acquire the background work permit and then the single
flush permit.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-27 21:09:18 +02:00
Duarte Nunes
7e68e4677d memtable_list: Register different seal functions for each behaviour
Instead of passing a flush_behaviour to the seal function, use two
different functions for each of the behaviours.

This will be important in the forthcoming patches, which will require
the signatures of those functions to differ.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-27 21:09:18 +02:00
Duarte Nunes
7502401652 main: Don't catch polymorphic exceptions by value
GCC trunk complains due to exception slicing.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-27 21:09:18 +02:00
Duarte Nunes
143f4fd861 Merge "Prevent pull requests from accumulating" from Tomasz
If schema merging completes at lower rate than incoming pull requests,
then merge processes will accumulate and needlessly request and hold schema mutations.

In rare cases, when there are constant schema changes, they may even
overflow memory. This was seen in dtest:

  concurrent_schema_changes_test.py:TestConcurrentSchemaChanges.create_lots_of_schema_churn_test

Allowing only one active and one queued pull request per remote
endpoint is enough.

* tag 'tgrabiec/dont-accumulate-schema-pulls-v2' of github.com:scylladb/seastar-dev:
  migration_manager: Log schema pulls
  migration_manager: Prevent pull requests from accumulating
  utils: Introduce serialized_action
2017-07-27 21:01:38 +02:00
Tomasz Grabiec
e09220dbff migration_manager: Log schema pulls 2017-07-27 20:08:25 +02:00
Tomasz Grabiec
350d98d4e1 migration_manager: Prevent pull requests from accumulating
If schema merging completes at lower rate than incoming pull requests,
then merge processes will accumulate and needlessly request and hold schema mutations.

In rare cases, when there are constant schema changes, they may even
overflow memory. This was seen in dtest:

  concurrent_schema_changes_test.py:TestConcurrentSchemaChanges.create_lots_of_schema_churn_test

Allowing only one active and one queued pull request per remote
endpoint is enough.
2017-07-27 20:08:25 +02:00
Tomasz Grabiec
6a3703944b utils: Introduce serialized_action 2017-07-27 20:08:21 +02:00
Duarte Nunes
dbbb9e93da tests/sstable_mutation_test: Test promoted index blocks are monotonic
Reproduces #2333

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-27 18:23:58 +02:00
Duarte Nunes
06728bdfe9 sstables: Consider eoc when flushing pi block
When flushing a promoted index block using a range tombstone cell name
as a bound, use the right eoc value instead of always writing
composite::eoc::none.

Fixes #2333

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-27 18:23:58 +02:00
Duarte Nunes
718517ed91 sstables: Extract out converting bound_kind to eoc
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-27 18:23:58 +02:00
Paweł Dziepak
f02bef7917 streamed_mutation: do not call fill_buffer() ahead of time
consume_mutation_fragments_until() allows consuming mutation fragments
until a specified condition happens. This patch reorganises its
implementation so that we avoid situations when fill_buffer() is called
with stop condition being true.
Message-Id: <20170727122218.7703-1-pdziepak@scylladb.com>
2017-07-27 17:47:57 +02:00
Tomasz Grabiec
ac7e6ef1bc tests: Introduce perf_cache_eviction 2017-07-27 17:19:07 +02:00
Tomasz Grabiec
2d2e7ef6fb tests: simple_schema: Add getter for DDL statement 2017-07-27 17:19:07 +02:00
Tomasz Grabiec
5602be72fa estimated_histogram: Implement percentile() 2017-07-27 17:19:07 +02:00
Tomasz Grabiec
1bc305ed7b utils: estimated_histogram: Make printable 2017-07-27 17:19:03 +02:00
Takuya ASADA
91a75f141b dist/redhat: limit metapackage dependencies to specific version of scylla packages
When we install scylla metapackage with version (ex: scylla-1.7.1),
it just always install newest scylla-server/-jmx/-tools on the repo,
instead of installing specified version of packages.

To install same version packages with the metapackage, limited dependencies to
current package version.

Fixes #2642

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20170726193321.7399-1-syuu@scylladb.com>
2017-07-27 14:21:35 +03:00
Takuya ASADA
11870e47ec dist/redhat: refuse upgrade if current scylla < 1.7.3 && commitlog remains
Commitlog replay fails when upgrade from <1.7.3 to 2.0, we need to refuse
updating package if current scylla < 1.7.3 && commitlog remains.

Note: We have the problem on scylla-server package, but to prevent
scylla-conf package upgrade, %pretrans should be define on scylla-conf.

Fixes #2551

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20170727110730.613-1-syuu@scylladb.com>
2017-07-27 14:09:17 +03:00
Tomasz Grabiec
22948238b6 row_cache: Fix potential timeout or deadlock due to sstable read concurrency limit
database::make_sstable_reader() creates a reader which will need to
obtain a semaphore permit when invoked. Therefore, each read may
create at most one such reader in order to be guaranteed to make
progress. If the reader tries to create another reader, that may
deadlock (or for non-system tables, timeout), if enough number of such
readers tries to do the same thing at the same time.

Avoid the problem by dropping previous reader before creating a new
one.

Refs #2644.

Message-Id: <1501152454-4866-1-git-send-email-tgrabiec@scylladb.com>
2017-07-27 13:58:20 +03:00
Vlad Zolotarov
e98adb13d5 service::storage_service: initialize auth and tracing after we joined the ring
Initialize the system_auth and system_traces keyspaces and their tables after
the Node joins the token ring because as a part of system_auth initialization
there are going to be issues SELECT and possible INSERT CQL statements.

This patch effectively reverts the d3b8b67 patch and brings the initialization order
to how it was before that patch.

Fixes #2273

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1500417217-16677-1-git-send-email-vladz@scylladb.com>
2017-07-27 10:54:36 +02:00
Amnon Heiman
a71b9e498a database: remove latency from the system table
This patch remove the latency histograms from the system table, it also
extend the already existing exclusion to all system keyspaces.

It also uses the new get_histogram API to set a minimal bucket size to
100 microseconds.
2017-07-27 11:41:15 +03:00
Amnon Heiman
1b05f23d12 estimated histogram: return a smaller histogram
The current histogram contains 91 buckets, this is a very high
resolution with a high upper limit.

To reduce traffic passed, between scylla and the prometheus, this patch
generate a smaller histogram.

It limit the number of buckets (16 by default), set a lower limit to the
lowest bucket, and uses 2 as the bucket coeficient.

Highest empty buckets will not be reported.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>

estimated histogram
2017-07-27 11:41:10 +03:00
Tomasz Grabiec
e9fc0b0491 Merge "Some fixes for performance regressions in perf_fast_forward" from Paweł
These patches contain some minor fixes for performance regression reported
by perf_fast_forward after partial cache was merged. The solution is still
far from perfect, there is one case that still has 30% degradation, but
there is some improvement so there is no reason to hold these changes back.

Refs #2582.

Some numbers:
before - before cache changes were merged
(555621b537)

cache - at the commit that introduced the partial cache
(9b21a9bfb6)

after - recent master + this series
(based on e988121dbb)

Differences are shown relative to "before".

Testing effectiveness of caching of large partition, single-key slicing reads:
Large partitions, range [0, 500000], populating cache
  before      cache      after
 1636840    1013688    1234606
              -38%        -25%

Large partitions, range [0, 500000], reading from cache
  before      cache      after
 2012615    3076812    3035423
               +53%       +51%

Testing scanning small partitions with skips.
reading small partitions (skip 0)
 before      cache      after
 227060     165261     200639
              -27%       -11%

skipping small partitions (skip 1)
 before      cache      after
  29813      27312      38210
               -8%       +28%

Testing slicing small partitions:
slicing small partitions (offset 0, read 4096)
 before      cache      after
 195282     149695     180497
              -23%        -8%

* https://github.com/pdziepak/scylla.git perf_fast_forward-regression/v3:
  sstables: make sure that fill_buffer() actually fills buffer
  mutation_merger: improve handling of non-deferring fill_buffer()s
  partition_snapshot_row_cursor: avoid apply() in single-version cases
  sstables: introduce decorated_key_view
  ring_position_comparator: accept sstables::decorated_key_view
  sstable: keep a pre-computed token in summary_entry
  sstables: cache token in index entries
  index_reader: advance_and_check_if_present() use index_comparator
  ring_position_comparator: drop unused overloads
  cache_streamed_mutation: avoid moving clustering_row
  streamed_mutation: introduce consume_mutation_fragments_until()
  cache_streamed_mutation: use consumer based read_context reader
  rows_entry: make position() inlineable
  mutation_fragment: make destructor always_inline
  keys: introduce compound_wrapper::from_exploded_view()
  sstables: avoid copying key components
  compound_compat: explode: reserve some elements in a vector
  cache: short-circut static row logic if there are no static columns
  cache: use equality comparators instead of tri_compare
  sstables: avoid indirect calls to abstract_type::is_multi_cell()
2017-07-27 10:14:35 +02:00
Pekka Enberg
b80504188a docs/docker-hub: Mark '--experimental' as 2.0 feature
The '--experimental' flag appears in 2.0 so mark it as such in the user
documentation on Docker Hub.

Message-Id: <1501137703-29706-1-git-send-email-penberg@scylladb.com>
2017-07-27 10:28:25 +03:00
Duarte Nunes
85e85ec72e Don't catch polymorphic exceptions by value
It makes gcc a very sad compiler.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170726172053.5639-2-duarte@scylladb.com>
2017-07-27 09:39:58 +03:00
Duarte Nunes
7536659cb5 CqlParser: Don't catch polymorphic exceptions by value
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170726172053.5639-1-duarte@scylladb.com>
2017-07-27 09:39:57 +03:00
Tzach Livyatan
ea97b87205 Adding Scylla restart instructions
Signed-off-by: Tzach Livyatan <tzach@scylladb.com>
Message-Id: <20170725064719.31109-1-tzach@scylladb.com>
2017-07-27 09:38:49 +03:00
Vlad Zolotarov
9adabd1bc4 utils::loading_cache: add stop() method
loading_cache invokes a timer that may issue asynchronous operations
(queries) that would end with writing into the internal fields.

We have to ensure that these operations are over before we can destroy
the loading_cache object.

Fixes #2624

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1501096256-10949-1-git-send-email-vladz@scylladb.com>
2017-07-26 21:28:49 +02:00
Duarte Nunes
50ad0003c6 db/schema_tables: Drop dropped columns when dropping tables
Fixes #2633

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170726150228.2593-2-duarte@scylladb.com>
2017-07-26 18:41:28 +02:00
Duarte Nunes
3425403126 db/schema_tables: Store column_name in text form
As does Cassandra.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170726150228.2593-1-duarte@scylladb.com>
2017-07-26 18:41:12 +02:00
Duarte Nunes
787308a96c cql3/tuples: Don't catch polymorphic exception by value
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170726155740.3275-1-duarte@scylladb.com>
2017-07-26 19:28:35 +03:00
Asias He
515a744303 gossip: Fix nr_live_nodes calculation
We need to consider the _live_endpoints size. The nr_live_nodes should
not be larger than _live_endpoints size, otherwise the loop to collect
the live node can run forever.

It is a regression introduced in commit 437899909d
(gossip: Talk to more live nodes in each gossip round).

Fixes #2637

Message-Id: <863ec3890647038ae1dfcffc73dde0163e29db20.1501026478.git.asias@scylladb.com>
2017-07-26 16:48:30 +03:00
Paweł Dziepak
7b0f75c0d1 sstables: avoid indirect calls to abstract_type::is_multi_cell() 2017-07-26 14:38:27 +01:00
Paweł Dziepak
b4d1dea4a9 cache: use equality comparators instead of tri_compare
Equality comparator may be much cheaper than the fully fledged
trichotomic comparator, especially if the component types are byte order
equal but not byte order comparable.
2017-07-26 14:38:27 +01:00
Paweł Dziepak
2780555968 cache: short-circut static row logic if there are no static columns 2017-07-26 14:38:27 +01:00
Paweł Dziepak
4a0385e908 compound_compat: explode: reserve some elements in a vector
When we are exploding a compound key we know already that there is more
than one component, but we have no easy way of determining how many of
them are going to be there. Let's reserve space for a few elements so
that we avoid an excessive number of reallocations in case of
medium-sized keys.
2017-07-26 14:38:27 +01:00
Paweł Dziepak
28c105e4a7 sstables: avoid copying key components 2017-07-26 14:38:27 +01:00
Paweł Dziepak
6031b7e587 keys: introduce compound_wrapper::from_exploded_view() 2017-07-26 14:38:27 +01:00
Paweł Dziepak
c9ccd813ab mutation_fragment: make destructor always_inline
mutation_fragment destructor was already made inline-friendly by moving
most of the logic to a separate function. However, the compiler still is
quite reluctant to inline it in certain cases, so let's give it a
stronger hint.
2017-07-26 14:38:27 +01:00
Paweł Dziepak
43cce6c2f4 rows_entry: make position() inlineable 2017-07-26 14:38:27 +01:00
Paweł Dziepak
c2ec43f70b cache_streamed_mutation: use consumer based read_context reader 2017-07-26 14:38:21 +01:00
Paweł Dziepak
2066354de3 streamed_mutation: introduce consume_mutation_fragments_until()
consume_mutation_fragments_until() is a consumer based interface that
avoids indirect calls and continuation overhead present in the naive
streamed_mutation::operator() approach.
2017-07-26 14:37:20 +01:00
Paweł Dziepak
9bc6038ff3 cache_streamed_mutation: avoid moving clustering_row
clustering_row can stores quite a lot of data internally which makes its
move constructor not exactly cheap.
If possible it is better to move mutation_fragment around as it keeps
everything externally. This also avoids some cases when clustering row
would be extracted from mutation_fragment only to be made to create
another mutation_fragment later.
2017-07-26 14:36:37 +01:00
Paweł Dziepak
68e57a742f ring_position_comparator: drop unused overloads 2017-07-26 14:36:37 +01:00
Paweł Dziepak
960a140880 index_reader: advance_and_check_if_present() use index_comparator 2017-07-26 14:36:37 +01:00
Paweł Dziepak
dc7bad9a50 sstables: cache token in index entries
When a sstable reader is fast forwarded some index entries may be read
(and compared) multiple times. This patch makes sure that once a token
is computed we keep it around and reuse if the entry is accessed again.
2017-07-26 14:36:37 +01:00
Paweł Dziepak
bfb7b56c74 sstable: keep a pre-computed token in summary_entry
Each sstable index lookup involves a binary search in the summary and
each time a partition key of summary entry is compared with anything its
token needs to be calculated.
Since we keep summary in the memory all the time it is better to also
keep the tokens around.
2017-07-26 14:36:36 +01:00
Paweł Dziepak
fe7eba7f06 ring_position_comparator: accept sstables::decorated_key_view
ring_position_comparator has overloads for comparing ring_positions as
well as sstables::key_view. In the case of the latter it needs to
compute the token of the key. However, the sstable layer could cache
some tokens so let's allow the comparator callers to provide it
directly.
2017-07-26 14:36:36 +01:00
Paweł Dziepak
31d7cfdefb sstables: introduce decorated_key_view 2017-07-26 14:36:36 +01:00
Paweł Dziepak
722c56f3f2 partition_snapshot_row_cursor: avoid apply() in single-version cases 2017-07-26 14:36:36 +01:00
Paweł Dziepak
e145ee6bb8 mutation_merger: improve handling of non-deferring fill_buffer()s
It is possible that a call to fill_buffer() will return an immediately
ready future. This patch avoids uncontrolled recursion in case when all
merged streamed mutation do not defer ini fill_buffer() and also
optimises for non-deferring case by avoiding some of the logic.
2017-07-26 14:36:36 +01:00
Paweł Dziepak
e0a04cb7fe sstables: make sure that fill_buffer() actually fills buffer
streamed_mutation::impl::fill_buffer() is supposed to either push
mutation fragments to the buffer or set EOS flag. However, it was
possible that mp_row_consumer would return proceed::no if a skip was
needed without satisfying any of these conditions.
2017-07-26 14:36:36 +01:00
Pekka Enberg
e66635a885 Merge "Developer documentation improvements" from Jesse
"This patch series addresses some feedback from the preliminary
 HACKING.md, adds some new content, and updates the README file with
 some quick-start information."

* 'jhk/better_hacking/v3' of github.com:hakuch/scylla:
  README.md: Add quick-start section and defer to `HACKING.md`
  HACKING.md: `CMakeLists.txt` for analysis works for other IDEs too
  HACKING.md: Add details and examples for unit tests
  HACKING.md: Add section for project dependencies
  HACKING.md: Describe releases and tags
  HACKING.md: Re-work "building" section, including memory needs
  HACKING.md: Update ccache recommendations
  HACKING.md: Update "Contributing" URL
2017-07-26 16:25:58 +03:00
Duarte Nunes
e988121dbb schema_builder: Replace type when re-dropping column
Fixes #2634

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170725183933.5311-1-duarte@scylladb.com>
2017-07-26 13:26:29 +02:00
Duarte Nunes
64fcf0c642 alter_table_statement: Allow collection columns to replace normal ones
Fixes #2632

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170725183811.5155-1-duarte@scylladb.com>
2017-07-26 13:24:03 +02:00
Duarte Nunes
1622847c1d perf/perf_fast_forward: Don't pass non-pod to varargs function
Passing a Non-POD object to variadic functions is unsupported.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170726094756.22867-1-duarte@scylladb.com>
2017-07-26 11:48:22 +01:00
Duarte Nunes
9c831b4e97 schema: Remove unnecessary print
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170725174000.71061-1-duarte@scylladb.com>
2017-07-26 12:01:51 +02:00
Duarte Nunes
472f32fb06 tests/schema_change_test: Add test case for add+drop notification
Reproduces #2616

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170725170622.4380-2-duarte@scylladb.com>
2017-07-26 11:59:48 +02:00
Duarte Nunes
33e18a1779 db/schema_tables: Consider differing dropped columns
If a node is notified of a schema change where the schema's dropped
columns have changes, that node will miss the changes to the dropped
columns. A scenario where this can happen is where a column c is
dropped, then added as a different typed, and then dropped again, with
a node n having seen the first drop and being notified of the
subsequent add and drop.

Fixes #2616

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170725170622.4380-1-duarte@scylladb.com>
2017-07-26 11:59:34 +02:00
Jesse Haber-Kucharsky
d6c0138576 README.md: Add quick-start section and defer to HACKING.md 2017-07-25 17:58:00 -04:00
Jesse Haber-Kucharsky
d06bccf857 HACKING.md: CMakeLists.txt for analysis works for other IDEs too 2017-07-25 17:57:55 -04:00
Jesse Haber-Kucharsky
9c2390e1a4 HACKING.md: Add details and examples for unit tests 2017-07-25 17:57:43 -04:00
Jesse Haber-Kucharsky
488839dd15 HACKING.md: Add section for project dependencies 2017-07-25 17:57:43 -04:00
Jesse Haber-Kucharsky
14d03d7548 HACKING.md: Describe releases and tags 2017-07-25 17:57:43 -04:00
Jesse Haber-Kucharsky
6e8bfdbb3f HACKING.md: Re-work "building" section, including memory needs 2017-07-25 17:57:43 -04:00
Jesse Haber-Kucharsky
64acb41305 HACKING.md: Update ccache recommendations 2017-07-25 17:57:43 -04:00
Jesse Haber-Kucharsky
4fe767de31 HACKING.md: Update "Contributing" URL
The old page results in a 404 error.
2017-07-25 17:46:45 -04:00
Paweł Dziepak
295689d16f db: include counter writes on leader in metrics
Counters write path on leader is completely different than on any other
replica (non-leaders share write path between counters and regular
columns). This patch makes sure that counter writes performed on leader
are added to appropriate metrics.
Message-Id: <20170725153346.31238-1-pdziepak@scylladb.com>
2017-07-25 18:31:43 +02:00
Tomasz Grabiec
18be42f71a Merge fixes related to row cache from Raphael
* git@github.com:raphaelsc/scylla.git row_cache_fixes:
  db: atomically synchronize cache with changes to the snapshot
  db: refresh row cache's underlying data source after compaction
2017-07-25 15:34:32 +02:00
Paweł Dziepak
79a1ad7a37 tests/row_cache: test queries with no clustering ranges
Reproducer for #2604.
Message-Id: <20170725131220.17467-3-pdziepak@scylladb.com>
2017-07-25 15:29:17 +02:00
Paweł Dziepak
1ea507d6ae tests: do not overload the meaning of empty clustering range
Empty clustering key range is perfectly valid and signifies that the
reader is not interested in anything but the static row. Let's not
make it mean anything else.
Message-Id: <20170725131220.17467-2-pdziepak@scylladb.com>
2017-07-25 15:28:12 +02:00
Paweł Dziepak
6572f38450 cache: fix aborts if no clustering range is specified
cache_streamed_mutation assumed that at least one clustering range was
specified. That was wrong since the readers are allowed to query just
for a static row (e.g. counter update that modifies only static
columns).

Fixes #2604.
Message-Id: <20170725131220.17467-1-pdziepak@scylladb.com>
2017-07-25 15:27:48 +02:00
Amnon Heiman
1f5a9ecc40 scylla-housekeeping: support patches releases
To support both version and patch release, the version server now returns
a patchversion parameter that include the latest minor version's patch
release.

The housekeeping should return a separate message if the current
minor version is not with the latest patch release, and a message if the version was
changed.

For example, if a user is using version 1.6.1 it should get a warning
that he need to update if 1.6.2 is available and in addition a warning it
should upgrade if version 1.7 is out.

Examples:
$ scylla-housekeeping version --version 1.6.2
Your current Scylla release is 1.6.2, while the latest patch release is 1.6.4, and the latest minor release is 1.7.2 (recommended)

$ scylla-housekeeping version --version 1.7.1
You current Scylla release is  1.7.1 while the latest patch release is 1.7.2 is available, update for the latest bug fixes

$ scylla-housekeeping version --version 1.7.1
You current Scylla release is 1.7.1 while the latest patch release is 1.7.2, update for the latest bug fixes and improvements

Fixes #1972

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Acked-by: Tzach Livyatan <tzach@scylladb.com>
Message-Id: <20170725095455.6450-1-amnon@scylladb.com>
2017-07-25 13:12:18 +03:00
Raphael S. Carvalho
637f3bfa50 db: refresh row cache's underlying data source after compaction
Underlying data source in row cache holds a reference to sstable set
prior to compaction which isn't released until a memtable flush, which
means file descriptors of deleted sstables remains opened, wasting
disk space.
The fix is to refresh underlying data source in row cache.

Fixes #2570.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-24 15:49:11 -03:00
Raphael S. Carvalho
e3ad676433 db: atomically synchronize cache with changes to the snapshot
updates to cache and snapshot (i.e. sstable set) aren't synchronized, so
it may happen that cache update for memtable flush will use wrong snapshot
version, and that violates cache invariant of each partition entry only
reflecting one snapshot.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-24 15:45:05 -03:00
Avi Kivity
c21bb5ae05 tests: fix sstable_datafile_test build with boost 1.55
Boost 1.55 accidentally removed support for "range for" on
recursive_directory_iterator (previous and latter versions do
support it). Use old-style iteration instead.

Message-Id: <20170724080128.8824-1-avi@scylladb.com>
2017-07-24 11:20:12 +03:00
Avi Kivity
f75b578607 Update ami submodule
* dist/ami/files/scylla-ami 5dfe42f...2bd1481 (1):
  > Enable support for experimental CPU controller in i3 instances
2017-07-24 10:26:52 +03:00
Tomasz Grabiec
60678f0e8a ring_position: Optimize contruction from r-value referenceces of decorated_key
Message-Id: <1500650171-26291-1-git-send-email-tgrabiec@scylladb.com>
2017-07-24 10:25:14 +03:00
Tomasz Grabiec
136d205855 mutation_partition: Always mark static row as continuous when no static columns
To avoid unnecessary cache misses after static columns are added.

Message-Id: <1500650057-26036-1-git-send-email-tgrabiec@scylladb.com>
2017-07-24 10:23:35 +03:00
Tomasz Grabiec
714d609605 database: Fix reversed order of keyspace and table names in a log message
Message-Id: <1500649623-25377-1-git-send-email-tgrabiec@scylladb.com>
2017-07-21 17:10:17 +02:00
Tomasz Grabiec
059779eea6 gdb: Fix 'scylla ptr' reporting large object pages as free
The 'free' attribute is not updated for all pages belonging to a large
object, so we can't use it to determine if the page is allocated or
not. More reliable way is to check if it belongs to any free span.
Message-Id: <1500648094-20039-1-git-send-email-tgrabiec@scylladb.com>
2017-07-21 16:56:41 +02:00
Tomasz Grabiec
29a82f5554 schema_registry: Keep unused entries around for 1 second
This is in order to avoid frequent misses which have a relatively high
cost. A miss means we need to fetch schema definition from another
node and in case of writes do a schema merge.

If the schema is kept alive only by the incoming request, then it
will be forgotten immediately when the request is done, and the next
request using the same schema version will miss again.

Refs #2608.
Message-Id: <1500632447-10104-1-git-send-email-tgrabiec@scylladb.com>
2017-07-21 16:56:37 +02:00
Tomasz Grabiec
ecc85988dd legacy_schema_migrator: Don't snapshot empty legacy tables
Otherwise we will create a new (empty) snapshot each time we boot.
Message-Id: <1500573920-31478-2-git-send-email-tgrabiec@scylladb.com>
2017-07-21 16:56:31 +02:00
Tomasz Grabiec
408cea66cd database: Allow disabling auto snapshots during drop/truncate
Message-Id: <1500573920-31478-1-git-send-email-tgrabiec@scylladb.com>
2017-07-21 16:56:29 +02:00
Duarte Nunes
937fe80a1a Merge 'Fix possible inconsistency of table schema version' from Tomasz
"Fixes issues uncovered in longevity test (#2608).

Main problem is that due to time drift scylla_tables.version column
may not get deleted on all nodes doing the schema merge, which will
make some nodes come up with different table schema version than others.

The inconsistency will not heal because scylla_tables doesn't
take part in the schema sync. This is fixed by the last patch.

This will cause nodes to constantly try to sync the schema, which under
some conditions triggers #2617."

* tag 'tgrabiec/fix-table-schema-version-inconsistency-v1' of github.com:scylladb/seastar-dev:
  schema_tables: Add scylla_tables to ALL
  schema: Make schema_mutations equality consistent with digest
  schema_tables: Extract compact_for_schema_digest()
  schema_tables: Always drop scylla_tables::version
2017-07-21 16:55:23 +02:00
Tomasz Grabiec
65c64614aa schema_registry: Ensure schema_ptr is always synced on the other core
global_schema_ptr ensures that schema object is replicated to other
cores on access. It was replicating the "synced" state as well, but
only when the shard didn't know about the schema. It could happen that
the other shard has the entry, but it's not yet synced, in which case
we would fail to replicate the "synced" state. This will result in
exception from mutate(), which rejects attempts to mutate using an
unsynced schema.

The fix is to always replicate the "synced" state. If the entry is
syncing, we will preemptively mark it as synced earlier. The syncing
code is already prepared for this.

Refs #2617.
Message-Id: <1500555224-15825-1-git-send-email-tgrabiec@scylladb.com>
2017-07-21 16:54:47 +02:00
Duarte Nunes
7eecda3a61 schema: Support compaction enabled attribute
Fixes #2547

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170721132206.3037-1-duarte@scylladb.com>
2017-07-21 15:38:45 +02:00
Amnon Heiman
e345d05ebe system_keyspace: Use paging for get compaction history
there could be a lot of compactions when querying for compaction
history.

This patch changes the query to use paging. It would collect all results
when returning to the caller.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2017-07-20 18:17:49 +03:00
Vlad Zolotarov
9086c643a6 service::storage_proxy: add a trace points pair in the SELECT replica flow
Add two trace points: at the beginning and at the end of the replica flow on the
replica shard.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1499961542-16263-1-git-send-email-vladz@scylladb.com>
2017-07-20 16:44:25 +02:00
Amnon Heiman
08c81427b9 Add paging for internal queries
Usually, internal queries are used for short queries. Sometimes though,
like in the case of get compaction history, there could be a large
amount of results. Without paging it will overload the system.

This patch adds the ability to use paging internally.

Using paging will be done explicitely, all the relevant information
would be store in an internal_query_state, that would hold both the
paging state but also the query so consecutive calls can be made.

To use paging use the query method with a function.

The function gets beside a statement and its parameters a function that
will be used for each of the returned rows.

For example if qp is a query_processor:

qp.query("SELECT * from system.compaction_history", [] (const cql3::untyped_result_set::row& row) {
  ....
  // do something with row
  ...
  return stop_iteration::no; // keep on reading
});

Will run the function on each of the compaction history table rows.

To stop the iteration, the function can return stop_iteration::yes.
2017-07-20 17:43:51 +03:00
Tomasz Grabiec
2bc549f426 Merge perf_fast_forward enhancements from Paweł
* https://github.com/pdziepak/scylla.git perf_fast_forward_improvements/v1:
  perf_fast_forward: move global state to global scope
  perf_fast_forward: move tests groups to separate functions
  perf_fast_forward: allow running only selected test groups
  perf_fast_forward: use consumer interface for reading
    streamed_mutation
2017-07-20 16:41:29 +02:00
Tomasz Grabiec
ed2388da2c schema_tables: Add scylla_tables to ALL
So that scylla_tables takes part in the digest and in mutations sent
as part of schema sync. Otherwise inconsistencies in scylla_tables
will not heal.

Refs #2608.
2017-07-20 15:47:10 +02:00
Tomasz Grabiec
78ff728795 schema: Make schema_mutations equality consistent with digest
Digest only looks like live values, ignoring deletion
information. Equality should be consistent with that, so that schemas
considered equal do not trigger the alter path unnecessarily.
2017-07-20 15:47:10 +02:00
Tomasz Grabiec
6adbe61e2f schema_tables: Extract compact_for_schema_digest() 2017-07-20 15:47:10 +02:00
Tomasz Grabiec
1b85c316bf schema_tables: Always drop scylla_tables::version
It can happen that due to time drift between nodes, the incoming
"version" cell will have higher timestamp than api::new_timestamp().
In such case the column would not be dropped and would cause version
mismatch between nodes.

Ensure it's always covered by using max of current time and cell's
timestamp.

Refs #2608.
2017-07-20 15:47:10 +02:00
Takuya ASADA
2bf16c6e8a dist/debian: add --no-clean option to skip building pbuilder .tgz image
By default build_deb.sh destroys all previous build image to make sure we don't
have environment dependent issue, but it's takes time to build distribution root
image (.tgz in pbuilder) from scratch.

--no-clean option is for skipping create .tgz stage, use previously built image,
to make build time shorter.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1500542094-12946-1-git-send-email-syuu@scylladb.com>
2017-07-20 15:37:00 +03:00
Calle Wilund
91f314e54c duration.cc: Fix static assert
static_assert(cond) is C++17 only
Message-Id: <1500373227-12025-1-git-send-email-calle@scylladb.com>
2017-07-20 13:14:51 +02:00
Paweł Dziepak
823fb5e9d8 perf_fast_forward: use consumer interface for reading streamed_mutation
Using streamed_mutation::operator() is undesirable as it introduces an
indirect call and a continuation overhead for each emitted mutation
fragment. Consumer interface is the preferred method of reading streamed
mutations.
2017-07-20 11:02:53 +01:00
Paweł Dziepak
d184508d7b perf_fast_forward: allow running only selected test groups 2017-07-20 11:02:31 +01:00
botond
884928c511 install-dependencies.sh: Fix ubuntu dependencies
Remove dependencies section from README.md, point to the
install-dependencies.sh script instead.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <7f51f17a743a82d68b7d4a279b066ffe55fe0379.1500540523.git.bdenes@scylladb.com>
2017-07-20 12:00:20 +03:00
Paweł Dziepak
a18a36c94b perf_fast_forward: move tests groups to separate functions 2017-07-20 09:26:42 +01:00
Paweł Dziepak
3fd4f9c1c7 perf_fast_forward: move global state to global scope
All test perf_fast_forward test cases currently live in the main
function. This patch moves the state they rely on to a global scope
so that it will be easier to extract these tests to individual
functions.
2017-07-20 09:26:42 +01:00
Avi Kivity
c5ee62a6a4 Merge "restrict background writers with scheduling groups" from Glauber
"This patchset restricts background writers - such as compactions,
streaming flushes and memtable flushes to a maximum amount of CPU usage
through a seastar::thread_scheduling_group.

The said maximum is recommended to be set  50 % - it is default
disabled, but can be adjusted through a configuration option until we
are able to auto-tune this.

The second patch in this series provides a preview on how such auto-tune
would look like. By implementing a simple controller we automatically
adjust the quota for the memtable writer processes, so that the rate at
which bytes come in is equal to the rates at which bytes are flushed.

Tail latencies are greatly reduced by this series, and heavy spikes that
previously appeared on CPU-bound workloads are no more."

* 'memtable-controller-v5' of https://github.com/glommer/scylla:
  simple controller for memtable/streaming writer shares.
  restrict background writers to 50 % of CPU.
2017-07-20 10:58:53 +03:00
Takuya ASADA
c441b1604a dist/redhat: use EPEL's ragel for CentOS
Since ragel added on EPEL, drop self-built package and use EPEL one.

See #2441

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20170719170500.18515-1-syuu@scylladb.com>
2017-07-20 10:36:57 +03:00
Calle Wilund
7a583585a2 system_keyspace: Make sure "system" is written to keyspaces (visible)
Fixes #2514

Bug in schema version 3 update: We failed to write "system" to the
schema tables. Only visible on an empty instance of course.

Message-Id: <1500469809-23546-2-git-send-email-calle@scylladb.com>
2017-07-19 16:18:56 +03:00
Calle Wilund
247c36e048 system_schema: Fix remaining places not handing two system keyspaces
Some places remained where code looked directly at
system_keyspace::NAME to determine iff a ks is
considered special/system/protected. Including
schema digest calculation.

Export "is_system_keyspace" and use accordingly.

Message-Id: <1500469809-23546-1-git-send-email-calle@scylladb.com>
2017-07-19 16:18:45 +03:00
Duarte Nunes
1daf1bc4bb Merge 'Revert back to 1.7 schema layout in memory' from Tomasz
"Fixes schema layout incompatibility in a mixed 1.7 and 2.0 cluster (#2555)
by reverting back to using the old layout in memory and thus also
in across-node requests. We still use the new v3 layout in schema
tables (needed by drivers and external tools). Translations happen
when converting to/from schema mutations."

* tag 'tgrabiec/use-v2-schema-layout-in-memory-v2' of github.com:scylladb/seastar-dev:
  schema: Revert back to the 1.7 layout of static compact tables in memory
  schema: Use v3 column layout when converting to/from schema mutations
  schema: Encapsulate column layout translations in the v3_columns class
2017-07-19 12:52:52 +02:00
Avi Kivity
d5aba779d4 Merge "streaming error handling improvement" from Asias
"This series improves the streaming error handling so that when one side of the
streaming failed, it will propagate the error to the other side and the peer
will close the failed session accordingly. This removes the unnecessary wait and
timeout time for the peer to discover the failed session and fail eventually.

Fix it by:

- Use the complete message to notify peer node local session is failed
- Listen on shutdown gossip callback so that we can detect the peer is shutdown
  can close the session with the peer

Fixes #1743"

* tag 'asias/streaming/error_handling_v2' of github.com:cloudius-systems/seastar-dev:
  streaming: Listen on shutdown gossip callback
  gms: Add is_shutdown helper for endpoint_state class
  streaming: Send complete message with failed flag when session is failed
  streaming: Handle failed flag in complete message
  streaming: Do not fail the session when failed to send complete message
  streaming: Introduce send_failed_complete_message
  streaming: Do not send complete message when session is successful
  streaming: Introduce the failed parameter for complete message
  streaming: Remove unused session_failed function
  streaming: Less verbose in logging
  streaming: Better stats
2017-07-19 11:18:09 +03:00
Amos Kong
2bdcad5bc3 scylla_raid_setup: fix syntax error
/usr/lib/scylla/scylla_raid_setup: line 132: syntax error
near unexpected token `fi'

Fixes #2610

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <af3a5bc77c5ba2b49a8f48a5aaa19afffb787886.1500430021.git.amos@scylladb.com>
2017-07-19 11:10:29 +03:00
Duarte Nunes
ab72132cb1 view_schema_test: Retry failed queries
Due to the asynchronous nature of view update propagation, results
might still be absent from views when we query them. To be able to
deterministically assert on view rows, this patch retries a query a
bounded number of times until it succeeds.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170718212646.2958-1-duarte@scylladb.com>
2017-07-19 09:59:44 +02:00
Duarte Nunes
115ff1095e db/view: Use view schema for view pk operations
Instead of base schema.

Fixes #2504

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170718190703.12972-1-duarte@scylladb.com>
2017-07-19 09:59:34 +02:00
Tomasz Grabiec
a9237c1666 schema: Revert back to the 1.7 layout of static compact tables in memory
We are using C* 3.x compatible layout in schema tables but want to
keep using the 1.7 layout in memory for compatibility during rolling
upgrade. This patch switches the schema and schema_builder classes
back to the old layout. Translation of layout happens when converting
to/from schema mutations.

Notable changes:

 1) Includes a revert of commit 6260f31e08
    "thrift: Update CQL mapping of static CFs".

 2) Brings back the "default_validation_class" schema attribute. In v3
    it can be dervied from column definitions, but in v2 it can't, so
    we have to store it.

 3) legacy_schema_migrator and schema_builder don't have to do
    conversions to v3, this is now handled by the v3_columns
    class. schema_builder works with the same layout as schema, that
    is v2.

 4) Includes a revert of commit 66991a7ccb
    "v3 schema test fixes"

Fixes #2555.
2017-07-19 09:52:15 +02:00
Tomasz Grabiec
dc2dc056a4 schema: Use v3 column layout when converting to/from schema mutations 2017-07-19 09:52:15 +02:00
Tomasz Grabiec
dc463ef644 schema: Encapsulate column layout translations in the v3_columns class 2017-07-19 09:52:15 +02:00
Avi Kivity
bfae5c7bac Merge "Time window compaction strategy support" from Raphael
"Time window strategy was introduced to address several limitations of
date tiered strategy. In addition, its options are much easier to reason
about, basically just window size and window unit.
TWCS will work to keep only one sstable in each window. So the only real
optimization needed is to align partition key to the window.
Size tiered strategy is used to reduce write amplification when compacting
the incoming window.

For more details: https://issues.apache.org/jira/browse/CASSANDRA-9666

Fixes #1432."

* 'twcs_v2' of github.com:raphaelsc/scylla:
  tests: add tests for time window compaction strategy
  compaction: wire up time window compaction strategy
  compaction/twcs: override default values with options in schema
  sstables: implement time window compaction strategy
  sstables: import TimeWindowCompactionStrategy.java
2017-07-19 10:22:53 +03:00
Duarte Nunes
3bfcf47cc6 types: Implement hash() for collections
This patch provides a rather trivial implementation of hash() for
collection types.

It is needed for view building, where we hold mutations in a map
indexed by partition keys (and frozen collection types can be part of
the key).

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170718192107.13746-1-duarte@scylladb.com>
2017-07-19 09:52:56 +03:00
Raphael S. Carvalho
c55c63f213 tests: add tests for time window compaction strategy
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-19 02:58:37 -03:00
Raphael S. Carvalho
7ecedac222 compaction: wire up time window compaction strategy
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-19 02:58:37 -03:00
Raphael S. Carvalho
01886c23a8 compaction/twcs: override default values with options in schema
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-19 02:58:37 -03:00
Raphael S. Carvalho
206d30c52a sstables: implement time window compaction strategy
For more details, https://issues.apache.org/jira/browse/CASSANDRA-9666

Fixes #1432.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-19 02:58:35 -03:00
Glauber Costa
c9a529ebee simple controller for memtable/streaming writer shares.
This patch introduces a simple controller that will adjust memtables CPU
shares, trying to keep it around the soft limit: if we start going below
it means we're too fast (unless we are idle) and shares are adjusted
downwards. If we start going above it means we're too fast and shares
are adjusted upwards.

I have tested this extensively in a single-CPU setup with various
CPU-bound workloads while tracking virtual dirty and the results are
good, with virtual dirty fluctuating only slightly, somewhere within the
desired range.

Exceptions to this are:
1) when the load is very light - the idle system goes faster, and that's
   ok
2) when the load is very high - as foreground requests dominate we can't
   flush fast enough and hit the hard limit. However, in such scenarios
   the memtable shares do hit its maximum, and the results are no worse
   than they are right now and this will only be fixed by CPU-limiting the
   actual requests.

This feature can be disabled with a config option - that is scheduled to
go away as we acquire more confidence in this. When the feature is
disabled, all background writers (streaming, compaction, memtables) will
share the same scheduling group, with static quotas.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-07-18 23:35:47 -04:00
Glauber Costa
4f01ec0910 restrict background writers to 50 % of CPU.
In scylla, we have foreground processes, which are latency sensitive and
need to be responded to as fast as possible in order to maintain good
latency profiles, and background process, which are less so.

The most important background processes we have during normal write
workload operations are memtable writes and sstable compactions. Those
processes are quite CPU-intensive, and left unchecked will easily
dominate the CPU. Lower values of task-quota usually help, as it will
force those processes to preempt more, but aren't enough to guarantee
good isolation. We have seen boxes with good NVMe storage having their
throughput reduced to less than half of the original baseline in a short
dive down for the duration of a compaction.

In the long run, our goal is to leverage the CPU scheduler to make sure
that those processes are balanced with respect to all the others.
However, the current state of affairs is causing grievances as this very
moment. Thankfully, those processes live in a seastar::thread, that
ships with its own rudimentary bandwidth control mechanism: the
scheduling group.

The goal of this patch is to wrap background processes together in a
scheduling group, and assign to such group 50 % of our CPU power; the
remainder being left to foreground processes.

While we pride ourselves in dynamically adjusting things to the
workload, we won't be able to do this properly before the CPU scheduler
lands - and let's face it, leaving background processes run wild is not
adaptative either. Every workload would benefit most from a different
value for such shares, but 50 % is as fair as it gets if we really need
static partitining in the mean time.

As a defense against unforeseen consequences, we'll leave the actual
value as an option, but will do our best to hide it - as this is not a
tunable that we want to be part of a normal Scylla setup. The most
convenient place for this tunable is still db::config, so we can easily
pass it down to the database layer - but we will not document it in the
yaml, and will clearly note in the help string that it is not supposed
to be tuned.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-07-18 23:35:33 -04:00
Asias He
d6cebd1341 streaming: Listen on shutdown gossip callback
When a node shutdown itself, it will send a shutdown status to peer
nodes. When peer nodes receives the shtudown status update, they are
supposed to close all the sessions with that node becasue the node is
shutdown, no need to wait and timeout, then fail the session.

This change can speed up the closing of sessions.
2017-07-19 10:11:06 +08:00
Asias He
ed7e6974d5 gms: Add is_shutdown helper for endpoint_state class
It will be used by streaming manager to check if a node is in shutdown
status.
2017-07-19 10:11:05 +08:00
Asias He
aa87429e67 streaming: Send complete message with failed flag when session is failed
To notify peer node the session is failed.
2017-07-19 10:11:05 +08:00
Asias He
03b838705c streaming: Handle failed flag in complete message
Fail the current session if the failed flag is on in the complete
message handler.
2017-07-19 10:11:05 +08:00
Asias He
12d18cfab4 streaming: Do not fail the session when failed to send complete message
Since the complete message is not mandatary, no point to fail the session
in case failed to send the complete message.
2017-07-19 10:11:04 +08:00
Asias He
ca5248cd58 streaming: Introduce send_failed_complete_message
Currently, send_complete_message is not used. We will use it shortly in
case the local session is failed. Send a complete message with failed
flag to notify peer node that the session is failed so that peer can
close the session. This can speed up the closing of failed session.

Also rename it to send_failed_complete_message.
2017-07-19 10:11:04 +08:00
Raphael S. Carvalho
2686e84792 sstables: import TimeWindowCompactionStrategy.java
it will be later converted to C++. Imported from latest scylla-
tools-java repository. Checked that it doesn't lack anything.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-18 18:26:17 -03:00
Takuya ASADA
49b01e764a dist/common/scripts/scylla_prepare: stop running hugeadm when it's posix mode
A user reported scylla-server.service does not able to run on their cloud instance, because of hugeadm.
(hugeadm says the kernel does not support huge pages.)
We don't need it for posix mode, so move it in dpdk mode.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1500367219-8728-1-git-send-email-syuu@scylladb.com>
2017-07-18 16:39:16 +03:00
Tomasz Grabiec
63caa58b70 Merge "Drop mutations that raced with truncate" from Duarte
Instead of retrying, just drop mutations that raced with a truncate.

* git@github.com:duarten/scylla.git truncate-reorder/v1:
  database: Rename replay_position_reordered_exception
  database: Drop mutations that raced with truncate
2017-07-18 12:53:36 +02:00
Asias He
f21cb75cdb streaming: Do not send complete message when session is successful
The complete_message is not needed and the handler of this rpc message
does nothing but returns a ready future. The patch to remove it did not
make into the Scylla 1.0 release so it was left there.
2017-07-18 15:29:42 +08:00
Duarte Nunes
d9fa3bf322 thrift: Fail when mixed CFs are detected
Fixes #2588

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170717222612.7429-1-duarte@scylladb.com>
2017-07-18 10:21:33 +03:00
Asias He
0ba4e73068 streaming: Introduce the failed parameter for complete message
Use this flag to notify the peer that the session is failed so that the
peer can close the failed session more quickly.

The flag is used as a rpc::optional so it is compatible use old
version of the verb.
2017-07-18 11:24:31 +08:00
Asias He
7599c1524d streaming: Remove unused session_failed function
It is never used. Get rid of it.
2017-07-18 11:22:09 +08:00
Asias He
caad7ced23 streaming: Less verbose in logging
Now, we will have large number of small streaming. Make the
not very important logging message debug level.
2017-07-18 11:17:09 +08:00
Asias He
d0dffd7346 streaming: Better stats
Log the number of bytes streamed and streaming bandwidth summary in the same line with session
complete message.
2017-07-18 11:17:09 +08:00
Avi Kivity
64ef7aa5e4 Merge seastar upstream
* seastar 867b7c7...a14d667 (1):
  > tls: remove unneeded lambda captures
2017-07-17 19:30:59 +03:00
Duarte Nunes
6b464da67d schema: Get rid of regular_columns_by_name
They are unused.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170717103635.6473-2-duarte@scylladb.com>
2017-07-17 12:52:41 +02:00
Asias He
adc5f0bd21 gossip: Implement the missing fd_max_interval_ms and fd_initial_value_ms option
It is useful for larger cluster with larger gossip message latency. By
default the fd_max_interval_ms is 2 seconds which means the
failure_detector will ignore any gossip message update interval larger
than 2 seconds. However, in larger cluster, the gossip message udpate
interval can be larger than 2 seconds.

Fixes #2603.

Message-Id: <49b387955fbf439e49f22e109723d3a19d11a1b9.1500278434.git.asias@scylladb.com>
2017-07-17 13:29:16 +03:00
Duarte Nunes
13caccf1cf Merge 'Fixes around migration to v3 schema tables' from Tomasz
branch 'tgrabiec/schema-migration-fixes' of github.com:scylladb/seastar-dev:
  schema: Use proper name comparator
  legacy_schema_migrator: Properly migrate non-UTF8 named columns
  schema_tables: Store column_name in text form
  legacy_schema_migrator: Migrate columns like Cassandra
  schema_builder: Add factory method for default_names
  legacy_schema_migrator: Simplify logic
  thrift: Don't set regular_column_name_type
  schema: Use proper column name type for static columns
  schema: Fix column_name_type() for static compact tables
  schema: Introduce clustering_column_at()
  thrift: Reuse cell_comparator::to_sstring() for obtaining comparator type
  partition_slice_builder: Use proper column's type instead of regular_column_name_type()
2017-07-17 11:16:52 +02:00
Tomasz Grabiec
34dae0588c schema: Use proper name comparator
This replaces column_definition::name_comparator, which incorrectly
assumes that names are always utf8, with name_compare moved from
schema::rebuild() and unifies usages.
2017-07-17 09:40:06 +02:00
Tomasz Grabiec
7e54290d38 legacy_schema_migrator: Properly migrate non-UTF8 named columns
Currently migrator assumed all columns are utf8-named, which
doesn't have to be the case for static compact tables.

Refs #2597.

Due to #2573, we can assume that Scylla wasn't used with non-utf8
column names, and that old names are always in textual form.
2017-07-17 09:40:06 +02:00
Tomasz Grabiec
60a76efd37 schema_tables: Store column_name in text form
That's how it is stored by Cassandra.

Refs #2597.
2017-07-17 09:40:06 +02:00
Tomasz Grabiec
61229a7536 legacy_schema_migrator: Migrate columns like Cassandra
This fixes generation of synthetic columns for static compact tables.
Current code always generates synthetic clustering column with utf8
type and synthetic regular column with bytes type (in schema_builder).
That's fine when creating a new CQL table, but not when migrating
existing tables created via thrift API.

Fixes #2584.

This also migrates empty compact value columns like Cassandra
does. Such columns are present in compact tables without regular
columns, e.g.:

  create table test (k int, ck int, primary key (k, ck)) with compact storage;

They should be migrated to a synthetic regular column with
empty_type type and a non-empty name.
2017-07-17 09:40:06 +02:00
Tomasz Grabiec
49e21b3b8e schema_builder: Add factory method for default_names 2017-07-17 09:40:06 +02:00
Tomasz Grabiec
6dc299c27a legacy_schema_migrator: Simplify logic
The expression "is_dense.value_or(true)" is always true inside the if,
so drop it.

This allows us to drop temporary calulated_is_dense.

We can also get rid of one of the if branches by extracting
builder.set_is_dense() outside.
2017-07-17 09:40:06 +02:00
Tomasz Grabiec
3987e9be31 thrift: Don't set regular_column_name_type
Regular columns are always utf8 after f5dae826ce.
2017-07-17 09:40:06 +02:00
Tomasz Grabiec
b919c50d21 schema: Use proper column name type for static columns
After f5dae826ce, static columns not
always have utf8 column names. For static compact tables it's
determined by the cell name comparator type, which is equal to the
type of the synthetic clustering column.

Caused various errors with static thrift tables with non-utf8
comparator.
2017-07-17 09:40:06 +02:00
Tomasz Grabiec
f685f7f8a1 schema: Fix column_name_type() for static compact tables
Introduced in f5dae826ce.
2017-07-17 09:40:06 +02:00
Tomasz Grabiec
84536a4a75 schema: Introduce clustering_column_at() 2017-07-17 09:40:06 +02:00
Tomasz Grabiec
9ed958a1eb thrift: Reuse cell_comparator::to_sstring() for obtaining comparator type 2017-07-17 09:40:06 +02:00
Tomasz Grabiec
9768036d61 partition_slice_builder: Use proper column's type instead of regular_column_name_type() 2017-07-17 09:40:06 +02:00
Avi Kivity
c51001b598 Merge seastar upstream
* seastar b812cee...867b7c7 (1):
  > rpc: start server's send loop only after protocol negotiation

Fixes #2600.
2017-07-16 19:36:31 +03:00
Avi Kivity
a5bd854019 Merg seastar upstream
* seastar 844bcfb...b812cee (1):
  > Update dpdk submodule

Fix #2595 (again).
2017-07-16 17:00:48 +03:00
Avi Kivity
d9c64ef737 tests: move tmpdir to /tmp
Reduces view_schema_test runtime to 5 seconds, from 53 seconds on an NVMe disk
with write-back cache, and forever on a spinning disk.
Message-Id: <20170716081653.10018-1-avi@scylladb.com>
2017-07-16 11:55:08 +02:00
Avi Kivity
9116dd91cb tests: copy the sstable with an unknown component to the data directory
We will be creating links to those sstable's files, and those don't work
if the data directory and the test sstable are on different devices.

Copying the files to the same directory fixes the problem.
Message-Id: <20170716090405.14307-1-avi@scylladb.com>
2017-07-16 11:55:00 +02:00
Duarte Nunes
2c711922cc database: Drop mutations that raced with truncate
Mutations that race with a truncate can just be dropped.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-16 00:08:05 +02:00
Duarte Nunes
0825c9c805 database: Rename replay_position_reordered_exception
Rename replay_position_reordered_exception to
mutation_reordered_with_truncate_exception for more precision, since
this is the only situation where this exception can be thrown.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-16 00:08:05 +02:00
Avi Kivity
e87ab54bfc Merge seastar upstream
* seastar ff34c42...844bcfb (1):
  > Update dpdk submodule

Fixes #2595.
2017-07-15 19:17:05 +03:00
Tomasz Grabiec
caa62f7f05 Merge "Fixes for memtable flushing and replay positions" from Duarte
We don't ensure mutations are applied in memory following the order of their
replay positions. A memtable can thus be flushed with replay position rp,
with the new one being at replay position rp', where rp' < rp. This breaks
an intrinsic assumption in the code, which this series addresses.

Fixes #2074

branch memtable-flush/v3 of git@github.com:duarten/scylla.git:
  commitlog: Always flush latest memtable
  column_family: More precise count of switched memtables
  column_family: Fix typo in pending_tasks metric name
  column_family: More precise count of pending flushes
  dirty_memory_manager: Remove unnecessary check from flush_one()
  column_family: Don't rely on flush_queue to guarantee flushes finished
  column_family: Don't bother closing the flush_queue on stop()
  column_family: Stop using flush_queue
  column_family: Remove outdated comment about the flush_queue
  memtable: Stop tracking the highest flushed rp
2017-07-14 11:39:37 +02:00
Avi Kivity
162d9aa85d tests: fix view_schema_test with clang
Clang is happy to create a vector<data_value> from a {}, a {1, 2}, but not a {1}.
No doubt it is correct, but sheesh.

Make the data_value explicit to humor it.
Message-Id: <20170713074315.9857-1-avi@scylladb.com>
2017-07-14 12:24:27 +03:00
Duarte Nunes
b8235f2e88 storage_proxy: Preserve replica order across mutations
In storage_proxy we arrange the mutations sent by the replicas in a
vector of vectors, such that each row corresponds to a partition key
and each column contains the mutation, possibly empty, as sent by a
particular replica.

There is reconciliation-related code that assumes that all the
mutations sent by a particular replica can be found in a single
column, but that isn't guaranteed by the way we initially arrange the
mutations.

This patch fixes this and enforces the expected order.

Fixes #2531
Fixes #2593

Signed-off-by: Gleb Natapov <gleb@scylladb.com>
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170713162014.15343-1-duarte@scylladb.com>
2017-07-14 12:11:22 +03:00
Duarte Nunes
5f24e9a4a5 memtable: Stop tracking the highest flushed rp
Since we no longer enforce that mutations are applied in memory
ordered by their replay_positions, the way the highest_flush_rp is
being tracked is no longer correct.

The invariant it was used to maintain no longer exists, so we can get
rid of it together with the assertion on the highest_flush_rp on
flush().

Fixes #2074

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-13 22:56:06 +02:00
Duarte Nunes
22a53a52a1 column_family: Remove outdated comment about the flush_queue
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-13 22:56:05 +02:00
Duarte Nunes
003941cd95 column_family: Stop using flush_queue
Since commitlog ordering requirements have been relaxed, we now keep
the set of replay_positions seen by a memtable in a set, which we then
use to clean up relevant segments in the commitlog. This means that
the guarantees provided by the flush_queue are no longer necessary.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-13 22:56:00 +02:00
Duarte Nunes
7e6fe5895e column_family: Don't bother closing the flush_queue on stop()
When stopping a column family we issue a flush(), for which we wait.
Since writes are supposed to have stopped coming in, and also new
flush requests, there's no need to call and wait for the flush_queue
to be closed.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-13 22:51:58 +02:00
Duarte Nunes
a1f4536ffb column_family: Don't rely on flush_queue to guarantee flushes finished
We now don't ensure mutations are applied in memory following the
order of their replay positions, so we can't rely on the replay
position to order memtable flushes. So, use a phased_barrier() to
ensure that calling flush() returns a future that completes when all
flushes up to that point have finished.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-13 22:51:58 +02:00
Duarte Nunes
1b320496e2 dirty_memory_manager: Remove unnecessary check from flush_one()
We don't need to check whether a memtable is empty in flush_one(), as
that must be checked later, during the actual sealing.

The condition itself is rare and is checked already after the potentially
contented semaphore has been acquired.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-13 22:51:57 +02:00
Duarte Nunes
59bdaed02b column_family: More precise count of pending flushes
This patch ensures we update the count of pending flushes in the same
place as we update the stats across column families, which is more
correct since it only accounts for actual flushes and not those of
empty memtables or that have been coalesced together.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-13 22:51:25 +02:00
Duarte Nunes
3e27c335a9 column_family: Fix typo in pending_tasks metric name
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-13 22:51:25 +02:00
Duarte Nunes
a11724c6e1 column_family: More precise count of switched memtables
The memtable_switch_count metric is supposed to count the number of
times a flush has resulted in the memtable being switched out, but we
were incrementing the count regardless of whether we tried to flush an
empty memtable or two or more flushes were coalesced into one. This
patch fixes this by moving the metric to where the memtable is
actually switched.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-13 22:51:25 +02:00
Duarte Nunes
bca1b19ce9 commitlog: Always flush latest memtable
We now don't ensure mutations are applied in memory following the
order of their replay positions, so we can't rely on the replay
position to order memtable flushes. When flushing commit log segments,
ensure we flush the latest memtable.

Refs #2074

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-13 22:51:25 +02:00
Paweł Dziepak
ec689b2fe1 Merge "utils: minor fixes in the loading_cache class" from Vlad
"This series aims to fix the "serving invalid (old) values" issue in the
loading_cache (issue #2590) by arming the timer with a period that equals
min(expire, refresh).

We are still trying to optimize the main case where 'expire' is
significantly longer than 'refresh' period.

We don't want to add any additional logic in the fast path and this
series gives the immediate solution for the issue above while not adding
any additional CPU cycle to the fast path."

* 'loading_cache_short_expired-v2' of https://github.com/vladzcloudius/scylla:
  utils::loading_cache: arm the timer with a period equal to min(_expire, _update)
  utils::loading_cache: make a timer use a loading_cache_clock_type clock as a source
2017-07-13 16:58:53 +01:00
Vlad Zolotarov
45e23d8090 db::config: fix the permissions cache related parameters description
Make the descriptions of permissions_validity_in_ms, permissions_update_interval_in_ms
and permissions_cache_max_entries more readable and more related to what they really
do.

Mention the none-zero value requirement for the permissions_update_interval_in_ms and
the permissions_cache_max_entries when the permissions cache is enabled.

Adjust the parameters description in the scylla.yaml too.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1499957053-31792-1-git-send-email-vladz@scylladb.com>
2017-07-13 16:00:40 +01:00
Vlad Zolotarov
76ea74f3fd utils::loading_cache: arm the timer with a period equal to min(_expire, _update)
Arm the timer with a period that is not greater than either the permissions_validity_in_ms
or the permissions_update_interval_in_ms in order to ensure that we are not stuck with
the values older than permissions_validity_in_ms.

Fixes #2590

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-07-13 10:48:59 -04:00
Vlad Zolotarov
121e3c7b8f utils::loading_cache: make a timer use a loading_cache_clock_type clock as a source
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-07-13 10:42:12 -04:00
Tomasz Grabiec
30ec4af949 legacy_schema_migrator: Fix calculation of is_dense
Current algorithm was marking tables with regular columns not named
"value" as not dense, which doesn't have to be the case. It can be
either way.

It should be enough to look at clustering components. If there is a
clustering key, then table is dense if and only if all comparator
components belong to the clustering key.

If there is no clustering key, then if there are any regular columns
we're sure it's not dense.

Fixes #2587.

Message-Id: <1499877777-7083-1-git-send-email-tgrabiec@scylladb.com>
2017-07-13 17:28:09 +03:00
Jesse Haber-Kucharsky
8fa47b74e8 cql: Add definition of underlying type for durations
Cassandra 3.10 added the `duration` type [1], intended to manipulate date-time
values with offsets (for example, `now() - 2y3h`).

The full implementation of the `duration` type in Scylla requires support
for version 5 of the binary protocol, which is not yet available.

In the meantime, this patch patch adds the implementation of the underlying type
for the eventual `duration` type. Included is also the ported test suite from
the reference implementation and additional tests.

Related to #2240.

[1] https://issues.apache.org/jira/browse/CASSANDRA-11873

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <b1e481da103efee82106bf31f261c5a1f4f8d9ca.1499885803.git.jhaberku@scylladb.com>
2017-07-13 17:26:00 +03:00
Tomasz Grabiec
54953c8d27 gdb: Fix "scylla columnfamilies" command
Broken in 0e4d5bc2f3.

Message-Id: <1499951956-26206-1-git-send-email-tgrabiec@scylladb.com>
2017-07-13 16:33:32 +03:00
Amnon Heiman
45b3e8cd11 query_options: Allows creating query_options from query_options
query_options object cannot be changed after it was created. For
internal uses, like internal query paging, it is needed to create a new
object based on some of the data from an existing one with a new paging
state.

This patch adds a constructor from a unique_ptr and paging state.

using unique_ptr behave similar to move modify constructor.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2017-07-13 14:02:11 +03:00
Duarte Nunes
3df6777b9b database: Load views after loading tables
Since base tables no longer look for their views, we need to parse
base tables first so that when we add a view we can fetch and connect
it to its base table.

When announcing view table mutations to other nodes we always include
the base table mutations, so there's no need to expect a view being
added before its base table.

Found out while testing view building.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170712172115.2960-1-duarte@scylladb.com>
2017-07-13 11:14:02 +02:00
Avi Kivity
4704a78332 tests: remove bad constexpr in sstable_datafile_test
std::ceil() is not constexpr.

Found by clang.
2017-07-12 17:14:13 +03:00
Avi Kivity
67a5e10218 Merge seastar upstream
* seastar a2be7a4...ff34c42 (3):
  > tls: Wrap all IO in semaphore (Fixes #2575)
  > tests/lowres_clock_test.cc: Declare helper static
  > tests/lowres_clock_test.cc: fix compilation error for older GCC
2017-07-12 10:19:55 +03:00
Avi Kivity
a397889c81 Merge "Preserve table schema digest on schema tables migration" from Tomasz
"Currently new nodes calculate digests based on v3 schema mutations,
which are very different from v2 mutations. As a result they will
use schemas with different table_schema_version that the old nodes.
The old nodes will not recognize the version and will try to request
its definition. That will fail, because old nodes don't understand
v3 schema mutations.

To fix this problem, let's preserve the digests during migration,
so that they're the same on new and old nodes. This will allow
requests to proceed as usual.

This does not solve the problem of schema being changed during
the rolling upgrade. This is not allowed, as it would bring the
same problem back.

Fixes #2549."

* tag 'tgrabiec/use-consistent-schema-table-digests-v2' of github.com:cloudius-systems/seastar-dev:
  tests: Add test for concurrent column addition
  legacy_schema_migrator: Set digest to one compatible with the old nodes
  schema_tables: Persist table_schema_version
  schema_tables: Introduce system_schema.scylla_tables
  schema_tables: Simplify read_table_mutations()
  schema_tables: Resurrect v2 read_table_mutations()
  system_keyspace: Forward-declare legacy schemas
  legacy_schema_migrator: Take storage_proxy as dependency
2017-07-11 17:22:42 +03:00
Raphael S. Carvalho
7dbfebb7dc lcs: remove conditional limit for partial sort
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170711140241.11023-2-raphaelsc@scylladb.com>
2017-07-11 17:18:32 +03:00
Raphael S. Carvalho
ebb5dafef0 lcs: remove useless filter for demotion procedure
there's no way a sstable from a level higher than N+1 will be in
set of candidates that can be either level N or level N + 1.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170711140241.11023-1-raphaelsc@scylladb.com>
2017-07-11 17:18:31 +03:00
Botond Dénes
33bc62a9cf Fix crash in the out-of order restrictions error msg composition
Use name of the existing preceeding column with restriction
(last_column) instead of assuming that the column right after the
current column already has restrictions.
This will yield an error message that is different from that of
Cassandra, albeit still a correct one.

Fixes #2421

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <40335768a2c8bd6c911b881c27e9ea55745c442e.1499781685.git.bdenes@scylladb.com>
2017-07-11 17:15:33 +03:00
Gleb Natapov
f88723e739 storage_proxy: pass pending_endpoints by reference instead of by value
This makes lifetime of dead_endpoints object more clear and move() also has its price.

Message-Id: <20170710084549.GX2324@scylladb.com>
2017-07-11 16:52:21 +03:00
Gleb Natapov
739dd878e3 consistency_level: report less live endpoints in Unavailable exception if there are pending nodes
DowngradingConsistencyRetryPolicy uses live replicas count from
Unavailable exception to adjust CL for retry, but when there are pending
nodes CL is increased internally by a coordinator and that may prevent
retried query from succeeding. Adjust live replica count in case of
pending node presence so that retried query will be able to proceed.

Fixes #2535

Message-Id: <20170710085238.GY2324@scylladb.com>
2017-07-11 16:51:56 +03:00
Avi Kivity
3b7fde18cf Merge "improvements for leveled strategy manifest" from Raphael
"most of changes are to improve maintainability of the strategy but
the ones that are introduced by the following patches:
lcs: do not check if level 0 can be promoted twice
lcs: remove quadratic behavior from L0 compaction
lcs: partially sort candidates that will be trimmed
lcs: only demote sstable from level higher than target one"

* 'lcs_improvements_2' of github.com:raphaelsc/scylla:
  lcs: only demote sstable from level higher than target one
  lcs: improve indentation for get_overlapping_starved_sstables
  lcs: improve indentation for get_compaction_candidates
  lcs: partially sort candidates that will be trimmed
  lcs: remove quadratic behavior from L0 compaction
  lcs: introduce private interface
  lcs: make some member functions static
  lcs: make some functions const qualified
  lcs: remove add method
  lcs: extract code for higher levels compaction from get_candidates_for
  lcs: simplify code to get candidates for higher levels
  lcs: extract round-robin heuristic for even distribution of keys into function
  lcs: update outdated comments for level 0 compaction
  lcs: improve worth_promoting_L0_candidates interface
  lcs: do not check if level 0 can be promoted twice
  lcs: extract code for level 0 compaction from get_candidates_for
2017-07-11 16:38:50 +03:00
Paweł Dziepak
5aa523aaf9 transport: send correct type id for counter columns
CQL reply may contain metadata that describes columns present in the
response including the information about their type.

However, Scylla incorrectly reports counter types as bigint. The
serialised format of counters and bigint is exactly the same, which
could explain why the problem hasn't been noticed earlier but it is a
bug nevertheless.

Fixes #2569.
Message-Id: <20170711130520.27603-1-pdziepak@scylladb.com>
2017-07-11 16:21:49 +03:00
Tomasz Grabiec
6d53cb7ab5 tests: Add test for concurrent column addition 2017-07-11 14:52:23 +02:00
Tomasz Grabiec
f5909ec515 legacy_schema_migrator: Set digest to one compatible with the old nodes
Calculate and set digest using v2 mutations so that digests are the
same before and after migration. This is neeed so that no schema
definition exchange is required during rolling upgrade.

Fixes #2549.
2017-07-11 14:52:23 +02:00
Tomasz Grabiec
5b69d99bf8 schema_tables: Persist table_schema_version
When migrating schema tables from v2 to v3, mutations underlying
table schema will change, and so will their digest. However, we want
the digest to be the same on new nodes as on the old nodes, because
schema exchange is not possible between the two nodes, so they
must to request schema definitions from each other.

The solution is to make the digest persistable, so that it sticks to
given table schema, surviving both migration and node restarts. On
migration from v2, the digest will be calculated from v2 mutations, so
it will be the same on new and old nodes.
2017-07-11 14:52:23 +02:00
Tomasz Grabiec
cdf5b67522 schema_tables: Introduce system_schema.scylla_tables
It will be used to store Scylla spcific table metadata.  We cannot
store it in the standard "tables" table for compatibility reasons -
Cassandra will fail to read schema if it encounteres columns it is not
expecting.
2017-07-11 14:52:23 +02:00
Tomasz Grabiec
cdcdf4772f schema_tables: Simplify read_table_mutations() 2017-07-11 14:52:23 +02:00
Tomasz Grabiec
6e62bc77f1 schema_tables: Resurrect v2 read_table_mutations() 2017-07-11 14:52:23 +02:00
Tomasz Grabiec
4b5818a404 system_keyspace: Forward-declare legacy schemas 2017-07-11 14:52:23 +02:00
Tomasz Grabiec
8624edc0fa legacy_schema_migrator: Take storage_proxy as dependency
Will be needed to query for mutations.
2017-07-11 14:52:23 +02:00
Raphael S. Carvalho
6aa2e5be17 lcs: only demote sstable from level higher than target one
if we are compacting level 1 into level 2, we only want to demote
a sstable from level 3 or higher.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:35:42 -03:00
Raphael S. Carvalho
53b72b473e lcs: improve indentation for get_overlapping_starved_sstables
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:35:40 -03:00
Raphael S. Carvalho
3639b48d7b lcs: improve indentation for get_compaction_candidates
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:35:38 -03:00
Raphael S. Carvalho
5a8b8a6ccb lcs: partially sort candidates that will be trimmed
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:35:37 -03:00
Raphael S. Carvalho
8334086441 lcs: remove quadratic behavior from L0 compaction
L0 compaction triggers quadratic behavior when many newly created
sstables are needed for promotion due to their size being relatively
low to max sstable size parameter. So until L0 is worth promoting,
the strategy will compact every new sstable with all the existing
ones in L0. To fix it, let's do STCS on level 0 until it becomes
worth promoting.

Fixes #2432.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:35:35 -03:00
Raphael S. Carvalho
80f1dca328 lcs: introduce private interface
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:35:33 -03:00
Raphael S. Carvalho
bc71f97116 lcs: make some member functions static
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:35:32 -03:00
Raphael S. Carvalho
f4b733efe4 lcs: make some functions const qualified
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:35:28 -03:00
Raphael S. Carvalho
ede0ee16b2 lcs: remove add method
Its code can be inlined because no one besides create() calls it

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:35:26 -03:00
Raphael S. Carvalho
00ef528e5b lcs: extract code for higher levels compaction from get_candidates_for
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:35:25 -03:00
Raphael S. Carvalho
a46b73c401 lcs: simplify code to get candidates for higher levels
get rid of unneeded loop for dealing with suspect sstables and
std::advance because vector allows random access.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:35:19 -03:00
Raphael S. Carvalho
e954af0f0f lcs: extract round-robin heuristic for even distribution of keys into function
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:35:15 -03:00
Raphael S. Carvalho
3c0028d921 lcs: update outdated comments for level 0 compaction
some comments are no longer relevant, especially the ones that
talk about dealing with busy sstables due to parallel compaction,
which isn't done by us for lcs.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:35:07 -03:00
Raphael S. Carvalho
62607ba36a lcs: improve worth_promoting_L0_candidates interface
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:35:00 -03:00
Raphael S. Carvalho
c1e42f6528 lcs: do not check if level 0 can be promoted twice
can_promote flag will be used to carry info about whether or not
level 0 can promoted. That will avoid a single iteration for higher
levels too which can contain tens of thousands of sstables.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:34:49 -03:00
Raphael S. Carvalho
887aab4ae7 lcs: extract code for level 0 compaction from get_candidates_for
I will split code for higher levels compaction into functions first
before putting it into its own function too.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:34:41 -03:00
Pekka Enberg
ed3c62704e transport/server: Kill unused functions
Message-Id: <1499773755-27920-1-git-send-email-penberg@scylladb.com>
2017-07-11 14:57:54 +03:00
Glauber Costa
780a6e4d2e change task quota's default
The default of 2ms is somewhat arbitrary. Now that we have a lot more
mileage deploying Scylla applications in production it does sound not
only arbitrary, but high.

In particular, it is really hard to achieve 1ms latencies in the face of
CPU-heavy workloads with it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <1499354495-27173-1-git-send-email-glauber@scylladb.com>
2017-07-11 13:50:39 +03:00
Avi Kivity
7147808797 Merge seastar upstream
* seastar 89cc97c...a2be7a4 (3):
  > configure.py: verifies boost version
  > pkg-config: Eliminate spaces in include path arguments
  > allow applications to override task-quota-ms
2017-07-11 13:50:06 +03:00
Botond Dénes
f18f724f1c Generate an error when CONTAINS is used on a non-collection column
Fixes #2255

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <517bb6268ac213aed9a1def231614c2e88f77c9f.1499764183.git.bdenes@scylladb.com>
2017-07-11 11:30:49 +02:00
Tomasz Grabiec
310d2a54d2 legacy_schema_migrator: Use separate joinpoint instance for each table
Otherwise we may deadlock, as explained in commit 5e8f0efc8:

Table drop starts with creating a snapshot on all shards. All shards
must use the same snapshot timestamp which, among other things, is
part of the snapshot name. The timestamp is generated using supplied
timestamp generating function (joinpoint object). The joinpoint object
will wait for all shards to arrive and then generate and return the
timestamp.

However, we drop tables in parallel, using the same joinpoint
instance. So joinpoint may be contacted by snapshotting shards of
tables A and B concurrently, generating timestamp t1 for some shards
of table A and some shards of table B. Later the remaining shards of
table A will get a different timestamp. As a result, different shards
may use different snapshot names for the same table. The snapshot
creation will never complete because the sealing fiber waits for all
shards to signal it, on the same name.
Message-Id: <1499762663-21967-1-git-send-email-tgrabiec@scylladb.com>
2017-07-11 11:21:45 +02:00
Avi Kivity
7b4412c3ce Revert "Merge "improvements for leveled strategy manifest" from Raphael"
This reverts commit 43a3e718e6, reversing
changes made to 3813e94b0a. It contains some
unrelated commits.
2017-07-11 11:12:53 +03:00
Avi Kivity
43a3e718e6 Merge "improvements for leveled strategy manifest" from Raphael
"most of changes are to improve maintainability of the strategy but
the ones that are introduced by the following patches:
lcs: do not check if level 0 can be promoted twice
lcs: remove quadratic behavior from L0 compaction
lcs: partially sort candidates that will be trimmed
lcs: only demote sstable from level higher than target one"

* 'lcs_improvements' of github.com:raphaelsc/scylla: (21 commits)
  lcs: only demote sstable from level higher than target one
  lcs: improve indentation for get_overlapping_starved_sstables
  lcs: improve indentation for get_compaction_candidates
  lcs: partially sort candidates that will be trimmed
  lcs: remove quadratic behavior from L0 compaction
  lcs: introduce private interface
  lcs: make some member functions static
  lcs: make some functions const qualified
  lcs: remove add method
  lcs: extract code for higher levels compaction from get_candidates_for
  lcs: simplify code to get candidates for higher levels
  lcs: extract round-robin heuristic for even distribution of keys into function
  lcs: update outdated comments for level 0 compaction
  lcs: improve worth_promoting_L0_candidates interface
  lcs: do not check if level 0 can be promoted twice
  lcs: extract code for level 0 compaction from get_candidates_for
  dist/offline_installer: add --skip-setup option to offline installer
  dist/offline_installer/debian: install python-minimal package before installing scylla deps
  migration_manager: Give empty response to schema pulls from incompatible nodes
  migration_manager: Don't pull schema from incompatible nodes
  ...
2017-07-11 11:08:12 +03:00
Botond Dénes
3813e94b0a Add Cql.tokens and KDevelop project files to .gitignore
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <ae4935d2ac0c92287022f677c3e66757c0861e13.1499753032.git.bdenes@scylladb.com>
2017-07-11 10:21:00 +03:00
Botond Dénes
61c5c2a175 transport: Fix accept typo in debug log message
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <d2f9269f25ace6579a6fbe6b99f4da60a05beac8.1499753306.git.bdenes@scylladb.com>
2017-07-11 09:16:35 +03:00
Raphael S. Carvalho
8b9686e621 lcs: only demote sstable from level higher than target one
if we are compacting level 1 into level 2, we only want to demote
a sstable from level 3 or higher.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-10 16:10:42 -03:00
Raphael S. Carvalho
0d0699e06e lcs: improve indentation for get_overlapping_starved_sstables
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-10 16:01:31 -03:00
Raphael S. Carvalho
cda2b18f83 lcs: improve indentation for get_compaction_candidates
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-10 15:55:43 -03:00
Raphael S. Carvalho
ca1c6fd9ca lcs: partially sort candidates that will be trimmed
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-10 15:45:26 -03:00
Raphael S. Carvalho
28ebe1807f lcs: remove quadratic behavior from L0 compaction
L0 compaction triggers quadratic behavior when many newly created
sstables are needed for promotion due to their size being relatively
low to max sstable size parameter. So until L0 is worth promoting,
the strategy will compact every new sstable with all the existing
ones in L0. To fix it, let's do STCS on level 0 until it becomes
worth promoting.

Fixes #2432.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-10 15:42:28 -03:00
Raphael S. Carvalho
0392dc5d23 lcs: introduce private interface
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-10 15:13:56 -03:00
Raphael S. Carvalho
dd9c9341be lcs: make some member functions static
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-10 15:13:55 -03:00
Raphael S. Carvalho
408a7f902a lcs: make some functions const qualified
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-10 15:10:47 -03:00
Raphael S. Carvalho
7cba6548e2 lcs: remove add method
Its code can be inlined because no one besides create() calls it

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-10 15:10:47 -03:00
Raphael S. Carvalho
0a9fcc6202 lcs: extract code for higher levels compaction from get_candidates_for
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-10 15:10:47 -03:00
Raphael S. Carvalho
8709365d84 lcs: simplify code to get candidates for higher levels
get rid of unneeded loop for dealing with suspect sstables and
std::advance because vector allows random access.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-10 15:10:47 -03:00
Raphael S. Carvalho
1a9fc835a0 lcs: extract round-robin heuristic for even distribution of keys into function
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-10 15:10:40 -03:00
Raphael S. Carvalho
258ed0afbd lcs: update outdated comments for level 0 compaction
some comments are no longer relevant, especially the ones that
talk about dealing with busy sstables due to parallel compaction,
which isn't done by us for lcs.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-10 15:10:30 -03:00
Raphael S. Carvalho
97b5cf94d8 lcs: improve worth_promoting_L0_candidates interface
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-10 15:10:30 -03:00
Raphael S. Carvalho
8f418e9864 lcs: do not check if level 0 can be promoted twice
can_promote flag will be used to carry info about whether or not
level 0 can promoted. That will avoid a single iteration for higher
levels too which can contain tens of thousands of sstables.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-10 15:10:30 -03:00
Raphael S. Carvalho
6785d83c02 lcs: extract code for level 0 compaction from get_candidates_for
I will split code for higher levels compaction into functions first
before putting it into its own function too.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-10 15:10:30 -03:00
Takuya ASADA
abd2b6bd6f dist/offline_installer: add --skip-setup option to offline installer
To use offline installer in non-interactive way, add option to skip scylla_setup (which run in interactive mode).

Fixes #2533

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
[ penberg: clean up diff noise ]
Message-Id: <1499387981-11814-1-git-send-email-syuu@scylladb.com>
2017-07-10 15:10:30 -03:00
Takuya ASADA
e1a2be28d2 dist/offline_installer/debian: install python-minimal package before installing scylla deps
To prevent dependency error, we need to install python-minimal manually.

Fixes #2553

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
[ penberg: clean up diff noise ]
Message-Id: <1499387587-9032-1-git-send-email-syuu@scylladb.com>
2017-07-10 15:10:30 -03:00
Tomasz Grabiec
fa17c2a59b migration_manager: Give empty response to schema pulls from incompatible nodes
The old nodes which are still using v2 schema tables will fail to
apply our response, with error messages complaining about not being
able to locate schema of certain versions (new schema tables). This
change inhibits such errors by responding with an empty mutation list.
2017-07-10 15:10:30 -03:00
Tomasz Grabiec
8e8a26ef1b migration_manager: Don't pull schema from incompatible nodes
Currently it results in scary error messages in logs about not being
able to find schema of given version. It's benign, but may scare
users. It the future incompatibilities could result in more subtle
errors. Better to inhibit it completely.
2017-07-10 15:10:30 -03:00
Tomasz Grabiec
b2f52454b9 service: Advertise schema tables format version through gossip
Will be needed to inhibit schema exchange on per-peer basis.
2017-07-10 15:10:30 -03:00
Takuya ASADA
bf49dd8aa1 dist/offline_installer: add --skip-setup option to offline installer
To use offline installer in non-interactive way, add option to skip scylla_setup (which run in interactive mode).

Fixes #2533

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
[ penberg: clean up diff noise ]
Message-Id: <1499387981-11814-1-git-send-email-syuu@scylladb.com>
2017-07-10 16:02:11 +03:00
Takuya ASADA
1f97e5b3f4 dist/offline_installer/debian: install python-minimal package before installing scylla deps
To prevent dependency error, we need to install python-minimal manually.

Fixes #2553

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
[ penberg: clean up diff noise ]
Message-Id: <1499387587-9032-1-git-send-email-syuu@scylladb.com>
2017-07-10 16:01:26 +03:00
Avi Kivity
91221e020b Merge "Silence schema pull errors during upgrade from 1.7 to 2.0" from Tomasz
"Old and new nodes will advertise different schema version because
of different format of schema tables. This will result in attempts
to sync the schema by each of the node.

Currently this will result in scary error messages in logs about
sync failing due to not being able to find schema of given version.
It's benign, but may scare users. It the future incompatibilities
could result in more subtle errors. Better to inhibit it completely."

* 'tgrabiec/fix-schema-pull-errors-during-upgrade' of github.com:cloudius-systems/seastar-dev:
  migration_manager: Give empty response to schema pulls from incompatible nodes
  migration_manager: Don't pull schema from incompatible nodes
  service: Advertise schema tables format version through gossip
2017-07-10 14:04:04 +03:00
Pekka Enberg
8112d7c5c0 idl: Fix frozen_schema version numbers
The IDL changes will appear in 2.0 so fix up the version numbers.

Message-Id: <1499680669-6757-1-git-send-email-penberg@scylladb.com>
2017-07-10 14:02:20 +03:00
Avi Kivity
06b7ec6901 install-dependencies.sh: add snappy 2017-07-10 13:25:57 +03:00
Avi Kivity
7ddd322bce Add install-dependencies.sh
Easier to get started when a script installs all the build dependencies.
Message-Id: <20170710101657.12574-1-avi@scylladb.com>
2017-07-10 12:21:02 +02:00
Botond Dénes
e0d0f9f30c Make the CMakeLists.txt's IDE marker generic
To allow some other IDEs (e.g. KDevelop, QtCreator) to use the cmake
file in a convenient manner. Keep the existing CLIEN_IDE marker to
not break existing workflows.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <5ecf8c0e8a242cc8ebb0d803547bead4dadc38e2.1499667807.git.bdenes@scylladb.com>
2017-07-10 12:21:02 +02:00
Botond Dénes
66cbc45321 Add text(sstring) version of count, max and min functions
Fixes #2459

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <b6abb97f21c0caea8e36c7590b92a12d148195db.1499666251.git.bdenes@scylladb.com>
2017-07-10 09:06:15 +03:00
Tomasz Grabiec
72e01b7fe8 tests: commitlog: Check there are no segments left on disk after clean shutdown
Reproduces #2550.

Message-Id: <1499358825-17855-2-git-send-email-tgrabiec@scylladb.com>
2017-07-09 19:25:27 +03:00
Tomasz Grabiec
6555a2f50b commitlog: Discard active but unused segments on shutdown
So that they are not left on disk even though we did a clean shutdown.

First part of the fix is to ensure that closed segments are recognized
as not allocating (_closed flag). Not doing this prevents them from
being collected by discard_unused_segments(). Second part is to
actually call discard_unused_segments() on shutdown after all segments
were shut down, so that those whose position are cleared can be
removed.

Fixes #2550.

Message-Id: <1499358825-17855-1-git-send-email-tgrabiec@scylladb.com>
2017-07-09 19:25:22 +03:00
Tomasz Grabiec
d33d29ad95 legacy_schema_migrator: Drop tables instead of truncate()+remove()
It achieves similar effect, but is safer than non-standard remove()
path. The latter was missing unregistration from compaction manager.

Fixes 2554.

Message-Id: <1499447165-30253-1-git-send-email-tgrabiec@scylladb.com>
2017-07-09 18:36:44 +03:00
Duarte Nunes
136accdbf6 database: Fix typos in metric descriptions
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170709145522.19534-1-duarte@scylladb.com>
2017-07-09 18:35:17 +03:00
Raphael S. Carvalho
7f7758fb6f tests/sstable: make sstable_expired_data_ratio more robust
this change will stress histogram ability to return a good estimation
after merging keys such that it doesn't grow beyond size limit.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170708205713.5958-1-raphaelsc@scylladb.com>
2017-07-09 10:33:10 +03:00
Botond Dénes
4f6b2a1ff0 transport: Move "accept failed" message to the debug log
Fixes #2518

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <492ea8a916bb3b2427f6cc16a4f6eadadaa30b10.1499418234.git.bdenes@scylladb.com>
2017-07-08 10:59:03 +03:00
Takuya ASADA
09aeb2aabe dist/debian/pbuilderrc: merge Debian releases
Merge duplicated lines to simplified pbuilderrc.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1499454242-3716-1-git-send-email-syuu@scylladb.com>
2017-07-08 10:56:54 +03:00
Tomasz Grabiec
07ed512060 migration_manager: Give empty response to schema pulls from incompatible nodes
The old nodes which are still using v2 schema tables will fail to
apply our response, with error messages complaining about not being
able to locate schema of certain versions (new schema tables). This
change inhibits such errors by responding with an empty mutation list.
2017-07-07 19:09:57 +02:00
Tomasz Grabiec
5f613d0527 migration_manager: Don't pull schema from incompatible nodes
Currently it results in scary error messages in logs about not being
able to find schema of given version. It's benign, but may scare
users. It the future incompatibilities could result in more subtle
errors. Better to inhibit it completely.
2017-07-07 19:08:59 +02:00
Tomasz Grabiec
18a9e1762c service: Advertise schema tables format version through gossip
Will be needed to inhibit schema exchange on per-peer basis.
2017-07-07 19:07:59 +02:00
Piotr Jastrzebski
a4b6cfe8f0 row_cache: use continuity info in single partition queries
If a query requests for a single partition that is inside
a range that has already been queried, use the continuity info
and don't go to disk when it's not needed.

Fixes #2244.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <15bb3b5b03225e7402e3862da53b5e06d3f4fa74.1499345295.git.piotr@scylladb.com>
2017-07-07 10:29:19 +02:00
Piotr Jastrzebski
b950c59bbb row_cache: Fix wrong comment on continuity flag
This comment was stating exactly the opposite to the
truth. This is very misleading

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <79062a061e22ef4c4add24cbdf723cbfb5cda060.1499345295.git.piotr@scylladb.com>
2017-07-07 10:29:19 +02:00
Piotr Jastrzebski
70f4b23876 row_cache_test: Add test to reproduce issue 2544
This tests checks that cache should use continuity information
for single partition queries inside a range that has already been
queried.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <2ebd03ff5366e554d520f86da8054e0b9eff4178.1499345295.git.piotr@scylladb.com>
2017-07-07 10:29:19 +02:00
Avi Kivity
ecda97edeb Merge seastar upstream
* seastar c848486...89cc97c (2):
  > future-utils: fix do_for_each exception reporting
  > core/thread: Fix unwind information for seastar threads
2017-07-06 17:28:29 +03:00
Jesse Haber-Kucharsky
4f838a82e2 Add guide for getting started with development ("hacking")
This change adds the start of what will hopefully be a continually evolving and
improving document for helping developers and contributors to get started with
Scylla development.

The first part of the document is general advice and information that is broadly
applicable.

The second part is an opinionated example of a particular work-flow and set of
tools. This is intended to serve as a starting point and inspire contributors to
develop their own work-flow.

The section on branching is marked "TODO" for now, and will be addressed by a
subsequent change.

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <470a542a92aff20d6205fb94b3fb26168735ae6f.1499319310.git.jhaberku@scylladb.com>
2017-07-06 15:59:16 +03:00
Duarte Nunes
3dd0397700 wrapping_range: Fix lvalue transform()
Instead of copying and moving the bound, pass it by reference so the
transformer can decide whether it wants to copy or not. The only
caller so far doesn't want a copy and takes the value by reference,
which would be capturing a temporary value. Caught by the
view_schema_test with gcc7.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170705210255.29669-1-duarte@scylladb.com>
2017-07-06 15:47:49 +03:00
Raphael S. Carvalho
ff50b57761 dist: fix spelling mistakes in dev-mode.conf
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170705202054.4614-1-raphaelsc@scylladb.com>
2017-07-06 15:08:17 +03:00
Botond Dénes
b1082641f9 Make sure keyspace strategy class is stored in qualified form
Even when it's provided in unqualified (short) form.
Fixes #767

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <4379f8864843e64c097d432fd06129ce4025f100.1499322476.git.bdenes@scylladb.com>
2017-07-06 14:50:00 +03:00
Botond Dénes
c4277d6774 cql3: Add K_FROZEN and K_TUPLE to basic_unreserved_keyword
To allow the non-reserved keywords "frozen" and "tuple" to be used as
column names without double-quotes.

Fixes #2507

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <9ae17390662aca90c14ae695c9b4a39531c6cde6.1499329781.git.bdenes@scylladb.com>
2017-07-06 12:25:38 +03:00
Avi Kivity
a6d9cf09a7 build: fix excessive stack usage in CqlParser in debug mode
The state machines generated by antlr allocate many local variables per function.
In release mode, the stack space occupied by the variables is reused, but in debug
build, it is not, due to Address Sanitizer setting -fstack-reuse=none. This causes
a single function to take above 100k of stack space.

Fix by hacking the generated code to use just one variable.

Fixes #2546
Message-Id: <20170704135824.13225-1-avi@scylladb.com>
2017-07-05 23:05:26 +02:00
Duarte Nunes
d583ef6860 thrift/handler: Remove leftover debug artifacts
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170705161156.2307-1-duarte@scylladb.com>
2017-07-05 19:57:07 +03:00
Takuya ASADA
6d0bd01e0f dist/offline_installer/redhat: enable EPEL repo before try to install makeself
To prevent yum install error, we need to enable EPEL repo before install
makeself.

Fixes #2508

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1499196715-19710-1-git-send-email-syuu@scylladb.com>
2017-07-05 09:50:03 +03:00
Takuya ASADA
71624d7919 dist/common/scripts/scylla_raid_setup: prevent renaming MDRAID device after reboot
On Debian variants, mdadm.conf should placed at /etc/mdadm instead of /etc.
Also it seems we need update-initramfs to fix renaming issue.

Fixes #2502

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1499179912-14125-1-git-send-email-syuu@scylladb.com>
2017-07-04 18:07:20 +03:00
Avi Kivity
b1a0e37fcb Merge "Adjust row cache metrics for row granularity" from Tomasz
* tag 'tgrabiec/row-cache-metrics-v2' of github.com:cloudius-systems/seastar-dev:
  row_cache: Switch _stats.hits/misses to row granularity
  row_cache: Rename num_entries() to partitions() for clarity
  row_cache: Track mispopulations also at row level
  row_cache: Track row insertions
  row_cache: Track row hits and misses
  row_cache: Make mispopulation counter also apply for continuity information
  row_cache: Add partition_ prefix to current counters
  misc_services: Switch to using reads_with[_no]_misses counters
  row_cache: Add metrics for operations on underlying reader
  row_cache: Add reader-related metrics
  row_cache: Remove dead code
2017-07-04 15:20:25 +03:00
Tomasz Grabiec
37d2b6b3c6 row_cache: Switch _stats.hits/misses to row granularity
Those are exported by the RESTful APIs called
"get_row_hits/get_row_misses" and reported by nodetool.
2017-07-04 13:55:06 +02:00
Tomasz Grabiec
62c76abf71 row_cache: Rename num_entries() to partitions() for clarity 2017-07-04 13:55:06 +02:00
Tomasz Grabiec
60c2a86192 row_cache: Track mispopulations also at row level 2017-07-04 13:55:06 +02:00
Tomasz Grabiec
94547db620 row_cache: Track row insertions 2017-07-04 13:55:06 +02:00
Tomasz Grabiec
a58f2c8640 row_cache: Track row hits and misses 2017-07-04 13:55:06 +02:00
Tomasz Grabiec
77b2a92ece row_cache: Make mispopulation counter also apply for continuity information 2017-07-04 13:55:06 +02:00
Tomasz Grabiec
a5fdff2ac2 row_cache: Add partition_ prefix to current counters
In preparation for adding per-row counters.
2017-07-04 13:55:06 +02:00
Tomasz Grabiec
ae4b24db06 misc_services: Switch to using reads_with[_no]_misses counters
They better approximate the intended meaning than hits/misses, which
according to Gleb is whether a read did any I/O or not.
2017-07-04 13:55:06 +02:00
Tomasz Grabiec
6a22cbceaf row_cache: Add metrics for operations on underlying reader 2017-07-04 13:55:06 +02:00
Tomasz Grabiec
5c7b6fc164 row_cache: Add reader-related metrics 2017-07-04 13:55:06 +02:00
Tomasz Grabiec
be2e89d596 row_cache: Remove dead code 2017-07-04 13:55:06 +02:00
Tomasz Grabiec
e720b317c9 row_cache: Restore update of concurrent_misses_same_key
It was lost in action in 6f6575f456.

Message-Id: <1499168837-5072-1-git-send-email-tgrabiec@scylladb.com>
2017-07-04 14:51:05 +03:00
Avi Kivity
66e56511d6 Merge "Use selective_token_range_sharder in repair" from Asias
"This series introduces selective_token_range_sharder and uses it in repair to
generate dht::token_range belongs to a specific shard."

* tag 'asias/repair-selective_token_range_sharder-v3' of github.com:cloudius-systems/seastar-dev:
  repair: Use selective_token_range_sharder
  tests: Add test_selective_token_range_sharder
  dht: Add selective_token_range_sharder
2017-07-04 14:14:33 +03:00
Asias He
b10e961a64 repair: Use selective_token_range_sharder
With this change, we ask all the shard to handle the ranges provided by
user and we use selective_token_range_sharder to split the ranges and
ignore the ranges do not belong to the current shard.
2017-07-04 18:46:19 +08:00
Asias He
2a794db61b tests: Add test_selective_token_range_sharder 2017-07-04 18:46:19 +08:00
Asias He
d835cf2748 dht: Add selective_token_range_sharder
It is like ring_position_range_sharder but it works with
dht::token_range. This sharder will return the ranges belong to a
selected shard.
2017-07-04 18:46:19 +08:00
Tomasz Grabiec
1d6fec0755 row_cache: Drop not very useful prefixes from metric names
This drops "total_opertaions_" and "objects_" prefixes. There is no
convention of adding them in other parts of the system, and they don't
add much value.

Fixes scylladb/scylla-grafana-monitoring#169.

Message-Id: <1499160342-25865-1-git-send-email-tgrabiec@scylladb.com>
2017-07-04 13:37:12 +03:00
Nadav Har'El
d95f908586 Fix test to use non-wrapping range
The test put a wrapping range into a non-wrapping range variable.
This was harmless at the time this test was written, but newer code
may not be as forgiving so better use a non-wrapping range as intended.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170704103128.29689-1-nyh@scylladb.com>
2017-07-04 13:36:29 +03:00
Avi Kivity
07b8adce0e sstables: fix use-after-free in read_simple()
`r` is moved-from, and later captured in a different lambda. The compiler may
choose to move and perform the other capture later, resulting in a use-after-free.

Fix by copying `r` instead of moving it.

Discovered by sstable_test in debug mode.
Message-Id: <20170702082546.20570-1-avi@scylladb.com>
2017-07-04 10:24:07 +02:00
Raphael S. Carvalho
7b777fe2e3 sstables/lcs: choose sstable with highest droppable tombstone ratio
Currently, lcs will choose, for tombstone compaction, sstable with
the lowest ratio from the ones which ratio is at least above threshold
(0.2 by default).

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170703185633.6644-1-raphaelsc@scylladb.com>
2017-07-04 10:25:10 +03:00
Avi Kivity
bcf7867ac9 Merge "small fixes and cleanup for leveled strategy (part 2)" from from Raphael
* 'lcs_improvements_part_2' of github.com:raphaelsc/scylla:
  lcs: Match estimated tasks arithmetic to score in LCS
  lcs: prevent leveled_compaction_strategy.hh from being included more than once
  lcs: use vector instead for storing a level of sstables
  compaction: keep only one variant of size_tiered_most_interesting_bucket
  lcs: get rid of unused code in leveled_manifest
2017-07-04 10:10:53 +03:00
Raphael S. Carvalho
7606ffd744 lcs: Match estimated tasks arithmetic to score in LCS
Contains fix for CASSANDRA-8904.

Added TARGET_SCORE to get rid of magic number for target score which
is now used more than once.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-04 03:35:02 -03:00
Raphael S. Carvalho
dfb5463478 lcs: prevent leveled_compaction_strategy.hh from being included more than once
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-04 03:35:00 -03:00
Raphael S. Carvalho
db98ab6aaf lcs: use vector instead for storing a level of sstables
list is no longer needed because lcs no longer moves a sstable breaking
invariant at its level to level 0. Now lcs incrementally restores invariant
by compacting together first set of overlapping tables.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-04 03:34:57 -03:00
Raphael S. Carvalho
b350352e6c compaction: keep only one variant of size_tiered_most_interesting_bucket
two variants of size_tiered_most_interesting_bucket existed to avoid copy,
but subsequent work will make lcs use vector for each level of sstables,
so let's only keep one variant.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-04 03:34:51 -03:00
Raphael S. Carvalho
5921600b95 lcs: get rid of unused code in leveled_manifest
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-04 03:34:34 -03:00
Nadav Har'El
d177ec05cb repair: further limit parallelism of checksum calculation
Repair today has a semaphore limiting the number of ongoing checksum
comparisons running in parallel (on one shard) to 100. We needed this
number to be fairly high, because a "checksum comparison" can involve
high latency operations - namely, sending an RPC request to another node
in a remote DC and waiting for it to calculate a checksum there, and while
waiting for a response we need to proceed calculating checksums in parallel.

But as a consequence, in the current code, we can end up with as many as
100 fibers all at the same stage of reading partitions to checksum from
sstables. This requires tons of memory, to hold at least 128K of buffer
(even more with read-ahead) for each of these fibers, plus partition data
for each. But doing 100 reads in parallel is pointless - one (or very few)
should be enough.

So this patch adds another semaphore to limit the number of checksum
*calculations* (including the read and checksum calculation) on each shard
to just 2. There may still be 100 ongoing checksum *comparisons*, in
other stages of the comparisons (sending the checksum requests to other
and waiting for them to return), but only 2 will ever be in the stage of
reading from disk and checksumming them.

The limit of 2 checksum calculations (per shard) applies on the repair
slave, not just to the master: The slave may receive many checksum
requests in parallel, but will only actually work on 2 at a time.

Because the parallelism=100 now rate-limits operations which use very little
memory, in the future we can safely increase it even more, to support
situations where the disk is very fast but the link between nodes has
very high latency.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170703151329.25716-1-nyh@scylladb.com>
2017-07-03 18:14:57 +03:00
Piotr Jastrzebski
80f08921c4 Make table_helper independent from trace_keyspace_helper
table_helper is a generic helper than can easily be used in other places.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <11e46dbc1c90d0273a41c8144e6f6013e21efcdb.1499077818.git.piotr@scylladb.com>
2017-07-03 15:55:00 +03:00
Raphael S. Carvalho
972a0237ef database: restore indentation for cleanup_sstables
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170630035324.19881-2-raphaelsc@scylladb.com>
2017-07-03 12:48:54 +03:00
Raphael S. Carvalho
b9d0645199 database: fix potential use-after-free in sstable cleanup
when do_for_each is in its last iteration and with_semaphore defers
because there's an ongoing cleanup, sstable object will be used after
freed because it was taken by ref and the container it lives in was
destroyed prematurely.

Let's fix it with a do_with, also making code nicer.

Fixes #2537.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170630035324.19881-1-raphaelsc@scylladb.com>
2017-07-03 12:48:53 +03:00
Avi Kivity
5883e85da3 Merge "improve maintainability of compaction strategies" from Raphael
"compaction_strategy.cc keeps the full implementation of size tiered,
major, and null strategies, and partial implementation of leveled
and date tiered strategies. It's a mess. In the future, we will also
need space for time window strategy. The file is hard to read and
maintain.
My goal here is to improve maintainability of the strategies by
putting each of them into its own header.

NOTE: No semantic change is introduced here."

* 'improve_compaction_strategy_maintainability' of github.com:raphaelsc/scylla:
  compaction_strategy: move dtcs to its existing header
  compaction_strategy: move lcs implementation to its own header
  compaction_strategy: move stcs implementation to its own header
  compaction_strategy: move compaction_strategy_impl to its own header
2017-07-03 11:39:30 +03:00
Takuya ASADA
0c81974bc4 dist/common/systemd: move scylla-server.service to be after network-online.target instead of network.target
To make sure start Scylla after network is up, we need to move from
network.target to network-online.target.

Fixes #2337

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1493661832-9545-1-git-send-email-syuu@scylladb.com>
2017-07-03 10:01:21 +03:00
Asias He
b2a2fbcf73 repair: Do not store the failed ranges
The number of failed ranges can be large so it can consume a lot of memory.
We already logged the failed ranges in the log. No need to storge them
in memory.

Message-Id: <7a70c4732667c5c3a69211785e8efff0c222fc28.1498809367.git.asias@scylladb.com>
2017-07-03 10:00:25 +03:00
Takuya ASADA
1c35549932 dist/common/scripts/scylla_cpuscaling_setup: skip configuration when cpufreq driver doesn't loaded
Configuring cpufreq service on VMs/IaaS causes an error because it doesn't supported cpufreq.
To prevent causing error, skip whole configuration when the driver not loaded.

Fixes #2051

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1498809504-27029-1-git-send-email-syuu@scylladb.com>
2017-07-03 09:59:56 +03:00
Takuya ASADA
e645b0fb13 dist/common/scripts: move EC2 configuration verification to 'scylla_ec2_check'
Currently we only have EC2 configuration verification on AMI, so move it to
/usr/lib/scylla and run it from scylla_setup, to make it usable for
non-AMI users.

Fixes #1997

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1498811107-29135-1-git-send-email-syuu@scylladb.com>
2017-07-03 09:59:28 +03:00
Avi Kivity
6895f6e603 sstable_datafile_test: fix sstable_expired_data_ratio failure
A comment states that we want the file to be old enough, but sets
a timestamp of max(), which is in the future. This may have passed
because the conversion from numeric_limits<time_t>::max() to
db_clock::time_point is not well defined (their dynamic range is
different), so truncation may have converted the large number to a
low one.
Message-Id: <20170702082903.20879-1-avi@scylladb.com>
2017-07-02 20:22:51 +02:00
Avi Kivity
51b6066212 cql3: operation: correctly format error messages
Error messages incorrectly used the debug representation of the receiver,
rather than the text representation of the operation itself.

Fixes #113.
Message-Id: <20170701101325.3163-1-avi@scylladb.com>
2017-07-02 20:06:50 +02:00
Duarte Nunes
d157e4558a utils/log_histogram: Remove largest() function
It should never have existed in the first place, as there are no
legitimate callers and it can be misused.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170630095939.2429-1-duarte@scylladb.com>
2017-07-02 14:29:17 +03:00
Gleb Natapov
d23111312f main: wait for wait_for_gossip_to_settle() to complete during boot
Boot should not continue until a future returned by
wait_for_gossip_to_settle() is resolved.  Commit 991ec4a16 mistakenly
broke that, so restore it back. Also fix calls for supervisor::notify()
to be in the right places.

Message-Id: <20170702082355.GQ14563@scylladb.com>
2017-07-02 11:32:36 +03:00
Avi Kivity
5bc13e4454 Revert "Make table_helper independent from trace_keyspace_helper"
This reverts commit db5bf363d0. Causes
errors of the sort

    Exiting on unhandled exception: exceptions::invalid_request_exception
    (Keyspace 'system_traces' does not exist)
2017-07-02 11:30:51 +03:00
Avi Kivity
7c809917b6 compaction_manager: fix debug mode build (periodic_compaction_submission_interval)
Turn static constexpr variable into a function.
2017-07-01 19:34:46 +03:00
Avi Kivity
c2c69e003f compaction: fix build on debug mode (DEFAULT_TOMBSTONE_COMPACTION_INTERVAL)
Debug mode wants to allocate storage for a constexpr variable for some
reason. Turn it into a function.
2017-07-01 19:26:22 +03:00
Avi Kivity
59f649e2bc Revert "cql_server::do_accepts: modernize loop"
This reverts commit 37af493f6e. Connections
are not accepted and ^C does not work anymore.
2017-07-01 12:54:23 +03:00
Jesse Haber-Kucharsky
1100bb8a5b cql: Eagerly throw lexing and parsing exceptions
Previously, lexing and parsing errors were aggregated while CQL queries were
evaluated. Afterwards, the first collected error (if present) would be thrown as
an exception.

The problem was that when parsing and lexing errors were aggregated this way,
the parser would continue even in spite of errors like "no viable alternative".
Semantic actions attached to grammar rules would still execute, though with
variables that had not yet been initialized. This would crash Scylla.

This change modifies the error-handling strategy of CQL parsing. Rather than
aggregate errors, we throw an exception on the first error we encounter. This
ensures that grammar actions never execute unless there is a precise match.

One possible issue with this approach is that the generated C++ code from the
ANTLR grammar may not be exception-safe. I compiled Scylla in debug-mode with
ASan support and executed several erroneous CQL queries with `cqlsh`. No memory
leaks were reported.

Fixes #2466.

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <db1f650a2bbb615b506d9015486eece45375a440.1498836703.git.jhaberku@scylladb.com>
2017-07-01 12:13:44 +03:00
Raphael S. Carvalho
69a9ad468c compaction_strategy: move dtcs to its existing header
Goal is to improve maintainability.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-30 03:50:09 -03:00
Raphael S. Carvalho
4d387475fe compaction_strategy: move lcs implementation to its own header
Goal is to improve maintainability.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-30 03:50:07 -03:00
Raphael S. Carvalho
4b46d286fd compaction_strategy: move stcs implementation to its own header
Goal is to improve maintainability.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-30 03:50:06 -03:00
Raphael S. Carvalho
0d9bb0da39 compaction_strategy: move compaction_strategy_impl to its own header
compaction_strategy.cc keeps the full implementation of size tiered,
major, and null strategies, and partial implementation of leveled
and date tiered strategies. It's a mess. In the future, we will also
need space for time window strategy. The file is hard to read and
maintain.
My goal here is to eventually improve maintainability of the
strategies by putting each of them into its own header.
This is the first step towards that goal.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-30 03:50:04 -03:00
Raphael S. Carvalho
9fa855e105 compaction_strategy: use duration type for default tombstone compaction interval
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170630041838.20604-1-raphaelsc@scylladb.com>
2017-06-30 08:56:22 +03:00
Piotr Jastrzebski
db5bf363d0 Make table_helper independent from trace_keyspace_helper
table_helper is a generic helper than can easily be used in other places.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <3e360a963d4a53de6d758ba8bada78fc572f001a.1498745600.git.piotr@scylladb.com>
2017-06-29 17:20:07 +03:00
Tomasz Grabiec
97005825bf row_cache: Fix compilation errors with gcc 5
Message-Id: <1498741526-27055-1-git-send-email-tgrabiec@scylladb.com>
2017-06-29 16:34:46 +03:00
Avi Kivity
6da9b6eb81 cql3: error_listener: add virtual destructor
Found by Eclipse.
Message-Id: <20170629063324.31309-1-avi@scylladb.com>
2017-06-29 10:51:20 +02:00
Avi Kivity
9298fea27b Merge seastar upstream
* seastar 0ab7ae5...c848486 (2):
  > build: export full cflags in pkgconfig file (Fixes #2439)
  > configure: Avoid putting tmp file on /tmp
2017-06-29 11:35:24 +03:00
Avi Kivity
fc966c0c4c Merge "tombstone removal compaction" from Raphael
"This feature is intended to make compaction more efficient at getting rid of
droppable tombstone and expired data wasting disk space. So far, people have
been dealing with it manually through major compaction.

With strategies other than date tiered, large sstables will be left untouched
for a long time even though it's all expired. Date tiered suffers from it when
mixing data with different TTL because it only includes for compaction sstable
that is fully expired.

sstables keeps as metadata a histogram which allows us to easily estimate
droppable data ratio from gc_before. sstables which droppable data ratio is
above 20% (default value for tombstone_threshold option) will be considered
candidates for the operation.

Like in C*, we will only do tombstone removal compaction when there's nothing
to compact in standard way. It would be interesting to trigger it too when
disk usage is above a given threshold, but I decided to leave this for later.

Fixes #2306."

* 'tombstone_removal_compaction_v4' of github.com:raphaelsc/scylla:
  tests: more testing for tombstone compaction options
  tests: basic tombstone compaction test for date tiered
  compaction/dtcs: add support for tombstone compaction
  tests: basic test of tombstone compaction with lcs
  compaction/lcs: add support for tombstone compaction
  tests: basic tombstone compaction test for size tiered
  compaction/stcs: add support for tombstone compaction
  tests: add test for estimation of droppable tombstone ratio
  sstables: introduce function to estimate droppable tombstone ratio
  compaction_manager: periodically submit cfs for compaction
  streaming_histogram: fix coding style
  tests: add streaming_histogram_test
  streaming_histogram: implement sum
  tests: add test for sstable with bad tombstone histogram
  sstables: discard bad streaming histogram for future use
  tests: add sstable tombstone histogram test
  streaming_histogram: fix update
  streaming_histogram: move it to utils
  streaming_histogram: do not limit it to be used by sstables
  sstables: update tombstone_histogram for cells with expiration time
2017-06-29 10:19:59 +03:00
Avi Kivity
1317c4a03e Update ami submodule
* dist/ami/files/scylla-ami f10db69...5dfe42f (1):
  > don't fetch perf from amazon repo
2017-06-29 09:38:48 +03:00
Raphael S. Carvalho
ab335c8085 tests: more testing for tombstone compaction options
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-29 02:43:08 -03:00
Raphael S. Carvalho
ce4dc15a20 tests: basic tombstone compaction test for date tiered
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-29 02:43:08 -03:00
Raphael S. Carvalho
f76ece5349 compaction/dtcs: add support for tombstone compaction
Unlike other strategies, dtcs has tombstone compaction disabled by
default due to:
- deletion shouldn't be used with DTCS; rather data is deleted through TTL.
- with time series workloads, it's usually better to wait for whole sstable
to be expired rather than compacting a single sstable when it's more than
20% (default value) expired.
See CASSANDRA-9234 for more details.

For tombstone compaction, unworthy sstables are filtered out and the oldest
one is chosen because it's the one less likely to shadow data and it's also
relatively big.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-29 02:43:08 -03:00
Raphael S. Carvalho
c400bf97b9 tests: basic test of tombstone compaction with lcs
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-29 02:43:08 -03:00
Raphael S. Carvalho
70e54cfe6e compaction/lcs: add support for tombstone compaction
LCS will choose its candidate by starting from highest level and
getting sstable which has highest droppable tombstone ratio.
Unlike STCS which needs to choose oldest sstable from biggest tier,
LCS can choose the one with highest d__t__r because sstables in
a given level don't overlap.
Sstable picked up for tombstone removal compaction won't be demoted
or promoted.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-29 02:43:08 -03:00
Raphael S. Carvalho
138fda468f tests: basic tombstone compaction test for size tiered
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-29 02:43:08 -03:00
Raphael S. Carvalho
8fd80ac22c compaction/stcs: add support for tombstone compaction
Larger sstables are hard to find sstable peers and therefore are
left uncompacted for a long time. Expired data and tombstones which
can be purged will waste disk space meanwhile.

sstable tracks droppable tombstone from which ratio can be calculated.
If ratio is greater than threshold (0.2 by default), sstable will
be eligible for compaction. Oldest sstables from biggest tiers are
preferrable because droppable data in them are more likely to satisfy
the conditions for purge, like not shadowing data in another sstable.

Subsequent patches will add support in leveled and date tiered
strategies.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-29 02:43:08 -03:00
Raphael S. Carvalho
ad24470972 tests: add test for estimation of droppable tombstone ratio
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-29 02:43:08 -03:00
Raphael S. Carvalho
eb6d17b748 sstables: introduce function to estimate droppable tombstone ratio
Function used to estimate ratio of droppable tombstone.
A tombstone is considered droppable for cells expired before
gc_before and regular tombstones older than gc_before.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-29 02:43:08 -03:00
Raphael S. Carvalho
0d21129cc7 compaction_manager: periodically submit cfs for compaction
This is useful for a column family which isn't generating new content
and will have lots of expired data later on that can be purged.
Compaction submission is NO-OP if there's nothing to do, so I think
it's reasonable to do it at an interval of 1 hour.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-29 02:43:03 -03:00
Raphael S. Carvalho
719dbf547d streaming_histogram: fix coding style
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-29 02:08:12 -03:00
Raphael S. Carvalho
6fb26d9f0c tests: add streaming_histogram_test
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-29 02:08:12 -03:00
Raphael S. Carvalho
a65b9eb8b4 streaming_histogram: implement sum
This function is used to estimate number of points in interval
[-inf,b]. It will be useful for estimating droppable tombstone
ratio in a given sstable.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-29 02:08:12 -03:00
Raphael S. Carvalho
c01c659594 tests: add test for sstable with bad tombstone histogram
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-29 02:08:12 -03:00
Raphael S. Carvalho
06fabf9810 sstables: discard bad streaming histogram for future use
Find bad histogram which had incorrect elements merged due to use of
unordered map. The keys will be unordered. Histogram which size is
less than max allowed will be correct because no entries needed to be
merged, so we can avoid discarding those.

This is important because histogram for tombstone will be used to
estimate droppable tombstone ratio. If it's incorrectly high for many
of existing sstables, we will needlessly compact lots of them.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-29 02:08:10 -03:00
Raphael S. Carvalho
7b532867ce tests: add sstable tombstone histogram test
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-29 01:17:28 -03:00
Raphael S. Carvalho
f35bd66da4 streaming_histogram: fix update
This bug was introduced when converting java code. Return value
of map::erase() was used as if it were the value of the removed
entry, but it's actually the number of removed entries.
update() also relies on ordered keys, so map is used instead
by histogram.
In addition, histograms will be written in sorted order (like C*
does) such that we can detect bad histograms, using disk_array.
disk_array is also used from now on to read histograms.
The conversion from array to map is fine because histograms for
sstables are limited to 100 elements.

Coming patch will detect bad histograms (generated only by us)
and discard them, because we can't rely on their information.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-29 01:17:26 -03:00
Amnon Heiman
644868d816 api: remove reply creation
As a preperation for the http stream support, creation of empty reply
should be avoided.

This patch removes a line that cannot be reached but causes the compiler
to complain.

It has no effect aside of removing the reply creation.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170628130202.8132-1-amnon@scylladb.com>
2017-06-28 16:30:58 +03:00
Tomasz Grabiec
786e75dbf7 row_cache: Use continuity information to decide whether to populate
If cache is missing given key, but the range is marked as continuous,
it means sstables don't have that entry and we can insert it without
asking the presence checker (bloom filter based). The latter is more
expensive and gives false positives. So this improves update
performance and hit ratio.

Another positive effect is that we don't have to clear continuity now.

Fixes #1999.

Message-Id: <1498643043-21117-1-git-send-email-tgrabiec@scylladb.com>
2017-06-28 13:32:48 +03:00
Tomasz Grabiec
3489c68a68 lsa: Fix performance regression in eviction and compact_on_idle
Region comparator, used by the two, calls region_impl::min_occupancy(),
which calls log_histogram::largest(). The latter is O(N) in terms of
the number of segments, and is supposed to be used only in tests.
We should call one_of_largest() instead, which is O(1).

This caused compact_on_idle() to take more CPU as the number of
segments grew (even when there was nothing to compact). Eviction
would see the same kind of slow down as well.

Introduced in 11b5076b3c.

Message-Id: <1498641973-20054-1-git-send-email-tgrabiec@scylladb.com>
2017-06-28 12:32:43 +03:00
Raphael S. Carvalho
a3a73899bc database: remove outdated FIXME comments
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170621002253.29660-1-raphaelsc@scylladb.com>
2017-06-28 11:06:02 +02:00
Etienne Kruger
37af493f6e cql_server::do_accepts: modernize loop
Replace recursion in cql_server::do_accepts with more modern repeat()
from future-util.hh.

Fixes #2467.

Signed-off-by: Etienne Kruger <el@loadavg.io>
Message-Id: <20170628033130.19824-1-el@loadavg.io>
2017-06-28 10:25:22 +03:00
Raphael S. Carvalho
d90f46000d streaming_histogram: move it to utils
It's not specific to sstables. May be needed somewhere else in
the future.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-28 01:07:13 -03:00
Glauber Costa
f3742d1e38 disable defragment-memory-on-idle-by-default
It's been linked with various performance issues, either by causing
them or making them worse. One example is #1634, and also recently
I have investigated continuous performance degradation that was also
linked to defrag on idle activity.

Until we can figure out how to reduce its impact, we should disable it.

Signed-off-by: Glauber Costa <glauber@glauber.scylladb>
Message-Id: <20170627201109.10775-1-glauber@scylladb.com>
2017-06-28 00:21:11 +03:00
Raphael S. Carvalho
fb9bc609c6 streaming_histogram: do not limit it to be used by sstables
streaming histogram will later be placed in /utils, so we want
it to use std::unordered_map<> instead of disk_hash<>.
That also requires implementing serialization/deserialization
functions for it.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-27 16:51:52 -03:00
Raphael S. Carvalho
e224653d70 sstables: update tombstone_histogram for cells with expiration time
That tombstone_histogram is used to determine droppable data ratio
for a sstable, and unlike C*, we were only updating it for
tombstones. We need to update it with expiration time of cells too,
if any. Creation time (expiration - ttl) cannot be used because if
ttl > gc_grace_seconds, the resulting sstable could be considered
worth dropping by tomstone compaction before any data is actually
expired.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-27 16:50:38 -03:00
Avi Kivity
08488a75e0 dist: tolerate sysctl failures
sysctl may fail in a container environment if /proc is not virtualized
properly.

Fixes #1990
Message-Id: <20170625145930.31619-1-avi@scylladb.com>
2017-06-27 16:11:48 +02:00
Avi Kivity
ff7be8241f Merge "Fix compilation issues in older environments" from Tomasz
* 'tgrabiec/fix-compilation-issues' of github.com:cloudius-systems/seastar-dev:
  tests: streamed_mutation_test: Avoid using boost::size() on row ranges
  tests: row_cache: Remove unused method
2017-06-27 16:30:54 +03:00
Tomasz Grabiec
eb844a10e9 tests: streamed_mutation_test: Avoid using boost::size() on row ranges
Fails to compile with libboost 1.55.
2017-06-27 15:27:13 +02:00
Tomasz Grabiec
e68925595c tests: row_cache: Remove unused method 2017-06-27 14:10:37 +02:00
Vlad Zolotarov
6839a50677 db::commitlog: entry_writer add a virtual destructor
Add a virtual destructor for a base class commitlog::entry_writer.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1498511180-18391-1-git-send-email-vladz@scylladb.com>
2017-06-27 10:17:10 +03:00
Takuya ASADA
1e86196ed5 dist/debian: unofficial support of Ubuntu non-LTS versions / Debian non-stable versions
Currently our build script only supports Ubuntu 14.04/16.04 and Debian 8, this
change extends support to Ubuntu non-LTS versions / Debian non-stable versions.
Note that this is unofficial support, users should build the package for these
distributions theirselves.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1498491473-28691-1-git-send-email-syuu@scylladb.com>
2017-06-26 18:55:55 +03:00
Asias He
cc02a62756 repair: Prefer nodes in local dc when streaming
When peer nodes have the same partition data, i.e., with the same
checksum, we currently choose to stream from any of them randomly.
To improve streaming performance, select the peer within the same DC.
This patch is supposed to improve repair perforamnce with multiple DC.

Message-Id: <c6a345b6e8ed2b59f485e53c865241e463b44507.1498490831.git.asias@scylladb.com>
2017-06-26 18:34:21 +03:00
Avi Kivity
1170f56447 Merge "Speed up gossip dissemination in large cluster" from Asias
Fixes #2528.

* tag 'asias/gossip_talk_to_more_nodes/v3' of github.com:cloudius-systems/seastar-dev:
  gossip: Use vector for _live_endpoints
  gossip: Talk to more live nodes in each gossip round
2017-06-26 17:59:43 +03:00
Asias He
e31d4a3940 gossip: Use vector for _live_endpoints
To speed up the random access in get_random_node. Switch to use vector
instead of set.
2017-06-26 22:49:59 +08:00
Asias He
437899909d gossip: Talk to more live nodes in each gossip round
In large clusters with multiple DC deployment, it is observed that it
takes long delay for gossip update to disseminate in the cluster.

To speed up, talk to more live nodes in each gossip round.

Fixes #2528
2017-06-26 22:49:59 +08:00
Nadav Har'El
6cf44f6817 Optimize column_family::make_sstable_reader() for one partition
This patch does the same thing to column_family::make_sstable_reader() as
commit 186f031 did to sstable::as_mutation_source().

Although usually one can fast_forward_to() on the result of a
column_family::make_sstable_reader(), earlier we had an optimization
where if a single partition was specified, it was read exactly,
and fast_forward_to() was *NOT* allowed.

With the mutation_reader::forwarding flag patch, when this flag
was on - requesting fast_forward_to() - we disabled this optimization.
This makes sense, but is not backward compatible with the code which
previously assumes this optimization exists. In particular,
column_family::data_query() does a single partition read but does not
specify forwarding::no explicitly.
So this patch returns this optimization, despite this meaning that we
blatently ignore the fwd_mr flag in that case.

Fixes #2524.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170626141121.30322-1-nyh@scylladb.com>
2017-06-26 17:13:03 +03:00
Avi Kivity
9b21a9bfb6 Merge "Implement partial cache" from Tomasz and Piotr
"This series enables cache to keep partial partitions.
Reads no longer have to read whole partition from sstables
in order to cache the result.

The 10MB threshold for partition size in cache is lifted.

Known issues:

 - There is no partial eviction yet, whole partitions are still evicted,
   and partition snapshots held by active reads are not evictable at all
 - Information about range continuity is not recorded if that
   would require inserting a dummy entry, or if previous entry
   doesn't belong to the latest snapshot
 - Cache update after memtable flush happening concurrently with reads
   may inhibit that reads' ability to populate cache (new issue)
 - Cache update from flushed memtables has partition granularity,
   so may cause latency problems with large partition
 - Schema is still tracked per-partition, so after schema changes
   reads may induce high latency due to whole partition needing
   to be converted atomically
 - Range tombstones are repeated in the stream for every range between
   cache entries they cover (new issue)
 - Populating scans for both small and large partitions (perf_fast_forward)
   experienced a 40% reduction of throughput, CPU bound

How was this tested:

 - test.py --mode release
 - row_cache_stress_test -c1 -m1G
 - perf_fast_forward, passes except for the test case checking range continuity population
   which would require inserting a dummy entry (mentioned above)
 - perf_simple_query (-c1 -m1G --duration 32):
     before: 90k [ops/s] stdev: 4k [ops/s]
     after:  94k [ops/s] stdev: 2k [ops/s]"

* tag 'tgrabiec/introduce-partial-cache-v8' of github.com:cloudius-systems/seastar-dev: (130 commits)
  tests: row_cache: Add test_tombstone_merging_in_partial_partition test case
  tests: Introduce row_cache_stress_test
  utils: Add helpers for dealing with nonwrapping_range<int>
  tests: simple_schema: Allow passing the tombstone to make_range_tombstone()
  tests: simple_schema: Accept value by reference
  tests: simple_schema: Make add_row() accept optional timestamp
  tests: simple_schema: Make new_timestamp() public
  tests: simple_schema: Introduce make_ckeys()
  tests: simple_schema: Introduce get_value(const clustered_row&) helper
  tests: simple_schema: Fix comment
  tests: simple_schema: Add missing include
  row_cache: Introduce evict()
  tests: Add cache_streamed_mutation_test
  tests: mutation_assertions: Allow expecting fragments
  mutation_fragment: Implement equality check
  tests: row_cache: Add test for population of random partitions
  tests: row_cache: Add test for partition tombstone population
  tests: row_cache: Test reading randomly populated partition
  tests: row_cache: Add test_single_partition_update()
  tests: row_cache: Add test_scan_with_partial_partitions
  ...
2017-06-26 14:54:37 +03:00
Avi Kivity
555621b537 Disentable memtables from sstables
Remove sstable::write_components(memtable), replacing it with a helper.

Fixes #2354
Message-Id: <20170624142639.16662-1-avi@scylladb.com>
2017-06-26 09:37:11 +02:00
Avi Kivity
236a8370e4 Remove use of std::random_shuffle()
It was removed in C++17. Replace with std::shuffle().
Message-Id: <20170626063809.7563-1-avi@scylladb.com>
2017-06-26 09:36:38 +02:00
Avi Kivity
c4ae2206c7 messaging: respect inter_dc_tcp_nodelay configuration parameter
We respect it partially (client side only) for now.

Fixes #6.
Message-Id: <20170623172048.23103-1-avi@scylladb.com>
2017-06-24 21:49:27 +02:00
Duarte Nunes
2dfd7040eb CMakeLists.txt: Add boost support
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170623172236.15507-1-duarte@scylladb.com>
2017-06-24 21:49:27 +02:00
Avi Kivity
801b5220d6 Merge seastar upstream
* seastar 9e2b7ec...0ab7ae5 (4):
  > Update fmt submodule
  > rpc: add options to control tcp_nodelay
  > core: Fix compilation for older versions of Boost
  > tests/lowres_clock_test: Fix compilation issues
2017-06-24 20:47:52 +03:00
Tomasz Grabiec
b0bcf2be53 tests: row_cache: Add test_tombstone_merging_in_partial_partition test case 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
23c6f517cb tests: Introduce row_cache_stress_test
Runs readers, updates and eviction concurrently and verifies the
following property of reads:

  - reads see all past writes

  - reads see no partial writes within a single partition
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
4b4aef789e utils: Add helpers for dealing with nonwrapping_range<int> 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
5c9f87fb27 tests: simple_schema: Allow passing the tombstone to make_range_tombstone() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
edf4a3494c tests: simple_schema: Accept value by reference 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
5f70df472f tests: simple_schema: Make add_row() accept optional timestamp 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
53867c4328 tests: simple_schema: Make new_timestamp() public 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
51b5814ec2 tests: simple_schema: Introduce make_ckeys() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
074c67fe4d tests: simple_schema: Introduce get_value(const clustered_row&) helper 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
8ffc776e06 tests: simple_schema: Fix comment 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
ecacd2e84a tests: simple_schema: Add missing include 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
b56232b216 row_cache: Introduce evict() 2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
c4e8effffa tests: Add cache_streamed_mutation_test
[tgrabiec:
  - extracted from a larger commit
  - removed coupling with how cache_streamed_mutation is created (the
    code went out of sync), used more stable make_reader(). it's simpler too.
  - replaced false/true literals with is_continuous/is_dummy where appropraite
  - dropped tests for cache::underlying (class is gone)
  - reused streamed_mutation_assertions, it has better error messages
  - fixed the tests to not create tombstones with missing timestamps
  - relaxed range tombstone assertions to only check information relevant for the query range
  - print cache on failure for improved debuggability
]
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
44fdee3f2e tests: mutation_assertions: Allow expecting fragments 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
1f23130b07 mutation_fragment: Implement equality check 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
116bcb8b30 tests: row_cache: Add test for population of random partitions 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
930a1415fe tests: row_cache: Add test for partition tombstone population 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
9bfece6f82 tests: row_cache: Test reading randomly populated partition 2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
0358334579 tests: row_cache: Add test_single_partition_update()
[tgrabiec: Extracted from "row_cache: Introduce cache_streamed_mutation"]
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
8bb76e2f12 tests: row_cache: Add test_scan_with_partial_partitions 2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
896bf2e5de Remove unused methods from MVCC
Some apply methods where replaced by apply_to_incomplete().

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
6f6575f456 row_cache: Enable partial partition population 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
5a0ae55f6d Introduce schema_upgrader 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
1828e28bbb database: Invalidate cache atomically with attaching streaming sstables
Not doing so may cause reads to see partial writes, if another
update+read happens in between.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
896196b841 database: Invalidate cache from seal_active_streaming_memtable_immediate()
Cache must be synchronized atomically with changing the underlying
mutation source, otherwise write atomicity may not hold.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
7ae40d7045 tests: Add test for update_invalidating() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
e792220c3a row_cache: Introduce update_invalidating() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
c29878f49f row_cache: Extract memtable walking logic from update() into do_update()
So that it can be reused in update_invalidating().
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
6ebfb730ee partition_entry: Introduce partition_tombstone() getter 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
fb62dfab02 tests: mvcc: Introduce test_schema_upgrade_preserves_continuity 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
164989a574 tests: mvcc: Add test for partition_entry::apply_to_incomplete() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
e433e68610 partition_entry: Make squashed() and upgrade() work with not fully continuous versions
Those methods first create a neutral mutation_partition, and left-fold
it with the versions. The problem is that there is no neutral element
for static row continuity, the flag from the first addend always
wins. We have to copy the flag from the first version to preserve
the logical value.
2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
b680de930c partition_entry: Introduce apply_to_incomplete()
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>

[tgrabiec:
  - extracted from a larger commit
  - fix heap comparator in apply_incomplete_target to order versions properly
  - extracted partition_version detaching into
    partition_entry::with_detached_versions()
  - dropped unnecessary rows_iterator::_version field
  - dropped unnecessary allocation of rows_entry and key copies
    in rows_iterator
  - dropped row_pointer
  - replaced apply_reversibly() with weaker and faster apply()
  - added handling of dummy entries at any position
  - fixed exception safety issue in apply_to_incomplete() which may
    result in data loss. We cannot move data out of applied versions
    into a new synthetic row and then apply it, because if exception
    happens in the middle, the data which was moved from the source
    will be lost. To fix that, row_iterator::consume_row() is
    introduced which allows in-place consumption of data without
    construction of temporary deletable_row.
  ]
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
b6ce963200 partition_version: Introduce partition_entry::with_detached_versions() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
2d8f024e4d partition_version: Document version merging rules on partition_entry 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
0770845a23 mutation_partition: Introduce r-value accepting deletable_row::apply() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
48a5b1d3ab converting_mutation_partition_applier: Expose cell upgrade logic 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
04aebaa2cb streamed_mutation: Introduce transform() 2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
3755641c4b row_cache: Introduce cache_streamed_mutation
This streamed mutation populates cache with
the rows requested by the read. It takes whatever
it can find in the cache and fetches the remainings
from underlying source.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>

[tgrabiec:

  - fixed maybe_add_to_cache_and_update_continuity() leaking entries if
    the key already exists in the snapshot

  - fixed a problem where population race could result in a read
    missing some rows, because cache_streamed_mutation was advancing
    the cursor, then deferring, and then checking continuity. We
    should check continuity atomically with advancing.

  - fixed rows_handle.maybe_refresh() being accessed outside of update
    section in read_from_underlying() (undefined behavior)

  - fixed a problem in start_reading_from_underlying() where we would
    use incorrect start if lower_bound ended with a range tombstone
    starting before a key.

  - range tombstone trimming in add_to_buffer() could create a
    tombstone which has too low start bound if last_rt.end was a
    prefix and had inclusive end. invert_kind(end_kind) should be used
    instead of unconditional inc_start.

  - range tombstone trimming incorrectly assumed it is fine to trim
    the tombstone from underlying to the previous fragment's end and
    emit such tombstone. That would mean the stream can't emit any
    fragments which start before previous tombstone's end. Solve with
    range_tombstone_stream.

  - split add_to_buffer() into overloads for clustering_row, and
    range_tombstone. Better than wrapping into mutation_fragment
    before the call and having add_to_buffer() rediscover the
    information.

  - changed maybe_add_to_cache_and_update_continuity() to not set
    continuity to false for existing entries, it's not necessary

  - moved range tombstone trimming to range_tombstone class
  - moved range tombstone slicing code to range_tombstone_list and partition_snapshot
  - can_populate::can_use_cache was unused, dropped
  - dropped assumption that dummy entries are only at the end
  - renamed maybe_add_to_cache_and_update_continuity() to maybe_add_to_cache()
  - dropped no longer needed lower_bound class
  - extracted row_handle to a seaparate patch
  - made the copy-from-cache loop preemptable
  - split maybe_add_next_to_buffer_and_update_continuity(bool)
  - dropped cache_populator
  - replaced "underlying" class with use of read_context
  - replaced can_populate class with a function
  - simplified lsa_manager methods to avoid moves
]
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
58a8022462 intrusive_set_external_comparator: Introduce insert_check() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
509a0d8a83 row_cache: Allow reading from underlying through read_context
The interaction will be as follows:

  - Before creating cache_streamed_mutation for given partition, cache
    mutation reader sets up read_context for current partition (in one
    of two ways) so that the matching underlying streamed_mutation can
    be accessed at any time by cached_stream_mutation.

  - cache_streamed_mutation assumes that read_context is set up for
    current partition and invokes fast_forward_to() and
    get_next_fragment() to access the underlying
    streamed_mutation.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
69ea29131f row_cache: Allow specifying desired snapshot in autoupdating_underlying_reader
When reading from incomplete partition entry, we may discover we need
to read something from the underlying mutation source. In such case we
will fast forward this reader to that partition. But we must do it
using a specific snapshot, the one we obtained when entering the
partition, not the latest one.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
a1d3e0318c row_cache: Store autoupdating_underlying_reader in read_context
Will be reused for reading of incomplete partition entries.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
3f2320c377 row_cache: Store information whether query is a range query in read_context
We will need to use this information later in yet another place, when
creating a reader for incomplete cache entry. This refactors the code
so that there is a single place which determines this fact.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
a2207ee9a6 row_cache: Move autoupdating_underlying_reader to read_context.hh 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
ca920bd0ef row_cache: Keep only one streamed_mutation in scanning_and_populating_reader
Currently scanning_and_populating_reader asks
just_cache_scanning_reader for the next partition from cache, together
with information if the range is continuous. If it's not, it saves the
partition it got from it and moves on to reading from the underlying
reader up to that partition. When that's done, it emits the stored
partition.

This approach won't work well with upcoming changes for storing
partial partitions. We won't have whole partitions any more, so
streamed_mutation returned for the entry needs to be prepared for
reading from the underlying mutation source. We want to reuse the same
underlying reader as much as possible, so all streamed_mutations for
given read (read_context) will share the state of the underlying
reader. Construction of a streamed_mutation will depend on the fact
that the shared state is set up for it, so we cannot have two
streamed_mutations prepared at the same time (one for entry from
primary, and one for the earlier entry being populated). This change
defers the creation of a streamed_mutation for the entry present in
cache until the whole reader reaches it to avoid this problem.

This will also have antoher potentially beneficial effect. Since we
defer the decision about which snapshot to use until we reach the
entry, there is a higher chance that the current snapshot of the entry
will match the one used last by the populating read, and that we will
be able to reuse the reader.

It's implemented by utilizing a stable partition cursor which tracks
its current position so that it's possible to revisit the cache entry
(if it's still there) after population ends. The functionality of
just_cache_scanning_reader was inlined into
scanning_and_populating_reader.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
1b041298fe range: Introduce trim_front() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
045888d5f3 row_cache: Introduce partition_range_cursor 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
e989d65539 dht: Make ring_position_view copyable
dht::token needs to be stored as a pointer now and not a reference so
that validity of old pointers is not impacted by in-place object
construction which would occur in the copy-assignment operator.

[1] says that old pointers can be used to access the new object only
if the type "does not contain any non-static data member whose type is
const-qualified or a reference type".

[1] http://en.cppreference.com/w/cpp/language/lifetime#Storage_reuse
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
c3905bf235 row_cache: Print position instead of key of cache_entry 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
f0cc86e5db row_cache: Introduce cache_entry::position() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
e3272526a1 row_cache: Allow comparing with ring_position views in row_cache::compare 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
5bfecaad99 row_cache: Switch invalidate_unwrapped() to use ring_position_view ranges
It's needed before switching cache_entry ordering to rely solely on
cache_entry::position() so that invalidate_unwrapped() never removes
the dummy entry at the end. Currently if the range has upper bound
like this:

  { ring_position::max(), inclusive=true }

The code which selects entries for removal would include the dummy row
at the end. It uses upper_bound() to get the end iterator, and the
dummy entry has a position which is equal to the position in the
bound.

ring_position_view ranges are end-exclusive, so it's impossible to
create a partition range which would include a dummy entry.

The code is also simpler.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
64626b32b0 row_cache: Make printable 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
639af55a78 partition_version: Add versions() getter
[tgrabiec: Use explicit return type]
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
1d3fec43eb partition_version: Make return type of versions() explicit 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
22a0e301f1 partition_version: Make is_referenced() const-qualified 2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
a17fa5726f Introduce streamed_mutation_from_forwarding_streamed_mutation
This will allow conversion from streamed_mutation that
supports fast forwarding to streamed_mutation that does not.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
11191b7aef streamed_mutation: Introduce make_empty_streamed_mutation() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
7c9569ec95 Introduce partition_snapshot_row_cursor 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
54b3da1910 row_cache: Introduce find_or_create() helper 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
f2d2c221d4 row_cache: Return cache_entry reference from do_find_or_create_entry
Will be useful when additional action needs to be done on the entry
after it was created or constructed.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
0db6bdc916 row_cache: Introduce cache_entry constructor which constructs incomplete entry 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
1c642219c9 row_cache: Ensure there is always a dummy entry after all clustered rows
Algorithms will assume that.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
bbfa52822e row_cache: Switch readers to use per-entry snapshots
Currently readers are always using the latest snapshot. This is fine
for respecting write atomicity if partitions are fully continuous in
cache (now), but will break write atomicity once partial population is
allowed.

Consider the following case:

  flush write(ck=1), write(ck=2) -> snapshot_1
  cache reader 1 reads and inserts ck=1 @snapshot_1
  flush write(ck=1), write(ck=2) -> snapshot_2
  cache reader 2 reads and inserts ck=2 @snapshot_2

Because cache update is not atomic, it can happen that reader 2 will
complete while the partition hasn't been updated yet for snapshot_2.
In such case, after read 2 the partition would contain ck=1 from
snapshot_1 and ck=2 from snapshot_2. It will match neither of the
snapshots, and this could violate write atomicity.

To solve this problem we conceptually assign each partition key in the
ring to its current snapshot which it reflects. The update process
gradually converts entries in ring order to the new snapshot. Reads
will not be using the latest snapshot, but rather the current snapshot
for the position in the ring they are at.

There is a race between the update process and populating reads. Since
after the update all entries must reflect the new snapshot, reads
using the old snapshot cannot be allowed to insert data which can no
longer be reached by the update process. Before this patch this race
was prevented by the use of a phased_barrier, where readers would keep
phased_barrier::operation alive between starting a read of a partition
and inserting it into cache. Cache update was waiting for all prior
operations before starting the update. Any later read which was not
waited for would use the latest snapshot for reads, so the update
process didn't have to fix anything up for such reads.

After this change, later reads cannot always use the latest snapshot,
they have to use the snapshot corresponding to given entry. So it's
not enough for update() to wait for prior reads in order to prevent
stale populations. The (simple) solution implemented in this patch is
to detect the conflict and abandon population of given sub-range. In
general, reads are allowed to populate given range only if it belongs
to a single snapshot.

Note that the range here is not the whole query range. For population
of continuity, it is the range starting after the previous key and
ending after the key being inserted. When populating a partition
entry, the range is a singular range containing only the partition
key. Readers switch to new snapshots automatically as they move across
the ring. It's possible that the insertion of the partition doesn't
conflict, but continuity does. In such case the entry will be inserted
but continuity will not be set.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
81e7b561da dht: Add ring_position min()/max() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
8ba6366610 row_cache: Switch to using snapshot_source
Currently every time cache needs to create reader for missing data it
obtains a reader which is most up to date. That reader includes writes
from later populate phases, for which update() was not yet
called. This will be problematic once we allow partitions to be
partially populated, because different parts of the partition could be
partially populated using readers using different sets of writes, and break
write atomicity.

The solution will be to always populate given partition using the same
set of writes, using reader created from the current snapshot. The
snapshot changes only on update(), with update() gradually converting
each partition to the new snapshot.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
446bcdb00d database: Add missing cache invalidation after attaching sstables
This violation of the contract is currently benign, because there are
no reads from those tables before they are populated. If there were,
the cache would mark the whole (empty) range as continuous and the
table would appear empty.

It will cause similar problem once cache starts using snapshots of the
underlying mutation source. Then this lack of invalidate() will also
result in cache thinking that the table is still empty.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
e23c7e2f34 row_cache: Rework invalidate() implementation
1) Reduce duplication by delegating to more general overloads

 2) Improve documentation to not mention effects in terms of
    population (detail) but rather write visibiliy

 3) Rename clear() to invalidate() and merge with the range variant,
    it has the same semantics
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
c82c6ec6ed database: Allow obtaining snapshot_source for sstables 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
bd023b6161 tests: Introduce memtable_snapshot_source
Snapshottable in-memory mutation source for use in row_cache tests.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
ddfcf64966 mutation_source: Make copying cheaper
Cache readers will need to take snapshots by copying the
mutation_source. That's going to happen quite often, so make copying
cheaper.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
58d5e1393b mutation_reader: Introduce make_combined_mutation_source() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
1e2463a382 mutation_reader: Introduce make_empty_*_source() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
289d01c2cc mutation_reader: Introduce concept of snapshot_source 2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
2d73c193e7 row_cache: Introduce read_context
This object stores all read relevant context required all
over the place. This leads to a cleaner code.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>

[tgrabiec:
  - made read_context shareable to allow storing shared
    mutable state later
  - added range and cache getters
]
2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
a3ff8db323 row_cache: Introduce autoupdating_underlying_reader
This is an abstraction that represents a reader
to the underlying source and auto updates itself
to make sure the reader reflects the latest state
of the underlying source.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>

[tgrabiec: Add range getter to avoid friendships]
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
b6d349728f range_tombstone_list: Introduce slice() working with position range 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
6ce08f2f9a range_tombstone: Introduce trim_front() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
271dfc2eac position_in_partition: Introduce for_range_start()/for_range_end() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
3b52afa4a3 position_in_partition: Introduce no_clustering_row_between() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
9c2b3e1167 position_in_partition: Introduce as_start_bound_view() 2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
b47c8f1df7 partition_snapshot: Add const-qualified overload of version()
[tgrabiec: Extracted from a different patch]
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
dd9d35c166 partition_snapshot: Add getter for range tombstones 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
60c3c0a471 partition_entry: Add squashed() overload with a single schema 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
98f7671553 partition_snapshot: Introduce squashed() 2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
87b0f11be3 partition_snapshot: Add getters for static row and partition tombstone
[tgrabiec:
  - Extracted from a different patch
  - Renamed concept names to more familiar Map and Reduce
  - Renamed aggregate() to squashed() to match the existing nomenclature
  - Uncommented the concepts
  ]
2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
ea59b9475e partition_version: Add const-quialified variant of operator->
[tgrabiec: Extracted from a different patch]
2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
f6fe0acea4 partition_version: Make operator bool() const-qualified
[tgrabiec: Extracted from a different patch]
2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
efc75b0bc3 mutation_partition: Add rows_entry constructor which accepts full contents
[tgrabiec: Extracted from different patch]
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
7f8620d4a7 tests: mutation_source: Relax expectations about range tombstones
In preparation for having partial cache which trims range tombstones
to the lower bound of the query.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
3a9212e0f2 tests: mutation_assertions: Add ability to limit verification to given clustering_row_ranges
Currently mutation sources are free to return range tombstones
covering range which is larger than the query range. The cache
mutation source will soon become more eager about trimming such
tombstones. To cover up for such differences, allow telling the
restrictions to only care about differences relevant for given
clustering ranges.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
f925b26241 tests: mutation_reader_assertions: Simplify 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
1d5d5e26a2 mutation: Introduce sliced() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
92d6456070 range_tombstone_list: Introduce equal() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
1594ace4d3 range_tombstone_stream: Make printable 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
19edb0b535 range_tombstone_list: Make printable 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
2e75595ecf range_tombstone_list: Introduce trim() 2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
5a29c70f3e mutation_fragment: make mutation_fragment copyable
This will be needed by implementation of cache_streamed_mutation

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
2fdabcaa9b Track population phase in partition_snapshot
This will be used by partial cache in later patches.

[tgrabiec:
  - changed title,
  - documented meaning of the variable,
  - renamed the variable,
  - introduced open_version(),
  - fixed continuity of the static row not being preserved in case
    a new version is created]

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
a841c77c54 Introduce maybe_merge_versions
This will be used in the following patches by partial cache.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
9642f806ab partition_version: Introduce version() getter 2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
9380dd1ee3 mutation_source: make sure we never ignore fast forwarding
mutation source sometimes ignore fast forwarding parameter so
this change adds assertion to check that this parameter
can be safely ignored.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
ab72241e22 mutation_reader: Accept forwarding flag in make_reader_returning()
By default make_reader_returning creates a reader that does not
support fast forwarding but the second parameter can be used to
make it support fast forwarding.

[tgrabiec: Improve title]

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
ac03331490 row_cache_test: improve test_sliced_read_row_presence
Remove unused parameter and add checks to make sure
all expected rows have been received.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
db053ef902 tests: Add test for continuity merging rules 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
2edf08d36a tests: random_mutation_generator: Generate random continuity 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
8873a443db tests: mutation: Generate mutations with continuity 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
dce293e11c tests: row_cache: Apply only fully continuous mutations to underlying mutation source
Cache currently assumes that mutations coming from outside are fully
continuous.
2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
e86f74edd8 tests: row_cache: Add missing apply() to test_mvcc test case
[tgrabiec: Extracted from "row_cache: Introduce cache_streamed_mutation"]
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
95dcfa859b tests: row_cache: Improve test_mvcc()
assert_that().is_equal_to() gives better error message.

Also, there is code which can be replaces with
assert_that_stream().has_monotonic_positions()
2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
05b56fcfb0 mutation_partition: Add support for specifying continuity
This will allow expressing lack of information about certain ranges of
rows (including the static row), which will be used in cache to
determine if information in cache is complete or not.

Continuity is represented internally using flags on row entries. The
key range between two consecutive entries is continuous iff
rows_entry::continuous() is true for the later entry. The range
starting after the last entry is assumed to be continuous. The range
corresponding to the key of the entry is continuous iff
rows_entry::dummy() is false.

[tgrabiec:
  - based on the following commits:
     4a5bf75 - Piotr Jastrzebski : mutation_partition: introduce dummy rows_entry
     773070e - Piotr Jastrzebski : mutation_partition: add continuity flag to rows_entry
  - documented that partition tombstone is always complete
  - require specifying the partition tombstone when creating an incomplete entry
  - replaced rows_entry(dummy_tag, ...) constructor with more general
    rows_entry(position_in_partition, ...)
  - documented continuity semantics on mutation_partition
  - fixed _static_row_cached being lost by mutation_partition copy constructors
  - fixed conversion to streamed_mutation to ignore dummy entries
  - fixed mutation_partition serializer to drop dummy entries
  - documented semantics of continuity on mutation_partition level
  - dropped assumptions that dummy entries can be only at the last position
  - changed equality to ignore continuity completely, rather than
    partially (it was not ignoring dummy entries, but ignoring
    continuity flag)
  - added printout of continuity information in mutation_partition
  - fixed handling of empty entries in apply_reversibly() with regards
    to continuity; we no longer can remove empty entries before
    merging, since that may affect continuity of the right-hand
    mutation. Added _erased flag.
  - fixed mutation_partition::clustered_row() with dummy==true to not ignore the key
  - fixed partition_builder to not ignore continuity
  - renamed dummy_tag_t to dummy_tag. _t suffix is reserved.
  - standardized all APIs on is_dummy and is_continuous bool_class:es
  - replaced add_dummy_entry() with ensure_last_dummy() with safer semantics
  - dropped unused remove_dummy_entry()
  - simplified and inlined cache_entry::add_dummy_entry()
  - fixed mutation_partition(incomplete_tag) constructor to mark all row ranges as discontinuous
  ]
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
063b37f352 partition_snapshot_reader: Be prepared for skipping some row entries
If some row entries may have to be skipped by the reader then it could
be that _clustering_rows is not empty, but read_next() will return a
disengaged optional because there are no more rows in the current
range. The code assumed that it's never the case, and if read_next()
returns a disengaged optional then we exhousted all ranges. Before
introducing dummy entries this needs to be refactored.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
2cfe23a35e partition_snapshot_reader: Use rows_entry::position() for comparing rows
key() will not be valid for dummy entries.
2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
50641b2849 partition_snapshot_reader: Reuse rows_entry comparator 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
d1a1fdfd57 partition_snapshot_reader: Encapsulate row walking to simplify read_next() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
a77734952d mutation_partition: Make rows_entry comparable with position_in_partition 2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
65b3123516 mutation_partition: Use rows_entry::position() in comparators
key() will not be valid for dummy entries, but position() is always
valid.

[tgrabiec: Extracted from other commits]
[tgrabiec: Added missing change to range_tombstone_stream::get_next]
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
660f3127a6 mutation_partition: Introduce rows_entry::position()
In preparation for enabling dummy entries with postion past all
clustering rows.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
9d705bc1c6 position_in_partition_view: Add key component getter 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
206a5f2bf5 position_in_partition_view: Add is_clustering_row() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
874b34ac09 position_in_partition_view: Add converting constructor from a key reference 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
794f9b21ef position_in_partition: Add is_after_all_clustered_rows() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
14fbf4409c position_in_partition: Introduce after_key() in the view 2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
1457be4de8 position_in_partition: Introduce for_key()
[tgrabiec: Take the key by reference]
2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
94c957c2ff Extract position_in_partition to separate header
This will allow it's usage in mutation_partition.hh

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
0fd4dedc6a position_in_partition: Add after_all_clustered_rows() to view
This is a position that's always in the end after any
other position. It will be used for dummy rows_entry.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
60346a2819 row_cache: remove unused read overload
This will simplify the following patches and unused
code should be removed anyway.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
77f944880c cache: Remove support for wide partitions
This will be handled by row cache now.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
fbe8c24ebe tests: row_cache_alloc_stress: Make eviction detection more reliable
It can happen that touch() will trigger eviction on entry to
allocating section, and drop in occupancy around insertion
will not happen. As a result, we may evict a lot without detecting that.

Extend the check to include touch() and use more reliable eviction counters.
2017-06-24 18:06:11 +02:00
Jesse Haber-Kucharsky
0791e90424 Further improve CMakeLists.txt for CLion
We support situations where `seastar.pc` is available, and when it isn't.

When `seastar.pc` is available, we grab the compilation flags from it in
addition to the defaults.

Some DPDK files are statically available in the source repository. We prefer
those to files placed during compilation in case modifications are made during
development (that would be lost during a build).

We always disable GCC6 concepts for the IDE, even if they are enabled while
configuring the project for compilation. CLion's parser doesn't understand them.

One final benefit to this revision is that now only target-specific flags are
modified rather than global flags.

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <75457652201c5ed89d05081ec4b56b4340721cf5.1498237756.git.jhaberku@scylladb.com>
2017-06-23 19:21:28 +02:00
Avi Kivity
aebc77507c Merge "Clock efficiency and organization" from Jesse
"This patch series makes two noteworthy changes:

- `gc_clock` now uses `seastar::lowres_system_clock` for better performance.

- Code common to all clocks is properly refactored and shared.

Otherwise, there are some small improvements that should have no functional impact."

* 'jhk/lowres_gc_clock/v1' of https://github.com/hakuch/scylla:
  Seal clock definitions
  `timestamp_clock::now()` is not `noexcept`
  Make `db_clock` `time_t` conversions `constexpr`
  Move common clock implementation helpers
  Simplify clock implementations
  db_clock.hh: Clean preprocessor directives
  Make `gc_clock` a model of `Clock`
  Use `lowres_system_clock` to back `gc_clock`
  Add `time_t` conversions for `gc_clock`
2017-06-23 18:47:56 +03:00
Jesse Haber-Kucharsky
b0ad1ff447 Seal clock definitions 2017-06-23 11:35:35 -04:00
Jesse Haber-Kucharsky
09954c45f1 timestamp_clock::now() is not noexcept
The problem is that `std::chrono::duration_cast` is not `noexcept`. As a result,
`timestamp_clock` is actually a model of `Clock` and not `TrivialClock`.
2017-06-23 11:35:35 -04:00
Jesse Haber-Kucharsky
28169fabca Make db_clock time_t conversions constexpr 2017-06-23 11:35:35 -04:00
Jesse Haber-Kucharsky
e045dddae8 Move common clock implementation helpers
This change fixes the dependencies between the clock implementation headers. All
the clocks share the common clock offset, but are otherwise independent (though
the `db_clock` does depend on `gc_clock` for time point conversions).
2017-06-23 11:35:35 -04:00
Jesse Haber-Kucharsky
2d184f27af Simplify clock implementations 2017-06-23 11:35:34 -04:00
Jesse Haber-Kucharsky
51c767c1c7 db_clock.hh: Clean preprocessor directives 2017-06-23 11:35:34 -04:00
Jesse Haber-Kucharsky
050ece6f74 Make gc_clock a model of Clock
It was missing `is_steady`.
2017-06-23 11:35:34 -04:00
Jesse Haber-Kucharsky
00bcd568a6 Use lowres_system_clock to back gc_clock
`seastar::lowres_system_clock` is more efficient than
`std::chrono::system_clock` and `gc_clock` has very coarse granularity
requirements.

Fixes #1957.
2017-06-23 11:35:34 -04:00
Jesse Haber-Kucharsky
73020685ee Add time_t conversions for gc_clock
`gc_clock` reports system time, and these conversion functions allow for
manipulating time points produced by the clock without making assumptions about
its epoch.
2017-06-23 11:35:34 -04:00
Duarte Nunes
4ef25e8e38 db/schema_tables: Add note to make_update_view_mutations
Document that a new view schema passed to make_update_view_mutations()
might be based on base schema that hasn't yet been loaded.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170618200558.96036-1-duarte@scylladb.com>
2017-06-23 15:24:35 +02:00
Duarte Nunes
bc1f1fa88a Merge branch "Some fixes for clang++ trunk" from Avi
"Fix minor issues found while building with clang trunk."

* 'clang' of https://github.com/avikivity/scylla:
  seastarx: don't make seastar namespace inline
  seastarx: add missing make_shared forward declaration
  tests: fix call to seastar::sleep()
  dht: fix bad to_sstring() call
2017-06-22 17:31:27 +02:00
Avi Kivity
f3366d8ae6 seastarx: don't make seastar namespace inline
It's apparently not legal to re-declare an existing namespace
inline. Use "using" instead.
2017-06-22 18:16:13 +03:00
Avi Kivity
c423330917 seastarx: add missing make_shared forward declaration
Without this, clang (correctly) complains that it can't deduce the type when
it is not explicitly mentioned.
2017-06-22 18:16:13 +03:00
Avi Kivity
672de608bf tests: fix call to seastar::sleep()
It's not in the global namespace.
2017-06-22 18:16:13 +03:00
Avi Kivity
f9f2f18145 dht: fix bad to_sstring() call
to_sstring() is part of seastar, nor the global namespace.
2017-06-22 17:51:27 +03:00
Avi Kivity
6f8dba3fa9 Merge "small fixes and cleanup for leveled strategy" from Raphael
* 'lcs_improvements_v1' of github.com:raphaelsc/scylla:
  lcs: remove useless code for choosing L0 candidates
  lcs: remove some dead code
  lcs: make logger static
  lcs: actually prefer oldest sstables of L0 when it falls behind
  lcs: remove useless expensive check for overlapping L1 sstables
2017-06-22 15:45:34 +03:00
Raphael S. Carvalho
4351e0a996 compaction: introduce new compaction type for reshard
so now user can look at nodetool compactionstats and determine
whether or not resharding is running, for example:
$ ./bin/nodetool compactionstats
pending tasks: 3
       id   compaction type   keyspace                table   completed   total   unit   progress
   <none>           RESHARD     system   compaction_history          11     256   keys      4.30%
   <none>           RESHARD     system   compaction_history           2     256   keys      0.78%
   <none>           RESHARD     system   compaction_history          10     256   keys      3.91%
   <none>           RESHARD     system   compaction_history           8     256   keys      3.12%
   <none>           RESHARD     system   compaction_history           7     256   keys      2.73%

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170620175733.25882-1-raphaelsc@scylladb.com>
2017-06-22 14:48:38 +03:00
Gleb Natapov
9b8499df0e cache_hitrate_calculator: filter cfs based on replication strategy instead of a name
The code filters CFs by name to not include system keyspace, but v3
schema added yet another system namespace. Better filter according to
replication strategy to accommodate for schema v4 adding even more
system keyspaces.

Fixes: #2516

Message-Id: <20170621073816.GB3944@scylladb.com>
2017-06-22 11:26:34 +03:00
Tzach Livyatan
9e6337f330 Add a comment experimental line to scylla.yaml
Making it easier for users to enable experimental features

Signed-off-by: Tzach Livyatan <tzach@scylladb.com>
Message-Id: <20170621191720.13575-1-tzach@scylladb.com>
2017-06-22 09:06:19 +03:00
Jesse Haber-Kucharsky
8174a40098 Improve CMakeLists.txt for CLion
This version optionally reads the include paths for Seastar from pkg-config and
uses file globbing to register all source and header files.

In comparison to the previous version, I see all files in the project explorer
view are "active" (rather than just .cc files). I believe there are also fewer
errors reported by the editor.
2017-06-21 16:34:47 -04:00
Avi Kivity
f0b20be14d Revert "system_keyspace: Make sure "system" is written to keyspaces (visible)"
This reverts commit 89ef69c4b3. Prevents nodes
from joining the cluster.
2017-06-21 16:58:04 +03:00
Avi Kivity
8585a356eb Revert "Revert "db: prevent latency spikes during streaming/repair""
This reverts commit 399d219cab. Turns out it
was not the culprit.
2017-06-21 16:58:04 +03:00
Takuya ASADA
aa77ac1138 dist/debian: Debian 9(stretch) support
Add support Debian new stable release.
Also including following changes:
 - update libthrift due to unable to compile on Debian 9
 - drop dist/debian/supported_release since distribution check code moved to pbuilderrc
 - add libssl-dev for build-depends
 - add sudo for pbuilder extra packages (Debian doesn't have it by default install)

Signed-off-by: syuu <syuu@dokukino.com>
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1498047515-19972-1-git-send-email-syuu@scylladb.com>
2017-06-21 15:30:22 +03:00
Avi Kivity
399d219cab Revert "db: prevent latency spikes during streaming/repair"
This reverts commit bdfa2ed923245e236837f58925c797e26df32361; prevents nodes
from joining.
2017-06-21 11:28:29 +03:00
Calle Wilund
89ef69c4b3 system_keyspace: Make sure "system" is written to keyspaces (visible)
Fixes #2514
Bug in schema version 3 update: We failed to write "system" to the
schema tables. Only visible on an empty instance of course.
Message-Id: <1497966982-10044-1-git-send-email-calle@scylladb.com>
2017-06-20 20:59:47 +02:00
Avi Kivity
bdfa2ed923 db: prevent latency spikes during streaming/repair
The memtable destructor can take a long time if the memtable is full; use
clear_gently() to clear it without impacting latency.

Fixes #2477.
Message-Id: <20170620093550.16121-1-avi@scylladb.com>
2017-06-20 13:03:43 +02:00
Nadav Har'El
186f031187 Optimize sstable::as_mutation_source() for one partition
Although usually one can fast_forward_to() on the result of a
sstable::as_mutation_source(), earlier we had an optimization
where if a single partition was specified, it was read exactly,
and fast_forward_to() was *NOT* allowed.

With the mutation_reader::forwarding flag patch, when this flag
was on - requesting fast_forward_to() - we disabled this optimization.
This makes sense, but is not backward compatible with the code which
previously assumes this optimization exists... So this patch returns
this optimization, despite this meaning that we blatently ignore
the fwd_mr flag in that case.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170620081107.14335-1-nyh@scylladb.com>
2017-06-20 10:32:10 +02:00
Avi Kivity
ba0ba87bf9 Merge seastar upstream
* seastar 621b7ed...9e2b7ec (8):
  > Merge "Low-resolution clocks" from Jesse
  > build: disable -Wattributes when gcc -fvisibility=hidden bug strikes
  > build: work around ragel 7 generated code bug
  > rpc: make unmarshall_exception() inline
  > future-utils: make functions global
  > fix reactor stall detector rate limiting on an mostly idle system
  > prometheus: Add ability to add /metrics to any http_server
  > prometheus: fix memory leak in http_server_control
2017-06-20 11:01:40 +03:00
Tomasz Grabiec
358bf88cf8 mutation_reader: Fix abort when streaming more than one range
multi_range_mutation_reader uses fast_forward_to() to skip between
ranges, so we always need to create the underlying reader with with
mutation_reader::forwarding::yes if there is more than one range,
irrespective of whether multi_range_mutation_reader itself will be
forwarded or not.

Fixes #2510.

Introduced in commit 3018df1.

Message-Id: <1497943032-18696-1-git-send-email-tgrabiec@scylladb.com>
2017-06-20 10:29:45 +03:00
Amos Kong
92731eff4f common/scripts: fix node_exporter url
Commit ff3d83bc2f updated node_exporter
from 0.12.0 to 0.14.0, and it introduced a bug to download install file.

node_exporter started to add 'v' prefix in release tags[1] from 0.13.0,
so we need to fix the url.

[1] https://github.com/prometheus/node_exporter/tags

Fixes #2509

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <42b0a7612539a34034896d404d63a0a31ce79e10.1497919368.git.amos@scylladb.com>
2017-06-20 09:25:39 +03:00
Raphael S. Carvalho
82048e6f77 lcs: remove useless code for choosing L0 candidates
The code being removed could be used if parallel compaction were
allowed for LCS, but the current code isn't even allowing that.
At the moment, it's only wasting cycles.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-19 21:19:29 -03:00
Raphael S. Carvalho
81f20068d6 lcs: remove some dead code
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-19 21:02:46 -03:00
Raphael S. Carvalho
b26dc6db1a lcs: make logger static
otherwise, there will be one instance per shard.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-19 21:01:21 -03:00
Raphael S. Carvalho
4bb27cbd6f lcs: actually prefer oldest sstables of L0 when it falls behind
Strategy prefers promoting oldest sstables in L0. Because sort
procedure is incorrectly sorting elements in descending order,
newest sstables will be promoted first *if and only if* L0 falls
behind (more than 32 sstables). If L0 doesn't fall behind, we'll
have all L0 sstables compacted with overlapping ones in L1.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-19 20:45:39 -03:00
Raphael S. Carvalho
90db2d7eba lcs: remove useless expensive check for overlapping L1 sstables
there's no way a L1 sstable will be in candidates set which was
previously built from list of L0 sstables.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-19 19:33:52 -03:00
Takuya ASADA
71600eb298 dist/debian: Use pbuilder for Ubuntu/Debian debs
Enable pbuilder for Ubuntu/Debian to prevent build enviroment dependent issues.
Also support cross building by pbuilder.
(cross-building from Fedora 25 and Ubuntu 16.04 are tested)

closes #629

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1497895661-26376-1-git-send-email-syuu@scylladb.com>
2017-06-19 21:15:06 +03:00
Nadav Har'El
984da1d8d7 Make forwarding_tag local to streamed_mutation
As Avi noticed, the "forwarding_tag" which was meant to be local in
streamed_mutation, became global. If another class copied the same trick,
it would share the same type instead of being distinct types as intended.

The problem is that in:

	using forwarding = bool_class<class forwarding_tag>;

Apparently, the "class forwarding_tag" forward-declares a global type - it
does not create a local-scope type as intended, which the following apparently
does (even though no actual definition is given for that class):

	class forwarding_tag;
	using forwarding = bool_class<forwarding_tag>;

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170619153933.13116-1-nyh@scylladb.com>
2017-06-19 20:04:47 +03:00
Nadav Har'El
3018df11b5 Allow reading exactly desired byte ranges and fast_forward_to
In commit c63e88d556, support was added for
fast_forward_to() in data_consume_rows(). Because an input stream's end
cannot be changed after creation, that patch ignores the specified end
byte, and uses the end of file as the end position of the stream.

As result of this, even when we want to read a specific byte range (e.g.,
in the repair code to checksum the partitions in a given range), the code
reads an entire 128K buffer around the end byte, or significantly more, with
read-ahead enabled. This causes repair to do more than 10 times the amount
of I/O it really has to do in the checksumming phase (which in the current
implementation, reads small ranges of partitions at a time).

This patch has two levels:

1. In the lower level, sstable::data_consume_rows(), which reads all
   partitions in a given disk byte range, now gets another byte position,
   "last_end". That can be the range's end, the end of the file, or anything
   in between the two. It opens the disk stream until last_end, which means
   1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is
   not allowed beyond last_end.

2. In the upper level, we add to the various layers of sstable readers,
   mutation readers, etc., a boolean flag mutation_reader::forwarding, which
   says whether fast_forward_to() is allowed on the stream of mutations to
   move the stream to a different partition range.

   Note that this flag is separate from the existing boolean flag
   streamed_mutation::fowarding - that one talks about skipping inside a
   single partition, while the flag we are adding is about switching the
   partition range being read. Most of the functions that previously
   accepted streamed_mutation::forwarding now accept *also* the option
   mutation_reader::forwarding. The exception are functions which are known
   to read only a single partition, and not support fast_forward_to() a
   different partition range.

   We note that if mutation_reader::forwarding::no is requested, and
   fast_forward_to() is forbidden, there is no point in reading anything
   beyond the range's end, so data_consume_rows() is called with last_end as
   the range's end. But if forwarding::yes is requested, we use the end of the
   file as last_end, exactly like the code before this patch did.

Importantly, we note that the repair's partition reading code,
column_family::make_streaming_reader, uses mutation_reader::forwarding::no,
while the other existing reading code will use the default forwarding::yes.

In the future, we can further optimize the amount of bytes read from disk
by replacing forwarding::yes by an actual last partition that may ever be
read, and use its byte position as the last_end passed to data_consume_rows.
But we don't do this yet, and it's not a regression from the existing code,
which also opened the file input stream until the end of the file, and not
until the end of the range query. Moreover, such an improvement will not
improve of anything if the overall range is always very large, in which
case not over-reading at its end will not improve performance.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170619152629.11703-1-nyh@scylladb.com>
2017-06-19 18:31:32 +03:00
Pekka Enberg
98fb2c0b56 docs: Fix Docker Hub documentation logo
The URL got broken when www.scylladb.com changed. Fix it up.

Message-Id: <1497360648-19210-1-git-send-email-penberg@scylladb.com>
2017-06-19 13:11:59 +03:00
Takuya ASADA
e1459dc9ef dist/debian: provides 3rdparty packages for Debian jessie
Now we provides jessie prebuilt 3rdparty packages, allow running without --rebuild-dep.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1497863340-19726-1-git-send-email-syuu@scylladb.com>
2017-06-19 13:11:20 +03:00
Takuya ASADA
3d671baf3b dist/debian: define dh_auto_configure task correctly
We mistakenly placed ./configure.py to dh_auto_build, but it's should place at
dh_auto_configure.
This bug causes issue #2505, since we haven't defined dh_auto_configure task yet(It seems running cmake on top of the dir is one of default behavior of dh_auto_configure).

Fixes #2505

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1497866068-32097-1-git-send-email-syuu@scylladb.com>
2017-06-19 12:58:59 +03:00
Duarte Nunes
7c17eba8e8 cql3/cql3_type: Don't quote tuple types
A regression introduced in 08b2ceb28e
quoted tuple type names, which, being of the form tuple<t1, ..., tn>,
would always be quoted. The quoted name would then not be found in any
internal data structures.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1497813816-39956-1-git-send-email-duarte@scylladb.com>
2017-06-18 22:03:57 +02:00
Duarte Nunes
ffcd4c76c2 ide: Add CMakeLists.txt for cmake-based IDEs
This patch add the CMakeLists.txt file for IDEs based on cmake, like
CLion.

This file assumes the existence of a build/release/gen directory,
containing generated files.

Refs #867

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170618151333.94714-1-duarte@scylladb.com>
2017-06-18 18:47:55 +03:00
Avi Kivity
58fd3dd006 Merge "cql3: Quote type name when needed" from Duarte
"This patch set ensures we quote the name of a UDT when it
contains characters that may cause parsing by the CQL parser
to fail.

Fixes #2491"

* 'cql3-quote-type/v1' of https://github.com/duarten/scylla:
  cql3/util: Make maybe_quote() take argument by const reference
  cql3/cql3_type: Quote UDT name if needed
  schema: Lift maybe_quote() into cql3/util
2017-06-18 17:59:47 +03:00
Gleb Natapov
72a4554dd9 storage_proxy: Fix compilation on older (1.55) boost
Boost 1.55 (ubuntu 14) fails to compile because an iterator produce by
boost::adaptors::transformed() when std::ref to lambda is passed to
it do not match iterator concept. It cannot be default constructed
because std::reference_wrapper is not default constructable.
boost::range::min_element() never actually default construct it, but
concept is checked anyway. The patch fixes it by providing an explicit
functor that is default constructable.

Message-Id: <20170618131836.GD3944@scylladb.com>
2017-06-18 16:54:41 +03:00
Duarte Nunes
b2c5aca4cf db/schema_tables: View mutations shouldn't always include base ones
When making the schema mutations for a view update, we should only
include the base table schema mutations (in case the target node
doesn't contain them) when the view is being directly updated. When it
is being updated as a side effect of updating the base table, then
including the base schema mutations will hide the actual changes being
performed on the base.

Fixes #2500

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1497782822-2711-1-git-send-email-duarte@scylladb.com>
2017-06-18 16:29:59 +03:00
Avi Kivity
6e2c9ef9fb Revert "Allow reading exactly desired byte ranges and fast_forward_to"
This reverts commit 317d7fc253 (and also the
related 2c57ab84b2).  It causes crashes
during range scans, reported by Gleb:

"To reproduce I run SELECT * FROM keyspace1.standard1; on typical c-s
dataset and 3 node cluster.

Backtrace:
    at /home/gleb/work/seastar/seastar/core/apply.hh:36
    rvalue=<unknown type in /home/gleb/work/seastar/build/release/scylla, CU 0x54cf307, DIE 0x55ebf2a>) at /home/gleb/work/seastar/seastar/core/do_with.hh:57
    range=std::vector of length 6, capacity 8 = {...}) at /home/gleb/work/seastar/seastar/core/future-util.hh:142
    at ./seastar/core/future.hh:890
    at /home/gleb/work/seastar/seastar/core/future-util.hh:119
    at /home/gleb/work/seastar/seastar/core/future-util.hh:142
2017-06-18 16:10:21 +03:00
Amnon Heiman
ff3d83bc2f node_exporter_install script update version to 0.14
Fixes #2097

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170612125724.7287-1-amnon@scylladb.com>
2017-06-18 12:25:58 +03:00
Calle Wilund
3464422051 commitlog_test: Fix reader test dropping rp handles
Test wants data in live segments to read from, so should
not just drop the handles returned from allocate.
Message-Id: <1497344532-2616-1-git-send-email-calle@scylladb.com>
2017-06-16 22:45:46 +01:00
Etienne Kruger
be0a947596 tests: perf_simple_query: Add delete perf test
Add a performance test for deletion in addition to the existing update
and query tests. The deletion performance test is executed using the
'--delete' argument to perf_simple_query.

Fixes #2417.

Signed-off-by: Etienne Kruger <el@loadavg.io>
Message-Id: <20170615232500.26987-1-el@loadavg.io>
2017-06-16 14:51:00 +01:00
Duarte Nunes
b993124d94 cql3/util: Make maybe_quote() take argument by const reference
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-06-15 19:55:52 +00:00
Duarte Nunes
08b2ceb28e cql3/cql3_type: Quote UDT name if needed
This patch ensures we properly quote a UDT name, which may contain
characters like ".", which can lead the name to be interpreted as a
keyspace qualified name when parsed by the CQL parser.

Fixes #2491

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-06-15 19:55:52 +00:00
Duarte Nunes
4886b7ed5e schema: Lift maybe_quote() into cql3/util
It's a more natural place given its current and future usages.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-06-15 19:55:52 +00:00
Avi Kivity
2c57ab84b2 mutation_reader: fix typo in forwarding_tag
The typo went unnoticed since the compiler picked up the global scope's
forwarding_tag.  The bug made streamed_mutation::forwarding and
mutation_reader::forwarding the same type, but fortunately there were
no type mixups due to this.
2017-06-15 20:13:01 +03:00
Avi Kivity
9cf6db3de5 Merge 2017-06-15 19:11:07 +03:00
Nadav Har'El
317d7fc253 Allow reading exactly desired byte ranges and fast_forward_to
In commit c63e88d556, support was added for
fast_forward_to() in data_consume_rows(). Because an input stream's end
cannot be changed after creation, that patch ignores the specified end
byte, and uses the end of file as the end position of the stream.

As result of this, even when we want to read a specific byte range (e.g.,
in the repair code to checksum the partitions in a given range), the code
reads an entire 128K buffer around the end byte, or significantly more, with
read-ahead enabled. This causes repair to do more than 10 times the amount
of I/O it really has to do in the checksumming phase (which in the current
implementation, reads small ranges of partitions at a time).

This patch has two levels:

1. In the lower level, sstable::data_consume_rows(), which reads all
   partitions in a given disk byte range, now gets another byte position,
   "last_end". That can be the range's end, the end of the file, or anything
   in between the two. It opens the disk stream until last_end, which means
   1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is
   not allowed beyond last_end.

2. In the upper level, we add to the various layers of sstable readers,
   mutation readers, etc., a boolean flag mutation_reader::forwarding, which
   says whether fast_forward_to() is allowed on the stream of mutations to
   move the stream to a different partition range.

   Note that this flag is separate from the existing boolean flag
   streamed_mutation::fowarding - that one talks about skipping inside a
   single partition, while the flag we are adding is about switching the
   partition range being read. Most of the functions that previously
   accepted streamed_mutation::forwarding now accept *also* the option
   mutation_reader::forwarding. The exception are functions which are known
   to read only a single partition, and not support fast_forward_to() a
   different partition range.

   We note that if mutation_reader::forwarding::no is requested, and
   fast_forward_to() is forbidden, there is no point in reading anything
   beyond the range's end, so data_consume_rows() is called with last_end as
   the range's end. But if forwarding::yes is requested, we use the end of the
   file as last_end, exactly like the code before this patch did.

Importantly, we note that the repair's partition reading code,
column_family::make_streaming_reader, uses mutation_reader::forwarding::no,
while the other existing reading code will use the default forwarding::yes.

In the future, we can further optimize the amount of bytes read from disk
by replacing forwarding::yes by an actual last partition that may ever be
read, and use its byte position as the last_end passed to data_consume_rows.
But we don't do this yet, and it's not a regression from the existing code,
which also opened the file input stream until the end of the file, and not
until the end of the range query. Moreover, such an improvement will not
improve of anything if the overall range is always very large, in which
case not over-reading at its end will not improve performance.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170614072122.13473-1-nyh@scylladb.com>
2017-06-15 13:22:46 +01:00
Avi Kivity
da24bd7c34 Merge "Balance read requests according to CF's cache hit ratio" from Gleb
"During read query with CL<ALL not all replicas are contacted. It is
possible for some replicas to cache less data for some CF's (for instance
because of node restart), so the replica choice may have a big impact
on request's completion latency and on amount of work it generates in
a cluster.

This patch series keep track of per CF cached hit ratio and uses this
information to choose best replicas for a request. Nodes with lower
hit ratios are still contacted in order to populate their cache, but
less frequently."

* 'gleb/cache-hitrate' of github.com:cloudius-systems/seastar-dev:
  storage_proxy: load balance read requests according to cache hit rates
  choose extra replica for speculation in filter_for_query()
  consistency_level: drop filter_for_query_dc_local function
  database: reset node's hit rate information on connection drop
  messaging_service: connection drop notifier
  Store cluster wide cache hit statistics in CF
  messaging_service: return cache hit ratio as part of data read
  Distribute cache temperature over gossiper.
  periodically calculate avg cache hit rate between all shards
  database: introduce cache_temperature class
  Rename load_broadcaster.cc to misc_services.cc
  storage_proxy: use db::count_local_endpoints function instead open code it
2017-06-15 14:33:08 +03:00
Avi Kivity
7dffe7f933 Merge "parallel repair and more memory usage fix" from Asias
"This series reduces repair memory usage and improves repair speed."

* tag 'asias/fix-repair-2430-branch-master-v4.1' of github.com:cloudius-systems/seastar-dev:
  repair: Repair on all shards
  repair: Allow one stream plan in flight
2017-06-15 14:00:19 +03:00
Duarte Nunes
5736468a71 mutation_partition_serializer: Assume range tombstone support
Range tombstones were introduced in version 1.3 and there exists
no direct upgrade from 1.2 to vnext, so we can retire the code
enforcing backwards compatibility.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170614211654.82501-1-duarte@scylladb.com>
2017-06-15 09:54:05 +03:00
Avi Kivity
e11f1c9cc3 tests: fix partitioner_test build on gcc 5 2017-06-14 17:22:01 +03:00
Gleb Natapov
4fdfa2dbb7 gdb: Fix "scylla heapprof" command
Message-Id: <20170612084241.GF21915@scylladb.com>
2017-06-14 15:41:39 +02:00
Gleb Natapov
c7a59ab7ff do not calculate serialized size of commitlog_entry_writer before final format is knows
Currently commitlog_entry_writer constructor calculates serialized size
before it is knows if a schema should be included into the entry. The
result is never used since it is recalculated when schema information is
supplied. The patch removes needless calculation.

Message-Id: <20170614114607.GA21915@scylladb.com>
2017-06-14 14:53:07 +03:00
Gleb Natapov
a032078410 intern also tuple and user defined types
Currently each time UDT or tuple is parsed new object is created. If
those objects are used to create container type repeatedly it will cause
memory leak since container types are interned, but lookup in the
cache is done using pointer to a contained type (which will be always
different for UDT and tuples). This patches interns also UDT and tuple,
so each type the same object is parsed same pointer is also returned.

Refs #2469
Fixes #2487

Message-Id: <20170612142942.GO21915@scylladb.com>
2017-06-14 14:41:17 +03:00
Asias He
47345078ec repair: Repair on all shards
Currently, shard zero is the coordinator of the repair. All the work of
checksuming of the local node and sending of the repair checksum rpc
verb is done on shard zero only. This causes other shards being
underutilized.

With this patch, we split the ranges need to be repaired into at least
smp::count ranges, so sizeof(ranges) / smp::count will be assigned to
each shard. For exmaple, we have 8 shards and 256 ragnes, each shard
will repair 32 ranges. Each shard will repair the 32 ranges
sequencially.  There will be at most 8 (smp::count) ranges of repair in
parallel.
2017-06-14 17:52:49 +08:00
Asias He
54831a344c repair: Allow one stream plan in flight
In "repair: Use more stream_plan" (commit 2043ffc064), we
switched to do stream while doing checksum instead of do stream only
after checksum pahse is completed. We take a parallelism_semaphore
before we do checksum, if there are more than sub_ranges_to_stream
(1024) ranges, we start a stream_plan and wait for the streaming to
complete (still under the parallelism_semaphore). So at most
parallelism_semaphore (100) stream_plans can be in parallel.

The parallelism_semaphore limits the parallelism of both checksum and the
streaming plan. However, it is not necessary to have the same
parallelism for both checksum and streaming, because 1) a streaming
operation itself runs in parallel (handling ranges on all shards in
prallel, sending mutaitons in parallel) , 2) and with more streaming plan
(in worse case 100) means we can write to 100 memtables at the same time
and flush 100 memtables to disk at the same time which can take a lot of
memory.

With this patch, we only allow one stream plan in flight.
2017-06-14 17:52:36 +08:00
Calle Wilund
525730e135 database: Fix assert in truncate to handle empty memtables+sstables
If we do two truncates in a row, the second will have neither memtable
nor sstable data. Thus we will not write/remove sstables, and thus
get no resulting truncation replay position.
Message-Id: <1497378469-6063-1-git-send-email-calle@scylladb.com>
2017-06-14 11:21:21 +02:00
Gleb Natapov
87094849fa storage_proxy: load balance read requests according to cache hit rates
This patch makes storage proxy to choose replicas to read from base on
their cache hit rates. Replicas with higher cache hit rates will see
more requests while replicas with lower hit rates will see less. Local
node has a special bonus and will get more requests even if another node
has slightly higher cache hit rate (same goes for local vs remote DC),
but after the patch it is no longer guarantied that a coordinator node
will be chosen as a replica for the read (if the feature is enabled).
2017-06-13 09:57:14 +03:00
Gleb Natapov
bc8aa1b4ee choose extra replica for speculation in filter_for_query()
Currently storage proxy has to loop over remaining replicas to search
for suitable extra replica, but doing it in filter_for_query() is
extremely easy, so do it there instead.
2017-06-13 09:57:14 +03:00
Gleb Natapov
8437ea3b99 consistency_level: drop filter_for_query_dc_local function
Merge filter_for_query_dc_local() functionality into filter_for_query().
This is more efficient since filter_for_query_dc_local() partitions
endpoints into 'local' and 'remote' set but filter_for_query() already
does it for CL=LOCAL so for such queries we needlessly do it twice.
2017-06-13 09:57:14 +03:00
Gleb Natapov
ca812a8ea0 database: reset node's hit rate information on connection drop
Node may go down, so after it restarts cache hit rate info will be
incorrect and it can be overwhelmed with traffic until new and
up-to-date cache hit rate arrives. Solve this by dropping node's
information on connection reset, it is more accurate than relying on
gossip which may be slow and miss reboot of a node.
2017-06-13 09:57:14 +03:00
Gleb Natapov
23c51b3e57 messaging_service: connection drop notifier
Allow registering callbacks that will be called when connection is going
down.
2017-06-13 09:57:14 +03:00
Gleb Natapov
0e4d5bc2f3 Store cluster wide cache hit statistics in CF 2017-06-13 09:57:14 +03:00
Gleb Natapov
69c5526301 messaging_service: return cache hit ratio as part of data read 2017-06-13 09:57:14 +03:00
Gleb Natapov
8ca1432b04 Distribute cache temperature over gossiper.
When a node start it does not have any information about cache temperature
of other nodes in the cluster and it is hard (if not impossible) to make
right guess. During cluster startup all nodes have cold caches, so there
is no point to redirect reads to other nodes even though local cache it
cold, but if only that node restarted than other nodes have populated
cache and reads should be redirected.

The node will get up-to-date information about other nodes caches,
but only after receiving first reply, until then it does not have the
information to make right decisions which may cause unwanted spikes
immediately after restart. Having cache temperature in gossiper helps
to solve the problem.
2017-06-13 09:57:14 +03:00
Gleb Natapov
991ec4a16c periodically calculate avg cache hit rate between all shards
This patch adds new class cache_hitrate_calculator whose responsibility
is to periodically calculate average cache hit rates between all shards
for each CF.
2017-06-13 09:57:14 +03:00
Gleb Natapov
fab18c0c5a database: introduce cache_temperature class
The class will represent cache hit rate for a column family and is
serializable for use with RPC.
2017-06-13 09:57:14 +03:00
Gleb Natapov
f59ecc2687 Rename load_broadcaster.cc to misc_services.cc
load_broadcaster is very small class, move it into generic file so that
we can put other small services there to save on compilation time.
2017-06-13 09:57:14 +03:00
Gleb Natapov
7bcf4c690f storage_proxy: use db::count_local_endpoints function instead open code it 2017-06-13 09:57:14 +03:00
Gleb Natapov
21197981a5 Fix use after free in nonwrapping_range::intersection
end_bound() returns temporary object (end_bound_ref), so it cannot be
taken by reference here and used later. Copy instead.

Message-Id: <20170612132328.GJ21915@scylladb.com>
2017-06-12 15:34:36 +01:00
Tomasz Grabiec
20095d7ed6 gdb: Fix "scylla column_families" command
Apparently some GDB versions (7.11.1-86.fc24) don't parse double '>'
in a type name, so this:

 std::pair<utils::UUID const, seastar::lw_shared_ptr<column_family>>

should be this:

 std::pair<utils::UUID const, seastar::lw_shared_ptr<column_family> >

Message-Id: <1497256644-4335-1-git-send-email-tgrabiec@scylladb.com>
2017-06-12 11:39:50 +03:00
Tomasz Grabiec
9e7a040f0c gdb: Fix "scylla keyspaces" command
The problem is that 'key' is a 'bytes' object now, which doesn't have __format__.

Fixes the following error:

  Traceback (most recent call last):
    File "~/src/scylla/scylla-gdb.py", line 184, in invoke
  TypeError: non-empty format string passed to object.__format__
  Error occurred in Python command: non-empty format string passed to object.__format__

Message-Id: <1497253433-374-2-git-send-email-tgrabiec@scylladb.com>
2017-06-12 11:22:59 +03:00
Tomasz Grabiec
230683bdfa gdb: Add missing seastar namespace qualifier
Message-Id: <1497253433-374-1-git-send-email-tgrabiec@scylladb.com>
2017-06-12 11:22:53 +03:00
Asias He
2bcb368a13 repair: Fix range use after free
Capture it by value.

scylla:  [shard 0] repair - repair's stream failed: streaming::stream_exception (Stream failed)
scylla:  [shard 0] repair - Failed sync of range ==<runtime_exception
(runtime error: Invalid token. Should have size 8, has size 0#012)>: streaming::stream_exception (Stream failed)

Message-Id: <7fda4432e54365f64b556e7e4c26e36d3a9bb1b7.1497238229.git.asias@scylladb.com>
2017-06-12 11:00:57 +03:00
Avi Kivity
419ad9d6cb Merge "repair memory usage fix" from Asias
"This series switches repair to use more stream plans to stream the mismatched
sub ranges and use a range generator to produce sub ranges.

Test shows no huge memory is used for repair with large data set.

In addition, we now have a progress reporter in the log how many ranges are processed.

   Jun 06 14:18:22  [shard 0] repair - Repair 512 out of 529 ranges, id=1, keyspace=myks, cf=mytable, range=(8526136029525195375, 8549482295083869942]
   Jun 06 14:19:55  [shard 0] repair - Repair 513 out of 529 ranges, id=1, keyspace=myks, cf=mytable, range=(8526136029525195375, 8549482295083869942]

Fixes #2430."

* tag 'asias/fix-repair-2430-branch-master-v1' of github.com:cloudius-systems/seastar-dev:
  repair: Remove unused sub_ranges_max
  repair: Reduce parallelism in repair_ranges
  repair: Tweak the log a bit
  repair: Use more stream_plan
  repair: iterator over subranges instead of list
2017-06-08 14:19:08 +03:00
Tomasz Grabiec
9b7f170121 gdb: Improve error message
Message-Id: <1496849069-21750-1-git-send-email-tgrabiec@scylladb.com>
2017-06-07 18:26:31 +03:00
Tomasz Grabiec
0dfe1ad431 Merge "Relax replay position ordering requirement" from Calle
From seastar-dev.git calle/concorde

Normally, we require that all mutations applied to a column family
have replay positions higher than all previously flushed.
The main reason for this is to be able to determine when to drop a
commit log segment, i.e. determine that all replay positions less
than X are now in sstables.

This patch series, small as it is, relaxes this by instead of just
keeping track of high rp applied, keep a reference count to each
segment per CF in memtables, and on flush, release this very count.

The only case where we need to keep a water mark for RP is then
for table truncation, for which we simply say that the highest RP
applied to the column family is the lowest allowed henceforth,
and use the old reordering logic for this instead. I.e. very rare.

There is of course one (big?) downside to all this, and this is
"normal" commit log replay on startup after crash/shutdown.
Since we relax RP ordering, we cannot use RP:s in sstables as
low marks for replay start, since it is now allowed to exist
non-persisted mutations in commitlog with lower RP:s than
previously flushed. I.e. we more or less always have to replay
the full commit log.
It is worth noting though that due to compaction and the non-
propagation of RP marks to new sstables, we end up often
doing this anyway, so it is hard to say how much of a regression
this is.
2017-06-07 14:51:28 +02:00
Calle Wilund
18806989b6 database: remove hard rp ordering requirement, set low rp mark on
truncate

With commitlog keeping use-count per CF id, we can ease the ordering
restriction on replay positiontion. Previously we required that all
added mutations have a position > previously flushed. However, if
we accept that replay must now be all data, by keeping track instead
per CF of highest RP ever entered, we can instead just set a
low mark on truncation, since this is the only remaining hard
RP divider.
2017-06-07 12:07:01 +00:00
Calle Wilund
d9b8c79eb9 commitlog_replayer: Ignore sstable replay positions
With relaxed position ordering, we cannot use existing sstables as
water mark for replay. We must replay everything above truncation
marks.
2017-06-07 12:07:01 +00:00
Calle Wilund
2913241df1 memtable/commitlog: Change bookkeep to track individul segments
Use per CF-id reference count instead, and use handles as result of 
add operations. These must either be explicitly released or stored
(rp_set), or they will release the corresponding replay_position
upon destruction. 

Note: this does _not_ remove the replay positioning ordering requirement
for mutations. It just removes it as a means to track segment liveness.
2017-06-07 12:07:01 +00:00
Calle Wilund
0c598e5645 commitlog_test: Fix test_commitlog_delete_when_over_disk_limit
Test should
a.) Wait for the flush semaphore
b.) Only compare segement sets between start and end, not start,
    end and inbetwen. I.e. the test sort of assumed we started
    with < 2 (or so) segments. Not always the case (timing)

Message-Id: <1496828317-14375-1-git-send-email-calle@scylladb.com>
2017-06-07 12:44:02 +03:00
Avi Kivity
07ff3f68e0 Merge seastar upstream
* seastar b1f69cc...621b7ed (8):
  > net/api: Remove outdated comments
  > Merge "Fixes for Clang 5" from Paweł
  > Merge "Metrics: Safely transfer metadata between shared" from Amnon
  > posix: add missing #include
  > build: add cmake dependency
  > build: add -Wno-maybe-uninitialized
  > rpc: handle messages larger than memory limit
       (Fixes #2453)
  > doxygen: enable macro expansion
2017-06-07 11:04:56 +03:00
Takuya ASADA
7fe63c539a dist/debian: install gdebi when it's not exist
Since we started to use gdebi for install build-dep metapackage that generated by
mk-build-dep, we need to install gdebi on build_deb.sh too.

Fixes #2451

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1496819209-30318-1-git-send-email-syuu@scylladb.com>
2017-06-07 10:24:22 +03:00
Asias He
3fdb8a3d3f repair: Remove unused sub_ranges_max
With the sub range iterator, it is not used anymore. Drop it.
2017-06-07 08:52:45 +08:00
Asias He
ca00c10b35 repair: Reduce parallelism in repair_ranges
We currently repair all the ranges in parallel.

1) All the ranges will contend for parallelism_semaphore, instead of
processing multiple ranges in parallel and calculating the sub ranges
(which take memory) for each range in parallel, we can handle the ranges
one bye one.

We could have enough parallelism because the checksum are calucated on
all the shards.

2) If for some reason the repair failed, if we handle ranges 1 by 1, we
can log which range of repair is successful. Next time, we can ignore
them. If we start ranges in parallel, it has a high chance, no single
range is completed because all the ranges are on going.

Refs #1912
2017-06-07 08:50:57 +08:00
Asias He
3852665156 repair: Tweak the log a bit
- Count n out m ranges the repair is running for (kind of progress report)
- Make the 'Found differing range' log debug because it can be millions
  of such entries
- Print the failed ranges
2017-06-07 08:50:57 +08:00
Asias He
2043ffc064 repair: Use more stream_plan
In the very beginning, we use a stream_plan for each checksum range.
Later, we changed to use a single stream_plan for all the checksum
ranges. It pushes memory presure to streaming, e.g., millinons of ranges
in a vector to send over RPC.

To fix, we do checksum and streaming in parallel, limit the number of
checksum ranges stored in memory.

Fixes #2430
2017-06-07 08:50:56 +08:00
Nadav Har'El
b3ff37e67f repair: iterator over subranges instead of list
When starting repair, we divided the large token ranges (vnodes) linto small
subranges of a desired length (around 100 partition), and built a huge list
of those subranges - to iterate over them later and compare checksums of
those chunks.

However, building this list up-front is completely unnecessary, and wastes
a lot of memory: In a test with 1 TB of data, as much as 3 gigabytes was
spent on this list. Instead, what we do in this patch is to find the next
chunk in a DFS-like splitting algorithm, using only the token range
midpoint() function (as before). The amount of memory needed for this is
O(logN), instead of O(N) in the previous implementation.

Refs #2430.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2017-06-07 08:50:56 +08:00
Raphael S. Carvalho
0ca1e5cca3 sstables: fix report of disk space used by bloom filter
After change in boot, read_filter is called by distributed loader,
so its update to _filter_file_size is lost. The load variant
which receives foreign components that must do it. We were also
not updating it for newly created sstables.

Fixes #2449.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170606151129.5477-1-raphaelsc@scylladb.com>
2017-06-06 18:20:28 +03:00
Takuya ASADA
a4c392c113 dist/debian: use gdebi instead of mk-build-deps -i
At least on Debian8, mk-build-deps -i silently finishes with return code 0
even it fails to install dependencies.
To prevent this, we should manually install the metapackage generated by
mk-build-deps using gdebi.

Fixes #2445

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1496737502-10737-2-git-send-email-syuu@scylladb.com>
2017-06-06 11:37:34 +03:00
Takuya ASADA
5608842e96 dist/debian/dep: install texlive from jessie-backports to prevent gdb build fail on jessie
Installing openjdk-8-jre-headless from jessie-backports breaks texlive on
jessie main repo.
It causes 'Unmet build dependencies' error when building gdb package.
To prevent this, force insatlling texlive from jessie-backports before start
building gdb.

Fixes #2444

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1496737502-10737-1-git-send-email-syuu@scylladb.com>
2017-06-06 11:37:33 +03:00
Paweł Dziepak
b2b78158f6 mutation_partition: restore formatting
No functional change.

Message-Id: <20170526104119.22075-2-pdziepak@scylladb.com>
2017-06-06 11:20:57 +03:00
Gleb Natapov
f5679e0416 database: remove remnants of no longer existing db::serializer.
Message-Id: <20170604100552.GD8248@scylladb.com>
2017-06-04 13:07:17 +03:00
Raphael S. Carvalho
dcbeb42f67 sstables: explicitly close file in fsync_directory
or close is called in the reactor thread when destroying the
file object.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170602024346.7803-1-raphaelsc@scylladb.com>
2017-06-02 21:09:58 +03:00
Pekka Enberg
a6dc21615b Merge "Fixes to thrift/server" from Duarte
"This series fixes some issues with the thrift_server, namely
ensuring that streams and sockets are properly closed.

Fixes #499
Fixes #2437"

* 'thrift-server-fixes/v1' of github.com:duarten/scylla:
  thrift/server: Close connections when stopping server
  thrift/server: Move connection class to header
  thrift/server: Shutdown connection
  thrift/server: Close output_stream when connection is done
2017-06-02 08:15:22 +03:00
Duarte Nunes
c525331e60 thrift/server: Close connections when stopping server
Fixes #499

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-06-02 00:15:20 +02:00
Duarte Nunes
315c69b830 thrift/server: Move connection class to header
No changes in functionality. Required for an upcoming patch.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-06-02 00:15:20 +02:00
Duarte Nunes
22fafd5034 thrift/server: Shutdown connection
This patch adds the shutdown() function to thrif_server::connection,
and calls it after a connection is done.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-06-02 00:15:20 +02:00
Duarte Nunes
0a5ec97b7f thrift/server: Close output_stream when connection is done
Fixes #2437

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-06-02 00:15:20 +02:00
Jesse Haber-Kucharsky
376c661823 Eliminate duplicate definition of sstable column mask values
The column mask identifies the kind of atom in a row in an sstable. Two
definitions of these values were present: one as a C-style enumeration and one
as a C++11-style enumeration.

The C++11-style definition is used elsewhere in `sstables.cc`. It also offers
additional type-safety.

Therefore, this commit removes the inlined C-style enumeration.

Fixes #2214.

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <c525b4ae7fad3b54480e133921aa4ffe0dd5d9ce.1496352711.git.jhaberku@scylladb.com>
2017-06-02 00:06:31 +02:00
Michał Matczuk
04da4dbf83 docker support for api-address
Message-Id: <1b5fb2bbba1b879aae825094a0f1b77c865be139.1496318996.git.michal@scylladb.com>
2017-06-01 15:31:45 +03:00
Takuya ASADA
22339bba44 dist/debian: depends to collectd-core instead of collectd, to reduce dependencies
To reduce unwanted dependencies, we need to replace dependency from collectd to
collectd-core.
However, collectd provides /etc/collectd/collectd.conf, so without this package
we need to install the configuration file by our self.
So install the file on .postinst script.

Fixes #2426

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1496231743-7828-1-git-send-email-syuu@scylladb.com>
2017-06-01 13:20:37 +03:00
Takuya ASADA
909a9ebf97 dist/debian: provide prebuilt 3rdparty packages for Ubuntu 16.04
Currently we only offers 14.04 prebuit but we have 16.04 one on s3, so use it.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1496301544-15251-1-git-send-email-syuu@scylladb.com>
2017-06-01 10:37:52 +03:00
Duarte Nunes
15a62701f2 test.py: Ensure view_schema_test runs with only one cpu
In the write path we don't wait for view updates, as they happen in
the background.

The view schema tests can fail when running with more than one cpu due
to this inherent race condition: the write to the base table returns
while the view updates are still being processed, after which we issue
a query to the view table. The shard handling the view data is not
guaranteed to finish processing the mutation before handling the query.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170531165726.9212-1-duarte@scylladb.com>
2017-05-31 19:17:51 +01:00
Raphael S. Carvalho
b8091799ca lcs: fix off-by-one comparison
invariant is broken if size of L0 candidates is equal to max
sstable size because the overlapping L1 sstables will not be
added to compacting set, and they will be promoted.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170530143708.3775-1-raphaelsc@scylladb.com>
2017-05-30 17:39:51 +03:00
Avi Kivity
15af6acc8b dist: redirect stdout/stderr to the journal on systemd systems
Fixes #2408.

Message-Id: <20170524080729.10085-1-avi@scylladb.com>
2017-05-30 08:47:17 +03:00
Avi Kivity
1c84aae0c1 Merge seastar upstream
* seastar 68dbf60...b1f69cc (10):
  > metrics: fix namespace in documentation
  > add special logger for memory allocation failures
  > xen: remove
  > Merge "sanity checks, fixes and extensions in the perftune.py" from Vlad
  > tutorial: more "seastar" namespace
  > execution_stages: fix build errors in comments
  > tutorial: more "seastar" namespace additions
  > tutorial: more minor changes
  > tutorial: minor changes to the introduction
  > tutorial: start overhauling the examples to use "seastar" namespace
2017-05-29 19:02:02 +03:00
Calle Wilund
3512ed4596 storage_service/config: Add "native_transport_port_ssl" option
Mimic origin behaviour, iff TLS encryption is enabled, and
native_transport_port_ssl is set and different from
native_transport_port, start both tls- and non-tls
listeners.

Message-Id: <1496061600-24454-2-git-send-email-calle@scylladb.com>
2017-05-29 15:53:56 +03:00
Calle Wilund
1b387a1f56 cql server: Allow multiple listeners on different ports
Need to separate "notifiers" to per-port/address and keep
life span as such.

Message-Id: <1496061600-24454-1-git-send-email-calle@scylladb.com>
2017-05-29 15:53:50 +03:00
Avi Kivity
ef98afa748 build: make swagger generated code depend on the code generator
Fixes failures when moving between branches due to the seastar namespace
change.
Message-Id: <20170528100052.29131-1-avi@scylladb.com>
2017-05-29 13:17:42 +02:00
Avi Kivity
8979d7abf0 Deprecate non-murmur3 partitioners
Removing non-murmur3 partitioners will allow us to reduce memory footprint
and speed up some code by utilizing the properties of the murmur3 partitioner
token.
Message-Id: <20170528172536.16079-1-avi@scylladb.com>
2017-05-28 19:35:56 +02:00
Takuya ASADA
36ccbc1539 dist/ami: follow rpm output dir path change
CentOS mock support on build_rpm.sh changed rpm output directory, so follow it.

Fixes #2406

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1495573343-13912-1-git-send-email-syuu@scylladb.com>
2017-05-28 13:02:36 +03:00
Amos Kong
f655639e5a scylla_setup: fix deadloop in inputting invalid option
example: # scylla_setup --invalid-opt

Fixes #2305

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <9a4f631b126d8eaaae479fa99137db7a61a7c869.1493135357.git.amos@scylladb.com>
2017-05-28 13:02:10 +03:00
Takuya ASADA
bdec38d23c dist/common/scripts/scylla_setup: skip SELinux setup when it's already disabled
It doesn't make sence to ask "Do you want to disable SELinux?" when SELinux is
already disabled, so skip whole question.

Fixes #2411

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1495652423-20806-1-git-send-email-syuu@scylladb.com>
2017-05-28 13:00:10 +03:00
Avi Kivity
c4faa1e202 Merge "tracing: tracing spans and time series helper table" from Vlad
"
 - Introduce a parent span IP and span ID paradigm.
 - Introduce time series tables to simplify traces processing.
 - Add the "How to get traces?" chapter to the tracing.md.
"

* 'tracing-span-ids-and-time-series-helpers-v4' of github.com:cloudius-systems/seastar-dev:
  docs: tracing.md: add a "how to get traces" chapter
  tracing::trace_keyspace_helper: introduce a time series helper tables
  tracing: cleanup: use nullptr instead of trace_state_ptr()
  tracing: introduce a span ID and parent span ID
2017-05-28 12:01:35 +03:00
Paweł Dziepak
d9dd798c4f counter_write_query: avoid use-after-free on partition range
Message-Id: <20170526104119.22075-1-pdziepak@scylladb.com>
2017-05-28 11:41:30 +03:00
Raphael S. Carvalho
41137c7fb6 compaction: use sstable::bytes_on_disk for calculating start and end size
Currently, start and end size of compaction are calculated using the
uncompressed size of data component. bytes_on_disk() returns size
used by all components.

NOTE: start and end size are written to compaction history, so users
who monitor it should be aware of this change.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170525212129.6758-1-raphaelsc@scylladb.com>
2017-05-28 11:33:24 +03:00
Raphael S. Carvalho
3b5ad23532 db: fix computation of live disk usage stat after compaction
sstable::data_size() is used by rebuild_statistics() which only
returns uncompressed data size, and the function called by it
expects actual disk space used by all components.
Boot uses add_sstable() which correctly updates the stat with
sstable::bytes_on_disk(). That's what needs to be used by
r__s() too.

Fixes #1592

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170525210055.6391-1-raphaelsc@scylladb.com>
2017-05-28 10:38:32 +03:00
Vlad Zolotarov
1ae40ee91a utils::timestamped_val: fix the touch() comment
The current comment has been written when the function has not been a
timestamped_val member.

Let's adjust it to the current code.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1495555659-10881-1-git-send-email-vladz@scylladb.com>
2017-05-26 19:26:56 +03:00
Vlad Zolotarov
0619c2cb71 utils::serialization: remove not used deserialization_xxx() functions
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1495556124-16672-1-git-send-email-vladz@scylladb.com>
2017-05-26 19:26:20 +03:00
Tomasz Grabiec
de70d942a9 memtable: Decouple from sstable
We can make the dependency more abstract by using mutation_source
instead of an sstable.

Will be useful in some stress tests which want to avoid the disk, but
is also good for the sake of decoupling.
Message-Id: <1495729508-30081-2-git-send-email-tgrabiec@scylladb.com>
2017-05-25 19:30:21 +03:00
Tomasz Grabiec
f3a6d94398 sstables: Introduce sstable::as_mutation_source()
Adaptors extracted from existing testing code.
Message-Id: <1495729508-30081-1-git-send-email-tgrabiec@scylladb.com>
2017-05-25 19:30:20 +03:00
Glauber Costa
3d3afd8f11 node_exporter: add interrupt information
Information about interrupts is invaluable when debugging performance
problems with Scylla in the field. node_exporter doesn't include that in
the list of collectors enabled by default, so we suggest we do it here.

The list that goes into this file is the default list as shown by
node_exporter, with "interrupts" added to it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20170525153008.26720-1-glauber@scylladb.com>
2017-05-25 19:11:30 +03:00
Avi Kivity
a8a82433e5 Merge "improve lcs promotion decision with compression enabled" from Raphael
"lcs' current behavior will make it hard to reduce number of
levels by increasing sstable size because it uses uncompressed
length when deciding whether or not a level needs promotion.
Demotion process is slower because of that."

* 'lcs_promotion_improvement_2' of github.com:raphaelsc/scylla:
  lcs: use sstable compressed length when computing level size
  sstables: introduce sstable::ondisk_data_size
  sstables: unconditionally set sstable data file size
2017-05-25 12:37:24 +03:00
Tomasz Grabiec
848ca035a2 gdb: Adjust scylla-gdb.py for the namespace change in seastar
Message-Id: <1495700444-29269-1-git-send-email-tgrabiec@scylladb.com>
2017-05-25 11:52:42 +03:00
Raphael S. Carvalho
b7e1575ad4 db: remove partial sstable created by memtable flush which failed
partial sstable files aren't being removed after each failed attempt
to flush memtable, which happens periodically. If the cause of the
failure is ENOSPC, memtable flush will be attempted forever, and
as a result, column family may be left with a huge amount of partial
files which will overwhelm subsequent boot when removing temporary
TOC. In the past, it led to OOM because removal of temporary TOC
took place in parallel.

Fixes #2407.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170525015455.23776-1-raphaelsc@scylladb.com>
2017-05-25 11:50:02 +03:00
Raphael S. Carvalho
0a105473df lcs: use sstable compressed length when computing level size
lcs uses uncompressed length of sstables when computing size of a
level, and that may result in unnecessary promotion when the goal
is to reduce the number of levels after an increase in sstable
size, for example.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-05-24 20:10:02 -03:00
Raphael S. Carvalho
b2dc0b2db5 sstables: introduce sstable::ondisk_data_size
this new function is an alternative to data_size(), which will
return size of data component. data_size() returns uncompressed
size of data component if compression is enabled.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-05-24 20:10:00 -03:00
Raphael S. Carvalho
e02bf6da58 sstables: unconditionally set sstable data file size
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-05-24 20:09:58 -03:00
Piotr Jastrzebski
6528f3a963 Make sure mutation_reader for sstables can be fast-forwarded
Fixes #2145.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
[tgrabiec: Extracted from a series, fixed title]
Message-Id: <1495639745-19387-1-git-send-email-tgrabiec@scylladb.com>
2017-05-24 16:36:24 +01:00
Tomasz Grabiec
6cf2841654 mvcc: Extract partition_snapshot_reader to separate header
Right know whole world includes it transitively, which results in
painful recompiles when the code changes.

Relax dependencies.
Message-Id: <1495620201-8046-1-git-send-email-tgrabiec@scylladb.com>
2017-05-24 12:13:15 +01:00
Asias He
f792c78c96 streaming: Do not abort session too early in idle detection
Streaming ususally takes long time to complete. Abort it on false
positive idle detection can be very wasteful.

Increase the abort timeout from 10 minutes to a very large timeout, 300
minutes. The real idle session will be aborted eventually if other
mechanisms, e.g., streaming manager has gossip callback for on_remove
and on_restart event to abort, do not abort the session.

Fixes #2197

Message-Id: <57f81bfebfdc6f42164de5a84733097c001b394e.1494552921.git.asias@scylladb.com>
2017-05-24 12:29:50 +03:00
Paweł Dziepak
3b9c0a6ae2 Merge "loading_cache: fix the known complexity issue in the shrink() method" from Vlad
Use the boost::intrusive containers in order to achieve a O(1) complexity
for both "LRU list" update and to minimize the memory overhead in the hash
table item to "LRU list" item connection:
   - Make the timestamped_val be both a bi::list and a bi::unordered_set
     item.
   - Make a bi::unordered_set be a cache backend instead of the
     std::unordered_map.

As a result dropping k LRU items becomes an O(k) operation instead of
O(N log N), where N is a total number of all cached items:
   - Every time a value is read - move it to the front of the "LRU list"
     (O(1)).
   - When we need to remove k LRU items:
      - Repeat k times:
         - Take an element from the back of the "LRU list". (O(1)).
         - Remove it from the bi::unordered_set and dispose. (O(1)).
           We use an auto-unlink configuration for bi::list, therefore
           disposing an item is going to auto unlink it from the list.

* 'permissions_cache_move_to_intrusive-v1' of github.com:scylladb/seastar-dev:
  utils::loading_cache: cleanup
  utils/loading_cache.hh: use intrusive list to store the lru entry
  utils::loading_cache: implement automatic rehashing
  utils::loading_cache: make the underlying map to be an intrusive unordered_set
2017-05-23 16:18:16 +01:00
Tomasz Grabiec
cec8d7f38c gdb: Fix error about gdb.Value not being convertible to int by %x format
Message-Id: <1495538843-27777-1-git-send-email-tgrabiec@scylladb.com>
2017-05-23 15:38:58 +03:00
Avi Kivity
fd0e1eb1e2 Merge "Fixes for mutation algebra" from Tomasz
"Enforces commutativity of addition:

 m1 + m2 == m2 + m1

and consistency of difference and addition with equality:

 m1 + (m2 - m1) == m1 + m2"

* tag 'tgrabiec/fix-range-tombstone-commutativity-v2' of github.com:cloudius-systems/seastar-dev:
  mutation: Make compare_*_for_merge() consistent with equals()
  tests: mutation: Improve assertion failure message
  tests: Use default equality in test_mutation_diff_with_random_generator
  mutation: Make counter cell difference consistent with apply
  tests: range_tombstone_list_test: Improve error message
  tests: range_tombstone_list: Check adjacent range merging
  range_tombstone_list: Merge adjacent range tombstones in apply()
  tests: mutation: Check commutativity of mutation addition
  range_tombstone_list: Avoid violating set invariant
  range_tombstone_list: Make tombstone merging commutative
  range_tombstone_list: Add erase() operation to the reverter
  range_tombstone_list: Make all undo operations ordered relative to each other
  utils: Extract to_boost_visitor() to a separate header
  allocating_strategy: Introduce alloc_strategy_unique_ptr<>
2017-05-23 15:20:38 +03:00
Tomasz Grabiec
804f46f684 mutation: Make compare_*_for_merge() consistent with equals()
equals() considers expiring cells to be different form non-expiring cells,
but compare_row_marker_for_merge() considers them equal. Fix the latter to
pick expiring cells. The choice was arbitrary.
2017-05-23 13:35:03 +02:00
Tomasz Grabiec
c1475a8eb2 tests: mutation: Improve assertion failure message 2017-05-23 13:16:03 +02:00
Tomasz Grabiec
d15880b3b7 tests: Use default equality in test_mutation_diff_with_random_generator 2017-05-23 13:16:03 +02:00
Tomasz Grabiec
9dbae279ad mutation: Make counter cell difference consistent with apply
The case when both cells are dead was not handled properly, the diff
was always empty, whereas the cell with higher timestamp should win.

Caused test_mutation_diff_with_random_generator to fail.
2017-05-23 13:16:03 +02:00
Tomasz Grabiec
951da421db tests: range_tombstone_list_test: Improve error message 2017-05-23 13:16:03 +02:00
Tomasz Grabiec
bee40b4628 tests: range_tombstone_list: Check adjacent range merging 2017-05-23 13:16:03 +02:00
Tomasz Grabiec
3c509308ab range_tombstone_list: Merge adjacent range tombstones in apply()
Needed for equivalence to work correctly with difference and addition:

  m1 + (m2 - m1) = m1 + m2

Fixes #2158.
2017-05-23 13:16:03 +02:00
Tomasz Grabiec
ef4c7c458c tests: mutation: Check commutativity of mutation addition 2017-05-23 12:11:12 +02:00
Tomasz Grabiec
1dea251ca2 range_tombstone_list: Avoid violating set invariant
The code was inserting an entry with the same key as its successor,
and only later adjusting the key of the old entry. This is violating
set's invariant of unique keys, and insertion may cause rebalancing. I
don't know if this violation actually causes problems currently, but
it's safer not to.

Fix by first updating the existing entry and then inserting the new
one.
2017-05-23 12:11:12 +02:00
Tomasz Grabiec
a2a22e5f00 range_tombstone_list: Make tombstone merging commutative
Example of non-commutative case:

 a = [1, 5]@t2
 b = {[2, 3]@t1, [4, 5]@t1}

 a + b = [1, 5]@t2
 b + a = [1, 4)@t2, [4, 5]@t2

After this patch, both will yield [1, 5]@t2.

The patch also changes the logic to handle overlaps of tombstones with
equal timestamps to be handled symmetrically. They are now merged
instead of split on either of the boundary.

Refs #2158.
2017-05-23 12:11:12 +02:00
Tomasz Grabiec
c4dac7c80f range_tombstone_list: Add erase() operation to the reverter 2017-05-23 12:11:12 +02:00
Tomasz Grabiec
935709cddc range_tombstone_list: Make all undo operations ordered relative to each other
Later operation may depend on the result of previous operation. Same dependency
is present when reverting the operations.

Fixes assertion failure in update reverter.
2017-05-23 12:11:12 +02:00
Vlad Zolotarov
2d4d198fb9 utils::loading_cache: cleanup
- Remove "_" at the beginning of the type names.
   - s/Pred/EqualPred/

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-05-22 23:02:18 -04:00
Vlad Zolotarov
fd59a548c0 utils/loading_cache.hh: use intrusive list to store the lru entry
Fix the shrink() O(n log n) complexity issue by constantly pushing the corresponding intrusive
list entry to the head of the list every time the values are read.

This will keep the list ordered by the last read time from the most recently read
to the least recently read entry.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-05-22 23:00:18 -04:00
Vlad Zolotarov
0c4e9efce7 utils::loading_cache: implement automatic rehashing
- Start the cache with 256 buckets - the minimum number of buckets.
   - Limit the maximal number of buckets by 1M buckets.
   - Keep the load factor between 0.25 and 1.0 as long as the number of buckets is
     between the minimum and the maximum values mentioned above.
   - Grow and shrink the hash every "refresh" period if needed.
   - Enable bi::power_2_buckets and bi::compare_hash bi::unordered_set options.
   - Enable bi::unordered_set_base_hook's bi::store_hash option.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-05-22 22:57:44 -04:00
Vlad Zolotarov
2be3596a4f utils::loading_cache: make the underlying map to be an intrusive unordered_set
Make the underlying map to be a boost::intrusive::unordered_set<timestamped_val>
instead of std::unordered_set<Key, timestamped_val>.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-05-22 18:45:13 -04:00
Tomasz Grabiec
5aeb9eb70c utils: Extract to_boost_visitor() to a separate header 2017-05-22 19:30:02 +02:00
Tomasz Grabiec
69e2eccf68 allocating_strategy: Introduce alloc_strategy_unique_ptr<> 2017-05-22 19:30:02 +02:00
Raphael S. Carvalho
4b4a1883aa refresh: do not use default priority for loading new sstables
Metadata is read using default priority class, which can significantly
slow down the process under high load. Compaction class can be used,
and if it turns out to be a problem, we can switch to a special class
for it.

Fixes #1859.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170517184546.17497-1-raphaelsc@scylladb.com>
2017-05-22 19:03:17 +03:00
Avi Kivity
ef428d008c Merge "reduce memory requirement for loading sstables" from Rapahel
"fixes a problem in which memory requirement for loading in-memory
components of sstables is very high due to unlimited parallelism."

* 'mem_requirement_sstable_load_v2_2' of github.com:raphaelsc/scylla:
  database: fix indentation of distributed_loader::open_sstable
  database: reduce memory requirement to load sstables
  sstables: loads components for a sstable in parallel
  sstables: enable read ahead for read of in-memory components
  sstables: make random_access_reader work with read ahead
2017-05-22 18:23:03 +03:00
Raphael S. Carvalho
28206993a4 database: fix indentation of distributed_loader::open_sstable
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-05-22 11:52:52 -03:00
Raphael S. Carvalho
a4e414cb3b database: reduce memory requirement to load sstables
SSTable load temporarily uses more space than needed to store metadata,
due to:
1) All components are read using read_simple() which uses 128k buffer.
file::dma_read_bulk() will allocate 128k, and may potentially allocate
another big buffer (128k - read) for file::read_maybe_eof().
2) read_filter() may use double the space it needs to.

Due to the fact that sstable loading parallelism is unlimited, Scylla
may require much more memory to load all sstables, and that may lead to
OOM. Higher the number of sstables higher the memory overhead.

To confirm this problem, I wrote a test[1] which loads 30k sstables in
parallel and reports the memory usage peak in the end.
When loading 30k sstables, each of which metadata is ~300kb, memory
usage peak was ~18G. When loading completed, only ~9GB were needed to
store all the metadata.
[1]: https://gist.github.com/raphaelsc/2db37b4fb34301833ab9eeed3b1a524d

To fix this problem, we need to set a limit on load parallelism (let's
start with a small number like 3 and adjust later if needed) and rely
on readahead so that the requirement drops considerably without
increasing boot time. Actually, boot time is improved by it.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
2017-05-22 11:52:51 -03:00
Raphael S. Carvalho
043fae2ef5 sstables: loads components for a sstable in parallel
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
2017-05-22 11:52:49 -03:00
Raphael S. Carvalho
0ac729fd57 sstables: enable read ahead for read of in-memory components
Read ahead 4 is used. Let's adjust it later if needed. File size is
used to prevent file_input_stream from issuing useless reads beyond
file size with read ahead enabled. We can switch to variant without
length once file_input_stream handles it properly.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-05-22 11:52:37 -03:00
Raphael S. Carvalho
77b8870cf3 sstables: make random_access_reader work with read ahead
Scylla crashes if read ahead is enabled by file_random_access_reader
because a call to seek() destroys the existing input stream without
closing it first.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-05-22 11:52:33 -03:00
Duarte Nunes
6ac73b57fb cql3/statements/select_statement: Remove dead code
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170522100230.17393-1-duarte@scylladb.com>
2017-05-22 14:32:12 +03:00
Avi Kivity
5828ddcca4 Merge seastar upstream
* seastar 4af898c...68dbf60 (4):
  > dpdk: follow namespace changes to fix compile error
  > perftune.py: fix regression introduced in df5f74ac
  > doc: typo in README.md
  > posix_net: load-balance connections
2017-05-22 12:39:48 +03:00
Asias He
b56ba02335 gossip: Make bootstrap more robust
The bootstrapping node will be a gossip only member, until the streaming
finishes and the node becomes NORMAL state. If during this time, the
bootstrapping node is overwhelmed with streaming, it is possible the node will
delay the update the gossip heartbeat. Be forgiving for the bootstrapping node
and do not remove it from gossip too fast. Otherwise, streaming rpc verbs will
not be resent becasue the node is not in gossip membership anymore.

Fixes #2150

Message-Id: <286d7035d854f2a48abf4e1e2e3bfcb8b22b9ca2.1494553580.git.asias@scylladb.com>
2017-05-21 19:25:40 +03:00
Takuya ASADA
7777b558c4 dist/redhat: Use mock for CentOS/RHEL rpms
Enable mock for CentOS/RHEL, also support cross building by mock.

Fixes #630

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20170513171200.14926-1-syuu@scylladb.com>
2017-05-21 19:22:54 +03:00
Avi Kivity
2f23648b9e Revert "dist: add conflict with Cassandra"
This reverts commit da55aecca3.  Instead of an
install-time conflict, we'll add a run-time conflict.
2017-05-21 18:37:59 +03:00
Alexys Jacob
c8116b4252 scylla_raid_setup: fix typo on print_usage
Simple typo fix on the usage message output, the script name was not correct.

Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20170519145851.6205-1-ultrabug@gentoo.org>
2017-05-21 18:01:28 +03:00
Avi Kivity
5b182537db Merge seastar upstream
* seastar 8aef5f5...4af898c (4):
  > memory: fix debug build
  > tests: fix slab_test build
  > xen: fix fallouts from seastar namespace change
  > build: make swagger generated files depend on the code generator
2017-05-21 13:48:24 +03:00
Alexys Jacob
8dbad4f34a scylla_sysconfig_setup: fix typo on print_usage
Simple typo fix on the usage message output, the script name was not correct.

Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20170519143227.2741-1-ultrabug@gentoo.org>
2017-05-21 13:41:43 +03:00
Alexys Jacob
c0756d97b8 scylla_setup: fix typos on cpu scaling messages
This fixes typos on CPU scaling related messages.

Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20170519143703.3574-1-ultrabug@gentoo.org>
2017-05-21 13:41:42 +03:00
Glauber Costa
5f99158889 api: return correct values for bloom filter statistics
We are currently suspecting that the bloom filter false positive ratio
is not being respected. While trying to debug that, I found out that we
have a more basic problem:

The numbers are all meaningless, because the stats are wrong.  We are
accumulating by summing the ratios together. It's easy to see how this
doesn't work, if we look at an example where the ratio for some CFs is
zero:

SST1: false = 1, total = 2. ratio = 0.5
SST2: false = 0, total = 98 . ratio = 0.

The real ratio in this example is 1 / (98 + 2) = 1 %, but the displayed
ratio will be 0.5 + 0 = 0.5.

This patch will map reduce all the sstables together keeping both
numerator and denominator, yielding the right value at the end. To do
that, we'll reuse the existing ratio_holder class, which already does
exactly what we want.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20170518222333.16307-1-glauber@scylladb.com>
2017-05-21 13:11:22 +03:00
Avi Kivity
ebaeefa02b Merge seatar upstream (seastar namespace)
- introcduced "seastarx.hh" header, which does a "using namespace seastar";
 - 'net' namespace conflicts with seastar::net, renamed to 'netw'.
 - 'transport' namespace conflicts with seastar::transport, renamed to
   cql_transport.
 - "logger" global variables now conflict with logger global type, renamed
   to xlogger.
 - other minor changes
2017-05-21 12:26:15 +03:00
Avi Kivity
dab2783b58 Merge seastar upstream
* seastar 45b718b...f726938 (2):
  > memory: add --mbind option to supress warning message when running Seastar apps on container
  > Add support for Gentoo Linux irqbalance configuration detection.
2017-05-20 21:15:46 +03:00
Avi Kivity
c8cb3d6ff5 Merge "Materialized views: bug fixes and unit tests" from Duarte
"This series fixes bugs related to materialized views, most pertaining
to column filtering in the where clause."

* 'materialized-views/bug-fixes/v1' of https://github.com/duarten/scylla:
  tests/view_schema_test: Add more test cases
  tests/cql_assertions: Add assertion for row set equality
  single_column_relation: Correctly print IN relation
  statement_restrictions: Allow filtering regular columns for views
  statement_restrictions: Relax clustering restrictions for views
  statement_restrictions: Relax partition restrictions for views
  cql3/statements: Prevent setting default ttl on view
  cql3/restrictions: Complete implementation of is_satisfied_by()
  db/view: Re-implement clustering_prefix_matches()
  db/view: Re-implement partition_key_matches()
  db/view: Generate regular tombstone for base deletions
  db/view: Consider cell liveness when generating updates
  db/view: Don't generate view updates for static rows
2017-05-20 13:52:56 +03:00
Tomasz Grabiec
cd4d15672b utils: estimated_histogram: Fix clear()
It was a no-op. It doesn't seem currently used, but I will have a use
for it soon.
Message-Id: <1495198172-1969-1-git-send-email-tgrabiec@scylladb.com>
2017-05-19 14:34:34 +01:00
Paweł Dziepak
c560cf9d9d Merge "fixes and improvements in the permissions cache implementation" from Vlad
"There are numerous issues in the current implementation of permissions
cache starting from the logical errors and bugs and ending with the
suboptimal implementation described in the issue #2262."

* 'permissions_cache_fixes-v4' of github.com:scylladb/seastar-dev:
  utils::loading_cache: avoid the reads storm when the key is not in the cache
  utils::loading_cache: cleanup
  utils::loading_cache: align the constrains in the constructor with the parameters description
  utils::loading_cache: refresh in the background
  auth::auth: add operator<<() for a permission_cache key
  auth::auth::permissions_cache: use the values from the configuration - don't try to be smart
  db::config: define a saner default value for permissions_validity_in_ms
2017-05-18 13:33:05 +01:00
Vlad Zolotarov
6a63c87a9f utils::loading_cache: avoid the reads storm when the key is not in the cache
Use a mutex to serialize producers when the key is not present in the cache.

Fixes #2262

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-05-18 07:55:48 -04:00
Tomasz Grabiec
3fc1703ccf range: Fix SFINAE rule for picking the best do_lower_bound()/do_upper_bound() overload
mutation_partition has a slicing constructor which is supposed to copy
only the rows from the query range. The rows are located using
nonwrapping_range::lower_bound() and
nonwrapping_range::lower_bound(). Those two have two different
implementations chosen with SFINAE. One is using std::lower_bound(),
and one is using container's built in lower_bound() should it
exist. We're using intrusive tree in mutation_partition, so
container's lower_bound() is preferred. It's O(log N) whereas
std::lower_bound() is O(N), because tree's iterator is not random
access.

However, the current rule for picking container's lower_bound() never
triggers, because lower_bound() has two overloads in the container:

  ./range.hh:618:14: error: decltype cannot resolve address of overloaded function
              typename = decltype(&std::remove_reference<Range>::type::upper_bound)>
              ^~~~~~~~

As a result, the overload which uses std::lower_bound() is used.

Spotted when running perf_fast_forward with wide partition limit in
cache lifted off. It's so slow that I timeouted waiting for the result
(> 16 min).

Fixes #2395.

Message-Id: <1495048614-9913-1-git-send-email-tgrabiec@scylladb.com>
2017-05-18 13:28:10 +03:00
Avi Kivity
ba31619594 tests: fix partitioner_test for g++ 5
It can't make the leap from dht::ring_position to
stdx::optional<range_bound<dht::ring_position>> for some reason.
2017-05-18 13:09:41 +03:00
Pekka Enberg
30b5933db2 Merge "Add Gentoo Linux support to utility and setup scripts" from Alexys
"These patches add support to setup and operate ScyllaDB on Gentoo Linux.

 * scylla_setup and related scripts
 * node_health_check

 I have kept them as simple as possible and tested them to setup and operate
 succesfully a three nodes cluster running on Gentoo Linux."

* 'gentoo_linux_support' of github.com:ultrabug/scylla:
  scylla_setup: add gentoo linux installation detection
  prometheus node_exporter install: add support for gentoo linux
  raid setup: add support for gentoo linux
  ntp setup: add support for gentoo linux
  kernel check: add support for gentoo linux
  cpuscaling setup: add support for gentoo linux
  coredump setup: add support for gentoo linux
  detect gentoo linux on selinux setup
  add gentoo_variant detection and SYSCONFIG setup
2017-05-18 09:41:13 +03:00
Vlad Zolotarov
1ef22f84c1 utils::loading_cache: cleanup
- Fix a callback signature: receive a const ref.
   - White spaces.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-05-17 15:03:14 -04:00
Vlad Zolotarov
87ce0b2d47 utils::loading_cache: align the constrains in the constructor with the parameters description
According to description of permissions_validity_in_ms the permissions_cache is enabled if this
value is set to a non-zero value. Otherwise the permissions_cache is disabled.

According to the permissions_update_interval_in_ms description it must have a non-zero value if permissions_cache
is enabled.

permissions_cache_max_entries description doesn't explicitly state it but it makes no sense to allow it to be zero
if permissions_cache is enabled.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-05-17 15:03:14 -04:00
Vlad Zolotarov
e286828472 utils::loading_cache: refresh in the background
This patch changes the way a loading_cache works.

Before this patch:
   1) If a permissions key is not in the cache it's loaded in the foreground and the original
      query is blocked till the permissions are loaded.
   2) Every _period the timer does the following:
      1) If a value was loaded more than _expiry time ago it is removed from the cache.
      2) If the cache is too big - the less recently loaded values are removed till the cache
         fits the requested size.

After this patch:
   1) If a permissions key is not in the cache it's loaded in the foreground and the original
      query is blocked till the permissions are loaded.
   2) Every _period the timer does the following:
      1) If a value in the cache was loaded or read for the last time more than _expiry time ago - it's removed from the cache.
      2) If the cache is too big - the less recently read values are removed till the cache fits the requested size.
      3) The values that were loaded more than _refresh time ago are re-read in the background.

The new implementation allows to minimize the amount of the foreground reads for a frequently used value to a single
event (when the value is loaded for the first time).

It also ensures we do not reload values we no longer need.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-05-17 15:03:06 -04:00
Alexys Jacob
fa0944ac19 scylla_setup: add gentoo linux installation detection 2017-05-17 18:06:54 +02:00
Alexys Jacob
9bb1bda466 prometheus node_exporter install: add support for gentoo linux 2017-05-17 18:06:34 +02:00
Alexys Jacob
1d235e5012 raid setup: add support for gentoo linux 2017-05-17 18:06:14 +02:00
Alexys Jacob
fdd5944ab2 ntp setup: add support for gentoo linux 2017-05-17 18:05:59 +02:00
Alexys Jacob
412f96a1bf kernel check: add support for gentoo linux 2017-05-17 18:05:45 +02:00
Alexys Jacob
a198f2b1af cpuscaling setup: add support for gentoo linux 2017-05-17 18:05:24 +02:00
Alexys Jacob
6a1807a7d8 coredump setup: add support for gentoo linux 2017-05-17 18:05:08 +02:00
Alexys Jacob
bc63e501db detect gentoo linux on selinux setup 2017-05-17 18:04:20 +02:00
Vlad Zolotarov
4edb336ac5 auth::auth: add operator<<() for a permission_cache key
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-05-17 12:03:56 -04:00
Vlad Zolotarov
d780818cac auth::auth::permissions_cache: use the values from the configuration - don't try to be smart
Our configuration already has the default values for for permission cache parameters.
Therefore if user decides to give some bad parameters we'd rather fail the load and inform him/her
about the bad parameters instead of trying to silently "fix" them.

In addition the original code wasn't passing the parameters correctly: it switched the "expiry" and "refresh" parameters in
the utils::loaded_cache constructor.

Add to this that the original code was doing really strange things in the permission_cache::expiry(cfg) method.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-05-17 12:03:56 -04:00
Vlad Zolotarov
ea1cfabe28 db::config: define a saner default value for permissions_validity_in_ms
It makes little sense to have the same value for permissions_update_interval_in_ms and permissions_validity_in_ms.
This may cause the values to be invalidated only because some minor delays in the timer scheduling.

It makes a lot more sense to make the permissions_update_interval_in_ms value smaller than permissions_validity_in_ms.
This way we would minimize the chances of "false invalidation" due to some small delays in the timer scheduling.

In addition, 2s seems to be a too small value for permissions_validity_in_ms since our default read_request_timeout_in_ms is 5s.
This means that a single system_auth read failure would guarantee that the following queries are going to read system_auth data
in the foreground.

Setting it to 10s would allow a second read attempt before we enforce the foreground read.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-05-17 12:03:56 -04:00
Alexys Jacob
2ca0380d06 add gentoo_variant detection and SYSCONFIG setup 2017-05-17 18:03:53 +02:00
Avi Kivity
2aa5b3e20c Merge "Improve perf_fast_forward test" from Tomasz
"Notably:
  - add validation of the results (e.g. fragment count, expectations about disk activity)
  - add cache-specific tests"

* 'tgrabiec/add-cache-tests-to-perf-fast-forward' of github.com:cloudius-systems/seastar-dev:
  tests: perf_fast_forward: Report cache stats
  row_cache: Keep counters in a struct
  tests: perf_fast_forward: Add cache-specific tests
  tests: perf_fast_forward: Extract test_reading_all()
  tests: perf_fast_forward: Add validation of the results
  tests: perf_fast_forward: Fix partition scans to read the expected amount of fragments
  tests: perf_fast_forward: Allow the test to be interrupted
  tests: perf_fast_forward: Allow testing with cache enabled
  row_cache: Implement mutation_reader::fast_forward_to() for cache scanner
2017-05-17 18:06:02 +03:00
Calle Wilund
29b20d410a schema_tables: Remove "class" attribute from strategy options
Not 100% proper, but in line with how we still store the info.
Ensures (helps at least) to keep schema loaded from tables
and schema from builder comparable.

Fixes schema_changes_test error.

Message-Id: <1495030581-2138-2-git-send-email-calle@scylladb.com>
2017-05-17 17:56:11 +03:00
Calle Wilund
6ca07f16c1 scylla: fix compilation errors on gcc 5
Message-Id: <1495030581-2138-1-git-send-email-calle@scylladb.com>
2017-05-17 17:56:06 +03:00
Paweł Dziepak
3ecceaee48 Merge "Fix fast_forward_to() on sstable reader being ignored in some cases" from Tomasz
"When mutation reader enters the partition using index,
streamed_mutation object is returned to the user before the row start
fragment is processed. In that case, when we process the row start, we
should ignore it and not call setup_for_partition() again. That may
override user's fast_forward_to() request."

* 'tgrabiec/fix-initial-fast-forward-to-for-single-key-sstable-readers' of github.com:scylladb/seastar-dev:
  tests: mutation_source_test: Test forwarding in single-key readers
  sstables: Remove unused code
  sstables: mutation_reader: Fix setup_for_partition() being called twice in some cases
  sstables: Fix verify_end_state() to tolerate ATOM_START_2 state
2017-05-17 15:35:30 +01:00
Avi Kivity
eb69fe78a4 Merge "Adding private repository to housekeeping" from Amnon
"This series adds private repository support to scylla-housekeeping"

* 'amnon/housekeeping_private_repo_v3' of github.com:cloudius-systems/seastar-dev:
  scylla-housekeeping service: Support private repositories
  scylla-housekeeping-upstart: Use repository id, when checking for version
  scylla-housekeeping: support private repositories
2017-05-17 15:56:46 +03:00
Tomasz Grabiec
777ffa3a27 tests: perf_fast_forward: Report cache stats 2017-05-17 14:15:14 +02:00
Tomasz Grabiec
d1bde3036e row_cache: Keep counters in a struct
So that taking a snapshot of all stats is easy.
2017-05-17 14:15:14 +02:00
Tomasz Grabiec
7a81f5e980 tests: perf_fast_forward: Add cache-specific tests 2017-05-17 14:15:14 +02:00
Tomasz Grabiec
1a7b03004a tests: perf_fast_forward: Extract test_reading_all() 2017-05-17 14:15:14 +02:00
Tomasz Grabiec
a38fd16f89 tests: perf_fast_forward: Add validation of the results 2017-05-17 14:15:14 +02:00
Tomasz Grabiec
3c3ea51657 tests: perf_fast_forward: Fix partition scans to read the expected amount of fragments
make_pkeys() needs to be invoked with n equal to the number of keys
which the table was populated with. Otherwise the extra keys, which
are missing in the table, may be placed anywhere in the vector due to
ring order sorting, and break the assumption that the table contains
all keys from the array up to index n. This resulted in the test
reading slighlty less fragments than it would follow from the desired
count.

Another problem is that we should not skip the fast_forward_to() call
for the inital range (workaround for a bug in sstable mutation
reader), otherwise we will read slightly less than expected as well.
2017-05-17 14:15:14 +02:00
Tomasz Grabiec
49a0bc3847 tests: perf_fast_forward: Allow the test to be interrupted 2017-05-17 14:15:14 +02:00
Tomasz Grabiec
5c7f5643a6 tests: perf_fast_forward: Allow testing with cache enabled 2017-05-17 14:15:14 +02:00
Tomasz Grabiec
35c9dfecc2 row_cache: Implement mutation_reader::fast_forward_to() for cache scanner
Needed to make perf_fast_forward work with cache enabled.
2017-05-17 14:15:14 +02:00
Tomasz Grabiec
84648f73ef Merge "Fix performance problems with high shard counts tag" from Avi
From http://github.com/avikivity/scylla exponential-sharder/v3.

The sharder, which takes a range of tokens and splits it among shards, is
slow with large shard count and the default
murmur3_partitioner_ignore_msb_bits.

This patchset fixes excessive iteration in sstable sharding metadata writer and
nonsignular range scans.

Without this patchset, sealing a memtable takes > 60 ms on a 48-shard
system.  With the patchset, it drops below the latency tracker threshold I
used (5 ms).
2017-05-17 14:03:33 +02:00
Avi Kivity
68034604e1 dht: murmur3_partitioner: simplify moving to and from the zero-based token range 2017-05-17 13:50:30 +03:00
Avi Kivity
1a99ebaa65 storage_proxy: switch to the exponential sharder for nonsingular queries
Nonsingular queries used exponential expansion of the token space to
avoid spending too much cpu time on near-empty tables, but the generation
of the search space was itself exponential.  Switch to the exponential sharder
which has linear cost.
2017-05-17 13:50:30 +03:00
Avi Kivity
00f48f96cb sstables: select just the shard we want when writing sharding metadata
On a system with many shards, this saves many useless iterations where
we just skip the unwanted shard.
2017-05-17 13:50:30 +03:00
Avi Kivity
44a1a51987 tests: add tests for dht::split_range_to_single_shard() 2017-05-17 13:50:30 +03:00
Avi Kivity
76f12a8842 dht: add split_range_to_single_shard()
Intersects a shard's owning range with a ring position range, and return
the sorted result.
2017-05-17 13:50:27 +03:00
Tomasz Grabiec
1da3daa4f4 range: Use more standard notation for singular range
Reuse notation for a single-element set.

Message-Id: <1494923827-10097-1-git-send-email-tgrabiec@scylladb.com>
2017-05-17 13:28:42 +03:00
Avi Kivity
a65e8bd215 dht: add a ring-position-range-vector variant of the exponential sharder
The "exponentiality" is not carried over from one range to another, because
we expect one or two ranges (two ranges result from a wrapped around thrift
token range).
2017-05-17 13:18:52 +03:00
Avi Kivity
6eb6f12909 tests: add test for ring_position_exponential_sharder 2017-05-17 13:18:52 +03:00
Avi Kivity
f671ac13b4 dht: add an exponential ring_position range sharder
Like the regular sharder, the exponential sharder divides a range into
subranges owned by individual ranges.  Unlike the regular sharder, it
generates ever-increasing subranges, spanning more and more shards, and
eventually returns several subranges per shard.  To avoid using
exponential cpu and memory, subranges belonging to a single shard are merged,
and a flag is set to indicate the subranges are not ordered wrt. each other.
2017-05-17 13:18:49 +03:00
Avi Kivity
025c6b45b2 dht: extend i_partitioner::next_token_for_shard()
Right now, next_token_for_shard() only allows iterating linearly in shard
order.  Add the ability to select a specific shard to skip to (in case we're
only interested in a single shard), and to select larger ranges (so that
exponential increases are not implemented by iteration).
2017-05-17 12:30:03 +03:00
Avi Kivity
7156ea8804 dht: make ring_position_range_sharder more independent of global_partitioner
Useful for testing.
2017-05-17 12:30:03 +03:00
Avi Kivity
302fec8293 dht: make i_partitioner::name() const 2017-05-17 12:30:03 +03:00
Avi Kivity
f462c4327e dht: make i_partitioner keep track of the number of shards it was configured with
Useful for testing classes layered on top of the partitioner (the sharders).
2017-05-17 12:30:03 +03:00
Avi Kivity
04b16ae8ec dht: fix partitioner initialization for tests
The partitioners now depend on smp::count to be initialized correctly,
but smp::count isn't available at static initialization time.

The scylla executable isn't affected because it calls set_global_partitioner()
after smp::count has been initialized.

Fix by deferring initialization to the first global_partitioner() call.
2017-05-17 12:30:03 +03:00
Avi Kivity
1c6cecd9d0 utils: introduce div_ceil()
Divides integrals but rounds up rather than down.
2017-05-17 12:30:03 +03:00
Avi Kivity
f1dbb951da Merge "Materialized views: implement read before write" from Duarte
"This patch ensures we read the base table rows that an update
is modifying, in order to correctly calculate the set of
materialized view updates.

The read-before-write is performed on the shard applying the
update and attempts to do a precise read of the rows being modified,
which can be more than one in case of ranged deletions or a
batch update."

* 'materialized-views/read-existing/v2' of https://github.com/duarten/scylla:
  database: Read existing base mutations
  db/view: Calculate clustering ranges for MV read-before-write query
  db/view: Replace entry if cells don't match
  view_info: Store base regular col in the view's PK as column_id
  compound_view_wrapper: Add tri_compare
  bound_view: Build range bound from bound_view
  clustering_bounds_comparator: Enable Range concept
  range: Add lvalue version of transform()
  tests: Add test case for nonwrapping_range::intersection()
  nonwrapping_range: Add intersection() function
2017-05-17 12:26:26 +03:00
Duarte Nunes
ef252036ba tests/view_schema_test: Add more test cases
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 11:21:58 +02:00
Duarte Nunes
983af595e9 database: Read existing base mutations
When generating updates for a materialized view we need to read the
existing base row, to be able to determine the primary key of the view
row the new base update will supplant, in case the view includes a
base non-primary key column in its own primary key. That old view row
will be tombstoned or updated, if it exists, depending on the difference
between the new base row and the existing one, if any.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 10:33:19 +02:00
Duarte Nunes
0861a66853 tests/cql_assertions: Add assertion for row set equality
For row set equality, the order of the actual rows doesn't need to
match the order of the expected rows.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 10:33:19 +02:00
Duarte Nunes
8a77bfe35b db/view: Calculate clustering ranges for MV read-before-write query
Introduce the calculate_affected_clustering_ranges() function to
calculate the smallest subject of affected clustering ranges that we
need to query for.

The update_requires_read_before_write() function checks whether
a view is potentially affected by the base update.

The patch also cleans up the may_be_affected_by() function.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 10:33:19 +02:00
Duarte Nunes
9115862419 single_column_relation: Correctly print IN relation
So that the output of a set of relations can be fed back into the CQL
parser; useful for materialized views.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 10:33:19 +02:00
Duarte Nunes
ec681060a8 db/view: Replace entry if cells don't match
If a base table regular columns is part of the view's pk, and if that
column changes, we should replace the entry, by deleting the row(s)
with the old value and inserting a new one.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 10:33:19 +02:00
Duarte Nunes
3ef1a825c9 statement_restrictions: Allow filtering regular columns for views
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 10:33:19 +02:00
Duarte Nunes
0170c743d3 statement_restrictions: Relax clustering restrictions for views
In process_clustering_columns_restrictions(), don't require all
clustering columns to be restricted if we're dealing with a
materialized view's where clause.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 10:33:19 +02:00
Duarte Nunes
4f90b19cc2 statement_restrictions: Relax partition restrictions for views
In process_partition_key_restrictions(), don't require all partition
key columns to be restricted if we're dealing with a materialized
view's where clause.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 10:33:19 +02:00
Duarte Nunes
99b234d717 cql3/statements: Prevent setting default ttl on view
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 10:33:19 +02:00
Duarte Nunes
d480daffca cql3/restrictions: Complete implementation of is_satisfied_by()
This patch implements the is_satisfied_by() function for the remaining
types of restrictions, lifting the function declaration to
abstract_restrictions.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 10:33:19 +02:00
Duarte Nunes
bad0edb23b db/view: Re-implement clustering_prefix_matches()
This patch implements clustering_prefix_matches() in terms of
abstract_restriction::is_satisfied_by() instead of ranges, which
supports filtering just a subset of the clustering columns.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 10:33:19 +02:00
Duarte Nunes
b0d1ea76a2 db/view: Re-implement partition_key_matches()
This patch implements partition_key_matches() in terms of
abstract_restriction::is_satisfied_by() instead of ranges, which
supports filtering just a component of a compound partition key.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 10:33:19 +02:00
Duarte Nunes
38be85a21d db/view: Generate regular tombstone for base deletions
Instead of shadowable tombstones, which only apply to updates.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 10:33:19 +02:00
Duarte Nunes
1fd8b8e723 db/view: Consider cell liveness when generating updates
This patch ensures we take into account the liveness of the base's
regular column in the view's pk when generating view updates.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 10:33:19 +02:00
Duarte Nunes
c421da6825 db/view: Don't generate view updates for static rows
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 10:33:19 +02:00
Duarte Nunes
f41a5e554d view_info: Store base regular col in the view's PK as column_id
This patch stores the base_non_pk_column_in_view column as column_id,
which is more convenient, and it also stores a two-level optional to
encode both lazy initialization and the absence of such a column.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 10:33:18 +02:00
Duarte Nunes
257eaa0d05 compound_view_wrapper: Add tri_compare
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 10:33:18 +02:00
Duarte Nunes
06a6679826 bound_view: Build range bound from bound_view
We introduce the bound_view::to_range_bound() function, which builds a
wrapping_range or nonwrapping_range bound from a bound_view.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 10:33:18 +02:00
Duarte Nunes
8288e504fb clustering_bounds_comparator: Enable Range concept
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 10:33:18 +02:00
Duarte Nunes
fb1e966137 range: Add lvalue version of transform()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 10:33:18 +02:00
Duarte Nunes
f365b7f1f7 tests: Add test case for nonwrapping_range::intersection()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 10:33:18 +02:00
Duarte Nunes
1f9359efba nonwrapping_range: Add intersection() function
intersection() returns an optional range with the intersection of the
this range and the other, specified range.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-17 10:33:18 +02:00
Avi Kivity
f5dae826ce Merge "Migrate schema tables to v3 format" from Calle
"Defines origin v3-format for system/schema tables, and use them for
schema storage/retrival.

Includes a legacy_schema_migrator implementation/port from origin. Note
that since we don't support features like triggers, functions and
aggregates, it will bail if encountering such a feature used.

Note also that this patch set does not convert the "hints" and
"backlog" tables, even though these have changed in v3 as well.
That will be a separate patch set.

Tested against dtests. Note that patches for dtest + ccm
will follow."

* 'calle/systemtables' of github.com:cloudius-systems/seastar-dev: (36 commits)
  legacy_schema_migrator: Actually truncate legacy schema tables on finish
  database: Extract "remove" from "drop_columnfamily"
  v3 schema test fixes
  thrift: Update CQL mapping of static CFs
  schema_tables: Use v3 schema tables and formats
  type_parser: Origin expects empty string -> bytes_type
  cf_prop_defs: Add crc_check_chance as recognized (even if we don't use)
  types_test: v3 style schemas enforce explicit "frozen" in tupes/ut:s
  cql3_type: v3 to_string
  cql_types: Introduce cql3_type::empty and associate with empty data_type
  schema: rename column accessors to be in line with origin
  schema: Add "is_static_compact_table"
  schema_builder: Add helper to generate unique column names akin origin
  schema: Add utility functions for static columns
  schema: Use heterogeneous comparator for columns bounds
  cql3_type_parser: Resolve from cql3 names/expressions
  cql3_type: Add "prepare_interal" and "references_user_type"
  cql3::cql3_type: Add prepare_internal path using only "local" holders
  cql3_type: Add virtual destructor.
  database/main: encapsulate system CF dir touching
  ...
2017-05-17 11:25:52 +03:00
Asias He
0abfe39d8f database: Log compaction strategy setting on shard 0 only
The compaction strategy is per node not per shard. Do not duplicate the
same log on all shards.

Message-Id: <1494835519.git.asias@scylladb.com>
2017-05-17 11:17:41 +03:00
Avi Kivity
f09f056515 Merge seastar upstream
* seastar 4a3118c...45b718b (7):
  > tests: make connect_test use a random port
  > log: Introduce log.info0
  > configure.py: link to DPDK PMD drivers which are already built on build/dpdk and enabled by default on DPDK config
  > Update fmt submodule
  > perftune: fix perftune.py IndexError when NIC uses less IRQs than requested.
  > build: Add more required build dependencies to the Dockerfile
  > Prometheus: Reserve in protobuf object before iterating
2017-05-17 11:16:58 +03:00
Raphael S. Carvalho
a58699cc92 sstables: kill sstable::mark_for_deletion_on_disk
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170515233021.21223-1-raphaelsc@scylladb.com>
2017-05-17 11:15:59 +03:00
Raphael S. Carvalho
deabf06d49 lcs: log invariant restoration
It will be useful for understanding the strategy behavior after
invariant is possibly broken by resharding.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170515234925.22793-1-raphaelsc@scylladb.com>
2017-05-17 11:15:41 +03:00
Avi Kivity
2eef7cd395 Merge "compress the tracing session ID when compression is requested" from Vlad
"Tested with:
   - test.py --mode relase
   - debug/test-serialization
   - c-s with both debug and relase compiled scylla with authentication enabled:
     cassandra-stress write  n=10000 no-warmup -rate threads=10 -mode native unprepared cql3 user='cassandra' password='cassandra'

Tested with:
   - test.py --mode relase
   - debug/test-serialization
   - c-s with both debug and relase compiled scylla with authentication enabled:
     cassandra-stress write  n=10000 no-warmup -rate threads=10 -mode native unprepared cql3 user='cassandra' password='cassandra'"

* 'compress_tracing_session_id-v6' of github.com:cloudius-systems/seastar-dev:
  cql_server::response: rework the tracing session ID insertion
  utils::UUID: align the UUID serialization API with the similar API of other classes in the project
  utils: serialization: unify the variety of serialize_XXX(...)
  cql_server::response: rework the compress(...) method
  cql_server::response: store the frame flags inside the class
2017-05-17 09:48:49 +03:00
Pekka Enberg
374c3d66ab Merge "Fixes for CQL regressions" from Duarte
"This series fixes a set of regressions introduced by
 f7bc88734a, resulting in two failed
 tests:

   testDenseNonCompositeTable(org.apache.cassandra.cql3.validation.operations.CreateTest)

 and

   testStaticColumnsWithDistinct(org.apache.cassandra.cql3.validation.entities.StaticColumnsTest)"

* 'cql-fixes/v1' of github.com:duarten/scylla:
  update_statement: Reject empty values for dense clustering key
  modification_statement: Fix detection of clustering keys
  cql3/restrictions/statement_restrictions: Consider statement type
  cql3/statements/modification_statement: Extract statement_type
2017-05-17 09:29:24 +03:00
Vlad Zolotarov
a0737abdc5 cql_server::response: rework the tracing session ID insertion
Insert the tracing session ID into the response body in the cql_server::response constructor.

Fixes #2356

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-05-16 15:57:28 -04:00
Vlad Zolotarov
494ea82a88 utils::UUID: align the UUID serialization API with the similar API of other classes in the project
The standard serialization API (e.g. in data_value) includes the following methods:

size_t serialized_size() const;
void serialize(bytes::iterator& it) const;
bytes serialize() const;

Align the utils::UUID API with the pattern above.

The only addition is that we are going to make an output iterator parameter of a second method above
a template so that we may serialize into different output sources.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-05-16 15:56:03 -04:00
Vlad Zolotarov
7706775a63 utils: serialization: unify the variety of serialize_XXX(...)
Use the same templated implementation for all different serialize_XXX(...).

The chosen implementation is based on the std::copy_n(char*, size, OutputIterator),
which is heavily optimized and will be using memcpy/memmove where possible.

This patch also removes the not needed specializations that accept signed integer
values since we were casting them to unsigned value anyway.

The std::ostream based specifications are also removed since they are not used
anywhere except for a test-serialization.cc and adjusting the ostream to the iterator
is a single-liner.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-05-16 15:56:03 -04:00
Vlad Zolotarov
a33fe5b775 cql_server::response: rework the compress(...) method
Cleanup the compress(...) method interface:
   - Encapsulate the technical details inside the method:
      - Re-write the _body inside the method instead of returning it.
      - Set the response::_flags inside the method.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-05-16 15:53:35 -04:00
Vlad Zolotarov
c00814383d cql_server::response: store the frame flags inside the class
It makes a lot more sense to keep the flags mask inside the response and update it each time
the corresponding feature is set instead of holding the separate components like tracing state
pointer.

This patch adds this ability to set the flags.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-05-16 14:31:54 -04:00
Takuya ASADA
da55aecca3 dist: add conflict with Cassandra
Cassandra and Scylla are not able to install single instance, so add
cassandra to 'Conflicts'.

Fixes #2157

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1494856314-9322-1-git-send-email-syuu@scylladb.com>
2017-05-16 19:18:27 +03:00
Gleb Natapov
c7ad3b9959 database: remove temporary sstables sequentially
The code that removes each sstable runs in a thread. Parallel
removing of a lot of sstables may start a lot of threads each of which
is taking 128k for its stack. There is no much benefit in running
deletion in parallel anyway, so fix it by deleting sstables sequentially.

Fixes #2384

Message-Id: <20170516103018.GQ3874@scylladb.com>
2017-05-16 15:06:10 +03:00
Tomasz Grabiec
bdf3c536aa tests: mutation_source_test: Test forwarding in single-key readers 2017-05-16 13:36:10 +02:00
Tomasz Grabiec
e07cc44af2 sstables: Remove unused code 2017-05-16 13:31:01 +02:00
Tomasz Grabiec
0e23f8aa9b sstables: mutation_reader: Fix setup_for_partition() being called twice in some cases
When mutation reads enters the partition using index,
streamed_mutation object is returned to the user before the row start
fragment is processed. In that case, when we process the row start, we
should ignore it and not call setup_for_partition() again. That may
override user's fast_forward_to() request.
2017-05-16 13:31:01 +02:00
Tomasz Grabiec
a1dea3c4fc sstables: Fix verify_end_state() to tolerate ATOM_START_2 state
We would be in that state if consume_row_start() returns porceed::yes
and the stream ends after that. This can happen if slicing using
promoted index determined that there are no cells in the partition in
the range.
2017-05-16 13:31:01 +02:00
Raphael S. Carvalho
706ce5a27b sstables: do not swallow system error exception in read_simple
If error code is different than ENOENT, exception is swallowed.
That can lead to a variety of problems down the road.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170515225309.19185-1-raphaelsc@scylladb.com>
2017-05-16 08:47:34 +02:00
Alexys Jacob
9ddc05899d Fix scylla-housekeeping version detection to work with newer setuptools
Newer setuptools parse_version() don't like dashed version strings,
so we should trim it to avoid false negative version_compare() checks.

Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20170511162646.22129-1-ultrabug@gentoo.org>
2017-05-15 12:41:49 +03:00
Gleb Natapov
385645e8df storage_proxy: Fix mutation logging
Log mutation type only if mutation set is not empty.

Message-Id: <20170510142406.GA30426@scylladb.com>
2017-05-11 15:49:52 +01:00
Tomasz Grabiec
7b6be7e188 row_cache: Add missing propagation of the forwarding flag in handle_large_partition()
Message-Id: <1494503145-25622-1-git-send-email-tgrabiec@scylladb.com>
2017-05-11 15:47:19 +01:00
Vlad Zolotarov
a855e82eff service::client_state: don't allow dropping the system_auth and system_traces objects
Prevent the accidental dropping of system_auth and system_traces objects (keyspaces and tables)
but allow their modification (including tables).

We need to be able to modify keyspases in order to set/modify the replication strategy and its parameters.
We need to be able to ALTER the tables in order to allow rolling upgrades when some of the tables has changed.

Fixes #2346
Fixes #2338

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1494363335-20424-1-git-send-email-vladz@scylladb.com>
2017-05-11 13:03:30 +01:00
Tomasz Grabiec
0351ab8bc6 row_cache: Fix undefined behavior in read_wide()
_underlying is created with _range, which is captured by
reference. But range_and_underlyig_reader is moved after being
constructed by do_with(), so _range reference is invalidated.

Fixes #2377.
Message-Id: <1494492025-18091-1-git-send-email-tgrabiec@scylladb.com>
2017-05-11 09:43:43 +01:00
Duarte Nunes
a69039df03 tests/batchlog_manager_test: Fix failure
Since a9f6e5f8da, metrics can't be
duplicated. This patch works around that by avoiding to create a new
batchlog_manager (one is already created by the cql_test_env).

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170510191047.6154-1-duarte@scylladb.com>
2017-05-11 08:28:08 +02:00
Duarte Nunes
ec35cc33f1 update_statement: Reject empty values for dense clustering key
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-10 19:54:42 +02:00
Duarte Nunes
03f765c468 modification_statement: Fix detection of clustering keys
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-10 19:54:42 +02:00
Duarte Nunes
d7701087af cql3/restrictions/statement_restrictions: Consider statement type
Now that update_statement uses statement_restrictions, we need our
validation logic to take the statement type into account, in
particular to deal with insertion statements which only set static
columns but specify clustering values.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-10 19:54:42 +02:00
Duarte Nunes
c2041753c9 cql3/statements/modification_statement: Extract statement_type
This patch extracts the statement_type into its own file. The type
will be later passed to statement_restrictions for validation
purposes.

Further along, we could add methods to it that currently live in other
statements so we can move more validation into statement_restrictions.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-10 19:54:42 +02:00
Calle Wilund
c8f92536c1 legacy_schema_migrator: Actually truncate legacy schema tables on finish 2017-05-10 16:44:48 +00:00
Calle Wilund
3514123677 database: Extract "remove" from "drop_columnfamily" 2017-05-10 16:44:48 +00:00
Calle Wilund
66991a7ccb v3 schema test fixes 2017-05-10 16:44:48 +00:00
Duarte Nunes
6260f31e08 thrift: Update CQL mapping of static CFs
This patch updates the mapping of static CFs so that their CQL
representation is a non-compound, non-dense schema with static
columns, instead of regular ones. This matches the representation os
static CFs in Cassandra 3.x.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-10 16:44:48 +00:00
Calle Wilund
6c8b5fc09d schema_tables: Use v3 schema tables and formats
Switches system/schema_* for system_schema/*, updates schema/schema
builder and uses to hold/expect v3 style info (i.e. types & dropped).
2017-05-10 16:44:48 +00:00
Calle Wilund
f9b83e299e type_parser: Origin expects empty string -> bytes_type 2017-05-10 16:44:48 +00:00
Calle Wilund
97c54d254b cf_prop_defs: Add crc_check_chance as recognized (even if we don't use) 2017-05-10 16:44:48 +00:00
Calle Wilund
3d90152dc5 types_test: v3 style schemas enforce explicit "frozen" in tupes/ut:s 2017-05-10 16:44:48 +00:00
Calle Wilund
7969a156d5 cql3_type: v3 to_string 2017-05-10 16:44:48 +00:00
Calle Wilund
c572a8c83c cql_types: Introduce cql3_type::empty and associate with empty data_type 2017-05-10 16:44:48 +00:00
Calle Wilund
0e6ae8dec2 schema: rename column accessors to be in line with origin
More pointedly: Expose columns as is (currently
all_columns_in_select_order), expose name->column mapping more
appropriately named. 

Renaming like this is not strictly neccesary, but there is a point to
trying to keep nomenclature similar-ish with origin, esp. when select
order column need to become filtered (spoiler alert).
2017-05-10 16:44:48 +00:00
Calle Wilund
d2dc7898aa schema: Add "is_static_compact_table" 2017-05-10 16:44:48 +00:00
Calle Wilund
1c328a4166 schema_builder: Add helper to generate unique column names akin origin 2017-05-10 16:44:48 +00:00
Duarte Nunes
5387ac98f8 schema: Add utility functions for static columns
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-10 16:44:48 +00:00
Duarte Nunes
0439d83d1e schema: Use heterogeneous comparator for columns bounds
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-10 16:44:47 +00:00
Calle Wilund
b1c5447ab5 cql3_type_parser: Resolve from cql3 names/expressions
Cassandra 3 uses cql names for column/field types, thus
we need to parse these out-of-line, and resolve more akin
to the cql parser. 

Also wrap building user types similarly to origin, using
a "builder" wrapper, and usage graph resolving.
2017-05-10 16:44:47 +00:00
Calle Wilund
fcfea4c121 cql3_type: Add "prepare_interal" and "references_user_type"
Allows localized use of cql type + parsing + resolving
2017-05-10 16:44:47 +00:00
Calle Wilund
8b3e7bbe05 cql3::cql3_type: Add prepare_internal path using only "local" holders 2017-05-10 16:44:47 +00:00
Calle Wilund
2f791f5c3d cql3_type: Add virtual destructor.
It should be there.
2017-05-10 16:44:47 +00:00
Calle Wilund
48ddcbb77b database/main: encapsulate system CF dir touching 2017-05-10 16:44:47 +00:00
Calle Wilund
9eb91bc30b main: Add legacy schema migration to startup 2017-05-10 16:44:47 +00:00
Calle Wilund
3964055d98 legacy_schema_migrator: Add schema table converter
Initial. Does not actually write anything.
2017-05-10 16:44:47 +00:00
Paweł Dziepak
ba6b74e305 storage_service: counters are no longer experimental
Message-Id: <20170510124552.23558-1-pdziepak@scylladb.com>
2017-05-10 17:18:32 +03:00
Gleb Natapov
ab92406585 storage_proxy: optimize reconcile logic for CL=ONE
Regular single key query will never reconcile with CL=ONE since there
will be no digest mismatch, but range queries do not have digest stage,
so always goes through reconcile code. For CL=ONE there will be only one
result though, so no need to run complicated reconciliation logic and the
only result can be returned directly.

Message-Id: <20170509100334.GQ28272@scylladb.com>
2017-05-10 17:09:34 +03:00
Pekka Enberg
b63d33526d cql3: Fix variable_specifications class get_partition_key_bind_indexes()
The "_specs" array contains column specifications that have the bind
marker name if there is one. That results in
get_partition_key_bind_indices() not being able to look up a column
definition for such columns. Fix the issue by keeping track of the
actual column specifications passed to add() like Cassandra does.

Fixes #2369
Message-Id: <1494397358-24795-1-git-send-email-penberg@scylladb.com>
2017-05-10 12:38:18 +03:00
Calle Wilund
8066efb710 system_keyspace: Add getter/setter for built index status
Even though we have none.
2017-05-09 13:48:55 +00:00
Calle Wilund
061ef16562 system_tables/schema_tables: Remove special format case of "execute_cql"
Having a varadic parameter being used in implicit sprint is not
very readable + makes it less intuitive when suddenly system keyspace
becomes more than one -> multiple sprints in the chain -> more confusion
or more execution paths.
Its not that horrible with some spread out sprint:s
2017-05-09 13:48:55 +00:00
Calle Wilund
f5fcadf0b1 schema: Add "as_cql_string" for column_def + quote-wrapper 2017-05-09 13:48:55 +00:00
Calle Wilund
cb7ee98217 json: Add convinience ability to generate unordered_maps 2017-05-09 13:48:55 +00:00
Calle Wilund
e960724724 caching_options: Add from/to map methods 2017-05-09 13:48:55 +00:00
Calle Wilund
2e1c23f2f2 database: Relax rp ordering check to allow non-commitlog mutations
Allow replay to come post certain operations. Such as schema migration
2017-05-09 13:48:55 +00:00
Calle Wilund
27fdc5cfef schema_tables/system_tables: Add v3 tables to "ALL" and handle in init
I.e. deal with more than one keyspace in system_keyspace::make
2017-05-09 13:48:55 +00:00
Calle Wilund
afcf0372df cql3::untyped_result_set: Add more getter methods 2017-05-09 13:48:55 +00:00
Calle Wilund
539b65fc90 client_state: Make "has_access" auth check schema ks name independent 2017-05-09 13:48:55 +00:00
Calle Wilund
815aa8ba9f schema_tables: Add schema definitions for v3 tables 2017-05-09 13:48:55 +00:00
Calle Wilund
4378dca6e1 schema_tables: Hide/abstract schema keyspace name 2017-05-09 13:48:55 +00:00
Calle Wilund
2fb36e3bf8 system_keyspace: Add query overloads with named keyspace 2017-05-09 13:48:55 +00:00
Calle Wilund
32909d4c84 system_keyspace: Add v3+legacy schema definitions 2017-05-09 13:48:55 +00:00
Calle Wilund
b522b2bf22 Merge branch 'master' of https://github.com/scylladb/scylla 2017-05-09 13:48:47 +00:00
Avi Kivity
8af2b7c418 transport: honor the skip_metadata flag
Reduces processing overhead and network traffic.

We can't use the NO_METADATA flag in the metadata object, because this
is a request attribute; different executions of the same prepared statement
can have different settings for skip_metadata.
Message-Id: <20170419175145.19766-1-avi@scylladb.com>
2017-05-09 14:52:03 +03:00
Tomasz Grabiec
e56711a54d sstables: mutation_reader: Avoid reading index when restrictions cover whole partition
The check for is_static_row() used to be enough, but it no longer is
after optimization made in commit 3e06065, which avoids reading the
static row.
Message-Id: <1494241164-25810-1-git-send-email-tgrabiec@scylladb.com>
2017-05-09 11:03:18 +01:00
Pekka Enberg
5b931268d4 cql3: Move variable_specifications implementation to source file
Move the class implementation to source file to reduce the need to
recompile everything when the implementation changes...

Message-Id: <1494312003-8428-1-git-send-email-penberg@scylladb.com>
2017-05-09 12:44:18 +03:00
Calle Wilund
c30e515c70 Merge branch 'master' of https://github.com/scylladb/scylla 2017-05-09 09:29:27 +00:00
Duarte Nunes
65d96421da tests/sstable_datafile_test: Fix regression
This patch fixes a regression introduced in 9e88b60, where the wrong
clustering key was being specified.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170509091621.2682-1-duarte@scylladb.com>
2017-05-09 12:18:47 +03:00
Gleb Natapov
2d5a7c8058 storage_proxy: make read repair stats accessible through Prometheus
Currently they can be read only through JMX.

Message-Id: <20170509075546.GN28272@scylladb.com>
2017-05-09 11:23:38 +03:00
Calle Wilund
780a7c8641 Merge branch 'master' of https://github.com/scylladb/scylla 2017-05-08 15:39:46 +00:00
Avi Kivity
8c5c5d3004 Merge "CQL front-end for secondary indices" from Pekka
"This patch series adds CQL front-end support for secondary indices. You
can now execute CREATE INDEX and DROP INDEX statements, which will
update the newly added "Indexes" system table. However, the indexes are
not actually backed up by anything nor are they available for CQL
queries. The feature is hidden behind a new cluster feature flag and
enabled only with the "--experimental" flag."

* 'penberg/cql-2i/v2' of github.com:cloudius-systems/seastar-dev: (34 commits)
  schema: Kill index_type enum
  schema: Kill index_info class
  cql3/statements/create_index_statement: Use database::existing_index_names() in validation
  cql3/statements: Use secondary index manager in alter_table_statement class
  index: Add secondary_index_manager
  thrift/handler: Use index_metadata
  db/schema_tables: Index persistence
  schema: Add all_indices() to schema class
  schema: Remove add_default_index_names() from schema_builder class
  db/schema_tables: Add system table for indices
  cql3/Cgl.g: DROP INDEX
  cql3/statements: Add drop_index_statement class
  database: Add find_indexed_table() to database class
  cql3: Return change event from announce_migration()
  cql3/statements: Multiple index targets for CREATE INDEX
  cql3/statements: Use index_metadata in create_index_statement class
  cql3/statements: Use feature flag in create_index_statement class
  service/storage_service: Add feature flag for secondary indices
  database: Add get_available_index_name() to database class
  schema: Add get_default_index_name() to index_metadata class
  ...
2017-05-08 17:04:40 +03:00
Calle Wilund
2049303399 query_pagers: bugfix: must count pk only/pk + static rows as 1
Previously only counted clustered/regular

Message-Id: <1494249013-4069-1-git-send-email-calle@scylladb.com>
2017-05-08 16:35:27 +03:00
Pekka Enberg
dfee4d2bb0 cql3: Fix partition key bind indices for prepared statements
Fix the CQL front-end to populate the partition key bind index array in
result message prepared metadata, which is needed for CQL binary
protocol v4 to function correctly.

Fixes #2355.

Message-Id: <1494247871-3148-1-git-send-email-penberg@scylladb.com>
2017-05-08 16:33:17 +03:00
Calle Wilund
a03d54d9f8 Merge branch 'master' of https://github.com/scylladb/scylla 2017-05-08 11:28:26 +00:00
Pekka Enberg
35bb6dedd8 schema: Kill index_type enum 2017-05-08 10:19:34 +03:00
Pekka Enberg
06564afedb schema: Kill index_info class
It's no longer used. Indices are managed by the index_metadata class.
2017-05-08 10:19:34 +03:00
Pekka Enberg
3f27d12e99 cql3/statements/create_index_statement: Use database::existing_index_names() in validation 2017-05-08 10:19:34 +03:00
Pekka Enberg
b87679821c cql3/statements: Use secondary index manager in alter_table_statement class 2017-05-08 10:03:28 +03:00
Pekka Enberg
4b4e4e6878 index: Add secondary_index_manager 2017-05-08 10:03:28 +03:00
Pekka Enberg
94bc031ca7 thrift/handler: Use index_metadata 2017-05-08 10:03:28 +03:00
Pekka Enberg
11474ed4c6 db/schema_tables: Index persistence 2017-05-08 10:03:28 +03:00
Avi Kivity
9e67bd5aac Merge " Add partial range deletion support" from Duarte
"This series introduces partial support for range deletions. This allows
deletion operations such as

delete from cf where p=1 and c > 0 and c <= 3.

This series only adds support for single-column range restrictions.

We enforce that both range bounds be specified, because we can't represent
infinite bounds in the current sstable format. Such bounds are represented
as a prefix with no components, with the bound_kind informing whether they
are a bottom of top bound.

We're currently unable to serialize an infinite bound in such a way that it
would be correctly interpreted by Cassandra 2.2.x. A serialized bound is a
composite with a (<length><value><EOC>)+ format. While we could technically
represent the bottom bound, the top bound, if written as a single component
with 0 bytes in size and some EOC, would always sort before other values.
The same would happen if represented as an empty (no components) composite,
because in Cassandra 2.2.x those always have EOC = NONE.

This limitation should stay in place until we can properly represent range
tombstones in the storage format."

* 'range-deletions/v2' of https://github.com/duarten/scylla:
  mutation: Set cell using clustering_key_prefix
  mutation_partition: Harmonize apply_delete overloads
  prefix_compound_view_wrapper: Add is_full and is_empty functions
  tests/cql_query_test: Add range deletion tests
  cql3: Partially support ranged deletions
  single_column_primary_key_restrictions: Implement has_bound()
  modification_statement: Use statement_restrictions for where clause
  statement_restrictions: Expose primary key restrictions
  to_string: Add missing include
2017-05-07 19:27:09 +03:00
Takuya ASADA
b574100075 dist/common/scripts/scylla_selinux_setup: keep symlink on /etc/sysconfig/selinux
Current script has a bug that overwrites symlink on
/etc/sysconfig/selinux by real file, the script not able to disable
SELinux because of it.

So keep symlink after modifying the file.

Fixes #2279

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1493663263-10573-1-git-send-email-syuu@scylladb.com>
2017-05-07 17:30:05 +03:00
Tomer Sandler
9a1aa6c1d3 node_health_check: Major rework
This is a folded version of the following rework on the node health
check script:

  - Added support for non-default cql + nodetool ports

  - Script will not exit if either Scylla-server / Scylla-jmx / Both
    services are not up and running. It will alert the user about it and
    which output cannot be collected, but continue collecting everything
    else.

  - Removed lshw installation and non-needed use in commands

  - Script supports RHEL/CentOS/Ubuntu14/Ubuntu16/Debian (tested on all
    beside Debian, should behave the same as Ubuntu14/16)

  - All Indentation issues fixed -> using only tab (no spaces)
    consistently.

  - >> vs. >  was fixed as well in the needed places.

  - Changes the ${VAR_NAME} instances to $VAR_NAME, and kept the {} only
    where needed.

  - Check Scylla service as Vlad recommended using 'ps -C'

  - Fixed the CQL not listening error message.

  - Added Sanity check if script is attempted to run on non-Fedora and
    non-Debian OS -> alert the user and exit.

  - Removed the MANUAL CHECK LIST section (moved to Google Forms)

  - Added date in head of the report.

  - Removed text from Report's "PURPOSE" section, which was referring to
    the "MANUAL CHECK LIST" (not needed anymore).

[ penberg: Fold into a single commit and add proper license. ]
Acked-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1493900076-29170-1-git-send-email-penberg@scylladb.com>
2017-05-06 08:38:12 +03:00
Paweł Dziepak
798cfbc68f Merge "Fixes for gcc 7" from Avi
"gcc 7 doesn't like some of our code, so adjust to make it happy."

* 'gcc7' of http://github.com/avikivity/scylla:
  Remove exception specifications
  commitlog: handle noexcept conflict between unlink and function object
  thrift: change generated code namespace
2017-05-05 15:42:56 +01:00
Avi Kivity
a592573491 Remove exception specifications
C++17 removed exception specifications from the language, and gcc 7 warns
about them even in C++14 mode.  Remove them from the code base.
2017-05-05 17:02:31 +03:00
Avi Kivity
5278e1a14d commitlog: handle noexcept conflict between unlink and function object
::unlink is declared as noexcept, but the function object it is passed into
is not.  gcc 7 warns, so wrap ::unlink in a lambda to make it happy.
2017-05-05 17:02:30 +03:00
Avi Kivity
d542cdddf6 thrift: change generated code namespace
org::apache::cassandra (the generated namespace name) gets confused with
apache::cassandra (the thrift runtime library namespace), either due to
changes in gcc 7 or in thrift 0.10.  Either way, the problem is fixed
by changing the generated namespace to plain cassandra.
2017-05-05 05:26:20 +03:00
Paweł Dziepak
c9470b5c94 Merge "Fix abort in advance_and_check_if_present()" form Tomasz
"Fixes abort which happens when making a single-key query for a key which
is after all keys present in the sstable."

* 'tgrabiec/fix-abort-in-index-reader' of github.com:scylladb/seastar-dev:
  tests: mutation_source_test: Add test cases for single-key out of range reads
  sstables: index_reader: Remove redundant function
  sstables: index_reader: Fix abort in advance_and_check_if_present()
2017-05-04 17:19:47 +01:00
Duarte Nunes
9e88b60ef5 mutation: Set cell using clustering_key_prefix
Change the clustering key argument in mutation::set_cell from
exploded_clustering_prefix to clustering_key_prefix, which allows for
some overall code simplification and fewer copies. This mostly affects
the cql3 layer.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-04 15:59:50 +02:00
Duarte Nunes
db63ffdbb4 mutation_partition: Harmonize apply_delete overloads
This patch ensures the different mutation_partition::apply_delete()
overloads behave similarly, so that, for example, an empty clustering
key is treated the same way as an empty
exploded_clustering_key_prefix.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-04 15:59:50 +02:00
Duarte Nunes
07e648251b prefix_compound_view_wrapper: Add is_full and is_empty functions
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-04 15:59:50 +02:00
Duarte Nunes
ef138bdd2c tests/cql_query_test: Add range deletion tests
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-04 15:59:50 +02:00
Duarte Nunes
42873189d4 cql3: Partially support ranged deletions
This patch introduces partial support for range deletions. This allows
deletion operations such as

delete from cf where p=1 and c > 0 and c <= 3.

We enforce that both range bounds be specified, because we can't represent
infinite bounds in the current sstable format. Such bounds are represented
as a prefix with no components, with the bound_kind informing whether they
are a bottom of top bound.

We're currently unable to serialize an infinite bound in such a way that it
would be correctly interpreted by Cassandra 2.2.x. A serialized bound is a
composite with a (<length><value><EOC>)+ format. While we could technically
represent the bottom bound, the top bound, if written as a single component
with 0 bytes in size and some EOC, would always sort before other values.
The same would happen if represented as an empty (no components) composite,
because in Cassandra 2.2.x those always have EOC = NONE.

This limitation should stay in place until we can properly represent range
tombstones in the storage format.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-04 15:59:50 +02:00
Duarte Nunes
169cc41251 single_column_primary_key_restrictions: Implement has_bound()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-04 15:59:49 +02:00
Duarte Nunes
f7bc88734a modification_statement: Use statement_restrictions for where clause
This patch replaces the custom where clause processing by adding and
using a statement_restrictions field to modification_statement.

This improves code reuse and also moves some checks to prepare-time.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-04 15:59:49 +02:00
Duarte Nunes
aff23f93b4 statement_restrictions: Expose primary key restrictions
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-04 15:59:49 +02:00
Duarte Nunes
8b7d7c4e6d to_string: Add missing include
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-04 15:59:49 +02:00
Tomasz Grabiec
e71771d019 tests: mutation_source_test: Add test cases for single-key out of range reads 2017-05-04 14:59:08 +02:00
Tomasz Grabiec
297b4b0cf5 sstables: index_reader: Remove redundant function 2017-05-04 14:59:08 +02:00
Tomasz Grabiec
ec45f1e51d sstables: index_reader: Fix abort in advance_and_check_if_present()
Happens when the key is missing and after all keys in the sstables.

Fixes #2345.
2017-05-04 14:59:08 +02:00
Pekka Enberg
25e2777344 schema: Add all_indices() to schema class 2017-05-04 14:59:12 +03:00
Pekka Enberg
830591b092 schema: Remove add_default_index_names() from schema_builder class
The add_default_index_names() is part of the old and incomplete
secondary index implementation in Scylla. Drop it as it's no longer
used.
2017-05-04 14:59:12 +03:00
Pekka Enberg
8b943c0ceb db/schema_tables: Add system table for indices 2017-05-04 14:59:12 +03:00
Pekka Enberg
af9015d0d0 cql3/Cgl.g: DROP INDEX 2017-05-04 14:59:12 +03:00
Pekka Enberg
a4ee9f9fa1 cql3/statements: Add drop_index_statement class 2017-05-04 14:59:12 +03:00
Pekka Enberg
f26b8d7afb database: Add find_indexed_table() to database class 2017-05-04 14:59:12 +03:00
Pekka Enberg
14391a8ec8 cql3: Return change event from announce_migration()
This changes announce_migration() to return a change event directory in
schema_altering_statement base class. It's needed for drop index
statement, which does not know the keyspace or column family until it
looks up them based on the index. Two stage approach of announcing a
migration and then creating the change event won't work because in the
latter stage, the lookup will fail. The same change in
announce_migration() has been applied to Apache Cassandra.
2017-05-04 14:59:12 +03:00
Pekka Enberg
82394debe6 cql3/statements: Multiple index targets for CREATE INDEX 2017-05-04 14:59:12 +03:00
Pekka Enberg
fe315bd31a cql3/statements: Use index_metadata in create_index_statement class 2017-05-04 14:59:12 +03:00
Pekka Enberg
651af0f45a cql3/statements: Use feature flag in create_index_statement class 2017-05-04 14:59:12 +03:00
Pekka Enberg
815c91a1b8 service/storage_service: Add feature flag for secondary indices 2017-05-04 14:59:11 +03:00
Pekka Enberg
930fa79aff database: Add get_available_index_name() to database class 2017-05-04 14:59:11 +03:00
Pekka Enberg
ef29520c8e schema: Add get_default_index_name() to index_metadata class 2017-05-04 14:59:11 +03:00
Pekka Enberg
c6e7d4484a database: Make existing_index_names() per-keyspace operation 2017-05-04 14:59:11 +03:00
Pekka Enberg
8c729f0f5f database: Rewrite existing_index_names() to use new index metadata 2017-05-04 14:59:11 +03:00
Pekka Enberg
4391faaf45 cql3/statements: Add constants to index_target 2017-05-04 14:59:11 +03:00
Pekka Enberg
546d1e47dd cql3/statements: Add as_cql_string() to index_target class 2017-05-04 14:59:11 +03:00
Pekka Enberg
56cca3b0d6 cql3: Add to_cql_string() to column_identifier class 2017-05-04 14:59:11 +03:00
Pekka Enberg
58b90655d2 cql3/statements: Add to_string(target_type) 2017-05-04 14:59:11 +03:00
Pekka Enberg
1f5a52d03f cql3/statements: Use namespaces in index_target.cc file 2017-05-04 14:59:11 +03:00
Pekka Enberg
5e8f2f49c3 schema: Add indices() to schema class 2017-05-04 14:59:11 +03:00
Pekka Enberg
5abd4b8041 schema: Add has_index() to schema class 2017-05-04 14:59:11 +03:00
Pekka Enberg
1fb1828aa2 schema: Add index_names() to schema class 2017-05-04 14:59:11 +03:00
Pekka Enberg
1c1a767408 schema: Add find_index_noname() to schema class
This adds a find_index_nomame() helper to the schema class, which
searches for index that is otherwise equal but ignores the name of the
index in comparison. This is needed to for CREATE INDEX to reject
duplicate index creation.
2017-05-04 14:59:11 +03:00
Pekka Enberg
62fba73a05 schema_builder: Add index_metadata support 2017-05-04 14:59:11 +03:00
Pekka Enberg
05e12a1d2b schema: Add index_metadata maps to raw_schema class 2017-05-04 13:22:12 +03:00
Raphael S. Carvalho
ddc1d80c28 compaction: remove dead function declaration
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170504013046.23522-2-raphaelsc@scylladb.com>
2017-05-04 11:48:51 +03:00
Raphael S. Carvalho
61229ab88c compaction: fix type for cleanup
After compaction revamp, compaction type set by cleanup at its ctor
is being overwritten at compaction::setup(). Consequently, cleanup
would not be stopped by 'nodetool stop cleanup' and cleanup would
be listed as regular compaction in 'nodetool compactionstats'.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170504013046.23522-1-raphaelsc@scylladb.com>
2017-05-04 11:48:50 +03:00
Avi Kivity
211a337883 Merge seastar upstream
* seastar 194d80f...4a3118c (4):
  > execution_stage: fix wrong exception thrown for non-unique stages
  > metrics: add missing move assignment operators for metric_group, metric_groups
  > Remove unused lambda captures
  > core: lw_shared_ptr::get() should return nullptr for null pointer
2017-05-04 11:47:05 +03:00
Asias He
66e3b73b9c repair: Fix partition estimation
We estimate number of partitions for a given range of a column familiy
and split the range into sub ranges contains fewer partitions as a
checksum unit.

The estimation is wrong, because we need to count the partitions on all
the shards, instead of only counting the local shard.

Fixes #2299

Message-Id: <7876285bd26cfaf65563d6e03ec541626814118a.1493817339.git.asias@scylladb.com>
2017-05-03 16:25:45 +03:00
Pekka Enberg
1e04731fa0 Merge "gossip mark alive fixes" from Asias
"This series fixes the user after free issue in gossip and elimates the
duplicated / unnecessary mark alive operations.

Fixes #2341"

* tag 'asias/gossip_fix_mark_alive/v1' of github.com:cloudius-systems/seastar-dev:
  gossip: Ignore callbacks and mark alive operation in shadow round
  gossip: Ingore the duplicated mark alive operation
  gossip: Fix user after free in mark_alive
2017-05-03 12:19:16 +03:00
Jacob Johansen
9616956c16 dist/docker: Add support for experimental flag
Fixes #2188

Message-Id: <20170502180047.24071-1-jacob.johansen@virginpulse.com>
2017-05-03 10:29:55 +03:00
Asias He
3bd9840c01 gossip: Ignore callbacks and mark alive operation in shadow round
In shadow round, we only interested in the peer's endpoint_state, e.g., gossip
features, host_id, tokens. No need to call the on_restart or on_join callbacks
or to go through the mark alive procedure with EchoMessage gossip message. We
will do them during normal gossip runs anyway.
2017-05-03 07:24:21 +08:00
Asias He
1441ae5cac gossip: Ingore the duplicated mark alive operation
If a node is being marked as alive with EchoMessage, ignore the future
duplicated mark alive opeariton.
2017-05-03 07:24:21 +08:00
Asias He
d682fbfa28 gossip: Fix user after free in mark_alive
After sending echo message, the Node might not be in the
endpoint_state_map anymore, use the reference of local_state
might cause user after free.

Fixes #2341
2017-05-03 07:24:20 +08:00
Raphael S. Carvalho
8b0e358d73 tests/sstable_test: fix release-mode compaction_manager_test
in release mode, compaction task is active after submitting request
because ready future may be scheduled immediately.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170502171925.9893-1-raphaelsc@scylladb.com>
2017-05-02 20:48:30 +03:00
Calle Wilund
a37d03cd1d transport::server: ignore socket shutdown future results
These will as of-late always be ready. Removing the future usage
in preparation for changing api signature to void(*)() (i.e. prevent
breakage on seastar update)
2017-05-02 15:08:47 +00:00
Avi Kivity
7e29dd7066 managed_bytes: improve alignment hygene
While blob_storage is marked as an unaligned type, the back references also
point to an unaligned type (a pointer to blob_storage), since a back
reference can live in a blob_storage.  This triggers errors from zapcc/clang 4.

Fix by creating a type for the reference, which is marked as unaligned.
Message-Id: <20170502071404.507-1-avi@scylladb.com>
2017-05-02 10:04:13 +01:00
Pekka Enberg
2f83232a02 schema: Add index_metadata class 2017-05-02 10:29:18 +03:00
Avi Kivity
b46f6a4124 build: ignore unused lambda capture warnings from clang
Worthwhile to revisit later.
2017-05-02 10:09:58 +03:00
Raphael S. Carvalho
8dfb5f9c33 tests/sstable_test: fix compaction_manager_test
after 'compaction: make major compaction go through compaction manager',
the test fails because task is preempted in debug mode before it reaches
intruction to increase stat.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170501183255.6191-1-raphaelsc@scylladb.com>
2017-05-02 09:06:41 +03:00
Avi Kivity
1d12d69881 logalloc: define segment_zone::maximum_size
Yield build errors with some compilers, if missing.
2017-05-01 16:31:29 +03:00
Amnon Heiman
b59c95359d scylla_setup: Fix conditional when checking for newer version
During the changes in the way the housekeeping check for newer version
and warn about it in the installation the UUID part was removed but kept
in the sarounding if.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170426075724.7132-1-amnon@scylladb.com>
2017-05-01 12:13:35 +03:00
Raphael S. Carvalho
3071b9052a compaction: make cleanup_compaction inherit from regular_compaction
Some fields that belong to regular and cleanup aren't needed for
resharding_compaction, such as incremental selector (which is used
for determining max purgeable timestamp for a given decorated key)
Better move those fields to regular and make cleanup inherit from
regular compaction.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170428195611.9196-1-raphaelsc@scylladb.com>
2017-04-30 19:37:09 +03:00
Raphael S. Carvalho
687a4bb0c2 dtcs: do not compact fully expired sstable which ancestor is not deleted yet
Currently, fully expired sstable[1] is unconditionally chosen for compaction
by DTCS, but that may lead to a compaction loop under certain conditions.

Let's consider that an almost expired sstable is compacted, and it's not
deleted yet, and that the new sstable becomes expired before its ancestor is
deleted.
Because this new sstable is expired, it will be chosen by DTCS, but it will
not be purged because 'compacted undeleted' sstables are taken into account
by calculation of max purgeable timestamp and prevents expired data from
being purged. The problem is that this sequence of events can keep happening
forever as reported by issue #2260.
NOTE: This problem was easier to reproduce before improvement on compaction
of expired cells, because fully expired sstable was being converted into a
sstable full of tombstones, which is also considered fully expired.

Fixes #2260.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170428233554.13744-1-raphaelsc@scylladb.com>
2017-04-30 19:35:46 +03:00
Paweł Dziepak
24f4dcf9e4 db: make virtual dirty soft limit configurable
Message-Id: <20170428150005.28454-1-pdziepak@scylladb.com>
2017-04-30 19:17:22 +03:00
Avi Kivity
248aa4fc23 Merge "Fix update of counter in static rows" from Paweł
"The logic responsible for converting counter updates to counter shards was
not covered by unit tests and didn't transform counter cells inside static
rows.

This series fixes the problem and makes sure that the tests cover both
static rows and transformation logic."

* tag 'pdziepak/static-counter-updates/v1' of github.com:cloudius-systems/seastar-dev:
  tests/counter: test transform_counter_updates_to_shards
  tests/counter: test static columns
  counters: transform static rows from updates to shards
2017-04-30 19:13:44 +03:00
Avi Kivity
339322517e Merge "sstables: index_reader: Fix advance_to() to include relevant range tombstones" from Tomasz
"Fixes #2326."

* 'tgrabiec/fix-range-tombstones-missing-when-slicing' of github.com:cloudius-systems/seastar-dev:
  tests: mutation_source_test: Cover single-ranged queries in test_streamed_mutation_slicing_returns_only_relevant_tombstones()
  tests: mutation_source_test: Add test for slicing of clustered rows
  tests: mutation_reader_assertions: Log expectations
  tests: mutation_reader_assertions: Add produces_eos_or_empty_mutation()
  tests: sstables: Use read_row() for single-key reads
  tests: sstables: Test more configutaions of sstable writer in test_sstable_conforms_to_mutation_source()
  sstables: Improve logging
  sstables: index_reader: Fix advance_to() to include relevant range tombstones
2017-04-30 14:40:41 +03:00
Avi Kivity
831ee80c3c tests: workaround older boost::apply_visitor requiring a result_type member
Older versions of boost::apply_visitor require a result_type member for the
visitor; supply it to make them happy.

Fixes #2312.
2017-04-30 13:56:44 +03:00
Takuya ASADA
a19c1b7f86 dist/redhat: add missing dependencies for Fedora
We only have "%{?rhel:Requires}" for scylla-server, need fedora one.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1493314367-419-1-git-send-email-syuu@scylladb.com>
2017-04-30 11:06:27 +03:00
Takuya ASADA
fe9f72d2c0 dist/debian: add python3-pyudev to dependencies
pyudev is required for seastar/scripts/perftune.py.

Fixes #2315

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1493309116-18074-1-git-send-email-syuu@scylladb.com>
2017-04-30 11:05:15 +03:00
Paweł Dziepak
f5cf86484e lsa: introduce upper bound on zone size
Attempting to create huge zones may introduce significant latency. This
patch introduces the maximum allowed zone size so that the time spent
trying to allocate and initialising zone is bounded.

Fixes #2335.

Message-Id: <20170428145916.28093-1-pdziepak@scylladb.com>
2017-04-30 10:58:11 +03:00
Paweł Dziepak
5c302cf67b tests/counter: test transform_counter_updates_to_shards 2017-04-28 16:29:34 +01:00
Paweł Dziepak
0473750056 tests/counter: test static columns 2017-04-28 16:29:34 +01:00
Paweł Dziepak
0ffdd8d3d0 counters: transform static rows from updates to shards 2017-04-28 16:29:34 +01:00
Tomasz Grabiec
d4df6e214e tests: mutation_source_test: Cover single-ranged queries in test_streamed_mutation_slicing_returns_only_relevant_tombstones() 2017-04-27 18:43:49 +02:00
Tomasz Grabiec
22cce52dff tests: mutation_source_test: Add test for slicing of clustered rows 2017-04-27 18:43:49 +02:00
Tomasz Grabiec
86b693f562 tests: mutation_reader_assertions: Log expectations 2017-04-27 18:43:49 +02:00
Tomasz Grabiec
ece6e107cc tests: mutation_reader_assertions: Add produces_eos_or_empty_mutation() 2017-04-27 18:43:49 +02:00
Tomasz Grabiec
6354acc1a2 tests: sstables: Use read_row() for single-key reads
So that as_mutation_reader() will create the same kind of reader which
database::make_sstable_reader() does.

Before this change, all readers were range readers.
2017-04-27 18:43:49 +02:00
Tomasz Grabiec
fd5dbe04b5 tests: sstables: Test more configutaions of sstable writer in test_sstable_conforms_to_mutation_source()
Test different versions of the format, and different promoted index
block sizes.  The size of 1 is especially important, it will put each
fragment in a separate block, exposing various issues with promoted
index handling.
2017-04-27 18:43:49 +02:00
Tomasz Grabiec
c5baeed6d2 sstables: Improve logging 2017-04-27 18:43:49 +02:00
Tomasz Grabiec
b523815ac1 sstables: index_reader: Fix advance_to() to include relevant range tombstones
Fixes #2326.
2017-04-27 18:43:49 +02:00
Glauber Costa
14b9aa2285 reduce kernel scheduler wakeup granularity
We set the scheduler wakeup granularity to 500usec, because that is the
difference in runtime we want to see from a waking task before it
preempts the running task (which will usually be Scylla). Scheduling
other processes less often is usually good for Scylla, but in this case,
one of the "other processes" is also a Scylla thread, the one we have
been using for marking ticks after we have abandoned signals.

However, there is an artifact from the Linux scheduler that causes those
preemption to be missed if the wakeup granularity is exactly twice as
small as the sched_latency. Our sched_latency is set to 1ms, which
represents the maximum time period in which we will run all runnable
tasks.

We want to keep the sched_latency at 1ms, so we will reduce the wakeup
granularity so to something slightly lower than 500usec, to make sure
that such artifact won't affect the scheduler calculations. 499.99usec
will do - according to my tests, but we will reduce it to a round
number.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20170427135039.8350-1-glauber@scylladb.com>
2017-04-27 18:11:35 +03:00
Pekka Enberg
9cfb94510f Merge "Fix issues found by PVS-Studio static analyzer" from Vlad
Fix issues found by PVS-Studio as reported by Phillip Khandeliants.

Merge branch 'pvs_analyzer_errors-v1' of github.com:cloudius-systems/seastar-dev

* 'pvs_analyzer_errors-v1' of github.com:cloudius-systems/seastar-dev:
  type_parser: catch exceptions by reference and not by value
  token_metadata::get_host_id(ep): add a missing 'throw'
2017-04-27 11:39:49 +03:00
Vlad Zolotarov
d5b76d5198 type_parser: catch exceptions by reference and not by value
Found by PVS-Studio static analyzer:

Type slicing. An exception should be caught by reference rather than by value.

Fixes #2288

Reported-by: Phillip Khandeliants
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-04-26 15:12:15 -04:00
Vlad Zolotarov
181c68e97d token_metadata::get_host_id(ep): add a missing 'throw'
Caught by PVS-Studio static analyzer:

The object was created but it is not being used. The 'throw' keyword could be missing: throw runtime_error(FOO);

Reported-by: Phillip Khandeliants
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-04-26 14:54:34 -04:00
Takuya ASADA
7a59336b8a main.cc: drop FS type check
Since we add support ext4, we don't need to limit filesystem to XFS anymore.

Fixes #1933

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1493212525-26264-1-git-send-email-syuu@scylladb.com>
2017-04-26 17:35:55 +03:00
Raphael S. Carvalho
8bae413bcf database: fix format msg for sprint
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170425224920.16607-1-raphaelsc@scylladb.com>
2017-04-26 17:18:58 +03:00
Raphael S. Carvalho
f49bdb6839 compaction_manager: dont go on with major compaction if task was stopped
A column family which was truncated will remove itself from compaction
manager. Any task running a compaction should be interrupted and a
task waiting to run should bail out when it wakes up.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170425224350.15965-3-raphaelsc@scylladb.com>
2017-04-26 17:18:37 +03:00
Takuya ASADA
abf65cb485 dist/debian: skip tunables when kernel = 3.13.0-*-generic, to prevent kernel panic bug
There is kernel panic bug on kernel = 3.13.0-*-generic(Ubuntu 14.04), we have to skip tunables.

Fixes #1724

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1493196636-25645-1-git-send-email-syuu@scylladb.com>
2017-04-26 11:54:11 +03:00
Vlad Zolotarov
a9ad762f47 docs: tracing.md: add a "how to get traces" chapter
This chapter describes how to get tracing information.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-04-25 21:52:29 -04:00
Vlad Zolotarov
f993e85b5f tracing::trace_keyspace_helper: introduce a time series helper tables
Introduce two time series helper tables that will simplify the querying
of traces.

One for querying regular traces:

CREATE TABLE system_traces.sessions_time_idx (
minute timestamp,
started_at timestamp,
session_id uuid,
PRIMARY KEY (minute, started_at, session_id))

and one for querying slow query records:

CREATE TABLE system_traces.node_slow_log_time_idx (
minute timestamp,
started_at timestamp,
session_id uuid,
start_time timeuuid,
node_ip inet,
shard int,
PRIMARY KEY (minute, started_at, session_id))

With these tables one may get the relevant traces like in an example below:

SELECT * from system_traces.sessions_time_idx where minutes in ('2016-09-07 16:56:00-0700') and started_at > '2016-09-07 16:56:30-0700'

Fixes #2243

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-04-25 21:52:28 -04:00
Vlad Zolotarov
81bcc36b16 tracing: cleanup: use nullptr instead of trace_state_ptr()
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-04-25 21:52:28 -04:00
Vlad Zolotarov
b0f660331a tracing: introduce a span ID and parent span ID
This patch makes the tracing framework follow the general idea of Google's
Dapper paper: traces generated in a context of the same query are forming
a single-rooted acyclic tree where in a ScyllaDB case vertexes are spans running
on each involved replica Node and edges are RPCs sent from one Node to another.

   - Each vertex in the tree above has an ID - "span ID".
   - In order to be able to build the tree from the sessions traces we need
     to know the parent "span ID" - the ID of a span that sent an RPC that created
     the current span.
   - Each span of a tracing session is given a 64-bit random span ID.
   - The root span has a span_id::illegal_id value.

This patch adds:
   - The described above parent span ID and a span ID to the one_session_records
     object.
   - The current span ID is passed in the trace_info struct to the remote replica.
   - Add parent_id and span_id columns to system_traces.events table for the parent
     ID and span ID.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-04-25 21:52:23 -04:00
Tomasz Grabiec
92dba05f0d sstables: Fix malformed_sstable_exception from single-key reads
After 4742008b70, _read_partial_row is
never set, and we will fail here in case the consumer will exhoust the
range. That would be the case if the end bound of the slice aligns
with the end of the index page.

Fix by assuming that if we're out of range in the middle of partition,
we sliced.

Message-Id: <1493121249-18847-1-git-send-email-tgrabiec@scylladb.com>
2017-04-25 14:59:08 +03:00
Avi Kivity
628b3092e4 Merge "Reify shadowable tombstones" from Duarte
"This series introduces the row_tombstone class, which represents a
tombstone applied to a clustering row. It distinguishes itself from a
normal tombstone by the fact that it contains a regular tombstone and
a shadowable one, which can be erased by a row marker.

The intent of the series is thus to reify the idea of shadowable
tombstones, that up until now we considered all materialized view row
tombstones to be, leading to incorrect results."

* 'materialized-views/shadowable/v5' of https://github.com/duarten/scylla:
  sstables: Read and write shadowable tombstones
  mutation_partion: Use row_tombstone
  mutation_partion: Introduce row_tombstone
  mutation_partition: Introduce shadowable tombstones
  idl-compiler: Support optional fields in views
  tombstone: Extract out relational operators
  row_marker: Mark constructors explicit
2017-04-25 13:05:27 +03:00
Duarte Nunes
d45596ae8e sstables: Read and write shadowable tombstones
This patch serializes shadowable tombstones to sstables by adding a
new, incompatible atom's mask.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-04-25 11:46:33 +02:00
Duarte Nunes
4e693383f7 mutation_partion: Use row_tombstone
This patch replaces the current row tombstone representation by a
row_tombstone.

The intent of the patch is thus to reify the idea of shadowable
tombstones, that up until now we considered all materialized view row
tombstones to be.

We need to distinguish shadowable from non-shadowable row tombstones
to support scenarios such as, when inserting to a table with a
materialzied view:

1. insert into base (p, v1, v2) values (3, 1, 3) using timestamp 1
2. delete from base using timestamp 2 where p = 3
3. insert into base (p, v1) values (3, 1) using timestamp 3

These should yield a view row where v2 is definitely null, but with
the current implementation, v2 will pop back with its value v2=3@TS=1,
even though its dead in the base row. This is because the row
tombstone inserted at 2) is a shadowable one.

This patch only addresses the memory representation of such
row_tombstones.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-04-25 11:46:33 +02:00
Duarte Nunes
6a2bccd4ae mutation_partion: Introduce row_tombstone
This patch introduces the row_tombstone class, which represents a
tombstone made up of a regular tombstone and a shadowable one.

The rules for row_tombstones are as follows:

 - The shadowable tombstone is always >= than the regular one;
 - The regular tombstone works as expected;
 - The shadowable tombstone doesn't erase or compact away the regular
   row tombstone, nor dead cells;
 - The shadowable tombstone can erase live cells, but only provided
   they can be recovered (e.g., by including all cells in a MV update,
   both updated cells and pre-existing ones);
 - The shadowable tombstone can be erased or compacted away by a newer
   row marker.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-04-25 11:46:28 +02:00
Duarte Nunes
3d49c1da01 mutation_partition: Introduce shadowable tombstones
A shadowable tombstone is a tombstone that can be replaced by a
smaller one if provided a row_marker with a bigger timestamp than the
shadowable tombstone.

In the context of a row, it is only valid as long as no newer insert
is done (thus setting a live row marker; note that if the row
timestamp set is lower than the tombstone's, then the tombstone
remains in effect as usual). If a row has a shadowable tombstone with
timestamp Ti and that row is updated with a timestamp Tj, such that
Tj > Ti (and that update sets the row marker), then the shadowable
tombstone is shadowed by that update. A concrete consequence is that
if the update has cells with timestamp lower than Ti, then those cells
are preserved (since the deletion is removed), and this is contrary to
a regular, non-shadowable row tombstone where the tombstone is
preserved and such cells are removed.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-04-25 11:46:22 +02:00
Duarte Nunes
8cc29f84fb idl-compiler: Support optional fields in views
When generating view code, the compiler was ignoring optional fields.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-04-25 11:43:04 +02:00
Duarte Nunes
d216c3dbd2 tombstone: Extract out relational operators
This patch extracts out the relational operators in struct tombstone
to a class capable of generating them from a tri-compare function.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-04-25 11:43:04 +02:00
Duarte Nunes
392403b5b3 row_marker: Mark constructors explicit
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-04-25 11:43:04 +02:00
Tomasz Grabiec
f3609fc813 tests: log_historgram_test: Fix compiation on Ubuntu
Some gcc versions incorrectly complain:

  tests/log_histogram_test.cc:87:22: error: ‘opts1’ is not a valid template argument for type ‘const log_histogram_options&’ because object ‘opts1’ has not external linkage
 size_t hist_key<node<opts1>>(const node<opts1>& n) { return n.v; }

Apparently this is a bug in gcc:

   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52036

Fixes #2307.

Message-Id: <1493108791-11247-1-git-send-email-tgrabiec@scylladb.com>
2017-04-25 12:15:28 +03:00
Pekka Enberg
940c3f4330 Merge "Clang fixes (part 2)" from Avi
"This series fixes some more errors found by clang, with the aim of enabling
clang/zapcc as a supported compiler.  A single issue remains, but it's
probably in std::experimental::optional::swap(); not in our code."

* tag 'clang/2/v1' of https://github.com/avikivity/scylla:
  sstable_test: avoid passing negative non-type template arguments to unsigned parameters
  UUID: add more comparison operators
  sstable_datafile_test: avoid string_view user-defined literal conversion operator
  mutation_source_test: avoid template function without template keyword
  cql_query_test: define static variable
  cql_query_test: add braces for single-item collection initializers
  storage_service: don't use typeid(temporary)
  logalloc: remove unused max_occupancy_for_compaction
  storage_proxy: drop overzealous use of __int128_t in recently-modified-no-read-repair logic
  storage_proxy: drop unused member access from return value
  storage_proxy: fix reference bound to temporary in data_read_resolver::less_compare
  read_repair_decision: fix operator<<(std::ostream&, ...)
2017-04-24 20:32:16 +03:00
Tomasz Grabiec
dfbb9fd8f1 gdb: Workaround for gdb.Value being not accepted by %x
Fixes the following error in "scylla segment-descs" and a similar one in "scylla lsa-segment":

Traceback (most recent call last):
  File "scylla-gdb.py", line 530, in invoke
    gdb.write('0x%x: lsa free=%d region=0x%x zone=0x%x\n' % (addr, desc['_free_space'], desc['_region'], desc['_zone']))
TypeError: %x format: an integer is required, not gdb.Value

Message-Id: <1493029465-6482-1-git-send-email-tgrabiec@scylladb.com>
2017-04-24 13:27:25 +03:00
Avi Kivity
6d9e18fd61 logalloc: reduce descriptor overhead
Every lsa-allocated object is prefixed by a header that contains information
needed to free or migrate it.  This includes its size (for freeing) and
an 8-byte migrator (for migrating).  Together with some flags, the overhead
is 14 bytes (16 bytes if the default alignment is used).

This patch reduces the header size to 1 byte (8 bytes if the default alignment
is used).  It uses the following techniques:

 - ULEB128-like encoding (actually more like ULEB64) so a live object's header
   can typically be stored using 1 byte
 - indirection, so that migrators can be encoded in a small index pointing
   to a migrator table, rather than using an 8-byte pointer; this exploits
   the fact that only a small number of types are stored in LSA
 - moving the responsibility for determining an object's size to its
   migrator, rather than storing it in the header; this exploits the fact
   that the migrator stores type information, and object size is in fact
   information about the type

The patch improves the results of memory_footprint_test as following:

Before:

 - in cache:     976
 - in memtable:  947

After:

mutation footprint:
 - in cache:     880
 - in memtable:  858

A reduction of about 10%.  Further reductions are possible by reducing the
alignment of lsa objects.

logalloc_test was adjusted to free more objects, since with the lower
footprint, rounding errors (to full segments) are different and caused
false errors to be detected.

Missing: adjustments to scylla-gdb.py; will be done after we agree on the
new descriptor's format.
2017-04-24 12:23:12 +02:00
Avi Kivity
b4e897a66d cql3::metadata: fix undefined evaluation order in constructor
We both move names_ to its destination, and call names_.size() in the same
expression; this has undefined evaluation order, and fails with clang.

With this patch as well as the clang build fixes, Scylla starts and is
able to serve requests (light cassandra-stress load).

Message-Id: <20170423121727.1948-1-avi@scylladb.com>
2017-04-24 10:40:12 +03:00
Duarte Nunes
cddf2f4d74 tests: Fix failure virtual_reader_test
This patch fixes a failure of virtual_reader_test, where both the test
itself and the cql_test_env initialize the messaging_service to listen
on the same address and port, triggering an assert in
posix_ap_server_socket_impl::accept().

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170423104240.21275-1-duarte@scylladb.com>
2017-04-23 14:06:35 +03:00
Avi Kivity
566c094764 sstable_test: avoid passing negative non-type template arguments to unsigned parameters
Clang complains.  The test looks somewhat bogus, but that's for another patch.
2017-04-22 22:13:55 +03:00
Avi Kivity
dc6ea51ffa UUID: add more comparison operators
Clang wanted them for some unit test; not sure how gcc was able to
synthesize them, but they're clearly needed.
2017-04-22 22:12:33 +03:00
Avi Kivity
5424aca745 sstable_datafile_test: avoid string_view user-defined literal conversion operator
Clang doesn't like it, perhaps because it isn't in the std namespace (it's
still in std::experimental).
2017-04-22 22:11:30 +03:00
Avi Kivity
705ac957a2 mutation_source_test: avoid template function without template keyword
This isn't (yet?) standard C++, and clang rejects it.
2017-04-22 22:10:21 +03:00
Avi Kivity
551fb03476 cql_query_test: define static variable
single_node_cql_env is declared but not defined; define it to make clang
happy.
2017-04-22 22:01:44 +03:00
Avi Kivity
eb700752d8 cql_query_test: add braces for single-item collection initializers
Clang complains that braces are missing; I didn't verify it but I'm sure
it's right.  Add braces to make it happy.
2017-04-22 22:00:49 +03:00
Avi Kivity
6bb8ae7788 storage_service: don't use typeid(temporary)
Clang warns that the expression will be evaluated (doh).  While the warning
seems dubious, keep it and change the code to call the function outside
typeid(), in case it does help someone one day.
2017-04-22 21:09:41 +03:00
Avi Kivity
9303b09a64 logalloc: remove unused max_occupancy_for_compaction
Noticed by clang.
2017-04-22 21:09:41 +03:00
Avi Kivity
6d0811711f storage_proxy: drop overzealous use of __int128_t in recently-modified-no-read-repair logic
Clang's std::abs() doesn't support __int128_t, so use __int64_t instead.  With
this change, it's possible that a read repair 252,700 years after a write
will be interpreted as a recent write and the read repair will incorrectly
be skipped; hopefully by that time __int128_t will be standardized.
2017-04-22 21:09:41 +03:00
Avi Kivity
5ec1742b9a storage_proxy: drop unused member access from return value
Noticed by clang.
2017-04-22 21:09:41 +03:00
Avi Kivity
e4bae0df51 storage_proxy: fix reference bound to temporary in data_read_resolver::less_compare
Noticed by clang.
2017-04-22 21:09:41 +03:00
Avi Kivity
944047f039 read_repair_decision: fix operator<<(std::ostream&, ...)
Argument-dependent lookup requires that the operator be declared in the
same namespace as the class; move it there.

While at it, de-static it, it only causes bloat.
2017-04-22 21:09:41 +03:00
Raphael S. Carvalho
4a86dd473d tests: add tests/sstable_resharding_test.cc
Forgot to add file after resolving conflict.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170422172053.3734-1-raphaelsc@scylladb.com>
2017-04-22 21:09:29 +03:00
Benoît Canet
f68049ef5d tests: Fix clang auto universal reference type deduction
Replace it by regular template type deduction.

Signed-off-by: Benoît Canet <benoit@scylladb.com>
Message-Id: <20170421204150.4626-2-benoit@scylladb.com>
2017-04-22 20:04:00 +03:00
Benoit Canet
b902f3b81b tests: Remove parenthesis in variable declaration
Prevent clang compilation of this tests.

Signed-off-by: Benoît Canet <benoit@scylladb.com>
Message-Id: <20170421204150.4626-1-benoit@scylladb.com>
2017-04-22 20:04:00 +03:00
Avi Kivity
54ab13eb8e Merge "sstable resharding revamp" from Raphael
"Currently, a shared sstable is rewritten at all shards it belongs to, and only
after that, it's deleted.

This new algorithm adds the ability to reshard a set of sstables together at a
single shard and produce unshared sstable for all shards involved.

That's important for the leveled compaction strategy issue, in which the number
of sstables growing considerably after resharding. What happened is that every
sstable was being split into N ones, so we could end up with tons of small
sstables. Now, we will reshard together a set of adjacent sstables."

* 'sstable_resharding_revamp_v9' of github.com:raphaelsc/scylla:
  tests: add test for new sstable resharding
  database: kill column_family::start_rewrite
  database: wire up new resharding algorithm
  database: implement new sstable resharding algorithm
  database: introduce function to replace new sstables by their ancestors
  prevent regular compaction from choosing shared sstables
  compaction_strategy: implement resharding strategy for compaction strategies
  sstables: store more info in foreign_sstable_open_info
  sstables: make it possible to get open info from loaded sstable
  database: export column family dir
  database: inform if column family has shared tables
  sstables: add method to export ancestors
  lcs: implement get_level_count
  compaction_manager: introduce method to check if manager stopped
  lcs: restore invariant instead of sending overlapping sst to L0
  sstables: extend compaction for new resharding
  sstables: allow shard A to correctly create sstable for shard B
  compaction: rework compacting_sstable_writer to work with multiple writers
  compaction: prepare compacting_sstable_writer to work with writers
  sstables: rework compaction to make it easy to extend
2017-04-22 13:31:54 +03:00
Raphael S. Carvalho
8a37b279ed tests: add test for new sstable resharding
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:34 -03:00
Raphael S. Carvalho
662fe77c11 database: kill column_family::start_rewrite
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:33 -03:00
Raphael S. Carvalho
43ac19eb52 database: wire up new resharding algorithm
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:31 -03:00
Raphael S. Carvalho
cf45333588 database: implement new sstable resharding algorithm
NOTE: it's not wired yet.

Currently, a shared sstable is rewritten at all shards it belongs
to and only after that, it's deleted. With this new algorithm, a
shared sstable will be read only once and N unshared sstables
will be created, each of them with 1/N of the data. After it's
done, each owner shard will receive its new unshared sstable
replacing its ancestors.

Another benefit is that we'll no longer have resharding resulting
in number of sstables growing considerably after resharding.
A full-sized leveled sstable is usually 160MB, so after resharding,
we could have N files of 160MB/N. Now, leveled strategy will help
resharding. N adjacent sstables of same level will be resharded
together, so we'll end up with N files of N*160MB/N.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:30 -03:00
Raphael S. Carvalho
6513252e91 database: introduce function to replace new sstables by their ancestors
When resharding, we're working with sstables from all shards. So let's say
we're done with resharding of sstable A that belongs to shard 0 and 1 and
sstable B that belongs to shard 1 and 2. SStables were generated for
shards 0, 1, and 2. So shards 0, 1, and 2 need to load the new sstables
and remove the ancestors. Shard 1 for example will remove sstables A and
B (ancestors) and add the new one. Then it comes this new function.
We'll forward new sstables to their target shards using foreign sstable
open info.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:27 -03:00
Raphael S. Carvalho
c44a2319e6 prevent regular compaction from choosing shared sstables
For new resharding, it's important to exclude resharding sstables
from the list of candidates for regular compaction. That's doesn't
affect current resharding because it marks the sstables as
compacting. That won't work with new resharding which will work
with sstables from multiple shards.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:26 -03:00
Raphael S. Carvalho
13477075e2 compaction_strategy: implement resharding strategy for compaction strategies
Strategies other than leveled will reshard one shared sstable at
a time, and the target shard, shard at which job will run, for each
job will be chosen in a round-robin fashion.

For leveled strategy, we will reshard together smp::count adjacent
sstables that belong to same level.
The reason for that is because resharding one sstable at a time
may result in creation of file for each shard, meaning after
resharding we could end up with NO_SSTABLES*NO_SHARDS.

These resharding strategies will be used for our new resharding
algorithm.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:24 -03:00
Raphael S. Carvalho
bf930476b3 sstables: store more info in foreign_sstable_open_info
We need that info for opening a sstable at different shard, unlike
sstable loader which has everything in entry_descriptor, obtained
from components in sstable filename.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:22 -03:00
Raphael S. Carvalho
e5e7037aa4 sstables: make it possible to get open info from loaded sstable
It will be useful for resharding which will need to move a
sstable across shards, and to do that without reloading the
sstable at target shard, we need to be able to get the open
info and move it to the target shard instead.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:21 -03:00
Raphael S. Carvalho
405e41e9a8 database: export column family dir
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:19 -03:00
Raphael S. Carvalho
2b774c5bc3 database: inform if column family has shared tables
That's gonna be useful to quickly determine if it's worth resharding
a column family.

Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:17 -03:00
Raphael S. Carvalho
2d119287b7 sstables: add method to export ancestors
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:16 -03:00
Raphael S. Carvalho
f2f8a2f5c7 lcs: implement get_level_count
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:14 -03:00
Raphael S. Carvalho
585596cede compaction_manager: introduce method to check if manager stopped
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:12 -03:00
Raphael S. Carvalho
d82a8dfae0 lcs: restore invariant instead of sending overlapping sst to L0
A large token span sstable may find its way into high level due to resharding,
which means the strategy invariant is broken. The invariant is restored by
compacting first set of overlapping sstables, meaning that the restoration
is done incrementally for multiple overlapping sets.

Invariant is restored by regular compaction after resharding puts new unshared
sstables into their original level, where level > 0.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:09 -03:00
Raphael S. Carvalho
0127309820 sstables: extend compaction for new resharding
Extends compaction for new resharding algorithm. Not wired yet.
New resharding will compact shared sstable(s) and create one
sstable for each owner. It's up to the caller to open these
new unshared sstables at their respective column families.

This new approach will save a lot of bandwidth because we'll
no longer read the entire shared sstable #smp::count times.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:08 -03:00
Raphael S. Carvalho
758bc38e7a sstables: allow shard A to correctly create sstable for shard B
That's possible by shard A explicitly saying that sstable is created
for shard B. If we don't do that, sharding metadata isn't correct,
and consequently sstable will report wrong owners.

We'll need this for resharding which will create sstables for all
shards that own the shared sstable.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:06 -03:00
Raphael S. Carvalho
2a437ab427 compaction: rework compacting_sstable_writer to work with multiple writers
compacting_sstable_writer only allowed one writer so far, but we will need
multiple ones for resharding.
It's done by moving writer management to compaction.
finish_sstable_writer() is added for compaction impl to stop all writers,
whereas stop_sstable_writer() will only stop current writer (needed when
current sstable reaches max limit size for example).

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:05 -03:00
Raphael S. Carvalho
a35a3a9647 compaction: prepare compacting_sstable_writer to work with writers
No need for compacting_sstable_writer to store items that are available
in compaction class. Also, that's a step towards supporting multiple
writers for compaction.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:03 -03:00
Raphael S. Carvalho
38ed83e2f7 sstables: rework compaction to make it easy to extend
compact_sstables() supported both regular and cleanup compaction,
but with lots of conditions that made it ugly and hard to extend.

In the future, we want to introduce a new type of compaction for
resharding that will create one sstable for every shard owning
the sstable(s) given as input. That will be easier now.

Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-04-21 17:11:02 -03:00
Avi Kivity
fdcf64520d Merge seastar upstream
* seastar 2eec212...194d80f (4):
  > removing the collectd tests
  > fix fstream metrics reporting.
  > do_for_each: Make it check for need preempt
  > core/sharded: introduce copy method to foreign_ptr
2017-04-21 22:14:01 +03:00
Avi Kivity
fccbf2c51f Merge "Reduce memory reclamation latency" from Tomasz
"Currently eviction is performed until occupancy of the whole region
drops below the 85% threshold. This may take a while if region had
high occupancy and is large. We could improve the situation by only
evicting until occupancy of the sparsest segment drops below the
threshold, as is done by this change.

I tested this using a c-s read workload in which the condition
triggers in the cache region, with 1G per shard:

 lsa-timing - Reclamation cycle took 12.934 us.
 lsa-timing - Reclamation cycle took 47.771 us.
 lsa-timing - Reclamation cycle took 125.946 us.
 lsa-timing - Reclamation cycle took 144356 us.
 lsa-timing - Reclamation cycle took 655.765 us.
 lsa-timing - Reclamation cycle took 693.418 us.
 lsa-timing - Reclamation cycle took 509.869 us.
 lsa-timing - Reclamation cycle took 1139.15 us.

The 144ms pause is when large eviction is necessary.

Statistics for reclamation pauses for a read workload over
larger-than-memory data set:

Before:

 avg = 865.796362
 stdev = 10253.498038
 min = 93.891000
 max = 264078.000000
 sum = 574022.988000
 samples = 663

After:

 avg = 513.685650
 stdev = 275.270157
 min = 212.286000
 max = 1089.670000
 sum = 340573.586000
 samples = 663

Refs #1634."

* tag 'tgrabiec/lsa-reduce-reclaim-latency-v3' of github.com:cloudius-systems/seastar-dev:
  lsa: Reduce reclamation latency
  tests: Add test for log_histogram
  log_histogram: Allow non-power-of-two minimum values
  lsa: Use regular compaction threshold in on-idle compaction
  tests: row_cache_test: Induce update failure more reliably
  lsa: Add getter for region's eviction function
2017-04-21 17:47:06 +03:00
Tomasz Grabiec
20f4c9bf23 lsa: Reduce reclamation latency
Currently eviction is performed until occupancy of the whole region
drops below the 85% threshold. This may take a while if region had
high occupancy and is large. We could improve the situation by only
evicting until occupancy of the sparsest segment drops below the
threshold, as is done by this change.

I tested this using a c-s read workload in which the condition
triggers in the cache region, with 1G per shard:

 lsa-timing - Reclamation cycle took 12.934 us.
 lsa-timing - Reclamation cycle took 47.771 us.
 lsa-timing - Reclamation cycle took 125.946 us.
 lsa-timing - Reclamation cycle took 144356 us.
 lsa-timing - Reclamation cycle took 655.765 us.
 lsa-timing - Reclamation cycle took 693.418 us.
 lsa-timing - Reclamation cycle took 509.869 us.
 lsa-timing - Reclamation cycle took 1139.15 us.

The 144ms pause is when large eviction is necessary.

Statistics for reclamation pauses for a read workload over
larger-than-memory data set:

Before:

 avg = 865.796362
 stdev = 10253.498038
 min = 93.891000
 max = 264078.000000
 sum = 574022.988000
 samples = 663

After:

 avg = 513.685650
 stdev = 275.270157
 min = 212.286000
 max = 1089.670000
 sum = 340573.586000
 samples = 663

Refs #1634.

Message-Id: <1484730859-11969-1-git-send-email-tgrabiec@scylladb.com>
2017-04-21 12:52:31 +02:00
Tomasz Grabiec
4313641c03 tests: Add test for log_histogram 2017-04-21 12:52:31 +02:00
Tomasz Grabiec
c83768d6bb log_histogram: Allow non-power-of-two minimum values
We will want to reuse the min_size mechanism for the whole compaction
threshold, including the occupancy threshold. That threshold is close
to the segment size and we cannot pick a power of two which would be
close enough to what we need.

Therefore, change log_histogram to support arbitrary minimum base.

bucket_of() was moved into log_histogram_options so that it can be used
in number_of_buckets(), which makes for a simple and much less
error-prone implementation.
2017-04-21 10:54:50 +02:00
Tomasz Grabiec
7a800c54bf lsa: Use regular compaction threshold in on-idle compaction
Idle-time compaction should not produce not-compactible segments
becuase that means we would have to evict a lot when we finally need
to reclaim some memory, so that occupancy falls below the regular
compaction threshold. This may cause latency spikes.

Refs #1634.
2017-04-20 15:00:15 +02:00
Tomasz Grabiec
e054ccc037 tests: row_cache_test: Induce update failure more reliably
After changing region evicitability condition to be less strict, cache
update stopped failing because reclamation was able to compact dense
region. Induce failure by installing evictor which refuses to evict
from cache beyond few elements.
2017-04-20 14:51:47 +02:00
Tomasz Grabiec
7aa286439f lsa: Add getter for region's eviction function 2017-04-20 14:51:42 +02:00
Vlad Zolotarov
9c1d803157 fix_system_distributed_tables.py: add --node and --port parameters
Allow giving a non-default IP address and a port to connect to the cluster.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1491316458-18420-1-git-send-email-vladz@scylladb.com>
2017-04-20 14:49:26 +03:00
Avi Kivity
68f0df12ee Merge "Optimize reads with clustering restrictions" from Tomasz
"This series makes several optimizations to sstable mutation reader relevant
for large partitions.

Some highlights:

One optimization is to use the index for skipping across clustering restrictions.
Currently we read whole partition in such cases. That includes the case when
we need to read a static row and then jump to some clustering row in the
middle of the partition. Another case is having more than one clustering
restriction, e.g. selecting multiple single rows from the same partition.

Another optimization is using information from the index for creation of
streamed_mutation. That can save us the cost of reading the partition header
form the data file in case we would not continue reading, but skip to the
middle of that partition. Or we may not even attempt to read anything from
that partition, if after we determine the key that reader will be put behind
other readers, which will exhaust the query limit first.

Another optimization is switching single-partition queries to use the
index_reader infrastructure. Index lookups via index_reader are faster than
find_disk_ranges(). This is also a cleanup, a step towards converting all code
to use the index_reader."

* tag 'tgrabiec/optimize-sstable-reads-with-restrictions-v2' of github.com:cloudius-systems/seastar-dev: (44 commits)
  sstables: Remove unused code
  sstables: mutation_reader: Use index_reader::advance_to_next_partition() to skip to next partition
  sstables: mutation_reader: Use index_reader for single-partition reads
  sstables: mutation_reader: Add trace-level logging
  sstables: mutation_reader: Move partition reading code to sstable_data_source
  sstables: mutation_reader: Move definitions out of the class body
  sstables: Move binary_search() to a header
  database: Pass partition_range to single_key_sstable_reader to avoid copies and decorating
  sstables: index_reader: Introduce advance_to_next_partition()
  sstables: index_reader: Introduce advance_and_check_if_present()
  sstables: index_reader: Introduce advance_past()
  sstables: index_reader: Make copyable
  sstables: index_reader: Optimize advancing to extreme positions
  sstables: index_reader: Keep two last pages alive
  dht: ring_position_view: Add key getter
  dht: ring_position_view: Add constructor and factory from ring_position_view
  sstables: mutation_reader: Advance to next partition using index in some cases
  sstables: index_reader: Expose access to partition key and tombstone
  sstables: index_reader: Introduce promoted_index_view
  sstables: mutation_reader: Move _index_in_current to sstable_data_source
  ...
2017-04-20 13:58:37 +03:00
Tomasz Grabiec
3472a74de4 sstables: Remove unused code 2017-04-20 11:23:05 +02:00
Tomasz Grabiec
c1059ca8e4 sstables: mutation_reader: Use index_reader::advance_to_next_partition() to skip to next partition
It's cheaper than a key-based lookup, so use it when we can.
2017-04-20 11:23:05 +02:00
Tomasz Grabiec
4742008b70 sstables: mutation_reader: Use index_reader for single-partition reads
This switches single-partition query to use the index_reader
infrastructure. Index lookups via index_reader are faster than
find_disk_ranges().

perf_fast_forward, rows: 1000000, value size: 100

Before:

  Testing forwarding with clustering restriction in a large partition:
  pk-scan   time [s]     frags     frag/s    aio      [KiB] blocked dropped  idx hit idx miss  idx blk    cpu
  no        0.002182         2        916      3        152       2       0        0        1        1  88.1%

After:

  Testing forwarding with clustering restriction in a large partition:
  pk-scan   time [s]     frags     frag/s    aio      [KiB] blocked dropped  idx hit idx miss  idx blk    cpu
  no        0.000758         2       2639      3        152       2       0        0        1        1  48.6%

This is also a cleanup, a step towards converting all code to use the
index_reader.
2017-04-20 11:23:05 +02:00
Tomasz Grabiec
9d8795089d sstables: mutation_reader: Add trace-level logging 2017-04-20 11:18:55 +02:00
Tomasz Grabiec
b198c31c46 sstables: mutation_reader: Move partition reading code to sstable_data_source
It will be reused for read_row(), which does't create mutation_reader
instance, only sstable_data_source.
2017-04-20 11:18:26 +02:00
Tomasz Grabiec
6e4bca0be6 sstables: mutation_reader: Move definitions out of the class body
To make further refactoring easier to review. No functional changes here.
2017-04-20 10:54:38 +02:00
Tomasz Grabiec
4ed7e529db sstables: Move binary_search() to a header
There are instantiations of binary_search() used in sstables.cc, but
defined in partition.cc. The instantiations are explicitly declared in
partition.cc, but the types changed and they became obsolete. The
thing worked because partition.cc also instantiated it with the right
type. But after that code will be removed, it no longer would, and we
would get a linker error. To avoid such problems, define
binary_search() in a header.
2017-04-20 10:54:38 +02:00
Tomasz Grabiec
bedd0ab6f9 database: Pass partition_range to single_key_sstable_reader to avoid copies and decorating 2017-04-20 10:54:38 +02:00
Tomasz Grabiec
0b5ba13230 sstables: index_reader: Introduce advance_to_next_partition() 2017-04-20 10:54:38 +02:00
Tomasz Grabiec
4b81844d2e sstables: index_reader: Introduce advance_and_check_if_present() 2017-04-20 10:54:38 +02:00
Tomasz Grabiec
b92f095bf0 sstables: index_reader: Introduce advance_past() 2017-04-20 10:54:38 +02:00
Tomasz Grabiec
6780756258 sstables: index_reader: Make copyable 2017-04-20 10:54:38 +02:00
Tomasz Grabiec
7db83fa3fe sstables: index_reader: Optimize advancing to extreme positions 2017-04-20 10:54:38 +02:00
Tomasz Grabiec
f66443c01c sstables: index_reader: Keep two last pages alive
The idea behind caching is that when we have two index readers where
one is catching up with the other, each page will be read only
once. Currently that's not always the case. There is a case when
advance_to() may need to read two pages. That's when the target
position is not found in the first page as determined by the summary
index. The second reader which catches up would have to read the first
page as well, but it would not be in cache any more. To avoid this
extra I/O let's keep a reference to the two last pages touched by the
index.
2017-04-20 10:54:38 +02:00
Tomasz Grabiec
c7b9c5dfd3 dht: ring_position_view: Add key getter 2017-04-20 10:54:38 +02:00
Tomasz Grabiec
5b71e0b9ab dht: ring_position_view: Add constructor and factory from ring_position_view 2017-04-20 10:54:38 +02:00
Tomasz Grabiec
3e8795494e sstables: mutation_reader: Advance to next partition using index in some cases
To produce a streamed_mutation for the next partition, we need to read
its key and the tombstone. Currently we always do that by consuming
the partition header from the data file. In some cases that may cause
unnecessary IO.

It's better to obtain partition information from the index if we
already have it. We can save on IO if the user will skip past the
front of partition immediately after.

It is also better to pay the cost of reading the index if we know that
we will need to use the index anyway soon. This patch predicts that by
checking if there are any clustering restrictions. If there are any,
we will almost surely need_skip() and use the index anyway.

This change also lays the ground for unification of multi and single
partiton queries without loss of performance.
2017-04-20 10:54:37 +02:00
Tomasz Grabiec
e35fe7492c sstables: index_reader: Expose access to partition key and tombstone 2017-04-20 10:54:37 +02:00
Tomasz Grabiec
ae72c159b1 sstables: index_reader: Introduce promoted_index_view
So that we have a nice way of extracting tombstone out of it. We not
always need fully parsed index.
2017-04-20 10:54:37 +02:00
Tomasz Grabiec
0ef33b7f29 sstables: mutation_reader: Move _index_in_current to sstable_data_source
sstable_data_source holds a shared state between mutation_reader and
streamed_mutation for sstables. The information whether index is in
current partition will have to be accessed by both in the following
patches.
2017-04-20 10:54:37 +02:00
Tomasz Grabiec
885f53d905 sstables: mutation_reader: Avoid resetting the walker
Before the change, the following scenario was happening:

  1) we try to skip based on clustering restrictions
  2) we find the page and fast forward to it, recording walker's
     lower bound counter
  3) we read the first fragment, it's not a tombstone, so we reset the walker,
     and its lower bound counter too
  4) the fragment is not in range (the range starts in the middle of the page)
  5) needs_skip() is true, we redo the index lookup, which wastes some CPU

This change fixes the problem by avoiding resetting the walker. We can
do that because leading tombstones are checked with a non-mutable
contains_tombstone()
2017-04-20 10:54:37 +02:00
Tomasz Grabiec
bf21aa3a1f clustering_ranges_walker: Introduce contains_tombstone() 2017-04-20 10:54:37 +02:00
Tomasz Grabiec
b030ce693d sstables: mutation_reader: Don't try to read index to skip to static row
Static row is always at the beginning, there's no point in doing
index lookups.
2017-04-20 10:54:37 +02:00
Tomasz Grabiec
3e060659f1 sstables: mutation_reader: Don't try to read static row if table doesn't have any 2017-04-20 10:54:37 +02:00
Tomasz Grabiec
b1860a8a24 clustering_ranges_walker: Allow excluding the static row 2017-04-20 10:54:37 +02:00
Tomasz Grabiec
77d3e30239 sstables: mutation_reader: Use index to skip across clustering restrictions
Improves scans with clustering restrictions. Before the change such
scans would scan whole partition.

Below are results of a test case from perf_fast_forward which selects
few rows from a large partition using query restrictions (not fast forwarding).

Before:

  stride  rows      time [s]     frags     frag/s    aio      [KiB] blocked dropped  idx hit idx miss  idx blk    cpu
  1000000 1         0.000609         1       1642      3        152       2       1        0        1        1  38.0%
  500000  2         0.242255         2          8    511      64152     398       4        0        1        1  98.6%
  250000  4         0.281592         4         14    749      95832     564       4        0        1        1  98.4%
  125000  8         0.328056         8         24    873     111704     657       4        0        1        1  98.4%
  62500   16        0.306700        16         52    935     119640     751       4        0        1        1  99.4%

After:

  stride  rows      time [s]     frags     frag/s    aio      [KiB] blocked dropped  idx hit idx miss  idx blk    cpu
  1000000 1         0.000711         1       1406      3        152       2       1        0        1        1  42.1%
  500000  2         0.000910         2       2197      5        216       3       2        0        1        1  39.2%
  250000  4         0.001384         4       2891      9        344       5       4        0        1        1  35.3%
  125000  8         0.003197         8       2502     21        728      13       8        0        1        1  53.1%
  62500   16        0.006664        16       2401     41       1368      25      16        0        1        1  58.2%
2017-04-20 10:54:37 +02:00
Tomasz Grabiec
05a1f92cbc clustering_ranges_walker: Introduce lower_bound_change_counter()
Allows detecting changes of lower_bound().

Result of advance_to() is not enough. When we get false from
advance_to() twice in a row, lower bound may or may not have changed.
2017-04-20 10:54:37 +02:00
Tomasz Grabiec
461f2af0a1 sstables: mutation_reader: Avoid index lookups when out of range 2017-04-20 10:54:37 +02:00
Tomasz Grabiec
10c92d37d1 sstables: mutation_reader: Simplify fast_forward_to() 2017-04-20 10:54:37 +02:00
Tomasz Grabiec
bfb6858e55 sstables: mutation_reader: Let clustering_ranges_walker handle the _fwd_range start
Simplifies the code a bit, but also will make it easier to calculate
the next position we should skip to after forwarding, taking into
consideration both the position forwarded to as well as clustering
ranges of the query. That will be just calling
_ck_ranges_walker->lower_bound() after it is trimmed.
2017-04-20 10:54:37 +02:00
Tomasz Grabiec
d056d9c31b sstables: mutation_reader: Let mp_row_consumer decide about position passed to the index
In general mp_row_consumer has better information about the next
position to read. It could be after the position we forward to if
there are clustering restrictions. This will be exploited in the
following patches.
2017-04-20 10:54:37 +02:00
Tomasz Grabiec
a37712e9ae sstables: mutation_reader: Move mp_row_consumer::fast_forward_to() out of line 2017-04-20 10:54:37 +02:00
Tomasz Grabiec
bb3e683783 clustering_ranges_walker: Support trimming
Makes implementing fast_forward_to() easier. mp_row_consumer emulates
this currently. This change will allow simplifying this.
2017-04-20 10:54:37 +02:00
Tomasz Grabiec
652d04e78a clustering_ranges_walker: Generalize to work on position ranges
It will include the static row by default. This will allow simplifying
users, which work with position ranges already.
2017-04-20 10:54:36 +02:00
Tomasz Grabiec
c85fe3183c position_range: Allow stealing of bounds 2017-04-20 10:54:36 +02:00
Tomasz Grabiec
503c68de44 position_in_partition: Add more factory methods 2017-04-20 10:54:36 +02:00
Tomasz Grabiec
6c1dc642ee sstables: mutation_reader: Create index on-demand 2017-04-20 10:54:36 +02:00
Tomasz Grabiec
434fda3577 sstables: mutation_reader: Keep priority_class by reference
To indicate that it is not optional.
2017-04-20 10:54:36 +02:00
Tomasz Grabiec
a8c126c82a sstables: Expose get_index_reader() 2017-04-20 10:54:36 +02:00
Tomasz Grabiec
e1af5a406d sstables: Make sstable::get_index_reader() return unique_ptr<>
Makes callers a bit simpler
2017-04-20 10:54:36 +02:00
Tomasz Grabiec
7dc3fe7d3f tests: perf_fast_forward: Add test case for forwarding with clustering restrictions in a large partition 2017-04-20 10:54:36 +02:00
Tomasz Grabiec
eed864690b tests: perf_fast_forward: Add test case for slicing of large partition using a single-partition reader 2017-04-20 10:54:36 +02:00
Tomasz Grabiec
81fc7977a4 tests: perf_fast_forward: Add test for selecting few rows from large partition 2017-04-20 10:54:36 +02:00
Raphael S. Carvalho
3286f7aaa6 compaction: make major compaction go through compaction manager
From now on, major compaction will go through compaction manager.
Major compaction is serialized to reduce disk space requirement.
Each column family will be running either minor and major compaction
at a given time. The only issue is number of small sstables growing
while major compaction is running, but major compaction itself will
reduce the number of tables considerably. If this turns out to be
an issue, we can allow minor to start in parallel to major, but not
the other way around.

Fixes #1156.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170417233125.14092-1-raphaelsc@scylladb.com>
2017-04-19 15:44:21 +03:00
Duarte Nunes
e06bafdc6c alter_type_statement: Fix signed to unsigned conversion
This could allow us to alter a non-existing field of an UDT.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170419114254.5582-1-duarte@scylladb.com>
2017-04-19 14:48:12 +03:00
Tomasz Grabiec
02da3ba316 tests: perf_fast_forward: Fix use-after-free in scan_with_stride_partitions()
partition_range must live as long as the reader is used.
2017-04-19 08:37:56 +02:00
Raphael S. Carvalho
e78db43b79 compaction_manager: fix crash when dropping a resharding column family
Problem is that column family field of task wasn't being set for resharding,
so column family wasn't being properly removed from compaction manager.
In addition to fixing this issue, we'll also interrupt ongoing compactions
when dropping a column family, exactly like we do with shutdown.

Fixes #2291.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170418125807.7712-1-raphaelsc@scylladb.com>
2017-04-18 17:39:27 +03:00
Duarte Nunes
af37a3fdbf logalloc: Fix compilation error
This patch moves a function using the region_impl type after the type
has been defined.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170418124551.25369-1-duarte@scylladb.com>
2017-04-18 15:56:26 +03:00
Raphael S. Carvalho
11b74050a1 partitioned_sstable_set: fix quadratic space complexity
streaming generates lots of small sstables with large token range,
which triggers O(N^2) in space in interval map.
level 0 sstables will now be stored in a structure that has O(N)
in space complexity and which will be included for every read.

Fixes #2287.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170417185509.6633-1-raphaelsc@scylladb.com>
2017-04-18 13:04:38 +03:00
Takuya ASADA
86e464ab26 dist/offline_installer: support Ubuntu/Debian
moved existing script to dist/offline_installer/redhat, added .deb version into
dist/offline_installer/debian.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1492474821-9907-1-git-send-email-syuu@scylladb.com>
2017-04-18 10:56:50 +03:00
Pekka Enberg
b31c45d8af Merge "clang fixes (part 1)" from Avi
"This series fixes some errors found by clang, with the aim of enabling
clang/zapcc as a supported compiler.  A few more fixes are needed to produce
a binary."

* tag 'clang/1/v1' of https://github.com/avikivity/scylla:
  logalloc: avoid auto in function argument declaration
  thrift: avoid auto in function argument declaration
  streamed_mutation: fix non-POD argument to C-style variadic function
  mutation_partition_serializer: avoid auto in function argument declaration
  date: use correct casts for years
  streaming: avoid auto in function argument declaration
  repair: avoid auto in function argument declaration
  gms: expose gms::inet_address streaming operator
  murmur3_partitioner: fix build on clang
  i_partitioner: remove unused function
  byte_ordered_partitioner: fix bad operator precedence
  result_set: pass comparator by reference to std::sort()
  to_string: move standard container overloads of to_string to std:: namespace
  cql_type: fix bad enum syntax on clang
  build: disable more warnings for clang
  build: fix detection of unsupported warnings on clang
2017-04-18 08:49:25 +03:00
Avi Kivity
844529fe33 logalloc: avoid auto in function argument declaration
'auto' in a non-lambda function argument is not legal C++, and is hard
to read besides.  Replace with the right type.

Since the right type is private, add some friendship.
2017-04-17 23:18:44 +03:00
Avi Kivity
54add19ca2 thrift: avoid auto in function argument declaration
'auto' in a non-lambda function argument is not legal C++, and is hard
to read besides.  Replace with the right type.
2017-04-17 23:18:44 +03:00
Avi Kivity
f0c25fc20f streamed_mutation: fix non-POD argument to C-style variadic function
Clang warns that passing a non-POD to a C-style variadic function will
result in an abort().  That happens to be exactly what we want, but to
silence the warning, use a template instead.  Since templates aren't
allowed in local classes, move the containing class to namespace scope.
2017-04-17 23:18:44 +03:00
Avi Kivity
635c32eb32 mutation_partition_serializer: avoid auto in function argument declaration
'auto' in a non-lambda function argument is not legal C++, and is hard
to read besides.  Replace with the right type.
2017-04-17 23:18:44 +03:00
Avi Kivity
a0858dda3e date: use correct casts for years
Our date implementation uses int64_t for years, but some of the code was
not changed; clang complains, so use the correct casts to make it happy.
2017-04-17 23:03:15 +03:00
Avi Kivity
ca69a04969 streaming: avoid auto in function argument declaration
'auto' in a non-lambda function argument is not legal C++, and is hard
to read besides.  Replace with the right type.
2017-04-17 23:03:15 +03:00
Avi Kivity
ae7d7ae20f repair: avoid auto in function argument declaration
'auto' in a non-lambda function argument is not legal C++, and is hard
to read besides.  Replace with the right type.
2017-04-17 23:03:15 +03:00
Avi Kivity
c885c468a9 gms: expose gms::inet_address streaming operator
The standard says, and clang enforces, that declaring a function via
a friend declaration is not sufficient for ADL to kick in.  Add a namespace
level declaration so ADL works.
2017-04-17 23:03:15 +03:00
Avi Kivity
af118ab52b murmur3_partitioner: fix build on clang
Don't know what the root cause it, but the fix is harmless.
2017-04-17 23:03:15 +03:00
Avi Kivity
c05f60387b i_partitioner: remove unused function
Found by clang.
2017-04-17 23:03:15 +03:00
Avi Kivity
a496ec7f5b byte_ordered_partitioner: fix bad operator precedence
Found by clang.
2017-04-17 23:03:15 +03:00
Avi Kivity
d9aaa95b29 result_set: pass comparator by reference to std::sort()
Clang complains about some error without it, I could not understand it, but
I'm not going to argue with it.

Since std::sort() will copy the comparator, it's better to pass using an
std::ref(), and everyone is happy.
2017-04-17 23:03:15 +03:00
Avi Kivity
a83a24268d to_string: move standard container overloads of to_string to std:: namespace
Argument-dependent lookup will not find to_string() overloads in the global
namespace if the argument and the caller are in other namespaces.

Move these to_string() overloads to std:: so ADL will find them.

Found by clang.
2017-04-17 23:03:15 +03:00
Avi Kivity
a7fe7aedbf cql_type: fix bad enum syntax on clang
cql3::type used some gcc extension that is not recognized on clang; use
the standard syntax instead.
2017-04-17 22:35:41 +03:00
Avi Kivity
1faef017e3 build: disable more warnings for clang
We should fix the source and re-enable the warnings, but this will do for
now.
2017-04-17 22:34:59 +03:00
Avi Kivity
78e9b0265b build: fix detection of unsupported warnings on clang
The diagnostic that clang spits out when it sees an unrecognized warning
is itself a warning, so the test compilation succeeds and we don't notice
the warning is not supported.

Adding -Werror turns the warning about the unrecognized warning into an
error, allowing the detection machinery to work.
2017-04-17 22:33:01 +03:00
Takuya ASADA
b8f40a2dff dist/ami/files/.bash_profile: warn user when enhanced networking is not enabled
Show warnings on following conditions:
 - VPC is not used
 - Driver is not enhanced networking one

Fixes #1984

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1488844756-14935-1-git-send-email-syuu@scylladb.com>
2017-04-15 15:16:55 +03:00
Benoît Canet
8f793905a3 perf_sstable: Change busy loop to futurized loop
The blocked task detector introduced in
113ed9e963 was seeing
the initialization phase of perf_ssttable as a blocked
task.

Tranform this part of the code in a futurized loop
to make to blocked task detector happy.

Signed-off-by: Benoît Canet <benoit@scylladb.com>
Message-Id: <20170413132506.17806-1-benoit@scylladb.com>
2017-04-13 18:17:28 +03:00
Amnon Heiman
1dfd32f070 scylla-housekeeping service: Support private repositories
This patch add support for private repositories for scylla-housekeeping.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2017-04-13 18:13:57 +03:00
Amnon Heiman
5839dc1f20 scylla-housekeeping-upstart: Use repository id, when checking for
version

This patch allows the check version to use private repository when
checking for version.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2017-04-13 18:12:52 +03:00
Amnon Heiman
622502de7a scylla-housekeeping: support private repositories
This patch allows the check version to support private repositories.

If a repository file is passed as a parameter, the repository id will be
passed passed as a parameter when checking version.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2017-04-13 18:11:15 +03:00
Avi Kivity
7d16cfa5f0 Merge branch 'penberg/create-index-stmt-cleanup/v1' of github.com:cloudius-systems/seastar-dev
"The version of create_index_statement class that was translated to C++
is pretty old by now. This series of cleanups brings it closer to Apache
Cassandra trunk to make it easier to bring over more secondary index
code to Scylla."

* 'penberg/create-index-stmt-cleanup/v1' of github.com:cloudius-systems/seastar-dev:
  cql3/statements/create_index_statement: Move target validation
  cql3/statements/create_index_statement: Remove static column validation
  cql3/statements/create_index_statement: Extract validations
  cql3/statements/create_index_statement: Kill bogus custom validation
  cql3/statements/create_index_statement: Add materialized view to validate()
  cql3/statements/create_index_statement: Remove validation
2017-04-13 13:27:53 +03:00
Asias He
d27b47595b gossip: Fix possible use-after-free of entry in endpoint_state_map
We take a reference of endpoint_state entry in endpoint_state_map. We
access it again after code which defers, the reference can be invalid
after the defer if someone deletes the entry during the defer.

Fix this by checking take the reference again after the defering code.

I also audited the code to remove unsafe reference to endpoint_state_map entry
as much as possible.

Fixes the following SIGSEGV:

Core was generated by `/usr/bin/scylla --log-to-syslog 1 --log-to-stdout
0 --default-log-level info --'.
Program terminated with signal SIGSEGV, Segmentation fault.
(this=<optimized out>) at /usr/include/c++/5/bits/stl_pair.h:127
127     in /usr/include/c++/5/bits/stl_pair.h
[Current thread is 1 (Thread 0x7f1448f39bc0 (LWP 107308))]

Fixes #2271

Message-Id: <529ec8ede6da884e844bc81d408b93044610afd2.1491960061.git.asias@scylladb.com>
2017-04-13 13:18:17 +03:00
Takuya ASADA
81c1b07bac dist: add offline installer
This introduce offline installer generator.
It will generate self-extractable archive witch contains Scylla packages
and dependency packages.
Package installation automatically starts when the archive executed.

Limitation: Only supported CentOS at this point.

Fixes #2268

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1491997091-15323-1-git-send-email-syuu@scylladb.com>
2017-04-13 13:16:09 +03:00
Avi Kivity
ac48767146 Merge "tracing and cql3 patches" from Vlad
"This series was initially meant to only transition the keyspace based backend to work
on top of prepared statements but there were a few potential issues found on the way.

In addition the original Tracing series has been expanded with a few patches in the cql3 layer
that are improving the generic clq3 layer but are not obvious without the context of the
following Tracing patches.

The "main" patch contains a heavy rework of trace_keyspace_helper:
   - Use prepared statements for updating tables instead of manually constructing mutations:
      - We intentionally decrease the level of code robustness from "paranoid" to "normal".
      - The code gets a lot more simple, e.g. we don't need to cache columns definitions any more.
      - We are loosing some performance here but:
          - Tracing write is not in the fast path.
          - Tracing write events should be rare.
          - Currently the performance loss (for the actual write time of all trace records) for a "SELECT" query with a specific key is
            about 45%: 144us vs 99us."

* 'tracing_rework_using_prepared-v6' of github.com:cloudius-systems/seastar-dev:
  tracing: use prepared statment for updating tables
  tracing::trace_keyspace_helper: add a bad_column_family constructor that accepts an std::exception parameter
  tracing::trace_keyspace_helper: introduce a table_helper class
  tracing::trace_keyspace_helper: add static qualifier to  make_monotonic_UUID_tp() and elapsed_to_micros() methods
  tracing::tracing: allow slow query TTL only in the signed 32-bit integer range
  cql3::query_processor::prepare(): futurize the error case
  cql3::query_options: add a factory method for creation of options for a BATCH statement
  cql3::statements::batch_statement: add a constructor that doesn't receive the "bound_terms" value
  cql3::query_processor: use weak_ptr for passing the prepared statements around
2017-04-13 11:07:49 +03:00
Raphael S. Carvalho
a6f8f4fe24 compaction: do not write expired cell as dead cell if it can be purged right away
When compacting a fully expired sstable, we're not allowing that sstable
to be purged because expired cell is *unconditionally* converted into a
dead cell. Why not check if the expired cell can be purged instead using
gc before and max purgeable timestamp?

Currently, we need two compactions to get rid of a fully expired sstable
which cells could have always been purged.

look at this sstable with expired cell:
  {
    "partition" : {
      "key" : [ "2" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 120,
        "liveness_info" : { "tstamp" : "2017-04-09T17:07:12.702597Z",
"ttl" : 20, "expires_at" : "2017-04-09T17:07:32Z", "expired" : true },
        "cells" : [
          { "name" : "country", "value" : "1" },
        ]

now this sstable data after first compaction:
[shard 0] compaction - Compacted 1 sstables to [...]. 120 bytes to 79
(~65% of original) in 229ms = 0.000328997MB/s.

  {
    ...
    "rows" : [
      {
        "type" : "row",
        "position" : 79,
        "cells" : [
          { "name" : "country", "deletion_info" :
{ "local_delete_time" : "2017-04-09T17:07:12Z" },
            "tstamp" : "2017-04-09T17:07:12.702597Z"
          },
        ]

now another compaction will actually get rid of data:
compaction - Compacted 1 sstables to []. 79 bytes to 0 (~0% of original)
in 1ms = 0MB/s. ~2 total partitions merged to 0

NOTE:
It's a waste of time to wait for second compaction because the expired
cell could have been purged at first compaction because it satisfied
gc_before and max purgeable timestamp.

Fixes #2249, #2253

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170413001049.9663-1-raphaelsc@scylladb.com>
2017-04-13 10:59:19 +03:00
Vlad Zolotarov
f8956ba01a tracing: use prepared statment for updating tables
In addition to actually moving to using the prepared statements the changes also include:
   - Kill the cache_xxx() methods - the schema is going to be checked during the
     prepared statement creation and during its execution.
   - Move the caching of table ID and the prepared statement to the get_schema_ptr_or_create().
   - Rename: get_schema_ptr_or_create() -> cache_table_info().

After these changes we are less strict in our demands to system_traces tables schemas, e.g.
if some column's type is not exactly as we expect but rather only "compatible" in the CQL sense
we will tolerate this and will continue to write into that table.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-04-12 17:12:42 -04:00
Vlad Zolotarov
baf0289951 tracing::trace_keyspace_helper: add a bad_column_family constructor that accepts an std::exception parameter
An object built with this constructor will use the what() message from the
given exception in the final error message.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-04-12 17:04:54 -04:00
Vlad Zolotarov
98864c6c30 tracing::trace_keyspace_helper: introduce a table_helper class
This class contains a general table info and implements standard operations on this table:
   - Creation.
   - Info caching.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-04-12 17:04:48 -04:00
Pekka Enberg
8c45038729 cql3/statements/create_index_statement: Move target validation
Move index target validation to preserve the same code structure as
Apache Cassandra and simplify support for multiple index targets.
2017-04-12 20:50:09 +03:00
Pekka Enberg
528c33a05b cql3/statements/create_index_statement: Remove static column validation
Apache Cassandra supports secondary indices on static columns since
commit 9e74891 ("Add support for secondary indexes on static columns").
2017-04-12 20:46:29 +03:00
Pekka Enberg
975e1c8fc6 cql3/statements/create_index_statement: Extract validations
Extract specific validations to separate functions to preserve the same
structure as Apache Cassandra code and make it easier to add support for
multiple index targets.
2017-04-12 20:44:15 +03:00
Pekka Enberg
940d6de1b8 cql3/statements/create_index_statement: Kill bogus custom validation
Rejecting custom indices is bogus because it's just a configuration
mechanism like replication strategy, for example. Furthermore, it's
needed for SASI indices, which we likely need to be compatible with.
2017-04-12 20:17:09 +03:00
Pekka Enberg
cfadb70565 cql3/statements/create_index_statement: Add materialized view to validate()
Apache Cassandra does not support secondary indices on materialized
views so neither should we.
2017-04-12 20:15:20 +03:00
Pekka Enberg
1706f346f2 cql3/statements/create_index_statement: Remove validation
The validation was removed in Apacha Cassandra commit 0626be8 ("New 2i
API and implementations for built in indexes"). Let's also remove it
from our code so that we remove one dependency to the obsolete db/index/
code.
2017-04-12 20:14:52 +03:00
Vlad Zolotarov
b4bf0735b8 tracing::trace_keyspace_helper: add static qualifier to make_monotonic_UUID_tp() and elapsed_to_micros() methods
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-04-12 12:24:08 -04:00
Vlad Zolotarov
aa1f8ccea4 tracing::tracing: allow slow query TTL only in the signed 32-bit integer range
Any TTL is eventually converted into the gc_clock::duration value, which
is based on int32_t type. Limit the node_slow_log TTL user configurable value
to the same values range for consistency.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-04-12 12:24:08 -04:00
Vlad Zolotarov
1685cefa57 cql3::query_processor::prepare(): futurize the error case
Make sure that errors are reported in a form of an exceptional future and
not by a direct exception throwing.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-04-12 12:24:08 -04:00
Vlad Zolotarov
fcef9d3b05 cql3::query_options: add a factory method for creation of options for a BATCH statement
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-04-12 12:24:08 -04:00
Vlad Zolotarov
75fbc7c558 cql3::statements::batch_statement: add a constructor that doesn't receive the "bound_terms" value
This constructor should be used when we know that there are no bound terms in the current
batch statement.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-04-12 12:24:08 -04:00
Vlad Zolotarov
ff55b76562 cql3::query_processor: use weak_ptr for passing the prepared statements around
Use seastar::checked_ptr<weak_ptr<pepared_statement>> instead of shared_ptr for passing prepared statements around.
This allows an easy tracking and handling of statements invalidation.

This implementation will throw an exception every time an invalidated
statement reference is dereferenced.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-04-12 12:24:03 -04:00
Pekka Enberg
ecb8ee4efd cql3/statements: Cleanup create_index_statement.cc
Use namespaces and fix formatting issues to make the source file easier
on the eyes.

Message-Id: <1491977964-26629-1-git-send-email-penberg@scylladb.com>
2017-04-12 16:48:45 +03:00
Avi Kivity
db73cc045f Merge seastar upstream
* seastar e899c0b...2eec212 (4):
  > Merge "add seastar::checked_ptr class" from Vlad
  > resource: reduce default_reserve_memory size to fit low memory environment
  > scripts: posix_net_conf.sh: add --tune net parameter to perftune.py invocation
  > core: weak_ptr: add a weak_ptr(std::nullptr_t) constructor
2017-04-12 13:49:21 +03:00
Paweł Dziepak
0318dccafd lsa: avoid unnecessary segment migrations during reclaim
segment_zone::migrate_all_segments() was trying to migrate all segments
inside a zone to the other one hoping that the original one could be
completely freed. This was an attempt to optimise for throughput.

However, this may unnecesairly hurt latency if the zone is large, but
only few segments are required to satisfy reclaimer's demands.
Message-Id: <20170410171912.26821-1-pdziepak@scylladb.com>
2017-04-11 08:55:29 +02:00
Gleb Natapov
b4c368a6bc storage_proxy: update correct statistics on range reads
Fixes #2167

Message-Id: <20170405094119.GM8197@scylladb.com>
2017-04-09 18:16:06 +03:00
Glauber Costa
a808c32676 scylla_util: fix issues with cpuset handling
While cpuset.conf is supposed to be set before this is used, not having
a cpuset.conf at all is a valid configuration. The current code will
raise an exception in this case, but it shouldn't.

Also, as noted by Amos, atoi() is not available as a global symbol. Most
invocations were safe, calling string.atoi(), but one of them wasn't.

This patch replaces all usages of atoi() with int(), which is more
portable anyway.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20170407172937.17562-1-glauber@scylladb.com>
2017-04-09 16:04:02 +03:00
Glauber Costa
f842aeb07a mark i3 as a supported instance during login
We have recently fixed the ami init scripts to mark i3 as a supported
instance. However, the code to detect whether or not the instance is
supported is duplicated, and called from multiple locations. That means
that when the user logs in, it will see the instance as not supported -
as the test is coming from a different source.

This patch moves it to the scylla_lib.sh utilities script, so we can
share it, and make sure it is right for all locations.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20170406201137.8921-1-glauber@scylladb.com>
2017-04-09 12:35:35 +03:00
Glauber Costa
7ef8c6aaec centos: add runtime dependency for perftune script
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20170406172949.13264-1-glauber@scylladb.com>
2017-04-09 11:36:26 +03:00
Glauber Costa
ca8ca3b823 scylla_io_setup: change permissions
The script ended up with the default permissions, which lack the
executable bit. Change it, so it can be executed directly.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20170407173109.17870-1-glauber@scylladb.com>
2017-04-09 11:36:13 +03:00
Pekka Enberg
a9ad5cc560 dist/redhat: Add node_health_check to package 2017-04-08 08:12:57 +03:00
Tomer Sandler
c49f944483 dist: Add node_health_check script
This patch adds a script for running an automated node health check.

[ penberg: Fold into single commit ]
2017-04-08 08:06:30 +03:00
Avi Kivity
3b19ea8796 Update scylla-ami submodule
* dist/ami/files/scylla-ami 5d73a71...f10db69 (2):
  > use latest repo for amzn kernel
  > do not use duplicated code to test for instance support status
2017-04-07 12:28:45 +03:00
Avi Kivity
f579e4cc46 Merge seastar upstream
* seastar c5dd395...e899c0b (5):
  > perftune: remove psutil dependency
  > metrics: change push_back({...}) to emplace_back(...)
  > prometheus: return the metric prefix
  > Merge "scripts/perftune.py: add disks tuning" from Vlad
  > Use safe name for metric family
2017-04-06 17:10:19 +03:00
Glauber Costa
f4502bfb79 ami: update packer to 1.0.0 so ENA gets enabled
Older versions of packer do not support ENA, and when faced
with the option

      "enhanced_networking": true,

will only actually enable it for the older 82599 VF instances.
Fortunately, packer 1.0 already supports it, and all we have to do is
update it.

While we are at it, let's check if the file is legit before using a
random file we have downloaded from the internet, to avoid breaching our
building process.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20170406123952.14708-1-glauber@scylladb.com>
2017-04-06 16:39:19 +03:00
Glauber Costa
039f8d5994 dist/redhat: install the new scylla_util.py file
For the debian package, the files don't have to be listed individually,
but for RPMs they do.

The rpm builds are currently failing with:

error: Installed (but unpackaged) file(s) found:
  /usr/lib/scylla/scylla_util.py

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20170406012336.21081-1-glauber@scylladb.com>
2017-04-06 10:20:32 +03:00
Avi Kivity
ecf5781597 Update scylla-ami submodule
* dist/ami/files/scylla-ami 9e8c36d...5d73a71 (1):
  > support i3 instances
2017-04-05 18:52:46 +03:00
Amnon Heiman
6c1858b275 API:storage_service should support metrics load
Following C* API there are two APIs for getting the load from
storage_service:
/storage_service/metrics/load
/storage_service/load

This patch adds the implementation for
/storage_service/metrics/load

The alternative, is to drop on of the API and modify the JMX
implementation to use the same API.

Fixes #2245

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170401181520.19506-1-amnon@scylladb.com>
2017-04-05 18:14:19 +03:00
Tomasz Grabiec
d523c60629 sstables: Push fragments from mp_row_consumer so that parser is interrupted less
Currently we return proceed::no after every mutation_fragment which is
to be consumed. This froces parser to save and reload its state
often. This can be avoided if we pushed the fragments directly from
mp_row_consumer, then we would return proceed::no only when the buffer
fills up.

tests/perf/perf_fast_forward shows 15% increase in throughput of a large partition scan,
from 1.34M frag/s to 1.55M frag/s.

Message-Id: <1490882700-22684-1-git-send-email-tgrabiec@scylladb.com>
2017-04-05 18:10:54 +03:00
Takuya ASADA
da75ce694a dist: migrate python2 scripts to python3
Now we can package python3 .py script on CentOS, no need to keep using
python2 so migrate them to python3.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1491402479-7703-1-git-send-email-syuu@scylladb.com>
2017-04-05 17:35:35 +03:00
Takuya ASADA
610bc31b04 dist/redhat/scylla.spec.in: stop compiling .py on rpm
rpmbuild tries to compile any *.py by default, but it causes compilation
error on python3 code when python2 is system default (CentOS, RHEL).

So skip compiling it, drop .pyc / .pyo from the package.

Fixes #2235

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1490859768-18900-1-git-send-email-syuu@scylladb.com>
2017-04-05 12:15:22 +03:00
Avi Kivity
58318db4bb Merge "Rewrite scylla_io_setup in python" from Glauber
"Also, this new version supports i3.
Number of requests for i3 is obtained similarly as to i2: I have run tests
for a single disk, and then we'll take the amount of disks into account. Other
possible limits are also taken into account, like the max per-shard seastar limit
of 128 in-flight request, and the per-disk limit obtained by sysfs."

* 'python-io-setup' of https://github.com/glommer/scylla:
  rewrite scylla_io_setup in python
  scripts: add python module with common utilities
2017-04-05 10:39:54 +03:00
Avi Kivity
88637c86c2 Update ami submodule
* dist/ami/files/scylla-ami 407e8f3...9e8c36d (1):
  > Switch to Amazon Linux's kernel
2017-04-04 21:55:36 +03:00
Glauber Costa
2fa698ee95 rewrite scylla_io_setup in python
We do it using the new scylla_util.py library. As we do it, we also
enable i3 support.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-04-04 14:44:52 -04:00
Glauber Costa
ba7010b7a5 scripts: add python module with common utilities
As we convert more stuff to python, we'll have more opportunities for
sharing code between them. We already do that for the bash scripts with
a file "scylla_lib.sh". We'll do the same for python.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-04-04 14:42:19 -04:00
Vlad Zolotarov
c26799c9b0 config: enforce the 'stop' value for commit_failure_policy/disk_failure_policy
Fixes #2246

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1491246164-26612-1-git-send-email-vladz@scylladb.com>
2017-04-04 16:46:36 +03:00
Takuya ASADA
4262edd843 dist: use distribution standard fstrim script instead of our custom one
We recently introduced fstrim cronjob / systemd timer unit, but some of
distributions already has their own fstrim cronjob / systemd timer unit.
So let's use them when it's posssible.

Fixes #2233

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1491226727-10507-1-git-send-email-syuu@scylladb.com>
2017-04-04 16:44:45 +03:00
Takuya ASADA
72696eff22 dist/common/scripts/scylla_setup: add 'cancel' feature on RAID disk selector prompt
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1491217320-10211-2-git-send-email-syuu@scylladb.com>
2017-04-04 15:35:23 +03:00
Takuya ASADA
4eee3dd778 dist/common/scripts/scylla_setup: skip list DVD drive on block device list for RAID
Fixes #2230

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1491217320-10211-1-git-send-email-syuu@scylladb.com>
2017-04-04 15:35:22 +03:00
Takuya ASADA
b087616a6c dist/debian/debian/scylla-server.upstart: export SCYLLA_CONF, SCYLLA_HOME
We are sourcing sysconfig file on upstart, but forgot to load them as
environment variables.
So export them.

Fixes #2236

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1491209505-32293-1-git-send-email-syuu@scylladb.com>
2017-04-03 16:34:20 +03:00
Pekka Enberg
57c4bed420 Revert "dist: add --options-file /etc/scylla/scylla.yaml on sysconfig"
This reverts commit 58e628eb3d. Takuya says it
does not fix issue #2236.
2017-04-03 16:33:58 +03:00
Takuya ASADA
58e628eb3d dist: add --options-file /etc/scylla/scylla.yaml on sysconfig
After RAID devices mounted to /var/lib/scylla, scylla-server doesn't able to
find ./conf/ directory since we haven't created symlinks on the volume.

Instead of creating symlink on scylla_raid_setup, let's specify scylla.yaml
path on program argument.

Fixes #2236

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1491056110-1078-1-git-send-email-syuu@scylladb.com>
2017-04-02 11:28:27 +03:00
Vlad Zolotarov
2d8fcde695 init: add a proper message when there is a bad 'seeds' configuration
Fixes #2193

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1490912678-32004-1-git-send-email-vladz@scylladb.com>
2017-04-02 10:41:52 +03:00
Avi Kivity
d03207e939 Merge seastar upstream
* seastar 2ebe842...c5dd395 (3):
  > configure.py: Fix unrecognized option error
  > Merge fixes for fstream slow start from Paweł
  > Merge "A cleaner safer metrics layer" from Amnon
2017-04-02 10:17:40 +03:00
Tzach Livyatan
4efee3432b dist/ami: run nodetool status on each login
Running nodetool status on each login to Scylla AMI helps in three ways:
- give the user a quick view of the node and cluster status beyond the current "scylla is active"
- hint to the user about the nodetool and how to use it
- move the first, slow, run of nodetool to the login phase, making the second interactive run much faster

on the down side, it does slow the login in a few sec

Signed-off-by: Tzach Livyatan <tzach@scylladb.com>
[ penberg: fix formatting ]
Message-Id: <20170330081711.22038-1-tzach@scylladb.com>
2017-03-30 13:45:48 +03:00
Vlad Zolotarov
761a5eb72f scripts: add fix_system_distributed_tables.py
This script validates schemas of distributed system keyspaces:
system_traces and system_auth.

It tries to add the missing columns and checks that the existing columns
have expected types.

In case of any problem the corresponding info message is printed and non-zero exit
status is returned.

Extra columns are ignored.

The validation function may also be called from another Python program.
It returns True in case of success and False otherwise.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1490195645-29905-1-git-send-email-vladz@scylladb.com>
2017-03-29 19:15:05 +03:00
Avi Kivity
5b530aa464 Merge "Use promoted index for skipping in sstable mutation readers" from Tomasz
"sstable_streamed_mutation::fast_forward_to() is changed to use promoted index
(via index_reader) to optimize skipping in large partitions.

In addition to that, sstable mutation_reader is changed to use the index
to skip to the next partition.

Performance impact was evaluated using newly added tests/perf/perf_fast_forward

What's beyond this series:

  - Using index_reader for single-partition reads as well

  - Using index_reader for skipping across ranges in clustering restrictions"

* tag 'tgrabiec/skip-within-partition-using-index-v2' of github.com:cloudius-systems/seastar-dev: (47 commits)
  tests: Add performance test for fast forwarding of sstable readers
  tests: Allow starting cql_test_env on pre-existing data
  config: Allow specifying source when setting value
  tests: sstable: Add test for fast forwarding within partition using index
  sstables: sstable_streamed_mutation: use index in fast_forward_to()
  sstables: Store parsed promoted index in index_entry
  sstables: Add trace-level logging for sstable consumption
  sstables: Define deletion_time earlier
  sstables: Make parsing throw exception on malformed promoted index
  tests: Add tests for ordering of position_in_partition relative to composites
  position_range: Introduce all_clustered_rows() factory method
  position_in_partition: Introduce for_key()/after_key() factory methods
  position_in_partition: Add factory methods for positions around all rows
  position_in_partition: Introduce for_range_start()/for_range_end()
  position_in_partition: Fix friendship declaration
  keys: Introduce is_empty() for prefixes
  position_in_partition: Make comparable with composites
  types: Enhance lexicographical comparators
  compound_compat: Accept marker value in serialize_value()
  compound_compat: Add trichotomic comparator
  ...
2017-03-29 19:01:12 +03:00
Raphael S. Carvalho
023031b0c8 compaction: lcs: fix functionality to feed starved levels
quick introduction to level starvation:
high levels may be left uncompacted (thus starved) for a long time if user
makes something that make they contain little data, such as cleanup or change
of max sstable size (default 160M). Leveled strategy handles this problem as
follow: consider we're compacting L1 to L2. If L3 is starved, we look for one
of its sstable that is fully contained in token range of candidates L1->L2,
so that we won't end up with an overlapping in L2.

now the problem:
the functionality isn't working properly now because range of candidates is
being incorrectly calculated due to an accident when converting the code to
C++. It won't cause an overlap because it's actually being more restrictive
about which sstable from starved level can be used.

A test case was added to confirm the problem.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170328223753.15398-1-raphaelsc@scylladb.com>
2017-03-29 18:59:46 +03:00
Takuya ASADA
5e0cb39db6 dist: follow DPDK tools directory name change
Fixes #2234

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1490776630-6626-1-git-send-email-syuu@scylladb.com>
2017-03-29 11:38:39 +03:00
Tomasz Grabiec
97742fd4c2 test.py: Enable stack trace on UBSAN errors
Message-Id: <1490769716-10217-1-git-send-email-tgrabiec@scylladb.com>
2017-03-29 11:08:05 +03:00
Tomasz Grabiec
7fd724821b tests: Add performance test for fast forwarding of sstable readers 2017-03-28 18:34:55 +02:00
Tomasz Grabiec
543a484d78 tests: Allow starting cql_test_env on pre-existing data 2017-03-28 18:34:55 +02:00
Tomasz Grabiec
2c775bbb6e config: Allow specifying source when setting value
So that is_set() will be true for that option. Needed in tests which
set some config options in higher layer and then lower layers detects
if option was set or not before applying its default.
2017-03-28 18:34:55 +02:00
Tomasz Grabiec
f1aca6d116 tests: sstable: Add test for fast forwarding within partition using index 2017-03-28 18:34:55 +02:00
Tomasz Grabiec
3fbc0bed6e sstables: sstable_streamed_mutation: use index in fast_forward_to() 2017-03-28 18:34:55 +02:00
Tomasz Grabiec
5b36976bf0 sstables: Store parsed promoted index in index_entry 2017-03-28 18:34:55 +02:00
Tomasz Grabiec
a2a8312c78 sstables: Add trace-level logging for sstable consumption 2017-03-28 18:34:55 +02:00
Tomasz Grabiec
5af815bf20 sstables: Define deletion_time earlier 2017-03-28 18:34:55 +02:00
Tomasz Grabiec
5e34743882 sstables: Make parsing throw exception on malformed promoted index
Will be easier to propagate failure to upper layers once parsing is
reused in the index_reader.

The old behavior of ignoring parsing failures is preserved, but the
error is logged now.
2017-03-28 18:34:55 +02:00
Tomasz Grabiec
b40b20387a tests: Add tests for ordering of position_in_partition relative to composites 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
5b813898bc position_range: Introduce all_clustered_rows() factory method 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
622713be60 position_in_partition: Introduce for_key()/after_key() factory methods 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
e27fa712f5 position_in_partition: Add factory methods for positions around all rows 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
b90275f8e3 position_in_partition: Introduce for_range_start()/for_range_end() 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
5c29f4dd04 position_in_partition: Fix friendship declaration 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
212a021fc6 keys: Introduce is_empty() for prefixes 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
2ff6c1705b position_in_partition: Make comparable with composites 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
4e7fe40a70 types: Enhance lexicographical comparators
They now accept optional lexicographical_relation which can be used
to alter position of the element relative to elements prefixed by it.

Example. Let's consider lexicographical ordering on strings. The
position of "bc" in a sample sequence is affected by
lexicographical_realtion as follows:

   aa
   aaa
   b
   ba
 --> before_all_prefixed
   bc
 --> before_all_strictly_prefixed
   bca
   bcd
 --> after_all_prefixed
   bd
   bda
   c
   ca
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
6c6be0f7e4 compound_compat: Accept marker value in serialize_value() 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
1331d10811 compound_compat: Add trichotomic comparator 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
2121c2ee8a compound_compat: Make composites printable 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
38e14ab3c8 compound_compat: Introduce composite::serialize_static()
Generelized from static_prefix().
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
bef677b57d compound_compat: Introduce composite_view::last_eoc() 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
18a057aa81 compound_compat: Return composite from serialize_value()
To make the code more type-safe. Also, mark constructor from bytes
explicit.
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
680ffd20f5 compound: Use const bytes_view as iterator's value type
The iterator doesn't really allow modifying unserlying component.

This change enables using the iterator with boost::make_iterator_range() and
boost::range::join(), which get utterly confused otherwise.
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
123b102dd6 sstables: Skip to next partition using index
Slicing front of a very large partition:

Before:

 offset  read      time [s]     frags     frag/s    aio      [KiB] blocked dropped    cpu
 0       1         0.110960         1          9    992     126956     924       0  92.4%

After:

 offset  read      time [s]     frags     frag/s    aio      [KiB] blocked dropped    cpu
 0       1         0.000784         1       1276      3        344       2       1  37.3%
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
a9252dfc58 sstables: Use separate index readers for lower and upper bounds
So that lower bound can be advanced within the range.
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
27d86dfe18 sstables: Enable skipping to cells at data_consume_context level 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
aad943523a sstables: index_reader: Add trace-level logging 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
388315c1ff sstables: Expose index metrics 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
1dbd2e239e sstables: index_reader: Share index lists among other index readers
Direct motivation for this is to be able to use two index readers from
a single mutation reader, one for lower bound of the range and one for
the upper bound of the range, without sacrificing optimization of
avoiding index reads when forwarding to partition ranges which are
close by. After the change, all index readers of given sstable will
share index buffers, so lower bound reader can reuse the page read by
the upper bound reader.

The reason for using two readers will be so that we are able to skip
inside the partition range, not only outside of it. This is not
possible if we use the same index reader to locate the upper bound of
the range, because we may only advance the cursor.
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
0635d74e17 sstables: Make index_entry copyable
Needed to make the index_list copyable, which is going to be needed to
implement legacy get_index_entries() which returns by value, after
index sharing is implemented.
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
e36979da47 sstables: index_reader: Use sstable's schema
Makes for a simpler interface.
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
e3e2f037bb sstables: index_reader: Refactor around the concept of a cursor
Index reader already can be queried only with monotonic positions, so
the concept of a cursor is ingrained. Making it explicit will make it easier
to define behavior for forwarding withing the partition.

After the change:

 - lower_bound() is renamed to advance_to() and doesn't return
   the position, only advances the cursor

 - data file position for partition under cursor can be obtained
   at any time with data_file_position()
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
27862fa8f6 sstables: index_reader: Narrow down summary range during lookup
Positions passed to lower_bound() must be non-decreasing, so summary
indexes as well.
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
02ace99798 sstables: index_reader: Change lookup to work on ring_position_view
In preparation for changing the interface to work not only with ranges.
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
cd295e9926 sstables: Avoid moving an sstable
In preparation for adding non-movable members.
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
5edb427873 sstables: Remove private constructor
To reduce duplication.
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
705bd6da1a sstables: Remove unused method 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
a7301a702f tests: Add missing blocking on fast_forward_to() 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
5fe14735e8 tests: dht: Test ring_position_comparator 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
ff6cca6e9e tests: Add utility for checking total orders 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
d4b6e430ed dht: Introduce ring_position_view 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
55a7cceef5 dht: Move comparison logic from ring_position::tri_compare() to ring_position_comparator
It will soon define common ordering for many objects, not just
ring_position.
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
d5e704ca1e sstables: Make key_view constructor from bytes_view explicit 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
65a8920b25 dht: Make min/max tokens capturable by reference
So that they can be later used in views.
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
f6d2a07422 config: Warn on use of [[deprecated]] instead of failing 2017-03-28 18:10:39 +02:00
Calle Wilund
b12b65db92 commitlog/replayer: Bugfix: minimum rp broken, and cl reader offset too
The previous fix removed the additional insertion of "min rp" per source
shard based on whether we had processed existing CF:s or not (i.e. if
a CF does not exist as sstable at all, we must tag it as zero-rp, and
make whole shard for it start at same zero.

This is bad in itself, because it can cause data loss. It does not cause
crashing however. But it did uncover another, old old lingering bug,
namely the commitlog reader initiating its stream wrongly when reading
from an actual offset (i.e. not processing the whole file).
We opened the file stream from the file offset, then tried
to read the file header and magic number from there -> boom, error.

Also, rp-to-file mapping was potentially suboptimal due to using
bucket iterator instead of actual range.

I.e. three fixes:
* Reinstate min position guarding for unencoutered CF:s
* Fix stream creating in CL reader
* Fix segment map iterator use.

v2:
* Fix typo
Message-Id: <1490611637-12220-1-git-send-email-calle@scylladb.com>
2017-03-28 10:32:28 +02:00
Duarte Nunes
53014bd762 mutation_source_test: Ensure unique collection elements
Duplicate elements are illegal in collections, so we ensure they only
contain unique ones.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170327161149.8938-4-duarte@scylladb.com>
2017-03-27 18:44:11 +02:00
Duarte Nunes
94d568924d mutation_source_test: Sort collection elements
Ref #1607

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170327161149.8938-3-duarte@scylladb.com>
2017-03-27 18:43:58 +02:00
Duarte Nunes
4963902922 mutation_source_test: Remove extra randomness source
This patch ensures we generate UUIDs using the same randomness source
as all the other values we randomly generator, so that we can get a
deterministic run from the seeds we print.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170327161149.8938-2-duarte@scylladb.com>
2017-03-27 18:43:44 +02:00
Takuya ASADA
b84828b487 dist/common/scripts/scylla_fstrim: don't abort the program when a disk doesn't support TRIM
Do not abort the program until run fstrim on all directories.

Fixes #2220

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1490595397-19130-1-git-send-email-syuu@scylladb.com>
2017-03-27 11:43:19 +03:00
Avi Kivity
27c42359bc Merge seastar upstream
* seastar 6b21197...2ebe842 (6):
  > Merge "Various improvements to execution stages" from Paweł
  > app-template: allow apps to specify a name for help message
  > bool_class: avoid initializing object of incomplete type
  > app-template: make sure we can still get help with required options
  > prometheus: Http handler that returns prometheus 0.4 protobuf or text format
  > Update DPDK to 17.02

Includes patch from Pawel to adjust to updated execution_stage interface.
2017-03-26 10:50:21 +03:00
Takuya ASADA
e7697c37b2 dist/common/scripts/scylla_setup: warn twice before constructing RAID volume
Since RAID construction has possibility to destroy user data, warn twice before
executing scylla_raid_setup.

Fixes #1346

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1487891797-13109-1-git-send-email-syuu@scylladb.com>
2017-03-26 10:36:23 +03:00
Glauber Costa
f7f187a7f3 docker: do not touch yaml during startup
Users sometimes need to run their own yaml configuration files, and it
is currently annoying to deploy modified files on docker.

One possible solution is to bind mount the file into the docker
container using the -v switch, just like we already do for for the data
volume.

The problem with the aforementioned approach is that we have to change
the yaml file to insert the addresses, and that will change the file in
the host (or fail to happen, if we bind mount it read-only).

The solution I am proposing is to avoid touching the yaml file inside
the container altogether. Instead, we can deploy the address-related
arguments that we currently write to the yaml file as Scylla options.

Fixes #2113

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <1490195141-19940-1-git-send-email-glauber@scylladb.com>
2017-03-23 16:52:40 +02:00
Pekka Enberg
c5b1908e03 dist/docker: Use stdout as logging output
If early startup fails in docker-entrypoint.py, the container does not
start. It's therefore not very helpful to log to a file _within_ the
container...

Message-Id: <1490275943-23590-1-git-send-email-penberg@scylladb.com>
2017-03-23 16:48:34 +02:00
Calle Wilund
c3a510a08d commitlog_replayer: Do proper const-loopup of min positions for shards
Fixes #2173

Per-shard min positions can be unset if we never collected any
sstable/truncation info for it, yet replay segments of that id.

Wrap the lookups to handle "missing data -> default", which should have been
there in the first place.

Message-Id: <1490185101-12482-1-git-send-email-calle@scylladb.com>
2017-03-22 17:57:09 +02:00
Amnon Heiman
064f5e1b63 row_cache: switch to the metrics layer registration
This patch moves the row_cache metrics registration from collectd to the
metric layer.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170321143812.785-3-amnon@scylladb.com>
2017-03-21 16:42:58 +02:00
Amnon Heiman
a6a13865bf API: remove unneeded refrences to collectd
This patch removes left over references to the collectd from the API.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170321143812.785-2-amnon@scylladb.com>
2017-03-21 16:42:57 +02:00
Vlad Zolotarov
79978c156e transport::server: don't report a Tracing session ID unless requested
Don't report a Tracing session ID unless the current query had a Tracing bit in its
flags.

Although the current master's behaviour is legal it's suboptimal and some Clients are sensitive to that.
Let's fix that.

Fixes #2179

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1490063752-8915-1-git-send-email-vladz@scylladb.com>
2017-03-21 13:57:08 +00:00
Vlad Zolotarov
9dd5b5762d dist: install seastar/scripts/perftune.py together with posix_net_conf.sh
posix_net_conf.sh is currently a wrapper for perftune.py script and
perftune.py has to be at the same directory as posix_net_conf.sh.


Fixes #2176.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1490020882-32361-1-git-send-email-vladz@scylladb.com>
2017-03-21 12:31:50 +02:00
Amnon Heiman
7572addfbf column_family: metrics should be register once
column_family constructor uses delegation, as such, only the actual
constructor implementation should contain a call to register the
metrics.
Current implementation ends up with re registration of the metrics.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170320140817.22214-1-amnon@scylladb.com>
2017-03-21 11:20:29 +02:00
Pekka Enberg
85a127bc78 dist/docker: Expose Prometheus port by default
This patch exposes Scylla's Prometheus port by default. You can now use
the Scylla Monitoring project with the Docker image:

  https://github.com/scylladb/scylla-grafana-monitoring

To configure the IP addresses, use the 'docker inspect' command to
determine Scylla's IP address (assuming your running container is called
'some-scylla'):

  docker inspect --format='{{ .NetworkSettings.IPAddress }}' some-scylla

and then use that IP address in the prometheus/scylla_servers.yml
configuration file.

Fixes #1827

Message-Id: <1490008357-19627-1-git-send-email-penberg@scylladb.com>
2017-03-20 15:29:52 +02:00
Amos Kong
468df7dd5f scylla_setup: match '-p' option of lsblk with strict pattern
On Ubuntu 14.04, the lsblk doesn't have '-p' option, but
`scylla_setup` try to get block list by `lsblk -pnr` and
trigger error.

Current simple pattern will match all help content, it might
match wrong options.
  scylla-test@amos-ubuntu-1404:~$ lsblk --help | grep -e -p
   -m, --perms          output info about permissions
   -P, --pairs          use key="value" output format

Let's use strict pattern to only match option at the head. Example:
  scylla-test@amos-ubuntu-1404:~$ lsblk --help | grep -e '^\s*-D'
   -D, --discard        print discard capabilities

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <4f0f318353a43664e27da8a66855f5831457f061.1489712867.git.amos@scylladb.com>
2017-03-20 08:10:35 +02:00
Raphael S. Carvalho
7deeffc953 database: serialize sstable cleanup
We're cleaning up sstables in parallel. That means cleanup may need
almost twice the disk space used by all sstables being cleaned up,
if almost all sstables need cleanup and every one will discard an
insignificant portion of its whole data.
Given that cleanup is frequently issued when node is running out of
disk space, we should serialize cleanups in every shard to decrease
the disk space requirement.

Fixes #192.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170317022911.10306-1-raphaelsc@scylladb.com>
2017-03-19 12:33:03 +02:00
Tomasz Grabiec
bb0ce5d8fe Merge "Ensure base and view schema versions match" from Duarte
The mapping between a base table update and a view update is schema
dependent, so we need to ensure the view schema versions match the
base schema version. For example, we match base columns to view
columns by name, so we need to ensure the base and view schemas we're
using for writting are isolated with respect to a previous alter
table statement.

We thus need to match base schema versions with view schema versions,
and we need to so atomically to ensure that when one fiber sees a
schema, it also sees the complete set of corresponding view schemas.
This series ensures the schemas modified as a result of an alter
table statement are published atomically, under the schema lock. This
way, all the schemas referenced by the database are consistent with
each other when they are observed by other fibers.

Finally, we upgrade the mutation schema before generating the view
updates, to ensure it matches the most recent view schemas the base
replica knows about, registered in the database.

The db::view::view class was replaced by a set of non-member
functions, with its state, which used to reflect only the most recent
schema version, being moved to a new view_info class.
2017-03-17 12:40:00 +01:00
Duarte Nunes
b27da688f9 mutation: Remove dead get_cell() function
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170316234843.23130-1-duarte@scylladb.com>
2017-03-17 11:18:23 +02:00
Pekka Enberg
91b9e0d914 Update scylla-ami submodule
* dist/ami/files/scylla-ami eedd12f...407e8f3 (1):
  > scylla_create_devices: check block device is exists

Fixes #2171
2017-03-17 11:13:07 +02:00
Tomasz Grabiec
3609665b19 lsa: Fix debug-mode compilation error
By moving definitions of setters out of #ifdef
2017-03-16 18:23:05 +01:00
Tomasz Grabiec
88e7b3ff79 lsa: Ensure can_allocate_more_memory() always leaves a gap above seastar's min_free_memory()
One of the goals of can_allocate_more_memory() is to prevent depleting
seastar's free memory close to its minimum, leaving a head room above
that minimum so that standard allocations will not cause reclamation
immediately. Currently the function doesn't take into accoutn actual
threshold used by the seastar allocator, so there could be no gap or
even could go below the minimum.

Fix that by ensuring there's always a gap above min_free_memory().

min_gap was reduced to 1 MiB so that low memory setups are not
impacted significantly by the change.
Message-Id: <1489667863-15099-1-git-send-email-tgrabiec@scylladb.com>
2017-03-16 12:42:50 +00:00
Tomasz Grabiec
17ede24a77 Update seastar submodule
* seastar 4d25b85...6b21197 (3):
  > core: memory: Expose control of the free memory low water mark
  > scripts: add perftune.py
  > tutorial: make network examples work on multi-core
2017-03-16 13:32:45 +01:00
Pekka Enberg
3afd7f39b5 cql3: Wire up functions for floating-point types
Fixes #2168
Message-Id: <1489661748-13924-1-git-send-email-penberg@scylladb.com>
2017-03-16 11:04:59 +00:00
Avi Kivity
434a4fee28 Merge "tests: Use allocating_section in lsa_async_eviction_test" from Tomasz
"The test allocates objects in batches (allocation is always under a reclaim
lock) of ~3MiB and assumes that it will always succeed because if we cross the
low water mark for free memory (20MiB) in seastar, reclamation will be
performed between the batches, asynchronously.

Unfortunately that's prevented by can_allocate_more_memory(), which fails
segment allocation when we're below the low water mark. LSA currently doesn't
allow allocating below the low water mark.

The solution which is employed across the code base is to use allocating_section,
so use it here as well.

Exposed by recent consistent failures on branch-1.7."

* 'tgrabiec/fix-lsa-async-eviction-test' of github.com:cloudius-systems/seastar-dev:
  tests: lsa_async_eviction_test: Allocate objects under allocating section
  lsa: Allow adjusting reserves in allocating_section
2017-03-16 12:44:14 +02:00
Tomasz Grabiec
cefb6b604a tests: lsa_async_eviction_test: Allocate objects under allocating section 2017-03-16 10:21:10 +01:00
Tomasz Grabiec
4ab8b255da lsa: Allow adjusting reserves in allocating_section 2017-03-16 10:21:10 +01:00
Raphael S. Carvalho
6b6bb38f38 compaction_manager: stop manager after storage io error
Manager will stop itself if a compaction fails due to storage io
error, which unconditionally results in stop of transportation
services.

Fixes #2147.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170316054538.23423-1-raphaelsc@scylladb.com>
2017-03-16 10:37:47 +02:00
Duarte Nunes
876a514743 database: Upgrade mutation to current schema to push view updates
This patch ensures we upgrade the mutation to the current schema when
generating and pushing view updates, so that the it matches the most
up to date views.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-03-15 18:15:27 +01:00
Duarte Nunes
be12a2bf0a db/schema_tables: Atomically publish base and view changes
This patch ensures that the schema merging atomically publishes
schema changes. In particular, it ensures that when a base schema
and a subset of its views are modified together (i.e., upon an alter
table or alter type statement), then they are published together as
well, without any deferring in-between.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-03-15 16:35:07 +01:00
Duarte Nunes
e215f25b11 migration_manager: Atomically migrate table and views
This patch changes the migration path for table updates such that the
base table mutations are sent and applied atomically with the view
schema mutations.

This ensures that after schema merging, we have a consistent mapping
of base table versions to view table versions, which will be used in
later patches.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-03-15 16:03:56 +01:00
Duarte Nunes
bfb8a3c172 materialized views: Replace db::view::view class
The write path uses a base schema at a particular version, and we
want it to use the materialized views at the corresponding version.

To achieve this, we need to map the state currently in db::view::view
to a particular schema version, which this patch does by introducing
the view_info class to hold the state previously in db::view::view,
and by having a view schema directly point to it.

The changes in the patch are thus:

1) Introduce view_info to hold the extra view state;
2) Point to the view_info from the schema;
3) Make the functions in the now stateless db::view::view non-member;
4) Remove the db::view::view class.

All changes are structural and don't affect current behavior.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-03-15 15:50:05 +01:00
Duarte Nunes
a64c47f315 schema: Move raw_view_info outside of raw_schema
In preparation of an upcoming patch, where the schema
won't directly store the raw_view_info.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-03-15 15:38:31 +01:00
Duarte Nunes
4b209be8b8 view_info: Rename to raw_view_info
In preparation for upcoming patches, which will deal with
moving the state in db::view::view to view_info.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-03-15 15:38:31 +01:00
Paweł Dziepak
dc99197318 Merge "Correctly handle tombstoned collections" from Duarte
"The current implementations of collection_type_impl::is_empty() and
collection_type_impl::difference() don't handle tombstoned collection
mutations correctly. In particular:

- is_empty() considers a collection mutation with a tombstone (and no
  entries) as empty;
- difference() doesn't do set difference between the cells tombstones,
  and always returns the highests.

Fixes #2152"

* 'collection-diff/v4' of github.com:duarten/scylla:
  mutation_test: Add more test cases for difference()
  mutation_source_test: Randomly generate collection cells
  collection_type_impl: Use set difference for tombstones
  collection_type_impl: A mutation with a tombstone is not empty
2017-03-15 13:39:55 +00:00
Duarte Nunes
143136647a mutation_test: Add more test cases for difference()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-03-15 14:34:01 +01:00
Duarte Nunes
005e4741e3 mutation_source_test: Randomly generate collection cells
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-03-15 14:34:01 +01:00
Duarte Nunes
61741a69b6 collection_type_impl: Use set difference for tombstones
This patch fixes collection_type_impl::difference() so it does set
difference for tombstones instead of just returning the larger
one, as difference() is supposed to return only the information in
mutation A that supersedes that in B, given difference(A, B).

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-03-15 14:34:01 +01:00
Duarte Nunes
19fcd2d140 collection_type_impl: A mutation with a tombstone is not empty
This patch changes the collection_type_impl::is_empty() function so
that it doesn't consider empty a collection_mutation which has a
tombstone.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-03-15 14:34:01 +01:00
Takuya ASADA
b65d58e90e dist/common/scripts/scylla_raid_setup: don't discard blocks at mkfs time
Discarding blocks on large RAID volume takes too much time, user may suspects
the script doesn't works correctly, so it's better to skip, do discard directly on each volume instead.

Fixes #1896

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1489533460-30127-1-git-send-email-syuu@scylladb.com>
2017-03-15 13:13:57 +02:00
Calle Wilund
078589c508 commitlog_replayer: Make replay parallel per shard
Fixes #2098

Replay previously did all segments in parallel on shard 0, which
caused heavy memory load. To reduce this and spread footprint
across shards, instead do X segments per shard, sequential per shard.

v2:
* Fixed whitespace errors

Message-Id: <1489503382-830-1-git-send-email-calle@scylladb.com>
2017-03-15 13:07:17 +02:00
Amnon Heiman
0a2eba1b94 database: requests_blocked_memory metric should be unique
Metrics name should be unique per type.

requests_blocked_memory was registered twice, one as a gauge and one as
derived.

This is not allowed.

Fixes #2165

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170314162826.25521-1-amnon@scylladb.com>
2017-03-14 19:36:45 +02:00
Avi Kivity
ed4b5f5a18 Merge seastar upstream
* seastar fd29fd0...4d25b85 (2):
  > core/file: fix EOF detection for file with custom impl
  > tutorial: fix echo server example

Includes patch from Raphael updating checked_file_impl:

"Now file_impl requires dma_read_bulk to be implemented, and for
checked_file_impl, it only's about calling dma_read_bulk from
the posix file it wraps."
2017-03-14 13:38:38 +02:00
Takuya ASADA
d016dd4b74 dist: schedule daily fstrim for data directory and commitlog directory
Schedule daily fstrim for data directory and commitlog directory, witch is
recommended by Scylla doc:
http://www.scylladb.com/doc/admin/#schedule-fstrim

Fixes #1347

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1489447472-2981-1-git-send-email-syuu@scylladb.com>
2017-03-14 11:51:53 +02:00
Amnon Heiman
295a981c61 storage_proxy: metrics should have unique name
Metrics should have their unique name. This patch changes
throttled_writes of the queu lenght to current_throttled_writes.

Without it, metrics will be reported twice under the same name, which
may cause errors in the prometheus server.

This could be related to scylladb/seastar#250

Fixes #2163.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170314081456.6392-1-amnon@scylladb.com>
2017-03-14 11:19:39 +02:00
Tomasz Grabiec
ed530dfb3a tests: sstables: Add test for skipping within a compressed stream
Refs #2143.
2017-03-13 13:08:24 +01:00
Tomasz Grabiec
1e0af2efc3 Update seastar submodule
* seastar 84a0b70...fd29fd0 (4):
  > Fix smp::submit_to() with function reference
  > execution_stage: add concept restraint for operator()
  > core/temporary_buffer: Add operator==()
  > map_reduce: allow reducer to take accumulated value by rref
2017-03-13 10:13:03 +01:00
Paweł Dziepak
60c6b9a240 Merge "Implement sstable_streamed_mutation::fast_forward_to()" from Tomasz
"This replaces use of a generic forwarding wrapper in sstable reader with
specialized implentation. Forwarding doesn't yet utilize indexes in this
series, only integrates it with mp_row_consumer, which is a prerequisite.

It's still an optimization, since mp_row_consumer will not try to consume
past the range as it used to.

Sending early for easier consumption."

* tag 'tgrabiec/forwarding-of-mp-row-consumer-v2' of github.com:scylladb/seastar-dev:
  sstables: Remove use of forwarding wrapper
  sstables: Implement sstable_streamed_mutation::fast_forward_to()
  sstables: Extract and use clustering_ranges_walker
  tests: sstables: Add test for handling of repeated tombstones
  sstables: Extract writer parameters into config objects
  tests: Move as_mutation_source() helper to header
  tests: Extract ensure_monotonic_positions() to streamed_mutation_assertions
  streamed_mutation: Add streamed_mutation_returning() helper
  tests: mutation_source_test: Add test case for forwarding to a full range
  tests: simple_schema: Add fragment factories
  tests: Extract simple_schema
  sstables: Move workaround for out-of-order range tombstones to mp_row_consumer
  sstables: Drop default mp_row_consumer constructor
  sstables: Swap order of values in "proceed" so that "no" is assigned 0
  util/optimized_optional: Make printable
  position_in_partition: Add is_static_row() in the view
  range_tombstone_stream: Add reset()
  range_tombstone_stream: Add get_next(position_in_partition_view)
  sstables: streamed_mutation: Stop reading when end of slice reached
  sstables: Switch is_in_range() to position_in_partition
2017-03-10 13:55:46 +00:00
Tomasz Grabiec
1f1b516b31 sstables: Remove use of forwarding wrapper 2017-03-10 14:42:22 +01:00
Tomasz Grabiec
d7afab21e7 sstables: Implement sstable_streamed_mutation::fast_forward_to()
Handling of forwarding is done inside mp_row_consumer, because it
allows us to filter out irrelevant data sooner and thus more
efficiently.

Becuase static row can be now skipped as well, _skip_clustering_row
was renamed to more generic _skip_in_progress.
2017-03-10 14:42:22 +01:00
Tomasz Grabiec
4750216387 sstables: Extract and use clustering_ranges_walker
Extracted from mp_row_consumer.
2017-03-10 14:42:22 +01:00
Tomasz Grabiec
88ccc99017 tests: sstables: Add test for handling of repeated tombstones 2017-03-10 14:42:22 +01:00
Tomasz Grabiec
124dde30db sstables: Extract writer parameters into config objects
Also enables users to change the default promoted index block size.
2017-03-10 14:42:22 +01:00
Tomasz Grabiec
ad1e69c4c5 tests: Move as_mutation_source() helper to header 2017-03-10 14:42:22 +01:00
Tomasz Grabiec
6f409d367b tests: Extract ensure_monotonic_positions() to streamed_mutation_assertions 2017-03-10 14:42:22 +01:00
Tomasz Grabiec
dc7b93a326 streamed_mutation: Add streamed_mutation_returning() helper 2017-03-10 14:42:22 +01:00
Tomasz Grabiec
06a964b3a0 tests: mutation_source_test: Add test case for forwarding to a full range 2017-03-10 14:42:22 +01:00
Tomasz Grabiec
929842ad3f tests: simple_schema: Add fragment factories 2017-03-10 14:42:22 +01:00
Tomasz Grabiec
d98f013b07 tests: Extract simple_schema 2017-03-10 14:42:22 +01:00
Tomasz Grabiec
01374c41f2 sstables: Move workaround for out-of-order range tombstones to mp_row_consumer
This is a preliminary step before adding support for fast-forwarding
to mp_row_consumer, so that range handling can be solely in
mp_row_consumer rather than split between it and
sstable_streamed_mutation.

This also alleviates #2080 by reading all tombstones only up to the
first row, after that range tombstones are treated like other
fragments.
2017-03-10 14:42:22 +01:00
Tomasz Grabiec
d41a7c5eb4 sstables: Drop default mp_row_consumer constructor 2017-03-10 14:42:22 +01:00
Tomasz Grabiec
56f1ad7841 sstables: Swap order of values in "proceed" so that "no" is assigned 0 2017-03-10 14:42:22 +01:00
Tomasz Grabiec
58c29be45c util/optimized_optional: Make printable 2017-03-10 14:42:21 +01:00
Tomasz Grabiec
a32cf6c4cc position_in_partition: Add is_static_row() in the view 2017-03-10 14:42:21 +01:00
Tomasz Grabiec
e4db643730 range_tombstone_stream: Add reset() 2017-03-10 14:42:21 +01:00
Tomasz Grabiec
48ad2e2d64 range_tombstone_stream: Add get_next(position_in_partition_view) 2017-03-10 14:42:21 +01:00
Tomasz Grabiec
084747b1ee sstables: streamed_mutation: Stop reading when end of slice reached
As part of this change, skip detection detection is refactored. This
simplifies reasoning about mp_row_consumer's state a bit because now
is_mutation() is not reset externally and only depends on current
position of the reader.

It will prove useful when we extend mutation reader to decide if it
should skip to the next partition up front before calling
_context.read(), so that we can for instance skip using index instead.

Fixes #2088.
2017-03-10 14:42:19 +01:00
Duarte Nunes
16bcf8d085 db/schema_tables: Avoid copying keyspace name
This patch changes a lambda argument type so the keyspace name is
passed by reference instead of copying it, in
read_schema_for_keyspaces().

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170309213134.10331-1-duarte@scylladb.com>
2017-03-10 11:03:56 +02:00
Duarte Nunes
d32c848d73 utils/logalloc: Change linkage of hist_options to external
Change linkage of segment_descriptor_hist_options to external to keep
good old GCC5 happy, despite C++11 allowing static linkage of non-type
template arguments.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170309213206.10383-1-duarte@scylladb.com>
2017-03-10 11:02:51 +02:00
Tomasz Grabiec
55358cacc5 sstables: Switch is_in_range() to position_in_partition
Makes it immune to #1446 and is a prerequisite for implementing
forwarding in mp_row_consumer.
2017-03-09 21:15:11 +01:00
Paweł Dziepak
aaae8db033 loggers should not have external linkage
Message-Id: <20170309111034.20929-1-pdziepak@scylladb.com>
2017-03-09 12:27:20 +01:00
Gleb Natapov
d34f3a0440 batchlog: introduce batch_size_fail_threshold_in_kb option
Add batch_size_fail_threshold_in_kb to prevent huge batch from been
applied and causing troubles. Also do not warn or fail if only one
partition is affected.

Fixes: #2128

Message-Id: <20170309111247.GE8197@scylladb.com>
2017-03-09 12:20:17 +01:00
Amnon Heiman
7b04841dda main: Name the http servers
In main there are two http servers that start, the API and prometheus.
This patch name them accordingly so their metrics will have more
meaning.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1489055282-10887-1-git-send-email-amnon@scylladb.com>
2017-03-09 12:30:49 +02:00
Glauber Costa
a7b0a899a3 dist: don't execute dpdk scripts if not in dpdk mode
The scripts are not liking very much being executed inside docker.
Since we don't really need those variables set outside DPDK scenarios,
just don't set them.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <1488823691-9014-1-git-send-email-glauber@scylladb.com>
2017-03-09 11:40:08 +02:00
Avi Kivity
efd96a448c Merge "Add execution stages" from Paweł
"These patches introduce execution stages to Scylla in order to improve
icache friendliness. The places were stages are added are not chosen
very carefully and rather introduced at between different subsystems:
cql, storage proxy and database. This already results in a rather
significant improvement and can be tuned later if necessary.

Performance results:
perf_simple_query -c4 --duration 60
(medians)
          before       after      diff
write   83017.75   242876.04   +192.6%
read    61709.16   168258.26   +172.7%

The real life improvements aren't as good because it is much harder
to collect sufficiently high number of operations in a batch."

Additional benchmarking from Paweł:

"I did some tests on my local setup.

* Latency at light loads

Scylla running on 16 logical CPUs (8 cores) with 64 GB of RAM.
cassandra-stress -rate threads=32

write latency
        master      seda
median   1.2         0.6
95th     1.6         0.8
99th     1.7         0.9
99.9th   2.5         1.3
max     26.4        24.2

Flags '--poll-mode' and '--defragment-memory-on-idle false' didn't improve situation for master.

See also attached graph write_99.svg and write_999.svg.

read latency
        master      seda
median   0.8         0.6
95th     1.0         0.9
99th     1.1         1.0
99.9th   1.4         1.2
max     18.5        18.0

See also attached graph read_99.svg and read_999.svg.

* Server 100% loaded, dataset fitting in memory (throughput)

Scylla running on 2 cores with 64 GB of RAM.
4x scylla-bench with the uniform workload
(concurrency of each s-b: 512  for writes, 256 for reads).

There were no cache misses during reads.

          master        seda      diff
writes  107722.4   168482.26    +56.4%
reads   51049.48    76158.19    +49.2%

* Server 100% loaded, writes being flushed and compacted (throughput)

Scylla running on 2 cores with 4 GB of RAM.
4x scylla-bench with the uniform workload, concurrency 256 each.

          master        seda      diff
writes  79575.77   114206.11    +43.5%

See attached graph: writes_with_flushes_and_compaction.png (first run: master, second: seda)."

* tag 'pdziepak/scylla-execution-stages/v1-rebased' of github.com:cloudius-systems/seastar-dev:
  transport: make process_request_one() an execution stage
  mutation_query: add an execution stage
  db: make database::query() an execution stage
  db: make apply an execution stage
  storage_proxy: make mutate() an execution stage
  cql3: make batch statement an execution stage
  cql3: make modification statement an execution stage
  cql3: make select statement an execution stage
  mutation_reader: make mutation_source nothrow movable
2017-03-09 11:29:43 +02:00
Paweł Dziepak
74f35864ef transport: make process_request_one() an execution stage 2017-03-09 09:27:43 +00:00
Paweł Dziepak
a78501c206 mutation_query: add an execution stage 2017-03-09 09:27:43 +00:00
Paweł Dziepak
b5f0e590be db: make database::query() an execution stage 2017-03-09 09:27:43 +00:00
Paweł Dziepak
38c1501f4d db: make apply an execution stage 2017-03-09 09:27:43 +00:00
Paweł Dziepak
cfde2ad5b4 storage_proxy: make mutate() an execution stage 2017-03-09 09:27:43 +00:00
Paweł Dziepak
827357cb08 cql3: make batch statement an execution stage 2017-03-09 09:27:43 +00:00
Paweł Dziepak
dce785089a cql3: make modification statement an execution stage 2017-03-09 09:27:43 +00:00
Paweł Dziepak
d005b20071 cql3: make select statement an execution stage 2017-03-09 09:27:43 +00:00
Paweł Dziepak
12135dbe21 mutation_reader: make mutation_source nothrow movable 2017-03-09 09:27:43 +00:00
Amnon Heiman
4e8d73098f main: Prometheus should start as early as possible
There is no need to wait when starting the prometheus server. As it is
up to each of the modules to register its metrics when it is ready.

This is especially important when debuging boot issues.

This patch moves the prometheus initilization to be done at an early
stage of the boot sequencec.

Fixes #2144

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1489041986-28974-1-git-send-email-amnon@scylladb.com>
2017-03-09 11:26:51 +02:00
Asias He
39d2e59e7e repair: Fix midpoint is not contained in the split range assertion in split_and_add
We have:

  auto halves = range.split(midpoint, dht::token_comparator());

We saw a case where midpoint == range.start, as a result, range.split
will assert becasue the range.start is marked non-inclusive, so the
midpoint doesn't appear to be contain()ed in the range - hence the
assertion failure.

Fixes #2148

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Asias He <asias@scylladb.com>
Message-Id: <93af2697637c28fbca261ddfb8375a790824df65.1489023933.git.asias@scylladb.com>
2017-03-09 09:09:17 +01:00
Avi Kivity
b8e4113dba Merge seastar upstream
* seastar 5861f99...84a0b70 (13):
  > build: don't error out on [[deprecated]] APIs
  > Merge "Introduce execution stages" from Paweł
  > Remove unused include statement
  > http: catch and count errors in read and respond
  > Merge "Adding metrics configuration" from Amnon
  > future: add concepts for map_reduce(), when_all_succeed()
  > doxygen: exclude c-ares directory
  > scripts/posix_net_conf.sh: add --use-cpu-mask option
  > file: take flush into account when calculating size for truncate in optimize_queue()
  > Fixing the prometheus cleanup patch
  > Merge "posix_net_conf.sh: better distribute ingress processing" from Vlad
  > prometheus: code clean up
  > future: relax finally() constraints even more
2017-03-08 20:02:05 +02:00
Tomasz Grabiec
abf8e83c8d gdb: Cast gdb.Values to int
Fails with newer GDB with:

  TypeError: %x format: an integer is required, not gdb.Value

Message-Id: <1488981412-22279-1-git-send-email-tgrabiec@scylladb.com>
2017-03-08 19:43:48 +02:00
Paweł Dziepak
6db6d25f66 Merge "Avoid loosing changes to keyspace parameters of system_auth and tracing keyspaces" form Tomek
"If a node is bootstrapped with auto_boostrap disabled, it will not
wait for schema sync before creating global keyspaces for auth and
tracing. When such schema changes are then reconciled with schema on
other nodes, they may overwrite changes made by the user before the
node was started, because they will have higher timestamp.

To prevent that, let's use minimum timestamp so that default schema
always looses with manual modifications. This is what Cassandra does.

Fixes #2129."

* tag 'tgrabiec/prevent-keyspace-metadata-loss-v1' of github.com:scylladb/seastar-dev:
  db: Create default auth and tracing keyspaces using lowest timestamp
  migration_manager: Append actual keyspace mutations with schema notifications
2017-03-08 10:59:47 +00:00
Nadav Har'El
506e074ba4 sstable decompression: fix skip() to end of file
The skip() implementation for the compressed file input stream incorrectly
handled the case of skipping to the end of file: In that case we just need
to update the file pointer, but not skip anywhere in the compressed disk
file; In particular, we must NOT call locate() to find the relevant on-disk
compressed chunk, because there is none - locate() can only be called on
actual positions of bytes, not on the one-past-end-of-file position.

Fixes #2143

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170308100057.23316-1-nyh@scylladb.com>
2017-03-08 12:35:05 +02:00
Tomasz Grabiec
d6425e7646 db: Create default auth and tracing keyspaces using lowest timestamp
If the node is bootstrapped with auto_boostrap disabled, it will not
wait for schema sync before creating global keyspaces for auth and
tracing. When such schema changes are then reconciled with schema on
other nodes, they may overwrite changes made by the user before the
node was started, because they will have higher timestamp.

To prevent that, let's use minimum timestamp so that default schema
always looses with manual modifications. This is what Cassandra does.

Fixes #2129.
2017-03-07 19:19:15 +01:00
Tomasz Grabiec
06d4ad1bdd migration_manager: Append actual keyspace mutations with schema notifications
There is a workaround for notification race, which attaches keyspace
mutations to other schema changes in case the target node missed the
keyspace creation. Currently that generated keyspace mutations on the
spot instead of using the ones stored in schema tables. Those
mutations would have current timestamp, as if the keyspace has been
just modified. This is problematic because this may generate an
overwrite of keyspace parameters with newer timestamp but with stale
values, if the node is not up to date with keyspace metadata.

That's especially the case when booting up a node without enabling
auto_bootstrap. In such case the node will not wait for schema sync
before creating auth tables. Such table creation will attach
potentially out of date mutations for keyspace metadata, which may
overwrite changes made to keyspace paramteters made earlier in the
cluster.

Refs #2129.
2017-03-07 19:19:15 +01:00
Avi Kivity
1b5ba63676 sstable: fix unhandled exception in atomic_deletion_manager::delete_atomically()
The current code is assymetric: the first N-1 shards to delete a set receive
a synthetic future to wait on, while the last deletion receives the result
of the delete operation (which also broadcasts completion to the first N-1
operations.  This results, in case of an error, with the Nth future being
reported as an unhandled error.

Fix by making everything symmetric: all N callers receive a synthetic
future.  Nobody waits for the deletion operation (which still broadcasts its
completion to all waiters, so errors are not lost).

Message-Id: <20170305151607.14264-1-avi@scylladb.com>
2017-03-07 12:41:12 +02:00
Avi Kivity
439b38f5ab Merge "Improvements to counter implementation" from Paweł
"This series adds various optimisations to counter implementation
(nothing extreme, mostly just avoiding unnecessary operations) as well
as some missing features such as tracing and dropping timed out queries.

Performance was tested using:
perf-simple-query -c4 --counters --duration 60

The following results are medians.
          before       after      diff
write   18640.41    33156.81    +77.9%
read    58002.32    62733.93     +8.2%"

* tag 'pdziepak/optimise-counters/v3' of github.com:cloudius-systems/seastar-dev: (30 commits)
  cell_locker: add metrics for lock acquisition
  storage_proxy: count counter updates for which the node was a leader
  storage_proxy: use counter-specific timeout for writes
  storage_proxy: transform counter timeouts to mutation_write_timeout_exception
  db: avoid allocations in do_apply_counter_update()
  tests/counters: add test for apply reversability
  counters: attempt to apply in place
  atomic_cell: add COUNTER_IN_PLACE_REVERT flag
  counters: add equality operators
  counters: implement decrement operators for shard_iterator
  counters: allow using both views and mutable_views
  atomic_cell: introduce atomic_cell_mutable_view
  managed_bytes: add cast to mutable_view
  bytes: add bytes_mutable_view
  utils: introduce mutable_view
  db: add more tracing events for counter writes
  db: propagate tracing state for counter writes
  tests/cell_locker: add test for timing out lock acquisition
  counter_cell_locker: allow setting timeouts
  db: propagate timeout for counter writes
  ...
2017-03-07 11:48:13 +02:00
Tomasz Grabiec
ecfa9e40de Merge 'duarte/lsa/hist-cleanup/v2' from github.com:duarten/scylla
histogram cleanups from Duarte.
2017-03-07 10:33:50 +01:00
Gleb Natapov
5c4158daac memtable: do not yield while holding reclaim_lock
Holding reclaim_lock while yielding may cause memory allocations to
fail.

Fixes #2139

Message-Id: <20170306153151.GA5902@scylladb.com>
2017-03-06 17:24:22 +01:00
Gleb Natapov
d7bdf16a16 memtable: do not open code logalloc::reclaim_lock use
logalloc::reclaim_lock prevents reclaim from running which may cause
regular allocation to fail although there is enough of free memory.
To solve that there is an allocation_section which acquire reclaim_lock
and if allocation fails it run reclaimer outside of a lock and retries
the allocation. The patch make use of allocation_section instead of
direct use of reclaim_lock in memtable code.

Fixes #2138.

Message-Id: <20170306160050.GC5902@scylladb.com>
2017-03-06 17:24:22 +01:00
Avi Kivity
1af9e3a5cb Merge "database: fix the 'nodetool clearsnapshot'" from Vlad
"Work on this series started with fixing the  'nodetool clearsnapshot'.
The current master code  ignores the snapshots in deleted keyspaces (issue #2045).

I noticed that in many places our code has to build the path to some directory/file
it simply had the sstring(<path1>) + "/" + sstring(<path2>) constructs which may cause us issues
if somebody decides to complile/run scylla on not-Unix-based OS, like Microsoft Windows.

I understand that this is a long shot but if we can make it right now - why not to.
The answer is boost::filesystem::path class - its synchronous parts, of course.

I decided to take an initiative and fix the issues above and then use the fixed code for
fixing the issue #2045:
   - Fix some minor issues in the existing code.
   - Extend the lister class and move it into the separate files outside database.cc.

On the way I've found an issue in the existing code (issue #2071).
This series fixes this one too (PATCH2)."
2017-03-06 16:45:31 +02:00
Glauber Costa
2d620a25fb raid script: improve test for mounted filesystem
The current test for whether or not the filesystem is mounted is weak
and will fail if multiple pieces of the hierarchy are mounted.

util-linux ships with a mountpoint command that does exactly that,
so we'll use that instead.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <1488742801-4907-1-git-send-email-glauber@scylladb.com>
2017-03-06 15:59:29 +02:00
Gleb Natapov
7f5923f510 storage_service: handle empty token list correctly
boost::split() return one empty string if called on an empty input.
Trying to cast an empty string to a token value results in a bad_lexical_cast
exception. Fix it by handling empty token list explicitly.

Message-Id: <20170302125405.GU11471@scylladb.com>
2017-03-06 15:31:33 +02:00
Takuya ASADA
6602221442 dist/redhat: enables discard on CentOS/RHEL RAID0
Since CentOS/RHEL raid module disables discard by default, we need enable it
again to use.

Fixes #2033

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1488407037-4795-1-git-send-email-syuu@scylladb.com>
2017-03-06 12:21:42 +02:00
Duarte Nunes
ca4f5cabd4 lsa: Extract log_histogram class
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-03-04 14:47:19 +01:00
Avi Kivity
24d6560fbc Update scylla-ami submodule
* dist/ami/files/scylla-ami d5a4397...eedd12f (3):
  > Rewrite disk discovery to handle EBS and NVMEs.
  > add --developer-mode option
  > trivial cleanup: replace tab in indent
2017-03-04 13:29:32 +02:00
Duarte Nunes
5c73978b68 thrift/handler: Enable Aggregator concept with GCC6_CONCEPT
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170303172509.16844-1-duarte@scylladb.com>
2017-03-04 13:27:16 +02:00
Duarte Nunes
2b6abd5a91 lsa: Make log_histogram more generic
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-03-03 17:59:17 +01:00
Duarte Nunes
3819e6d55f lsa: log_histogram cleanups
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-03-03 17:09:07 +01:00
Tomasz Grabiec
22199abf50 gc_clock: Remove orphaned comment
Message-Id: <1488381379-8618-1-git-send-email-tgrabiec@scylladb.com>
2017-03-02 12:56:09 +02:00
Tomasz Grabiec
6a83fe5534 Merge 'pdziepak/optimise-commitlog-entry-writer/v1' from seastar-dev.git
From Paweł:

These patches optimise commitlog_entry_writer so that it avoids copying
column mapping, which is a particularly expensive operation.

perf_simple_query -c4 --write --duration 60
(medians)
          before       after      diff
write   79434.35    89247.54    +12.3%

Tested with:

  commitlog_test.py:TestCommitLog.test_commitlog_replay_on_startup
  commitlog_test.py:TestCommitLog.test_commitlog_replay_with_alter_table
  commitlog_test.py:TestCommitLog.test_commitlog_replay_with_counters
2017-03-02 11:37:42 +01:00
Paweł Dziepak
04b80272f2 cell_locker: add metrics for lock acquisition 2017-03-02 09:05:12 +00:00
Paweł Dziepak
00b42c477f storage_proxy: count counter updates for which the node was a leader 2017-03-02 09:05:12 +00:00
Paweł Dziepak
cf193f4b41 storage_proxy: use counter-specific timeout for writes 2017-03-02 09:05:12 +00:00
Paweł Dziepak
d177160f90 storage_proxy: transform counter timeouts to mutation_write_timeout_exception 2017-03-02 09:05:12 +00:00
Paweł Dziepak
f93a766db4 db: avoid allocations in do_apply_counter_update() 2017-03-02 09:05:12 +00:00
Paweł Dziepak
8457f407ef tests/counters: add test for apply reversability 2017-03-02 09:05:11 +00:00
Paweł Dziepak
3bccca67b9 counters: attempt to apply in place
It is expected that most counter updates just modify the values of
existing shards and can be done in place.
2017-03-02 09:05:11 +00:00
Paweł Dziepak
7604b926e1 atomic_cell: add COUNTER_IN_PLACE_REVERT flag
The general algorithm for merging counter cells involves allocating a
new buffer for the shards. However, it is expected that most of the
applies are just updating the values of existing shards and not adding
new ones, therefore can be done in place.
However, reverting the general and in-place applies requires different
logic, hence the need for an additional flag to differentiate between
them.
2017-03-02 09:05:11 +00:00
Paweł Dziepak
1e3fbddb3a counters: add equality operators 2017-03-02 09:05:11 +00:00
Paweł Dziepak
772c9078d0 counters: implement decrement operators for shard_iterator 2017-03-02 09:05:11 +00:00
Paweł Dziepak
edad5202f3 counters: allow using both views and mutable_views 2017-03-02 09:05:11 +00:00
Paweł Dziepak
2db92e92b2 atomic_cell: introduce atomic_cell_mutable_view 2017-03-02 09:05:11 +00:00
Paweł Dziepak
1293073019 managed_bytes: add cast to mutable_view 2017-03-02 09:05:11 +00:00
Paweł Dziepak
29430ba970 bytes: add bytes_mutable_view 2017-03-02 09:05:11 +00:00
Paweł Dziepak
0ed2352ade utils: introduce mutable_view
std::basic_string_view does not allow modifying the underlying buffer.
This patch introduces a mutable_view which permits that.
2017-03-02 09:05:10 +00:00
Paweł Dziepak
774241648d db: add more tracing events for counter writes 2017-03-02 09:05:10 +00:00
Paweł Dziepak
277501f42f db: propagate tracing state for counter writes 2017-03-02 09:05:10 +00:00
Paweł Dziepak
2b5c4386b5 tests/cell_locker: add test for timing out lock acquisition 2017-03-02 09:05:10 +00:00
Paweł Dziepak
5af780360f counter_cell_locker: allow setting timeouts 2017-03-02 09:05:10 +00:00
Paweł Dziepak
25173f8095 db: propagate timeout for counter writes 2017-03-02 09:05:10 +00:00
Paweł Dziepak
c122f3b2f8 cell_locker: use internal storage for hashtable 2017-03-02 09:05:10 +00:00
Paweł Dziepak
4702ebf80f counters: use c_c_builder::from_single_shard() when possible 2017-03-02 09:05:10 +00:00
Paweł Dziepak
13ec22ad9a counters: drop tombstone handling in transform update to shards
Encountering tombstones while transforming counter update from deltas to
shards is expected to be rare due to the fact that counter cells cannot
be recreated once removed.
This assumption makes it unnecessary to care much about removed cells
during delta->shard transformation as it adds complexity to the code and
is not required to produce correct results.
2017-03-02 09:05:10 +00:00
Paweł Dziepak
37485e5b29 counters: optimise counter_cell_builder
This patch attempts to avoid excessive allocations and copies when
constructing counter cells using counter_cell_builder. That involves
adding serializer interface to atomic_cell so that the counter cell can
be directly serialized to the buffer allocated for atomic cell.

counter_cell_builder::from_single_shard() is added as well to avoid
std::vector<> overhead when creating a counter cell from a single shard.
2017-03-02 09:05:10 +00:00
Glauber Costa
9e61a73654 setup: support mount points in raid script
By default behavior is kept the same. There are deployments in which we
would like to mount data and commitlog to different places - as much as
we have avoided this up until this moment.

One example is EC2, where users may want to have the commitlog mounted
in the SSD drives for faster writes but keep the data in larger, less
expensive and durable EBS volumes.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <1488258215-2592-1-git-send-email-glauber@scylladb.com>
2017-03-01 19:23:59 +02:00
Tomasz Grabiec
4b6e77e97e db: Fix overflow of gc_clock time point
If query_time is time_point::min(), which is used by
to_data_query_result(), the result of subtraction of
gc_grace_seconds() from query_time will overflow.

I don't think this bug would currently have user-perceivable
effects. This affects which tombstones are dropped, but in case of
to_data_query_result() uses, tombstones are not present in the final
data query result, and mutation_partition::do_compact() takes
tombstones into consideration while compacting before expiring them.

Fixes the following UBSAN report:

  /usr/include/c++/5.3.1/chrono:399:55: runtime error: signed integer overflow: -2147483648 - 604800 cannot be represented in type 'int'

Message-Id: <1488385429-14276-1-git-send-email-tgrabiec@scylladb.com>
2017-03-01 18:49:56 +02:00
Paweł Dziepak
bdac487b5a do not use long_type for counter update 2017-03-01 16:33:37 +00:00
Paweł Dziepak
f25fa6566f db: avoid deserialization when applying counter mutation
In the later stages of counter write path a mutation is produced that
already has all cells transformed to counter shards and can be applied
to the memtable and written to the commitlog.
The current interface expectes a frozen mutation, which is suboptimal
for counters. The freeze itself is unaviodable -- it is required by
commitlog, but we can avoid later deserialization of frozen_mutation
when it is applied to the memtable if we pass the unfrozen mutation
along.
2017-03-01 16:33:37 +00:00
Paweł Dziepak
582d397c41 introduce counter_write_query()
Counter write path involves read-modify-write. That read is guaranteed
to query only a single partition, does not care about dead cells and
expects to receive an unserialized mutation as a result.

Standard mutation queries can are able to produce results fit for
counter updates, but the logic involved is much more general (i.e.
slower), hence the addition of new, counter-specific kind of query.
2017-03-01 16:33:36 +00:00
Paweł Dziepak
426345e1d4 storage_proxy: avoid excessive mutation freezes 2017-03-01 16:33:36 +00:00
Paweł Dziepak
f10eb952d0 coordinator: do not apply counter write twice on leader 2017-03-01 16:33:36 +00:00
Paweł Dziepak
910bff297a to_string: add operator<< overload for std::array<> 2017-03-01 16:33:36 +00:00
Takuya ASADA
ba323e2074 dist/debian/dep: fix broken link of gcc-5, update it to 5.4.1-5
Since gcc-5/stretch=5.4.1-2 removed from apt repository, we nolonger able to
build gcc-5.

To avoid dead link, use launchpad.net archives instead of using apt-get source.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1488189378-5607-1-git-send-email-syuu@scylladb.com>
2017-03-01 17:13:14 +02:00
Tomasz Grabiec
0c84f00b16 query: Fix invalid initialization of _memory_tracker by moving-from-self
Fixes the following UBSAN warning:

  core/semaphore.hh:293:74: runtime error: reference binding to misaligned address 0x0000006c55d7 for type 'struct basic_semaphore', which requires 8 byte alignment

Since the field was not initialied properly, probably also fixes some
user-visible bug.
Message-Id: <1488368222-32009-1-git-send-email-tgrabiec@scylladb.com>
2017-03-01 11:38:28 +00:00
Duarte Nunes
c0e5964462 database: Explicitly use discard_result()
Values returned from the lambda passed to finally() are immediately
destroyed, so make that explicit by using discard_result().

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170227235541.28330-1-duarte@scylladb.com>
2017-02-28 18:41:19 +02:00
Duarte Nunes
11b5076b3c lsa: Use log histogram for closed segments
This patch replaces the current heap with a logarithmic histogram
to hold the closed segment descriptors.

This histogram stores elements in different buckets according to
their size. Values are mapped to a sequence of power-of-two ranges
that are split in N sub-buckets. Values less than a minimum value
are placed in bucket 0, whereas values bigger than a maximum value
are not admitted.

There is some loss of precision as segments are now not totally
ordered, and precision decreases the more sparse a segment is. This
allows to reduce the cost of the computations needed when freeing
from a closed segment.

Performance results for perf_simple_query -c4 --duration 60
           before       after       diff
read     43954.27    45246.10      +2.9%
write    48911.54    52807.76      +7.9%

Fixes #1442

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170227235328.27937-1-duarte@scylladb.com>
2017-02-28 18:40:38 +02:00
Avi Kivity
359fc68283 Merge seastar upstream
* seastar 4d4a58d...5861f99 (9):
  > future: adjust finally constraint to allow any future to be returned from the continuation
  > build: allow specifying the C compiler
  > socket: Change signature (and impls) of socket shutdown to void
  > reactor: give names to OS threads
  > Concepts support
  > core/file: Fix short-read in read_maybe_eof()
  > core/fstream: Avoid issuing read requests beyond _remain
  > tests: Improve assertion failure message
  > reactor: Expose IO stats in a public API
2017-02-28 13:13:35 +02:00
Avi Kivity
c1aac6fa87 build: accept and pass seastar's --c-compiler option 2017-02-28 13:13:02 +02:00
Duarte Nunes
a3873423d6 configure.py: Enable concepts support
This patch enables conditional concept support by propagating
seastar's --enable-gcc6-concepts flag.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170227235028.27490-1-duarte@scylladb.com>
2017-02-28 11:56:22 +02:00
Paweł Dziepak
5d66031b7a sstable: make input_stream_history initializers in-class
sstable has two constructors but only one of them was creating input
stream history objects.
Message-Id: <20170227151734.16928-1-pdziepak@scylladb.com>
2017-02-28 09:22:11 +01:00
Paweł Dziepak
374c8a56ac commitlog: avoid copying column_mapping
It is safe to copy column_mapping accros shards. Such guarantee comes at
the cost of performance.

This patch makes commitlog_entry_writer use IDL generated writer to
serialise commitlog_entry so that column_mapping is not copied. This
also simplifies commitlog_entry itself.

Performance difference tested with:
perf_simple_query -c4 --write --duration 60
(medians)
          before       after      diff
write   79434.35    89247.54    +12.3%
2017-02-27 17:05:58 +00:00
Paweł Dziepak
4df4994b71 idl: fix generated writers when member functions are used
When using member name in an idetifer of generated class or method
idl compiler should strip the trailing '()'.
2017-02-27 17:05:58 +00:00
Paweł Dziepak
018d16d315 idl: add start_frame() overload for seastar::simple_output_stream 2017-02-27 17:05:58 +00:00
Paweł Dziepak
0198d8e470 Merge "Introduce streamed_mutation::fast_forward_to()" from Tomasz
"This introduces an API which allows forward navigation in a stream of mutation
fragments. It allows one to consume only a subset of the stream by iteratively
specifying sub-ranges from which fragments should be returned.

API outline:

  When in forwarding mode, the stream does not return all fragments right away,
  but only those belonging to the current range. Initially current range only
  covers the static row. The stream can be forwarded, even before reaching end-
  of-stream for current range, to a later range with fast_forward_to().
  Forwarding doesn't change initial restrictions of the stream, it can only be
  used to skip over data.

  Monotonicity of positions is preserved by forwarding. That is fragments
  emitted after forwarding will have greater positions than any fragments
  emitted before forwarding.

  For any range, all range tombstones relevant for that range which are present
  in the original stream will be emitted. Range tombstones emitted before
  forwarding which overlap with the new range are not necessarily re-emitted.

  When not in forwarding mode, the stream acts as if the current range was equal
  to the full range. This implies that fast_forward_to() cannot be
  used.

  Whether stream is in forwarding mode or not is specified when the stream
  is created, typically via mutation_source interface.

What's left for later series:

  Optimization by providing specialized implementations. This series implements
  forwarding support in all mutation sources via generic wrapper which simply
  drops fragments."

* tag 'tgrabiec/clustering-fast-forward-to-v2' of github.com:scylladb/seastar-dev:
  tests: mutation_source_tests: Verify monotonicty of positions
  tests: random_mutation_generator: Spread the keys more
  tests: mutation_source_test: Make blobs more easily distinguishable
  tests: streamed_mutation: Test that merged stream passes mutation source tests
  tests: mutation_source_test: Add tests for forwarding of streamed_mutation
  tests: streamed_mutation_assertions: Add methods for navigating the stream
  tests: Add range generators to random_mutation_generator
  partition_slice_builder: Add with_ranges()
  query: Introduce full_clustering_range
  streamed_mutation: Add non-owning variant of mutation_from_streamed_mutation()
  db: Enable creating forwardable readers via mutation_source
  mutation_source: Document liveness requirements
  mutation_source: Cleanup
  db: Replace virtual_reader_type with mutation_source_opt
  partition_version: Refactor make_partition_snapshot_reader() overloads
  database: Fix mutation_source created by as_mutation_source() to not ignore trace_state_ptr
  memtable: Accept all mutation_source parameters
  streamed_mutation: Implement fast_forward_to() in stream merger
  streamed_mutation: Add generic implementation of forwardable streamed_mutation
  streamed_mutation: Add fast_forward_to() API
  position_in_partition: Introduce position_range
  position_in_partition: Introduce position constructor for right after the static row
  streamed_mutation: Make cast to view non-explicit
  streamed_mutation: Make schema() getter non-copying
2017-02-24 10:37:51 +00:00
Tomasz Grabiec
0798ea22c8 tests: mutation_source_tests: Verify monotonicty of positions 2017-02-23 18:50:54 +01:00
Tomasz Grabiec
d0421ba545 tests: random_mutation_generator: Spread the keys more
The deviation was very low so most ranges were very close. Spread them
to test more cases.
2017-02-23 18:50:54 +01:00
Tomasz Grabiec
27ff169b6b tests: mutation_source_test: Make blobs more easily distinguishable
It's easier to compare them if they differ only by a few most
significant bits, than by all bits.
2017-02-23 18:50:53 +01:00
Tomasz Grabiec
182e3f981b tests: streamed_mutation: Test that merged stream passes mutation source tests 2017-02-23 18:50:53 +01:00
Tomasz Grabiec
122562c1cc tests: mutation_source_test: Add tests for forwarding of streamed_mutation 2017-02-23 18:50:53 +01:00
Tomasz Grabiec
1d7e84f770 tests: streamed_mutation_assertions: Add methods for navigating the stream 2017-02-23 18:50:53 +01:00
Tomasz Grabiec
f2feb54fb0 tests: Add range generators to random_mutation_generator 2017-02-23 18:50:53 +01:00
Tomasz Grabiec
f56308597c partition_slice_builder: Add with_ranges() 2017-02-23 18:50:53 +01:00
Tomasz Grabiec
0073df30aa query: Introduce full_clustering_range 2017-02-23 18:50:53 +01:00
Tomasz Grabiec
cbf4601e31 streamed_mutation: Add non-owning variant of mutation_from_streamed_mutation() 2017-02-23 18:50:53 +01:00
Tomasz Grabiec
892d4a2165 db: Enable creating forwardable readers via mutation_source
Right now all mutation source implementations will use
make_forwardable() wrapper.
2017-02-23 18:50:44 +01:00
Tomasz Grabiec
b1d1091906 mutation_source: Document liveness requirements 2017-02-23 18:23:52 +01:00
Tomasz Grabiec
15db80188b mutation_source: Cleanup
- combines telescopic overloads into one method with default paramters.
 - Introduce func_type for a full handler to avoid some duplication.
2017-02-23 18:23:52 +01:00
Tomasz Grabiec
586dbaa8d3 db: Replace virtual_reader_type with mutation_source_opt
Virtual reader is a mutation_source.
2017-02-23 18:23:52 +01:00
Tomasz Grabiec
acfad565f0 partition_version: Refactor make_partition_snapshot_reader() overloads
So that streamed_mutation is created in only one of the overloads and
others delegate to that one. Later there will be common logic added to
the construction and doing this will help avoid a duplication.
2017-02-23 18:23:52 +01:00
Tomasz Grabiec
f46ae8128d database: Fix mutation_source created by as_mutation_source() to not ignore trace_state_ptr
It was using the state passed via as_mutation_source() instead. Let's
respect mutation_source contract instead, and use the state passed via
mutation_source invocation.

Technically just a cleanup. Alse prerequisite for more cleanup.
2017-02-23 18:23:52 +01:00
Tomasz Grabiec
2cc27f72ca memtable: Accept all mutation_source parameters 2017-02-23 18:23:52 +01:00
Tomasz Grabiec
53b1a257cc streamed_mutation: Implement fast_forward_to() in stream merger 2017-02-23 18:23:52 +01:00
Tomasz Grabiec
e0a7ed48b0 streamed_mutation: Add generic implementation of forwardable streamed_mutation
Generic but not very efficient wrapper which simply drops
fragments from the original stream.
2017-02-23 18:23:51 +01:00
Tomasz Grabiec
301cd4912b streamed_mutation: Add fast_forward_to() API 2017-02-23 18:23:28 +01:00
Gleb Natapov
2dc56013f8 commitlog: handle cycle() error
Do not ignore a future<> retuned by cycle() since it will produce a
warning in case of an error. Log it instead.

Message-Id: <20170219151811.GN11471@scylladb.com>
2017-02-22 19:15:14 +01:00
Calle Wilund
d5f57bd047 messaging_service: Move log printout to actual listen start
Fixes  #1845
Log printout was before we actually had evaluated endpoint
to create, thus never included SSL info.
Message-Id: <1487766738-27797-1-git-send-email-calle@scylladb.com>
2017-02-22 17:08:21 +01:00
Avi Kivity
9b113ffd3e config: enable new sharding algorithm for new deployments
Set murmur3_partitioner_ignore_msb_bits to 12 (enabling the new sharding
algorithm), but do this in scylla.yaml rather than the built-in defaults.
This avoids changing the configuration for existing clusters, as their
scylla.yaml file will not be updated during the upgrade.
Message-Id: <20170214123253.3933-1-avi@scylladb.com>
2017-02-22 11:23:12 +01:00
Calle Wilund
0a4edca756 counters/cql: allow wormholing actual counter values (with shards) via cql
Adds yet another magic function "SCYLLA_COUNTER_SHARD_LIST", indicating that
argument value, which must be a list of tuples <int, UUID, long, long>,
should be inserted as an actual counter value, not update.

This of course to allow counters to be read from sstable loader.

Note that we also need to allow timestamps for counter mutations,
as well as convince the counter code itself to treat the data as
already baked. So ugly wormhole galore.

v2:
* Changed flag names
* More explicit wormholing, bypassing normal counter path, to
  avoid read-before-write etc
* throw exceptions on unhandled shard types in marshalling
v3:
* Added counter id ordering check
* Added batch statement check for mixing normal and raw counter updates
Message-Id: <1487683665-23426-2-git-send-email-calle@scylladb.com>
2017-02-22 09:19:46 +00:00
Calle Wilund
0d87f3dd7d utils::UUID: operator< should behave as comparison of hex strings/bytes
I.e. need to be unsigned comparison.
Message-Id: <1487683665-23426-1-git-send-email-calle@scylladb.com>
2017-02-22 09:19:22 +00:00
Tomasz Grabiec
2b2d5c4c7a Update seastar submodule
* seastar 5088065...4d4a58d (3):
  > reactor utilization should return the utilization in 0-1 range
  > collectd should ignore type label in name creation
  > fix append_challenged_posix_file_impl::process_queue() to handle recursion
2017-02-22 09:40:25 +01:00
Calle Wilund
e20b804a65 commitlog/database: Add "release" method to ensure we free segments
On database stop, we do flush memtables and clean up commit log segment usage.
However, since we never actually destroy the distributed<database>, we
don't actually free the commitlog either, and thus never clear out
the remaining (clean) segments. Thus we leave perfectly clean segments
on disk.

This just adds a "release" method to commitlog, and calls it from
database::stop, after flushing CF:s.
Message-Id: <1485784950-17387-1-git-send-email-calle@scylladb.com>
2017-02-21 18:17:47 +01:00
Gleb Natapov
0977f4fdf8 sstable: close sstable_writer's file if writing of sstable fails.
Failing to close a file properly before destroying file's object causes
crashes.

[tgrabiec: fixed typo]

Message-Id: <20170221144858.GG11471@scylladb.com>
2017-02-21 18:17:47 +01:00
Tomasz Grabiec
8fd19a71ff position_in_partition: Introduce position_range 2017-02-21 16:49:36 +01:00
Tomasz Grabiec
78c563ea6a position_in_partition: Introduce position constructor for right after the static row 2017-02-21 16:43:09 +01:00
Tomasz Grabiec
ce58706b50 streamed_mutation: Make cast to view non-explicit 2017-02-21 16:43:09 +01:00
Paweł Dziepak
274bcd415a tests/cql_test_env: wait for storage service initialization
Message-Id: <20170221121130.14064-1-pdziepak@scylladb.com>
2017-02-21 17:05:45 +02:00
Paweł Dziepak
359c617821 db: restore call to check_valid_rp()
5a0955e89d "db: add operations for
applying counter updates" merged two column_family::apply() overloads
into do_apply() in order to reduce code duplication. Unfortunately,
a call to check_valid_rp() didn't survive that change.
Message-Id: <20170221133800.30411-1-pdziepak@scylladb.com>
2017-02-21 15:26:04 +01:00
Tomasz Grabiec
b4fd3a08e6 streamed_mutation: Make schema() getter non-copying 2017-02-21 14:18:57 +01:00
Duarte Nunes
65b21e3a99 schema_registry: Don't leak schemas
When loading a schema asynchronously, we're leaving a strong
reference to the loaded schema in the entry's shared future. This
patch fixed this by storing a shared_promised, which is reset when the
schema is loaded.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170220193654.17439-1-duarte@scylladb.com>
2017-02-21 09:56:21 +01:00
Tomasz Grabiec
33457cc9a9 sstables: Fix detection of repeated tombstones
The check was not catching range tombstone repeated immediately after
itself.
Message-Id: <1487596098-17409-1-git-send-email-tgrabiec@scylladb.com>
2017-02-20 15:35:15 +00:00
Tomasz Grabiec
cc439df542 Revert "sstables: Simplify sstable_streamed_mutation::read_next()"
This reverts commit 1e2c01ff49.

We do not detect repeated tombstone if it follows an in-range
tombstone following a skipped clustering row, because _in_progress
will be disengaged after such tombstone is emitted.
Message-Id: <1487596080-21480-1-git-send-email-tgrabiec@scylladb.com>
2017-02-20 15:34:58 +00:00
Vlad Zolotarov
978241d473 database: move lister class into separate files
Move lister class away from database.cc.
This is a preparation for moving it to the seastar library.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-02-17 17:50:40 -05:00
Vlad Zolotarov
34cafa71c3 database: make 'clearsnapshot' to delete the snapshots of deleted keyspaces if requested
The current implementation of 'nodetool clearsnapshot' command only deletes the snapshots
of the keyspaces that are alive at the time the command is issued (issue #2045).

This, besides not implementing the spec, prevents users from being able to clear
the disk space occupied by snapshots of deleted keyspaces that are no longer needed
(e.g. snapshots created when KS is deleted).

This patch fixes this issue by making the database::clear_snapshot() scan the data directories
looking for the snapshots to be deleted instead of relying on in-memory data structures.

This patch makes column_family::clear_snapshot() method not needed any more.

Fixes #2045

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-02-17 17:50:40 -05:00
Vlad Zolotarov
e1ee669aff database: lister: add the rmdir() static method
Removes the directory with all its contents (like 'rm -rf <dir name>' shell command).

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-02-17 17:50:40 -05:00
Vlad Zolotarov
53532ba5ff database: lister: pass the parent path object to callbacks
Pass a parent directory boost::filesystem::path object
to the walker and filter callbacks.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-02-17 17:50:37 -05:00
Vlad Zolotarov
b4c970dfc6 database: lister: make the "filter" callback receive directory_entry instead of sstring
Filter should get all information that the caller has in hand that may be used for filtering.

directory_entry has the following information:
   - Type of the entry
   - Its name

For the code that used lister filters so far this would be enough, however it's not
hard to imagine a filter that may need the parent directory as well.

We will add the parent directory path in the follow up patches to make the interface
complete.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-02-17 17:46:59 -05:00
Vlad Zolotarov
6f9f0e1b3f database: lister: add "show_hidden" parameter
If show_hidden parameter is set to show_hidden::yes - list hidden entries, otherwise skip them.
By default set to show_hidden::no.

This patch also completely removes default parameters in lister::scan_dir() and replaces them
with a few lister::scan_dir() overloads that ensure that lambdas are always going to
be the last parameter in the parameters list.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-02-17 17:46:58 -05:00
Vlad Zolotarov
9aedb191f6 database: lister: if entries' types set is empty - list everything
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-02-17 17:46:58 -05:00
Vlad Zolotarov
cb614f9be4 database: lister::guarantee_type: handle the case when entry type may not be read
There is a possibility that the type of the given entry may not be available
that would manifest in the ENOENT or ENOTDIR value set in the errno
by the fstat() call for this entry. In this case engine().file_type() will return
a not engaged optional<directory_entry_type> value.

Return the future with the std::runtime_error exception in this case.
This will prevent any further usage of the not engaged optional value by the code
in the normal flow.

The exception is going to be propagated to the caller and it's the caller's responsibility
to handle it.

Fixes #2071

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-02-17 17:46:55 -05:00
Tomasz Grabiec
9f63e172fb tests: compaction_manager_test: Fix abort on exception
Message-Id: <1487343901-12745-1-git-send-email-tgrabiec@scylladb.com>
2017-02-17 15:53:55 +00:00
Avi Kivity
113ed9e963 Merge seastar upstream
* seastar 28a143a...5088065 (8):
  > configure.py: switch cmake to build c-ares to do out-of-source-tree build
  > iotune: make sure help is working
  > collectd: send double correctly for gauge
  > tls: make shutdown/close do "clean" handshake shutdown in background
  > tls: Make sink/source (i.e. streams) first class channel owners
  > native-stack: Make sink/source (i.e. streams) first class channel owners
  > posix-stack: Make sink/source (i.e. streams) first class channel owners
  > Merge "Detector for tasks blocked" from Glauber

Fixes #2085.

Packaging updated to require cmake, drop libtool and automake.
2017-02-16 19:34:28 +02:00
Vlad Zolotarov
25502149cf database: lister::scan_dir(): std::move() all that needs to be moved
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-02-16 11:56:44 -05:00
Raphael S. Carvalho
53d9008052 sstables/deletion_manager: kill dead code
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <2b24d9e622238030a737fbbe12b8439853d5d075.1487095059.git.raphaelsc@scylladb.com>
2017-02-16 18:38:54 +02:00
Vlad Zolotarov
f2e4629254 main.cc: expose scylla version as a gauge metrics
Add a new metric that exposes the current ScyllaDB version as a
gauge metrics.

The version is exposed as a label with the "version" key.

Fixes #1979

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1487083703-27929-1-git-send-email-vladz@scylladb.com>
2017-02-16 16:57:55 +02:00
Piotr Jastrzebski
2b8e340761 Replace deprecated BOOST_MESSAGE with BOOST_TEST_MESSAGE
BOOST Unit test deprecated BOOST_MESSAGE as early as 1.34 and had it
been perminently removed.  This patch replaces all uses of BOOST_MESSAGE
with BOOST_TEST_MESSAGE.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <f1732018912a864cea229b0f7cd48170fd927dc2.1487238426.git.piotr@scylladb.com>
2017-02-16 10:49:03 +01:00
Avi Kivity
b8c4b35b57 Merge "Fixes for counter cell locking" from Paweł
"This series contains some fixes and a unit test for the logic responsible
for locking counter cells."

* 'pdziepak/cell-locking-fixes/v1' of github.com:cloudius-systems/seastar-dev:
  tests: add test for counter cell locker
  cell_locking: fix schema upgrades
  cell_locker: make locker non-movable
  cell_locking: allow to be included by anyone
2017-02-15 17:36:48 +02:00
Paweł Dziepak
f7f89df782 tests: add test for counter cell locker 2017-02-15 15:09:40 +00:00
Paweł Dziepak
2eb3e35815 cell_locking: fix schema upgrades
* cell_entry destructor may be called when the former is unlinked
 * update pointer to schema in partition_entry on schema upgrade
 * use correct bucket count when creating a new hash table
2017-02-15 15:09:40 +00:00
Paweł Dziepak
fa9c712263 cell_locker: make locker non-movable
locker keep iteratos to potentially internally stored data and moving
the object would invalidate them.
2017-02-15 13:48:47 +00:00
Paweł Dziepak
fc9145671d cell_locking: allow to be included by anyone 2017-02-15 13:48:47 +00:00
Tomasz Grabiec
9da078a18a tests: logalloc_test: Print debugging info and abort on failure
The test started to fail sporadically on jenkins after
7a00dd6985 due to quiesce() timing out. It's not
clear though if this is a regression because before the series such timeouts
would not cause test failure if the future resulves eventually, timeout was
only logged.

I was not able to reproduce it on my setup nor on jenkins, so let's add more
debugging output and trigger a coredump next time the test fails.

Message-Id: <1487089576-27147-1-git-send-email-tgrabiec@scylladb.com>
2017-02-15 14:41:49 +02:00
Takuya ASADA
9c8515eeed dist/redhat: stop backporting ninja-build from Fedora, install it from EPEL instead
ninja-build-1.6.0-2.fc23.src.rpm on fedora web site deleted for some
reason, but there is ninja-build-1.7.2-2 on EPEL, so we don't need to
backport from Fedora anymore.

Fixes #2087

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1487155729-13257-1-git-send-email-syuu@scylladb.com>
2017-02-15 12:58:00 +02:00
Paweł Dziepak
a5476b4e7d Merge "Emit only range tombstones relevant for query restrictions" from Tomasz
"Immediate reason to do this is to ensure that forwarding of streamed_mutation
will give the same mutations as slicing would, and have unit tests which
verify that those two access methods are consistent with each other.

Secondary reason is performance, to avoid processing unnecessary data.

Note that this should not cause digest mismatch of data queries during rolling
upgrade, because data queries are checksumming only tombstones affecting rows
in the results, so only relevant tombstones.

Fixes #1254."

* tag 'tgrabiec/only-relevant-range-tombstones-v2' of github.com:scylladb/seastar-dev:
  tests: mutation_source_test: Test that slicing returns only relevant range tombstones
  tests: Pass all mutation source parameters
  tests: mutation_source_tests: Ensure timestamps are strictly monotonic
  tests: streamed_mutation_assertions: Add more expectation methods
  tests: streamed_mutation_assertions: Make produces_end_of_stream() give better error messages
  sstables: Simplify sstable_streamed_mutation::read_next()
  sstables: Emit only relevant range tombstones
  range_tombstone: Introduce end_position()
  position_in_partition: Print position when printing fragment
  position_in_partition: Make printable
  position_in_partition: Add cast to view
  position_in_partition: Generalize from-bound_view constructor
  bound_view: Extract converters for range start and end bounds
  mutation_partition: Drop unneeded range tombstones
  mutation_partition: Simplify row removal
  range_tombstone_list: Introduce erase()
  partition_snapshot_reader: Emit only relevant tombstones
  range_tombstone_stream: Add slicing apply() overload
  range_tombstone_list: Introduce slice()
2017-02-14 11:18:51 +00:00
Tomasz Grabiec
7ec8c4cf54 tests: mutation_source_test: Test that slicing returns only relevant range tombstones 2017-02-13 20:52:50 +01:00
Tomasz Grabiec
2b8bd10dca tests: Pass all mutation source parameters 2017-02-13 20:52:49 +01:00
Tomasz Grabiec
25dffef6ae tests: mutation_source_tests: Ensure timestamps are strictly monotonic 2017-02-13 16:19:32 +01:00
Tomasz Grabiec
e6a95fd8cc tests: streamed_mutation_assertions: Add more expectation methods 2017-02-13 16:19:32 +01:00
Tomasz Grabiec
62843175ea tests: streamed_mutation_assertions: Make produces_end_of_stream() give better error messages 2017-02-13 16:19:32 +01:00
Tomasz Grabiec
1e2c01ff49 sstables: Simplify sstable_streamed_mutation::read_next()
mp_row_consumer doesn't split row fragments on repeated range
tombstones any more.
2017-02-13 16:12:16 +01:00
Tomasz Grabiec
6324876f24 sstables: Emit only relevant range tombstones 2017-02-13 16:12:16 +01:00
Tomasz Grabiec
72d74b7b40 range_tombstone: Introduce end_position() 2017-02-13 16:12:16 +01:00
Tomasz Grabiec
79d982cd86 position_in_partition: Print position when printing fragment 2017-02-13 16:12:16 +01:00
Tomasz Grabiec
f2e1f2938b position_in_partition: Make printable 2017-02-13 16:12:16 +01:00
Tomasz Grabiec
fb67dab548 position_in_partition: Add cast to view 2017-02-13 16:12:16 +01:00
Tomasz Grabiec
9ddb5a5173 position_in_partition: Generalize from-bound_view constructor
We will need to create positions corresponding to general range
bounds, not only corresponding to range tombstones.
2017-02-13 16:12:16 +01:00
Tomasz Grabiec
69911a87d3 bound_view: Extract converters for range start and end bounds 2017-02-13 16:12:16 +01:00
Tomasz Grabiec
2489a0f82e mutation_partition: Drop unneeded range tombstones
Fixes #1254.
2017-02-13 16:12:16 +01:00
Tomasz Grabiec
884858078a mutation_partition: Simplify row removal 2017-02-13 16:12:15 +01:00
Tomasz Grabiec
fb42366552 range_tombstone_list: Introduce erase() 2017-02-13 16:12:15 +01:00
Tomasz Grabiec
fcf3391785 partition_snapshot_reader: Emit only relevant tombstones
Refs #1254.
2017-02-13 16:12:15 +01:00
Tomasz Grabiec
440e50b76a range_tombstone_stream: Add slicing apply() overload 2017-02-13 16:12:15 +01:00
Tomasz Grabiec
8b7f93175c range_tombstone_list: Introduce slice() 2017-02-13 16:12:15 +01:00
Paweł Dziepak
8b1d34f39d mutation_partition_serializer: avoid creating atomic_cell object
write_{live, counter, expiring, dead}_cell() take a const reference to
an atomic cell as argument. However, their caller (which is
write_row_cells) passes to them an atomic_cell_view.
There is an appropriate implicit constructor so instead of compiler
complaints we get atomic_cell objects being constructed from views which
involves an allocation and a copy.
Message-Id: <20170213100106.9071-1-pdziepak@scylladb.com>
2017-02-13 11:23:23 +01:00
Avi Kivity
bcf34b9a58 Merge seastar upstream
* seastar 83a41c8...28a143a (5):
  > prometheus: send one MetricFamily per unique metric name
  > tests: Add test for circular_buffer::erase()
  > circular_buffer: Introduce erase()
  > protect against infinite do_until loop
  > metrics: alternative metrics creation with labels
2017-02-12 21:56:11 +02:00
Gleb Natapov
bb72425b61 storage_proxy: fix send_to_endpoint() to use correct create_write_response_handler() overload
There are several problems with storage_proxy::send_to_endpoint right
now. It uses create_write_response_handler() overload that is specific
to read repair which is suboptimal and creates incorrect logs, it does
not process errors and it does not hold storage_proxy object until write
is complete. The patch fixes all of the problems.

Message-Id: <20170208101949.GA19474@scylladb.com>
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
2017-02-12 10:46:13 +02:00
Tomasz Grabiec
c70ebc7ca5 lsa: Make reclaim_timer enclose segment_pool::reclaim_segments()
LSA timing did not include segment migration. It does after this
change.
Message-Id: <1486657046-9378-1-git-send-email-tgrabiec@scylladb.com>
2017-02-09 17:07:59 +00:00
Avi Kivity
5f15388e7a Merge "Size-based buffering of mutation_fragments" from Paweł
"This series changes buffering of mutation fragments in streamed mutations
so that the size of the fragments is taken into account.
The original implementation buffered up to 16 fragments which was pretty
much meaningless since it could be far too much if the fargments were
large or not nearly enough in case they were small

Fixes #2036.."

* 'pdziepak/buffer-mfs-by-size/v1' of github.com:cloudius-systems/seastar-dev:
  streamed_mutation: size-based mutation_fragment buffer limit
  mutation_fragment: cache size in memory
  mutation_fragment: make write access more explicit
2017-02-09 16:42:42 +02:00
Paweł Dziepak
3079b1661e streamed_mutation: size-based mutation_fragment buffer limit
Currently, streamed mutations buffer up to 16 mutation fragments. This
may be too much, not enough or a perfect choice depending on the
mutation fragment size.

This patch makes streamed mutation choose how much mutation fragments to
keep in the buffer depending on their size, so that we avoid using too
much memory in case of large mutation fragments and are able to buffer a
lot of fragments if they are small.
2017-02-09 10:51:11 +00:00
Paweł Dziepak
cd0dc7734a mutation_fragment: cache size in memory 2017-02-09 10:50:51 +00:00
Paweł Dziepak
354ce0b2c7 mutation_fragment: make write access more explicit
mutation_fragments are going to be caching their size in memory. In
order to be able to invalidate that correctly, they need to know when
that size may change (but avoid invalidation when it is not necessary).
2017-02-09 10:49:46 +00:00
Avi Kivity
9530bac2d6 Merge "Adding metrics using histogram and labels" from Amnon
"This series uses the newly added histogram and label support to add metrics to
the storage_proxy and to the column_family.

This would add latency and histogram and the missing metrics from column family."

* 'amnon/histogram_metrics' of github.com:cloudius-systems/seastar-dev:
  database: add metrics registration for the coloumn family
  storage_proxy: add read and write latency histogram
  estimated_histogram: returns a metrics histogram
2017-02-09 11:40:57 +02:00
Avi Kivity
9e4ae0763d Merge "Disallow mixed schemas" fro Paweł
"This series makes sure that schemas containing both counter and non-counter
regular or static columns are not allowed."

* 'pdziepak/disallow-mixed-schemas/v1' of github.com:cloudius-systems/seastar-dev:
  schema: verify that there are no both counter and non-counter columns
  test/mutation_source: specify whether to generate counter mutations
  tests/canonical_mutation: don't try to upgrade incompatible schemas
2017-02-07 18:03:28 +02:00
Paweł Dziepak
4cbbbc67f0 schema: verify that there are no both counter and non-counter columns 2017-02-07 15:17:14 +00:00
Paweł Dziepak
4ffe0401ee test/mutation_source: specify whether to generate counter mutations
Tests using random mutation generator should be provided with bot
counter and non-counter mutations to ensure that both cases are
sufficiently covered. However, mixed schemas (with both counter and
non-counter columns) are not allowed so the RMG has to be explicitly
told whether to use counter or non-counter schema.
2017-02-07 15:17:14 +00:00
Paweł Dziepak
294bf0bb7a tests/canonical_mutation: don't try to upgrade incompatible schemas
Test case test_reading_with_different_schemas uses randomly generated
pairs of mutations and tries to upgrade one to the schema of the other.

However, there are cases when one schema cannot be upgraded to another,
for example, counter and non-counter schemas.
2017-02-07 15:17:14 +00:00
Amnon Heiman
292c08f598 database: add metrics registration for the coloumn family
This patch adds a metrics registration to the column_family.

Using label each column metrics is label with its keyspace and column
family name.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2017-02-06 18:27:01 +02:00
Amnon Heiman
2cf13c26e2 storage_proxy: add read and write latency histogram
Register the read and write latency histogram on the metrics layer.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2017-02-06 17:54:47 +02:00
Amnon Heiman
1e3cfe7396 estimated_histogram: returns a metrics histogram
The metrics histogram is a struct that describe a histogram.
This patch adds a getter method that lets the estimated_histogram return
a metrics::histogram, this will allow to register it as a histogram
metrics.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2017-02-06 17:34:43 +02:00
Avi Kivity
afe98f8572 Merge "Materialized views: support new entries" from Duarte
"This patchset adds bits of the MV write-path, enough to support
new entries to be added. Note that this is still limited, as only
adding new rows to a base table will work correctly."

* 'materialized-views/insert-path/v4' of https://github.com/duarten/scylla: (30 commits)
  database: Apply mutation to views
  column_family: Push view replica update
  materialized views: partial mutate_MV
  materialized views: function to send a mutation to endpoint
  materialized views: add VIEW write type
  database: Ensure new write_type is correctly printed
  materialized views: match base and view replicas
  column_family: Generate view updates
  column_family: Adds affected_views() function
  view: Add view_update_builder class
  range_tombstone_accumulator: Expose current tombstone
  range_tombstone_accumulator: apply() takes value
  view_updates: Generate updates
  view_updates: Adds function to replace row
  view_updates: Update view entry
  view_updates: Delete old view entry
  mutation_partition: Introduce shadowable tombstone
  view_updates: Create view entry
  view_updates: Compute row marker
  view: Introduce view_updates class
  ...
2017-02-06 15:10:38 +02:00
Duarte Nunes
0eca6301d3 database: Apply mutation to views
This patch changes the database apply path so that it also
generates the mutations for the column family's views and
sends them to the paired view replicas.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:37:33 +01:00
Duarte Nunes
4777172348 column_family: Push view replica update
This patch adds a function to push updates to the view replicas of a
particular base table.
2017-02-06 13:36:45 +01:00
Nadav Har'El
3ae73164a4 materialized views: partial mutate_MV
This adds a function mutate_MV() which takes view mutations and sends
them to the appropriate nodes (this may be the current node, or a
remote node).

This is only a partial implementation - we still don't do the local
batch log (to survive reboots and failures) and some other stuff which
is left commented out.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:36:45 +01:00
Nadav Har'El
f2fd81ece0 materialized views: function to send a mutation to endpoint
Add a function for sending one mutation to one remote replica owning
this mutation. This is needed for materialized views, where each
base replica sends each view mutation to one particular view replica.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2017-02-06 13:36:45 +01:00
Nadav Har'El
92fc7386f6 materialized views: add VIEW write type
This adds to the "write_type" enum also the "VIEW" write type.
To be honest, I don't understand why the "write_type" distinction
is important.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:36:45 +01:00
Duarte Nunes
11bd3bd29f database: Ensure new write_type is correctly printed
By removing the default case in the switch statement over a write_type
variable, we ensure the compiler warns us about lack of exhaustiveness
in case we add a value to the enum but forget to change the
corresponding operator<<().

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:36:45 +01:00
Nadav Har'El
365df8f900 materialized views: match base and view replicas
A function to find the appropriate replica to send a view update to.

This patch creates a new source file db/view/view.cc. We should
eventually move a lot more of the materialized views code there.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2017-02-06 13:36:45 +01:00
Duarte Nunes
16206e9f15 column_family: Generate view updates
This patch adds the generate_view_updates() function to the
column_family class, which will use the view_update_builder to
generate updates to the column_family's materialized views.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:36:45 +01:00
Duarte Nunes
90cb35db04 column_family: Adds affected_views() function
This patch the affected_views() to determine the column family's
views a given update affects.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:36:45 +01:00
Duarte Nunes
d5a61a8c48 view: Add view_update_builder class
This patch adds the view_update_builder class, which is responsible
for calculating the mutations to apply to a column family's
materialized views, given a streamed_mutation representing an update
to the base table and a streamed_mutation representing the
pre-existing rows which the update covers.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:36:45 +01:00
Duarte Nunes
2ab9ba995a range_tombstone_accumulator: Expose current tombstone
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:36:45 +01:00
Duarte Nunes
f3c5ea392a range_tombstone_accumulator: apply() takes value
range_tombstone_accumulator::apply() now takes a value so the caller
can decide whether to move or copy the argument.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:36:45 +01:00
Duarte Nunes
3991a58f08 view_updates: Generate updates
This patch adds the view_updates::generate_update() function to
generate view updates given a base row update and the corresponding,
pre-existing row. This function will decide which of the previously
introduced functions to call based on whether there is a pre-existing
row and whether there exists a regular base column that's part of the
view's PK.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:36:45 +01:00
Duarte Nunes
861d2dfb61 view_updates: Adds function to replace row
This patch adds a function to replace a view row given a base
table update and the pre-existing row, which simply deletes the old
view entry and adds a new one.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:36:45 +01:00
Duarte Nunes
7901ce7de4 view_updates: Update view entry
This patch introduces the view_updates::update_entry function,
which creates the updates to apply to the existing view entry given
the base table row before and after the update.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:36:45 +01:00
Duarte Nunes
b34ae6d6da view_updates: Delete old view entry
This patch introduces the view_updates::delete_old_entry function,
which creates a view row mutation to delete an entry given an updated
base table row.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:36:45 +01:00
Duarte Nunes
7e150a18eb mutation_partition: Introduce shadowable tombstone
This patch introduces shadowable row tombstones. A shadowable row
tombstone is valid only if the row has no live marker. In other words,
the row tombstone is only valid as long as no newer insert is done
(thus setting a live row marker; note that if the row timestamp set
is lower than the tombstone's, then the tombstone remains in effect
as usual).

If a row has a shadowable tombstone with timestamp Ti and that row
is updated with a timestamp Tj, such that Tj > Ti (and that update
sets the row marker), then the shadowable tombstone is shadowed by
that update. A concrete consequence is that if the update has cells
with timestamp lower than Ti, then those cells are preserved (since
the deletion is removed), and this is contrary to a regular,
non-shadowable row tombstone where the tombstone is preserved and
such cells are removed.

Currently, only Materialized Views require shadowable row tombstones,
which solve a problem with view row deletions. Consider a base row with
columns p, v1, v2, PRIMARY KEY (p) denormalized into a view row consisting
of columns p, v1, v2 PRIMARY KEY (p, v1), and the following operations:

1) INSERT INTO base (p, v1, v2) VALUES (0, 0, 1) USING TIMESTAMP 0;
2) UPDATE base SET v1 = 1 USING TIMESTAMP 1 WHERE p = 0;
3) UPDATE base SET v1 = 0 USING TIMESTAMP 2 WHERE p = 0;

Without shadowable tombstones, the view contains:

At 1), pk = (0, 0), row_marker@T0, v2=1@T0
At 2), pk = (0, 0), row_marker@T0, row_tombstone@T1, v2=1@T0
       pk = (0, 1), row_marker@T1, v2=1@T0
At 3), pk = (0, 0), row_marker@T2, row_tombstone@T1, v2=1@T0
       pk = (0, 1), row_marker@T1, row_tombstone@T2, v2=1@T0

Notice how, if we read row (0, 0), the value of v2 will be shadowed by
the row tombstone we previously inserted.

With a view's row tombstone becoming shadowable, at 3) the row (0, 0)
will look like pk = (0, 0), row_marker@T2, shadowable_tombstone@T1, v2=1@T0,
which is equivalent to pk = (0, 0), row_marker@T2, v2=1@T0.

Since the shadowable tombstone is shadowed by the new row marker (T0 <
T2), now v2 would be taken into account.

Finally, note that this patch doesn't generalize the idea of
shadowable tombstone, instead taking advantage of the fact that they
are only needed by Materialized Views. This saves changing the
tombstone representation to account for an extra flag, the bits such
representation would require, and also avoids changes to the storage
format.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:36:45 +01:00
Duarte Nunes
e0f642180f view_updates: Create view entry
This patch introduces the view_updates::create_entry function, which
creates a view row mutation given a new base table row.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:35:31 +01:00
Duarte Nunes
b8b8a8099c view_updates: Compute row marker
This patch adds a function to compute the row marker of a view row
given the base row. There are two cases to consider when building the
row marker: 1) there is a column C that is a regular base column but
is in the view PK; and 2) the columns for the base and the view PKs
are the same.

For 1), the view row marker timestamp will be the biggest between the
base's row marker and C. The TTL will be that of C. This means that if
C expires, the view row maker will expire as well (and the row, if no
other column is keeping it alive). Note that if the base row marker
expires but not C, then the base row will still be live due to C and
we shouldn't expire the view row.

For 2), the view row timestamp will be the same as the base row
timestamp. The TTL should be set in such a way that both base and view
rows live for the same time. We thus set the view row TTL to be the
max of any other TTL in the base row. This is particularly important
in the case where the base row marker has a TTL, but a column *absent*
from the view holds a greater one.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:35:31 +01:00
Duarte Nunes
7321938bcf view: Introduce view_updates class
This patch introduces the view_updates class, which is responsible
for generating and storing updates to a particular materialized view.

The updates will be generated from an updated base row and the
pre-existing one (if any), in later patches.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:35:31 +01:00
Duarte Nunes
0f8dbc9243 collection_type_impl: Iterate over collection cells
This patch introduces the collection_type_impl::for_each_cell()
function, which allows the caller to iterate over the cells of a
particular collection_mutation_view.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:35:31 +01:00
Duarte Nunes
082ef56df1 view: Store pk view column that's non-pk in the base
To help calculate the view mutations from a base update, we store in
the view class the column that's part of the view's primary key but
not part of the base's, if such column exists.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:35:30 +01:00
Duarte Nunes
734ad80390 view: Add matches_view_filter() function
This patch adds the matches_view_filter() function which specifies
whether a given base row matches the view filter. Unlike
may_be_affected_by(), this function has no false positives.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:35:30 +01:00
Duarte Nunes
7be0f319d4 single_column_restriction: Filter clustering rows
This patch adds the is_satisfied_by() function to
single_column_restriction, which given a clustering row returns
whether the restrictions applies or not.

This is useful for secondary indexing such as materialized views,
where filters on regular columns precisely select which base table
rows to denormalize.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:35:30 +01:00
Duarte Nunes
3b52440ff3 statement_restrictions: Expose non-pk restrictions
This patch exposes the non-primary key column restrictions in a given
select statement, exposing them as single_column_restrictions.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:35:30 +01:00
Duarte Nunes
e987d87ab1 collection_type_impl: Identify concrete types
This patch adds the is_set() and is_list() functions to
collection_type_impl, which identify the concrete collection
type.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:35:30 +01:00
Duarte Nunes
71faa4a4eb abstract_restriction: Rename uses_function()
This patch renames abstract_restriction::uses_function() to
term_uses_function(), as it was previously hiding a function with the
same name in the restriction base class.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:35:30 +01:00
Duarte Nunes
21d1bbb527 view: Add may_be_affected_by function
This patch adds the may_be_affected_by() function to the view class,
which is responsible to determine whether an update to a base class
affects one of its views.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:35:30 +01:00
Duarte Nunes
c35d14e285 column_family: Store a pointer to view
Instead of storing the view in the column_family's map of materialized
views, store a lw_shared_ptr so that the view can be removed while it
is being updated.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:35:30 +01:00
Duarte Nunes
69171c28f0 cql3/util: Fix use-after-free
This patch fixes a use-after-free error in
rename_column_in_where_clause(), where we were creating a boost
adaptor on an rvalue.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:35:30 +01:00
Avi Kivity
3896c27e5f Merge "DNS use in scylla" from Calle
"Fixes #1531

Adds lookup to gms::inet_address and uses it in (hopefully all) the
salient places where configured symbolic names are interpreted.

Removes the dummy dns modula in scylla in favour of the seastar one."

* 'calle/use-dns' of github.com:cloudius-systems/seastar-dev:
  remove scylla dns code
  service::storage_service: Remove depedency on scylla dns
  main.cc: remove scylla dns dependency
  main/init: Lookup inet addresses from config by dns lookup
  db::system_keyspace: Find rpc_address by lookup
  gms::inet_address: Add lookup functionality.
  scylla tls: Add option support for client auth and tls opts
2017-02-06 13:50:42 +02:00
Avi Kivity
da8d00199e Merge 2017-02-06 13:43:07 +02:00
Avi Kivity
fdfabbf8bb Merge seastar upstream
* seastar f07f8ed...83a41c8 (8):
  > Cleaning the metrics API
  > tutorial: pick the name "asynchronous function".
  > tutorial: explain the difference between exception and exception future
  > tutorial: abstract
  > ninja: don't bother building c-ares shared libraries
  > ninja: unbreak build ordering
  > ninja: unbreak "ninja -t clean"
  > Add libtool to dependencies
2017-02-06 13:42:38 +02:00
Calle Wilund
44503f8253 remove scylla dns code
Use seastar facilities instead.
2017-02-06 11:36:57 +00:00
Calle Wilund
ab800c225a service::storage_service: Remove depedency on scylla dns
Use seastar facilities instead
2017-02-06 11:36:57 +00:00
Calle Wilund
c4c4eb06c4 main.cc: remove scylla dns dependency
Use seastar facilities instead.
2017-02-06 11:36:57 +00:00
Avi Kivity
b18e54307f tests: add --operations-per-shard option to perf_simple_query
This helps achieve more repeatable runs that can then be compared via the
Linux perf tool.  The option overrides duration-based testing and runs the
test for a specific number of iterations.
Message-Id: <20170204172937.8462-1-avi@scylladb.com>
2017-02-06 12:08:04 +01:00
Gleb Natapov
3c372525ed storage_proxy: use storage_proxy clock instead of explicit lowres_clock
Merge commit 45b6070832 used butchered version of storage_proxy
patch to adjust to rpc timer change instead the one I've sent. This
patch fixes the differences.

Message-Id: <20170206095237.GA7691@scylladb.com>
2017-02-06 12:51:36 +02:00
Calle Wilund
feffc2bbe1 main/init: Lookup inet addresses from config by dns lookup
I.e. allow symbolic names in addition to ip addresses.
2017-02-06 09:45:37 +00:00
Calle Wilund
ef26ab0e1b db::system_keyspace: Find rpc_address by lookup 2017-02-06 09:45:37 +00:00
Calle Wilund
0a740b5ccf gms::inet_address: Add lookup functionality.
To find addresses by name.
2017-02-06 09:45:37 +00:00
Calle Wilund
ff8f82f21c scylla tls: Add option support for client auth and tls opts
Refs #1813 (fixes scylla part)

Added require_client_auth and priority_string options to
server_encryption_options/client_encryption_options an process them.

Allows TLS method/algo specification. Also enabled enforcing known cert
authentication for both node-to-node and client communication.
2017-02-06 09:45:09 +00:00
Avi Kivity
6e9e28d5a3 cell_locking: work around for missing boost::container::small_vector
small_vector doesn't exist on Ubuntu 14.04's boost, use std::vector
instead.
2017-02-05 20:48:36 +02:00
Avi Kivity
2510b756fc dist: add build dependency on automake
Needed by seastar's c-ares.
2017-02-05 20:16:27 +02:00
Takuya ASADA
e82932b774 dist/common/systemd: introduce scylla-housekeeping restart mode
scylla-housekeeping requires to run 'restart mode' for check the version during
scylla-server restart, which wasn't called on systemd timer so added it.

Existing scylla-housekeeping.timer renamed to scylla-housekeeping-daily.timer,
since it is running 'daily mode'.

Fixes #1953

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1486180031-18093-1-git-send-email-syuu@scylladb.com>
2017-02-05 10:46:04 +02:00
Avi Kivity
4175f40da1 dist: add libtool build dependency for seastar/c-ares 2017-02-05 10:42:53 +02:00
Takuya ASADA
12b5e7288d dist/common/scripts/scylla_setup: show restart message when SELinux was disabled on the script
Disabling SELinux requires server restart, so warn user to restart before
running Scylla.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1485817393-25919-2-git-send-email-syuu@scylladb.com>
2017-02-05 10:10:18 +02:00
Takuya ASADA
c28a574b9e dist/common/scripts: stop setting hugepages boot parameter
Stop setting hugepages boot parameter since we don't use it on default
configuration (posix mode), but keep scylla_bootparam_setup to setup clocksource
on AMI.

Fixes #1758

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1485817393-25919-1-git-send-email-syuu@scylladb.com>
2017-02-05 10:10:18 +02:00
Paweł Dziepak
37b0c71f1d cell_locking: fix parititon_entry::equal_compare
The comparator constructor took schema by value instead of const l-ref
and, consequently, later tried to access object that has been destroyed
long time ago.
Message-Id: <20170202135853.8190-1-pdziepak@scylladb.com>
2017-02-03 19:49:18 +01:00
Avi Kivity
7a00dd6985 Merge "Avoid avalanche of tasks after memtable flush" from Tomasz
"Before, the logic for releasing writes blocked on dirty worked like this:

  1) When region group size changes and it is not under pressure and there
     are some requests blocked, then schedule request releasing task

  2) request releasing task, if no pressure, runs one request and if there are
     still blocked requests, schedules next request releasing task

If requests don't change the size of the region group, then either some request
executes or there is a request releasing task scheduled. The amount of scheduled
tasks is at most 1, there is a single releasing thread.

However, if requests themselves would change the size of the group, then each
such change would schedule yet another request releasing thread, growing the task
queue size by one.

The group size can also change when memory is reclaimed from the groups (e.g.
when contains sparse segments). Compaction may start many request releasing
threads due to group size updates.

Such behavior is detrimental for performance and stability if there are a lot
of blocked requests. This can happen on 1.5 even with modest concurrency
because timed out requests stay in the queue. This is less likely on 1.6 where
they are dropped from the queue.

The releasing of tasks may start to dominate over other processes in the
system. When the amount of scheduled tasks reaches 1000, polling stops and
server becomes unresponsive until all of the released requests are done, which
is either when they start to block on dirty memory again or run out of blocked
requests. It may take a while to reach pressure condition after memtable flush
if it brings virtual dirty much below the threshold, which is currently the
case for workloads with overwrites producing sparse regions.

I saw this happening in a write workload from issue #2021 where the number of
request releasing threads grew into thousands.

Fix by ensuring there is at most one request releasing thread at a time. There
will be one releasing fiber per region group which is woken up when pressure is
lifted. It executes blocked requests until pressure occurs."

* tag 'tgrabiec/lsa-single-threaded-releasing-v2' of github.com:cloudius-systems/seastar-dev:
  tests: lsa: Add test for reclaimer starting and stopping
  tests: lsa: Add request releasing stress test
  lsa: Avoid avalanche releasing of requests
  lsa: Move definitions to .cc
  lsa: Simplify hard pressure notification management
  lsa: Do not start or stop reclaiming on hard pressure
  tests: lsa: Adjust to take into account that reclaimers are run synchronously
  lsa: Document and annotate reclaimer notification callbacks
  tests: lsa: Use with_timeout() in quiesce()
2017-02-02 17:49:31 +02:00
Paweł Dziepak
788892e931 counters: fix build failure on gcc5
Message-Id: <20170202132049.4497-1-pdziepak@scylladb.com>
2017-02-02 14:23:49 +01:00
Piotr Jastrzebski
36b2c4df19 row_cache_test: extend test_mvcc
Make the test execute with and without an active
reader to memtable that's flushed to cache.

This improves the code covarage of MVCC with tests.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <007b6cd1ba7a84ea5675ea82e454bf1adf3b3330.1485954941.git.piotr@scylladb.com>
2017-02-02 13:51:32 +01:00
Tomasz Grabiec
5458a32f13 gdb: Introduce commands for inspecting pending task queue
Message-Id: <1485426236-6627-1-git-send-email-tgrabiec@scylladb.com>
2017-02-02 13:15:17 +02:00
Avi Kivity
000edc36c4 Merge "Counters" from Paweł
"This series introduces support for counters. The implementation of
counters more or less follows the design described on our wiki page [1].
Counter cells contain many shards with replicas being able to modify
and announce new versions only of the shards that they own. Historically,
there were three types of shards: local, remote and global. In these
patches only support for the global ones is added.

[1] https://github.com/scylladb/scylla/wiki/Counters

Currently, counters are only enabled as experimental features as there
still several things that need to be done before they become production
ready. Namely, the performance is expected to be quite poor (especially
for writes), there is no proper tracing support and timed out counter
requests may not be recognized and dropped early. There are also no
counter-related metrics.

However, apart from these problems there are no other missing parts of
counter implementation and they are expected to work correctly.

Fixes #577."

* 'pdziepak/counters/v3-rebased' of github.com:cloudius-systems/seastar-dev: (38 commits)
  perf_simple_query: add counter tables tests
  thrift: add support for counter operations
  cql3: allow counters in CREATE TABLE statements
  cql3: selection: do not panic when seeing counters
  storage_proxy: support counter updates
  storage_proxy: add get_live_endpoints()
  cql3: add counter increment and decrement operations
  db: add operations for applying counter updates
  counters: implement transforming counter deltas to shards
  add infrastructure for locking counter cells
  add fnv1a hasher
  position_in_partition: add feed_hash()
  position_in_partition: add functions for querying object type
  types: make counter_type_impl report its cql3_type
  transport: encode counters as long_type
  mutation_partition: make for_each_cell() accessible outside source file
  messaging_service: add COUNTER_MUTATION verb
  storage_service: add COUNTERS feature
  idl: add idl description of consistency level
  schema: make is_counter() return correct value
  ...
2017-02-02 12:40:09 +02:00
Paweł Dziepak
8671d8329d perf_simple_query: add counter tables tests 2017-02-02 10:35:14 +00:00
Paweł Dziepak
4ca7f0a491 thrift: add support for counter operations 2017-02-02 10:35:14 +00:00
Paweł Dziepak
fa29ef3cc0 cql3: allow counters in CREATE TABLE statements 2017-02-02 10:35:14 +00:00
Paweł Dziepak
fce6e0987f cql3: selection: do not panic when seeing counters
At this stage counters cells are already long_type values, so no special
handling is necessary.
2017-02-02 10:35:14 +00:00
Paweł Dziepak
1e8814f5ce storage_proxy: support counter updates 2017-02-02 10:35:14 +00:00
Paweł Dziepak
c14c6b753b storage_proxy: add get_live_endpoints() 2017-02-02 10:35:14 +00:00
Paweł Dziepak
d6ebf84edf cql3: add counter increment and decrement operations 2017-02-02 10:35:14 +00:00
Paweł Dziepak
5a0955e89d db: add operations for applying counter updates 2017-02-02 10:35:14 +00:00
Paweł Dziepak
8d889082bf counters: implement transforming counter deltas to shards
The leader receives counter updates as deltas which have to be
transformed to counter shards. In order to do that, current local shard
of the modified counter cell needs to be read, logical clock incremented
and the value modified by the specified delta.
2017-02-02 10:35:14 +00:00
Paweł Dziepak
55277b3182 add infrastructure for locking counter cells
The leader receives counter update in a form of deltas which need to be
transformed to counter shards. In order to do that the node needs to
read its current state of the modified counter cells. Since this is
essentially a read-modify-write opertation an appropriate locking
mechanism is needed.

Counter cell locker introduced in this patch uses a hashtable of
partition entry each containing a hashtable of cell entries. Inside a
cell entry there is a semaphore used for synchronization. Once no longer
needed cell entries and partition entries are removed.

In order to avoid deadlocks cell entries are always locked in the same
order which is the lexicographical order of (clustering key, column id)
pairs. Note that schema changes are not a difficulty since they do not
make it possible to change ordering of such pairs.
2017-02-02 10:35:14 +00:00
Paweł Dziepak
22fbb11f90 add fnv1a hasher 2017-02-02 10:35:14 +00:00
Paweł Dziepak
a16761dcb4 position_in_partition: add feed_hash() 2017-02-02 10:35:14 +00:00
Paweł Dziepak
f4fce93807 position_in_partition: add functions for querying object type 2017-02-02 10:35:14 +00:00
Paweł Dziepak
53d9a6f220 types: make counter_type_impl report its cql3_type 2017-02-02 10:35:14 +00:00
Paweł Dziepak
a805bea97a transport: encode counters as long_type
For the purposes of CQL counters are long values (either a delta in case
of writes or the final value for reads).
2017-02-02 10:35:14 +00:00
Paweł Dziepak
b6564651e4 mutation_partition: make for_each_cell() accessible outside source file
for_each_cell() const already can be used from any place in the code,
allow the same with non-const version.
2017-02-02 10:35:14 +00:00
Paweł Dziepak
bf60b7844b messaging_service: add COUNTER_MUTATION verb
This verb is going to be used for coordinator<->leader communication
during counter updates.
2017-02-02 10:35:14 +00:00
Paweł Dziepak
67ca6959bd storage_service: add COUNTERS feature 2017-02-02 10:35:14 +00:00
Paweł Dziepak
9989239c97 idl: add idl description of consistency level 2017-02-02 10:35:14 +00:00
Paweł Dziepak
4b3c0db5cc schema: make is_counter() return correct value 2017-02-02 10:35:14 +00:00
Paweł Dziepak
99b21fbb86 tests: random_mutation_generator: generate counter cells 2017-02-02 10:35:14 +00:00
Paweł Dziepak
de2acd47c9 tests/sstables: test reading and writing counters 2017-02-02 10:35:14 +00:00
Paweł Dziepak
83c6fc1114 sstables: write counter cells 2017-02-02 10:35:14 +00:00
Paweł Dziepak
5905729c4a sstables: read counter cells 2017-02-02 10:35:14 +00:00
Paweł Dziepak
de698105e4 tests/counter: test apply, difference and freeze 2017-02-02 10:35:14 +00:00
Paweł Dziepak
0c93d01232 atomic_cell: make sure upper level tombstones cover counters
Support for deletion of counters is limited in a way that once deleted
they cannot be used again (i.e. tombstone always wins, regardless of the
timestamp). Logic responsible for merging two counter cells already
makes sure that tombstones are handled properly, but it is also
necessary to ensure that higher level tombstones always cover counters.
2017-02-02 10:35:14 +00:00
Paweł Dziepak
9f1ebd4f7c idl/mutation: add counter serialisation logic 2017-02-02 10:35:14 +00:00
Paweł Dziepak
47d14906e6 mutation_partition: support querying counter cells 2017-02-02 10:35:14 +00:00
Paweł Dziepak
63f25eb12c mutation_hasher: handle counter cells properly 2017-02-02 10:35:14 +00:00
Paweł Dziepak
25c8ed1c71 feed_hash: allow additional arguments 2017-02-02 10:35:14 +00:00
Paweł Dziepak
a57e86cc37 mutation_partition: compute counter difference 2017-02-02 10:35:13 +00:00
Paweł Dziepak
2725a4945d mutation_partition: apply counter cells properly 2017-02-02 10:35:13 +00:00
Paweł Dziepak
496b42fcc7 tests: add test for counters 2017-02-02 10:35:13 +00:00
Paweł Dziepak
7bb5b49799 add in memory representation of counters
Live counter cells are collections of shards, each one representing the
sum of all operations performed by a particular replica. This commits
introduces an in-memory representation of counters as well as basic
operations such as merge, difference and hashing.
2017-02-02 10:35:13 +00:00
Paweł Dziepak
c66db213d3 storage_service: allow getting local host id without futures<> 2017-02-02 10:35:13 +00:00
Paweł Dziepak
0a8f00c159 atomic_cell: add flag for recognizing counter updates
A counter cell may be either a collection of shards or just a delta. The
former can only appear in certain places on coordinator and leader.
2017-02-02 10:35:13 +00:00
Paweł Dziepak
ab344c5aa3 mutation_partition_view: extract atomic_cell variant 2017-02-02 10:35:13 +00:00
Paweł Dziepak
83f6018ea2 schema: keep counter information in column definition 2017-02-02 10:35:13 +00:00
Avi Kivity
aec419da13 Merge seastar upstream
* seastar c1dbd89...f07f8ed (3):
  > Merge "Introduce when_all_succeed()" from Paweł
  > tests: adjust collectd test for metric API change
  > Merge "DNS query support" from Calle
2017-02-02 12:30:10 +02:00
Piotr Jastrzebski
15cc8460bd mutation_partition: make rows_entry constructors explicit
All converting constructors should be explicit otherwise they
can create a confusion. I got myself in such a situation when
clustering key got implicitly converted into rows_entry when
I was not expecting it.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <c3f19719760f6dc7cf5e858b9c452506faedf521.1485950529.git.piotr@scylladb.com>
2017-02-01 17:57:50 +01:00
Tomasz Grabiec
2fd339787b tests: lsa: Add test for reclaimer starting and stopping 2017-02-01 17:41:56 +01:00
Tomasz Grabiec
f943296da0 tests: lsa: Add request releasing stress test 2017-02-01 17:41:55 +01:00
Tomasz Grabiec
e40fb438f5 lsa: Avoid avalanche releasing of requests
Before, the logic for releasing writes blocked on dirty worked like
this:

  1) When region group size changes and it is not under pressure and
     there are some requests blocked, then schedule request releasing
     task

  2) request releasing task, if no pressure, runs one request and if
     there are still blocked requests, schedules next request
     releasing task

If requests don't change the size of the region group, then either
some request executes or there is a request releasing task
scheduled. The amount of scheduled tasks is at most 1, there is a
single thread of excution.

However, if requests themselves would change the size of the group,
then each such change would schedule yet another request releasing
thread, growing the task queue size by one.

The group size can also change when memory is reclaimed from the
groups (e.g.  when contains sparse segments). Compaction may start
many request releasing threads due to group size updates.

Such behavior is detrimental for performance and stability if there
are a lot of blocked requests. This can happen on 1.5 even with modest
concurrency becuase timed out requests stay in the queue. This is less
likely on 1.6 where they are dropped from the queue.

The releasing of tasks may start to dominate over other processes in
the system. When the amount of scheduled tasks reaches 1000, polling
stops and server becomes unresponsive until all of the released
requests are done, which is either when they start to block on dirty
memory again or run out of blocked requests. It may take a while to
reach pressure condition after memtable flush if it brings virtual
dirty much below the threshold, which is currently the case for
workloads with overwrites producing sparse regions.

Refs #2021.

Fix by ensuring there is at most one request releasing thread at a
time. There will be one releasing fiber per region group which is
woken up when pressure is lifted. It executes blocked requests until
pressure occurs.

The logic for notification across hierachy was replaced by calling
region_group::notify_relief() from region_group::update() on the
broadest relieved group.
2017-02-01 17:41:55 +01:00
Tomasz Grabiec
d55baa0cd1 lsa: Move definitions to .cc 2017-02-01 17:41:55 +01:00
Tomasz Grabiec
8f8b111b33 lsa: Simplify hard pressure notification management
The hard pressure was only signalled on region group when
run_when_memory_available() was called after the pressure condition
was met.

So the following loop is always an infinite loop rather than stopping
when engouh is allocated to cause pressure:

   while (!gr.under_pressure()) {
       region.allocate(...);
   }

It's cleaner if pressure notification works not only if
run_when_memory_available() is used but whenever conditino changes,
like we do for the soft pressure.

There is comment in run_when_memory_available() which gives reasons
why notifications are called from there, but I think those reasons no
longer hold:

 - we already notify on soft pressure conditions from update(), and if
   that is safe, notifying about hard pressure should also be safe. I
   checked and it looks safe to me.

 - avoiding notification in the rare case when we stopped writing
   right after crossing the threshold doesn't seem benefitial. It's
   unlikely in the first place, and one could argue it's better to
   actually flush now so that when writes resume they will not block.
2017-02-01 17:41:55 +01:00
Tomasz Grabiec
9aa1be5d08 lsa: Do not start or stop reclaiming on hard pressure
We already call these when crossing the soft threshold. We shouldn't
stop reclaiming when hard pressure is gone because soft pressure may
still be present. Calling start_reclaiming() on hard pressure is
unnecessary because soft pressure also starts it, and when there is
hard pressure there is also soft pressure.
2017-02-01 17:40:15 +01:00
Amnon Heiman
45b6070832 Merge seastar upstream
* seastar 397685c...c1dbd89 (13):
  > lowres_clock: drop cache-line alignment for _timer
  > net/packet: add missing include
  > Merge "Adding histogram and description support" from Amnon
  > reactor: Fix the error: cannot bind 'std::unique_ptr' lvalue to 'std::unique_ptr&&'
  > Set the option '--server' of tests/tcp_sctp_client to be required
  > core/memory: Remove superfluous assignment
  > core/memory: Remove dead code
  > core/reactor: Use logger instead of cerr
  > fix inverted logic in overprovision parameter
  > rpc: fix timeout checking condition
  > rpc: use lowres_clock instead of high resolution one
  > semaphore: make semaphore's clock configurable
  > rpc: detect timedout outgoing packets earlier

Includes treewide change to accomodate rpc changing its timeout clock
to lowres_clock.

Includes fixup from Amnon:

collectd api should use the metrics getters

As part of a preperation of the change in the metrics layer, this change
the way the collectd api uses the metrics value to use the getters
instead of calling the member directly.

This will be important when the internal implementation will changed
from union to variant.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1485457657-17634-1-git-send-email-amnon@scylladb.com>
2017-02-01 14:39:08 +02:00
Glauber Costa
facb0aa6d9 row_cache: rewrite loop so that debug mode doesn't become a noop
need_preempt() is always true in debug mode. Because of that, this loop
will never be executed. Rewrite it as a do-while loop so we are sure
that it is executed at least once - or exactly once in debug mode.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <1485913079-1283-1-git-send-email-glauber@scylladb.com>
2017-02-01 10:02:13 +02:00
Tomasz Grabiec
634761dbba commitlog: Fix default limit for size on disk
The per-node limit will be total memory divided by number of shards
instead of just total memory. For example, when Scylla is started with
-c16 -m16G, the commit log will induce flushes on given shard when
unflushed data exceeds on that shard 62MB instead of 1GB.

Fixes #2046.

Message-Id: <1485874534-10939-1-git-send-email-tgrabiec@scylladb.com>
2017-01-31 17:12:59 +02:00
Piotr Jastrzebski
c7e95af0b0 row_cache_test: fix test_mvcc
Currently the test does not wait for cache update
to finish before carrying on with the checks.

This makes the test nondeterministic and purely wrong
because checks expect update to be finished.

This patch changes the test to wait for update to finish.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <2a99bba24b1628466d3495332b48ef3ccdb43c26.1485862389.git.piotr@scylladb.com>
2017-01-31 11:37:29 +00:00
Avi Kivity
aedb5e5cfa mutation_fragment: add std::ostream support
Helps poor debuggers.
Message-Id: <20170130163605.4858-1-avi@scylladb.com>
2017-01-31 10:37:42 +01:00
Tomasz Grabiec
0d40b86546 Merge "bail sooner from cache update if need_preempt()" from Glauber
An earlier patch of mine was using should_yield to do the same.  That
is a better direction, but should_yield() was demonstrably more
expensive so for now we'll go with need_preempt() - since this is
hurting pretty much every latency-dependent workload.

I am also including the scripts that I have used to measure and
compare the various versions of this patch.
2017-01-31 09:51:34 +01:00
Tomasz Grabiec
f053b48f7c tests: lsa: Adjust to take into account that reclaimers are run synchronously 2017-01-30 19:18:07 +01:00
Tomasz Grabiec
ed9ff19467 lsa: Document and annotate reclaimer notification callbacks
They are called from region_group::update(), so must be alloc-free and
noexcept.
2017-01-30 19:18:07 +01:00
Tomasz Grabiec
2ec6fe415e tests: lsa: Use with_timeout() in quiesce()
Current consutrct doesn't interrupt the test, the timeout failure will
only be logged.
2017-01-30 19:18:07 +01:00
Pekka Enberg
a625aae489 cql3/values.hh: Fix to_bytes_opt(raw_value)
The data() method already returns a bytes_opt so there's no need to call to_bytes_opt() again.

Fixes compliation failure on CentOS:

  In file included from ./cql3/query_options.hh:51:0,
                   from ./cql3/cql_statement.hh:47,
                   from ./cql3/statements/raw/select_statement.hh:45,
                   from build/release/gen/cql3/CqlParser.hpp:65,
                   from build/release/gen/cql3/CqlParser.cpp:44:
  ./cql3/values.hh: In function 'bytes_opt to_bytes_opt(const cql3::raw_value&)':
  ./cql3/values.hh:184:37: error: no matching function for call to 'to_bytes_opt(bytes_opt)'
       return to_bytes_opt(value.data());

Message-Id: <1485761863-28236-1-git-send-email-penberg@scylladb.com>
2017-01-30 10:49:31 +02:00
Gleb Natapov
6e4817137e storage_proxy: report foreground reads instead of reads
The reason is the same as why foreground writes are reported instead of
total writes (049ae37d08): It is much easier to see what is going on
this way.

Also fixes a typo in a counter's description.

Fixes #1217

Message-Id: <20170129093412.GS11469@scylladb.com>
2017-01-29 12:40:56 +02:00
Avi Kivity
9fb2f31616 Merge "CQL binary protocol unset value support" from Pekka
This patch series adds support for "unset values" that were introduced
in CQL binary protocol v4. They allow bound statements to skip updates
to some or all of the bound variables.

Unset values are specified using the BoundStatement.unset() method in
the Java driver:

  http://docs.datastax.com/en/drivers/java/3.1/com/datastax/driver/core/BoundStatement.html#unset-int-

and using the UNSET_VALUE constant in the Python driver:

  https://datastax.github.io/python-driver/api/cassandra/query.html#cassandra.query.UNSET_VALUE

Fixes #2039.

* 'penberg/cql-unset-values/v2' of github.com:cloudius-systems/seastar-dev:
  transport/server: CQL unset value support
  cql3/statements/select_statement: Unset value support
  cql3/user_types: Unset value support
  cql3/tuples: Unset value support
  cql3/maps: Unset value support
  cql3/sets: Unset value support
  cql3/lists: Unset value support
  cql3/constants: UNSET_VALUE constant
  cql3/constants: Unset value support
  cql3/attributes: Unset value support
  types.hh: Add field_name_as_string() to user_type_impl type
  cql3: Introduce raw_value and raw_value_view types
2017-01-29 10:59:01 +02:00
Pekka Enberg
533c8d3949 transport/server: CQL unset value support
This patch implements support for CQL unset values at the protocol level.

Fixes #2039
2017-01-27 09:24:36 +02:00
Pekka Enberg
2bd560118e cql3/statements/select_statement: Unset value support 2017-01-27 09:24:36 +02:00
Pekka Enberg
baaf1779c5 cql3/user_types: Unset value support 2017-01-27 09:24:36 +02:00
Pekka Enberg
99c7dabd2a cql3/tuples: Unset value support 2017-01-27 09:24:36 +02:00
Pekka Enberg
a0e6f6f371 cql3/maps: Unset value support 2017-01-27 09:24:36 +02:00
Pekka Enberg
f883e64d70 cql3/sets: Unset value support 2017-01-27 09:24:36 +02:00
Pekka Enberg
50ec81ee67 cql3/lists: Unset value support 2017-01-27 09:24:36 +02:00
Pekka Enberg
c4cd0a6541 cql3/constants: UNSET_VALUE constant 2017-01-27 09:24:36 +02:00
Pekka Enberg
063be3ed44 cql3/constants: Unset value support 2017-01-27 09:24:36 +02:00
Glauber Costa
b4ac2c1d60 debug: add systemtap script to measure interesting latencies during cache updates.
Example output:

Measuring Scylla row cache update times ^C
Total update time, (usec)
value |-------------------------------------------------- count
    2 |                                                   0
    4 |                                                   0
    8 |@@                                                 2
   16 |@@@                                                3
   32 |                                                   0
   64 |                                                   0
  128 |@@@@                                               4
  256 |@@                                                 2
  512 |                                                   0
 1024 |                                                   0

Time spent per partition batch (nsec)
 value |-------------------------------------------------- count
   128 |                                                       0
   256 |                                                       0
   512 |                                                      43
  1024 |                                                       2
  2048 |                                                       2
  4096 |                                                      45
  8192 |                                                     349
 16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  61494
 32768 |@@@@@@@@@@@@@@@@@                                  21497
 65536 |                                                       0
131072 |                                                       0

Partitions updated per batch:
value |-------------------------------------------------- count
    0 |                                                      57
    1 |                                                      46
    2 |                                                      76
    4 |                                                     134
    8 |                                                     324
   16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  82795
   32 |                                                       0
   64 |                                                       0

Total partitions updated: 2485000
Average time spent per partition batch (nsec): 28816
Average time per partition per partition (nsec): 967

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-01-26 22:15:16 -05:00
Glauber Costa
69dbb3e108 row_cache: yield if need_preempt(), even if there is quota left.
The quota check is quite old at the moment, and dates back to a time in
which the infrastructure in seastar threads was lacking a lot. It is a
bad check since it will not take into consideration the size of the
partition or the time it takes to merge them.

A better check would at least take need_preempt() into account, so that
we would respect the task quota. That check is now embedded into
should_yield(), so there would no need to check anything else.

Although should_yield() does the job, it is still currently quite
expensive. And because we are in a seastar thread with a computationally
intensive loop, it can hurt latency a lot.

So as a temporary measure, let's at least check for need_preempt() - as
it is hurting real users at the moment - and soon work on making
should_yield() cheaper.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-01-26 22:10:54 -05:00
Glauber Costa
0e1f64b163 row_cache: add systemtap markers for the update process
update is one of our biggest sources of performance issues as far as the
cache is concerned. systemtap can be useful in helping tracking some of
them down.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-01-26 21:56:32 -05:00
Duarte Nunes
937ed1bacb bound_view: Simplify copy ctor
By using default generation.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Reviewed-by: Benoît Canet <benoit@scylladb.com>
Message-Id: <1485355007-1913-1-git-send-email-duarte@scylladb.com>
2017-01-26 19:29:29 +02:00
Avi Kivity
b91b9b351a Revert "Merge seastar upstream"
This reverts commit f301c678bfe5eb5df71f71fd20e08b422b1023bb; the rpc changes
don't compile due to rpc timeout type change.
2017-01-26 18:30:56 +02:00
Avi Kivity
f301c678bf Merge seastar upstream
* seastar 397685c...f5fa2e3 (3):
  > rpc: use lowres_clock instead of high resolution one
  > semaphore: make semaphore's clock configurable
  > rpc: detect timedout outgoing packets earlier
2017-01-26 18:16:14 +02:00
Pekka Enberg
3385144860 cql3/attributes: Unset value support 2017-01-26 13:50:04 +02:00
Pekka Enberg
630aba32ff types.hh: Add field_name_as_string() to user_type_impl type
This is needed to construct validation error messages when user types
encounter unset values.
2017-01-26 13:50:04 +02:00
Pekka Enberg
be0351b49c cql3: Introduce raw_value and raw_value_view types
Currently, the code is using bytes_opt and bytes_view_opt to represent
CQL values, which can hold a value or null. In preparation for
supporting a third state, unset value introduced in CQL v4, introduce
new raw_value and raw_value_view types and use them instead.

The new types are based on boost::variant<> and are capable of holding
null, unset values, and blobs that represent a value.
2017-01-26 13:50:04 +02:00
Gleb Natapov
64660397fc storage_proxy: move operation type information from counter's name to a label
Makes it much more flexible to view the data in various ways in Graphana.

Message-Id: <20170126102746.GL11469@scylladb.com>
2017-01-26 12:38:29 +02:00
Tomasz Grabiec
2c7902fb2b Revert "lsa: Reduce reclamation latency"
This reverts commit d61002cc33.

Introduced a regression in row_cache_alloc_stress.

The problem is that reclaim_from_evictable() evicts way too much after
the refactor due to the stop condition not taking into account how
much data was evicted so far and only looking at occupancy of the
minimal segment. This may lead to eviction of the whole region.
2017-01-26 10:43:18 +01:00
Paweł Dziepak
8cdffd7c57 time_type_impl: value initialize result
parse_time() adds hourse, minutes, etc to a final value 'result'.
However, it is of type std::chrono::nanoseconds which means it is not
zeroed at initialization unless it is explicitly asked to do so.

Fixed debug mode failures in types_tyes and cql_query_test.

Message-Id: <20170125155239.1253-1-pdziepak@scylladb.com>
2017-01-25 17:56:31 +02:00
Paweł Dziepak
034d028329 Merge "range_tombstone_list: Properly implement difference()" from Duarte
"This patchset properly implements range_tombstone_list::difference(),
which was very broken. We add unit tests for the function and ensure
we always randomly generate range_tombstones in other unit tests so
other problems aren't hidden."
2017-01-25 12:08:19 +00:00
Duarte Nunes
8c65b98ea7 mutation_merger: Emit deferred tombstones
This patch ensures the mutation_merger emits any deferred tombstones
that it still may be holding before closing the stream.

Together with the range_tombstone_list: Properly implement
difference() patch set, this fixes breakage of streamed_mutation_test
and row_cache_test.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170123195643.9876-1-duarte@scylladb.com>
2017-01-25 12:02:03 +00:00
Takuya ASADA
bce0fb3fa2 dist: add lspci on dependencies, since it used by dpdk-devbind.py
On minimum setup environment scylla_sysconfig_setup will fail because lspci command is not installed. So install it on package installation time.

Fixes #2035

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1485327435-20543-1-git-send-email-syuu@scylladb.com>
2017-01-25 10:22:57 +02:00
Avi Kivity
d2fc98270e Merge seastar upstream
* seastar 6d80c6a...397685c (4):
  > Merge "add label to the io_queue" from Amnon
  > rpc: Modify the shutdown code to wait and handle exceptions
  > tls.cc: Fix shutdown_input/output to conform with expected socket behaviour
  > core: Add counter for polls
2017-01-24 18:36:25 +02:00
Gleb Natapov
ccee01f352 storage_proxy: put datacenter name into a label instead of counter's name
Having datacenter name as a label makes it possible to create Prometheus board for the counters.

Message-Id: <20170124132051.GX11469@scylladb.com>
2017-01-24 15:27:34 +02:00
Duarte Nunes
54a464ae27 random_mutation_generator: Always generate range tombstones
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-01-23 19:02:23 +01:00
Duarte Nunes
a01aa91c82 range_tombstone_list: Add unit tests for difference()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-01-23 18:14:33 +01:00
Duarte Nunes
85315d1760 range_tombstone_list: Correctly implement difference()
The difference method wasn't properly implemented. The version in this
patch correctly computes the difference and returns a range tombstone
list contains those range tombstones in "this" but absent from the
other, specified range tombstone list.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-01-23 18:14:33 +01:00
Duarte Nunes
e7d20ea900 range_tombstone_list: Add apply() convenience overload
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-01-23 18:14:33 +01:00
Duarte Nunes
0847954d92 bound_view: Add copy ctor and assignment operator
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-01-23 18:14:33 +01:00
Avi Kivity
1758361640 Merge seastar upstream
* seastar 38aaa4a...6d80c6a (2):
  > DPDK: Change the metrics registration with label support
  > metric: Fix the error: could not convert {...} from <brace-enclosed initializer list> to struct metric_definition_impl
2017-01-23 11:55:21 +02:00
Takuya ASADA
f6d7a76223 dist: rename dist/ubuntu to dist/debian
Now we supported both Ubuntu and Debian on dist/ubuntu, and Ubuntu is one of
Debian variant, so dist/debian is better naming.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1485161896-21851-1-git-send-email-syuu@scylladb.com>
2017-01-23 10:59:52 +02:00
Avi Kivity
31c8e6885b build: improve support for custom builds
Add a counter field to RELEASE, just before the date, and fix it at zero.
This allows custom package builds to override it in a way that sorts before
the official packages.

Example:

  Official release:   1.6.0-0.20160120.<githash>
  Custom release 1:   1.6.0-1.avi.20160121.<githash>
  Custom release 2:   1.6.0-2.avi.20160122.<githash>

The counter (0/1/2) ensures that the build number dominates over the date
when sorting.

Message-Id: <20170122102814.19649-1-avi@scylladb.com>
2017-01-22 14:56:52 +02:00
Avi Kivity
1be9c232b6 Merge seastar upstream
* seastar ff098c8...38aaa4a (1):
  > metrics: equal operator should use ==
2017-01-22 14:41:59 +02:00
Tomasz Grabiec
834df74df0 Merge batch statement optimization from github.com/avikivity/scylla/1689/v2
From Avi:

In many cases, batch statements are used to mutate a single partition, or
a number of partitions that is smaller than the number of statements within
the batch.  We can detect this case and reduce the numbers of mutations
applied, and in some cases, convert a logged batch into an unlogged batch.

Ref #1689.
2017-01-20 13:44:05 +01:00
Tomasz Grabiec
6c75614d19 sstables: Fix input_stream not being closed by index_reader
Fixes #2022
Message-Id: <1484912679-5729-1-git-send-email-tgrabiec@scylladb.com>
2017-01-20 11:58:33 +00:00
Paweł Dziepak
19ad35610b sstables: do not discard future returned by fast_forward_to()
continuous_data_consumer::fast_forward_to() returns a future which was
later ignored by data_consume_context::fast_forward_to().

With the current implementation, the future in question is always ready
and that's why the problem didn't manifest itself in the form of crashes
or invalid results.
Message-Id: <20170120105746.7300-1-pdziepak@scylladb.com>
2017-01-20 12:22:17 +01:00
Avi Kivity
a9403877e4 cql3: add more metrics for batch statements
- how many statements are in a batch
 - different types of batches
 - whether we were able to convert a logged batch to an unlogged batch
2017-01-20 13:19:00 +02:00
Avi Kivity
e3c003544d cql3: optimize batch_statement when the same partition is mutated multiple times
Batch statements are often used to insert multiple rows into the same
partition.  Recognize this case and merge mutations to the same partition.

If the result is a single mutation, there is an additional win (already
present in the code), where a logged batch can be converted into an unlogged
batch.

Ref #1689.
2017-01-20 13:18:56 +02:00
Benoît Canet
bcc826cc34 mutation_reader: Short circuit the read path on empty range
Add a boolean to short circuit the read path on empty range
hoping for some speedup.

tested in read write with cs using:

cl=QUORUM duration=1m -mode native cql3 -rate threads=700 -node localhost

Will do some additional benchmark.

Fixes #1056

Signed-off-by: Benoît Canet <benoit@scylladb.com>
Message-Id: <20170118194451.16836-1-benoit@scylladb.com>
2017-01-20 10:05:40 +00:00
Avi Kivity
54b8acdd9f dht: add hashing and comparison helpers to dht::decorarted_key
An std::hash specialization, and an equality comparator.
2017-01-20 11:24:14 +02:00
Avi Kivity
141048e0e5 dht: improve token hash function
For a small token, we can just return it, since it already is a hash.
We hash large tokens using murmur3, which is supposedly a good hash.
2017-01-20 11:24:14 +02:00
Raphael S. Carvalho
1857ba0abc db: fix bad resource usage distribution when resharding due to refresh
That's because a single shard is used to calculate generation for new
sstables in upload directory, and that will result in that single shard
sharing all the resources with other shards.
For refresh without upload dir, it currently works fine because we
reshuffle column family dir instead.

flush_upload_dir() is now a free function, takes a distributed database
object, and uses calculate_shard_from_sstable_generation() to decide
which shard will move sstable using its own generation namespace.

Fixes #2008.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <b0cccf7bbb61416ff8718bac92fdca90cc5fb9c9.1484253232.git.raphaelsc@scylladb.com>
2017-01-19 18:55:21 +02:00
Duarte Nunes
d53f96e0da column_family: Only update stats once for a shared sstables
This patch ensures that when adding a shared sstable, we select only
one cpu to update that column family's stats. This is important so we
don't overestimated the on-disk size of sstables when resharding

This fixes only a temporary miscount of the current load, since shared
sstables are eventually re-written, but a fixes a permanent miscount
of the total load.

Refs #1592

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170119144823.31041-1-duarte@scylladb.com>
2017-01-19 17:40:35 +02:00
Tomasz Grabiec
d61002cc33 lsa: Reduce reclamation latency
Currently eviction is performed until occupancy of the whole region
drops below the 85% threshold. This may take a while if region had
high occupancy and is large. We could improve the situation by only
evicting until occupancy of the sparsest segment drops below the
threshold, as is done by this change.

I tested this using a c-s read workload in which the condition
triggers in the cache region, with 1G per shard:

 lsa-timing - Reclamation cycle took 12.934 us.
 lsa-timing - Reclamation cycle took 47.771 us.
 lsa-timing - Reclamation cycle took 125.946 us.
 lsa-timing - Reclamation cycle took 144356 us.
 lsa-timing - Reclamation cycle took 655.765 us.
 lsa-timing - Reclamation cycle took 693.418 us.
 lsa-timing - Reclamation cycle took 509.869 us.
 lsa-timing - Reclamation cycle took 1139.15 us.

The 144ms pause is when large eviction is necessary.

The change improves worst case latency. Reclamation time statistics
over 30 second period after cache fills up, in microseconds:

Before:

  avg = 1524.283148
  stdev = 11021.021118
  min = 12.934000
  max = 144356.000000
  sum = 257603.852000
  samples = 169

After:

  avg = 1317.362414
  stdev = 1913.542802
  min = 263.935000
  max = 19244.600000
  sum = 175209.201000
  samples = 133

Refs #1634.

Message-Id: <1484730859-11969-1-git-send-email-tgrabiec@scylladb.com>
2017-01-19 17:35:36 +02:00
Amos Kong
b880bdccef dist/redhat: fix path of housekeeping.cfg
scylla-housekeeping[3857]: Config file /etc/scylla.d/housekeeping.cfg is missing, terminating

Housekeeping failed to execute for missing the config file,
the config file should be in /etc/scylla.d/.

Fixes #2020

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <e63f2f8cb94410a6dca4e6193932f0079755ad47.1484724328.git.amos@scylladb.com>
2017-01-19 11:08:46 +02:00
Avi Kivity
3c05a81ef9 Merge seastar upstream
* seastar 240b0bf...ff098c8 (15):
  > metrics::impl::shard(): check if reactor is initialized before using it
  > reactor: introduce engine_is_ready()
  > fix metric name
  > Merge "Add label support to the metric layer" from Amnon
  > core: Avoid memory leak when submission to syscall_work_queue fails
  > core: Avoid memory leak when submission to smp_message_queue fails
  > core: append_challenged_posix_file_impl: Make exception-safe
  > Merge "Log backtrace in report_failed_future" from Tomasz
  > install-dependencies.sh: add systemtap-sdt-dev to Ubuntu/Debian dependencies
  > core: add fsqual.cc/.hh to core
  > dpdk: Fix compile error with rte_pci.h
  > fstream_test: fix spurious failures due to BOOST_REQUIRE_EQUAL thread-unsafety
  > reactor: unregister metrics of queue on shard 0
  > build: track system header changes too
  > Prometheus: do not rely on collectd for the hostname
2017-01-19 11:00:12 +02:00
Tomasz Grabiec
dd0fb48564 sstables: Close _file even if random_access_reader::close() reports errors
close() operation is like a destructor, it cannot fail. It just
reports errors, but close itself succeeds. So we should proceed with
the closing even if it fails.
Message-Id: <1484245886-7269-1-git-send-email-tgrabiec@scylladb.com>
2017-01-18 12:41:55 +00:00
Tomasz Grabiec
d048eec254 row_cache: Fix stats handling for uncached wide partitions
Report hitting wide partition dummy as a cache miss instead of a hit.

Refs #2011
Message-Id: <1484302266-3828-1-git-send-email-tgrabiec@scylladb.com>
2017-01-18 09:58:04 +00:00
Tomasz Grabiec
87f15624f4 row_cache: Add counter for wide partition mispopulations
Message-Id: <1484733250-14470-1-git-send-email-tgrabiec@scylladb.com>
2017-01-18 09:57:51 +00:00
Calle Wilund
5da92db432 cell_comparator: Better fix (i.e. potentially correct) for compound/clustered desc.
As Tomek pointed out, previous code, regardless of version mismatch, of generating
comparator description string was not correct (as in: in sync with origin).
This modifies it to look at
1.) Actual clustring size
2.) Compound-ness
3.) Dense-ness

to determine whether we should generate a compound desc, and whether it
should contain a trailing utf8-desc type.

v2: Simplify non-dense base column addition and ensure it handles
    thrift non-utf8 (as per comments from tomek)
Message-Id: <1484670171-18362-1-git-send-email-calle@scylladb.com>
2017-01-17 18:03:11 +01:00
Amnon Heiman
e19fa02a17 remove scollectd from headers
As the metrics migration progressed, some include to scollectd.hh left
behind.

Because of the nature of the scollecd implementation those include
brings alot of code with them to the header files and eventually to many
source file.

This patch remove those include and add a missing include to
storage_proxy.cc.

The reason the compiler didn't complain is an indication to the
problematic nature of those include in the first place.

Before this patch, change in metrics.hh would cause 169 files to
compile, after this change 17.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1484667536-2185-1-git-send-email-amnon@scylladb.com>
2017-01-17 17:39:47 +02:00
Calle Wilund
7d2a4defcf schema: Fix version check for comparator desc string formatting
Fixes #2019

According to the Java driver and cassandra, all versions < 3
include the PK in the comparator descriptor string.

This broke for us when bumping the cassandra version 2.1 -> 2.2

Message-Id: <1484657580-14411-1-git-send-email-calle@scylladb.com>
2017-01-17 14:59:47 +02:00
Tomasz Grabiec
ddfee57c97 Replace iostream include with iosfwd in headers
Message-Id: <1484656119-8386-4-git-send-email-tgrabiec@scylladb.com>
2017-01-17 14:52:44 +02:00
Tomasz Grabiec
50e3e3af08 db: Add missing include
Message-Id: <1484656119-8386-3-git-send-email-tgrabiec@scylladb.com>
2017-01-17 14:52:44 +02:00
Tomasz Grabiec
ea9ab36ad5 db: Move operator<<() definition to .cc
Message-Id: <1484656119-8386-2-git-send-email-tgrabiec@scylladb.com>
2017-01-17 14:52:43 +02:00
Duarte Nunes
c8cbfb7919 storage_service: Make MV feature experimental
This patch ensures that the host only announces and registers the
MATERIALIZED_VIEWS feature if it was started with the experimental
flag.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170116123412.21365-1-duarte@scylladb.com>
2017-01-16 15:45:25 +02:00
Tomasz Grabiec
a559a7ae19 streamed_mutation: Fix memory corruption when reader constructor throws
After we call unlink_leftmost_without_rebalance(), we must unlink all
elements before mutatation is destroyed. We did this properly from
~reader, but it would not be called if reader construction failed,
which it may.
Message-Id: <1484572581-6537-1-git-send-email-tgrabiec@scylladb.com>
2017-01-16 13:26:30 +00:00
Paweł Dziepak
e03868c226 tests: run with all features enabled
Since ce083308a1
"random_mutation_generator: Generate RTs by default" random mutation
generator produces range tombstones. However, so far the tests were run
with all features disabled (because of incomplete initialization of all
services) which meant that RANGE_TOMBSTONE feature was not enabled and
the code couldn't handle range tombstones that weren't just prefixes.

This patch solves the problem by forcing all features to be enabled when
tests are run.
Message-Id: <20170116103324.22956-1-pdziepak@scylladb.com>
2017-01-16 11:38:45 +01:00
Tomasz Grabiec
3c3a4358ae storage_proxy: Fix capturing of on-stack variable by reference
partition_range_count was accepted by do_with callback by value and
then captured by reference by async code, thus invoking use after
destroy.

Message-Id: <1484317846-14485-1-git-send-email-tgrabiec@scylladb.com>
2017-01-16 11:49:11 +02:00
Avi Kivity
c314047b6c config: disable new sharding algorithm
It still has problems:
 - while resharding a very large leveled compaction strategy table, a huge
   amount of tiny sstables are generated, overwhelming the file descriptor
   limits
 - there is a large impact on read latency while resharding is going on

(cherry picked from commit cf27d44412)

(forward-ported from branch-1.6)
2017-01-15 10:48:53 +02:00
Tomasz Grabiec
66547e7d7c storage_proxy: Add missing initialization of _short_read_allowed
Dropped by a1cafed370 ("storage_proxy:
handle range scans of sparsely populated tables").

Fixes the failure in update_cluster_layout_tests.TestUpdateClusterLayout test.

Message-Id: <1484317450-13525-1-git-send-email-tgrabiec@scylladb.com>
2017-01-13 16:47:54 +02:00
Takuya ASADA
bee7f549a9 scylla-housekeeping: move uuid file to /var/lib/scylla-housekeeping
Since scylla-housekeeping running as scylla user, it doesn't have a permission
to create a file on /etc/scylla.d.
So introduce /var/lib/scylla-housekeeping which owns by scylla user, place uuid
file on the directory.

Fixes #2009

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1484235946-12463-1-git-send-email-syuu@scylladb.com>
2017-01-13 16:27:53 +02:00
Avi Kivity
c227e3e706 Merge "move a few files in the ScyllaDB project to use the new metrics registration API" from Vlad
* 'rearrange-scylla-collectd-stats-registration-v3' of github.com:cloudius-systems/seastar-dev:
  thrift::server: move collectd counters registration to the metrics registration layer
  gms::gossiper: move collectd counters registration to the metrics registration layer
  utils::logalloc: move collectd counters registration to metrics registration layer
  streaming::stream_manager: move a collectd counters registration to the metrics registration layer
  db::commitlog::commitlog: move collectd counters registration to the metrics registration layer
  sstables::compaction_manager: move collectd metrics registration to the metrics registration layer
  db::batchlog_manager: move collectd registration to the metrics registration layer
  transport::server: move collectd metrics registration to the metrics registration layer
  cql3::query_processor: move collectd metrics registration to the metrics registration layer
  database: move collectd registrations to metrics registration layer
  tracing::trace_keyspace_helper: move collectd metrics registration to a metric registration layer
  tracing::trace_keyspace_helper: fix alignment
  tracing::tracing: move collectd metrics registration to metrics registration layer
2017-01-12 17:13:08 +02:00
Tomasz Grabiec
1e8151b4f2 storage_proxy: Fix use-after-free on one_or_two_partition_ranges
query_mutations_locally() takes one_or_two_partition_ranges by
reference and requires, indirectly, that it is kept alive until
operation resolves. However, we were passing expiring value to it, the
result of unwrap().

Fixes dtest failure in consistent_bootstrap_test.py:TestBootstrapConsistency.consistent_reads_after_bootstrap_test

Another potential problem was that we were dereferencing "s" in the same
expression which move-constructs an argument out of it.

Message-Id: <1484222759-4967-1-git-send-email-tgrabiec@scylladb.com>
2017-01-12 15:10:51 +02:00
Takuya ASADA
c07d703d0d dist/redhat/scylla.spec.in: fix typo of scylla_cpuscaling_setup
Fix packaging error

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1484191955-28006-2-git-send-email-syuu@scylladb.com>
2017-01-12 12:13:33 +02:00
Takuya ASADA
0e6df2a82e dist: follow DPDK script renaming
On DPDK 16.11 dpdk_nic_bind.py is renamed to dpdk-devbind.py, so we are
getting "file not found" both on packaging and scripts, fixed that.

Also fixed inconsistent packaging.
Since Seastar copied dpdk_nic_bind.py to its scripts/ directory, there're two
different versions of the script, .rpm/.deb packaging different one:
 dist/redhat: seastar/dpdk/tools/dpdk_nic_bind.py
 dist/ubuntu: seastar/scripts/dpdk_nic_bind.py

That's won't work because we sharing setup scripts between two
distributions, so I changed dist/ubuntu package to use DPDK one.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1484191955-28006-1-git-send-email-syuu@scylladb.com>
2017-01-12 12:13:33 +02:00
Gleb Natapov
76aed548e3 storage_proxy: add replica side counters for data read
Message-Id: <20170112085907.GN11469@scylladb.com>
2017-01-12 11:41:04 +02:00
Vlad Zolotarov
ca0a0f1458 tracing::trace_keyspace_helper: use generate_legacy_id() for CF IDs generation
Explicitly generate tables' IDs of tables from the system_traces KS  using
generate_legacy_id() in order to ensure all Nodes create these tables with
the same IDs.

This is going to prevent hitting issue #420.

Fixes #1976

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1484153725-31030-1-git-send-email-vladz@scylladb.com>
2017-01-12 11:36:35 +02:00
Tomasz Grabiec
33e1f9af6b sstables: Close input_stream from random_access_reader
Spotted by destroy-without-close detector.
Message-Id: <1484072527-13058-1-git-send-email-tgrabiec@scylladb.com>
2017-01-11 09:40:00 +00:00
Duarte Nunes
ce083308a1 random_mutation_generator: Generate RTs by default
This patch changes the random_mutation_generator so it generates range
tombstones by default. This fixes a bug where reversibly applying
range tombstones wasn't being tested.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170110164822.28747-1-duarte@scylladb.com>
2017-01-11 09:24:37 +00:00
Vlad Zolotarov
7fb0bab7d7 thrift::server: move collectd counters registration to the metrics registration layer
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-01-10 16:24:55 -05:00
Vlad Zolotarov
eb4fbb3949 gms::gossiper: move collectd counters registration to the metrics registration layer
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-01-10 16:24:55 -05:00
Vlad Zolotarov
022bca16bf utils::logalloc: move collectd counters registration to metrics registration layer
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-01-10 16:24:55 -05:00
Vlad Zolotarov
a850bea820 streaming::stream_manager: move a collectd counters registration to the metrics registration layer
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-01-10 16:24:54 -05:00
Vlad Zolotarov
dcdd98ccc1 db::commitlog::commitlog: move collectd counters registration to the metrics registration layer
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-01-10 16:24:54 -05:00
Vlad Zolotarov
00e37c389b sstables::compaction_manager: move collectd metrics registration to the metrics registration layer
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-01-10 16:24:54 -05:00
Vlad Zolotarov
a9f6e5f8da db::batchlog_manager: move collectd registration to the metrics registration layer
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-01-10 16:24:54 -05:00
Vlad Zolotarov
3b41d589f8 transport::server: move collectd metrics registration to the metrics registration layer
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-01-10 16:24:54 -05:00
Vlad Zolotarov
8d0a2e3883 cql3::query_processor: move collectd metrics registration to the metrics registration layer
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-01-10 16:24:54 -05:00
Vlad Zolotarov
cda382e8d6 database: move collectd registrations to metrics registration layer
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-01-10 16:24:54 -05:00
Vlad Zolotarov
af29c3506b tracing::trace_keyspace_helper: move collectd metrics registration to a metric registration layer
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-01-10 16:24:54 -05:00
Vlad Zolotarov
0df37c04f6 tracing::trace_keyspace_helper: fix alignment
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-01-10 16:24:54 -05:00
Vlad Zolotarov
6267bb63f4 tracing::tracing: move collectd metrics registration to metrics registration layer
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-01-10 16:24:54 -05:00
Avi Kivity
1ff0eef0a8 intrusive_set_external_comparator: avoid using boost::intrusive::value_traits_pointers
boost::intrusive::value_traits_pointers was introduced in boost 1.56, while
we also support boost 1.55.  Replace with an equivalent expression.

(with additions by Asias)

Message-Id: <20170110084700.19994-1-avi@scylladb.com>
2017-01-10 18:16:56 +02:00
Pekka Enberg
3d0217ec43 db/schema_tables: Fix system keyspace table list
Commit f0c28e1 ("db/schema_tables: Add schema_functions and
schema_aggregates tables") forgot to add the newly added tables to the
db::schema_tables::ALL list, which is used for authorization checks, for
example.

Fixes the following auth_test.py dtest failures:

  ('Unable to connect to any servers', {'127.0.0.1': Unauthorized('Error from server: code=2100 [Unauthorized] message="User cathy has no SELECT permission on <table system.schema_functions> or any of its parents"',)})
Message-Id: <1484045277-4997-1-git-send-email-penberg@scylladb.com>
2017-01-10 13:55:04 +01:00
Avi Kivity
0591303b72 Merge "avoid excessive memory usage during resharding" from Rapahel
"Intended to reduce memory usage when resharding by sharing sstable
components among shards. File descriptors are also shared from now
on, meaning that a much smaller number of file descriptors will be
used during resharding.

Fixes #1951."

branch 'excessive_memory_usage_v4' of github.com:raphaelsc/scylla

* 'excessive_memory_usage_v4' of github.com:raphaelsc/scylla:
  db: avoid excessive memory usage during resharding
  checked_file_impl: add support to dup
  sstables: group sstable components that can be shared among shards
  sstables: rename sstable member
2017-01-09 20:43:50 +02:00
Raphael S. Carvalho
68dfcf5256 db: avoid excessive memory usage during resharding
After resharding, sstables may be owned by all shards, which
means that file descriptors and memory usage for metadata will
increase by a factor equal to number of shards. That can easily
lead to OOM.

SSTable components are immutable, so they can be stored in one
shard and shared with others that need it. We use the following
formula to decide which shard will open the sstable and share
it with the others: (generation % smp::count), which is the
inverse of how we calculate generation for new sstables.
So if no resharding is performed, everything is shard-local.
With this approach, resource usage due to loaded sstables will
be evenly distributed among shards.

For this approach to work, we now only populate keyspaces from
shard 0. It's now the sole responsible for iterating through
column family dirs. In addition, most of population functions
are now free and take distributed database object as parameter.

Fixes #1951.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-01-09 15:24:36 -02:00
Raphael S. Carvalho
9200e389c2 checked_file_impl: add support to dup
That's needed for sstable fd sharing to work.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-01-09 13:33:30 -02:00
Avi Kivity
77cb2b452f Merge "CQL 3.3.1 support" from Pekka
"This patch series adds support for CQL 3.3.1. The changes to CQL are listed
here:

  https://github.com/apache/cassandra/blob/cassandra-2.2/doc/cql3/CQL.textile#changes

The following CQL features are already supported by Scylla:

  - TRUNCATE TABLE alias
  - Double-dollar string literals
  - Aggregate functions: MIN, MAX, SUM, and AVG

This series adds the following CQL features:

  - New data types: tinyint, smallint, date, and time
  - CQL binary protocol v4 (required by the new data types)
  - Advertise Cassandra 2.2.8 version from Scylla so that drivers correctly
    detect the presence of CQL 3.3.1

The following CQL features are not supported by Scylla:

  - Role-based access control (issue #1941)
  - JSON data type
  - User-defined functions (UDFs)
  - User-defined aggregates (UDAs)

The following CQL binary protocol v4 changes are not implemented by this
series:

  - Read_failure and Write_failure error codes are not implemented.
    They error codes not used by the smart drivers but as they are
    propagated to application code, we eventually need to wire them up
    to our storage proxy implementation.
  - Function_failure error code is only used by user-defined functions
    and the fromJson function, which are not implemented by Scylla.

Fixes #1284."

* 'penberg/cql-3.3.1/v5' of github.com:cloudius-systems/seastar-dev:
  version: Bump Cassandra version to 2.2.8
  db/schema_tables: Add schema_functions and schema_aggregates tables
  tests/type_tests: TIME type test cases
  tests/cql_query_test: TIME type test cases
  cql3: TIME data type support
  tests/type_tests: DATE type test cases
  tests/cql_query_test: DATE type test cases
  cql3: DATE type support
  date.h: 64-bit year and days representation
  licenses: Add utils/date.h license
  utils/date.h: Import date and time library sources
  tests/type_tests: TINYINT and SMALLINT type test cases
  tests/cql_query_test: TINYINT and SMALLINT type test cases
  cql3: TINYINT and SMALLINT data type support
  types: Fix integer_type_impl::parse_int() for bytes
2017-01-09 11:54:45 +02:00
Avi Kivity
8f36dca6f1 storage_proxy: prevent short read due to buffer size limit from being swallowed during range scan
mutation_result_merger::get() assumes that the merged result may be a
short read if at least one of the partial results is a short read (in
other words, if none of the partial results is a short read, then the
merged result is also not a short read). However this is not true;
because we update the memory accounter incrementally, we may stop
scanning early. All the partial results are full; but we did not scan
the entire range.

Fix by changing the short_read variable initialization from `no`
(which assumes we'll encounter a short read indication when processing
one of the batches) to `this->short_read()`, which also takes into
account the memory accounter.

Fixes #2001.
Message-Id: <20170108111315.17877-1-avi@scylladb.com>
2017-01-09 09:21:43 +00:00
Pekka Enberg
856d0e40fb version: Bump Cassandra version to 2.2.8
Advertise Cassandra 2.2.8 version to the drivers: CQL 3.3.1 language
version and CQL binary protocol version 4 support.
2017-01-09 10:42:21 +02:00
Pekka Enberg
f0c28e1b2d db/schema_tables: Add schema_functions and schema_aggregates tables
The 3.0.3 Java driver, for example, search for the tables and fails when
we advertise Cassandra 2.2 version from Scylla.
2017-01-09 10:42:21 +02:00
Pekka Enberg
10facd7db8 tests/type_tests: TIME type test cases 2017-01-09 10:42:21 +02:00
Pekka Enberg
a49ee9387e tests/cql_query_test: TIME type test cases 2017-01-09 10:42:20 +02:00
Pekka Enberg
93e6592296 cql3: TIME data type support
This adds support for the TIME data type introduced in CQL 3.3.1.

Refs #1284
2017-01-09 10:42:20 +02:00
Pekka Enberg
9ceea7bbc4 tests/type_tests: DATE type test cases 2017-01-09 10:42:20 +02:00
Pekka Enberg
f0cbfb9e4f tests/cql_query_test: DATE type test cases 2017-01-09 10:42:20 +02:00
Pekka Enberg
9def7db381 cql3: DATE type support
This adds support for the DATE type introduced in CQL 3.3.1.

Refs #1284
2017-01-09 10:42:20 +02:00
Pekka Enberg
f83503c09e date.h: 64-bit year and days representation
We need 64-bit year and days representation to support the boundary
values of the CQL data type, which is implemented using Joda Time
library's DateTime type.
2017-01-09 10:42:20 +02:00
Pekka Enberg
41df14f62d licenses: Add utils/date.h license 2017-01-09 10:42:20 +02:00
Pekka Enberg
7f2fc6470c utils/date.h: Import date and time library sources
This patch imports the "date.h" date and time library based on the C++11
<chrono> header, which is proposed for standadization:

  http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0355r1.html

We need it to implement support for the CQL date type.

Import repository

  https://github.com/HowardHinnant/date

Import commit:

  commit 2935f80109b8cfc15eb1243afe35f7ec3530f971
  Author: Howard Hinnant <howard.hinnant@gmail.com>
  Date:   Sun Jan 1 15:02:08 2017 -0500

      Have get_version check for the file named version first
2017-01-09 10:39:54 +02:00
Takuya ASADA
42c1e1e0e8 dist/common/systemd: run node-exporter.service as scylla user
For security reason, we should run node-exporter.service as scylla user,
instead of root.

Fixes #1968

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1483543419-16541-1-git-send-email-syuu@scylladb.com>
2017-01-09 09:51:47 +02:00
Paweł Dziepak
3339cced05 sstables: file_writer: make write() non-virtual
Noone overrides file_writer::write() so there is no reason to inhibit
optimisations and cause compiler to emit indirect calls.

Message-Id: <20170104163618.26251-1-pdziepak@scylladb.com>
2017-01-09 09:47:37 +02:00
Takuya ASADA
5422a8e046 dist/ubuntu: generate Ubuntu/Debian revision correctly
Ubuntu Packaging Guide says if there's no upstream package (means it's not
ported from Debian), revision should be "0ubuntu1", not "ubuntu1" which is we
currently using.

On Debian, Debian Policy Manual says it's conventional to restart revision from 1 when upstream version increased, so we should specify it to "1".

To do it in single script, we will generate the revision on building time.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1483498658-27491-1-git-send-email-syuu@scylladb.com>
2017-01-09 09:45:46 +02:00
Takuya ASADA
920683a882 dist/common/scripts: add scylla_cpuscaling_setup
To setup cpu scaling governor to 'performance', add new script to do it on
scylla_setup.

Fixes #1895

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1483542216-12195-1-git-send-email-syuu@scylladb.com>
2017-01-09 09:44:41 +02:00
Avi Kivity
97ab0d9feb build: track system header changes too
Changes to boost headers should trigger a rebuild if they change.
2017-01-08 20:49:19 +02:00
Avi Kivity
85f4e16336 main: fix incorrect low memory warning
A spurious division by smp::count warns that memory is low even when plenty
is available.  Fix by removing the division.

Fix #2002.

Message-Id: <20170108122216.27233-1-avi@scylladb.com>
Tested-by: Benoît Canet <benoit@scylladb.com>
2017-01-08 15:14:36 +02:00
Amnon Heiman
8cd3d7445c scylla_setup: remove the uuid file creation
Scylla housekeeping can crete a uuid file if it is missing. There is no
longer need to create one for it.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1483866553-13855-3-git-send-email-amnon@scylladb.com>
2017-01-08 14:11:04 +02:00
Amnon Heiman
32888fc0aa scylla-housekeeping: Create a uuid file if one is missing
This patch gets housekeeping to create a uuid file if a path to a uuid
file is upplied but the file is missing.

Because it import the uuid lib, uuid parameters where renamed.

Fixes #1987

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1483866553-13855-2-git-send-email-amnon@scylladb.com>
2017-01-08 14:11:03 +02:00
Gleb Natapov
9ed3346f98 main: fix error reporting about low memory
Message-Id: <20170108112144.GT1829@scylladb.com>
2017-01-08 13:46:48 +02:00
Raphael S. Carvalho
eed2a7d065 sstables: group sstable components that can be shared among shards
We intend to share immutable sstable components among shards to
reduce excessive memory usage when resharding shared sstables.

This change is about grouping those components into a structure,
and using foreign ptr to make sure that the structure will be
deleted by whichever shard created it.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-01-06 15:16:19 -02:00
Raphael S. Carvalho
a492f8dfaf sstables: rename sstable member
Rename _components to _recognized_components because _components
will be used to name a field with shareable components.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-01-06 15:16:17 -02:00
Avi Kivity
38b2fa27ad Merge seastar upstream
* seastar 1c8e389...240b0bf (15):
  > file/dup: don't decrease refcnt twice when file is explicitly closed
  > reactor: Add missing CentOS 7.2 dependency systemtap-sdt-devel
  > reactor: Cleaning the smp queue metrics when shuting down
  > metrics: metrics keep the value map while unregistering
  > change the reactor load metrics to utilization
  > Merge "ASan fiber switches" from Paweł
  > tls: Add missing credentials_builder::set_client_auth method
  > collectd: create metrics with the right format
  > io_queue: remove owner number from metric name
  > reactor: change the load metric name to load
  > Merge "reactor: stop using signals for task_quota timer"
  > metrics: Allow initializing the metric_group in its constructor
  > Update DPDK to 16.11
  > Revert "rpc: Avoid using zero-copy interface of output_stream"
  > core::metrics_groups: add a clear() method
2017-01-06 16:34:51 +02:00
Vlad Zolotarov
492295eb7f init: move supervisor_notify() out of main.cc
Transform the supervisor_notify() and related functions into
the "supervisor" class and place this class implementation in
a separate .cc file.

This is going to fix the compilation breakage of tests introduced
by a

commit 8014adc2a1

    init: serialize the creation of system_traces KS objects

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1483663955-20096-1-git-send-email-vladz@scylladb.com>
2017-01-06 10:10:55 +00:00
Avi Kivity
be11b054e1 Merge "Reduce the size of mutation_partition" from Piotr
"Reduce the size of mutation_partition by implementing intrusive set using
bi::rbtree_algorithms directly and using tree nodes optimized for size.

This will reduce the size of mutation_partition by:
24 bytes + <number of cql rows> * 8 bytes

This should have a positive impact on performance because mutation_partitions
are stored both in memtable and cache.

Fixes #742."

* 'haaawk/742' of github.com:cloudius-systems/seastar-dev:
  intrusive_set: rename size() to calculate_size()
  Make intrusive_set_external_comparator::_value_traits static
  Implement intrusive set using rbtree_algorithms
  mutation_partition: make apply_reversibly_intrusive_set nongeneric
  mutation_partition: take schema in find_row and clustered_row
  mutation_partition: Extract intrusive set logic to a class.
  mutation_partition: Replace value_comp with key_comp calls
2017-01-05 17:34:10 +02:00
Tomasz Grabiec
cd630fece6 db: Make system tables use the commitlog
Before this patch system table writes were not writing to commit log
because database::add_column_family() disables writes to commit log
for the table which is added if _commitlog is not set at that
time. Fix by initializing commit log before system tables are created.

Fixes #1986.

Fixes recent regression in
batch_test.py:TestBatch.replay_after_schema_change_test after
scylla-jmx was updated to not flush system tables on nodetool flush.

Could cause system keyspace writes to be delayed for more than before
under heavy write workload. Refs #1926.

Message-Id: <1483618117-4535-1-git-send-email-tgrabiec@scylladb.com>
2017-01-05 14:53:51 +02:00
Avi Kivity
eb520e7352 storage_proxy: fix result ordering for parallel partition range scans
During a range scan, we try to avoid sorting according to partition range
when we can do so.  This is when we scan fewer than smp::count shards --
each shard's range is strictly ordered with respect to the others.

However, we use the wrong key for the sort -- we use the shard number.  But
if we started at shard s > 0 and wrapped around to shard 0, then shard 0's
range will be after the range belonging to shard s, but will sort before it.

Fix by storing the iteration order as the sort key.  We use that when we
know that shards do not overlap (shards < smp::count) and the index within
the source partition range vector when they do.

Fixes #1998.
Message-Id: <20170105114253.17492-1-avi@scylladb.com>
2017-01-05 12:51:37 +01:00
Vlad Zolotarov
8014adc2a1 init: serialize the creation of system_traces KS objects
Serialize the creation of a system_traces KS objects when
they do not exist - the initial cluster boot.
Avoid creating them in parallel by different cluster Nodes
in order to avoid issue #420.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1483552503-12873-3-git-send-email-vladz@scylladb.com>
2017-01-05 12:41:38 +01:00
Vlad Zolotarov
d3b8b67e66 service::storage_service: serialize the system_auth KS initialization
Move the system_auth KS initialization to be before Node moves to the NORMAL
state. This way we will serialize this code running on different Nodes and
avoid hitting issue #420.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1483552503-12873-2-git-send-email-vladz@scylladb.com>
2017-01-05 12:36:06 +01:00
Piotr Jastrzebski
b159e08764 intrusive_set: rename size() to calculate_size()
This hopefully will make it more apparent that
the time complexity of this method is O(N) not O(1).

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-01-05 12:21:43 +01:00
Piotr Jastrzebski
b47a296053 Make intrusive_set_external_comparator::_value_traits static
_value_traits can be shared among all instances
and there's no need to store it in every single one.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-01-05 12:21:10 +01:00
Avi Kivity
4667641f5f result_memory_tracker: fix too-short short reads
1.6 truncates paged queries early to avoid overrunning server memory
with too-large query results, but in the case of partition range queries,
this terminates too early due to an uninitialized variable holding the
maximum result size.  This results in slow performance due to additional
round trips.

Fix by initializing the maximum result size from the result_memory_tracker
running on the coordinating shard.

Fixes #1995.
Message-Id: <20170105103915.10633-1-avi@scylladb.com>
2017-01-05 10:51:55 +00:00
Piotr Jastrzebski
041b0a65ac Implement intrusive set using rbtree_algorithms
This new implementation takes less memory because it
does not store comparator.

It also uses tree nodes optimized for size. This means
that instead of storing an enum field |color| they embed
this information inside pointer to parent.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-01-05 11:46:58 +01:00
Piotr Jastrzebski
a0c20f5c49 mutation_partition: make apply_reversibly_intrusive_set nongeneric
apply_reversibly_intrusive_set is used only in one place
and always with rows_type. There's no need for it to be generic.
This will allow changing intrusive set implementation.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-01-05 11:26:03 +01:00
Piotr Jastrzebski
4bbe05dd47 mutation_partition: take schema in find_row and clustered_row
This will allow intrusive set implementation that does not
store schema.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-01-05 11:26:03 +01:00
Piotr Jastrzebski
fe3c91db90 mutation_partition: Extract intrusive set logic to a class.
It will make it easier to change the implementation
of the intrusive set.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-01-05 11:26:03 +01:00
Piotr Jastrzebski
da67ac7ae4 mutation_partition: Replace value_comp with key_comp calls
This will reduce the size of bi::set API being used.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-01-05 11:26:03 +01:00
Pekka Enberg
0ea5652354 tests/type_tests: TINYINT and SMALLINT type test cases 2017-01-05 10:57:35 +02:00
Pekka Enberg
41e3327ebc tests/cql_query_test: TINYINT and SMALLINT type test cases 2017-01-05 10:57:35 +02:00
Pekka Enberg
fcaa743e3d cql3: TINYINT and SMALLINT data type support
This adds support for the TINYINT and SMALLINT data types introduced in
CQL 3.3.1.

Refs #1284
2017-01-05 10:57:35 +02:00
Pekka Enberg
257fa541f1 types: Fix integer_type_impl::parse_int() for bytes
The integer_type_impl::parse_int() function uses boost::lexical_cast()
under the hood, which parses 8-bit numbers as characters. Fix the
function to lexical cast to 64-bit integer and convert the result to
integer_type_impl template type.
2017-01-05 10:57:35 +02:00
Nadav Har'El
45f19f2633 main: better error message on failing to start Prometheus
Previously, if the Prometheus port (by default, 0.0.0.0:9180) could not
be opened, the following message appeared in the log about 10 seconds into
the run, and Scylla crashed.

ERROR 2017-01-01 19:31:04,066 [shard 0] seastar - Exiting on unhandled exception: std::system_error (error system:98, Address already in use)

The puzzled user would have no idea *which* address was already in use, why,
or why Scylla stopped.

In this patch, before the above message we get the much more informative
message:

ERROR 2017-01-01 19:58:19,080 [shard 0] init - Could not start Prometheus API server on 0.0.0.0:9180: std::system_error (error system:98, Address already in use)

We continue to print the original message - and exit - in this case,
under the assumption that it's better not to run the database while
improperly configured.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170102121304.2060-1-nyh@scylladb.com>
2017-01-04 14:58:26 +02:00
Tzach Livyatan
0c746b22e0 Fix a typo in scylla_setup housekeeping prompt
Signed-off-by: Tzach Livyatan <tzach@scylladb.com>
Message-Id: <1483362474-22113-1-git-send-email-tzach@scylladb.com>
2017-01-04 14:54:22 +02:00
Takuya ASADA
43655512e1 dist/redhat: add python-setuptools on dependency since it requires for scylla-housekeeping
scylla-housekeeping breaks when python-setuptools doesn't installed, so
add it on dependency.

Fixes #1884

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1483525828-7507-1-git-send-email-syuu@scylladb.com>
2017-01-04 14:32:10 +02:00
Pekka Enberg
060841b756 tests/types_test: Fix int32 type string conversion boundary case
The test case is interested in the upper boundary of 32-bit integer
because we already test the lower boundary in assertions below. The old
test passed, of course, but it wasn't very interesting.
Message-Id: <1483522773-6008-1-git-send-email-penberg@scylladb.com>
2017-01-04 11:57:02 +01:00
Avi Kivity
3232d47d4f dist: remove another bc dependency
No longer used.
2017-01-01 11:13:34 +02:00
Tzach Livyatan
2bfa7cc086 dist/common/scripts: improve scylla_setup wording
Fix a few minor typos and improve the user prompt text

Signed-off-by: Tzach Livyatan <tzach@scylladb.com>
Message-Id: <1482918340-19375-1-git-send-email-tzach@scylladb.com>
2016-12-30 13:18:08 +02:00
Tzach Livyatan
436ce7ae49 conf/scylla.yaml: Move broadcast_rpc_address to the supported section
Fixes #1779

Signed-off-by: Tzach Livyatan <tzach@scylladb.com>
Message-Id: <1483021417-8415-1-git-send-email-tzach@scylladb.com>
2016-12-29 16:24:56 +02:00
Takuya ASADA
e48cc9cf01 dist/ubuntu: check lsb_release existance since it's not included minimal Debian installation
Ubuntu has it in minimal installation but Debian doesn't, so add it.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1483003565-2753-1-git-send-email-syuu@scylladb.com>
2016-12-29 11:33:21 +02:00
Pekka Enberg
a443dfa95e tracing: Add seastar/core/scollectd.hh include
Fix the following build breakage:

FAILED: build/release/gen/cql3/CqlParser.o
g++ -MMD -MT build/release/gen/cql3/CqlParser.o -MF build/release/gen/cql3/CqlParser.o.d -std=gnu++1y -g  -Wall -Werror -fvisibility=hidden -pthread -I/home/penberg/scylla/seastar -I/home/penberg/scylla/seastar/fmt -I/home/penberg/scylla/seastar/build/release/gen  -march=nehalem -Ifmt -DBOOST_TEST_DYN_LINK -Wno-overloaded-virtual -DFMT_HEADER_ONLY -DHAVE_HWLOC -DHAVE_NUMA -DHAVE_LZ4_COMPRESS_DEFAULT  -O2 -DBOOST_TEST_DYN_LINK  -Wno-maybe-uninitialized -DHAVE_LIBSYSTEMD=1 -I. -I build/release/gen -I seastar -I seastar/build/release/gen -c -o build/release/gen/cql3/CqlParser.o build/release/gen/cql3/CqlParser.cpp
In file included from ./query-request.hh:31:0,
                 from ./locator/token_metadata.hh:51,
                 from ./locator/abstract_replication_strategy.hh:29,
                 from ./database.hh:26,
                 from ./service/storage_proxy.hh:44,
                 from ./db/schema_tables.hh:43,
                 from ./db/system_keyspace.hh:46,
                 from ./cql3/functions/function_name.hh:45,
                 from ./cql3/selection/selectable.hh:48,
                 from ./cql3/selection/writetime_or_ttl.hh:45,
                 from build/release/gen/cql3/CqlParser.hpp:63,
                 from build/release/gen/cql3/CqlParser.cpp:44:
./tracing/tracing.hh:357:5: error: ‘scollectd’ does not name a type
     scollectd::registrations _registrations;
     ^~~~~~~~~

Message-Id: <1482939751-8756-1-git-send-email-penberg@scylladb.com>
2016-12-28 18:40:18 +02:00
Nadav Har'El
d49aa7abd2 storage_service: make is_joined() an immediate function
Commit d41cd48a made the is_joined() method a future<bool> because
only cpu 0 knows its real value. This makes this function inconvenient
to use. So this patch reverts commit d41cd48a, and instead sets this
flag's value on all shards, so each shard can read its value locally
(and immediately).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20161228160450.5831-1-nyh@scylladb.com>
2016-12-28 18:37:22 +02:00
Pekka Enberg
2aee7f6334 Merge seastar upstream
* seastar f32e4c2...1c8e389 (2):
  > Merge "migrate network related seastar collectd metrics to the new metrics registration API" from Vlad
  > file: add dup() support
2016-12-28 17:04:11 +02:00
Duarte Nunes
1444a52fae position_in_partition: Add tri_comparator
Will be needed to order view updates with the existing mutations.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
[pdziepak: corrected component name in commit message]
Message-Id: <1482880989-3086-2-git-send-email-duarte@scylladb.com>
2016-12-28 13:04:16 +01:00
Duarte Nunes
c6b0387f31 clustering_bounds_comparator: Add tri_comparator
This patch adds a tri_comparator for bound_view, which will be used by
to add a tri comparator to position_in_partition.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1482880989-3086-1-git-send-email-duarte@scylladb.com>
2016-12-28 13:02:57 +01:00
Duarte Nunes
adb727f7dc clustering_row: Add apply() overload
This patch adds an overload to the apply() function,
which takes a clustering_row by reference, to copy. This will be
needed by future patches, when merging base table updates with the
existing data.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1482881106-3202-1-git-send-email-duarte@scylladb.com>
2016-12-28 12:45:12 +01:00
Pekka Enberg
302035577e cql3/statements: Make batch_statement::_type private
The _type member variable is never accessed outside of the
batch_statement class so make it private.
Message-Id: <1482921073-28485-1-git-send-email-penberg@scylladb.com>
2016-12-28 12:08:05 +01:00
Pekka Enberg
20daf43403 cql3/statements: Move batch_statement implementation to source file
Clean up batch_statement class by moving implementation to the
batch_statement.cc source file to make it easier to modify the class.

Message-Id: <1482920872-28303-1-git-send-email-penberg@scylladb.com>
2016-12-28 12:30:03 +02:00
Duarte Nunes
86a109915d streamed_mutations: Update comments
This patch removes references to the old begin_range_tombstone and
end_range_tombstone mutation_fragments, which have been replaced by a
single range_tombstone fragment.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1482880820-2831-1-git-send-email-duarte@scylladb.com>
2016-12-28 09:06:49 +01:00
Gleb Natapov
4ca58959ad storage_proxy: do not deref unengaged stdx:optional
Fixes intentional short reads.

Message-Id: <20161227142133.GE1829@scylladb.com>
2016-12-27 16:30:03 +02:00
Vlad Zolotarov
9606db2f08 api::set_tracing_probability: prevent a server from returning 500 for a bad probability value
- Change an exception type thrown by a tracing::tracing::set_trace_probability()
     to make it different from the one thrown by an std::stod() when it fails to
     parse a given string.
   - Catch the std::out_of_range exception thrown by a tracing::tracing::set_trace_probability() and
     wrap the exception string into the httpd::bad_param_exception() object.
   - Throw a httpd::bad_param_exception() with a
     "Bad format in a probability value: <a user given probability string value>"
     message if std::invalid_argument is caught.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1465300738-1557-1-git-send-email-vladz@cloudius-systems.com>
2016-12-27 12:07:09 +02:00
Avi Kivity
339cc0c2fa main: verify sufficient memory per shard
Refuse to boot if we don't have at least 1 GiB per shard, unless in developer
mode.

The primary violator here is docker, but since it starts in developer mode,
it won't get fixed.  We need some extra logic for this case.
Message-Id: <20161221090222.28677-1-avi@scylladb.com>
2016-12-27 12:05:52 +02:00
Avi Kivity
868b4d110c Merge "Fixes for intentional short reads" from Paweł
"This patchset contains fixes for the changes introduced in "Query result
size limiting". It also improves handling of short data reads.

I order to minimise chances of digest mismatch during data queries replicas
that were asked just to return a digest also keep track of the size of the
data (in the IDL representation) so that they would stop at the same point
nodes doing full data queries would. Moreover, data queries are not
affected by per-shard memory limit and the coordinator sends individual
result size limits to replicas in order not to depend on hardcoded values.

It is still possible to get digest mismatches if the IDL changes (e.g. a
new field is added), but, hopefully, that won't be a serious problem."

* 'pdziepak/short-read-fixes/v4' of github.com:cloudius-systems/seastar-dev:
  query: introduce result_memory_accounter::foreign_state
  storage_proxy: fix short reads in parallel range queries
  storage_proxy: pass maximum result size to replicas
  mutation_partition: use result limiter for digest reads
  query: make result_memory_limiter constants available for linker
  result_memory_limiter: add accounter for digest reads
  idl: allow writers to use any output stream
  result_memory_limiter: split new_read() to new_{data, mutation}_read()
  idl: is_short_read() was added in 1.6
  mutation_partition: honour allowed_short_read for static rows
  storage_proxy: fix _is_short_read computation
  storage_proxy: disallow short reads if got no live rows
  storage_proxy: don't stop after result with no live rows
2016-12-26 10:42:49 +02:00
Avi Kivity
1d9ee358f1 Revert "Merge "Reduce the size of mutation_partition" from Piotr"
This reverts commit aa392810ff, reversing
changes made to a24ff47c637e6a5fd158099b8a65f1191fc2d023; it uses
boost::intrusive::detail directly, which it must not, and doesn't compile on
all boost versions as a consequence.
2016-12-25 16:07:48 +02:00
Avi Kivity
59d389bd46 Merge seastar upstream
* seastar 0b98024...f32e4c2 (11):
  > Merge "Moving the reactor counters to the metric layer" from Amnon
  > metrics: Metrics function should take variable as a refernce
  > Revert "Merge ""Moving the reactor counters to the metric layer from Amnon"
  > Merge ""Moving the reactor counters to the metric layer from Amnon
  > Revert "fstream: Auto-close data_sink and data_source"
  > rpc: Avoid resource unit leaks on failure
  > fstream: Auto-close data_sink and data_source
  > http: Move metrics registration to the metrics layer
  > output_stream: add batching to zero copy interface
  > Revert "slab: Move the metrics registration to the metrics layer"
  > slab: Move the metrics registration to the metrics layer
2016-12-25 15:50:09 +02:00
Amnon Heiman
70b2a1bfd4 Set the prometheus prefix to scylla
This patch make the prometheus prefix configurable and set the default
value to scylla.

Fixes #1964

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1482671970-21487-1-git-send-email-amnon@scylladb.com>
2016-12-25 15:21:53 +02:00
Avi Kivity
b99a0fc076 licenses: clarify that licenses in this directory do not cover entire work 2016-12-25 12:59:38 +02:00
Avi Kivity
aa392810ff Merge "Reduce the size of mutation_partition" from Piotr
"Reduce the size of mutation_partition by implementing intrusive set using
bi::rbtree_algorithms directly and using tree nodes optimized for size.

This will reduce the size of mutation_partition by:
24 bytes + <number of cql rows> * 8 bytes

This should have a positive impact on performance because mutation_partitions
are stored both in memtable and cache.

Fixes #742."

* 'haaawk/742' of github.com:cloudius-systems/seastar-dev:
  intrusive_set: rename size() to calculate_size()
  Make intrusive_set_external_comparator::_value_traits static
  Implement intrusive set using rbtree_algorithms
  mutation_partition: make apply_reversibly_intrusive_set nongeneric
  mutation_partition: take schema in find_row and clustered_row
  mutation_partition: Extract intrusive set logic to a class.
  mutation_partition: Replace value_comp with key_comp calls
2016-12-25 12:56:10 +02:00
Benoît Canet
a24ff47c63 scylla_setup: Use blkid or ls to list potentials block devices
blkid does not list root raw device.

Revert to lsblk while taking care of having a fallback
path in case the -p option is not supported.

Fixes #1963.

Suggested-by: Avi Kivity <avi@scylladb.com>
Signed-off-by: Benoît Canet <benoit@scylladb.com>
Message-Id: <20161225100204.13297-1-benoit@scylladb.com>
2016-12-25 12:03:40 +02:00
Takuya ASADA
f3e45bc9ef dist/redhat: don't try to adduser when user is already exists
Currently we get "failed adding user 'scylla'" on .rpm installation when user is already exists, we can skip it to prevent error.

Fixes #1958

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1482550075-27939-1-git-send-email-syuu@scylladb.com>
2016-12-25 11:37:25 +02:00
Piotr Jastrzebski
345ed5b6ff intrusive_set: rename size() to calculate_size()
This hopefully will make it more apparent that
the time complexity of this method is O(N) not O(1).

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-12-23 11:32:13 +01:00
Piotr Jastrzebski
151fa3aaf0 Make intrusive_set_external_comparator::_value_traits static
_value_traits can be shared among all instances
and there's no need to store it in every single one.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-12-23 11:32:13 +01:00
Piotr Jastrzebski
671affc36c Implement intrusive set using rbtree_algorithms
This new implementation takes less memory because it
does not store comparator.

It also uses tree nodes optimized for size. This means
that instead of storing an enum field |color| they embed
this information inside pointer to parent.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-12-23 11:32:13 +01:00
Piotr Jastrzebski
b0f712a4e8 mutation_partition: make apply_reversibly_intrusive_set nongeneric
apply_reversibly_intrusive_set is used only in one place
and always with rows_type. There's no need for it to be generic.
This will allow changing intrusive set implementation.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-12-23 11:29:07 +01:00
Piotr Jastrzebski
2af6ff68d9 mutation_partition: take schema in find_row and clustered_row
This will allow intrusive set implementation that does not
store schema.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-12-23 11:29:07 +01:00
Piotr Jastrzebski
b3b924dec9 mutation_partition: Extract intrusive set logic to a class.
It will make it easier to change the implementation
of the intrusive set.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-12-23 11:29:07 +01:00
Piotr Jastrzebski
ac7481f4b2 mutation_partition: Replace value_comp with key_comp calls
This will reduce the size of bi::set API being used.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-12-23 11:29:07 +01:00
Tomasz Grabiec
f2a63270d1 sstables: Fix double close on index and data files when writing fails
file output streams take the responsibility of closing the file, they
will close the file as part of closing the stream.

During sstable writing we create sstable object and keep file
references there as well. Sstable object also has responsibility for
closing the files, and does so from sstable::~sstable().

Double close was supposed to be avoided by a construct like this:

  writer.close().get();
  _file = {};

However if close() failed, which can happen when write-ahead failed,
_file would not be cleared, and both the writer and sstable would
close the file. This will result in a crash in
append_challenged_posix_file_impl::close(), which is not prepared to
be closed twice.

Another problem is that if exception happened before we reached that
construct, we still should close the writer. Currently we don't, so
there's no double close on the file, but that's a bug which needs to
be fixed and once that's fixed double close on _file will be even more
likely.

The fix employed here is to not keep files inside sstable object when
writing. As soon as the writer is constructed, it's the only owner of
the file.

Fixes #1764.

Message-Id: <1482428648-22553-1-git-send-email-tgrabiec@scylladb.com>
2016-12-23 11:44:43 +02:00
Raphael S. Carvalho
fd80499b3d database: make column_family::add_sstable() private again
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <38226308bee2970a91b0e35370d6a646b85ecfe9.1482459877.git.raphaelsc@scylladb.com>
2016-12-23 11:42:16 +02:00
Paweł Dziepak
e6d27ac529 query: introduce result_memory_accounter::foreign_state
Range queries used to be performed sequentially and the shard performing
part of the read was reading state of the merger's memory accounter
directly. Now, they may be performed in parallel so it is safer to just
pass relevant data by value to the intersted shards so that they are not
reading something that another shard is modyfing at the same time.

Since query is done in parallel there is a chance of overread. However,
the parallelism is high only in sparsely populated tables and that's
when the overread is less serious problem.
2016-12-22 17:16:24 +01:00
Paweł Dziepak
49d675223e storage_proxy: fix short reads in parallel range queries
Since a1cafed370 "storage_proxy: handle
range scans of sparsely populated tables" nonsingular range queries may
be performed in parallel on multiple shards. The consequence of this
that result may be added to the merger out of order. This requires more
complex logic for handling short reads.

As soon as mutation_result_merger gets a short read it starts to discard
all subsequently received results that are known to contain partitions
with larger keys.
Then when the final result is being prepared the merger may need to
combine and sorts results which ordering is not known. If at least one
of these results is a short one all partitions with larger keys are
removed.

Due to request being performed in parallel it is possible that even
though there was a short read the merger has got enough live data to
satisfy specified limits. If this has happened the short read flag is
not set on the final result.
2016-12-22 17:16:24 +01:00
Paweł Dziepak
1a52569f7d storage_proxy: pass maximum result size to replicas
We may want to change the default individual result size limit in the
future. If it is provided by the coordinator and not hardcoded in the
replicas this can be done without causing data query digest mismatches
or wasteful mutation query results.
2016-12-22 17:16:23 +01:00
Paweł Dziepak
40176ca2f8 mutation_partition: use result limiter for digest reads
Even if we are performing a digest query we should do proper result
memory accounting so that the result ends exactly in the same place that
it would if it was a data query. This is to avoid digest mismatches
between replicas.
2016-12-22 17:16:23 +01:00
Avi Kivity
8686a59ea5 dht: use nonwrapping_ranges in ring_position_range_sharder
It was the observation that ring_position_range_sharder doesn't support
wrapping ranges that started the nonwrapping_range madness, but that
class still has some leftover wrapping ranges.  Close the circle by
removing them.
Message-Id: <20161123153113.8944-1-avi@scylladb.com>
2016-12-22 14:40:30 +01:00
Takuya ASADA
7c3b98806d dist/common/scripts/scylla_setup: improve the message of disk selection prompt
Not to confuse users, describe we only list up unmounted disks.

Fixes #1841

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1479720708-6021-1-git-send-email-syuu@scylladb.com>
2016-12-22 15:36:46 +02:00
Paweł Dziepak
a7d694654a query: make result_memory_limiter constants available for linker 2016-12-22 13:35:04 +01:00
Paweł Dziepak
a0523df8d6 result_memory_limiter: add accounter for digest reads
Digest reads differ from data reads in a way that they do not really
consume any memory. We still want them to stop in the same place that
data reads would, but the per-shard semaphore shouldn't be updated by
them.
2016-12-22 13:35:04 +01:00
Paweł Dziepak
38ee69dee0 idl: allow writers to use any output stream
Original IDL generated code was hardcoded to always use bytes_ostream.
This patch makes the output stream a template parameter so that any
valid output stream can be used.
Unfortunately, making IDL writers generic requires updates in the code
that uses them, this is fixed in C++17 which would be able to deduce the
parameter in most cases.
2016-12-22 13:35:04 +01:00
Paweł Dziepak
aa083d3d85 result_memory_limiter: split new_read() to new_{data, mutation}_read()
For data queries it is very important that all replicas get limited in
the same place (this includes replicas returning only digest). That's
why they shouldn't be affected by per-shard result memory limit.
Moreover, we should make sure that individual memory limits are the
same, making the coordinator provide it for replicas which allow to
safely change it in the future.

Mutation queries are not as sensitive but it is still beneficial to make
sure that all replicas use the same individual limit.
2016-12-22 13:35:04 +01:00
Paweł Dziepak
b8e29cc99c idl: is_short_read() was added in 1.6 2016-12-22 13:35:04 +01:00
Paweł Dziepak
1c7cade559 mutation_partition: honour allowed_short_read for static rows 2016-12-22 13:35:04 +01:00
Paweł Dziepak
a7a454c388 storage_proxy: fix _is_short_read computation 2016-12-22 13:35:04 +01:00
Paweł Dziepak
8c1e4a707c storage_proxy: disallow short reads if got no live rows
If after reconciliation the coordinator ends up with no live rows and
short reads are allowed a retry may not make any progress if replicas
end their reads in the same place. The solution is to disallow short
reads on retries which are caused by final result having no live rows.
2016-12-22 13:35:04 +01:00
Paweł Dziepak
6db262446f storage_proxy: don't stop after result with no live rows
mutation_result_merger merges results from different shards and stops as
soon as a shard returned a short read or memory usage on the merging
shard is too high. However, it should never stop unless at least one
live rows is in the merged result.
2016-12-22 13:35:04 +01:00
Avi Kivity
74ecd7072a Merge "Reduce overhead of get_max_purgeable_timestamp() during compaction" from Tomasz
* 'tgrabiec/calculate-hash-once-compaction' of github.com:cloudius-systems/seastar-dev:
  sstables: Calculate key hash only once during compaction
  tests: sstables: Add more test cases to tombstone_purge_test
  db: Expose column_family::add_sstable
  tests: sstables: Ensure timestamps are increasing
  tests: sstables: Simplify tombstone_purge_test
2016-12-22 14:33:30 +02:00
Tomasz Grabiec
045b9fd7c1 sstables: Calculate key hash only once during compaction
Improves compaction performance.
2016-12-22 13:24:46 +01:00
Tomasz Grabiec
fb8765bef9 tests: sstables: Add more test cases to tombstone_purge_test 2016-12-22 13:24:46 +01:00
Tomasz Grabiec
c7ff2a2bb0 db: Expose column_family::add_sstable
Needed by compaction tests.
2016-12-22 13:24:46 +01:00
Tomasz Grabiec
d841cab02c tests: sstables: Ensure timestamps are increasing 2016-12-22 13:24:45 +01:00
Tomasz Grabiec
21ade8e4a4 tests: sstables: Simplify tombstone_purge_test
- moved to seastar thread

  - extracted sstable creation and validation logic

  - reduced code duplication

  - switched to mutation_reader assertions

  - used result of compact_sstable() to locate the new sstable

  - rather than setting gc timestamp in the past, bump the clock
    before compacting
2016-12-22 13:24:41 +01:00
Tomasz Grabiec
bc6486b304 Use gc_clock instead of db_clock where possible
Some code paths were obtaining db_clock timestamp to only convert it
to gc_clock later. Avoid this. In the future we could make gc_clock
cheaper cause it has low precision.

Message-Id: <1482401190-2035-1-git-send-email-tgrabiec@scylladb.com>
2016-12-22 13:27:55 +02:00
Raphael S. Carvalho
c26090a6b2 sstables/compress: fix error message for snappy uncompression
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <898ad07db705355bdbf780afdb3aa982b8ca3823.1482364125.git.raphaelsc@scylladb.com>
2016-12-22 09:08:34 +01:00
Raphael S. Carvalho
27fb8ec512 db: avoid excessive disk usage during sstable resharding
Shared sstables will now be resharded in the same order to guarantee
that all shards owning a sstable will agree on its deletion nearly
the same time, therefore, reducing disk space requirement.
That's done by picking which column family to reshard in UUID order,
and each individual column family will reshard its shared sstables
in generation order.

Fixes #1952.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <87ff649ed24590c55c00cbb32bffd8fa2743e36e.1482342754.git.raphaelsc@scylladb.com>
2016-12-21 23:18:06 +02:00
Tomasz Grabiec
d87d50dc64 db: Use microsecond precision for server-side timestamps
Currently server-side timestamps use a clock with millisecond
precision. Timestamps have microsecond resolution, with lower bits
used to serialize mutations originating from given client.

Timestamps for column drops always use just the millisecond base. A
column drop which is executed after an insert may thus be given lower
timestamp than the insert, even when the two are serialized on the
client side over same connection.

Use microsecond precision to reduce chances of that event.

This is supposed to fix sporadic failures of
schema_test.py:TestSchema.drop_column_queries_test dtest.
Message-Id: <1482343119-27698-1-git-send-email-tgrabiec@scylladb.com>
2016-12-21 18:03:22 +00:00
Avi Kivity
875635554d Merge "educe overhead of partition presence checker during cache update" from Tomasz
Refs #1943.

* 'tgrabiec/optimize-bloom-filter' of github.com:cloudius-systems/seastar-dev:
  db: Compute key hash once in partition_presence_checker
  bloom_filter: Allow checking presence using pre-hashed key
  db: Use incremental selector in partition_presence_checker
2016-12-21 14:24:54 +02:00
Takuya ASADA
d356c21512 configure.py: don't allow to run multiple 'ninja -C seastar' on same time
Scylla's build.ninja allows to run multiple 'ninja -C seastar' on same time,
it breaks DPDK build after upgraded to DPDK-16.10:
https://gist.github.com/syuu1228/4bd1170630b7e5f15653281b4728e521

To prevent it, we need to limit number of seastar build only one in same time.

Note: it doesn't mean disabling parallel build on Seastar.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1482250560-20289-1-git-send-email-syuu@scylladb.com>
2016-12-21 12:42:52 +02:00
Vlad Zolotarov
62cad0f5f5 tracing: don't start tracing until a Tracing service is fully initialized
RPC messaging service is initialized before the Tracing service, so
we should prevent creation of tracing spans before the service is
fully initialized.

We will use an already existing "_down" state and extend it in a way
that !_down equals "started", where "started" is TRUE when the local
service is fully initialized.

We will also split the Tracing service initialization into two parts:
   1) Initialize the sharded object.
   2) Start the tracing service:
      - Create the I/O backend service.
      - Enable tracing.

Fixes issue #1939

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1481836429-28478-1-git-send-email-vladz@scylladb.com>
2016-12-21 12:40:14 +02:00
Gleb Natapov
0a2dd39c75 messaging_service: move MUTATION_DONE messages to separate connection
If a node gets more MUTATION request that it can handle via RPC it will
stop reading from this RPC connection, but this will prevent it from
getting MUTATION_DONE responses for requests it coordinates because
currently MUTATION and MUTATION_DONE messages shares same connection.

To solve this problem this patches moves MUTATION_DONE messages to
separate connection.

Fixes: #1843

Message-Id: <20161201155942.GC11581@scylladb.com>
2016-12-21 11:10:15 +02:00
Piotr Jastrzebski
3e502de153 mutation_partition: don't use unique_ptr to manage LSA objects
Unique_ptr won't destruct them correctly.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <5b49bb25a962432a178fe75554dd010c3cdea41d.1482261888.git.piotr@scylladb.com>
2016-12-21 09:40:15 +01:00
Raphael S. Carvalho
e28537b56f sstables: fix calculation of memory footprint for summary
size of keys weren't taken into account, so value reported
via collectd is much smaller than actual footprint.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <3ca24612e4e84d1cbdea4f2d79e431a4f4479291.1482255327.git.raphaelsc@scylladb.com>
2016-12-20 18:28:47 +00:00
Paweł Dziepak
d0e61fd092 test.py: remove '.cc' from view_schema_test 2016-12-20 18:26:52 +00:00
Avi Kivity
3989e4ed15 Revert "config, dht: reduce default msb ignore bits to 4"
This reverts commit b81a57e8eb.

With exponential range scanning, we should now be able to survive
msb ignore bits of 12, which allows better sharding on large clusters.
2016-12-20 19:41:05 +02:00
Duarte Nunes
a9e5b7f124 view_info: Fix comparison
Two view_info object are equal if their fields are equal, not
different.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1482253839-2736-1-git-send-email-duarte@scylladb.com>
2016-12-20 18:36:39 +01:00
Avi Kivity
a1cafed370 storage_proxy: handle range scans of sparsely populated tables
When murmur3_partitioner_ignore_msb_bits = 12 (which we'd like to be the
default), a scan range can be split into a large number of subranges, each
going to a separate shard.  With the current implementation, subranges were
queried sequentially, resulting in very long latency when the table was empty
or nearly empty.

Switch to an exponential retry mechanism, where the number of subranges
queried doubles each time, dropping the latency from O(number of subranges)
to O(log(number of subranges)).

If, during an iteration of a retry, we read at most one range
from each shard, then partial results are merged by concatentation.  This
optimizes for the dense(r) case, where few partial results are required.

If, during an iteration of a retry, we need more than one range per
shard, then we collapse all of a shard's ranges into just one range,
and merge partial results by sorting decorated keys.  This reduces
the number of sstable read creations we need to make, and optimizes for
the sparse table case, where we need many partial results, most of which
are empty.

We don't merge subranges that come from different partition ranges,
because those need to be sorted in request order, not decorated key order.

[tgrabiec: trivial conflicts]

Message-Id: <20161220170532.25173-1-avi@scylladb.com>
2016-12-20 18:32:29 +01:00
Tomasz Grabiec
dc94bd0642 Merge branch 'materialized-views/cql/v4' from git@github.com:duarten/scylla.git
This patchset implements the multiple CQL3 statements relating to
materialized views, as well as ensuring other statements now take
materialized views into account. It also adds the necessary internal
data structures to hold materialized view metadata.
2016-12-20 14:21:18 +01:00
Duarte Nunes
8ac4d7b2e8 tests: Add view_schema_test
This patch adds a set of tests for materialized view schema
handling, complementing the dtests for the same feature.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
eb25a8f3cd cql_test_env: Add do_with_cql_env_thread function
This patch introduces the do_with_cql_env_thread() function, which
behaves like do_with_cql_env() except that it executes the
user-specified function in the context of a Seastar thread.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
124802e196 cql3: Add function to build view's select statement
This patch adds an utility function that creates a raw select
statement from a set of columns and a where clause. It is intended to
be used to create the prepared select statement used by the view
class.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
088dfdb108 select_statement: Consider materialized views
This patch considers materialized views in
select_statement::check_access().

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
5511dab914 cql3: Add drop view statement
This patch adds the drop_view_statement, which enables users to drop a
given materialized view.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
5c51a24217 cql3: Parse drop view statement
This patch adds the necessary grammar to Cql.g to parse drop view
statements.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
3025ea63fc cql3: Add alter view statement
This patch adds the alter_view_statement, which enables users to
change the properties of a materialized view.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
71b1e7c056 cql3: Parse alter view statement
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
8792fed651 create_view_statement: Complete implementation
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
02bc0d2ab3 create_view_statement: Require MV feature
This patch adds the MATERIALIZED_VIEWS_FEATURE to the set of cluster
features and requires its presence to allow creating a view. This
ensures view schemas can be safely propagated across nodes.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
59682c95a1 create_view_statement: Require experimental switch
Creating a materialized view requires running Scylla with the
experimental switch.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
c626c983f4 create_view_statement: Reuse validation code
This replace some validation logic with a call to
validation::validate_column_family.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
5bd74abee8 create_view_statement: Implement check_access
This patch implements check_access according to Cassandra's
implementation.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
a9c17b0a52 select_statement: Propagate for_view argument
This patch propagates the for_view argument, used by
statement_restrictions to ensure IS NOT NULL can be used when creating
a materialized view.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
65535b3444 modification_statement: Check access for tables with views
This patch checks for additional permissions when modifying a table
with views, since that update will require reading from the table and
writing into its views.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
5187fdbb3a modification_statement: Views aren't updated directly
This patch ensures that views cannot be modified directly through an
insert or update statement.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
21e34c5054 alter_type_statement: Consider materialized views
This patch ensures we also update materialized views where the type
being updated occurs.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
a5b7b0464b migration_manager: Only drop table without views
This patch forbids dropping a column family if there are still views
associated with it, and also forbids dropping a view through the drop
table statement.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
76276f1a53 alter_table_statement: Update materialized view
This patch ensures that changes to a base table's schema
are reflected in that table's materialized views.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
44a1f2d836 query_processor: Use cql3::util::do_with_parser()
To minimize code duplication, have query_processor use
do_with_parser() instead of manually creating the CqlParser.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
bd1e66f411 cql3: Allow renaming a column in a where clause
This patch adds an utility function to rename a column occurring a
textual where clause. It is intended to change a view's where clause
when users alter the underlying base table.

To do this, we rely on functions that transform a textual where clause
into a set of relations, which allows to reliably rename the  column.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
ced4b6e4ff cql3: Allow renaming an identifier in a relation
This patch adds an utility function to rename an identifier
occurring in a cql3 relation. This function will be used when renaming
an identifier in a view's where clause.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
282c023524 migration_manager: Announce view drop
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
99aa8eb4b8 migration_manager: Announce view update
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
6ef3358321 migration_manager: Announce new view creation
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
8ce21a9c01 schema_tables: Make drop view mutations
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
61a5a74ea2 schema_tables: Make update view mutations
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
2098c336d9 schema_tables: Make create view mutations
This patch builds the mutations to announce a new view. Aside from
including the view schema, we include the base table mutations so
that a node is resilient against receiving create view mutations
before the base table create mutations.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
19a76a82e8 frozen_schema: Support view schemas
This patch allows a view schema to be frozen. To unfreeze such a
schema, we add an is_view attribute to the schema idl.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
c11eb30225 schema_tables: Replace add_table_to_schema_mutation
This patch replaces the add_table_to_schema_mutation() function with
add_table_or_view_to_schema_mutation().

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
04b93ba803 schema_tables: Make view mutations
This patch adds functions that translate a view schema to the
corresponding mutations.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
fe632e8ba5 schema_tables: Factor out duplicate code
This patch factors out duplicate code between
merge_tables() and merge_views().

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
3fd79bb6d6 schema_tables: Merge views for schema merging
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
06ab61a570 schema_tables: Extract update_column_family
This patch extracts update_column_family from schema_tables into
database so it can be used when adding materialized views, in future
patches.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
ecc4290bc6 database: Remove view from base table upon drop
This patch changes the drop_column_family() function to remove
a view schema from the list of views of its base table.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
4f166cfa6a database: Parse views schema table upon init
This patch adds code for parsing the views schema table upon init and
also ensures that when adding a view column family, that we add it to
its base table list of views.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
40c684b5f5 database: Extract common create cf code
This patch moves some duplicate code into the
add_column_family_and_create_directory() function. It also saves some
superfluous keyspace lookups and readies the code to be used by
materialized views.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
42242273f6 schema_tables: Create views from mutations
This patch enables views to be created from their low-level,
mutation-based representation.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
888a8923c7 read_table_mutations: Support other schemas
This patch changes read_table_mutations() so that it can now
read schemas from other tables besides the column families
schema table.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
93458f314c migration_manager: Notify of view schema changes
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
22d8aa9bb6 migration_listener: Listen for view schema changes
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
b9cf25c4dd schema_tables: Add views schema table
This patch adds the views schema table, containing the definition of
views in a keyspace.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
e41494996f thrift: Skip materialized views
This patch ensures we don't provide access to materialized views over
thrift. This includes preventing updates but also omitting them when
describing a keyspace.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
2b231f22b8 keyspace_metadata: Add tables() and views() functions
This patch adds utility functions to keyspace_metadata to select only
the tables or only the views out of all the schemas.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
7818339791 materialized views: Add view class
This patch adds the view class, which will contains functions related
to populating a view, either from the base table's write path or from
the view building mechanism which copies over already existing data in
the base table.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
d0ed8fa29b schema: Add view_ptr class
The view_ptr class contains a schema_ptr known to represent a
materialized view. It is intended to be used by functions that require
such a schema, and thus obviate the need for the function to check for
schema::is_view().

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
82ce8eedbd schema: Add view_info field
This patch adds a view_info optional field to the schema. It's
presence indicates the schema represents a materialized view.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
4b3ac42914 materialized views: Add view_info class
The view_info class is meant to augment a schema with
fields relevant for materialized views.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-20 13:06:11 +00:00
Duarte Nunes
d7e607ff51 query_pagers: Fix over-counting of rows
This patch fixes a regression introduced in 0518895, where we counted
one extra row per partition when it contained live, non static rows.

We also simplify the visitor logic further, since now we don't need to
count rows one by one. Also remove a bunch of unused fields.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1482234083-2447-1-git-send-email-duarte@scylladb.com>
2016-12-20 11:58:37 +00:00
Tomasz Grabiec
0e487b3499 db: Compute key hash once in partition_presence_checker
I measured reduction of cache update time by 20% for 6 sstables and by
40% for 16.

Refs #1943.
2016-12-19 14:20:58 +01:00
Tomasz Grabiec
ab5c77fcf1 bloom_filter: Allow checking presence using pre-hashed key
Will allow us to calculate the hash once and use it on many filters
instead of calculating the hash for each filter separately.

Another change made is to avoid precomputing all indexes during filter
operations, and have for_each_index() template instead which invokes a
functor.
2016-12-19 14:20:58 +01:00
Tomasz Grabiec
78844fa2e5 db: Use incremental selector in partition_presence_checker
This reduces the number of sstables we need to check to only those
whose token range overlaps with the key. Reduces cache update
time. Especially effective with leveled compaction strategy.

Refs #1943.

Incremental selector works with an immutable sstable set, so cache
updates need to be serialized. Otherwise we could mispopulate due to
stale presence information.

Presence checker interface was changed to accept decorated key in
order to gain easy access to the token, which is required by
the incremental selector.
2016-12-19 14:20:58 +01:00
Avi Kivity
b740aff777 tests: adjust mutation_query_test for partition and row limits
Won't build otherwise.
2016-12-19 11:37:25 +02:00
Avi Kivity
f3c8cbbac5 Merge "Introduce dht::token_range an dht::partition_range" from Asias
"nonwrapping_range<ring_position> and nonwrapping_range<token> are used
in many places. Let's make an alias for them to make it less verbose.

Also there is a query::partition_range in query-request.hh which is the alias of
nonwrapping_range<ring_position>. query::partition_range is used in
places not related to query at all. Let's unify the usage project wide."

* tag 'asias/repair_dht_token_range/v2' of github.com:cloudius-systems/seastar-dev:
  Convert to use dht::partition_range_vector and dht::token_range_vector
  dht: Introduce dht::partition_range_vector and dht::token_range_vector
  Get rid of query::partition_range
  Convert to use dht::partition_range
  Convert to use dht::token_range
  dht: Rename token_range to token_range_endpoints
  dht: Introduce dht::token_range an dht::partition_range
2016-12-19 10:59:52 +02:00
Asias He
937f28d2f1 Convert to use dht::partition_range_vector and dht::token_range_vector 2016-12-19 14:08:50 +08:00
Asias He
7a446986fa dht: Introduce dht::partition_range_vector and dht::token_range_vector
std::vector<dht::partition_range> and std::vector<dht::token_range> are
used in a lot of places, introduce dht::partition_range_vector and
dht::token_range_vector as the alias.
2016-12-19 08:09:28 +08:00
Asias He
e5485f3ea6 Get rid of query::partition_range
Use dht::partition_range instead
2016-12-19 08:09:25 +08:00
Asias He
85034c1b57 Convert to use dht::partition_range 2016-12-19 08:04:30 +08:00
Asias He
d1178fa299 Convert to use dht::token_range 2016-12-19 08:04:29 +08:00
Asias He
1f06eedb58 dht: Rename token_range to token_range_endpoints
It is a helper class used in storage_service only. Rename it so we can
use it for the real dht::token_range.
2016-12-19 08:04:29 +08:00
Asias He
264b6ee69e dht: Introduce dht::token_range an dht::partition_range
nonwrapping_range<ring_position> and nonwrapping_range<token> are used
in many places. Let's make an alias for them to make it less verbose.

Also there is a query::partition_range in query-request.hh which is the alias of
nonwrapping_range<ring_position>. query::partition_range is used in
places not related to query at all. Let's unify the usage project wide.
2016-12-19 08:04:29 +08:00
Avi Kivity
32fb4c3661 Merge "repair: Reduce unnecessary streaming traffic even more" from Asias
"In 7c873f0d (repair: Reduce unnecessary streaming traffic), we optimize
in cases when 1) all the remote nodes has the same checksum and 2) local node
has zero checksum.

In this series, we make the optimization more generec and cover more cases."

* tag 'asias/repair/node_reducer/v3' of github.com:cloudius-systems/seastar-dev:
  repair: Reduce unnecessary streaming traffic even more
  repair: Add hash specialization for partition_checksum
2016-12-18 16:53:39 +02:00
Avi Kivity
3421ebe8be Merge "storage_proxy: Enforce row limit" from Duarte
"This patchset ensures the partition limit is enforced at
the storage_proxy level. Uppers layers like the pager may
already be depending on this behavior."

* 'enforce-row-limit/v3' of https://github.com/duarten/scylla:
  query_pagers: Don't trim returned rows
  select_statement: Don't always trim result set
  query_result_merger: Limit rows
  mutation_query: to_data_query_result enforces row limit
2016-12-18 08:15:51 +02:00
Avi Kivity
6bb875bdb7 Merge "storage_proxy: Enforce partition limit" from Duarte
"This patchset ensures the partition limit is enforced at
the storage_proxy level. To achieve this, we add the partition
count to query::result, and allow the result_merger to trim
excess partitions."

* 'enforce-partition-limit/v3' of https://github.com/duarten/scylla:
  storage_proxy: Decrease limits when retrying command
  storage_proxy: Don't fetch superfluous partitions
  query::result: Add partition count
  column_family: Use counters in query::result::builder
  query_result_builder: Use the underlying counters
  mutation_partition: Count partitions in query_compacted
  mutation_partition: Remove tabs in query_compacted
  query::result::builder: Add partition count
  query_result_merger: Limit partitions
2016-12-16 13:57:37 +02:00
Glauber Costa
7133583797 track streaming and system virtual dirty memory
A case could be made that we should have counters for them no matter
what, since it can help us reason about the distribution of memory among
the groups. But with the hierarchy being broken in 1.5 it becomes even
more important. Now by looking solely at dirty, we will have no idea
about how much memory we are using in those groups.

After this patch, the dirty_memory_manager will register its metrics
for the 3 groups that we have, and the legacy names will be used to show
totals.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <0d04ca4c7e8472097f16a5dc950b77c73766049e.1481831644.git.glauber@scylladb.com>
2016-12-16 10:59:40 +02:00
Avi Kivity
293876c72f Merge "Limit number of readers streaming uses" from Paweł
"Original, naive db::make_streaming_reader() implementation created a set
of memtable and sstable readers for every partition range. This caused
bad interaction with the code limiting sstable readers concurrency and
was suboptimal.

This series introduces multi range mutation reader that takes mutation
source and a sorted, disjoint vector of ranges. It creates only a single
set of memtable and sstable readers and fast forwards it to the next
range once the current one is completed."

* 'pdziepak/multi-range-reader/v1' of github.com:cloudius-systems/seastar-dev:
  db: use multi range reader for streaming readers
  dht: describe split_range[s]_to_shards() guarantees
  repair: remove outdated fixme
  test/mutation_reader_test: add multi_range_reader test
  tests/mutation_reader: extract key creation code
  mutation_reader: add multi_range_reader
2016-12-15 17:48:31 +02:00
Paweł Dziepak
cf679a413c db: use multi range reader for streaming readers
A naive approach was to create a set of readers for each range and pass
them all to combining reader. This however performed badly if the number
of ranges was high.

The solution is to use multi range reader which uses only a single set
of readers and fast forwards from range to range when necessary. This
adds another requirement that the ranges passed to
make_streaming_reader() are sorted and disjoint.
2016-12-15 13:54:43 +00:00
Paweł Dziepak
b86a826baf dht: describe split_range[s]_to_shards() guarantees
We are going to require these functions to return sorted and disjoint
ranges. They already do so (provided that the input ranges are sorted
and disjoint), but if the guarantee is not explicitly stated it may
disappear some day.
2016-12-15 13:07:32 +00:00
Paweł Dziepak
5287417136 repair: remove outdated fixme 2016-12-15 13:07:32 +00:00
Paweł Dziepak
5b0cf20f75 test/mutation_reader_test: add multi_range_reader test 2016-12-15 13:07:32 +00:00
Paweł Dziepak
787a976c2b tests/mutation_reader: extract key creation code 2016-12-15 13:07:32 +00:00
Paweł Dziepak
52a4e79210 mutation_reader: add multi_range_reader
So far, the only way to combine outputs of multiple readers was to use
combining reader. It is very general and, in particular, supports case
when the readers emit mutations from overlapping ranges.

However, we have cases (e.g. streaming) when we need to read from
several disjoint ranges. Combining reader is a suboptimal solution as it
requires to creating a reader for each range and ignores the fact that
they do not overlap.

This patch introduces multi_range_mutation_reader which takes a
mutation_source and a sorted set of disjoint ranges. Internally, it uses
mutation_reader::fast_forward_to() to move to the next range once the
current one is completed.
2016-12-15 13:07:31 +00:00
Duarte Nunes
0518895f5b query_pagers: Don't trim returned rows
Since storage_proxy::query() now respects the read_command limits, we
can remove the trimming logic from query_pagers.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-15 11:00:46 +00:00
Duarte Nunes
7ce859799b select_statement: Don't always trim result set
Trimming the result set is only needed when the query contains an "IN"
relation, an ORDER BY clause, and defines a limit, which is the case
where we query different ranges concurrently. We don't use the
result_merger to trim since we first need to reorder the rows.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-15 11:00:46 +00:00
Duarte Nunes
fee0b7fa48 query_result_merger: Limit rows
This patch makes the row limit enforced by the storage_proxy layer.
It adds a row limit to the query_result_merger, useful when merging
results for concurrent queries.

More importantly, it provides guarantees that upper layers may be
relying on implicitly (e.g., the paging code).

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-15 11:00:36 +00:00
Duarte Nunes
efc986d548 mutation_query: to_data_query_result enforces row limit
This patch changes mutation_query::to_data_query_result() so that it
enforces the row limit alongside the partition limit and the
per-partition limit.

In the following patch, we'll enforce the row limit in an upper layer,
but this lets us optimize the case where only when replica replies.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-15 10:56:40 +00:00
Duarte Nunes
c2072c7dc9 storage_proxy: Decrease limits when retrying command
This patch changes a read_command's limits when retrying it, so that
we don't ask for more rows than necessary.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-15 10:41:06 +00:00
Duarte Nunes
9572c19dc6 storage_proxy: Don't fetch superfluous partitions
This patch ensures we keep track of how many partitions we've queried
so we don't ask for more than the number we need.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-15 10:27:46 +00:00
Duarte Nunes
93be8d7cef query::result: Add partition count
This patch adds a partition count to query::result, filled by the
query::result::builder. The partition count is present whenever the
result carries data, being absent only for the case where the result
contains only a digest.

We also ensure that counts are present for an empty query::result.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-15 10:27:46 +00:00
Duarte Nunes
781cd82cb8 column_family: Use counters in query::result::builder
This patch changes column_family::query() to use the counters in the
builder to determine how many partitions and rows to ask for and also
to implement the stop condition. This saves a continuation to do the
bookkeeping, and allows us to remove data_query_result.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-15 10:27:46 +00:00
Duarte Nunes
05b2ef4fa2 query_result_builder: Use the underlying counters
This patch changes the query_result_builder to use the counters
provided by the query::result::builder. It also ensures they are kept
current.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-15 10:27:46 +00:00
Duarte Nunes
f5cf7f7921 mutation_partition: Count partitions in query_compacted
This patch changes mutation_partition::query_compacted() to count the
number of partitions written to the underlying writer.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-15 10:27:46 +00:00
Duarte Nunes
f21dfb8217 mutation_partition: Remove tabs in query_compacted
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-15 10:27:46 +00:00
Duarte Nunes
2409b6b250 query::result::builder: Add partition count
This patch adds a partition count to the query::result::builder. It is
intended to be incremented by users, and later used to build a
query::result.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-15 10:27:46 +00:00
Duarte Nunes
108011a839 query_result_merger: Limit partitions
This patch adds a partition limit to the query_result_merger, useful
when merging results for concurrent queries. This change also makes
the partition limit enforced by the storage_proxy layer, no changes
being needed by the upper layers, namely the Thrift interface.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-15 10:27:41 +00:00
Pekka Enberg
06c5216c9d Merge "Improve gossip feature logging" from Asias 2016-12-15 10:36:54 +02:00
Asias He
e578e65103 gossip: Log feature enabled message on shard zero only
Feature is per node. No need to log them number of shards times.
2016-12-15 16:33:11 +08:00
Asias He
4137fab91b gossip: Make log in check_features debug level
We saw the message twice for the same feature check. This is a bit
confusing.

INFO  2016-12-15 11:26:23,993 [shard 0] gossip - Checking if need_features {RANGE_TOMBSTONES} in features {}
INFO  2016-12-15 11:26:23,993 [shard 0] gossip - Checking if need_features {RANGE_TOMBSTONES} in features {}
INFO  2016-12-15 11:26:23,993 [shard 0] gossip - Checking if need_features {LARGE_PARTITIONS} in features {}
INFO  2016-12-15 11:26:23,993 [shard 0] gossip - Checking if need_features {LARGE_PARTITIONS} in features {}

This is because

   ss._range_tombstones_feature = gms::feature(RANGE_TOMBSTONES_FEATURE);
   ss._large_partitions_feature = gms::feature(LARGE_PARTITIONS_FEATURE);

The first message is printed when gms::feature(RANGE_TOMBSTONES_FEATURE)
is constructed. The second message is printed when the
ss._range_tombstones_feature is copy-constructed.
2016-12-15 16:33:10 +08:00
Asias He
2b1ebc4719 gossip: Introduce gms:features::enable helper
Add the helper function to enable the a feature and log the feature is
enabled.

When a feature is enabled, we see

INFO  2016-12-15 11:29:32,443 [shard 0] gossip - Feature LARGE_PARTITIONS is enabled
INFO  2016-12-15 11:29:32,443 [shard 0] gossip - Feature RANGE_TOMBSTONES is enabled

in the log.
2016-12-15 16:33:10 +08:00
Paweł Dziepak
b70e5d2089 Merge seastar upstream
Submodule seastar 6fbd792..0b98024:
  > fstream: fix read ahead byte metric types
  > fstream: add read-ahead metrics
  > future-util: make stop_iteration use bool_class<>
  > util: introduce bool_class<Tag>
2016-12-14 15:01:13 +00:00
Avi Kivity
57f4910832 Merge "Query result size limiting" from Paweł
"This series makes Scylla limit size of query results it produces in case they
grow unreasonably large. This is possible because CQL paging queries do not
guarantee that the returned page is going to have page_size rows and pages
smaller than tha *do not* indicate end of stream. Non-paged queries and Thrift
requests do not have such flexibility and they also get all the requested data
(though their memory usage is still accounted for and may limit paged queries).

There is a maximum result size (1 MB) and all results builders will stop after
reaching it. Moreover, there is a per-shard limitation on the amount of memory
used by all results combined (10%). To avoid tiny results a query has to
reserve (wait if necessary) 4 kB before starting executing, after that it can
consume more memory without any additional waiting provided it is below
individual and shard-local limits.

Enabling the cluster to return less rows than requested also means some changes
for the coordinator. Firstly, if it receives such short result from a replica
retrying it with a larger limit obviously makes no sense whatsoever. Instead,
in such cases the coordinator removes the clustering rows it has incomplate
information about and sends short result back to the client. Moreover, even
if no replica returned short response reconciliation may have made it so. In
this case, the coordinator do not necessairly need to retry the query as well.
Unfortunately, with the current implementation short responses ruin data
queries since they will cause a digest mismatch.

Three new metrics were added:
 * database_bytes_total_result_memory -- total memory used by query results
 * database_total_operations_short_data_queries -- data queries that were
   limited by size, particulary bad as it basically forces coordinator to
   retry them as mutation queries
 * database_total_operations_short_mutation_queries -- mutation queries limited
   by size"

* 'pdziepak/short-paged-reads/v4' of github.com:cloudius-systems/seastar-dev:
  storage_proxy: clean up after primary_key introduction
  cql3: allow short reads with paged queries
  storage_proxy: handle intentional short reads
  storage_proxy: make sure coordinator has complete data
  storage_proxy: honour partition limit
  storage_proxy: use cmd limits to determine that replica reached end
  db: add metrics for short reads and memory used for results
  data_query: limit result size
  mutation_query: limit result size
  db: create result_memory_accounters when starting query
  query_builder: add partition_slice getter
  reconcilable_result: keep result_memory_tracker object
  mutation_compactor: honour stop_iteration from consumers
  db: add result_memory_limiter
  query: add result size limiter
  reconcilable_result: properly propagate short_read flag
  query_pagers: handle short reads properly
  query: allow short reads
  serializer_impl: add serializer for bool_class<Tag>
2016-12-14 16:53:07 +02:00
Paweł Dziepak
4c69d7e2fe storage_proxy: clean up after primary_key introduction
primary_key was introduced as a replacement for
std::pair<dht::decorated_key, std::optional<clustering_key>>. In order
to simplify patch introducing its fields were named 'first' and
'second'. This patch changes the names to something less useless,
removes old row_address alias and removes is_missing_rows() in favour of
primary_key::less_compare_clustering comparator.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:28:37 +00:00
Paweł Dziepak
dde4bd5051 cql3: allow short reads with paged queries
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:28:37 +00:00
Paweł Dziepak
3c173d87b5 storage_proxy: handle intentional short reads
If the result is going to be too large the replica may decide to make it
shorter and coordinator should handle this properly (i.e. do not retry).
Moreover, coordinator could avoid some retries by setting the short_read
flag itself.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:28:37 +00:00
Paweł Dziepak
dd67de7218 storage_proxy: make sure coordinator has complete data
got_incomplete_information() ensures that the coordinator has received
all required data from all replicas.
(see 77dbe3c12f "storage_proxy: fix
reconciliation with limits" for the examples when that may not be the
case).

However, this function is called only if reconciled result has at least
as much rows as the user asked for. This was correct when we had only
total row limit: if the result was shorter than that either all replicas
sent all data they have or the coordinator will retry anyway. However,
since then we got partition limit and per partition row limit and a
request may be limited by one of these while being still below the total
row limit.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:28:36 +00:00
Paweł Dziepak
2ff5308d8e storage_proxy: honour partition limit
At the moment the coordinator does not care much for the partition
limit. In particular it doesn't check whether after reconciliation the
result still contains enough partitions.

This patch makes it honour the partition limit and increase it in the
retried queries if necessary.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:28:36 +00:00
Paweł Dziepak
7bed7aa7de storage_proxy: use cmd limits to determine that replica reached end
Coordinator may retry a query with larger limits. However, code
determining whether replica has no more data always used the original
limits. This may cause a livelock.

For example, consider cluster having the following partitions (deletions
cover live cells):

node1:
pk=0, v=0
pk=1, v=1

node2
delete pk=0
delete pk=1
pk=2, v=2
pk=3, v=3

Now, if there is a query SELECT * FROM cf LIMIT 2 the first node is
going to send partitions 0 and 1 while second node is going to send 2
and 3 + tombstones for 0 and 1. The coordinator will decide that it
needs to retry the request with larger row limit since node1 may have
some information about partitions 2 and 3 that are newer than what node2
has sent.

However, when the second response arrives node1 will still sent only two
rows since it has no more data. Because the coordinator uses original
row limit it will not notice that this node reached the end and we are
going to get another retry without making any progress.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:28:36 +00:00
Paweł Dziepak
cfd4d0f680 db: add metrics for short reads and memory used for results
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:28:36 +00:00
Paweł Dziepak
ba51e7e8db data_query: limit result size
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
f1b9f49f2b mutation_query: limit result size
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
6c33a4f177 db: create result_memory_accounters when starting query
This pach ensures than when we start executing a query a minimum result
size is reserved from result_memory_limiter.

Moreover, range queries need a way of merging memory usage information
from different shards.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
0bce4047bd query_builder: add partition_slice getter
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
15de8de9e5 reconcilable_result: keep result_memory_tracker object
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
34f9eb4cbd mutation_compactor: honour stop_iteration from consumers
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
5d7185fd39 db: add result_memory_limiter
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
ee89d80d5c query: add result size limiter
This patch introduces an infrastrucutre for limiting result size.

There is a shard-local limit which makes sure that all results combined
do not use more than 10% of the shard memory.
There is also an invidual limit which restricts a result to 4 MB.
In order

In order to avoid sending tiny results there is minimum guaranteed size
(4 kB), which the query needs to reserve before it starts producing the
result.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
43fe3439ca reconcilable_result: properly propagate short_read flag
reconcilable_result can be merged with another or transformed into
query::result. Make sure that short_read information is never lost.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
837d24f1b2 query_pagers: handle short reads properly
Currently, the paging implementation assumes that the server retunrs
either as many rows as it was asked for all reached the end. Soon,
that's not going to be true so instead of making any assumptions about
the number of the rows returned use the new "short read" flag to
determine whether there is going to be more data.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
da7ca85040 query: allow short reads
When paging is used the cluster is allowed to return less rows than the
client asked for. However, if such possibility is used we need a way of
telling that to the coordinator and the paging implementation so that
they can differentiate between short reads caused by the replica running
out of data to sent and short reads caused by any other means.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:01 +00:00
Paweł Dziepak
7a15c89b1d serializer_impl: add serializer for bool_class<Tag>
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:01 +00:00
Takuya ASADA
8918a4be57 dist/common/scripts/scylla_setup: don't abort scylla_setup when each setup script failed
Instead of abort scylla_setup, print warning message then continue to next setup.

Fixes #1357

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1481713664-18429-1-git-send-email-syuu@scylladb.com>
2016-12-14 13:31:50 +02:00
Tomasz Grabiec
c9344826e9 tests: Remove unintentional enablement of trace-level logging
Sneaked in by mistake.
2016-12-14 10:58:07 +01:00
Tomasz Grabiec
fe6a70dba1 tests: commitlog: Fix assumption about write visibility
The test assumed that mutations added to the commitlog are visible to
reads as soon as a new segment is opened. That's not true because
buffers are written back in the background, and new segment may be
active while the previous one is still being written or not yet
synced.

Fix the test so that it expectes that the number of mutations read
this way is <= the number of mutations read, and that after all
segments are synced, the number of mutations read is equal.

Message-Id: <1481630481-19395-1-git-send-email-tgrabiec@scylladb.com>
2016-12-14 11:29:33 +02:00
Avi Kivity
a61ff53150 Merge "rework flush criteria" from Glauber
"The current criteria for memtable flush is not being respected.  The
problem is demonstrated to happen when the dirty memory group is over
limit, and so is the system table extra allowance. In that situation,
both the normal region and the system table region will be under
pressure and try to flush.

More specifically, because the normal region inherits from the system
region, if the normal region is under pressure (over the soft limit
threshold), the system region will certainly be as well, even though it
has an extra allowance. This is because after virtual dirty, we start
blocking when we reach half the region, but memory itself can grow up to
100 % of the region. So the total amount of memory used will be
certainly bigger than the system pressure threshold, which is now 50 %
plus the allowance.

To fix that, this patch reworks the flush logic so that the regions are
not dependent on each other.

Fixes #1918"

* 'flush-criteria-v6' of github.com:glommer/scylla:
  config: get rid of memtable_total_space
  database: rework dirty memory hierarchy
  system keyspace: write batchlog mutation in user memory
  database: remove flush_token
  database: abstract pressure condition notification
  database: encapsulate semaphore_units into a flush_permit
  database: remove friendship declaration
  database: simplify flush_one
  database: make memtable_list aware in cases it can't flush
2016-12-14 11:24:10 +02:00
Takuya ASADA
c18a95cddf dist/redhat: add scylla_lib.sh to scylla.spec
Fix .rpm build error.

Fixes #1932

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1481703992-9596-1-git-send-email-syuu@scylladb.com>
2016-12-14 10:27:37 +02:00
Glauber Costa
56df53f51e compaction_manager: fix shutdown sequence
By the time we are able to acquire this semaphore, we may be stopped
already. So we need to test it before we go ahead. I can see shutdown
hangs before this patch that are fixed with it applied.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <e5b378893128d086d584ffbb2acd3fb687648e5c.1481655433.git.glauber@scylladb.com>
2016-12-14 09:26:24 +01:00
Asias He
84fa2c91c7 repair: Reduce unnecessary streaming traffic even more
In 7c873f0d (repair: Reduce unnecessary streaming traffic), we optimize
in cases when 1) all the remote nodes has the same checksum and 2) local node
has zero checksum.

In this patch, we make the optimization more generec and cover more cases.

1) With RF = 3, 3 nodes cluster, rm data on node3 then run repair on node2

Before:
INFO  2016-12-09 16:24:31,961 [shard 0] repair - Found differing range (-4091524285777924069, -4086237930244473115]
on nodes {127.0.0.3, 127.0.0.1}, in = {127.0.0.3, 127.0.0.1}, out = {127.0.0.3, 127.0.0.1}
INFO  2016-12-09 16:24:31,963 [shard 0] repair - Found differing range (-609511120964672970, -605253169726090861]
on nodes {127.0.0.1, 127.0.0.3}, in = {127.0.0.1, 127.0.0.3}, out = {127.0.0.1, 127.0.0.3}
INFO  2016-12-09 16:24:31,964 [shard 0] repair - Found differing range (-7655412157560911259, -7652234653747163387]
on nodes {127.0.0.3, 127.0.0.1}, in = {127.0.0.3, 127.0.0.1}, out = {127.0.0.3, 127.0.0.1}
INFO  2016-12-09 16:24:31,965 [shard 0] repair - Found differing range (-4133815130045531703, -4128528774512080749]
on nodes {127.0.0.3, 127.0.0.1}, in = {127.0.0.3, 127.0.0.1}, out = {127.0.0.3, 127.0.0.1}
INFO  2016-12-09 16:24:31,967 [shard 0] repair - Found differing range (-605253169726090861, -600995218487508751]
on nodes {127.0.0.1, 127.0.0.3}, in = {127.0.0.1, 127.0.0.3}, out = {127.0.0.1, 127.0.0.3}
INFO  2016-12-09 16:24:31,968 [shard 0] repair - Found differing range (438510347741343837, 441475345714861354]
on nodes {127.0.0.1, 127.0.0.3}, in = {127.0.0.1, 127.0.0.3}, out = {127.0.0.1, 127.0.0.3}

After:
INFO  2016-12-09 16:30:29,204 [shard 0] repair - Found differing range (-660606535827658284, -656348584589076175]
on nodes {127.0.0.1, 127.0.0.3}, in = {}, out = {127.0.0.3}
INFO  2016-12-09 16:30:29,204 [shard 0] repair - Found differing range (-4234255885181099833, -4228969529647648879]
on nodes {127.0.0.3, 127.0.0.1}, in = {}, out = {127.0.0.3}
INFO  2016-12-09 16:30:29,204 [shard 0] repair - Found differing range (-4228969529647648879, -4223683174114197925]
on nodes {127.0.0.3, 127.0.0.1}, in = {}, out = {127.0.0.3}
INFO  2016-12-09 16:30:29,204 [shard 0] repair - Found differing range (-4223683174114197925, -4218396818580746971]
on nodes {127.0.0.3, 127.0.0.1}, in = {}, out = {127.0.0.3}
INFO  2016-12-09 16:30:29,204 [shard 0] repair - Found differing range (-7728494745277112315, -7725317241463364443]
on nodes {127.0.0.3, 127.0.0.1}, in = {}, out = {127.0.0.3}
INFO  2016-12-09 16:30:29,204 [shard 0] repair - Found differing range (-720217853167807818, -715959901929225709]
on nodes {127.0.0.1, 127.0.0.3}, in = {}, out = {127.0.0.3}

Before, we need to fetch data from both node 1 and node 3 and send data back to node 1 and node 3, i.e., 2 IN, 2 OUT

After, we only need to fetch data from node 3, i.e. 0 IN, 1 OUT

We saved 3X traffic, with higher RF, we can save even more.

2) With RF = 3, 3 nodes cluster, rm data on node3 then run repair on node3

Before:
INFO  2016-12-09 16:20:11,448 [shard 0] repair - Found differing range (-8533861887892628919, -8052600134279395253]
on nodes {127.0.0.1, 127.0.0.2}, in = {127.0.0.1}, out = {}
INFO  2016-12-09 16:20:11,465 [shard 0] repair - Found differing range (7190719703944308372, 7692358524564683543]
on nodes {127.0.0.1, 127.0.0.2}, in = {127.0.0.1}, out = {}
INFO  2016-12-09 16:20:11,486 [shard 0] repair - Found differing range (-3305328316052774469, -2671876682129336880]
on nodes {127.0.0.1, 127.0.0.2}, in = {127.0.0.1}, out = {}
INFO  2016-12-09 16:20:11,494 [shard 0] repair - Found differing range (-2190610927722759275, -1305178847032904465]
on nodes {127.0.0.2, 127.0.0.1}, in = {127.0.0.2}, out = {}
INFO  2016-12-09 16:20:11,518 [shard 0] repair - Found differing range (-4747032371925842389, -4070378863644120252]
on nodes {127.0.0.2, 127.0.0.1}, in = {127.0.0.2}, out = {}
INFO  2016-12-09 16:20:11,519 [shard 0] repair - Found differing range (-1137497074548854552, -592479316010344531]
on nodes {127.0.0.1, 127.0.0.2}, in = {127.0.0.1}, out = {}

After:
INFO  2016-12-09 16:29:22,433 [shard 0] repair - Found differing range (67885601051654285, 447405341661896387]
on nodes {127.0.0.2, 127.0.0.1}, in = {127.0.0.2}, out = {}
INFO  2016-12-09 16:29:22,454 [shard 0] repair - Found differing range (-2190610927722759275, -1305178847032904465]
on nodes {127.0.0.2, 127.0.0.1}, in = {127.0.0.2}, out = {}
INFO  2016-12-09 16:29:22,473 [shard 0] repair - Found differing range (2523396860109747637, 3083778975065200884]
on nodes {127.0.0.2, 127.0.0.1}, in = {127.0.0.2}, out = {}
INFO  2016-12-09 16:29:22,474 [shard 0] repair - Found differing range (-3305328316052774469, -2671876682129336880]
on nodes {127.0.0.1, 127.0.0.2}, in = {127.0.0.1}, out = {}
INFO  2016-12-09 16:29:22,487 [shard 0] repair - Found differing range (-4747032371925842389, -4070378863644120252]
on nodes {127.0.0.2, 127.0.0.1}, in = {127.0.0.2}, out = {}
INFO  2016-12-09 16:29:22,493 [shard 0] repair - Found differing range (-1137497074548854552, -592479316010344531]
on nodes {127.0.0.1, 127.0.0.2}, in = {127.0.0.1}, out = {}

This shows the new more generic methods covers the optimization we had before as well.
2016-12-14 09:37:35 +08:00
Asias He
bd1cd53b2a repair: Add hash specialization for partition_checksum
So we can store partition_checksum in std::map as key.
2016-12-14 09:33:16 +08:00
Glauber Costa
2aa6514667 config: get rid of memtable_total_space
Those values are now statically set.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 17:05:12 -05:00
Glauber Costa
80440c0d79 database: rework dirty memory hierarchy
Issue #1918 describes a problem, in which we are generating smaller
memtables than we could, and therefore not respecting the flush
criteria.

That happens because group sizes (and limits) for pressure purposes, and
the the soft threshold is currently at 40 %. This causes system group's
soft threshold to be way below regular's virtual dirty limit and close
to regular group's soft threshold. The system group was very likely to
become under soft pressure when regular was because writes to regular
group are not yet throttled when they cross both soft thresholds.

This is a direct consequence of the linear hierarchy between the regions
and to guarantee that it won't happen we would have acqire the semaphore
of all ancestor regions when flushing from a child region. While that
works, it can lead to problems on its own, like priority inversion if
the regions have different priorities - like streaming and regular, and
groups lower in the hierarchy, like user, blocking explicit flushes
from their ancestors

To fix that, this patch reorganizes the dirty memory region groups so
that groups are now completely independent. As a disadvantage, when
streaming happen we will draw some memory from the cache, but we will
live with it for the time being.

Fixes #1918

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 14:07:53 -05:00
Glauber Costa
db7cc3cba8 system keyspace: write batchlog mutation in user memory
Batchlog is a potentially memory-intensive table whose workload is
driven by user needs, not system's. Move it to the user dirty memory
manager.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 13:59:35 -05:00
Glauber Costa
be9e4c71ad database: remove flush_token
We had a flush_token structure in addition to the flush_permit because
we needed to keep a pointer to the dirty_memory_manager and apply
changes to the region group upon the region destruction. Since Tomek's
latest series, this is no longer needed and now this structure doesn't
have a place in the world anymore. Simplify the code by removing it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 13:59:34 -05:00
Glauber Costa
98030ad66c database: abstract pressure condition notification
Done in a separate patch to reduce clutter in the main patch.
Soon we'll be testing for one more condition.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 13:59:34 -05:00
Glauber Costa
c9a8b03311 database: encapsulate semaphore_units into a flush_permit
We will soon need to hold more than a semaphore_units<> object per
flush, potentially.

Preparation patch for that.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 13:59:34 -05:00
Glauber Costa
2e8c7d2c62 database: remove friendship declaration
Not needed anymore since memtable started having a direct pointer to the
memtable list.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 13:59:34 -05:00
Glauber Costa
bb1509c21e database: simplify flush_one
flush_one has to make sure that we're using the correct
dirty_memory_manager object, because we could be flushing from a region
group different than the one the flush request originated.

It's simpler to just assume flush_one will be dealing with the right
object, and use a different object instead of "this" when calling it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 13:59:34 -05:00
Glauber Costa
8ab7c04caa database: make memtable_list aware in cases it can't flush
Some of our CFs can't be flushed. Those are the ones who are not marked
as having durable writes. We treat them just the same from the point of
view of the flush logic, but they provide a function that doesn't do
anything and just returns right away.

We already had troubles with that in the past, and that also poses a
problem for an upcoming patch reworking the flush memtable pick
criteria.

It's easier, simpler, and cleaner, to just make the memtable_list aware
it can't flush. Achieving that is also not very complicated: we just
need a special constructor that doesn't take a seal function and then we
make sure that it is initialized to an empty std::function

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 13:59:34 -05:00
Takuya ASADA
0a6312d254 dist/common/scripts/scylla_ntp_setup: fix incorrect usage of is_debian_variant
Use it as "if is_debian_variant; then".
Fixes #1931

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1481644262-29383-1-git-send-email-syuu@scylladb.com>
2016-12-13 18:29:42 +02:00
Takuya ASADA
ed4cd1908f dist/common/scripts/scylla_selinux_setup: correct CentOS/RHEL detection
CentOS/RHEL is using SELinux, and it's NOT Debian variant, so fixed from
"is_debian_variant" to "! is_debian_variant".

Fixes #1930

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1481643873-28984-1-git-send-email-syuu@scylladb.com>
2016-12-13 18:29:29 +02:00
Takuya ASADA
6c0dc55495 dist/common/scripts/scylla_selinux_setup: to use is_debian_variant(), need to source /usr/lib/scylla/scylla_lib.sh
This fixes following command not found error:
```
/usr/sbin/scylla_selinux_setup: line 7: is_debian_variant: command not found
```

Fixes #1929

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1481643308-28637-1-git-send-email-syuu@scylladb.com>
2016-12-13 18:29:13 +02:00
Takuya ASADA
3b74c50546 dist/ubuntu: add uuidgen to package dependency
We haven't added uuidgen to Ubuntu/Debian package dependency, so scylla_setup
script may abort because of command not found.

Fixes #1928

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1481642385-27941-1-git-send-email-syuu@scylladb.com>
2016-12-13 18:28:48 +02:00
Duarte Nunes
1e75a4950e database: Complete query when hitting partition limit
Currently, we weren't completing a query as early as possible if it
reached the partition limit, we instead had to wait until reaching the
end of the specified partition ranges. This patches fixes that by
including a check to the partition limit in the termination condition.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>

Message-Id: <20161213114559.26438-1-duarte@scylladb.com>
2016-12-13 14:53:46 +02:00
Tomasz Grabiec
f451014785 schema: Implement operator<< for column_mapping
Message-Id: <1481310679-14074-1-git-send-email-tgrabiec@scylladb.com>
2016-12-13 12:20:46 +02:00
Tomasz Grabiec
059a1a4f22 db: Fix commitlog replay to not drop cell mutations with older schema
column_mapping is not safe to access across shards, because data_type
is not safe to access. One of the manifestation of this is that
abstract_type::is_value_compatible_with() always fails if the two
types belong to different shards.

During replay, column_mapping lives on the replaying shard, and is
used by converting_mutation_partition_applier against the schema on
the target shard. Since types in the mapping will be considered
incompatible with types in the schema, all cells will be dropped.

Fix by using column_mapping in a safe way, by copying it to the target
shard if necessary. Each shard maintains its own cache of column
mappings.

Fixes #1924.
Message-Id: <1481310463-13868-1-git-send-email-tgrabiec@scylladb.com>
2016-12-13 12:19:32 +02:00
Avi Kivity
32d55bbb4c Merge seastar upstream
* seastar 0773e98...6fbd792 (2):
  > tls: Only run our "verify" function in client session
  > Merge "Clean the metric definition" from Amnon

Includes patch from Amnon adjusting the metrics registration due to seastar
API changes.
2016-12-13 12:17:14 +02:00
Avi Kivity
6f9c317b91 Merge "Use uuid file in housekeeping" from Amnon
"This patch adds the use of uuid file to the housekeeping daily version check.
uuid file are optional, if a file is missing no uuid will be used."
2016-12-13 10:52:44 +02:00
Avi Kivity
c67782f169 Merge seastar upstream
* seastar 0a74317...0773e98 (6):
  > tls: Add support for client cetrificate verification & priority strings
  > semaphore: add consume_units
  > semaphore: add available_units()
  > thread: check need_preempt for threads in a scheduling group as well
  > tutorial: fix semaphore example, and text
  > stop_iteration: add && and || operators
2016-12-12 18:06:19 +02:00
Avi Kivity
c801cc4bd1 Merge "streaming and repair updates" from Asias
"This series:
- We can make reader with ranges
- Fix possible use after free of 'si'
- Streaming ranges now are sorted and merged
- Fix shard_begin shard_end end loop in both streaming and repair"
2016-12-12 11:32:42 +02:00
Asias He
ba54654af3 streaming: Use interval_set to sort and merge ranges
So that the ranges are sorted and have no overlaps. We can have less
ranges to deal with and it can help the mutation readers to optimize.

Here is an exmaple of ranges generated by repair:

Before:

    INFO  2016-12-07 17:44:21,185 [shard 0] stream_session - cf_id =
    dec9fa90-bc3b-11e6-af78-000000000001,
    before ranges = {(-3383928698815274642, -3376937163195039606],
    (-7260764223708720005, -7251657821052234309], (-4767213984179237293,
    -4747032371925842389], (-7645879646119667643, -7589962743703481776],
    (-2340199306656526861, -2320523117224780931], (-576028861239229331,
    -560973674020019962], (-4070378863644120252, -3987599893827407860],
    (-2551584407739673151, -2498779102482524711], (-5416061903556353312,
    -5354212455975869358], (37594980457713898, 67885601051654285],
    (3083778975065200884, 3091232478835418439], (3131345970514528877,
    3187922544267434961], (5765437476661317163, 5778671293583720541],
    (5960610072466058818, 5972289771228014343], (7749618183851698485,
    7758080813117351135], (-3987599893827407860, -3899198931034439776],
    (-7251657821052234309, -7131649010279865221], (-3576581915808403133,
    -3383928698815274642], (-417850207760366422, -327959672080599465],
    (-2671876682129336880, -2551584407739673151], (-1305178847032904465,
    -1137497074548854552], (8540448858050275827, 8610171849752115483],
    (-560973674020019962, -417850207760366422], (-2498779102482524711,
    -2340199306656526861], (2394447940525988167, 2523396860109747637],
    (-6703329224557608009, -6517757811218772762], (-3675103288021821677,
    -3576581915808403133], (-5622185785296846551, -5416061903556353312],
    (8610171849752115483, 8742605005068551458], (8068079250973315241,
    8185655671734937642], (560264964510741191, 790641981923757238],
    (5581202487214475094, 5765437476661317163], (8742605005068551458,
    8923908282731801645], (-6038176423022601107, -5622185785296846551],
    (5778671293583720541, 5960610072466058818], (-3899198931034439776,
    -3675103288021821677], (8356739976149429222, 8540448858050275827],
    (-6517757811218772762, -6038176423022601107], (-8052600134279395253,
    -7645879646119667643], (-327959672080599465, 37594980457713898],
    (7758080813117351135, 8019254284118543066], (4781565016737645510,
    5067070718000527886], (2523396860109747637, 3083778975065200884],
    (-5354212455975869358, -4767213984179237293], (6784138025918878582,
    7190719703944308372], (67885601051654285, 447405341661896387],
    (-2190610927722759275, -1305178847032904465], (-4747032371925842389,
    -4070378863644120252]}, size=48

After:

    INFO  2016-12-07 17:44:21,185 [shard 0] stream_session - cf_id =
    dec9fa90-bc3b-11e6-af78-000000000001,
    after  ranges = {(-8052600134279395253, -7589962743703481776],
    (-7260764223708720005, -7131649010279865221], (-6703329224557608009,
    -3376937163195039606], (-2671876682129336880, -2320523117224780931],
    (-2190610927722759275, -1137497074548854552], (-576028861239229331,
    447405341661896387], (560264964510741191, 790641981923757238],
    (2394447940525988167, 3091232478835418439], (3131345970514528877,
    3187922544267434961], (4781565016737645510, 5067070718000527886],
    (5581202487214475094, 5972289771228014343], (6784138025918878582,
    7190719703944308372], (7749618183851698485, 8019254284118543066],
    (8068079250973315241, 8185655671734937642], (8356739976149429222,
    8923908282731801645]}, size=15
2016-12-12 11:09:26 +08:00
Asias He
e523803a5d token_metadata: Introduce interval_to_range helper
It is used to convert a boost::icl::interval<token> interval back to a
range<token>.
2016-12-12 11:09:26 +08:00
Asias He
af3d76e6ac repair: Fix a typo in the log
sucessfully -> successfully
2016-12-12 11:09:26 +08:00
Asias He
374324e6fb repair: Fix shard_begin and shard_end
A range now alternates between different shards: the first part of the
range goes to shard X, the next to shard X+1, but after a while we go
back to shard X. So we can't do a simple loop between shard_begin and
shard_end.

Fix by using the newly introduced dht::split_range_to_shards

Use the cf.make_streaming_reader with ranges to simplify the code a bit.
2016-12-12 11:09:26 +08:00
Asias He
1987264beb streaming: Make streaming reader with ranges
Now that we have the new interface to make readers with ranges, we can
simplify the code a lot.

1) Less readers are needed
before: number of ranges of readers
after: smp::count readers at most

2) No foreign_ptr is needed
There is no need to forward to a shard to make the foreign_ptr for
send_info in the first phase and forward to that shard to execute the
send_info in the second phase.

3) No do_with is needed in send_mutations since si now is a
lw_shared_ptr

4) Fix possible user after free of 'si' in do_send_mutations
We need to take a reference of 'si' when sending the mutation with
send_stream_mutation rpc call, otherwise:
   msg1 got exception
   si->mutations_done.broken()
   si is freed
   msg2 got exception
   si is used again
The issue is introduced in dc50ce0ce5 (streaming: Make the mutation
readers when streaming starts) which is master only, branch 1.5 is not
affected.
2016-12-12 09:04:21 +08:00
Asias He
463cc4fbde dht: Introduce split_ranges_to_shards
Split a ranges into shard ranges map with ring_position_range_sharder
helper.
2016-12-12 09:04:21 +08:00
Asias He
044c4ff44c dht: Introduce split_range_to_shards
Split a range into shard ranges map with ring_position_range_sharder
helper.
2016-12-12 09:04:21 +08:00
Asias He
cd2105b8bd database: make_streaming_reader for ranges
Allow to make a streaming reader with a vector of ranges in addition to
a single range. This will be used soon in following streaming patch.

We can make the reader more efficient later.
2016-12-12 09:04:21 +08:00
Duarte Nunes
ada2f1092e dht: Make i_partitioner::tri_compare pure virtual
This patch makes the i_partitioner::tri_compare() function pure
virtual as it is overridden by all partitioners.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20161211172037.16496-1-duarte@scylladb.com>
2016-12-11 19:29:37 +02:00
Duarte Nunes
bb66b051ed dht: Make i_partitioner::tri_compare memory safe
This patch fixes a typo in i_partitioner::tri_compare() where we were
using std::max instead of std::min, thus avoiding accessing random
memory and getting random results.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20161211165043.17816-1-duarte@scylladb.com>
2016-12-11 18:58:10 +02:00
Amnon Heiman
08dcd8cb4a scylla housekeeping ubuntu service: use uuid file
This patch adds uuid file support for ubuntu system. It also split the
behaviour between restart and daily checks. The first run in r mode and
the second in d mode.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-12-11 16:35:07 +02:00
Amnon Heiman
6fef24aaf0 housekeeping systemd service: use uuid file
This set the housekeeping systemd service to use a uuid file and use
daily mode.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-12-11 16:02:16 +02:00
Amnon Heiman
17b8306bc4 scylla-housekeeping support uuid file
Allows scylla-housekeeping getting the uuid from a file instead of the
command line.

If the file is missing no uuid will be used.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-12-11 16:00:34 +02:00
Avi Kivity
299d1fad0b Merge "reduce bloom filter overhead in compaction" from Raphael
"Function to calculate maximum purgeable timestamp is made 10 times faster when
compacting sstables overlap with 10% of all sstables.
That's possible with an incremental selector that will incrementally select
sstables based on key being compacted.
Currently, we iterate through all non-compacting sstables and consult their
bloom filter to determine max purgeable timestamp, and that will be very
expensive for compactions that are frequently deciding whether or not to purge
tombstones."

* 'filter_overhead_fix_v4' of github.com:raphaelsc/scylla:
  compaction: reduce bloom filter overhead with incremental selector
  tests: add test for sstable set's incremental selector
  sstable_set: introduce incremental selector
  compatible_ring_position: add function to return token
2016-12-11 09:46:58 +02:00
Glauber Costa
5803957ab5 compaction: fix build
Commit 732ee275 moved tracking of one statistics value inside a lambda
without capturing this in that lambda. Compilation fails as a result.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Reviewed-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <68860640f4533dd43e43f341f1620e25464b700b.1481313455.git.glauber@scylladb.com>
2016-12-10 09:00:20 +02:00
Raphael S. Carvalho
fcfc84e836 compaction: reduce bloom filter overhead with incremental selector
The procedure to calculate max purgeable timestamp is optimized
by only visiting sstables that overlap with key being currently
compacted. That's done using incremental sstable selector.

Function to calculate maximum purgeable timestamp is made 10 times
faster when compacting sstables overlap with 10% of all sstables.

Fixes #1322.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-12-09 16:17:17 -02:00
Raphael S. Carvalho
548f6066c5 tests: add test for sstable set's incremental selector
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-12-09 16:17:17 -02:00
Raphael S. Carvalho
02541e15c1 sstable_set: introduce incremental selector
Incrementally select sstables from sstable set using token
in ascending order.
For leveled strategy, it returns all sstables that belong
to current interval. For other strategies, it just return
all sstables from the set.
Useful for compaction which needs all sstables that overlap
with key being currently compacted to calculate maximum
purgeable timestamp.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-12-09 16:17:16 -02:00
Glauber Costa
9b5e6d6bd8 commitlog: correctly report requests blocked
The semaphore future may be unavailable for many reasons. Specifically,
if the task quota is depleted right between sem.wait() and the .then()
clause in get_units() the resulting future won't be available.

That is particularly visible if we decrease the task quota, since those
events will be more frequent: we can in those cases clearly see this
counter going up, even though there aren't more requests pending than
usual.

This patch improves the situation by replacing that check. We now verify
whether or not there are waiters in the semaphore.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <113c0d6b43cd6653ce972541baf6920e5765546b.1481222621.git.glauber@scylladb.com>
2016-12-09 15:02:26 +02:00
Raphael S. Carvalho
732ee275f8 compaction: fix running compaction counter when splitting sstables
The counter was being increased before taking the semaphore, so
every pending split would count as a running compaction which
misleads the user as a result.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <f2050cc3599cee7af29d4579368a154708b37731.1481248048.git.raphaelsc@scylladb.com>
2016-12-09 15:01:43 +02:00
Raphael S. Carvalho
453620a316 compatible_ring_position: add function to return token
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-12-08 14:25:29 -02:00
Avi Kivity
872b5ef5f0 sstables: fix probe with Unknown component
Commit 53b7b7def3 ("sstables: handle unrecognized sstable component")
ignores unrecognized components, but misses one code path during probe_file().

Ignore unrecognized components there too.

Fixes #1922.
Message-Id: <20161208131027.28939-1-avi@scylladb.com>
2016-12-08 15:24:25 +01:00
Glauber Costa
733d87fcc6 database: try to acquire semaphore before we start flush
As Tomek pointed out, as we are starting the flush before we acquire the
semaphore, we are not really limiting parallelism, but only delaying the
end of the flush instead.

Fixes #1919

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <6cbf9ec2f3a341c76becf94f794cfa16539c5192.1481120410.git.glauber@scylladb.com>
2016-12-08 12:18:32 +01:00
Tomasz Grabiec
3511bf4a81 Merge branch 'tgrabiec/memtable-gentle-clearing' from seastar-dev.git
When row cache is disabled, update_cache() will do nothing to the
memtable. Active readers may keep the memtable alive for unbounded
amount of time, preventing it from going away. This doesn't play well
with virtual dirty accounting. Soon before calling update_cache(), the
memory which was subtracted during flush is added back to the amount
of virtual dirty memory. If there was write pressure all along, we
will be at the dirty memory limit. When we give back subtracted memory
this will put virtual dirty way above the limit. This will stall all
writes until another memtable flush drags virtual dirty down or
readers finally release the memtable. We want to prevent upward
jumps of virtual dirty.

First part of the fix is to ensure that as long as the memtable's
region is in the dirty group, we will not revert flushed memory. This
must happen synchronously from region's memory being removed from the
group in order to prevent upward virtual dirty jumps. To make this
easier, tracking of flushed memory was moved to the memtable object.

Another part of the fix is to gradually clear the memtable when cache
is disabled in a similar fashion as when it's moved to cache. This
ensures that the actual memory held by memtable's region is released
sooner than it dies.

Refs #1879
2016-12-08 12:18:32 +01:00
Gleb Natapov
a05516f14c storage_proxy: wire up range_slice_timeouts, range_slice_unavailables and read_unavailables counters
Message-Id: <20161206105154.GL1866@scylladb.com>
2016-12-08 11:42:52 +02:00
Avi Kivity
5530a61975 stables: fix build with older boost (boost::variant::get<T&>)
Older boost doesn't support boost::variant::get<T&> (where the type
parameter is reference qualified); remove (unneeded anyway).
2016-12-08 10:56:05 +02:00
Pekka Enberg
0bc3ce7e09 Merge "sstables: remove sharding metadata from Statistics component" from Avi
"Due to my misreading of Cassandra code, I thought it would ignore new
components in the Statistics component; however, it doesn't, and the change
(introduced in bdd11648ac ("sstables: add
intra-node sharding metadata") breaks sstable2json and likely any
Cassandra code that touches sstables.

To fix, move the sharding data into a new component ("Scylla.db"), which
Cassandra does ignore.  The new component is designed to be extensible so
we don't experience the same issue later on."
2016-12-08 10:14:07 +02:00
Avi Kivity
7f26f9c0f9 Merge "repair refactor and fix" from Asias
* tag 'asias/repair/subranges/refactor_fix/v1' of github.com:cloudius-systems/seastar-dev:
  repair: Limit the number of sub ranges
  repair: Use estimated_keys_for_range in repair_cf_range
  repair: Extract the target_partitions into repair_info class
  repair: Put request_transfer_ranges into repair_info class
  repair: Introduce check_failed_ranges helper
  repair: Introduce do_streaming helper
  repair: Make the neighbors const reference
  repair: Introduce repair_info
  repair: Attach the repair id in the stream plan name
2016-12-08 10:06:39 +02:00
Tomasz Grabiec
f7197dabf8 commitlog: Fix replay to not delete dirty segments
The problem is that replay will unlink any segments which were on disk
at the time the replay starts. However, some of those segments may
have been created by current node since the boot. If a segment is part
of reserve for example, it will be unlinked by replay, but we will
still use that segment to log mutations. Those mutations will not be
visible to replay after a crash though.

The fix is to record preexisting segents before any new segments will
have a chance to be created and use that as the replay list.

Introduced in abe7358767.

dtest failure:

 commitlog_test.py:TestCommitLog.test_commitlog_replay_on_startup

Message-Id: <1481117436-6243-1-git-send-email-tgrabiec@scylladb.com>
2016-12-07 15:54:47 +02:00
Avi Kivity
4fedbf8430 Merge "service::storage_proxy: rework collectd counters registration" from Vlad
- Add "coordinator" and "replica" categories
   - Use a new seastar/metrics_registration framework

* 'rearrange-storage-proxy-stats-v4' of github.com:cloudius-systems/seastar-dev:
  service::storage_proxy: rework the collectd counters registration
  service/storage_proxy: regroup collectd statistics
2016-12-07 15:38:40 +02:00
Avi Kivity
3c3a18f222 sstables: move sharding metadata from Statistics component to a new Scylla component
The Cassandra derived sstable tools (and likely Cassandra itself) object to
a new sub-component in the Statistics component; create a new Scylla
component instead to host this data.
2016-12-07 15:20:13 +02:00
Avi Kivity
24140ec8c6 sstables: add support for sets of discriminated union types
Allow declaring discriminated unions (with an enum type as the
discriminant and any sstable serializable type as a value) and sets
of these unions, with the disciminant as the key.  Parsers and writers
are auto-generated.
2016-12-07 13:27:52 +02:00
Avi Kivity
e0cce9d299 Merge "streaming: Improve logging" from Asias
"This seires adds streaming bandwidth and streaming plan name to the log when
streaming is finished."
2016-12-07 12:21:47 +02:00
Amos Kong
f32f7993cc systemd: reset housekeeping timer at each start
Currently housekeeping timer won't be reset when we restart scylla-server.
We expect the service to be run at each start, it will be consistent with
upstart script in Ubuntu 14.04

When we restart scylla-server, housekeepting timer will also be restarted,
so let's replace "OnBootSec" with "OnActiveSec".

Fixes: #1601

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <a22943cc11a3de23db266c52fd476c08014098c4.1480607401.git.amos@scylladb.com>
2016-12-06 18:33:37 +02:00
Takuya ASADA
5a5ab51254 dist/ubuntu/dep: fix incorrect file path to detect previously built .deb existance check
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1480667672-9453-4-git-send-email-syuu@scylladb.com>
2016-12-06 12:06:30 +02:00
Takuya ASADA
6dd6b868a6 scripts/scylla_install_pkg: support Debian
Supported Debian on installation script.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1480667672-9453-3-git-send-email-syuu@scylladb.com>
2016-12-06 12:06:30 +02:00
Takuya ASADA
7f2df8f86e dist/common/scripts: introduce scylla_lib.sh
To reduce duplicated code and simplified scripts introduce scylla_lib.sh
for shellscripts which provides functions to classify distributions,
and load all sysconfig files.

This also fixes script bugs to misdetect Debian and RHEL.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1480667672-9453-2-git-send-email-syuu@scylladb.com>
2016-12-06 12:06:30 +02:00
Takuya ASADA
8464903021 dist/common/systemd/scylla-housekeeping.timer: workaround to avoid crash of systemd on RHEL 7.3
RHEL 7.3's systemd contains known bug on timer.c:
https://github.com/systemd/systemd/issues/2632

This is workaround to avoid hitting bug.

Fixes #1846

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1480452194-11683-1-git-send-email-syuu@scylladb.com>
2016-12-06 10:48:28 +02:00
Takuya ASADA
b2c0059da3 dist/common/scripts/scylla_coredump_setup: use systemd-coredump on Ubuntu 16.04
Ubuntu 16.04 has systemd-coredump, better to use it.

Fixes #1916

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1480679267-30844-1-git-send-email-syuu@scylladb.com>
2016-12-05 17:09:38 +02:00
Takuya ASADA
2976799ef2 main: fix startup failing on Ubuntu 15.10/16.04
Since Ubuntu 15.10/16.04 still uses Upstart to manage GUI session (not as init), when we directly launch Scylla on Ubuntu's GUI Terminal(not using systemctl or initctl), raise(SIGSTOP) mistakenly calls (Because GUI session has "UPSTART_JOB" environment variable, won't happen when running Scylla as systemd service).

To avoid this, we need to verify UPSTART_JOB == "scylla-server".
If it's part of GUI session UPSTART_JOB has to be "unity7", we need to avoid raise(SIGSTOP) in that case.

Fixes #1199

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1480620421-28967-1-git-send-email-syuu@scylladb.com>
2016-12-05 16:28:25 +02:00
Tomasz Grabiec
527ff6aa40 db: Clear memtable after flush when cache is disabled
So that memory is released gradually (impacting latency less) and
sooner than when memtable is destroyed. Active readers may keep the
memtable alive for unbounded amount of time.

Refs #1879
2016-12-05 12:59:09 +01:00
Tomasz Grabiec
1bba51319e memtable: Maintain virtual dirty on clear()
When memtable is flushing, it subtracts _flushed_memory from groups's
size to gradually allow more writes. Ideally _flushed_memory would be
equal to region's size when flush ends, so the group's size would
reach zero. When the memtable and its region are gone the group size
should remain the same as after the flush. This is ensured by adding
back _flushed_memory to group's size right before the region is
removed from the group.

Calling clear() before region is removed from the group breaks the
accounting because it will shrink the region, but will not affect the
amount of memory subtracted due to _flushed_memory. So group's size
would decrease more than we want (twice the region's size). The fix is
to change clear() so that it reverts _flushed_memory by the amount by
which the region size is reduced. This will keep the groups's size
constant as long as _flushed_memory > 0.
2016-12-05 12:59:09 +01:00
Tomasz Grabiec
1b5f338c17 memtable: Track flushed memory in memtable object 2016-12-05 12:59:09 +01:00
Tomasz Grabiec
c3768fe4de memtable: Pass dirty_memory_manager& to memtable constructor
The implementation assumes that memtable's region group is owned by
dirty_memory_manager, and tries to obtain a reference to it like this:

  boost::intrusive::get_parent_from_member(_region.group(), &dirty_memory_manager::_region_group));

This is undefined behavior when the region's group does not come from
dirty manager. It's safer to be explicit about this dependency by
taking a reference to dirty_memory_manager in the constructor.
2016-12-05 12:59:09 +01:00
Asias He
00d7a35949 utils: Put crc32 under utils namespace
It conflicts with crc in zlib
Message-Id: <1480918984-4117-2-git-send-email-asias@scylladb.com>
2016-12-05 11:48:29 +02:00
Takuya ASADA
54ea0055fc dist/common/scripts/node_exporter_install: use curl instead of wget
CentOS/Ubuntu contains curl on minimal instllation but wget doesn't,
and we already has dependency for curl, so we should switch to curl.

Fixes #1902

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1480929047-2347-1-git-send-email-syuu@scylladb.com>
2016-12-05 11:26:36 +02:00
Asias He
86c2620b7a gossip: Skip stopping if it is not started
If exception is triggered early in boot when doing an I/O operation,
scylla will fail because io checker calls storage service to stop
transport services, and not all of them were initialized yet.

Scylla was failing as follow:
scylla: ./seastar/core/sharded.hh:439: Service& seastar::sharded<Service>::local()
[with Service = gms::gossiper]: Assertion `local_is_initialized()' failed.
Aborting on shard 0.
Backtrace:
  0x000000000048a2ca
  0x000000000048a3d3
  0x00007fc279e739ff
  0x00007fc279ad6a27
  0x00007fc279ad8629
  0x00007fc279acf226
  0x00007fc279acf2d1
  0x0000000000c145f8
  0x000000000110d1bc
  0x000000000041bacd
  0x00000000005520f1
  0x00007fc279aeaf1f
Aborted (core dumped)

Refs #883.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Signed-off-by: Asias He <asias@scylladb.com>
Message-Id: <963f7b0f5a7a8a1405728b414a7d7a6dccd70581.1479172124.git.asias@scylladb.com>
2016-12-05 09:42:37 +02:00
Asias He
49229964d0 streaming: Add streaming plan name when session is failed
Before:
[shard 0] stream_session - [Stream #fc1b66e0-b75b-11e6-b295-000000000000]
Stream failed, peers={127.0.0.1, 127.0.0.2}

After:
[shard 0] stream_session - [Stream #fc1b66e0-b75b-11e6-b295-000000000000]
Stream failed for streaming plan repair-in-29, peers={127.0.0.1, 127.0.0.2}
2016-12-05 08:20:18 +08:00
Asias He
1c47e26913 streaming: Add streaming plan name when all sessions are completed
Before:
[shard 0] stream_session - [Stream #e050b710-b758-11e6-9321-000000000000]
All sessions completed, peers={127.0.0.2}

After:
[shard 0] stream_session - [Stream #e050b710-b758-11e6-9321-000000000000]
All sessions completed for streaming plan repair-in-32, peers={127.0.0.2}
2016-12-05 08:20:18 +08:00
Asias He
984f427cb5 streaming: Log streaming bandwidth
It looks like:

[Stream #f3907fd0-a557-11e6-a583-000000000000] Session with 127.0.0.1 is complete, state=COMPLETE
[Stream #f3907fd0-a557-11e6-a583-000000000000] Session with 127.0.0.2 is complete, state=COMPLETE
[Stream #f3907fd0-a557-11e6-a583-000000000000] Session with 127.0.0.3 is complete, state=COMPLETE
[Stream #f3907fd0-a557-11e6-a583-000000000000] bytes_sent = 393284364, bytes_received = 0, tx_bandwidth = 17.048 MiB/s, rx_bandwidth = 0.000 MiB/s
[Stream #f3907fd0-a557-11e6-a583-000000000000] All sessions completed, peers={127.0.0.1, 127.0.0.2, 127.0.0.3}

Fixes #1826
2016-12-05 08:20:18 +08:00
Asias He
4ae5781e40 repair: Limit the number of sub ranges
A range is diveded into N sub ranges so that each sub range contains 100
partitions. So N depends on the number of partitions in that range.  N
can grow unbounded and the memory usage of vector to hold these sub
ranges can go unbouded.

Limit the max number of sub ranges a range can divided into.

The downside is that the limited sub range will make we include more
partitions in the checksum.

Fixes #1917
2016-12-05 08:12:48 +08:00
Asias He
d850b86145 repair: Use estimated_keys_for_range in repair_cf_range
Use the newly introduced interface to estimate number of partitions in
the range.
2016-12-05 08:05:07 +08:00
Asias He
7b63cbbe0d repair: Extract the target_partitions into repair_info class
We can tune the number on a per repair basis.
2016-12-05 08:05:07 +08:00
Asias He
d9b689321e repair: Put request_transfer_ranges into repair_info class 2016-12-05 08:05:07 +08:00
Asias He
7741393059 repair: Introduce check_failed_ranges helper
To check if there is any failed ranges and log it.
2016-12-05 08:05:07 +08:00
Asias He
f8d7aa597b repair: Introduce do_streaming helper
To execute the stream_plans to sync data between nodes.
2016-12-05 08:05:07 +08:00
Asias He
d0a6290d4f repair: Make the neighbors const reference
We do not modify it. Make it const reference.
2016-12-05 08:05:07 +08:00
Asias He
6d0f6c1a99 repair: Introduce repair_info
To reduce the number of parameters we pass around. Simplify the code a
little bit.
2016-12-05 08:05:06 +08:00
Asias He
9be5170c07 repair: Attach the repair id in the stream plan name
So that we know which repair id this stream plan belongs to.
2016-12-05 08:05:06 +08:00
Tomasz Grabiec
d496dfeced Update seastar submodule
* seastar 7790e68...0a74317 (2):
  > core/reactor: Move definitions out of #ifndef
  > Add systemtap-sdt-devel to fedora dependencies

Fixes #1915.
2016-12-02 10:49:17 +01:00
Vlad Zolotarov
e5e7ac1bd4 service::storage_proxy: rework the collectd counters registration
Use the new seastar's metrics_registration framework:
   - Change the registration syntax.
   - Add a long description for each counter.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-12-01 22:38:09 -05:00
Vlad Zolotarov
3bf12e4ffc service/storage_proxy: regroup collectd statistics
Instead of putting all statistics under the same "storage_proxy" category
separate them into 2 groups according to where the corresponding counters
are updated:
   - "storage_proxy_replica"
   - "storage_proxy_coordinator"

Fixes #1763

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-12-01 22:27:47 -05:00
Glauber Costa
99a5a77234 prevent commitlog replay position reordering during reserve refill
When requests hit the commitlog, each of them will be assigned a replay
position, which we expect to be ordered. If reorders happen, the request
will be discarded and re-applied. Although this is supposed to be rare,
it does increase our latencies, specially when big requests are
involved. Processing big requests is expensive and if we have to do it
twice that adds to the cost.

The commitlog is supposed to issue replay positions in order, and it
coudl be that the code that adds them to the memtables will reorder
them. However, there is one instance in which the commitlog will not
keep its side of the bargain.

That happens when the reserve is exhausted, and we are allocating a
segment directly at the same time the reserve is being replenished.  The
following sequence of events with its deferring points will ilustrate
it:

on_timer:

    return this->allocate_segment(false). // defer here // then([this](sseg_ptr s) {

At this point, the segment id is already allocated.

new_segment():

    if (_reserve_segments.empty()) {
	[ ... ]
        return allocate_segment(true).then ...

At this point, we have a new segment that has an id that is higher than
the previous id allocated.

Then we resume the execution from the deferring point in on_timer():

    i = _reserve_segments.emplace(i, std::move(s));

The next time we need to allocate a segment, we'll pick it from the
reserve. But the segment in the reserve has an id that is lower than the
id that we have already used.

Reorders are bad, but this one is particularly bad: because the reorder
happens with the segment id side of the replay position, that means that
every request that falls into that segment will have to be reinserted.

This bug can be a bit tricky to reproduce. To make it more common, we
can artificially add a sleep() fiber after the allocate_segment(false)
in on_timer(). If we do that, we'll see a sea of reinsertions going on
in the logs (if dblog is set to debug).

Applying this patch (keeping the sleep) will make them all disappear.
We do this by rewriting the reserve logic, so that the segments always
come from the reserve. If we draw from a single pool all the time, there
is no chance of reordering happening. To make that more amenable, we'll
have the reserve filler always running in the background and take it out
of the timer code.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <49eb7edfcafaef7f1fdceb270639a9a8b50cfce7.1480531446.git.glauber@scylladb.com>
2016-12-01 13:20:46 +01:00
Tomasz Grabiec
570fc0008b scylla-gdb: Fix lookup of symbols in 'scylla ptr'
Message-Id: <1480529617-26564-1-git-send-email-tgrabiec@scylladb.com>
2016-12-01 12:33:29 +02:00
Raphael S. Carvalho
b30a2cb21a lcs: generate info that preserves token distribution in higher levels
The information (last compacted keys) is lost after node is restarted
or schema is updated, which causes strategy to be rebuilt.
We need it for strategy to guarantee uniform distribution of token
range across sstables, or we could end up with 1 sstable of level L
overlapping with lots of sstables of level L+1, and that results in
a compaction of undesired length.
That information can be generated from scratch by getting last key
of newest sstable in each level > 0.

Fixes #1906.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <35ebd15977d5a8418239febb160c796cdc0e98fa.1480533805.git.raphaelsc@scylladb.com>
2016-12-01 11:19:58 +02:00
Raphael S. Carvalho
38743c1948 sstables: provide write time of data component
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <59686148149f2159990329775e0cd8780bc54254.1480533805.git.raphaelsc@scylladb.com>
2016-12-01 11:19:57 +02:00
Glauber Costa
d7256e7b21 database: do not call seal directly from the streaming timer
Streaming memtable have a delayed mode where many flushes are coalesced
together into one, with the actual flush happening later and propagated
to all the previous waiters.

However, the timer that triggers the actual flush was not using the
newly introduced flush infrastructure. This was a minor problem because
those flushes wouldn't try to take the semaphore, and so we could have
many flushes going on at the same time.

What was a potential performance issue became a correctness issue when
we moved the reversal of the dirty memory accounting out of
revert_potentially_cleaned_up_memory() into remove_from_flush_manager().

Since the latter is only called through the flush infrastructure, it
simply wasn't called. So the deferral of the reversal exposed this bug.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <0d5755375bc27524b8cfb9970c76d492b14d9eea.1480522742.git.glauber@scylladb.com>
2016-11-30 18:00:55 +01:00
Tomasz Grabiec
c35e18ba12 tests: Fix use-after-free on commitlog
Only shutdown() ensures all internal processes are complete. Call it before calling clear().

Message-Id: <1480495534-2253-1-git-send-email-tgrabiec@scylladb.com>
2016-11-30 11:03:26 +02:00
Avi Kivity
281b4c64ea Update ami submodule
* dist/ami/files/scylla-ami 25e101f...d5a4397 (1):
  > scylla_install_ami: allow specify different repository for Scylla installation and receive update
2016-11-29 19:26:49 +02:00
Takuya ASADA
17ef5e638e dist/ami: allow specify different repository for Scylla installation and receive update
This fix splits build_ami.sh --repo to three different options:
 --repo-for-install is for Scylla package installation, only valid
 during AMI construction.

 --repo-for-update will be stored at /etc/yum.repos.d/scylla.repo, to
 receive update package on AMI.

 --repo is both, for installation and update.

Fixes #1872

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1480438858-6007-1-git-send-email-syuu@scylladb.com>
2016-11-29 19:26:07 +02:00
Avi Kivity
5ea235e3e8 Merge "Prevent overloading memory with timed out writes" from Tomasz
"The goal of this series is to prevent unbounded memory use
in cases when requests are timing out. Write requests which timed
out may still occupy memory for a while because of local mutation
application. This memory is not accounted for and can build up.

First part of the fix changes local mutation application so that it times out
at about the same time as the request handler. Then the life
time of the request handler is extended to cover any background activity
of that request which hasn't timed out yet. This has two main effects:

  (1) by timing out local writes we prevent build up of background activity
      for timed out requests

  (2) we ensure that memory used by background activity is not left
      behind unaccounted for. This will prevent CQL server from admitting
      more requests than memory usage limit allows.

Fixes #1756."

* tag 'tgrabiec/prevent-oom-on-timeouts-v5' of github.com:cloudius-systems/seastar-dev:
  storage_proxy: Do not flood logs with timeout errors
  database: Add counter for timed out writes
  storage_proxy: Delay timeout response until background work ceases
  storage_proxy: Propagate timeout to local writes
  storage_proxy: Use shared ownership for abstract_write_response_handler
  storage_proxy: Add counter for all alive write handlers
  db: Allow writes to be timed out
  db: Introduce counters for failed reads and writes
  commitlog: Allow allocations to be timed out
  utils/logalloc: Add ability to timeout run_when_memory_available() task
  utils/flush_queue: Add ability to wait with a timeout
2016-11-29 18:55:52 +02:00
Avi Kivity
28a5ff51cb dist: add build dependency on systemtap-sdt
Needed to newer seastar.
2016-11-29 18:49:51 +02:00
Tomasz Grabiec
48bbd6733c storage_proxy: Do not flood logs with timeout errors
Timeout errors are flooding the log after local mutate can time
out. We don't log remote mutate timeouts, so for consistency we won't
log local ones as well.

There is a database counter for timed out writes which can be
consulted in order to check if they're occuring.

Perhaps this would be better solved by a generic log message
throttling/coalescing mechanism, but that's not ready yet.
2016-11-29 16:40:59 +01:00
Tomasz Grabiec
b5d5612f98 database: Add counter for timed out writes 2016-11-29 16:40:59 +01:00
Tomasz Grabiec
14cb31f69a storage_proxy: Delay timeout response until background work ceases
Write requests which timed out may still occupy memory for a while due
to local write. It should time out soon as well but there is a time
window in which it has not yet. If we don't delay timeout response,
the request would be seen as not consuming any memory too early. This
in turn would cause the CQL server to allow more requests than we
want. In some cases causing OOM or exceeding memory limits and causing
excessive cache eviciton.

Fixes #1756.
2016-11-29 16:40:59 +01:00
Tomasz Grabiec
ba3779802f storage_proxy: Propagate timeout to local writes 2016-11-29 16:40:59 +01:00
Tomasz Grabiec
6d195a1538 storage_proxy: Use shared ownership for abstract_write_response_handler 2016-11-29 16:40:58 +01:00
Tomasz Grabiec
5805330d98 storage_proxy: Add counter for all alive write handlers
Currently the counter uses _response_handlers.size(), but after later
patches we may have an active (timed out) write with no response
handler, so count live instances instead.
2016-11-29 16:40:58 +01:00
Tomasz Grabiec
2c561ecaed db: Allow writes to be timed out 2016-11-29 16:40:58 +01:00
Tomasz Grabiec
b1ae6ad2ad db: Introduce counters for failed reads and writes 2016-11-29 16:40:58 +01:00
Tomasz Grabiec
31645e2c4a commitlog: Allow allocations to be timed out 2016-11-29 16:40:58 +01:00
Tomasz Grabiec
e14caaef60 utils/logalloc: Add ability to timeout run_when_memory_available() task 2016-11-29 16:40:58 +01:00
Tomasz Grabiec
61d81617e1 utils/flush_queue: Add ability to wait with a timeout 2016-11-29 16:40:58 +01:00
Raphael S. Carvalho
a16425833c size_tiered: do not recreate bucket when it goes beyond max threshold
Problem will cause size tiered to return small jobs when there are
more than max_threshold sstables of similar size. For example, if
max_threshold is 32, and there are 36 sstables of similar size,
strategy will only return 4 sstables to be compacted. That's because
we incorrectly create a new bucket when it meets the max threshold.
What we should do is to allow buckets to grow beyond max threshold
and trim them when selecting the most suitable one for compaction.

Important to mention that estimation for size tiered will now
work better when there are more than max_threshold sstables of
similar size.

Fixes #1901.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <080bad70d6cb86eaf52ac1bdd6765ac47aab5b03.1478316140.git.raphaelsc@scylladb.com>
2016-11-29 16:56:02 +02:00
Glauber Costa
353a4cd2d4 commitlog: sync segments before acquiring semaphore on shutdown.
Sync all segments before acquiring the semaphore, otherwise waiting may
have to wait for the timer to kick in and push them down.
Note that we can't guarantee that no other requests were executed in the
mean time, so we have to sync again.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <aea019fe49820acce5d2b55dd5ec31e975b3436c.1480388674.git.glauber@scylladb.com>
2016-11-29 11:07:28 +02:00
Tomasz Grabiec
96c7764458 Revert "prevent commitlog replay position reordering during reserve refill"
This reverts commit 0e9b75d406.

commitlog_test fails with this:

Running 14 test cases...
ERROR 2016-11-28 20:48:00,565 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:00,578 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:10,591 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:20,601 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
tests/commitlog_test.cc(203): fatal error in "test_commitlog_discard_completed_segments": critical check dn <= nn failed
ERROR 2016-11-28 20:48:20,645 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:20,837 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
WARN  2016-11-28 20:48:20,838 [shard 0] commitlog - Exception in segment reservation: std::system_error (error system:2, No such file or directory)
ERROR 2016-11-28 20:48:20,952 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:31,064 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:31,083 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:31,098 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:31,111 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:31,113 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
WARN  2016-11-28 20:48:31,116 [shard 0] commitlog - Could not allocate 16388 k bytes output buffer (16388 k required)

*** 1 failure detected in test suite "tests/commitlog_test.cc"
WARN  2016-11-28 20:48:31,117 [shard 0] commitlog - Exception in segment reservation: std::system_error (error system:2, No such file or directory)
2016-11-28 20:52:13 +01:00
Raphael S. Carvalho
f141b0cdae database: atomically add new sstables to cf when refreshing
New sstables are loaded and added in parallel, meaning that scylla can
potentially return stale data if a new sstable containing a tombstone
wasn't loaded yet. Compaction should also not run until all new sstables
are added for similar reasons.

Fix is about separating blocking and non-blocking steps to allow
atomic add of multiple new sstables.

Fixes #1368.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <14283b8a4a69127071d1fabef320a93c91817ec2.1480356073.git.raphaelsc@scylladb.com>
2016-11-28 20:30:48 +02:00
Glauber Costa
0e9b75d406 prevent commitlog replay position reordering during reserve refill
When requests hit the commitlog, each of them will be assigned a replay
position, which we expect to be ordered. If reorders happen, the request
will be discarded and re-applied. Although this is supposed to be rare,
it does increase our latencies, specially when big requests are
involved. Processing big requests is expensive and if we have to do it
twice that adds to the cost.

The commitlog is supposed to issue replay positions in order, and it
coudl be that the code that adds them to the memtables will reorder
them. However, there is one instance in which the commitlog will not
keep its side of the bargain.

That happens when the reserve is exhausted, and we are allocating a
segment directly at the same time the reserve is being replenished.  The
following sequence of events with its deferring points will ilustrate
it:

on_timer:

    return this->allocate_segment(false). // defer here // then([this](sseg_ptr s) {

At this point, the segment id is already allocated.

new_segment():

    if (_reserve_segments.empty()) {
	[ ... ]
        return allocate_segment(true).then ...

At this point, we have a new segment that has an id that is higher than
the previous id allocated.

Then we resume the execution from the deferring point in on_timer():

    i = _reserve_segments.emplace(i, std::move(s));

The next time we need to allocate a segment, we'll pick it from the
reserve. But the segment in the reserve has an id that is lower than the
id that we have already used.

Reorders are bad, but this one is particularly bad: because the reorder
happens with the segment id side of the replay position, that means that
every request that falls into that segment will have to be reinserted.

This bug can be a bit tricky to reproduce. To make it more common, we
can artificially add a sleep() fiber after the allocate_segment(false)
in on_timer(). If we do that, we'll see a sea of reinsertions going on
in the logs (if dblog is set to debug).

Applying this patch (keeping the sleep) will make them all disappear.
We do this by rewriting the reserve logic, so that the segments always
come from the reserve. If we draw from a single pool all the time, there
is no chance of reordering happening. To make that more amenable, we'll
have the reserve filler always running in the background and take it out
of the timer code.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <2606b97df39997bcf3af84a23adf17e094ffb0b8.1480107174.git.glauber@scylladb.com>
2016-11-28 19:26:26 +01:00
Takuya ASADA
1042e40188 dist/common/scripts/scylla_kernel_check: fix incorrect document URL
Fixes #1871

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1480327243-18177-1-git-send-email-syuu@scylladb.com>
2016-11-28 13:51:19 +02:00
Avi Kivity
18df2d9e9e partition_version: fix const correctness in rows_entry_compare
Using a non-const-correct comparator results in build failures with
boost 1.55.

Fixes #1892.
Message-Id: <20161128104335.28789-1-avi@scylladb.com>
2016-11-28 10:55:12 +00:00
Avi Kivity
5358984982 Merge seastar upstream
* seastar 93c3b12...7790e68 (7):
  > core/reactor: Introduce reactor-*/dervie-busy_ns metric
  > Collectd: Hold a reference to the metrics implementation in registration
  > future: Improve comments
  > fstream: actually use dynamically adjusted buffer
  > debug: add latency detector script
  > reactor: add static probes for latency detector
  > semaphore: Fix with_semaphore() in case wait() throws
2016-11-28 11:05:59 +02:00
Avi Kivity
28857e42e7 Merge " Virtualize size_estimates system table" from Duarte
"We currently write the size_estimates system table for every schema
on a periodic basis, currently set to 5 minutes, which can interfere
with an ongoing workload.

This patchset virtualizes it such that queries are intercepted and we
calculate the results on the fly, only for the ranges the caller is interested in.

Fixes #1616"

* 'virtual-estimates/v4' of github.com:duarten/scylla:
  size_estimates_virtual_reader: Add unit test
  db: Delete size_estimates_recorder
  size_estimates: Add virtual reader
  column_family: Add support for virtual readers
  storage_service: get_local_tokens() returns a future
  nonwrapping_range: Add slice() function
  range: Find a sequence's lower and upper bounds
  system_keyspace: Build mutations for size estimates
  size_estimates: Store the token range as bytes
  range_estimates: Add schema
  murmur3_partitioner: Convert maximum_token to sstring
2016-11-28 10:12:59 +02:00
Avi Kivity
176fca5775 logalloc: use correct header for unique_ptr
<bits/unique_ptr.hh> is a libstdc++ internal header.  USe <memory> instead.
2016-11-27 23:08:04 +02:00
Glauber Costa
c32803f2f0 database: move reversion of virtual dirty state closer to update_cache.
When we finish writing a memtable, we revert the dirty memory charges
immediately. When we do that, dirty memory will grow back to what it
was, and soon (we hope) will go down again when we release the requests
for real.

During that time, we may not accept new requests. Sealing can take a
long time, specially in the face of Linux issues like the ones we have
seen in the past. It also will take proportionally more time if the
SSTables end up being small, which is a possibility in some scenarios.

This patch changes the dirty_memory_manager so that the charges won't be
reverted right after we finish the flush. Rather, we will hold on to it,
and revert it right before we update the cache. We don't need to do it
for all classes of memtable writes, because after we finish flushing,
flush_one() will destroy the hashed element anyway.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <2d5a8f6ca57d5036f4850ac163557bca59b8063d.1480004384.git.glauber@scylladb.com>
2016-11-24 18:18:15 +01:00
Raphael S. Carvalho
4781b6eb71 sstables: use nonwrapping_range::make to avoid compilation issues
GCC 5.3.1 was unable to convert bound to optional<bound>.

sstables/sstables.cc:2494:123: error: no matching function for call to
‘nonwrapping_range<dht::ring_position>::nonwrapping_range(dht::ring_position,
dht::ring_position)’
(dtr.right.exclusive ? dht::ring_position::starting_at :
dht::ring_position::ending_at)(std::move(t2)));

In file included from ./dht/i_partitioner.hh:52:0,
                 from ./query-request.hh:28,
                 from ./clustering_key_filter.hh:27,
                 from sstables/sstables.hh:35,
                 from sstables/sstables.cc:38:
./range.hh:441:14: note: candidate: nonwrapping_range<T>::nonwrapping_range(
const wrapping_range<U>&) [with T = dht::ring_position]
     explicit nonwrapping_range(const wrapping_range<T>& r)

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <95bbf984cd73a61739c8da99cf6cd5e94f1d1457.1479954360.git.raphaelsc@scylladb.com>
2016-11-24 11:26:16 +02:00
Duarte Nunes
cc3f26c993 lz4: Conditionally use LZ4_compress_default()
Since not all distributions have a version of LZ4 with
LZ4_compress_default(), we use it conditionally.

This is specially important beginning with version 1.7.3 of LZ4,
which deprecates the LZ4_compress() function in favour of
LZ4_compress_default() and thus prevents Scylla from compiling
due to the deprecated warning.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20161124092339.23017-1-duarte@scylladb.com>
2016-11-24 11:25:03 +02:00
Avi Kivity
1be95b1227 Merge seastar upstream
* seastar d6f26d8...93c3b12 (3):
  > rpc: Conditionally use LZ4_compress_default()
  > queue: allow queue to change its maximum size
  > util/defer: add missing return to move assignment
2016-11-24 11:00:53 +02:00
Duarte Nunes
a527ba285f thrift: Don't apply cell limit across rows
In Thrift, SliceRange defines a count that limits the number of cells
to return from that row (in CQL3 terms, it limits the number of rows
in that partition). While this limit is honored in the engine, the
Thrift layer also applies the same limit, which, while redundant in
most cases, is used to support the get_paged_slice verb.

Currently, the limit is not being reset per Thrift row (CQL3
partition), so in practice, instead of limiting the cells in a row,
we're limiting the rows we return as well. This patch fixes that by
ensuring the limit applies only within a row/partition.

Fixes #1882

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20161123220001.15496-1-duarte@scylladb.com>
2016-11-24 10:38:31 +02:00
Takuya ASADA
ce80fb3a39 dist/ubuntu: increase number of open files on Ubuntu 14.04(upstart)
Follow the change of NOFILE for non-systemd environment.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1479975050-14907-1-git-send-email-syuu@scylladb.com>
2016-11-24 10:13:41 +02:00
Avi Kivity
d58c8aaa32 db: remove unused belongs_to_{current,other}_shard(s) functions
Obsoleted by new sharding mechanism, but break the build for some.
2016-11-23 21:39:29 +02:00
Avi Kivity
b81a57e8eb config, dht: reduce default msb ignore bits to 4
With the default value of 12, a node's range is partitioned into
4096 * smp::count sub-ranges which are queried sequentually for a range
scan.  If the number of rows in the table is smaller than the required
result size, we will query all of them.  This can take so long that we
time out.

A better fix is to query multiple sub-ranges in parallel and merge them,
but for that we need to resurrect the non-sequential merger.
2016-11-23 21:25:37 +02:00
Pekka Enberg
c526a9f0be Update seastar submodule
* seastar 7473945...d6f26d8 (2):
  > semaphore_units: add missing return statement
  > metrics: Do not detroy the metrics layer if it is been used
2016-11-23 20:27:09 +02:00
Paweł Dziepak
919825a2c7 Merge "Improve sharding in large clusters" from Avi
"Clusters with a large number of nodes, or a low number of vnodes, and a
high number of shards, or a combination, suffer from an aliasing problem:
both vnodes and intra-node sharding consider the most significant bits
to select the owning node and owning shard respectively.  Since the same
bits are used for both, a low number of vnodes leads to some shards
being overcommitted relative to others.

This series fixes the problem by sharding on bits 0:47 of the token
(murmur3 partitioner only), leaving the most significant 12 bits for
vnodes.  Simulation shows that this value provides reasonable sharding
for 100-node, 30-shard clusters.

In order to prevent re-sharding sstables on each boot, token ranges for
the range are stored in a new sub-component of the sstable Statistics
component. With the default 12 ignored bits we have 4096 token ranges
for non-Level-compacted SSTables, which takes some space but is still
reasonable.

Fixes #1277."
2016-11-23 11:25:53 +00:00
Glauber Costa
18b9fa3d43 dist: increase number of open files
This limit was found to be too low for production environments. It would
be hit at boot, when we're touching a lot of files from multiple shards
before deciding that we don't need them.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <87bbf43da1a67f5fa6174017205c6ef8bdb0dc3d.1479829232.git.glauber@scylladb.com>
2016-11-23 13:10:25 +02:00
Avi Kivity
07d5a20bae Wire up sharding ignore msb parameter to configuration
We might have used a fancy map<sstring, any> to pass the parameters, but
that's overkill for now.
2016-11-22 22:40:47 +02:00
Avi Kivity
8b1d689de8 partitioner: add ignore_msb parameters to byte ordered and random partitioners
Ignored; doesn't make sense on byte ordered, and random is deprecated.
2016-11-22 21:56:42 +02:00
Avi Kivity
af16c0fac4 murmur3_partitioner: shard on the middle token bits, not most significant bits
Sharding on the most significant token bits aliases with the vnode mechanism,
which also uses the most significant bits; this requires a huge number of
vnodes to achieve good sharding.

This patch teaches the murmur3 partitioner to ignore the most significant
N bits when calculating a token's hard, so we use token bits which still have
some entropy.  In effect, with changes the token range layout from

   shard 0
   shard 1
   ...
   shard S-1

to

   shard 0
   shard 1
   ...
   shard S-1

   shard 0
   shard 1
   ...
   shard S-1

   ...

   shard 0
   shard 1
   ...
   shard S-1

Where the number of repetitions of the block is 2^(ignored msb bits).

For compatibility, the default is zero ignored bits, matching the pre-patch
state, until we wire things up.
2016-11-22 21:56:42 +02:00
Avi Kivity
024c8ef8a1 db: adjust sstable load to use sstable self-reporting of shard ownership
Instead of calculating the owning shard from the sstable's partition
key range, delegate to the new sstable method for getting owning shard
infomation.  This insulates us from changes in the sharding algorithm.
2016-11-22 21:56:40 +02:00
Avi Kivity
98a4544e1c sstables: add method to get sstable owning shards from an unloaded sstable
When we load an sstable, we don't know beforehand which shards it belongs
to; we don't want to open it until we do.  Add a method that allows us
to read just the sharding data, without opening anything else.
2016-11-22 21:52:23 +02:00
Avi Kivity
bdd11648ac sstables: add intra-node sharding metadata
Add a metadata component that describes token ranges that are spanned by
this sstable.  With the current sharding algorithm, where each shard owns
a single token range, the first/last partition key is sufficient to
describing sharding information, but for multi-range algorithms, this
is not sufficient.
2016-11-22 21:44:25 +02:00
Avi Kivity
316ef1d70a sstables: automate writing statistics components
Add a virtual funnction to metadata_base so we can loop over statistics
components when writing them.
2016-11-22 21:05:06 +02:00
Glauber Costa
13973e7f3b keep background work semaphore alive during sstable flush
We have a semaphore controlling the amount of background work generated
by the memtable flush process. However, because we are not moving it
inside the memtable post-flush continuation, the units are being
released when we star the flush and not when we finish it.

That's not the intended behavior and that can cause flushes to
accumulate.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <b7dc1866ed3473b9b1862c433d59c5ebd8575dbc.1479839600.git.glauber@scylladb.com>
2016-11-22 19:54:08 +01:00
Avi Kivity
d05b22e502 sstables: automatically calculate offsets in statistics
Instead of calculating the offset for each statistic component manually,
use a loop to iterate over all components, accumulating the offset as we
go along.
2016-11-22 20:35:24 +02:00
Avi Kivity
7c5e6525ef sstables: switch statistics components to generic serialized_size() implementation 2016-11-22 20:20:38 +02:00
Avi Kivity
096ae59a5b sstables: introduce generic serialized_size()
Introduce a new function that reuses the file_writer code to compute
the serialized size of an sstable object, by serializing it into memory
and discarding the result.
2016-11-22 20:06:23 +02:00
Avi Kivity
3c06ffac9d sstables: const correctness for the write(file_writer&, T&) functions
write() doesn't need to change its input; so change it to const.

The only snag is that describe_type() isn't and can't be made const-correct,
so cheat when it is called and const_cast the input.

This helps in writing a generic serialized_size() that is const correct,
in the next patch.
2016-11-22 20:04:27 +02:00
Tomasz Grabiec
eefc538225 Update seastar submodule
* seastar 7504026...7473945 (1):
  > Merge "Improve support for timeouts in primitives"
2016-11-22 17:51:29 +01:00
Glauber Costa
0b8b5abf16 commitlog: acquire semaphore earlier
Recently we have changed our shutdown strategy to wait for the
_request_controller semaphore to make sure no other allocations are
in-flight. That was done to fix an actual issue.

The problem is that this wasn't done early enough. We acquire the
semaphore after we have already marked ourselves as _shutdown and
released the timer.

That means that if there is an allocation in flight that needs to use a
new segment, it will never finish - and we'll therefore neve acquire
the semaphore.

Fix it by acquiring it first. At this point the allocations will all be
done and gone, and then we can shutdown everything else.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <5c2a2f20e3832b6ea37d6541897519a9307294ed.1479765782.git.glauber@scylladb.com>
2016-11-21 22:19:32 +00:00
Avi Kivity
6bdb8ba31d storage_proxy: don't query concurrently needlessly during range queries
storage_proxy has an optimization where it tries to query multiple token
ranges concurrently to satisfy very large requests (an optimization which is
likely meaningless when paging is enabled, as it always should be).  However,
the rows-per-range code severely underestimates the number of rows per range,
resulting in a large number of "read-ahead" internal queries being performed,
the results of most of which are discarded.

Fix by disabling this code. We should likely remove it completely, but let's
start with a band-aid that can be backported.

Fixes #1863.

Message-Id: <20161120165741.2488-1-avi@scylladb.com>
2016-11-21 18:19:46 +02:00
Glauber Costa
0ca8c3f162 database: keep a pointer to the memtable list in a memtable
We current pass a region group to the memtable, but after so many recent
changes, that is a bit too low level. This patch changes that so we pass
a memtable list instead.

Doing that also has a couple of advantages. Mainly, during flush we must
get to a memtable to a memtable_list. Currently we do that by going to
the memtable to a column family through the schema, and from there to
the memtable_list.

That, however, involves calling virtual functions in a derived class,
because a single column family could have both streaming and normal
memtables. If we pass a memtable_list to the memtable, we can keep
pointer, and when needed get the memtable_list directly.

Not only that gets rid of the inheritance for aesthetic reasons, but
that inheritance is not even correct anymore. Since the introduction of
the big streaming memtables, we now have a plethora of lists per column
family and this transversal is totally wrong. We haven't noticed before
because we were flushing the memtables based on their individual sizes,
but it has been wrong all along for edge cases in which we would have to
resort to size-based flush. This could be the case, for instance, with
various plan_ids in flight at the same time.

At this point, there is no more reason to keep the derived classes for
the dirty_memory_manager. I'm only keeping them around to reduce
clutter, although they are useful for the specialized constructors and
to communicate to the reader exactly what they are. But those can be
removed in a follow up patch if we want.

The old memtable constructor signature is kept around for the benefit of
two tests in memtable_tests which have their own flush logic. In the
future we could do something like we do for the SSTable tests, and have
a proxy class that is friends with the memtable class. That too, is left
for the future.

Fixes #1870

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <811ec9e8e123dc5fc26eadbda82b0bae906657a9.1479743266.git.glauber@scylladb.com>
2016-11-21 18:18:27 +02:00
Duarte Nunes
def2bc72b0 size_estimates_virtual_reader: Add unit test
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 11:15:05 +00:00
Duarte Nunes
6a37d87c76 db: Delete size_estimates_recorder
Now that access to the size_estimates system is virtualized, we no
longer need the recorder.

Fixes #1616

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 11:15:05 +00:00
Duarte Nunes
225648780d size_estimates: Add virtual reader
This patch add a virtual mutation_reader so that queries
to the size_estimates system table are handled by the engine
without needing to perform any IO.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 11:15:05 +00:00
Duarte Nunes
cd7e2fd602 column_family: Add support for virtual readers
Virtual readers allow queries to selected tables, usually system
tables, to be answered by the engine. This is useful for tables which
aren't written by users and whose contents can be calculated on
demand.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 11:15:05 +00:00
Duarte Nunes
c0d450c57d storage_service: get_local_tokens() returns a future
This patch changes the get_local_tokens() function in storage_service
to return a future instead of requiring running under a seastar::thread.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 11:15:04 +00:00
Duarte Nunes
9b384d375f nonwrapping_range: Add slice() function
This patch add the slice() function to nonwrapping range, which uses
its bounds to slice an input sequence.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 11:15:04 +00:00
Duarte Nunes
bdba8d99c3 range: Find a sequence's lower and upper bounds
This patch extracts a pair of functions from mutation_partition to
calculate the lower and upper bounds of a sequence from a
nonwrapping_range.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 11:15:04 +00:00
Duarte Nunes
636287fdf2 system_keyspace: Build mutations for size estimates
This patch adds a function to system_keyspace responsible for creating
a mutation to a partition of the size_estimates system table from a
set of range_estimates.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 11:15:04 +00:00
Duarte Nunes
18ddec245e size_estimates: Store the token range as bytes
This patch changes the range_estimates struct so that the tokens are
represented as utf8 encoded bytes. This will make future patches
require less conversions.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 11:14:21 +00:00
Duarte Nunes
e7a5162c1d range_estimates: Add schema
This will be used in future patches, when virtualizing the
size_estimates system table.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 10:56:32 +00:00
Duarte Nunes
01815ecd24 murmur3_partitioner: Convert maximum_token to sstring
This patch ensures we can convert the maximum_token to an sstring.
For Cassandra, the minimum and maximum tokens have the same
representation. So, we use the string representation of the
maximum_token for the maximum_token.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 10:56:32 +00:00
Takuya ASADA
eee63027e5 dist/ami/build_ami.sh: update base AMI to CentOS7-Base5
To drop unnecessary .ssh/authozied_keys, we need to update base AMI.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1479496938-29724-1-git-send-email-syuu@scylladb.com>
2016-11-21 10:12:47 +02:00
Avi Kivity
783729c540 Merge "Clean up T::memory_usage() function" from Paweł
"This series is just a cleanup which intention is to deal with all
confusion related to the way T::memory_usage() functions work.

* T::memory_usage() which returned external memory usage are renamed
  to T::external_memory_usage()
* T::memory_usage() is introduced where needed to avoid repeating
  sizeof(T) + T::external_memory_usage()"

Paweł Dziepak (6):
  rename memory_usage() to external_memory_usage() where applicable
  streamed_mutation: add memory_usage() to mutation fragment types
  keys: add memory_usage()
  partition_snapshot_accounter: use range_tombstone::memory_usage()
  mutation_rebuilder: use memory_usage()
  frozen_mutation: use memory_usage()
2016-11-21 10:11:39 +02:00
Avi Kivity
498887ca0d Merge seastar upstream
* seastar 31c5fd7...7504026 (2):
  > circular_buffer: add move assignment operator
  > scollectd: Fix serialization of GAUGE-typed values
2016-11-20 20:16:56 +02:00
Gleb Natapov
9222a47fed sstable test: add test for generated summary data
Message-Id: <20161117155051.GV6765@scylladb.com>
2016-11-20 19:50:45 +02:00
Glauber Costa
21c1e2b48c commitlog: wait for pending allocations to finish before closing gate.
allocations may enter the gate, so it would be wise for us to wait for them.

Fixes #1860

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <53cd6996c1cbd8b38bab3b03604bd11e5c20beda.1479650012.git.glauber@scylladb.com>
2016-11-20 19:45:33 +02:00
Avi Kivity
a39b92a40a build: fix tests-with-symbols generation
Bad indentation caused the libs variable for tests-with-symbols to be
overwritten, resulting in link failure.
2016-11-20 17:23:26 +02:00
Glauber Costa
504b5ac30f database: don't check for waiters in the condition variable predicate.
In the last iterations of this patchset, we have moved explicit flushes
to acquire the semaphore directly and the coalescing inside the
memtable_list. As a result, we are no longer keeping any kind of action
for them inside the condition variable. Checking for them has no longer
a purpose.

This is a cleanup patch that remove does checks.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <732676ccfe4ac93eb57aa799ec94b841499a01a6.1479500646.git.glauber@scylladb.com>
2016-11-18 21:34:48 +01:00
Glauber Costa
1933349654 database: fix direct flushes of non-durable column families.
If a Column Family is non-durable, then its flushes will never create a
memtable flush reader. Our current flush logic depends on that being
created and destroyed to release the semaphore permits on the flush.

We will remove the permits ourselves it there is an exception, but not
under normal circumnstances. Given this issue, however, it would be more
adequate to always try to remove the permits after we flush. If the
permits were already removed by the flush reader, then this test will
just see that the permit is not in the map and return. But if it is
still there, then it is removed.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <049334c3b4bef620af2c7c045e6c84347dcf9013.1479498026.git.glauber@scylladb.com>
2016-11-18 21:32:29 +01:00
Avi Kivity
6eecbc80dc CONTRIBUTING.md: add sections for help and issues
Don't scare away users reporting an issue with the CLA.
2016-11-18 22:21:10 +02:00
Glauber Costa
60b7d35f15 commitlog: close file after read, and not at stop
There are other code paths that may interrupt the read in the middle
and bypass stop. It's safer this way.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <8c32ca2777ce2f44462d141fd582848ac7cf832d.1479477360.git.glauber@scylladb.com>
2016-11-18 14:09:33 +00:00
Paweł Dziepak
249e0ab087 frozen_mutation: use memory_usage()
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-11-18 11:25:36 +00:00
Paweł Dziepak
948c062e64 mutation_rebuilder: use memory_usage()
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-11-18 11:25:36 +00:00
Paweł Dziepak
e04664e851 partition_snapshot_accounter: use range_tombstone::memory_usage()
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-11-18 11:25:36 +00:00
Paweł Dziepak
711bd19f16 keys: add memory_usage()
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-11-18 11:25:36 +00:00
Paweł Dziepak
6b8bf030c0 streamed_mutation: add memory_usage() to mutation fragment types
This patch introduces memory_usage() to static_row, clustering_row and
range_tombstone so that we can avoid repeating sizeof(T) +
x.external_memory_usage().

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-11-18 11:25:36 +00:00
Paweł Dziepak
ef57b9a26f rename memory_usage() to external_memory_usage() where applicable
Renaming the function to external_memory_usage() makes it clear that
sizeof(T) is not included, something that was a source of confusion in
the past.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-11-18 11:25:36 +00:00
Avi Kivity
fec4ef3390 Merge "Make sure commitlog replay is able to make progress" from Glauber
"Fixes #1856

Commitlog replay reads are being issued without a priority. That means
they will lose to compaction every time."

* 'issue-1856-v2' of github.com:glommer/scylla:
  commitlog: use read ahead for replay requests
  commitlog: use commitlog priority for replay
  commitlog: close replay file
2016-11-18 12:04:18 +02:00
Takuya ASADA
55e5123313 dist/redhat: Support RHEL7
We supported install CentOS7 .rpm on RHEL7, but we haven't supported
building on RHEL7, since there is little difference between CentOS,
and that causes build error.

This patch fixes the error, now we can produce .rpm for RHEL7 wihout
using CentOS.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1479431134-8032-1-git-send-email-syuu@scylladb.com>
2016-11-18 11:56:05 +02:00
Glauber Costa
461778918b fix shutdown and exception conditions for flush logic
This patch addresses post-merge follow up comments by Tomek.
Basically, what we do is:
- we don't need to signal() from remove_from_flush_manager(), because
  the explicit flushes no longer wait on the condition variable. So we
  don't.
- We now wait on the stop() flushes (regardless of their return status)
  so we can make sure that the _flush_queue will indeed be done with.
- we acquire the semaphore before shutting down the dirty_memory_manager
  to make sure that there are no pending flushes
- the flush manager that holds the semaphore has to match in the exception
  handler

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <a23ab5098934546c660a08de64cd9294bb3a2008.1479400239.git.glauber@scylladb.com>
2016-11-17 21:16:44 +01:00
Glauber Costa
59a41cf7f1 commitlog: use read ahead for replay requests
Aside from putting the requests in the commitlog class, read ahead
will help us going through the file faster.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-17 14:09:54 -05:00
Glauber Costa
aa375cd33d commitlog: use commitlog priority for replay
Right now replay is being issued with the standard seastar priority.
The rationale for that at the time is that it is an early event that
doesn't really share the disk with anybody.

That is largely untrue now that we start compactions on boot.
Compactions may fight for bandwidth with the commitlog, and with such
low priority the commitlog is guaranteed to lose.

Fixes #1856

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-17 14:09:02 -05:00
Glauber Costa
4d3d774757 commitlog: close replay file
Replay file is opened, so it should be closed. We're not seeing any
problems arising from this, but they may happen. Enabling read ahead in
this stream makes them happen immediately. Fix it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-17 12:35:24 -05:00
Avi Kivity
eaf83ab59c Merge seastar upstream
* seastar 3001c08...31c5fd7 (2):
  > Safe use of collectd during shutdown
  > udp: abort reader and writer when udp channel close
2016-11-17 18:44:28 +02:00
Piotr Jastrzebski
9d33948487 mutation_rebuilder: fix fragment size calculation
It wasn't calculating the size of data correctly.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <c03dfff7bf1ca3199991e5864189f98bfa2942ea.1479397736.git.piotr@scylladb.com>
2016-11-17 16:23:42 +00:00
Raphael S. Carvalho
3dc9294023 db: do not leak deleted sstable when deletion triggers an exception
The leakage results in deleted sstables being opened until shutdown, and disk
space isn't released. That's because column_family::rebuild_sstable_list()
will not remove reference to deleted sstables if an exception was triggered in
sstables::delete_atomically(). A sstable only has its files closed when its
object is destructed.

The exception happens when a major compaction is issued in parallel to a
regular one, and one of them will be unable to delete a sstable already deleted
by the other. That results in remove_by_toc_name() triggering boost::filesystem
::filesystem_error because TOC and temporary TOC don't exist.

We wouldn't have seen this problem if major compaction were going through
compaction manager, but remove_by_toc_name() and rebuild_sstable_list() should
be made resilient.

Fixes #1840.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <d43b2e78f9658e2c3c5bbb7f813756f18874bf92.1479390842.git.raphaelsc@scylladb.com>
2016-11-17 17:46:36 +02:00
Gleb Natapov
c052a1bc4f sstable: use schema's min_index_interval config when generating missing summary
Message-Id: <20161116181937.GA25303@scylladb.com>
2016-11-17 15:24:03 +02:00
Avi Kivity
5d067eebf2 Merge "get rid of memtable size parameter and rework flush logic" from Glauber
"This patchset allows Scylla to determine the size of a memtable instead
of relying in the user-provided memtable_cleanup_threshold. It does that
by allowing the region_group to specify a soft limit which will trigger
the allocation as early as it is reached.

Given that, we'll keep the memtables in memory for as long as it takes
to reach that limit, regardless of the individual size of any single one
of them. That limit is set to 1/4 of dirty memory. That's the same as
last submission, except this time I have run some experiments to gauge
behavior of that versus 1/2 of dirty memory, which was a preferred
theoretical value.

After that is done, the flush logic is reworked to guarantee that
flushes are not initiated if we already have one memtable under flush.
That allow us to better take advantage of coalescing opportunities with
new requests and prevents the pending memtable explosion that is
ultimately responsible for Issue 1817.

I have run mainly two workloads with this. The first one a local RF=1
workload with large partitions, sized 128kB and 100 threads. The results
are:

Before:

op rate                   : 632 [WRITE:632]
partition rate            : 632 [WRITE:632]
row rate                  : 632 [WRITE:632]
latency mean              : 157.8 [WRITE:157.8]
latency median            : 115.5 [WRITE:115.5]
latency 95th percentile   : 486.7 [WRITE:486.7]
latency 99th percentile   : 534.8 [WRITE:534.8]
latency 99.9th percentile : 599.0 [WRITE:599.0]
latency max               : 722.6 [WRITE:722.6]
Total partitions          : 189667 [WRITE:189667]
Total errors              : 0 [WRITE:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:05:00
END

After:

op rate                   : 951 [WRITE:951]
partition rate            : 951 [WRITE:951]
row rate                  : 951 [WRITE:951]
latency mean              : 104.8 [WRITE:104.8]
latency median            : 102.5 [WRITE:102.5]
latency 95th percentile   : 155.8 [WRITE:155.8]
latency 99th percentile   : 177.8 [WRITE:177.8]
latency 99.9th percentile : 686.4 [WRITE:686.4]
latency max               : 1081.4 [WRITE:1081.4]
Total partitions          : 285324 [WRITE:285324]
Total errors              : 0 [WRITE:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:05:00
END

The other workload was the workload described in #1817. And the result
is that we now have a load that is very stable around 100k ops/s and
hardly any timeouts, instead of the 1.4 baseline of wild variations
around 100k ops/s and lots of timeouts, or the deep reduction of
1.5-rc1."

* 'issue-1817-v4' of github.com:glommer/scylla:
  database: rework memtable flush logic
  get rid of max_memtable_size
  pass a region to dirty_memory_manager accounting API
  memtable: add a method to expose the region_group
  logalloc: allow region group reclaimer to specify a soft limit
  database: remove outdated comment
  database: uphold virtual dirty for system tables.
2016-11-17 14:36:43 +02:00
Avi Kivity
18078bea9b storage_proxy: avoid calculating digest when only one replica is contacted
If we're talking to just one replica, the digest is not going to be used,
so better not to calculate it at all.  The optimization helps with
LOCAL_ONE queries where the result is large, but does not contain large
blobs (many small rows).

This patch adds a digest_algorithm parameter to the READ_DATA verb that
can take on two values: none and MD5 (default), and sets it to none when
we're reading from one replica.

In the future we may add other values for more hardware-friendly digest
algorithms.
Message-Id: <1479380600-19206-1-git-send-email-avi@scylladb.com>
2016-11-17 13:04:30 +02:00
Asias He
dc50ce0ce5 streaming: Make the mutation readers when streaming starts
Currenlty we make the mutation readers for streaming at different
time point, i.e.,

do_for_each(_ranges.begin(), _ranges.end(), [] (auto range) {
     make a mutation reader for this range
     read mutations from the reader and send
})

If there are write workload in the background, we will stream extra
data, since the later the reader is made the more data we need to send.

Fix it by making all the readers before starting to stream.

Fixes #1815
Message-Id: <1479341474-1364-2-git-send-email-asias@scylladb.com>
2016-11-17 12:41:53 +02:00
Gleb Natapov
ae0a2935b4 sstables: fix ad-hoc summary creation
If sstable Summary is not present Scylla does not refuses to boot but
instead creates summary information on the fly. There is a bug in this
code though. Summary files is a map between keys and offsets into Index
file, but the code creates map between keys and Data file offsets
instead. Fix it by keeping offset of an index entry in index_entry
structure and use it during Summary file creation.

Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20161116165421.GA22296@scylladb.com>
2016-11-17 11:05:23 +02:00
Glauber Costa
f08162e181 database: rework memtable flush logic
The way we currently flush memtables, we seal the current one but wait
on a semaphore for the actual flush to proceed.

This is pointless, because if the flush is not proceeding we'll use up
memory for the new entries anyway, be them in a newly opened memtable or
not. As a matter of fact, by opening a new memtable we are foregoing
coalescing opportunities.

After recent changes to the flush paths, we are now in a position to do
differently. We move the semaphore earlier, and if we can't acquire it
we keep appending to the current memtable.

For explicit flushes, we'll queue and prioritize them over memory-based
flushes. This has the nice property of potentially coalescing various
flushes for the same CF into one.

Coalescing flushes for the same CF is particularly helpful for
commitlog-initiated flushes that can't complete within the flush period.
What we see currently, is that under heavy load the commitlog will keep
sealing memtables adding to the existing load.

Another interesting property of this approach is that we can keep the
disk utilization higher, by allowing a new flush to start before the
memtable is fully sealed. By design, every time a memtable is finished
flushing it will call revert_potentially_cleaned_up_memory() to revert
the virtual memory charges.  That is the perfect moment for us to act.
It indicates that all the data flushing part is done.

The way we'll do it is by keeping the semaphore_units alive for this
memtable. When the flush ends, we destroy that object. This will
effectively trigger the next flush if there is a next flush that can be
initiated.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-16 21:20:58 -05:00
Glauber Costa
895e838ac0 get rid of max_memtable_size
After recent changes to the memtable code, there is no reason for us to
uphold a maximum memtable size. Now that we only flush one memtable at a
time anyway, and also have soft limit notifications from the
region_group_reclaimer, we can just set the soft limit to the target
size and let all of that be handled by the dirty_memory_manager.

It does have the added property that we'll be flushing when we globally
reach the soft limit threshold. In conditions in which we have multiple
CF writes fighting for memory, that guarantees that we will start
flushing much earlier than the hard limit.

The threshold is set to 1/4 of dirty memory. While in theory we would
prefer the memtables to go as big as 1/2 of dirty memory, in my
experiments I have found 1/4 to be a better fit, at least for the
moment.

The reason for such behavior is that in situations where we have slow
disks, setting the soft limit to 1/2 of dirty will put us in a situation
in which we may not have finished writing down the memtable when we hit
the limit, and then throttle. When set the threshold to 1/4 of dirty, we
don't throttle at all.

This behavior could potentially be fixed by not doing the full
memtable-based throttling after we do the commitlog throttling, but that
is not something realistic for the moment.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-16 21:20:24 -05:00
Glauber Costa
2ed3f342c1 pass a region to dirty_memory_manager accounting API
We would like to know from which region is a particular flush coming
from, and account accordingly. The reasoning behind that, is that soon
we'll be driving the flushes internally from the dirty_memory_manager
without explcitly triggering them.

We need to start a flush before the current one finishes, otherwise
we'll have a period without significant disk activity when the current
SSTable is being sealed, the caches are being updated, etc.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-16 21:20:24 -05:00
Glauber Costa
0b337dab14 memtable: add a method to expose the region_group
That is technically not needed because a memtable inherits from group. So
whenever we have a memtable, we can use it's group() method to obtain a
group for it, and then from there go to the region_group.

However, region() is a const method in the memtable, so we have to play
trick with the const_cast, or remove the constness from the region. An
alternative to that, which I prefer, is to expose a method for the
region_group directly from the memtable object that does the right thing
and bypasses all that.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-16 21:20:24 -05:00
Glauber Costa
f86c9e36f4 logalloc: allow region group reclaimer to specify a soft limit
The region_group_reclaimer will let us know every time we are over the
limit we have specified for memory usage.

However, For some applications, we would be interested in knowing about
memory build up earlier, so we can start doing something about it before
we reach that condition.

This patch introduce soft limit notifications for the
region_group_reclaimer. After this patch is applied, start_reclaim() is
called earlier, and stop_reclaim() later, after the soft condition is
abated.

There are methods that allow one to easily test if the pressure
condition is a soft limit condition or a hard, threshold condition and
act accordingly. Whether to act on both conditions or just one of them
is up to the application.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-16 21:20:23 -05:00
Glauber Costa
da738a6cd1 database: remove outdated comment
Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-16 21:20:23 -05:00
Glauber Costa
919de98aa5 database: uphold virtual dirty for system tables.
Currently the virtual dirty mechanism is not properly set for system
tables. We haven't divided the system table allowance by two, which
means it won't start thottling earlier as it was supposed to.

In practice, this has little effect because system table requests are
very well behaved, their sizes well known, and they tend to be
force-flushed. But we should be consistent.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-16 21:20:23 -05:00
Avi Kivity
f26c6569d2 Update scylla-ami submodule
* dist/ami/files/scylla-ami 61ff5c6...25e101f (1):
  > scylla_install_ami: delete unneeded authorized_keys from AMI image
2016-11-16 22:36:31 +02:00
Takuya ASADA
3802f289f8 dist: remove bc from dependency
Since we replaced shellscript based cpuset generator with python based one,
we no longer depends to bc command.
d571123afd

So drop it from .rpm/.deb dependency.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1479152876-11020-1-git-send-email-syuu@scylladb.com>
2016-11-16 15:02:55 +02:00
Amnon Heiman
a4be7afbb0 API: cache_capacity should use uint for summing
Using integer as a type for the map_reduce causes number over overflow.

Fixes #1801

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1479299425-782-1-git-send-email-amnon@scylladb.com>
2016-11-16 13:55:46 +01:00
Avi Kivity
31d0e31de2 Merge seastar upstream
* seastar 47e1821...3001c08 (5):
  > core: Introduce weak_ptr<>
  > timer: Add missing include
  > tutorial: fix TeX template
  > Merge "Adding the metrics layer" from Amnon
  > core/memory: let malloc(0) return a valid pointer
2016-11-16 14:20:49 +02:00
Pekka Enberg
8a4bd6ecd5 README: Guidelines for contributing
Message-Id: <1479288359-14168-1-git-send-email-penberg@scylladb.com>
2016-11-16 12:50:02 +02:00
Paweł Dziepak
f877be50b0 Merge "Keep wide partition cache entry longer than others" from Piotr
"Cache entries for wide partitions are usually smaller than other
entries and the cost of recreating them is higher so it makes sense
to keep them longer than ordinary entries."
2016-11-15 20:44:52 +00:00
Paweł Dziepak
b8d737ff0a tests/row_cache_test: verify that eviction follows lru
Refs #1847.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1479231555-28191-1-git-send-email-pdziepak@scylladb.com>
2016-11-15 18:57:54 +01:00
Paweł Dziepak
999dafbe57 row_cache: touch entries read during range queries
Fixes #1847.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1479230809-27547-1-git-send-email-pdziepak@scylladb.com>
2016-11-15 18:54:11 +01:00
Tomasz Grabiec
11c5f4ab50 storage_proxy: Add counters for throttled writes 2016-11-15 17:18:25 +01:00
Piotr Jastrzebski
5ec668c9c6 Add separate LRU for wide partitions.
Evict wide partitions only every 1000 normal partition
evictions.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-11-15 16:19:13 +01:00
Piotr Jastrzebski
9a41bfbf69 Add collectd metric for wide partition evictions.
This will allow us to see how big is an amount
of evictions of cached info about wide partitions.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-11-15 15:53:14 +01:00
Paweł Dziepak
055d78ee4c query_pagers: distinct queries do not have clustering keys
Query pager needs to handle results that contain partitions with
possibly multiple clustering rows quite differently than results with
just one row per partition (for example a page may end in a middle of
partition). However, the logic dealing with partitions with clustering
rows doesn't work correctly for SELECT DISTINCT queries, which are
much more similar to the ones for schemas without clustering key.

The solution is to set _has_clustering_keys to false in case of SELECT
DISTINCT queries regardless of the schema which will make pager
correctly expect each partition to return at most one rows.

Fixes #1822.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1478612486-13421-1-git-send-email-pdziepak@scylladb.com>
2016-11-15 11:06:01 +01:00
Glauber Costa
93386bcec7 histograms: do not use latency_in_nano
Now that the histogram has its own unit expressed in its template
parameter, there is no reason to convert it to nano just so we may need
to convert it back if the histogram needs another unit.

This patch will keep everything as a duration until last moment, and
then we'll convert when needed.

This was suggested by Amnon.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <218efa83e1c4ddc6806c51913d4e5f82dc6d231e.1479139020.git.glauber@scylladb.com>
2016-11-14 18:01:43 +02:00
Nadav Har'El
c5254b6502 repair: fix undefined variable
If the "trace" parameter of the repair was not given, we will use the
"trace" variable without setting it. We need to set a default value.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1479136239-14204-1-git-send-email-nyh@scylladb.com>
2016-11-14 17:16:19 +02:00
Raphael S. Carvalho
e86de40b49 compaction_manager: inform about compaction cancelled by shutdown
After some changes in compaction manager, user no longer is informed
that compaction was cancelled in event of shutdown. That's because
we only ignore ready future when compaction manager was asked to
stop.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <02ca29b5a93fe3a558896598f325b0dce069e82c.1478277317.git.raphaelsc@scylladb.com>
2016-11-14 16:37:33 +02:00
Piotr Jastrzebski
4fe989d58e Cleanup sstables::mutation_reader::impl
Pointer to sstable seems unnecessary.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <a45e8853af2b5f896ec44144fbc26d3325a5ec0c.1479123740.git.piotr@scylladb.com>
2016-11-14 11:52:52 +00:00
Avi Kivity
14c1b17105 storage_service: fix construct_range_to_endpoint_map with semi-infinite range
After the conversion to nonwrapping ranges, construct_range_to_endpoint_map()
may be called with semi-infinite token ranges, but it does not expect this,
calling nonwrapping_range::end()->value() unconditionally.

Fix by checking whether this is a semi-infinite range on the right, and
replace ->value() by maximum_token() instead.

Fixes `nodetool describering` (once more).
Message-Id: <1478983010-29630-1-git-send-email-avi@scylladb.com>
2016-11-14 11:39:48 +01:00
Raphael S. Carvalho
9a9f0d3a0f main: fix exception handling when initializing data or commitlog dirs
Exception handling was broken because after io checker, storage_io_error
exception is wrapped around system error exceptions. Also the message
when handling exception wasn't precise enough for all cases. For example,
lack of permission to write to existing data directory.

Fixes #883.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <b2dc75010a06f16ab1b676ce905ae12e930a700a.1478542388.git.raphaelsc@scylladb.com>
2016-11-14 12:34:10 +02:00
Takuya ASADA
d571123afd dist/common/scripts/scylla_sysconfig_setup: stop using 'bc' command to generate cpuset parameter, use python script instead
We get error from bc command when we run the script on >34 ncpus,
to prevent the issue add a python script to generate cpuset parameter.

Fixes #1824

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1478887624-12737-1-git-send-email-syuu@scylladb.com>
2016-11-14 11:45:23 +02:00
Duarte Nunes
66f6a367a4 ring_position_range_sharder: Avoid copying eagerly
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20161104115632.15974-1-duarte@scylladb.com>
2016-11-13 11:42:23 +02:00
Avi Kivity
bf20aa722b Merge "Fixes for histogram and moving average calculations" from Glauber
"JMX metrics were found to be either not showing, or showing absurd
values.  Turns out there were multiple things wrong with them. The
patches were sent separately but conflict with one another. This series
is a collection of the patches needed to fix the issues we saw.

Fixes #1832, #1836, #1837"
2016-11-13 11:16:32 +02:00
Avi Kivity
2670e46f3e storage_service: deinline most methods
Most inline methods in storage_service are too large to be inlined, and
just increase compile time.  De-inline them.
2016-11-12 21:12:28 +02:00
Glauber Costa
608d825790 histogram: fix reporting units
We are tracking latencies in microseconds, but almost everywhere else
they are reported in microseconds. Instead of just converting, this
patch tries to be a bit more future proof and embed the unit into the
type - and we then default to microseconds.

I have verified that the JMX measures now report sane values for both
the storage proxy and the column family. nodetool cfhistograms still
works fine. That one is reported in nanoseconds, but through the
estimated_histogram, not ihistogram.

Fixes #1836

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-11 11:36:56 -05:00
Glauber Costa
1342d044eb moving averages: change metrics calculation
We have recently fixed a bug due to which the constructor parameters for
moving average were inverted, leading to the numbers being just plain
wrong. However, the calculation of alpha was already inverted, meaning
it was right by accident and now that's wrong.

With the wrong alpha, the values we see are still correct, but they move
very quickly. The intention of this code is obviously to smooth things
out.

This was found out by Nadav. I have tested and confirmed that the smoothing
factor now works as expected.

Fixes  #1837

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-10 22:33:34 -05:00
Amnon Heiman
a977ea85e1 histogram: moving_average and total rate should be calculate in seconds
The moving average and the total average should be calculated in seconds
and not nanoseconds.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-11-10 22:32:53 -05:00
Glauber Costa
d3f11fbabf histogram: moving averages: fix inverted parameters
moving_averages constructor is defined like this:

    moving_average(latency_counter::duration interval, latency_counter::duration tick_interval)

But when it is time to initialize them, we do this:

	... {tick_interval(), std::chrono::minutes(1)} ...

As it can be seen, the interval and tick interval are inverted. This
leads to the metrics being assigned bogus values.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <d83f09eed20ea2ea007d120544a003b2e0099732.1478798595.git.glauber@scylladb.com>
2016-11-10 11:28:51 -08:00
Paweł Dziepak
f16d6f9c40 partition_version: make sure that snapshot is destroyed under LSA
Snapshot destructor may free some objects managed by the LSA. That's why
partition_snapshot_reader destructor explicitly destroys the snapshot it
uses. However, it was possible that exception thrown by _read_section
prevented that from happenning making snapshot destoryed implicitly
without current allocator set to LSA.

Refs #1831.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1478778570-2795-1-git-send-email-pdziepak@scylladb.com>
2016-11-10 13:13:10 +01:00
Gleb Natapov
27e041606b fix LOCAL_ONE printout
Message-Id: <20161109125307.GH7766@scylladb.com>
2016-11-09 12:53:55 +00:00
Duarte Nunes
e680587b8a sstable_test: Be explicit about uncompressed tables
After 7c28ed, the schemas defined in the test became compressed by
default. This patch changes the test so that it is explicit about
which schemas shouldn't define a compressor.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1478646530-5558-1-git-send-email-duarte@scylladb.com>
2016-11-09 11:21:59 +02:00
Pekka Enberg
b3dea313dd Merge "API changes for Cassandra 3.x migration" from Calle
"Mostly small changes/additions to the API calls to match Cv3
 requirements/semantics, i.e. updated scylla-jmx can implement required
 nodetool etc calls in a working fashion."
2016-11-09 10:30:32 +02:00
Duarte Nunes
e33c02aa60 cql3: Disable compression on empty properties
The CQL 3.1 documentation specifies that for disabling compression,
users should use an empty string:

ALTER TABLE mytable WITH COMPRESSION = {'sstable_compression': ''};

However, Cassandra also accepts the absence of the sstable_compression
option to disable compression. The patch 7c28ed prevented this behavior
in Scylla, which this patch aims to fix.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1478639499-4183-1-git-send-email-duarte@scylladb.com>
2016-11-09 10:03:59 +02:00
Gleb Natapov
93f068bd44 storage_proxy: fix speculation target selection logic
Current speculation target selection logic has several bugs in multi-dc
setup. It may select a non local target for CL=LOCAL and it may select
more than one target to speculate, one of which is non local.

Examples:

1. Two dataceneters: DC1 RF 2, DC2 RF 2  and read with LOCAL_QUORUM.

In this scenario db::filter_for_query() will return both replicas from
local DC and speculation target selection logic will peek one one which
will be in different DC.

2. Two dataceneters: DC1 RF 2, DC2 RF 2  and read with LOCAL_ONE + RRD.DC_LOCAL

In this scenario db::filter_for_query() will return all nodes in local DC and
there already be enough nodes to speculate, but current logic will add
one node from non local dc as a speculation target.

The patch below fixed both of those scenarios.

Message-Id: <20161103154637.GS7766@scylladb.com>
2016-11-08 18:32:47 +01:00
Paweł Dziepak
a8308e2a8d row_cache: dummy entry does not count as partition
Since continuity flag introduction row cache contains a single dummy
entry. cache_tracker knows nothing about it so that it doesn't appear in
any of the metrics. However, cache destructor calls
cache_tracker::on_erase() for every entry in the cache including the
dummy one. This is incorrect since the tracker wasn't informed when the
dummy entry was created.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1478608776-10363-1-git-send-email-pdziepak@scylladb.com>
2016-11-08 13:54:44 +01:00
Piotr Jastrzebski
50b41f7d1d Fix row_cache_test
partition_range passed to row_cache::make_reader
has to be kept alive as long as the resulting reader
is used.

Otherwise weird things start to happen.

This used to work just because of a pure luck.
When I started changing the row_cache implementation
I run into very weird behaviors for this tests.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <2c9e337dbbcf35f4e1394cad043eda10b8c2bd4a.1478602876.git.piotr@scylladb.com>
2016-11-08 13:28:53 +01:00
Calle Wilund
473326d49a api/column_family: Make mean row size return integral
As (at least) per C3, these metrics are integral in origin. Adapt.
(Other option would be to translate in jmx).
2016-11-08 12:22:04 +00:00
Calle Wilund
bd646a6755 repair (api): Add option handling (sort of) for nodetool default options 2016-11-08 12:22:04 +00:00
Calle Wilund
0181fc8159 api::cache_service: Add (dummy) calls for key&counter metrics 2016-11-08 12:22:04 +00:00
Calle Wilund
5eb54f9bc4 api::storage_service: c3 compat - make query keyspaces a trinary choice
all, user or non-local strategy ones.
2016-11-08 12:22:04 +00:00
Calle Wilund
3b7a7dd383 api::failure_detector: c3 compat - add endpoint phi value query 2016-11-08 12:22:04 +00:00
Calle Wilund
218df55349 failure_detector: add accessor and api shortcut for arrival samples 2016-11-08 12:22:04 +00:00
Calle Wilund
f9836cd23b api::endpoint_snitch: c3 compat - allow dc/rack query for broadcast 2016-11-08 12:22:04 +00:00
Calle Wilund
54ba06a8bf api::column_family: Add calls/parameters for c3 compatibility 2016-11-08 12:22:04 +00:00
Amnon Heiman
c8082ccadb API: fix a type in storage_proxy
This patch fixes a typo in the URL definition, causing the metric in the
jmx not to find it.

Fixes #1821

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1478563869-20504-1-git-send-email-amnon@scylladb.com>
2016-11-08 11:09:21 +02:00
Amos Kong
95fe88c1d3 scripts/scylla_current_repo: use HTTP to access downloads.scylladb.com
Https isn't available for downloads.scylladb.com, or we can access
it by https://s3.amazonaws.com/downloads.scylladb.com/...

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <d4b65e1724bbeb76c928790d5d3e95b91ee9db79.1478153034.git.amos@scylladb.com>
2016-11-08 11:03:50 +02:00
Avi Kivity
767cfb4fe9 storage_service: fix range wrapping in describe_ring even more
Commit 8fca1887c2 ("storage_service: fix range wrapping in
describe_ring") fixed incorrect range wrapping code for describe_ring,
but fails when the number of endpoints for a token is greater than one,
because the endpoints are stored in an unordered vector.

Fix by comparing the endpoints in a way that ignores their order.
Message-Id: <1478460826-15923-1-git-send-email-avi@scylladb.com>
2016-11-07 16:18:20 +01:00
Calle Wilund
11baf37ab5 commitlog: Prevent exceptions in stream::produce from being set twice
Fixes #1775
stream lacks a check "is_open", which is a bummer. We have to both
prevent exception propagation and add a flag of our own to make sure
exceptions in producer code reaches consumer, and does not simply
get lost in the reactor.
Message-Id: <1478508817-18854-1-git-send-email-calle@scylladb.com>
2016-11-07 11:41:33 +01:00
Tomasz Grabiec
e6cc0a2e10 Merge branch '1766/v1' from duarten/scylla.git
This patchset adds missing properties to the create_view_statement,
such as whether the view is compact or the order of its clustering
columns.

Fixes #1766
2016-11-07 10:44:24 +01:00
Takuya ASADA
0f1ba1a3bb dist/redhat: remove unused dependencies
Seems like we mistakenly added unneeded packages for BuildRequires when
we created .spec file, so remove them.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1478504761-15067-1-git-send-email-syuu@scylladb.com>
2016-11-07 09:48:50 +02:00
Paweł Dziepak
985d2f6d4a Merge "Remove quadratic behavior from atomic sstable deletion" from Avi
"The atomic sstable deletion provides exception safety at the cost of
quadratic behavior in the number of sstables awaiting deletion.  This
causes high cpu utilization during startup.

Change the code to avoid quadratic complexity, and add some unit tests.

See #1812."
2016-11-04 15:48:04 +00:00
Avi Kivity
f75aceabc5 sstables: add unit tests for atomic deletion
We simulate shards deleting sstables, but this is all happening on a single
core, and no sstables are harmed during test execution.
2016-11-04 15:48:43 +02:00
Avi Kivity
f10b9906d8 sstables: move atomic deletion code to its own files
This will simplify unit testing.  We move generic code that
depends only on seastar, so compile time should not increase too much.
2016-11-04 15:47:35 +02:00
Avi Kivity
9e85653c33 sstables: make atomic_deletion_manager more abstract
Make the shard count and method of deleting sstables abstract, in order
not to require all that machinery for unit tests.
2016-11-04 15:44:09 +02:00
Avi Kivity
e527da1e3c sstables: wrap atomic deletion code in a class
This makes it easier to abstract and unit-test.
2016-11-04 15:44:07 +02:00
Avi Kivity
a05837936a sstables: remove quadratic behavior from atomic sstable deletions
In order to ensure exception safety, the atomic sstable deletion code
creates a copy of the list of sstables pending deletion, modifies that
copy, and then replaces the original data with the copy.  This guarantees
that any exception does not change the data, since the assignment does
not require allocation.

However, it does result in quadratic behavior.  During startup, all
sstables are loaded on each shard, and each shard deletes sstables that
are do not have any partitions served by that shard; this results in
almost all sstables being deleted from all shards, with all that work
going to shard 0; the list grows to O(nr sstables), and there are
O((nr sstables) * (nr shards)) operations to perform.

Fix by replacing the copy-modify-assign method with an in-place update,
but one that is designed to only commit changes after all allocations
have been made; in addition, instead of using a list, use a hash table,
removing another source of quadratic behavior.

Fixes #1812 (the quadratic beahvior part).
2016-11-04 15:42:44 +02:00
Avi Kivity
8fca1887c2 storage_service: fix range wrapping in describe_ring
describe_ring() tries to re-wrap the ranges, but fails because the ranges
are not sorted.  Adjust the code not to rely on sorting.
Message-Id: <1478198630-27483-1-git-send-email-avi@scylladb.com>
2016-11-04 10:48:14 +00:00
Paweł Dziepak
8afd9e52c7 Merge "Process range queries sequentially on shards" from Avi
"Currently, partition range queries are processed in parallel on all
shards. This is inefficient because we are likely to drop the results
from all but one shard, assuming a well-populated column family.  We
are multiplying our work by a factor of smp::count.

While this is worthwhile in its own right, it is really an excuse to
sneak in the range/shard generator (patch 5), which is preliminary for
a new sharding algorithm, dividing tokens among shards based on the
middle-significant bits rather than the most-siginificant bits (which
alias with vnodes)

Fixes #1573."
2016-11-04 09:58:04 +00:00
Tomasz Grabiec
c1a7e2090e Revert "database: change find_column_families signature so it returns a lw_shared_ptr"
This reverts commit f3528ede65.
2016-11-04 10:48:21 +01:00
Tomasz Grabiec
3b5ccda70e Revert "database: refactor code so apply_in_memory() is called only once"
This reverts commit 3f825f593d.
2016-11-04 10:48:18 +01:00
Tomasz Grabiec
6366eb5cf8 Revert "correctly calculate latencies for writes"
This reverts commit a382f10fc4.
2016-11-04 10:48:02 +01:00
Tomasz Grabiec
a5ee87611a Revert "database: when querying, move latency counter instead of copying"
This reverts commit 8840a5a593.
2016-11-04 10:47:58 +01:00
Tomasz Grabiec
f3c1ff78e6 Merge branch 'cql_read_write_counters-v4' from seastar-dev.git
New CQL counters from Vlad.
2016-11-04 09:19:07 +01:00
Avi Kivity
b3299d5bc3 storage_proxy: simplify range queries
Instead of asking a shard for cmd->partition_limit and cmd->row_limit,
just ask it for the number of partitions and rows still needed to
satisfy the query.  This removes the need to trim the shard's result.
2016-11-03 19:10:20 +02:00
Avi Kivity
a668e575f6 storage_proxy: execute multi-partition query sequentially over shards
Since every shard might cause the row_limit quota to be satisfied, every
shard might be the last one we need.  Hence it is better to process shards
sequentially, stopping if the quota is reached or the range is exhausted.

The original code tried to yield to reduce latency, but this is now
unnecessary, as we're doing a lot less work per iteration (if it becomes
necessary, we should do it on the replica shard, not the coordinating shard).
2016-11-03 19:10:20 +02:00
Avi Kivity
1d77e3a03a partitioner: add unit tests for token_for_next_shard()
i_partitioner::token_for_next_shard() is an inverse for
i_partitioner::shard_of(), test that this is so.
2016-11-03 19:10:20 +02:00
Avi Kivity
7202b94183 dht: introduce a sharder for vectors of partition ranges
Building on the single-range sharder, add a sharder for vectors of
partition ranges.  This helps with wrapped ranges, which are translated
into a vector containing two shards.
2016-11-03 19:10:20 +02:00
Avi Kivity
43a2380899 dht: add a generator for shard/range pairs
Divides a ring_position range into a sequence of shard/range pairs.  This
allows sequential iteration over shards in ring order.

The current multi-partition query executes on all shards in parallel, but
this is very wasteful, as most of the data will be thrown away if it is not
included in the page.  With the generator, we can switch to sequential
execution.
2016-11-03 19:10:17 +02:00
Avi Kivity
1f88d103a8 partitioner: add i_partitioner::token_for_next_shard()
When performing a range query, we want to iterate over shards, running the
query on each shard in order until the query range is exhausted or we have
the right number of rows.

To be able to do this, introduce token_for_next_shard(), which allows us
to determine the boundary between shards.

It is a sort-of inverse to shard_of(), in that

  shard_of(token_for_next_range(t)) == shard_of(t) + 1
2016-11-03 19:09:23 +02:00
Vlad Zolotarov
6c15dd967a cql3::query_processor: make the collectd metrics registration nicer
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-11-03 11:49:20 -04:00
Vlad Zolotarov
36cc351ae1 cql3::query_processor: add a counter for BATCH CQL statements
- Add a "batches" member to cql_stats.
   - Update it where appropriate.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-11-03 11:49:20 -04:00
Vlad Zolotarov
6e1d27bed1 cql3::query_processor: add a counter for a number of CQL modification requests ("writes")
- Add a inserts, updates, deletes members to cql_stats.
   - Store cql_stats& in a modification_statement and increment the corresponding counter according to the value of a "type" field.
   - Store cql_stats& in a batch_statement and increment the statistics for each BATCH member.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-11-03 11:49:15 -04:00
Vlad Zolotarov
fa4e1db0cb cql: add a counter for CQL read (SELECT) requests
- Add a "reads" counter to a cql3::cql_stats struct.
   - Store a reference for a query_processor::_cql_stats in the select_statement object.
   - Increment a "reads" counter where needed.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-11-03 11:48:57 -04:00
Vlad Zolotarov
7606588267 cql3::query_processor: add cql_stats
- Add cql_stats member.
   - Pass it to cql3::raw::parsed_statement::prepare() virtual method.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-11-03 11:48:57 -04:00
Glauber Costa
8840a5a593 database: when querying, move latency counter instead of copying
It is comprised of two time points. Let's move it instead of copying it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <c7c155c77780e188bfbe05881c81ce86456016d5.1478111467.git.glauber@scylladb.com>
2016-11-03 13:27:31 +01:00
Glauber Costa
a382f10fc4 correctly calculate latencies for writes
Right now we are calculating latencies only when we are about to add an
item to the memtable.

That's incorrect and misleading, for two reasons. First, it leaves the
commitlog latencies out. But second, it is done after the memtable wall
effect is applied, which means we are not counting throttle time neither
in the memtables or in the commitlog.

To do that, we'll start the latency_counter object as soon as possible
and move it all the way to apply_in_memory(). That should span the
entire write operation.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <4e424780d290fd5938046060df2b17e2b470b717.1478111467.git.glauber@scylladb.com>
2016-11-03 13:27:31 +01:00
Glauber Costa
3f825f593d database: refactor code so apply_in_memory() is called only once
There are two variants of apply_in_memory() being called in do_apply():
with and without the commitlog. The main differences are that when the
commitlog is involved, we need to wait for its future to complete before
moving to apply_in_memory. That can easily be factored out by providing
an always-ready future if we don't have the commitlog enabled, and
waiting on that.

The second, is that the commitlog version can cause apply_in_memory to
generate an exception if there is replay position reordering. However,
there is no harm in appending the exception handler to both versions. In
one of them it's an impossible exception, but that's fine.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <8cee0cad9b1930a057a24e095f0a655069ae8be2.1478111467.git.glauber@scylladb.com>
2016-11-03 13:27:31 +01:00
Glauber Costa
f3528ede65 database: change find_column_families signature so it returns a lw_shared_ptr
There are places in which we need to use the column family object many
times, with deferring points in between. Because the column family may
have been destroyed in the deferring point, we need to go and find it
again.

If we use lw_shared_ptr, however, we'll be able to at least guarantee
that the object will be alive. Some users will still need to check, if
they want to guarantee that the column family wasn't removed. But others
that only need to make sure we don't access an invalid object will be
able to avoid the cost of re-finding it just fine.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <722bf49e158da77ff509372c2034e5707706e5bf.1478111467.git.glauber@scylladb.com>
2016-11-03 13:27:31 +01:00
Avi Kivity
6c45b0bae8 partitioner: make comparators public
The public comparison operators depend on global_partitioner(), and are
therefore less useful for tests.
2016-11-03 11:27:40 +02:00
Avi Kivity
6320181b97 partitioner: const correctness for comparators 2016-11-03 11:27:40 +02:00
Avi Kivity
470826d127 partitioner: change partitioners to have shard counts independent from smp::count
Useful for testing.
2016-11-03 11:27:40 +02:00
Avi Kivity
75706c0a26 size_estimates_recorder: sort token range before rewrapping it
Since size estimates are stored as wrapped ranges, we call compat::wrap()
to convert from the now-standard unwrapped ranges back to wrapped ranges.
However, compat::wrap() relies on the ranges being in sorted order,
but our input is not.  This leads to a crash as we find an unexpected
empty token in the middle of the vector.

Sort it so compat::wrap() works as expected.

Fixes #1804.
Message-Id: <1478161908-25051-1-git-send-email-avi@scylladb.com>
2016-11-03 09:43:41 +01:00
Avi Kivity
a35136533d Convert ring_position and token ranges to be nonwrapping
Wrapping ranges are a pain, so we are moving wrap handling to the edges.

Since cql can't generate wrapping ranges, this means thrift and the ring
maintenance code; also range->ring transformations need to merge the first
and last ranges.

Message-Id: <1478105905-31613-1-git-send-email-avi@scylladb.com>
2016-11-02 21:04:11 +02:00
Takuya ASADA
8c55c99353 dist/common/scripts/scylla_io_setup: pass --smp option to iotune command
We were ignored --smp option taken from io.conf since iotune didn't supported
it, but now it supported we can pass it.
(We need to pass it because we need to measure io performance on same condition
with scylla)

Fixes #1768

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1478082591-27205-1-git-send-email-syuu@scylladb.com>
2016-11-02 12:49:50 +02:00
Raphael S. Carvalho
53b7b7def3 sstables: handle unrecognized sstable component
As in C*, unrecognized sstable components should be ignored when
loading a sstable. At the moment, Scylla fails to do so and will
not boot as a result. In addition, unknown components should be
remembered when moving a sstable or changing its generation.

Fixes #1780.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <b7af0c28e5b574fd577a7a1d28fb006ac197aa0a.1478025930.git.raphaelsc@scylladb.com>
2016-11-02 12:44:53 +02:00
Avi Kivity
72c2982260 dist: require scylla-boost-static for EL RPM build 2016-11-01 18:55:55 +02:00
Pekka Enberg
e1e8ca2788 cql3: Fix selecting same column multiple times
Under the hood, the selectable::add_and_get_index() function
deliberately filters out duplicate columns. This causes
simple_selector::get_output_row() to return a row with all duplicate
columns filtered out, which triggers and assertion because of row
mismatch with metadata (which contains the duplicate columns).

The fix is rather simple: just make selection::from_selectors() use
selection_with_processing if the number of selectors and column
definitions doesn't match -- like Apache Cassandra does.

Fixes #1367
Message-Id: <1477989740-6485-1-git-send-email-penberg@scylladb.com>
2016-11-01 09:09:01 +00:00
Pekka Enberg
d46ed53e9e scripts: add update-version
This patch adds an `update-version` script for updating the Scylla
version number in `SCYLLA-VERSION-GEN` file and committing the change to
git.

Example use:

  $ ./scripts/update-version 1.4.0

which results into the following git commit:

  commit 4599c16d9292d8d9299b40a3e44ef7ee80e3c3cf
  Author: Pekka Enberg <penberg@scylladb.com>
  Date:   Fri Oct 28 10:24:52 2016 +0300

      release: prepare for 1.4.0

  diff --git a/SCYLLA-VERSION-GEN b/SCYLLA-VERSION-GEN
  index 753c982..eba2da4 100755
  --- a/SCYLLA-VERSION-GEN
  +++ b/SCYLLA-VERSION-GEN
  @@ -1,6 +1,6 @@
   #!/bin/sh

  -VERSION=666.development
  +VERSION=1.4.0

   if test -f version
   then

Message-Id: <1477639560-10896-1-git-send-email-penberg@scylladb.com>
2016-10-30 12:43:41 +02:00
Avi Kivity
feb8faf70b Merge "make refresh resilient to permission denied error" from Raphael
Fixes #1709.

* 'refresh-resilient-v3' of github.com:raphaelsc/scylla:
  db: make refresh resilient to permission denied error
  db: make it possible to use custom error handler with io checker
  sstables: remove duplicated declaration of remove_by_toc_name
2016-10-30 10:28:09 +02:00
Takuya ASADA
68d9f5212c dist/ubuntu/dep/thrift.diff: add missing build time dependency
We need libcrypto header to build thrift, so add it.

Fixes #1798

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1477676716-5726-1-git-send-email-syuu@scylladb.com>
2016-10-29 17:49:30 +03:00
Avi Kivity
71532d8cd5 Merge seastar upstream
* seastar 05f6c5c...47e1821 (1):
  > rpc: Avoid using zero-copy interface of output_stream (Fixes #1786)
2016-10-28 14:09:16 +03:00
Avi Kivity
e03ca06431 dist: fix rpm build
--static-boost is supposed to be an input to ./configure.py, not ninja.  Move
it there.
2016-10-28 08:42:26 +03:00
Pekka Enberg
b54870764f auth: Fix resource level handling
We use `data_resource` class in the CQL parser, which let's users refer
to a table resource without specifying a keyspace. This asserts out in
get_level() for no good reason as we already know the intented level
based on the constructor. Therefore, change `data_resource` to track the
level like upstream Cassandra does and use that.

Fixes #1790

Message-Id: <1477599169-2945-1-git-send-email-penberg@scylladb.com>
2016-10-27 23:37:26 +03:00
Glauber Costa
ef3c7ab38e auth: always convert string to upper case before comparing
We store all auth perm strings in upper case, but the user might very
well pass this in upper case.

We could use a standard key comparator / hash here, but since the
strings tend to be small, the new sstring will likely be allocated in
the stack here and this approach yields significantly less code.

Fixes #1791.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <51df92451e6e0a6325a005c19c95eaa55270da61.1477594199.git.glauber@scylladb.com>
2016-10-27 22:08:57 +03:00
Raphael S. Carvalho
d11e839520 db: make refresh resilient to permission denied error
User may forget to set permission of new sstables in upload dir
before refreshing them, and that will result in shutdown.
io_checker is now able to work with a custom handler, so all we
have to do is to whitelist EACCES.

Fixes #1709.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-10-27 16:50:40 -02:00
Raphael S. Carvalho
a3e065da9b db: make it possible to use custom error handler with io checker
By default, io checker will cause Scylla to shutdown if it finds
specific system errors. Right now, io checker isn't flexible
enough to allow a specialized handler. For example, we don't want
to Scylla to shutdown if there's an permission problem when
uploading new files from upload dir. This desired flexibility is
made possible here by allowing a handler parameter to io check
functions and also changing existing code to take advantage of it.
That's a step towards fixing #1709.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-10-27 15:54:21 -02:00
Takuya ASADA
a1b7e76d43 dist/ubuntu: support 16.10
Add 16.10 to 'supported_release'

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1477585454-2115-1-git-send-email-syuu@scylladb.com>
2016-10-27 19:26:14 +03:00
Takuya ASADA
36e831a106 dist/common/scripts/scylla_bootparam_setup: support EC2 paravirtual instances
EC2 paravirtual instances uses pv-grub, which refers /boot/grub/menu.lst (grub0.9x config file) instead of grub2 config file.
So add boot parameters on /boot/grub/menu.lst when the file exists, and the instance is on EC2.

Fixes #1598

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1472056875-17512-1-git-send-email-syuu@scylladb.com>
2016-10-27 18:55:05 +03:00
Avi Kivity
402a3f1c9f Merge seastar upstream
* seastar 9bed76a...05f6c5c (5):
  > reactor: improve task quota timer resolution
  > Update dpdk submodule to local-patches-20161027 tag
  > tests: wire up json_formatter_test
  > json_formatter_test: Add rudimentary json formatter test
  > scripts/posix_net_conf.sh: detect IRQs of virtio-net and xen_netfront correctly
2016-10-27 18:19:40 +03:00
Avi Kivity
e995f5a3a7 dist: statically link with boost on RHEL
Reduces runtime dependencies on Scylla-provided third-party boost packages.

Message-Id: <1477552490-28961-1-git-send-email-avi@scylladb.com>
2016-10-27 12:35:12 +03:00
Avi Kivity
76628a7b0b dist: make wget quieter
wget is often used from scripts recording to logs; as it emits a log
line every second, the logs are huge and unreadable.  Make it quieter.

Message-Id: <1477558534-32718-1-git-send-email-avi@scylladb.com>
2016-10-27 12:11:26 +03:00
Avi Kivity
72d78ffa7e Merge "Cache fixes" from Paweł
"5ff699e09fcbd62611e78b9de601f6c8636ab2f0 ("row_cache: rework cache to
use fast forwarding reader") brought some significant changes to the
row cache implementation. Unfortunately, "significant changes" often
translates to "more bugs" and this time was no different.

This series contains fixes for the problems introduced in that rework
and makes failing dtest
bootstrap_test.py:TestBootstrap.local_quorum_bootstrap_test
pass again."

* 'pdziepak/cache-fixes/v1' of github.com:cloudius-systems/seastar-dev:
  row_cache: avoid dereferencing invalid iterator
  row_cache: set _first_element flag correctly
  row_cache: fix clearing continuity flag at eviction
2016-10-27 11:44:15 +03:00
Takuya ASADA
5cb7dc5dc3 dist/ubuntu/dep: update thrift to 0.9.3
To make thrift compilable on gcc-6.2, we need to upgrade latest version of
thrift.
This is required to support Ubuntu 16.10.

Fixes #1784

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1477517671-18067-1-git-send-email-syuu@scylladb.com>
2016-10-27 10:22:06 +03:00
Paweł Dziepak
a7224ae46e row_cache: avoid dereferencing invalid iterator
Conditions in row_cache::do_find_or_create_entry() make it possible that
std::prev(it) is going to be dereferenced even if it is a begin
iterator.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-26 15:24:23 +01:00
Paweł Dziepak
654f651e0c row_cache: set _first_element flag correctly
If the continuity flag was set for the first element _first_element flag
would not be cleared. This shouldn't cause any correctness problems but
properly setting the flag allows to avoid some unnecessary key
comparisons.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-26 15:07:24 +01:00
Paweł Dziepak
567ff96f2a row_cache: fix clearing continuity flag at eviction
In original implementation the continuity flag indicated that cache has
full information about the range the between current partition and the
one following it, hence when evicting an entry the one preceeding it
had to have its continuity flag cleared.

This was changed, however, and now the continuiy flag tells whether the
cache is continuous between the current element and the one before it.
This means that eviction code needs to clear the flag for the entry
directly following the evicted one.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-26 14:58:20 +01:00
Raphael S. Carvalho
bc2d351c25 sstables: remove duplicated declaration of remove_by_toc_name
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-10-26 11:21:27 -02:00
Takuya ASADA
7617adadf4 dist/ami/files/.bash_profile: fix confusing message when running AMI on unsupported instance type
To describe witch instance type is supported, show document URL instead of
confusing message.

Fixes #1646

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1477473336-25373-1-git-send-email-syuu@scylladb.com>
2016-10-26 12:48:51 +03:00
Avi Kivity
7faf2eed2f build: support for linking statically with boost
Remove assumptions in the build system about dynamically linked boost unit
tests.  Includes seastar update which would have otherwise broken the
build.
2016-10-26 08:51:21 +03:00
Piotr Jastrzebski
27726cecff Clean up position_in_partition.
Introduce position_in_partition_view and use it in
position() method in mutation_fragment, range_tombstone,
static_row and clustering_row.
Clean up comparators in position_in_partition.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <c65293c71a6aa23cf930ed317fb63df1fdc34fd1.1477399763.git.piotr@scylladb.com>
2016-10-25 15:13:20 +01:00
Tomasz Grabiec
cbaae2bf7f Merge seastar upstream
* seastar e18205b...3777135 (1):
  > rpc: Do not close client connection on error response for a timed out request

Fixes #1778
2016-10-25 13:59:41 +02:00
Raphael S. Carvalho
975ce62dbc sstables: do not swallow exception when reading TOC
That caused problem when refreshing a sstable with bad permissions.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <48e5322c53234209e55da05c64c99b8ec4e190a3.1477372974.git.raphaelsc@scylladb.com>
2016-10-25 12:21:32 +03:00
Avi Kivity
ddd4dbf928 Update scylla-ami submodule
* dist/ami/files/scylla-ami e1e3919...61ff5c6 (1):
  > scylla_ami_setup: run posix_net_conf.sh when NCPUS < 8
2016-10-25 11:18:58 +03:00
Avi Kivity
4b55a687b6 Merge seastar upstream
* seastar 98b5a2d...e18205b (1):
  > json::formatter: Add formatters for maps + rudimentary test
2016-10-25 11:17:29 +03:00
Avi Kivity
e8edaaf6a4 Merge seastar upstream
* seastar 69acec1...98b5a2d (9):
  > rpc: Silence warning about ignored failed future
  > future: prioritise continuations that can run immediately
  > iotune: relax aio restrictions
  > build: support for static linking with boost
  > rpc: Fix crash during connection teardown
  > rpc: Move _connected flag to protocol::connection
  > rpc test: fail test if exception is thrown during test execution
  > rpc: do not assume underling semaphore type
  > rpc: fix default resource limit
2016-10-25 11:09:40 +03:00
Avi Kivity
fc8210a875 tests: fix tests with boost 1.60
In boost 1.60, the executable's command-line arguments are expected to
be separated from the boost command-line arguments by '--'.  Detect
this requirement and comply with it.
Message-Id: <1477212424-3831-1-git-send-email-avi@scylladb.com>
2016-10-24 09:36:56 +02:00
Avi Kivity
37f112b610 dist: add python3-yaml to ununtu dependencies for blocktune 2016-10-23 16:42:13 +03:00
Avi Kivity
7d50d6df9b blocktune: fix syntax error in exception handling 2016-10-23 16:40:00 +03:00
Avi Kivity
e261a380a9 dist: add PyYAML dependency to rpm (for blocktune) 2016-10-23 10:36:29 +03:00
Raphael S. Carvalho
fa308c079c database: fix collectd metrics for clustering key filter
Same instance name was used for exported metrics, which is
definitely wrong. Checked it works properly now via collectd
exporter.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <471a36706113af60aeba86fb56a365feb4dab31a.1477086706.git.raphaelsc@scylladb.com>
2016-10-22 09:51:18 +03:00
Glauber Costa
a13c410749 commitlog: cycle based on total size, not on mutation size
We calculate two sizes during the allocation: "size", which is the
in-segment size of this mutation, and "s", which is that plus the
overhead. cycle() must be called with the latter, not the former, as
doing otherwise may lead to buffer overflows.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <ccf346d8d0ebb44a1ba9fd069653bab0d7be0a61.1477063157.git.glauber@scylladb.com>
2016-10-21 18:57:41 +03:00
Glauber Costa
d9875784a1 commitlog: do not wait on pending operations for batch mode
This was explicitly mentioned in my set as gone in one of the versions.
Somehow it came back in the final version - sorry about that.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <2a0eba28cd74267d1a1fdcf1aef2901cc74ffc9f.1477059963.git.glauber@scylladb.com>
2016-10-21 17:27:16 +03:00
Vlad Zolotarov
f75a350a8f service::storage_proxy: use global_trace_state_ptr when using invoke_on
When trace_state may migrate to a different shard a global_trace_state_ptr
has to be used.

This patch completes the patch below:

commit 7e180c7bd3
Author: Vlad Zolotarov <vladz@cloudius-systems.com>
Date:   Tue Sep 20 19:09:27 2016 +0300

    tracing: introduce the tracing::global_trace_state_ptr class

Fixes #1770

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1476993537-27388-1-git-send-email-vladz@cloudius-systems.com>
2016-10-21 11:34:13 +03:00
Avi Kivity
e3ae54f0fe Merge "Rework commitlog to avoid timeouts" from Glauber
"This patchset reworks the commitlog logic to better handle conditions in
which we are getting requests faster than the disk can handle. It does
this by building a wall around the commitlog and only allowing
allocations to proceed when we are under the desired memory threshold.

The main advantage of that is that we can now easily set the commitlog
to work at disk speed, more or less allowing an "one byte in for each
byte out" approach instead of depending on the current cycle to finish.
As a result, max latencies are greatly reduced.

Testing Results
===============

To test this, I have ran a workload that times out frequently. That
workload use 10 threads to write 100 partitions (to isolate from the
effects of the memtable introduced latencies) in a loop and each
partition is 2MB in size.

After 10 minutes running this load, we are left with the following
percentiles:

latency mean              : 51.9 [WRITE:51.9]
latency median            : 9.8 [WRITE:9.8]
latency 95th percentile   : 125.6 [WRITE:125.6]
latency 99th percentile   : 1184.0 [WRITE:1184.0]
latency 99.9th percentile : 1991.2 [WRITE:1991.2]
latency max               : 2338.2 [WRITE:2338.2]

After this patch:

latency mean              : 54.9 [WRITE:54.9]
latency median            : 43.5 [WRITE:43.5]
latency 95th percentile   : 126.9 [WRITE:126.9]
latency 99th percentile   : 253.9 [WRITE:253.9]
latency 99.9th percentile : 364.6 [WRITE:364.6]
latency max               : 471.4 [WRITE:471.4]

I have run this with larger sizes as well, and it generally performs
much better than the baseline version. For sizes up to 5MB, I have seen
no timeouts in my setup. After that, I see some timeouts. Buffer
splitting is expected to make this better.

Aside from performance testing, this was also tested with batch and
periodic mode for various requests sizes."
2016-10-20 16:44:39 +03:00
Glauber Costa
d5618c6ace commitlog: add total_operations type for requests_blocked_memory
Current tracker for pending allocations is a queue_size GAUGE.  Add a
total_operations version so we have more insight on what's going on.

It will be called requests_blocked_memory for consistency with other
subsystems that track similar things.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-20 09:25:38 -04:00
Avi Kivity
db2f5e6be1 blocktune: wire up blocktune on startup
Message-Id: <1476357027-15014-3-git-send-email-avi@scylladb.com>
2016-10-20 13:24:05 +03:00
Avi Kivity
098d02ad1a scylla-blocktune: introduce
scylla-blocktune is a script that parses scylla.yaml and tunes the data file
and commitlog directories it references.

Tuning includes:
 - set the I/O scheduler to noop
 - disable merging
 - tune dependent devices (like RAID members)

Message-Id: <1476357027-15014-2-git-send-email-avi@scylladb.com>
2016-10-20 13:24:05 +03:00
Avi Kivity
fad34eef6c scylla_raid_setup: don't mess with read-ahead
It doesn't affect O_DIRECT reads, and it's not persistent.

Message-Id: <1476269082-2473-2-git-send-email-avi@scylladb.com>
2016-10-20 13:23:38 +03:00
Avi Kivity
a837da06ef scylla_raid_setup: increase chunk size
The current chunk size of 256 gives a 50% probability of a 128k read or
write getting split into two accesses.  This reduces efficiency and
increases latency.

Change the chunk size to 1MB, with a 12% probability of cross-member
access.

Message-Id: <1476269082-2473-1-git-send-email-avi@scylladb.com>
2016-10-20 13:23:38 +03:00
Takuya ASADA
80e3d8286c dist/ami: fix incorrect /etc/fstab entry on CentOS7 base image
There was incorrect rootfs entry on /etc/fstab:
 /dev/sda1 / xfs defaults,noatime 1 1
This causes boot error when updated to new kernel.
(see:
https://github.com/scylladb/scylla/issues/1597#issuecomment-250243187)

So replaced the entry to
 UUID=<uuid>  / xfs defaults,noatime 1 1
Also all recent security updates applied.

Fixes #1597
Fixes #1707

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1475094957-9464-1-git-send-email-syuu@scylladb.com>
2016-10-20 11:48:24 +03:00
Takuya ASADA
5f602752a5 dist/ubuntu: backport g++-5 from Debian 9(stretch) to Debian 8(jessie)
Since Debian 8(jessie) does not provides g++-5, we frequently got compile error
because we are using older compiler.
To fix the problem, backport g++-5 from Debian 9(stretch).

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1476694318-10640-3-git-send-email-syuu@scylladb.com>
2016-10-20 11:41:02 +03:00
Takuya ASADA
7d67504b56 dist/ubuntu: use VERSION_ID from /etc/os-release instead of 'lsb_release -r'
On Debian, lsb_release -r returns the version number something like '8.6'.
However, on this script we want to check major version only.
Therefore we can use VERSION_ID from /etc/os-release which only contains
major version number.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1476694318-10640-2-git-send-email-syuu@scylladb.com>
2016-10-20 11:41:02 +03:00
Avi Kivity
0da2f64cfb Merge seastsar upstream
* seastar ccd8649...69acec1 (2):
  > app/iotune: add --smp option
  > rpc: Add missing adjustment of snd_buf::size

Fixes #1767.
Fixes #1768.
2016-10-20 11:16:40 +03:00
Paweł Dziepak
210a390892 tests: add missing sstable for partition skipping test
Commit 7dcd70124a "tests/sstables: add
test for fast forwarding reader" added a test for skipping parts of
sstable. Unfortunately, it did not include the sstables it was trying to
read.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 23:23:49 +01:00
Glauber Costa
1578d7363a commitlog: rework blocking logic
The current incarnation of commitlog establishes a maximum amount of
writes that can be in-flight, and blocks new requests after that limit
is reached.

That is obviously something we must do, but the current approach to it
is problematic for two main reasons:

1) It forces the requests that trigger a write to wait on the current
   write to finish. That is excessive; ideally we would wait for one
   particular write to finish, not necessarily the current one. That
   is made worse by the fact that when a write is followed by a flush
   (happens when we move to a new segment), then we must wait for
   *all* writes in that segment to finish.

1) it casts concurrency in terms of writes instead of memory, which
   makes the aforementioned problem a lot worse: if we have very big
   buffers in flight and we must wait for them to finish, that can
   take a long time, often in the order of seconds, causing timeouts.

The approach taken by this patch is to replace the _write_semaphore
with a request_controller. This data structure will account the amount
of memory used by the buffers and set a limit on it. New allocations
will be held until we go below that limit, and will be released
as soon as this happens.

This guarantees that the latencies introduced by this mechanism are
spread out a lot better among requests and will keep higher percentile
latencies in check.

To test this, I have ran a workload that times out frequently. That
workload use 10 threads to write 100 partitions (to isolate from the
effects of the memtable introduced latencies) in a loop and each
partition is 2MB in size.

After 10 minutes running this load, we are left with the following
percentiles:

latency mean              : 51.9 [WRITE:51.9]
latency median            : 9.8 [WRITE:9.8]
latency 95th percentile   : 125.6 [WRITE:125.6]
latency 99th percentile   : 1184.0 [WRITE:1184.0]
latency 99.9th percentile : 1991.2 [WRITE:1991.2]
latency max               : 2338.2 [WRITE:2338.2]

After this patch:

latency mean              : 54.9 [WRITE:54.9]
latency median            : 43.5 [WRITE:43.5]
latency 95th percentile   : 126.9 [WRITE:126.9]
latency 99th percentile   : 253.9 [WRITE:253.9]
latency 99.9th percentile : 364.6 [WRITE:364.6]
latency max               : 471.4 [WRITE:471.4]

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-19 13:56:36 -04:00
Glauber Costa
aec724bbda commitlog: factor out code for checking mutation size
In a subsequent patch, I'll use this code in a different place. To
prepare for that, we move it out as a method. It also fits a lot better
inside the segment manager, so move it there.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-19 13:49:47 -04:00
Glauber Costa
a50996f376 commitlog: calculate segment-independent size of mutations
Goal is to calculate a size that is lesser or equal than the
segment-dependent size.

This was originally written by Tomasz, and featured in his submission
"commitlog: Handle overload more gracefully"

Extracted here so it sits clearly in a different patch.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-19 13:49:47 -04:00
Glauber Costa
0b7c9fa17f commitlog: remove _needed_size
It is mostly an optimization, and while it makes sense in this context,
it won't soon as we'll stop waiting for the current cycle specifically
to finish.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-19 13:49:47 -04:00
Glauber Costa
6214bdeb66 commitlog: move segment_manager constructor outside the class definition
We'll do that so we can, in following patches, use static members from
the segment. Those are not defined at this point.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-19 13:49:47 -04:00
Glauber Costa
299877f432 commitlog: add a counter for pending allocations
We track the amount of pending allocations but we don't really export
it. It will be crucial when we stop tracking pending writes.

This patch exports it through a method instead of the totals structure,
so we can easily change it. Current code probing pending_allocations
(the api code) is also converted to use the public method instead of the
totals struct.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-19 13:49:47 -04:00
Avi Kivity
07c995ab3d Merge "Fast forward mutation readers" from Paweł
"This patchset enables mutation readers to be fast forwarded to a different
partition range. The main reason for introducing such feature are range
queries served from cache. If the cache is partially populated in the
requested range the reader will end up with multiple subranges that have
to be read from the sstables. Originally, each of these subranges would
require a new reader to be created, but with fast forwarding we can have
just one sstable reader. This is better since there is a chance that buffers
kept by the reader may be still useful after fast forwarding it.

In this series there are also patches that clean up cache readers in order
to make integration with fast forwarding easier. Namely, continuity flag is
changed to store information about range before the entry which significantly
simplifies the logic.

Fixes #1299."

* 'pdziepak/fast-forward-mutation-readers/v5' of github.com:cloudius-systems/seastar-dev: (24 commits)
  sstables: keep separate stream history for single and range reads
  sstables: drop sstable::{lower, upper}_bound()
  row_cache: rework cache to use fast forwarding reader
  row_cache: put cache entry flags in a struct
  row_cache: add do_find_or_create_entry() to reduce code duplication
  mutation_reader: forward fast_forward_to() calls
  tests/row_cache: add fast_forward_to() to throttled reader
  tests/row_cache: count mutations read from _underlying
  memtable: add support for fast_forward_to()
  drop key readers
  tests/mutation_reader: test fast forwarding combined reader
  database: enable fast forwarding of range_sstable_reader
  combined_mutation_reader: implement fast_forward_to()
  mutation_reader: make combinded_reader public
  tests/sstables: add test for fast forwarding reader
  tests: add more helpers to mutation reader assertions
  sstables: enable fast forwarding for range readers
  mutation_reader: introduce fast_forward_to()
  sstables: implement mutation_reader::impl::fast_forward_to()
  sstables: introduce index_reader
  ...
2016-10-19 18:10:44 +03:00
Paweł Dziepak
ab0eeae82d sstables: keep separate stream history for single and range reads
Single partition and partition range reads are expected to behave
considerably different so it is worth to have them use separate file
stream history. This also makes reads use different history for each
sstable which is also a good thing.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
20bfa1fa52 sstables: drop sstable::{lower, upper}_bound()
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
5ff699e09f row_cache: rework cache to use fast forwarding reader
This uncomfortably large patch overhauls cache range reader so that it
can take advantage of fast forwarding mutation readers.

A significant change in the cache itself is that the continuity flag now
is used to determine whether cache is contiguous between the previous
entry and the current one. This allows for a significant simplification
of the cache code and easier integration with reader fast forwarding.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
18acb0c0e6 row_cache: put cache entry flags in a struct
Flags are easier to manage if they are in a single structure.
Especially, default initialization and move contstructors are simpler
and less error prone.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
f248e23db5 row_cache: add do_find_or_create_entry() to reduce code duplication
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
bcd374c05d mutation_reader: forward fast_forward_to() calls
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
0c24bbe639 tests/row_cache: add fast_forward_to() to throttled reader
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
69645455f3 tests/row_cache: count mutations read from _underlying
Originally, cache tests checked how many times a mutation reader was
created from the underlying mutation source to determine whether
continuity flag is working correctly.

This is not going to work with fast forwarding mutation readers so the
test is switched to count number of mutations (+ end of stream markers)
returned from underlying mutaiton readers which is much less fragile.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
e14f8027d5 memtable: add support for fast_forward_to()
Fast forwarding of memtable readers is needed only for unit tests which
often use memtables as underlying data source for cache and the cache
readers.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
6755a679f6 drop key readers
key_readers weren't used since introduction of continuity flag to cache
entries.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
5ac9babe97 tests/mutation_reader: test fast forwarding combined reader
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
7bebfb851f database: enable fast forwarding of range_sstable_reader
When fast forwarding a reader that combines sstable reader we must also
remember that the set of sstables for the new range may be different
than for the previous one. The reader introduced in this patch makes
sure that we read from correct sstables.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
b7b7b2bd63 combined_mutation_reader: implement fast_forward_to()
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
2c0cdd55fc mutation_reader: make combinded_reader public
We want to be able to fast forward sstable readers. However, just
implementing fast_forward_to() for combined_reader is not enough as the
sstables we are reading from may need to change.

Following patches are going to introduce a combined sstable reader that
derives from combined_reader. To make that possible we first need to
make combined_reader public.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
7dcd70124a tests/sstables: add test for fast forwarding reader
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
5534dc2817 tests: add more helpers to mutation reader assertions
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
cf024975fe sstables: enable fast forwarding for range readers
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
62c9492d33 mutation_reader: introduce fast_forward_to()
This patch introduces the interface for fast forwarding mutation
readers. The main user of this feature is going to be cache which, while
serving range query, may need to read multiple small ranges from the
sstables to populate itself with the missing entries.

Fast forwarding is an alternative to recreating a reader with different
range. Its main advantage is fact that it avoids dropping data that has
already been read.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
c63e88d556 sstables: implement mutation_reader::impl::fast_forward_to()
This patch allows sstable readers to be fast forwarded without making it
necessary to recreate the reader (and dropping all buffers in the
process). It is built on top of index_reader and ability of
data_consume_context to be fast forwarded.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
a530762277 sstables: introduce index_reader
index_reader is a helper that implements index lookups. Its goal is to
avoid dropping read buffers if they still may be needed (for example to
get end bound of the range or after fast forwarding the reader).

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
f49a9e0d64 sstables: drop unused read_range_rows() overload
That overload was used only by unit test and violated guarantee that
partition range lives until mutation reader is done.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
0bc873ace5 sstables: add fast_forward_to() to continuous_data_consumer
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
25b91c51e2 ssables: add data_consume_rows_context::reset()
reset() is going to be used to restore valid state after fast forwarding
the reader.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
2124d08b88 sstables: add skip() to compressed_file_data_source
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
54069162f5 Merge "Add test for partition version list consistency after compaction" from Tomek 2016-10-18 11:03:25 +01:00
Tomasz Grabiec
308434f891 tests: memtable: Add test for partition version list consistency after compaction 2016-10-18 11:57:14 +02:00
Tomasz Grabiec
6548132423 lsa: Make logalloc::tracker::full_compaction() compact all reclaimable regions
is_compactible() will pass on very small regions. full_compaction() is
only used in tests to force objects to be moved due to compaction, so
we want all reclaimable regions to be compacted.
2016-10-18 11:16:08 +02:00
Tomasz Grabiec
ecf85cbffb mutation: Define + operation
It's more convenient to write m1 + m2 in tests than to do more
elaborate constructs with copy constructors and apply().
2016-10-18 11:16:08 +02:00
Tomasz Grabiec
fe387f8ba0 partition_version: Fix corruption of partition_version list
The move constructor of partition_version was not invoking move
constructor of anchorless_list_base_hook. As a result, when
partition_version objects were moved, e.g. during LSA compaction, they
were unlinked from their lists.

This can make readers return invalid data, because not all versions
will be reachable.

It also casues leaks of the versions which are not directly attached
to memtable entry. This will trigger assertion failure in LSA region
destructor. This assetion triggers with row cache disabled. With cache
enabled (default) all segments are merged into the cache region, which
currently is not destroyed on shutdown, so this problem would go
unnoticed. With cache disabled, memtable region is destroyed after
memtable is flushed and after all readers stop using that memtable.

Fixes #1753.
Message-Id: <1476778472-5711-1-git-send-email-tgrabiec@scylladb.com>
2016-10-18 09:25:38 +01:00
Duarte Nunes
1d45f19c78 create_view_statement: Use cf_properties
This patch uses cf_properties instead to add the missing attributes to
the create_view_statement class.

Fixes #1766

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-18 01:18:52 +00:00
Duarte Nunes
7c58b7e764 unimplemented: Add materialized views
This patch adds the VIEWS element to the cause enum so we can
mark failures due to incomplete support of materialized views.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-18 01:18:52 +00:00
Duarte Nunes
7c28ed3dfc schema: Extract default compressor
This patch extracts the definition of the default compressor into the
compression_parameters class, so that the table and view creation
statements don't have to explicitly deal with it.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-18 01:18:52 +00:00
Duarte Nunes
dc470e6a36 cql3: Extract cf_properties
This patch extracts the cf_properties class, which contains common
attributes of tables and materialized views.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-18 01:18:51 +00:00
Takuya ASADA
587d375e19 main: exit with 1 when verify_seastar_io_scheduler() failed
Since we are exiting Scylla process in engine().at_exit() using
::_exit(0), even verify_seastar_io_scheduler() throwing an exception,
scylla always exit with 0.

Systemd misunderstands scylla-server.service was shutdown successfully
because of this, so we need to pass correct exit code to ::_exit() here.

Fixes #1674

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1475065607-15486-1-git-send-email-syuu@scylladb.com>
2016-10-17 13:57:00 +03:00
Avi Kivity
163088c6af Merge seastar upstream
* seastar 207bf3d...ccd8649 (3):
  > Merge "Augment semaphore with non-blocking operations" from Glauber
  > Merge "More dynamic fstream patches" from Paweł
  > Merge "fstream: add dynamic adjustments based on stream history" from Paweł
2016-10-17 12:49:17 +03:00
Avi Kivity
65c27ccf21 bytes_ostream: make max_chunk_size() an inline function
Fixes debug build looking for a variable definition and not finding it.
2016-10-17 11:49:33 +03:00
Avi Kivity
c0a1ad0b77 bytes_ostream: use larger allocations
A 1MB response will require 2000 allocations with the current 512-byte
chunk size.  Increase it exponentially to reduce allocation count for
larger responses (still respecting the upper limit).
Message-Id: <1476369152-1245-1-git-send-email-avi@scylladb.com>
2016-10-16 10:05:48 +01:00
Tomasz Grabiec
d836e8f64b tests: memtable: Add tests for flushing reader
Message-Id: <1476454187-11462-1-git-send-email-tgrabiec@scylladb.com>
2016-10-14 15:11:06 +01:00
Tomasz Grabiec
63784fd921 db: Fix corruption of partition_entry
Memory accounting code was attaching partition_snapshot to
partition_entry in order to calculate the size of partition_version
object. However, it is only allowed if partition_entry doesn't have
any snapshot attached already. In this case it always has one, created
by the flushing reader.

Change the accounting code to reuse existing partition_snapshot reference.

Fixes #1746
Message-Id: <1476449160-9252-1-git-send-email-tgrabiec@scylladb.com>
2016-10-14 15:10:48 +01:00
Paweł Dziepak
d08cffd3c7 lsa: avoid exceptions during segment_zone creation
LSA tries to allocate zones as large as possible (while still leaving
enough free space for the standard allocator). It uses the amount of
free memory in order to guess how much it can get, but that obviously
doesn't account for fragmentation and the allocation attempt may fail.

This patch changes the LSA code so that it doesn't throw in case zone
couldn't be created but just returns a null pointer which should be
more performant if the LSA memory cannot grow any more.

Fixes #1394.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1476435031-5601-1-git-send-email-pdziepak@scylladb.com>
2016-10-14 11:08:24 +02:00
Amnon Heiman
7829da13b4 scylla_setup: Reorder questions and actions
The expected behaviour in the scylla_setup script is that a question
will be followed by the answer.

For example, after asking if the scylla should be run as a service the
relevant actions will be taken before the following question.

This patch address two such mis-orders:
1. the scylla-housekeeping depends on the scylla-server, but the
setup should first setup the scylla-server service and only then ask
(and install if needed) the scylla-housekeeping.
2. The node_exporter should be placed after the io_setup is done.

Fixes #1739

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1476370098-25617-1-git-send-email-amnon@scylladb.com>
2016-10-13 18:29:36 +03:00
Pekka Enberg
3b4e6cdc5e abstract_replication_strategy: Fix exception type if class not found
Change abstract_replication_strategy::create_replication_strategy() to
throw exceptions::configuration_error if replication strategy class
lookup to make sure the error is converted to the correct CQL response.

Fixes #1755

Message-Id: <1476361262-28723-1-git-send-email-penberg@scylladb.com>
2016-10-13 17:39:28 +03:00
Tomasz Grabiec
e617bcd8a7 logalloc: disable abort on allocation failure in places in which it is benign
Some places start big expecting allocation failure, then reduce the
requested size. Let's not abort in such cases.

Message-Id: <1476295120-32047-1-git-send-email-tgrabiec@scylladb.com>
2016-10-13 10:53:32 +03:00
Avi Kivity
13e9d4c8e3 Merge seastar upstream
* seastar f937fb0...207bf3d (11):
  > Merge "iotune: gracefully exit on predictable exceptions" (Fixes #1623)
  > core/semaphore: Add semaphore_units::release()
  > Merge "rometheus API with grafana uses labels" from Amnon
  > core/thread: Fix stack alloc-dealloc mismatch
  > core/thread: Make jmp_buf_link::yield_at use the same time point as thread_scheduling_group
  > file: support for XFS on older kernels
  > reactor: fix bug when handling EBADF in flush_pending_aio()
  > prometheus CPU should start in 0
  > Collectd: bytes ordering depends on the type
  > tests: Check that backtrace() doesn't corrupt signal mask
  > core/thread: Add stack guards to seastar thread stacks
2016-10-12 23:47:12 +03:00
Avi Kivity
63f053e9b7 storage_proxy: fix mutation reordering with wrapping ranges
If we have a range query involving a wrapping range (i.e., from thrift),
and mutations from both halves of the result are involved, then
we will return the results in the wrong order (and potentially the wrong
partitions) since we order by token, so the results from the second half
of the wrapping range end up before the first.

Fix by splitting the two queries, and merging the second half with lower
priority compared to the first half.

Note: this will be fixed in a better way once we have the sharding iterator,
as then we can query sequentially.

Fixes #1761.
Message-Id: <1476262693-30162-1-git-send-email-avi@scylladb.com>
2016-10-12 15:59:16 +02:00
Avi Kivity
1506b06617 Merge "node_exporter service on ubuntu 16" from Amnon
"This series address two issues that interfere with running the node_exporter as a service in ubuntu 16.
1. The service file should be packed in the deb file
2. When setting the node_exporter as a service it doesn't need to run with scylla use"

* 'amnon/node_exporter_ubuntu_v2' of github.com:cloudius-systems/seastar-dev:
  node-exporter service: No need to run as scylla user
  debian package: Include the node_exporter service file
2016-10-12 12:11:18 +03:00
Amnon Heiman
1bd50789e0 node-exporter service: No need to run as scylla user
the node-exporter does not need to run as scylla user.  It can run
without scylla or without the scylla user being configure.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-10-11 12:44:27 +03:00
Amnon Heiman
d523bf56ed debian package: Include the node_exporter service file
This will include the node_exporter service script for ubuntu
distribution with systemd support.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-10-11 12:44:14 +03:00
Avi Kivity
f6998bb260 Merge "Implement describe_splits_ex based on Cassandra" from Duarte
"This patch-set re-implements the describe_splits_ex() verb to more closely
follow Cassandra's implementation, on which some clients rely.

Ref #1139
Ref #693"

* 'describe-splits/v2' of github.com:duarten/scylla:
  thrift: Implement describe_splits_ex based on Cassandra
  storage_service: Implement get_splits() function
  sstables: Add function to get key samples
  sstables/key: Add to_partition_key function
  size_estimates_recorder: Increase estimate accuracy
  sstables: Get estimates for a particular range
  sstables/key: Make key::kind public
2016-10-11 11:13:35 +03:00
Takuya ASADA
0007f2d838 dist/common/sbin: add scylla_cpuset_setup and scylla_dev_mode_setup to /usr/sbin
We haven't added symlinks to /usr/sbin for newly created scripts, so add them.

Fixes #1702

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1474879711-31793-1-git-send-email-syuu@scylladb.com>
2016-10-11 11:02:14 +03:00
Takuya ASADA
ccad720bb1 dist/common/script/scylla_io_setup: handle comma correctly when parsing cpuset
The script mistakenly split value at "," when cpuset list is separated
by comma. Instead of matching possible patterns of the argument, let's
pass all characters until reach to space delimiter or end of line.

Fixes #1716

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1476171037-32373-1-git-send-email-syuu@scylladb.com>
2016-10-11 10:42:32 +03:00
Duarte Nunes
d8cfc56376 thrift: Implement describe_splits_ex based on Cassandra
This patch re-implements the describe_splits_ex() verb to more closely
follow Cassandra's implementation, on which some clients rely.

Ref #1139
Ref #693

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-10 22:32:10 +02:00
Duarte Nunes
01ab2081cd storage_service: Implement get_splits() function
This patch implements the get_splits() function in storage_service,
used to split a particular token range in slices of approximately the
specified size, using the sample keys and estimates of the CF's
sstables.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-10 22:32:08 +02:00
Duarte Nunes
c36dbaf0f1 sstables: Add function to get key samples
This patch implements the get_key_samples() function, on which a
future patch will base an implementation of the describe_splits()
thrift verb closer to Cassandra's.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-10 19:50:14 +02:00
Duarte Nunes
fc07b66678 sstables/key: Add to_partition_key function
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-10 19:50:11 +02:00
Duarte Nunes
c19c633299 size_estimates_recorder: Increase estimate accuracy
This patch uses the estimated_keys_for_range() function to get better
estimates.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-10 17:52:16 +02:00
Duarte Nunes
ceed09b23e sstables: Get estimates for a particular range
This patch adds the estimated_keys_for_range() function, which
estimates the number of keys present between the specified range.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-10 17:52:15 +02:00
Duarte Nunes
8c223b31c8 sstables/key: Make key::kind public
Needed to create synthetic keys without any value but with ordering
properties.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-10 17:47:24 +02:00
Avi Kivity
b305d92a65 Merge "housekeeping: check version during setup" from Amnon
"The version is taken from the installation rather than the API, a mode command
line indicated that this is part of the setup and uuid is used for the
interaction with the checkversion server."

* 'amnon/check_version_on_startup_v3' of github.com:cloudius-systems/seastar-dev:
  scylla_setup: Check and report the scylla version
  scylla-housekeeping: check version during setup
2016-10-10 16:37:14 +03:00
Vlad Zolotarov
ab748e829d docs: tracing.md: initial commit
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1475686745-20383-1-git-send-email-vladz@cloudius-systems.com>
2016-10-10 16:12:02 +03:00
Tomasz Grabiec
4357d0a6d9 db: Add counter for writes blocked on dirty memory
There is already queue_length-requests_blocked_memory, but it's a
gauge so does not reflect what happened between the sampling points.

total_operations-requests_blocked_memory will allow to see if there
were any (and how many) requests which were blocked by dirty memory.

Message-Id: <1476098616-12682-1-git-send-email-tgrabiec@scylladb.com>
2016-10-10 14:25:22 +03:00
Pekka Enberg
3b75ff1496 docs/docker: Tag --listen-address as 1.4 feature
The Docker Hub documentation is the same for all image versions. Tag
`--listen-address` as 1.4 feature.

Message-Id: <1475819164-7865-1-git-send-email-penberg@scylladb.com>
2016-10-10 13:26:16 +03:00
Vlad Zolotarov
006999f46c api::storage_service::slow_query: don't use duration_cast in GET
The slow_query_record_ttl() and slow_query_threshold() return the duration
of the appropriate type already - no need for an additional cast.
In addition there was a mistake in a cast of ttl.

Fixes #1734

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1475669400-5925-1-git-send-email-vladz@cloudius-systems.com>
2016-10-09 18:09:13 +03:00
Takuya ASADA
469e9af1f4 dist/common/scripts/scylla_setup: use 'swapon -s' instead of 'swapon --show'
Since Ubuntu 14.04 doesn't supported --show option, we need to prevent use it.
Fixes #1740

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1475788340-22939-2-git-send-email-syuu@scylladb.com>
2016-10-09 18:05:14 +03:00
Takuya ASADA
8452045b85 dist/ubuntu: add realpath to dependency, requires for scylla_setup
We need dependency to realpath, since scylla_setup using it.

Fixes #1740.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1475788340-22939-1-git-send-email-syuu@scylladb.com>
2016-10-09 18:05:14 +03:00
Tomasz Grabiec
41e66ebce2 gdb: Introduce 'scylla heapprof'
Presents current heap profile recording.

Works in text mode or dumps to collapsed stacks format from which
flame graph can be generated.

To generate a flamegraph:

  (gdb) scylla heapprof --flame
  Wrote heapprof.stacks

  $ flamegraph.pl --colors mem < heapprof.stacks > heapprof.svg

flamegraph.pl comes from:

  https://github.com/brendangregg/FlameGraph.git

Text mode example:

(gdb) scylla heapprof --min 100000000
All (274699676, #10213)
 \-- void* memory::cpu_pages::allocate_large_and_trim<memory::cpu_pages::allocate_large_aligned(unsigned int, unsigned int)::{lambda(unsigned int, unsigned int)#1}>(unsigned int, memory::cpu_pages::allocate_large_aligned(unsigned int, unsigned int)::{lambda(unsigned int, unsigned int)#1}) + 169  (268435456, #1)
     memory::allocate_large_aligned(unsigned long, unsigned long) + 87
     memory::allocate_aligned(unsigned long, unsigned long) + 48
     aligned_alloc + 9
     logalloc::segment_zone::segment_zone() + 304
     logalloc::segment_pool::allocate_segment() + 477
     logalloc::segment_pool::segment_pool() + 304
     __tls_init.part.801 + 72
     logalloc::region_group::release_requests() + 1333
     logalloc::region_group::add(logalloc::region_group*) + 514

The branches are formatted like this:

   -- <symbol> (<size>, #<count>)

Where <size> is total size of live objects and <count> is total
number of live objects, for all objects allocated from paths going
through this node.

Nodes which share the same <size> and <count> are stacked like this:

   -- <symbol_1> (<size>, #<count>)
      <symbol_2>
      <symbol_3>

Message-Id: <1475583334-19524-1-git-send-email-tgrabiec@scylladb.com>
2016-10-09 10:54:08 +03:00
Glauber Costa
33e9c2bbdd memtable: reduce sstable flush concurrency to one
Limiting the concurrency of memtable flushes to 4 was a temporary
workaround for the fact that we lacked good write behind support. Now
that write behind is properly merged we can reduce the concurrency to
what it should be, one.

This means that memtable flushes will now be serialized, and only when
one of them ends will the next one begin. Disk parallelism is obtained
through the write-behind mechanism.

Fixes #1373

Signed-off-by: Glauber Costa <glauber@scylladb.com>

Message-Id: <528f9ef928b5101bed952df600eb8555c275497a.1475881100.git.glauber@scylladb.com>
2016-10-09 10:48:57 +03:00
Tomasz Grabiec
2a5a90f391 db: Do not timeout streaming readers
There is a limit to concurrency of sstable readers on each shard. When
this limit is exhausted (currently 100 readers) readers queue. There
is a timeout after which queued readers are failed, equal to
read_request_timeout_in_ms (5s by default). The reason we have the
timeout here is primarily because the readers created for the purpose
of serving a CQL request no longer need to execute after waiting
longer than read_request_timeout_in_ms. The coordinator no longer
waits for the result so there is no point in proceeding with the read.

This timeout should not apply for readers created for streaming. The
streaming client currently times out after 10 minutes, so we could
wait at least that long. Timing out sooner makes streaming unreliable,
which under high load may prevent streaming from completing.

The change sets no timeout for streaming readers at replica level,
similarly as we do for system tables readers.

Fixes #1741.

Message-Id: <1475840678-25606-1-git-send-email-tgrabiec@scylladb.com>
2016-10-07 15:41:04 +03:00
Raphael S. Carvalho
9175977a9d cql3: fix build failure by defining out unused function
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <cba6207278ea945ee750d78b189320443843a288.1475793747.git.raphaelsc@scylladb.com>
2016-10-07 08:45:18 +03:00
Avi Kivity
9ac441d3b5 range: adjust split_after to allow split_point outside input range
Make split_after() more generic by allowing split_point to be anywhere,
not just within the input range.  If the split_point is before, the entire
range is returned; and if it is after, stdx::nullopt is returned.

"before" and "after" are not well defined for wrap-around ranges, so
but we are phasing them out and soon there will not be
wrapping_range::split_after() users.

This is a prerequisite for converting partition_range and friends to
nonwrapping_range.
Message-Id: <1475765099-10657-1-git-send-email-avi@scylladb.com>
2016-10-06 17:54:44 +02:00
Raphael S. Carvalho
7ea4513595 database: trigger compaction after loading new sstables
Scylla wasn't trying to compact new sstables uploaded via 'nodetool
refresh'. Thus, all new sstables were left uncompacted until user
issued 'nodetool flush' or a new sstable was written which would
trigger compaction too.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <bbdf274c8bb49f4bedeefcb85da78a6fb61a1232.1475535203.git.raphaelsc@scylladb.com>
2016-10-06 18:26:49 +03:00
Raphael S. Carvalho
9c59ccc52a storage_service: improve log message for refresh
'No new SSTables were found for keyspace1.standard1' was printed
if user uploaded new sstables to upload dir instead, and that is
confusing. We should instead print that if new sstables weren't
found in both cf and cf/upload dirs.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <90386f6255407697434213227ae7ff0de7464f99.1475535203.git.raphaelsc@scylladb.com>
2016-10-06 18:26:32 +03:00
Raphael S. Carvalho
76862d0d9c main: start compaction procedure after commit log is replayed
Commit log replay is a synchronous operation in bootstrap, so services
will only be started after it's completed. By starting compaction before,
less bandwidth will be available to both and consequently boot will be
slowed down. Fix is simply about moving compaction, which is an
asynchronous operation after commitlog replay is over.

Fixes #1620.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <d2a173a4ee4d474317b970c6b39530e61067fea9.1475527955.git.raphaelsc@scylladb.com>
2016-10-06 18:25:24 +03:00
Nadav Har'El
ee7ec10b11 CQL parser: "CREATE MATERIALIZED VIEW" statement
This patch adds the parsing for the "CREATE MATERIALIZED VIEW" statement,
following Cassandra 3 syntax. For example:

   CREATE MATERIALIZED VIEW building_by_city
   AS SELECT * FROM buildings
   WHERE city IS NOT NULL
   PRIMARY KEY(city, name);

It also adds the "IS NOT NULL" operator needed for this purpose.
As in Cassandra, "IS NOT NULL" can only be used for materialized
view creation, and not in a normal SELECT. It can only be used with
the NULL operand (i.e., "IS NOT 3" will be a syntax error).

The current implementation of this statement just does some sanity
checking (such as to verify that "city" is a valid column name and that
the "building" base table exists), complains that materialized views are
not yet supported:

SyntaxException: <ErrorMessage code=2000 [Syntax error in CQL query] message="Failed parsing statement: [CREATE MATERIALIZED VIEW building_by_city AS
SELECT * FROM buildings
WHERE city IS NOT NULL
PRIMARY KEY(city, name);] reason: unsupported operation: Materialized views not yet supported">

As mentioned above, the "IS NOT NULL" restriction is not allowed in
ordinary selects not creating a materialized views:

SELECT * FROM buildings WHERE city IS NOT NULL;
InvalidRequest: code=2200 [Invalid query] message="restriction 'city IS NOT null' is only supported in materialized view creation"

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1475742927-30695-1-git-send-email-nyh@scylladb.com>
2016-10-06 15:42:37 +03:00
Glauber Costa
7146776d7c fix sstable tests by not using the flush_reader if no region_group
The latest virtual dirty patches broke the SSTable tests. The reason for
this is that those tests will flush synthetic memtables that do not have
a region_group attached to it.

Normally in cases like this we would just give the flush_reader an empty
region group. However, the memtable class constructor takes a
region_group pointer and that can be null according to the interface.
So we must conditionally test it.

If there isn't a region_group involved, the virtual dirty accounting
should be disabled: after all, we won't even have the baseline memory
to begin with.

One of the approaches to fix this could be to just provide null
accounter classes to be used as a surrogate for the accounting classes
in this case. However, since this is mostly used for tests, a much
simpler way is to just revert back to the scanning reader in that case.

The scanning reader is similar enough to the flush_reader, except that
it can handle partial ranges, slices, and delegate accesses to an
sstable post-flush. We don't need any of that, but as argued above,
there is no need to remove it either.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
Message-Id: <1475667271-60806-1-git-send-email-glommer@scylladb.com>
2016-10-05 12:44:21 +01:00
Avi Kivity
c94fb1bf12 build: reduce inclusions of messaging_service.hh
Remove inclusions from header files (primary offender is fb_utilities.hh)
and introduce new messaging_service_fwd.hh to reduce rebuilds when the
messaging service changes.

Message-Id: <1475584615-22836-1-git-send-email-avi@scylladb.com>
2016-10-05 11:46:49 +03:00
Avi Kivity
f8118d9fc2 Merge "Virtual dirty memory management" from Glauber
"Description:
============

Scylla currently suffers from a brick wall behavior of the request throttler.
Requests pile up until we reach the dirty memory limit, at which point we stop
serving them until we have freed enough memory to allow for more requests.

The problem is that freeing dirty memory means writing an SSTable to completion.
That can take a long time, even if we are blessed with great disks. Those long
waiting times can and will translate into timeouts. That is bad behavior.

What this patch does is introduce one form of virtual dirty memory accounting.
Instead of allowing 100 % of the dirty memory to be filled up until we stop
accepting requests, we will do that when we reach 50 % of memory. However,
instead of releasing requests only when an SSTable is fully written, we start
releasing them when some memory was written.

The practical effect of that, is that once we reach 50 % occupancy in our dirty
memory region, we will bring the system from CPU speed to disk speed, and will
start accepting requests only at the rate we are able to write memory back.

Results
=======

With this patchset running a load big enough to easily saturate the disk,
(commitlog disabled to highlight the effects of the memtable writer), I am able
to run scylla for many minutes, with timeouts occurring only when I run out of
disk space, whereas without this patch a swarm of timeouts would start merely 2
seconds after the load started - and would never get stable.

In V2, I have sent a set of graphs illustrating the performance of this solution.
This version does not have any significant differences in that front.

For details, please refer to
https://groups.google.com/d/msg/scylladb-dev/iCvD-3Z-QqY/EM8KUh_MAQAJ

Accuracy of the accounting:
---------------------------
It is important for us to be as accurate as possible when accounting freed
memory, since every byte we mark as freed may allow one or more requests to be
executed.  I have measured the accuracy of this approach (ignoring padding,
object size for the mutation fragments) to be 99.83 % of used memory in the
test workload I have ran (large, 65k mutations). Memtables under this circumnstance
tend to have a very high occupancy ratio because throttle breeds idle, and idle
breeds compact-on-idle.

Known Issues:
-------------

A lot of time can be elapsed between destroying the flush_reader and actually
releasing memory. The release of memory only happens when the SSTable is fully
sealed, and we have to flush the files, as well as finish writing all SSTable
components at this point. This happened in practice with a buggy kernel that
would result in flushes taking a long time.

After that is fixed, this is just a theoretical problem and in practice it
shouldn't matter given the time we expect those operations to take."

* 'virtual-dirty-v6' of github.com:glommer/scylla:
  database: allow virtual dirty memory management
  streamed_mutation: make _buffer private
  add accounting of memory read to partition_snapshot_reader
  move partition_snapshot_reader code to header file
  LSA: allow a group to query its own region group
  memtables: split scanning reader in two
  sstables: use special reader for writing a memtable
  LSA: export information about object memory footprint
  LSA: export information about size of the throttle queue
  database: export virtual dirty bytes region group
2016-10-04 20:57:52 +03:00
Avi Kivity
cc33c8b4ba Merge seastar upstream
* seastar 18f7bb8...f937fb0 (5):
  > Merge "Fix signal mask corruption" from Tomasz
  > core/memory: Avoid violating strict aliasing when accessing allocation sites
  > core/memory: Avoid indirection when storing allocation sites
  > core/memory: Add a way to disable abort on allocation failure in some scope
  > core/sharded: Allow mapper to take the service by non-const reference
2016-10-04 20:08:57 +03:00
Glauber Costa
f89a67c75c database: allow virtual dirty memory management
Scylla currently suffers from a brick wall behavior of the request throttler.
Requests pile up until we reach the dirty memory limit, at which point we stop
serving them until we have freed enough memory to allow for more requests.

The problem is that freeing dirty memory means writing an SSTable to completion.
That can take a long time, even if we are blessed with great disks. Those long
waiting times can and will translate into timeouts. That is bad behavior.

What this patch does is introduce one form of virtual dirty memory accounting.
Instead of allowing 100 % of the dirty memory to be filled up until we stop
accepting requests, we will do that when we reach 50 % of memory. However,
instead of releasing requests only when an SSTable is fully written, we start
releasing them when some memory was written.

The practical effect of that is that once we reach 50 % occupancy in our dirty
memory region, we will bring the system from CPU speed to disk speed, and will
start accepting requests only at the rate we are able to write memory back.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-04 10:39:10 -04:00
Glauber Costa
7b6e8a2526 streamed_mutation: make _buffer private
It is currently protected, but now all users go through
push_mutation_fragment().  So we can safely move its visibility to guarantee
that it stays that way.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-04 10:39:10 -04:00
Glauber Costa
1db245b52d add accounting of memory read to partition_snapshot_reader
By default, we don't do any accounting. By specializing this class and providing
an accounter class, we can account how much memory are we reading as we read
through the elements.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-04 10:39:10 -04:00
Glauber Costa
452eb95943 move partition_snapshot_reader code to header file
This is so we can template it without worrying about declaring the
specializations in the .cc file.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-04 10:39:10 -04:00
Glauber Costa
86aa0b830d LSA: allow a group to query its own region group
Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-04 10:39:10 -04:00
Glauber Costa
eee15578fb memtables: split scanning reader in two
The code that is common will live in its own reader, the iterator_reader.  All
friendly private access to memtable attributes and methods happen through the
iterator reader.

After this patch, we are now left with the scanning_reader - same as always,
but now implemented on top of the iterator_reader, and a flush_reader, which
will be used by SSTable flushes only.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-04 10:39:10 -04:00
Glauber Costa
16886eeb96 sstables: use special reader for writing a memtable
Right now the special reader doesn't do much, but the idea is that we will
soon replace it will a reader that specializes in flush, and is in turn able
to provide read-side on-flush functionality like virtual dirty.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-04 10:39:10 -04:00
Glauber Costa
28e3f2f6ee LSA: export information about object memory footprint
We allocate objects of a certain size, but we use a bit more memory to hold
them.  To get a clerer picture about how much memory will an object cost us, we
need help from the allocator. This patch exports an interface that allow users
to query into a specific allocator to get that information.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-04 10:39:10 -04:00
Pekka Enberg
c3bebea1ef dist/docker: Add '--listen-address' to 'docker run'
Add a '--listen-address' command line parameter to the Docker image,
which can be used to set Scylla's listen address.

Refs #1723

Message-Id: <1475485165-6772-1-git-send-email-penberg@scylladb.com>
2016-10-04 13:57:55 +03:00
Marius
876775a52c dist/docker/ubuntu: refactored $IP/listen_address
In order to allow Scylla’s docker container to handle multiple network
interfaces, the start-scylla script was refactored:

- `$IP` is now called `$SCYLLA_LISTEN_ADDRESS`, so it is less likely to
   be confused or interfere with other environment variables.
- `$SCYLLA_LISTEN_ADDRESS` now checks its value and also tries to
   resolve a hostname, if no IP was set to it.
- `$SCYLLA_LISTEN_DEVICE` can now be set as environment variable and
   contain any available NIC device name (e.g. `eth0`). The script
   automatically retrieves the IP address from the device.

Usage:

1. With `$SCYLLA_LISTEN_ADDRESS` as IP:
`docker run -t -i --rm --name scylla -e SCYLLA_LISTEN_ADDRESS=192.168.1.100 scylladb/scylla`

2. With `$SCYLLA_LISTEN_ADDRESS` as hostname:
`docker run -t -i --rm --name scylla -e SCYLLA_LISTEN_ADDRESS=containername.network.lan scylladb/scylla`

3. With `$SCYLLA_LISTEN_DEVICE`:
`docker run -t -i --rm --name scylla -e SCYLLA_LISTEN_DEVICE=eth0 scylladb/scylla`

Message-Id: <20161003151230.67672-1-marius@twostairs.com>
2016-10-04 13:56:55 +03:00
Raphael S. Carvalho
747b42299c database: remove unused code
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <95e1ed590c9e45d15f19a84824a4dce05aefdab8.1475528611.git.raphaelsc@scylladb.com>
2016-10-04 09:26:43 +03:00
Paweł Dziepak
7599ef6fde query_pager: fix splitting range at the end bound
Currently, the code responsible for calculating ranges for the next
request could produce a wrap-around partition range. For example, if the
original range was (unimportant, A] and the last partition key A then
the output range would be (A, A].

This patch adds checks to make sure that in such cases the range is
removed.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1475497244-2790-1-git-send-email-pdziepak@scylladb.com>
2016-10-03 19:33:42 +02:00
Avi Kivity
8747054d10 exceptions: mark function called before construction static
cassandra_exception::prepare_message() is called from derived classes'
constructors before the base cassnadra_exception object is constructed.
This is technically illegal but harmless.  Fix by marking the function
static.

Found by clang.
2016-10-03 16:29:02 +03:00
Calle Wilund
5b815b81b4 auth::password_authenticator: Ensure exceptions are processed in continuation
Fixes #1718 (even more)
Message-Id: <1475497389-27016-1-git-send-email-calle@scylladb.com>
2016-10-03 14:49:59 +02:00
Pekka Enberg
f3cd21c8f1 Merge seastar upstream
* seastar 0e60722...18f7bb8 (1):
  > core/memory: Fix compilation errors
2016-10-03 12:54:38 +03:00
Calle Wilund
d24d0f8f90 auth::password_authenticator: "authenticate" should not throw undeclared excpt
Fixes #1718

Message-Id: <1475487331-25927-1-git-send-email-calle@scylladb.com>
2016-10-03 12:53:30 +03:00
Avi Kivity
a51804eca8 Merge "token_restriction: Deal with minimum tokens" from Duarte
"This patch set ensures we can correctly handle queries
where the minimum token is specified."

* 'min-token/v3' of github.com:duarten/scylla:
  cql_query_test: Add test case for min/max token bounds
  token_restriction: Deal with minimum tokens
  partitioner: Parse token from bytes
2016-10-02 12:32:40 +03:00
Avi Kivity
5071f4c0bf Merge seastar upstream
* seastar 9e1d5db...0e60722 (9):
  > core/memory: Replace assert with bad_alloc in allocate_large()
  > chunked_fifo: avoid direct use of sized operator delete
  > memory: fix build without heap profiler
  > xen: initialize port::_sem
  > Merge "Make input streams skippable" from Paweł
  > semaphore: require explict setting for start value
  > prometheus: remove invalid chars from meric names
  > core/memory: Introduce heap profiler
  > util/backtrace: Mark noexcept if func() doesn't throw
2016-10-02 11:43:22 +03:00
Vlad Zolotarov
7e180c7bd3 tracing: introduce the tracing::global_trace_state_ptr class
This object, similarly to a global_schema_ptr, allows to dynamically
create the trace_state_ptr objects on different shards in a context
of the original tracing session.

This object would create a secondary tracing session object from the
original trace_state_ptr object when a trace_state_ptr object is needed
on a "remote" shard, similarly to what we do when we need it on a remote
Node.

Fixes #1678
Fixes #1647

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1474387767-21910-1-git-send-email-vladz@cloudius-systems.com>
2016-10-02 11:31:37 +03:00
Amnon Heiman
a83bd900be scylla_setup: Check and report the scylla version
This patch adds a call to the scylla-housekeeping check version during
setup, so a warning will be printed if a newer version is available.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-10-02 11:11:07 +03:00
Amnon Heiman
5e3ab32365 scylla-housekeeping: check version during setup
This changes are for running scylla during setup.

It contains the following changes:
1. get the current version from the command line (as the syclla does not
run at this stage).
2. It support a mode parameter in the command line to indicate that we
running during the installation.
3. It accept an external uuid that will be used with all interaction
with the check_version server.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-10-02 11:11:07 +03:00
Takuya ASADA
15b156c9d4 dist/common/scripts/scylla_io_setup: describe how to set developer mode when validation tests failed
Describe how to set developer mode, not to confuse users.
Fixes #1701

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1475167584-18092-1-git-send-email-syuu@scylladb.com>
2016-10-02 10:58:38 +03:00
Avi Kivity
58ddfea18f Merge "Fixes for leveled compaction strategy" from Raphael
* 'lcs_fixes' of github.com:raphaelsc/scylla:
  lcs: fix starvation at higher levels
  lcs: fix broken token range distribution at higher levels
2016-10-01 21:34:21 +03:00
Takuya ASADA
9639cc840e dist/redhat: add missing build time dependency for libunwind
There was missing dependency for libunwind, so add it.
Fixes #1722

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1475260099-25881-1-git-send-email-syuu@scylladb.com>
2016-09-30 21:33:39 +03:00
Takuya ASADA
c89d9599b1 dist/ubuntu: add missing build time dependency for libunwind
There was missing dependency for libunwind, so add it.
Fixes #1721

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1475255706-26434-1-git-send-email-syuu@scylladb.com>
2016-09-30 21:33:21 +03:00
Raphael S. Carvalho
a8ab4b8f37 lcs: fix starvation at higher levels
When max sstable size is increased, higher levels are suffering from
starvation because we decide to compact a given level if the following
calculation results in a number greater than 1.001:
level_size(L) / max_size_for_level_l(L)

Fixes #1720.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-09-30 14:09:49 -03:00
Raphael S. Carvalho
a3bf7558f2 lcs: fix broken token range distribution at higher levels
Uniform token range distribution across sstables in a level > 1 was broken,
because we were only choosing sstable with lowest first key, when compacting
a level > 0. This resulted in performance problem because L1->L2 may have a
huge overlap over time, for example.
Last compacted key will now be stored for each level to ensure sort of
"round robin" selection of sstables for compactions at level >= 1.
That's also done by C*, and they were once affected by it as described in
https://issues.apache.org/jira/browse/CASSANDRA-6284.

Fixes #1719.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-09-30 14:09:16 -03:00
Paweł Dziepak
eb1fcf3ecc query_pagers: fix clustering key range calculation
Paging code assumes that clustering row range [a, a] contains only one
row which may not be true. Another problem is that it tries to use
range<> interface for dealing with clustering key ranges which doesn't
work because of the lack of correct comparator.

Refs #1446.
Fixes #1684.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1475236805-16223-1-git-send-email-pdziepak@scylladb.com>
2016-09-30 17:32:59 +02:00
Tomasz Grabiec
7e25b958ac transport: Extend request memory footprint accounting to also cover execution
CQL server is supposed to throttle requests so that they don't
overflow memory. The problem is that it currently accounts for
request's memory only around reading of its frame from the connection
and not actual request execution. As a result too many requests may be
allowed to execute and we may run out of memory.

Fixes #1708.
Message-Id: <1475149302-11517-1-git-send-email-tgrabiec@scylladb.com>
2016-09-30 14:23:14 +01:00
Duarte Nunes
72af476397 cql_query_test: Add test case for min/max token bounds
This patch adds a test case for specifying the minimum and maximum
tokens in a cql3 query.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-09-30 11:45:45 +00:00
Duarte Nunes
98b4814894 token_restriction: Deal with minimum tokens
This patch fixes a bug where queries such as the following are not
handled properly:

"SELECT * FROM ks.cf WHERE token(id) >
9207857967443869328 AND token(id) <= -9223372036854775808"

Here -9223372036854775808 represents the minimum token, which we were
just translating into a token with kind::key, thus returning incorrect
results.

Ref #1139
Ref #693
Fixes #1717

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-09-30 11:17:08 +00:00
Duarte Nunes
862f51cddf partitioner: Parse token from bytes
This patch adds the from_bytes() function to the i_partitioner class,
whose purpose is parse a particular token and explicitly handle the
case when the minimum token is specified.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-09-30 11:17:02 +00:00
Duarte Nunes
0c8f280af7 partition_key_view: Implement operator<<
The operator is declared, but it isn't implemented. This patch fixes
that.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1475225647-3800-1-git-send-email-duarte@scylladb.com>
2016-09-30 10:54:54 +02:00
Duarte Nunes
a36888f3cb storage_service: Convert token through partitioner
This patch ensures we use the partitioner to convert a token to
sstring instead of casting.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1475179683-28552-1-git-send-email-duarte@scylladb.com>
2016-09-30 10:54:26 +02:00
Glauber Costa
f5fd6bd714 LSA: export information about size of the throttle queue
Also add information about for how long has the oldest been sitting in the
queue. This is part of the backpressure work to allow us to throttle incoming
requests if we won't have memory to process them. Shortages can happen in all
sorts of places, and it is useful when designing and testing the solutions to
know where they are, and how bad they are.

This counter is named for consistency after similar counters from transport/.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-09-27 12:09:08 -04:00
Glauber Costa
aa6a96d09b database: export virtual dirty bytes region group
Currently, we export the region group where memtables are placed as dirty bytes.
Upcoming patches will optimistically mark some bytes in this region as free, a
scheme we know as "virtual dirty".

We are still interested in knowing the real state of the dirty region, so we
will keep track of the bytes virtually freed and split the counters in two.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-09-27 12:09:08 -04:00
890 changed files with 89449 additions and 28760 deletions

9
.gitignore vendored
View File

@@ -9,3 +9,12 @@ dist/ami/files/*.rpm
dist/ami/variables.json
dist/ami/scylla_deploy.sh
*.pyc
Cql.tokens
.kdev4
*.kdev4
CMakeLists.txt.user
.cache
.tox
*.egg-info
__pycache__CMakeLists.txt.user
.gdbinit

2
.gitmodules vendored
View File

@@ -1,6 +1,6 @@
[submodule "seastar"]
path = seastar
url = ../seastar
url = ../scylla-seastar
ignore = dirty
[submodule "swagger-ui"]
path = swagger-ui

140
CMakeLists.txt Normal file
View File

@@ -0,0 +1,140 @@
##
## For best results, first compile the project using the Ninja build-system.
##
cmake_minimum_required(VERSION 3.7)
project(scylla)
if (NOT DEFINED FOR_IDE AND NOT DEFINED ENV{FOR_IDE} AND NOT DEFINED ENV{CLION_IDE})
message(FATAL_ERROR "This CMakeLists.txt file is only valid for use in IDEs, please define FOR_IDE to acknowledge this.")
endif()
# Default value. A more accurate list is populated through `pkg-config` below if `seastar.pc` is available.
set(SEASTAR_INCLUDE_DIRS "seastar")
# These paths are always available, since they're included in the repository. Additional DPDK headers are placed while
# Seastar is built, and are captured in `SEASTAR_INCLUDE_DIRS` through parsing the Seastar pkg-config file (below).
set(SEASTAR_DPDK_INCLUDE_DIRS
seastar/dpdk/lib/librte_eal/common/include
seastar/dpdk/lib/librte_eal/common/include/generic
seastar/dpdk/lib/librte_eal/common/include/x86
seastar/dpdk/lib/librte_ether)
find_package(PkgConfig REQUIRED)
set(ENV{PKG_CONFIG_PATH} "${CMAKE_SOURCE_DIR}/seastar/build/release:$ENV{PKG_CONFIG_PATH}")
pkg_check_modules(SEASTAR seastar)
find_package(Boost COMPONENTS filesystem program_options system thread)
##
## Populate the names of all source and header files in the indicated paths in a designated variable.
##
## When RECURSIVE is specified, directories are traversed recursively.
##
## Use: scan_scylla_source_directories(VAR my_result_var [RECURSIVE] PATHS [path1 path2 ...])
##
function (scan_scylla_source_directories)
set(options RECURSIVE)
set(oneValueArgs VAR)
set(multiValueArgs PATHS)
cmake_parse_arguments(args "${options}" "${oneValueArgs}" "${multiValueArgs}" "${ARGN}")
set(globs "")
foreach (dir ${args_PATHS})
list(APPEND globs "${dir}/*.cc" "${dir}/*.hh")
endforeach()
if (args_RECURSIVE)
set(glob_kind GLOB_RECURSE)
else()
set(glob_kind GLOB)
endif()
file(${glob_kind} var
${globs})
set(${args_VAR} ${var} PARENT_SCOPE)
endfunction()
## Although Seastar is an external project, it is common enough to explore the sources while doing
## Scylla development that we'll treat the Seastar sources as part of this project for easier navigation.
scan_scylla_source_directories(
VAR SEASTAR_SOURCE_FILES
RECURSIVE
PATHS
seastar/core
seastar/http
seastar/json
seastar/net
seastar/rpc
seastar/tests
seastar/util)
scan_scylla_source_directories(
VAR SCYLLA_ROOT_SOURCE_FILES
PATHS .)
scan_scylla_source_directories(
VAR SCYLLA_SUB_SOURCE_FILES
RECURSIVE
PATHS
api
auth
cql3
db
dht
exceptions
gms
index
io
locator
message
repair
service
sstables
streaming
tests
thrift
tracing
transport
utils)
scan_scylla_source_directories(
VAR SCYLLA_GEN_SOURCE_FILES
RECURSIVE
PATHS build/release/gen)
set(SCYLLA_SOURCE_FILES
${SCYLLA_ROOT_SOURCE_FILES}
${SCYLLA_GEN_SOURCE_FILES}
${SCYLLA_SUB_SOURCE_FILES})
add_executable(scylla
${SEASTAR_SOURCE_FILES}
${SCYLLA_SOURCE_FILES})
# Note that since CLion does not undestand GCC6 concepts, we always disable them (even if users configure otherwise).
# CLion seems to have trouble with `-U` (macro undefinition), so we do it this way instead.
list(REMOVE_ITEM SEASTAR_CFLAGS "-DHAVE_GCC6_CONCEPTS")
# If the Seastar pkg-config information is available, append to the default flags.
#
# For ease of browsing the source code, we always pretend that DPDK is enabled.
target_compile_options(scylla PUBLIC
-std=gnu++14
-DHAVE_DPDK
-DHAVE_HWLOC
"${SEASTAR_CFLAGS}")
# The order matters here: prefer the "static" DPDK directories to any dynamic paths from pkg-config. Some files are only
# available dynamically, though.
target_include_directories(scylla PUBLIC
.
${SEASTAR_DPDK_INCLUDE_DIRS}
${SEASTAR_INCLUDE_DIRS}
${Boost_INCLUDE_DIRS}
build/release/gen)

11
CONTRIBUTING.md Normal file
View File

@@ -0,0 +1,11 @@
# Asking questions or requesting help
Use the [ScyllaDB user mailing list](https://groups.google.com/forum/#!forum/scylladb-users) for general questions and help.
# Reporting an issue
Please use the [Issue Tracker](https://github.com/scylladb/scylla/issues/) to report issues. Fill in as much information as you can in the issue template, especially for performance problems.
# Contributing Code to Scylla
To contribute code to Scylla, you need to sign the [Contributor License Agreement](http://www.scylladb.com/opensource/cla/) and send your changes as [patches](https://github.com/scylladb/scylla/wiki/Formatting-and-sending-patches) to the [mailing list](https://groups.google.com/forum/#!forum/scylladb-dev). We don't accept pull requests on GitHub.

233
HACKING.md Normal file
View File

@@ -0,0 +1,233 @@
# Guidelines for developing Scylla
This document is intended to help developers and contributors to Scylla get started. The first part consists of general guidelines that make no assumptions about a development environment or tooling. The second part describes a particular environment and work-flow for exemplary purposes.
## Overview
This section covers some high-level information about the Scylla source code and work-flow.
### Getting the source code
Scylla uses [Git submodules](https://git-scm.com/book/en/v2/Git-Tools-Submodules) to manage its dependency on Seastar and other tools. Be sure that all submodules are correctly initialized when cloning the project:
```bash
$ git clone https://github.com/scylladb/scylla
$ cd scylla
$ git submodule update --init --recursive
```
### Dependencies
Scylla depends on the system package manager for its development dependencies.
Running `./install_dependencies.sh` (as root) installs the appropriate packages based on your Linux distribution.
### Build system
**Note**: Compiling Scylla requires, conservatively, 2 GB of memory per native thread, and up to 3 GB per native thread while linking.
Scylla is built with [Ninja](https://ninja-build.org/), a low-level rule-based system. A Python script, `configure.py`, generates a Ninja file (`build.ninja`) based on configuration options.
To build for the first time:
```bash
$ ./configure.py
$ ninja-build
```
Afterwards, it is sufficient to just execute Ninja.
The full suite of options for project configuration is available via
```bash
$ ./configure.py --help
```
The most important options are:
- `--mode={release,debug,all}`: Debug mode enables [AddressSanitizer](https://github.com/google/sanitizers/wiki/AddressSanitizer) and allows for debugging with tools like GDB. Debugging builds are generally slower and generate much larger object files than release builds.
- `--{enable,disable}-dpdk`: [DPDK](http://dpdk.org/) is a set of libraries and drivers for fast packet processing. During development, it's not necessary to enable support even if it is supported by your platform.
Source files and build targets are tracked manually in `configure.py`, so the script needs to be updated when new files or targets are added or removed.
To save time -- for instance, to avoid compiling all unit tests -- you can also specify specific targets to Ninja. For example,
```bash
$ ninja-build build/release/tests/schema_change_test
```
### Unit testing
Unit tests live in the `/tests` directory. Like with application source files, test sources and executables are specified manually in `configure.py` and need to be updated when changes are made.
A test target can be any executable. A non-zero return code indicates test failure.
Most tests in the Scylla repository are built using the [Boost.Test](http://www.boost.org/doc/libs/1_64_0/libs/test/doc/html/index.html) library. Utilities for writing tests with Seastar futures are also included.
Run all tests through the test execution wrapper with
```bash
$ ./test.py --mode={debug,release}
```
The `--name` argument can be specified to run a particular test.
Alternatively, you can execute the test executable directly. For example,
```bash
$ build/release/tests/row_cache_test -- -c1 -m1G
```
The `-c1 -m1G` arguments limit this Seastar-based test to a single system thread and 1 GB of memory.
### Preparing patches
All changes to Scylla are submitted as patches to the public mailing list. Once a patch is approved by one of the maintainers of the project, it is committed to the maintainers' copy of the repository at https://github.com/scylladb/scylla.
Detailed instructions for formatting patches for the mailing list and advice on preparing good patches are available at the [ScyllaDB website](http://docs.scylladb.com/contribute/).
### Running Scylla
Once Scylla has been compiled, executing the (`debug` or `release`) target will start a running instance in the foreground:
```bash
$ build/release/scylla
```
The `scylla` executable requires a configuration file, `scylla.yaml`. By default, this is read from `$SCYLLA_HOME/conf/scylla.yaml`. A good starting point for development is located in the repository at `/conf/scylla.yaml`.
For development, a directory at `$HOME/scylla` can be used for all Scylla-related files:
```bash
$ mkdir -p $HOME/scylla $HOME/scylla/conf
$ cp conf/scylla.yaml $HOME/scylla/conf/scylla.yaml
$ # Edit configuration options as appropriate
$ SCYLLA_HOME=$HOME/scylla build/release/scylla
```
The `scylla.yaml` file in the repository by default writes all database data to `/var/lib/scylla`, which likely requires root access. Change the `data_file_directories` and `commitlog_directory` fields as appropriate.
Scylla has a number of requirements for the file-system and operating system to operate ideally and at peak performance. However, during development, these requirements can be relaxed with the `--developer-mode` flag.
Additionally, when running on under-powered platforms like portable laptops, the `--overprovisined` flag is useful.
On a development machine, one might run Scylla as
```bash
$ SCYLLA_HOME=$HOME/scylla build/release/scylla --overprovisioned --developer-mode=yes
```
### Branches and tags
Multiple release branches are maintained on the Git repository at https://github.com/scylladb/scylla. Release 1.5, for instance, is tracked on the `branch-1.5` branch.
Similarly, tags are used to pin-point precise release versions, including hot-fix versions like 1.5.4. These are named `scylla-1.5.4`, for example.
Most development happens on the `master` branch. Release branches are cut from `master` based on time and/or features. When a patch against `master` fixes a serious issue like a node crash or data loss, it is backported to a particular release branch with `git cherry-pick` by the project maintainers.
## Example: development on Fedora 25
This section describes one possible work-flow for developing Scylla on a Fedora 25 system. It is presented as an example to help you to develop a work-flow and tools that you are comfortable with.
### Preface
This guide will be written from the perspective of a fictitious developer, Taylor Smith.
### Git work-flow
Having two Git remotes is useful:
- A public clone of Seastar (`"public"`)
- A private clone of Seastar (`"private"`) for in-progress work or work that is not yet ready to share
The first step to contributing a change to Scylla is to create a local branch dedicated to it. For example, a feature that fixes a bug in the CQL statement for creating tables could be called `ts/cql_create_table_error/v1`. The branch name is prefaced by the developer's initials and has a suffix indicating that this is the first version. The version suffix is useful when branches are shared publicly and changes are requested on the mailing list. Having a branch for each version of the patch (or patch set) shared publicly makes it easier to reference and compare the history of a change.
Setting the upstream branch of your development branch to `master` is a useful way to track your changes. You can do this with
```bash
$ git branch -u master ts/cql_create_table_error/v1
```
As a patch set is developed, you can periodically push the branch to the private remote to back-up work.
Once the patch set is ready to be reviewed, push the branch to the public remote and prepare an email to the `scylladb-dev` mailing list. Including a link to the branch on your public remote allows for reviewers to quickly test and explore your changes.
### Development environment and source code navigation
Scylla includes a [CMake](https://cmake.org/) file, `CMakeLists.txt`, for use only with development environments (not for building) so that they can properly analyze the source code.
[CLion](https://www.jetbrains.com/clion/) is a commercial IDE offers reasonably good source code navigation and advice for code hygiene, though its C++ parser sometimes makes errors and flags false issues.
Other good options that directly parse CMake files are [KDevelop](https://www.kdevelop.org/) and [QtCreator](https://wiki.qt.io/Qt_Creator).
To use the `CMakeLists.txt` file with these programs, define the `FOR_IDE` CMake variable or shell environmental variable.
[Eclipse](https://eclipse.org/cdt/) is another open-source option. It doesn't natively work with CMake projects, and its C++ parser has many similar issues as CLion.
### Distributed compilation: `distcc` and `ccache`
Scylla's compilations times can be long. Two tools help somewhat:
- [ccache](https://ccache.samba.org/) caches compiled object files on disk and re-uses them when possible
- [distcc](https://github.com/distcc/distcc) distributes compilation jobs to remote machines
A reasonably-powered laptop acts as the coordinator for compilation. A second, more powerful, machine acts as a passive compilation server.
Having a direct wired connection between the machines ensures that object files can be transmitted quickly and limits the overhead of remote compilation.
The coordinator has been assigned the static IP address `10.0.0.1` and the passive compilation machine has been assigned `10.0.0.2`.
On Fedora, installing the `ccache` package places symbolic links for `gcc` and `g++` in the `PATH`. This allows normal compilation to transparently invoke `ccache` for compilation and cache object files on the local file-system.
Next, set `CCACHE_PREFIX` so that `ccache` is responsible for invoking `distcc` as necessary:
```bash
export CCACHE_PREFIX="distcc"
```
On each host, edit `/etc/sysconfig/distccd` to include the allowed coordinators and the total number of jobs that the machine should accept.
This example is for the laptop, which has 2 physical cores (4 logical cores with hyper-threading):
```
OPTIONS="--allow 10.0.0.2 --allow 127.0.0.1 --jobs 4"
```
`10.0.0.2` has 8 physical cores (16 logical cores) and 64 GB of memory.
As a rule-of-thumb, the number of jobs that a machine should be specified to support should be equal to the number of its native threads.
Restart the `distccd` service on all machines.
On the coordinator machine, edit `$HOME/.distcc/hosts` with the available hosts for compilation. Order of the hosts indicates preference.
```
10.0.0.2/16 localhost/2
```
In this example, `10.0.0.2` will be sent up to 16 jobs and the local machine will be sent up to 2. Allowing for two extra threads on the host machine for coordination, we run compilation with `16 + 2 + 2 = 20` jobs in total: `ninja-build -j20`.
When a compilation is in progress, the status of jobs on all remote machines can be visualized in the terminal with `distccmon-text` or graphically as a GTK application with `distccmon-gnome`.
One thing to keep in mind is that linking object files happens on the coordinating machine, which can be a bottleneck. See the next section speeding up this process.
### Using the `gold` linker
Linking Scylla can be slow. The gold linker can replace GNU ld and often speeds the linking process. On Fedora, you can switch the system linker using
```bash
$ sudo alternatives --config ld
```
### Testing changes in Seastar with Scylla
Sometimes Scylla development is closely tied with a feature being developed in Seastar. It can be useful to compile Scylla with a particular check-out of Seastar.
One way to do this it to create a local remote for the Seastar submodule in the Scylla repository:
```bash
$ cd $HOME/src/scylla
$ cd seastar
$ git remote add local /home/tsmith/src/seastar
$ git remote update
$ git checkout -t local/my_local_seastar_branch
```

View File

@@ -1,29 +1,19 @@
# Scylla
## Building Scylla
## Quick-start
In addition to required packages by Seastar, the following packages are required by Scylla.
### Submodules
Scylla uses submodules, so make sure you pull the submodules first by doing:
```
git submodule init
git submodule update --init --recursive
```bash
$ git submodule update --init --recursive
$ sudo ./install-dependencies.sh
$ ./configure.py --mode=release
$ ninja-build -j4 # Assuming 4 system threads.
$ ./build/release/scylla
$ # Rejoice!
```
### Building and Running Scylla on Fedora
* Installing required packages:
Please see [HACKING.md](HACKING.md) for detailed information on building and developing Scylla.
```
sudo dnf install yaml-cpp-devel lz4-devel zlib-devel snappy-devel jsoncpp-devel thrift-devel antlr3-tool antlr3-C++-devel libasan libubsan gcc-c++ gnutls-devel ninja-build ragel libaio-devel cryptopp-devel xfsprogs-devel numactl-devel hwloc-devel libpciaccess-devel libxml2-devel python3-pyparsing lksctp-tools-devel protobuf-devel protobuf-compiler systemd-devel libunwind-devel
```
* Build Scylla
```
./configure.py --mode=release --with=scylla --disable-xen
ninja-build build/release/scylla -j2 # you can use more cpus if you have tons of RAM
```
## Running Scylla
* Run Scylla
```
@@ -83,14 +73,6 @@ Run the image with:
docker run -p $(hostname -i):9042:9042 -i -t <image name>
```
## Contributing to Scylla
Do not send pull requests.
Send patches to the mailing list address scylladb-dev@googlegroups.com.
Be sure to subscribe.
In order for your patches to be merged, you must sign the Contributor's
License Agreement, protecting your rights and ours. See
http://www.scylladb.com/opensource/cla/.
[Guidelines for contributing](CONTRIBUTING.md)

View File

@@ -1,6 +1,6 @@
#!/bin/sh
VERSION=666.development
VERSION=2.1.6
if test -f version
then
@@ -10,7 +10,12 @@ else
DATE=$(date +%Y%m%d)
GIT_COMMIT=$(git log --pretty=format:'%h' -n 1)
SCYLLA_VERSION=$VERSION
SCYLLA_RELEASE=$DATE.$GIT_COMMIT
# For custom package builds, replace "0" with "counter.your_name",
# where counter starts at 1 and increments for successive versions.
# This ensures that the package manager will select your custom
# package over the standard release.
SCYLLA_BUILD=0
SCYLLA_RELEASE=$SCYLLA_BUILD.$DATE.$GIT_COMMIT
fi
echo "$SCYLLA_VERSION-$SCYLLA_RELEASE"

View File

@@ -397,6 +397,36 @@
}
]
},
{
"path": "/cache_service/metrics/key/hits_moving_avrage",
"operations": [
{
"method": "GET",
"summary": "Get key hits moving avrage",
"type": "#/utils/rate_moving_average",
"nickname": "get_key_hits_moving_avrage",
"produces": [
"application/json"
],
"parameters": []
}
]
},
{
"path": "/cache_service/metrics/key/requests_moving_avrage",
"operations": [
{
"method": "GET",
"summary": "Get key requests moving avrage",
"type": "#/utils/rate_moving_average",
"nickname": "get_key_requests_moving_avrage",
"produces": [
"application/json"
],
"parameters": []
}
]
},
{
"path": "/cache_service/metrics/key/size",
"operations": [
@@ -607,6 +637,36 @@
}
]
},
{
"path": "/cache_service/metrics/counter/hits_moving_avrage",
"operations": [
{
"method": "GET",
"summary": "Get counter hits moving avrage",
"type": "#/utils/rate_moving_average",
"nickname": "get_counter_hits_moving_avrage",
"produces": [
"application/json"
],
"parameters": []
}
]
},
{
"path": "/cache_service/metrics/counter/requests_moving_avrage",
"operations": [
{
"method": "GET",
"summary": "Get counter requests moving avrage",
"type": "#/utils/rate_moving_average",
"nickname": "get_counter_requests_moving_avrage",
"produces": [
"application/json"
],
"parameters": []
}
]
},
{
"path": "/cache_service/metrics/counter/size",
"operations": [

View File

@@ -78,11 +78,19 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
},
{
"name":"split_output",
"description":"true if the output of the major compaction should be split in several sstables",
"required":false,
"allowMultiple":false,
"type":"bool",
"paramType":"query"
}
]
}
@@ -102,7 +110,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -129,7 +137,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -153,7 +161,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -180,7 +188,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -204,7 +212,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -244,7 +252,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -271,7 +279,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -298,7 +306,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -317,7 +325,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -349,7 +357,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -381,7 +389,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -405,7 +413,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -432,7 +440,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -459,7 +467,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -491,7 +499,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -518,7 +526,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -545,7 +553,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -569,7 +577,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -593,7 +601,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -633,7 +641,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -673,7 +681,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -713,7 +721,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -753,7 +761,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -793,7 +801,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -833,7 +841,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -873,7 +881,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -916,7 +924,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -943,7 +951,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -970,7 +978,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -994,7 +1002,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1034,7 +1042,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1058,7 +1066,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1101,7 +1109,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1144,7 +1152,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1203,7 +1211,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1243,7 +1251,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1267,7 +1275,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1310,7 +1318,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1353,7 +1361,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1412,7 +1420,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1452,7 +1460,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1492,7 +1500,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1532,7 +1540,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1572,7 +1580,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1612,7 +1620,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1652,7 +1660,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1692,7 +1700,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1732,7 +1740,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1772,7 +1780,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1812,7 +1820,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1852,7 +1860,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1892,7 +1900,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1932,7 +1940,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -1972,7 +1980,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2012,7 +2020,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2052,7 +2060,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2092,7 +2100,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2116,7 +2124,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2156,7 +2164,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2196,7 +2204,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2236,7 +2244,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2276,7 +2284,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2300,7 +2308,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2324,7 +2332,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2351,7 +2359,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2378,7 +2386,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2405,7 +2413,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2432,7 +2440,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2501,7 +2509,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2525,7 +2533,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2549,7 +2557,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2573,7 +2581,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2597,7 +2605,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2621,7 +2629,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2645,7 +2653,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2669,7 +2677,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2693,7 +2701,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2717,7 +2725,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2741,7 +2749,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -2765,7 +2773,7 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keysspace:name format",
"description":"The column family name in keyspace:name format",
"required":true,
"allowMultiple":false,
"type":"string",

View File

@@ -21,8 +21,8 @@
"parameters":[
{
"name":"host",
"description":"The host name",
"required":true,
"description":"The host name. If absent, the local server broadcast/listen address is used",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
@@ -45,8 +45,8 @@
"parameters":[
{
"name":"host",
"description":"The host name",
"required":true,
"description":"The host name. If absent, the local server broadcast/listen address is used",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"

View File

@@ -42,6 +42,25 @@
}
]
},
{
"path":"/failure_detector/endpoint_phi_values",
"operations":[
{
"method":"GET",
"summary":"Get end point phi values",
"type":"array",
"items":{
"type":"endpoint_phi_values"
},
"nickname":"get_endpoint_phi_values",
"produces":[
"application/json"
],
"parameters":[
]
}
]
},
{
"path":"/failure_detector/endpoints/",
"operations":[
@@ -202,6 +221,20 @@
"description": "The application state version"
}
}
},
"endpoint_phi_value": {
"id" : "endpoint_phi_value",
"description": "Holds phi value for a single end point",
"properties": {
"phi": {
"type": "double",
"description": "Phi value"
},
"endpoint": {
"type": "string",
"description": "end point address"
}
}
}
}
}

View File

@@ -777,7 +777,7 @@
]
},
{
"path": "/storage_proxy/metrics/read/moving_avrage_histogram",
"path": "/storage_proxy/metrics/read/moving_average_histogram",
"operations": [
{
"method": "GET",
@@ -792,7 +792,7 @@
]
},
{
"path": "/storage_proxy/metrics/range/moving_avrage_histogram",
"path": "/storage_proxy/metrics/range/moving_average_histogram",
"operations": [
{
"method": "GET",
@@ -942,7 +942,7 @@
]
},
{
"path": "/storage_proxy/metrics/write/moving_avrage_histogram",
"path": "/storage_proxy/metrics/write/moving_average_histogram",
"operations": [
{
"method": "GET",

View File

@@ -952,6 +952,22 @@
}
]
},
{
"path":"/storage_service/force_terminate_repair",
"operations":[
{
"method":"POST",
"summary":"Force terminate all repair sessions",
"type":"void",
"nickname":"force_terminate_all_repair_sessions_new",
"produces":[
"application/json"
],
"parameters":[
]
}
]
},
{
"path":"/storage_service/decommission",
"operations":[
@@ -1201,11 +1217,12 @@
],
"parameters":[
{
"name":"non_system",
"description":"When set to true limit to non system",
"name":"type",
"description":"Which keyspaces to return",
"required":false,
"allowMultiple":false,
"type":"boolean",
"type":"string",
"enum": [ "all", "user", "non_local_strategy" ],
"paramType":"query"
}
]

View File

@@ -49,7 +49,7 @@ static std::unique_ptr<reply> exception_reply(std::exception_ptr eptr) {
throw bad_param_exception(ex.what());
}
// We never going to get here
return std::make_unique<reply>();
throw std::runtime_error("exception_reply");
}
future<> set_server_init(http_context& ctx) {

View File

@@ -29,6 +29,7 @@
#include "utils/histogram.hh"
#include "http/exception.hh"
#include "api_init.hh"
#include "seastarx.hh"
namespace api {
@@ -166,33 +167,36 @@ inline int64_t max_int64(int64_t a, int64_t b) {
* It combine total and the sub set for the ratio and its
* to_json method return the ration sub/total
*/
struct ratio_holder : public json::jsonable {
double total = 0;
double sub = 0;
template<typename T>
struct basic_ratio_holder : public json::jsonable {
T total = 0;
T sub = 0;
virtual std::string to_json() const {
if (total == 0) {
return "0";
}
return std::to_string(sub/total);
}
ratio_holder() = default;
ratio_holder& add(double _total, double _sub) {
basic_ratio_holder() = default;
basic_ratio_holder& add(T _total, T _sub) {
total += _total;
sub += _sub;
return *this;
}
ratio_holder(double _total, double _sub) {
basic_ratio_holder(T _total, T _sub) {
total = _total;
sub = _sub;
}
ratio_holder& operator+=(const ratio_holder& a) {
basic_ratio_holder<T>& operator+=(const basic_ratio_holder<T>& a) {
return add(a.total, a.sub);
}
friend ratio_holder operator+(ratio_holder a, const ratio_holder& b) {
friend basic_ratio_holder<T> operator+(basic_ratio_holder a, const basic_ratio_holder<T>& b) {
return a += b;
}
};
typedef basic_ratio_holder<double> ratio_holder;
typedef basic_ratio_holder<int64_t> integral_ratio_holder;
class unimplemented_exception : public base_exception {
public:

View File

@@ -177,6 +177,20 @@ void set_cache_service(http_context& ctx, routes& r) {
return make_ready_future<json::json_return_type>(0);
});
cs::get_key_hits_moving_avrage.set(r, [&ctx] (std::unique_ptr<request> req) {
// TBD
// FIXME
// See above
return make_ready_future<json::json_return_type>(meter_to_json(utils::rate_moving_average()));
});
cs::get_key_requests_moving_avrage.set(r, [&ctx] (std::unique_ptr<request> req) {
// TBD
// FIXME
// See above
return make_ready_future<json::json_return_type>(meter_to_json(utils::rate_moving_average()));
});
cs::get_key_size.set(r, [] (std::unique_ptr<request> req) {
// TBD
// FIXME
@@ -194,7 +208,7 @@ void set_cache_service(http_context& ctx, routes& r) {
});
cs::get_row_capacity.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, 0, [](const column_family& cf) {
return map_reduce_cf(ctx, uint64_t(0), [](const column_family& cf) {
return cf.get_row_cache().get_cache_tracker().region().occupancy().used_space();
}, std::plus<uint64_t>());
});
@@ -238,13 +252,13 @@ void set_cache_service(http_context& ctx, routes& r) {
// In origin row size is the weighted size.
// We currently do not support weights, so we use num entries instead
return map_reduce_cf(ctx, 0, [](const column_family& cf) {
return cf.get_row_cache().num_entries();
return cf.get_row_cache().partitions();
}, std::plus<uint64_t>());
});
cs::get_row_entries.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, 0, [](const column_family& cf) {
return cf.get_row_cache().num_entries();
return cf.get_row_cache().partitions();
}, std::plus<uint64_t>());
});
@@ -280,6 +294,20 @@ void set_cache_service(http_context& ctx, routes& r) {
return make_ready_future<json::json_return_type>(0);
});
cs::get_counter_hits_moving_avrage.set(r, [&ctx] (std::unique_ptr<request> req) {
// TBD
// FIXME
// See above
return make_ready_future<json::json_return_type>(meter_to_json(utils::rate_moving_average()));
});
cs::get_counter_requests_moving_avrage.set(r, [&ctx] (std::unique_ptr<request> req) {
// TBD
// FIXME
// See above
return make_ready_future<json::json_return_type>(meter_to_json(utils::rate_moving_average()));
});
cs::get_counter_size.set(r, [] (std::unique_ptr<request> req) {
// TBD
// FIXME

View File

@@ -40,13 +40,13 @@ static auto transformer(const std::vector<collectd_value>& values) {
for (auto v: values) {
switch (v._type) {
case scollectd::data_type::GAUGE:
collected_value.values.push(v.u._d);
collected_value.values.push(v.d());
break;
case scollectd::data_type::DERIVE:
collected_value.values.push(v.u._i);
collected_value.values.push(v.i());
break;
default:
collected_value.values.push(v.u._ui);
collected_value.values.push(v.ui());
break;
}
}

View File

@@ -182,17 +182,8 @@ static int64_t max_row_size(column_family& cf) {
return res;
}
static double update_ratio(double acc, double f, double total) {
if (f && !total) {
throw bad_param_exception("total should include all elements");
} else if (total) {
acc += f / total;
}
return acc;
}
static ratio_holder mean_row_size(column_family& cf) {
ratio_holder res;
static integral_ratio_holder mean_row_size(column_family& cf) {
integral_ratio_holder res;
for (auto i: *cf.get_sstables() ) {
auto c = i->get_stats_metadata().estimated_row_size.count();
res.sub += i->get_stats_metadata().estimated_row_size.mean() * c;
@@ -283,6 +274,16 @@ static std::vector<uint64_t> concat_sstable_count_per_level(std::vector<uint64_t
return a;
}
ratio_holder filter_false_positive_as_ratio_holder(const sstables::shared_sstable& sst) {
double f = sst->filter_get_false_positive();
return ratio_holder(f + sst->filter_get_true_positive(), f);
}
ratio_holder filter_recent_false_positive_as_ratio_holder(const sstables::shared_sstable& sst) {
double f = sst->filter_get_recent_false_positive();
return ratio_holder(f + sst->filter_get_recent_true_positive(), f);
}
void set_column_family(http_context& ctx, routes& r) {
cf::get_column_family_name.set(r, [&ctx] (const_req req){
vector<sstring> res;
@@ -562,11 +563,13 @@ void set_column_family(http_context& ctx, routes& r) {
});
cf::get_mean_row_size.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], ratio_holder(), mean_row_size, std::plus<ratio_holder>());
// Cassandra 3.x mean values are truncated as integrals.
return map_reduce_cf(ctx, req->param["name"], integral_ratio_holder(), mean_row_size, std::plus<integral_ratio_holder>());
});
cf::get_all_mean_row_size.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, ratio_holder(), mean_row_size, std::plus<ratio_holder>());
// Cassandra 3.x mean values are truncated as integrals.
return map_reduce_cf(ctx, integral_ratio_holder(), mean_row_size, std::plus<integral_ratio_holder>());
});
cf::get_bloom_filter_false_positives.set(r, [&ctx] (std::unique_ptr<request> req) {
@@ -602,39 +605,27 @@ void set_column_family(http_context& ctx, routes& r) {
});
cf::get_bloom_filter_false_ratio.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], double(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), double(0), [](double s, auto& sst) {
double f = sst->filter_get_false_positive();
return update_ratio(s, f, f + sst->filter_get_true_positive());
});
}, std::plus<double>());
return map_reduce_cf(ctx, req->param["name"], ratio_holder(), [] (column_family& cf) {
return boost::accumulate(*cf.get_sstables() | boost::adaptors::transformed(filter_false_positive_as_ratio_holder), ratio_holder());
}, std::plus<>());
});
cf::get_all_bloom_filter_false_ratio.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, double(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), double(0), [](double s, auto& sst) {
double f = sst->filter_get_false_positive();
return update_ratio(s, f, f + sst->filter_get_true_positive());
});
}, std::plus<double>());
return map_reduce_cf(ctx, ratio_holder(), [] (column_family& cf) {
return boost::accumulate(*cf.get_sstables() | boost::adaptors::transformed(filter_false_positive_as_ratio_holder), ratio_holder());
}, std::plus<>());
});
cf::get_recent_bloom_filter_false_ratio.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], double(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), double(0), [](double s, auto& sst) {
double f = sst->filter_get_recent_false_positive();
return update_ratio(s, f, f + sst->filter_get_recent_true_positive());
});
}, std::plus<double>());
return map_reduce_cf(ctx, req->param["name"], ratio_holder(), [] (column_family& cf) {
return boost::accumulate(*cf.get_sstables() | boost::adaptors::transformed(filter_recent_false_positive_as_ratio_holder), ratio_holder());
}, std::plus<>());
});
cf::get_all_recent_bloom_filter_false_ratio.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, double(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), double(0), [](double s, auto& sst) {
double f = sst->filter_get_recent_false_positive();
return update_ratio(s, f, f + sst->filter_get_recent_true_positive());
});
}, std::plus<double>());
return map_reduce_cf(ctx, ratio_holder(), [] (column_family& cf) {
return boost::accumulate(*cf.get_sstables() | boost::adaptors::transformed(filter_recent_false_positive_as_ratio_holder), ratio_holder());
}, std::plus<>());
});
cf::get_bloom_filter_disk_space_used.set(r, [&ctx] (std::unique_ptr<request> req) {

View File

@@ -20,13 +20,13 @@
*/
#include "compaction_manager.hh"
#include "sstables/compaction_manager.hh"
#include "api/api-doc/compaction_manager.json.hh"
#include "db/system_keyspace.hh"
#include "column_family.hh"
namespace api {
using namespace scollectd;
namespace cm = httpd::compaction_manager_json;
using namespace json;

View File

@@ -22,16 +22,22 @@
#include "locator/snitch_base.hh"
#include "endpoint_snitch.hh"
#include "api/api-doc/endpoint_snitch_info.json.hh"
#include "utils/fb_utilities.hh"
namespace api {
void set_endpoint_snitch(http_context& ctx, routes& r) {
httpd::endpoint_snitch_info_json::get_datacenter.set(r, [] (const_req req) {
return locator::i_endpoint_snitch::get_local_snitch_ptr()->get_datacenter(req.get_query_param("host"));
static auto host_or_broadcast = [](const_req req) {
auto host = req.get_query_param("host");
return host.empty() ? gms::inet_address(utils::fb_utilities::get_broadcast_address()) : gms::inet_address(host);
};
httpd::endpoint_snitch_info_json::get_datacenter.set(r, [](const_req req) {
return locator::i_endpoint_snitch::get_local_snitch_ptr()->get_datacenter(host_or_broadcast(req));
});
httpd::endpoint_snitch_info_json::get_rack.set(r, [] (const_req req) {
return locator::i_endpoint_snitch::get_local_snitch_ptr()->get_rack(req.get_query_param("host"));
httpd::endpoint_snitch_info_json::get_rack.set(r, [](const_req req) {
return locator::i_endpoint_snitch::get_local_snitch_ptr()->get_rack(host_or_broadcast(req));
});
httpd::endpoint_snitch_info_json::get_snitch_name.set(r, [] (const_req req) {

View File

@@ -88,6 +88,20 @@ void set_failure_detector(http_context& ctx, routes& r) {
return make_ready_future<json::json_return_type>(state);
});
});
fd::get_endpoint_phi_values.set(r, [](std::unique_ptr<request> req) {
return gms::get_arrival_samples().then([](std::map<gms::inet_address, gms::arrival_window> map) {
std::vector<fd::endpoint_phi_value> res;
auto now = gms::arrival_window::clk::now();
for (auto& p : map) {
fd::endpoint_phi_value val;
val.endpoint = p.first.to_sstring();
val.phi = p.second.phi(now);
res.emplace_back(std::move(val));
}
return make_ready_future<json::json_return_type>(res);
});
});
}
}

View File

@@ -24,7 +24,6 @@
namespace api {
using namespace scollectd;
using namespace json;
namespace hh = httpd::hinted_handoff_json;

View File

@@ -29,11 +29,11 @@
namespace api {
static logging::logger logger("lsa-api");
static logging::logger alogger("lsa-api");
void set_lsa(http_context& ctx, routes& r) {
httpd::lsa_json::lsa_compact.set(r, [&ctx](std::unique_ptr<request> req) {
logger.info("Triggering compaction");
alogger.info("Triggering compaction");
return ctx.db.invoke_on_all([] (database&) {
logalloc::shard_tracker().reclaim(std::numeric_limits<size_t>::max());
}).then([] {

View File

@@ -27,7 +27,7 @@
#include <sstream>
using namespace httpd::messaging_service_json;
using namespace net;
using namespace netw;
namespace api {
@@ -120,13 +120,13 @@ void set_messaging_service(http_context& ctx, routes& r) {
}));
get_version.set(r, [](const_req req) {
return net::get_local_messaging_service().get_raw_version(req.get_query_param("addr"));
return netw::get_local_messaging_service().get_raw_version(req.get_query_param("addr"));
});
get_dropped_messages_by_ver.set(r, [](std::unique_ptr<request> req) {
shared_ptr<std::vector<uint64_t>> map = make_shared<std::vector<uint64_t>>(num_verb);
return net::get_messaging_service().map_reduce([map](const uint64_t* local_map) mutable {
return netw::get_messaging_service().map_reduce([map](const uint64_t* local_map) mutable {
for (auto i = 0; i < num_verb; i++) {
(*map)[i]+= local_map[i];
}

View File

@@ -397,7 +397,7 @@ void set_storage_proxy(http_context& ctx, routes& r) {
});
sp::get_range_estimated_histogram.set(r, [&ctx](std::unique_ptr<request> req) {
return sum_timer_stats(ctx.sp, &proxy::stats::read);
return sum_timer_stats(ctx.sp, &proxy::stats::range);
});
sp::get_range_latency.set(r, [&ctx](std::unique_ptr<request> req) {

View File

@@ -22,6 +22,8 @@
#include "storage_service.hh"
#include "api/api-doc/storage_service.json.hh"
#include "db/config.hh"
#include <boost/range/adaptor/map.hpp>
#include <boost/range/adaptor/filtered.hpp>
#include <service/storage_service.hh>
#include <db/commitlog/commitlog.hh>
#include <gms/gossiper.hh>
@@ -32,6 +34,7 @@
#include "column_family.hh"
#include "log.hh"
#include "release.hh"
#include "sstables/compaction_manager.hh"
namespace api {
@@ -359,16 +362,22 @@ void set_storage_service(http_context& ctx, routes& r) {
try {
res = fut.get0();
} catch(std::runtime_error& e) {
return make_ready_future<json::json_return_type>(json_exception(httpd::bad_param_exception(e.what())));
throw httpd::bad_param_exception(e.what());
}
return make_ready_future<json::json_return_type>(json::json_return_type(res));
});
});
ss::force_terminate_all_repair_sessions.set(r, [](std::unique_ptr<request> req) {
//TBD
unimplemented();
return make_ready_future<json::json_return_type>(json_void());
return repair_abort_all(service::get_local_storage_service().db()).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
ss::force_terminate_all_repair_sessions_new.set(r, [](std::unique_ptr<request> req) {
return repair_abort_all(service::get_local_storage_service().db()).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
ss::decommission.set(r, [](std::unique_ptr<request> req) {
@@ -457,8 +466,15 @@ void set_storage_service(http_context& ctx, routes& r) {
});
ss::get_keyspaces.set(r, [&ctx](const_req req) {
auto non_system = req.get_query_param("non_system");
return map_keys(ctx.db.local().keyspaces());
auto type = req.get_query_param("type");
if (type == "user") {
return ctx.db.local().get_non_system_keyspaces();
} else if (type == "non_local_strategy") {
return map_keys(ctx.db.local().get_keyspaces() | boost::adaptors::filtered([](const auto& p) {
return p.second.get_replication_strategy().get_type() != locator::replication_strategy_type::local;
}));
}
return map_keys(ctx.db.local().get_keyspaces());
});
ss::update_snitch.set(r, [](std::unique_ptr<request> req) {
@@ -542,9 +558,7 @@ void set_storage_service(http_context& ctx, routes& r) {
});
ss::is_joined.set(r, [] (std::unique_ptr<request> req) {
return service::get_local_storage_service().is_joined().then([] (bool is_joined) {
return make_ready_future<json::json_return_type>(is_joined);
});
return make_ready_future<json::json_return_type>(service::get_local_storage_service().is_joined());
});
ss::set_stream_throughput_mb_per_sec.set(r, [](std::unique_ptr<request> req) {
@@ -664,17 +678,23 @@ void set_storage_service(http_context& ctx, routes& r) {
ss::set_trace_probability.set(r, [](std::unique_ptr<request> req) {
auto probability = req->get_query_param("probability");
try {
return futurize<json::json_return_type>::apply([probability] {
double real_prob = std::stod(probability.c_str());
return tracing::tracing::tracing_instance().invoke_on_all([real_prob] (auto& local_tracing) {
local_tracing.set_trace_probability(real_prob);
}).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
} catch (...) {
throw httpd::bad_param_exception(sprint("Bad format of a probability value: \"%s\"", probability.c_str()));
}
}).then_wrapped([probability] (auto&& f) {
try {
f.get();
return make_ready_future<json::json_return_type>(json_void());
} catch (std::out_of_range& e) {
throw httpd::bad_param_exception(e.what());
} catch (std::invalid_argument&){
throw httpd::bad_param_exception(sprint("Bad format in a probability value: \"%s\"", probability.c_str()));
}
});
});
ss::get_trace_probability.set(r, [](std::unique_ptr<request> req) {
@@ -684,8 +704,8 @@ void set_storage_service(http_context& ctx, routes& r) {
ss::get_slow_query_info.set(r, [](const_req req) {
ss::slow_query_info res;
res.enable = tracing::tracing::get_local_tracing_instance().slow_query_tracing_enabled();
res.ttl = std::chrono::duration_cast<std::chrono::microseconds>(tracing::tracing::get_local_tracing_instance().slow_query_record_ttl()).count() ;
res.threshold = std::chrono::duration_cast<std::chrono::microseconds>(tracing::tracing::get_local_tracing_instance().slow_query_threshold()).count();
res.ttl = tracing::tracing::get_local_tracing_instance().slow_query_record_ttl().count() ;
res.threshold = tracing::tracing::get_local_tracing_instance().slow_query_threshold().count();
return res;
});
@@ -789,10 +809,8 @@ void set_storage_service(http_context& ctx, routes& r) {
return make_ready_future<json::json_return_type>(json_void());
});
ss::get_metrics_load.set(r, [](std::unique_ptr<request> req) {
//TBD
unimplemented();
return make_ready_future<json::json_return_type>(0);
ss::get_metrics_load.set(r, [&ctx](std::unique_ptr<request> req) {
return get_cf_stats(ctx, &column_family::stats::live_disk_space_used);
});
ss::get_exceptions.set(r, [](const_req req) {

View File

@@ -28,11 +28,12 @@
#include "utils/managed_bytes.hh"
#include "net/byteorder.hh"
#include <cstdint>
#include <iostream>
#include <iosfwd>
#include <seastar/util/gcc6-concepts.hh>
template<typename T>
template<typename T, typename Input>
static inline
void set_field(managed_bytes& v, unsigned offset, T val) {
void set_field(Input& v, unsigned offset, T val) {
reinterpret_cast<net::packed<T>*>(v.begin() + offset)->raw = net::hton(val);
}
@@ -57,6 +58,8 @@ private:
static constexpr int8_t LIVE_FLAG = 0x01;
static constexpr int8_t EXPIRY_FLAG = 0x02; // When present, expiry field is present. Set only for live cells
static constexpr int8_t REVERT_FLAG = 0x04; // transient flag used to efficiently implement ReversiblyMergeable for atomic cells.
static constexpr int8_t COUNTER_UPDATE_FLAG = 0x08; // Cell is a counter update.
static constexpr int8_t COUNTER_IN_PLACE_REVERT = 0x10;
static constexpr unsigned flags_size = 1;
static constexpr unsigned timestamp_offset = flags_size;
static constexpr unsigned timestamp_size = 8;
@@ -66,14 +69,25 @@ private:
static constexpr unsigned deletion_time_size = 4;
static constexpr unsigned ttl_offset = expiry_offset + expiry_size;
static constexpr unsigned ttl_size = 4;
friend class counter_cell_builder;
private:
static bool is_counter_update(bytes_view cell) {
return cell[0] & COUNTER_UPDATE_FLAG;
}
static bool is_revert_set(bytes_view cell) {
return cell[0] & REVERT_FLAG;
}
static bool is_counter_in_place_revert_set(bytes_view cell) {
return cell[0] & COUNTER_IN_PLACE_REVERT;
}
template<typename BytesContainer>
static void set_revert(BytesContainer& cell, bool revert) {
cell[0] = (cell[0] & ~REVERT_FLAG) | (revert * REVERT_FLAG);
}
template<typename BytesContainer>
static void set_counter_in_place_revert(BytesContainer& cell, bool flag) {
cell[0] = (cell[0] & ~COUNTER_IN_PLACE_REVERT) | (flag * COUNTER_IN_PLACE_REVERT);
}
static bool is_live(const bytes_view& cell) {
return cell[0] & LIVE_FLAG;
}
@@ -87,13 +101,30 @@ private:
static api::timestamp_type timestamp(const bytes_view& cell) {
return get_field<api::timestamp_type>(cell, timestamp_offset);
}
template<typename BytesContainer>
static void set_timestamp(BytesContainer& cell, api::timestamp_type ts) {
set_field(cell, timestamp_offset, ts);
}
// Can be called on live cells only
static bytes_view value(bytes_view cell) {
private:
template<typename BytesView>
static BytesView do_get_value(BytesView cell) {
auto expiry_field_size = bool(cell[0] & EXPIRY_FLAG) * (expiry_size + ttl_size);
auto value_offset = flags_size + timestamp_size + expiry_field_size;
cell.remove_prefix(value_offset);
return cell;
}
public:
static bytes_view value(bytes_view cell) {
return do_get_value(cell);
}
static bytes_mutable_view value(bytes_mutable_view cell) {
return do_get_value(cell);
}
// Can be called on live counter update cells only
static int64_t counter_update_value(bytes_view cell) {
return get_field<int64_t>(cell, flags_size + timestamp_size);
}
// Can be called only when is_dead() is true.
static gc_clock::time_point deletion_time(const bytes_view& cell) {
assert(is_dead(cell));
@@ -126,6 +157,14 @@ private:
std::copy_n(value.begin(), value.size(), b.begin() + value_offset);
return b;
}
static managed_bytes make_live_counter_update(api::timestamp_type timestamp, int64_t value) {
auto value_offset = flags_size + timestamp_size;
managed_bytes b(managed_bytes::initialized_later(), value_offset + sizeof(value));
b[0] = LIVE_FLAG | COUNTER_UPDATE_FLAG;
set_field(b, timestamp_offset, timestamp);
set_field(b, value_offset, value);
return b;
}
static managed_bytes make_live(api::timestamp_type timestamp, bytes_view value, gc_clock::time_point expiry, gc_clock::duration ttl) {
auto value_offset = flags_size + timestamp_size + expiry_size + ttl_size;
managed_bytes b(managed_bytes::initialized_later(), value_offset + value.size());
@@ -136,6 +175,31 @@ private:
std::copy_n(value.begin(), value.size(), b.begin() + value_offset);
return b;
}
// make_live_from_serializer() is intended for users that need to serialise
// some object or objects to the format used in atomic_cell::value().
// With just make_live() the patter would look like follows:
// 1. allocate a buffer and write to it serialised objects
// 2. pass that buffer to make_live()
// 3. make_live() needs to prepend some metadata to the cell value so it
// allocates a new buffer and copies the content of the original one
//
// The allocation and copy of a buffer can be avoided.
// make_live_from_serializer() allows the user code to specify the timestamp
// and size of the cell value as well as provide the serialiser function
// object, which would write the serialised value of the cell to the buffer
// given to it by make_live_from_serializer().
template<typename Serializer>
GCC6_CONCEPT(requires requires(Serializer serializer, bytes::iterator it) {
serializer(it);
})
static managed_bytes make_live_from_serializer(api::timestamp_type timestamp, size_t size, Serializer&& serializer) {
auto value_offset = flags_size + timestamp_size;
managed_bytes b(managed_bytes::initialized_later(), value_offset + size);
b[0] = LIVE_FLAG;
set_field(b, timestamp_offset, timestamp);
serializer(b.begin() + value_offset);
return b;
}
template<typename ByteContainer>
friend class atomic_cell_base;
friend class atomic_cell;
@@ -149,17 +213,23 @@ protected:
atomic_cell_base(ByteContainer&& data) : _data(std::forward<ByteContainer>(data)) { }
friend class atomic_cell_or_collection;
public:
bool is_counter_update() const {
return atomic_cell_type::is_counter_update(_data);
}
bool is_revert_set() const {
return atomic_cell_type::is_revert_set(_data);
}
bool is_counter_in_place_revert_set() const {
return atomic_cell_type::is_counter_in_place_revert_set(_data);
}
bool is_live() const {
return atomic_cell_type::is_live(_data);
}
bool is_live(tombstone t) const {
return is_live() && !is_covered_by(t);
bool is_live(tombstone t, bool is_counter) const {
return is_live() && !is_covered_by(t, is_counter);
}
bool is_live(tombstone t, gc_clock::time_point now) const {
return is_live() && !is_covered_by(t) && !has_expired(now);
bool is_live(tombstone t, gc_clock::time_point now, bool is_counter) const {
return is_live() && !is_covered_by(t, is_counter) && !has_expired(now);
}
bool is_live_and_has_ttl() const {
return atomic_cell_type::is_live_and_has_ttl(_data);
@@ -167,17 +237,24 @@ public:
bool is_dead(gc_clock::time_point now) const {
return atomic_cell_type::is_dead(_data) || has_expired(now);
}
bool is_covered_by(tombstone t) const {
return timestamp() <= t.timestamp;
bool is_covered_by(tombstone t, bool is_counter) const {
return timestamp() <= t.timestamp || (is_counter && t.timestamp != api::missing_timestamp);
}
// Can be called on live and dead cells
api::timestamp_type timestamp() const {
return atomic_cell_type::timestamp(_data);
}
void set_timestamp(api::timestamp_type ts) {
atomic_cell_type::set_timestamp(_data, ts);
}
// Can be called on live cells only
bytes_view value() const {
auto value() const {
return atomic_cell_type::value(_data);
}
// Can be called on live counter update cells only
int64_t counter_update_value() const {
return atomic_cell_type::counter_update_value(_data);
}
// Can be called only when is_dead(gc_clock::time_point)
gc_clock::time_point deletion_time() const {
return !is_live() ? atomic_cell_type::deletion_time(_data) : expiry() - ttl();
@@ -192,7 +269,7 @@ public:
}
// Can be called on live and dead cells
bool has_expired(gc_clock::time_point now) const {
return is_live_and_has_ttl() && expiry() < now;
return is_live_and_has_ttl() && expiry() <= now;
}
bytes_view serialize() const {
return _data;
@@ -200,6 +277,9 @@ public:
void set_revert(bool revert) {
atomic_cell_type::set_revert(_data, revert);
}
void set_counter_in_place_revert(bool flag) {
atomic_cell_type::set_counter_in_place_revert(_data, flag);
}
};
class atomic_cell_view final : public atomic_cell_base<bytes_view> {
@@ -211,6 +291,14 @@ public:
friend std::ostream& operator<<(std::ostream& os, const atomic_cell_view& acv);
};
class atomic_cell_mutable_view final : public atomic_cell_base<bytes_mutable_view> {
atomic_cell_mutable_view(bytes_mutable_view data) : atomic_cell_base(std::move(data)) {}
public:
static atomic_cell_mutable_view from_bytes(bytes_mutable_view data) { return atomic_cell_mutable_view(data); }
friend class atomic_cell;
};
class atomic_cell_ref final : public atomic_cell_base<managed_bytes&> {
public:
atomic_cell_ref(managed_bytes& buf) : atomic_cell_base(buf) {}
@@ -239,6 +327,9 @@ public:
static atomic_cell make_live(api::timestamp_type timestamp, const bytes& value) {
return make_live(timestamp, bytes_view(value));
}
static atomic_cell make_live_counter_update(api::timestamp_type timestamp, int64_t value) {
return atomic_cell_type::make_live_counter_update(timestamp, value);
}
static atomic_cell make_live(api::timestamp_type timestamp, bytes_view value,
gc_clock::time_point expiry, gc_clock::duration ttl)
{
@@ -256,6 +347,10 @@ public:
return atomic_cell_type::make_live(timestamp, value, gc_clock::now() + *ttl, *ttl);
}
}
template<typename Serializer>
static atomic_cell make_live_from_serializer(api::timestamp_type timestamp, size_t size, Serializer&& serializer) {
return atomic_cell_type::make_live_from_serializer(timestamp, size, std::forward<Serializer>(serializer));
}
friend class atomic_cell_or_collection;
friend std::ostream& operator<<(std::ostream& os, const atomic_cell& ac);
};
@@ -293,11 +388,6 @@ collection_mutation::operator collection_mutation_view() const {
return { data };
}
namespace db {
template<typename T>
class serializer;
}
class column_definition;
int compare_atomic_cell_for_merge(atomic_cell_view left, atomic_cell_view right);

View File

@@ -26,16 +26,17 @@
#include "types.hh"
#include "atomic_cell.hh"
#include "hashing.hh"
#include "counters.hh"
template<>
struct appending_hash<collection_mutation_view> {
template<typename Hasher>
void operator()(Hasher& h, collection_mutation_view cell) const {
void operator()(Hasher& h, collection_mutation_view cell, const column_definition& cdef) const {
auto m_view = collection_type_impl::deserialize_mutation_form(cell);
::feed_hash(h, m_view.tomb);
for (auto&& key_and_value : m_view.cells) {
::feed_hash(h, key_and_value.first);
::feed_hash(h, key_and_value.second);
::feed_hash(h, key_and_value.second, cdef);
}
}
};
@@ -43,10 +44,14 @@ struct appending_hash<collection_mutation_view> {
template<>
struct appending_hash<atomic_cell_view> {
template<typename Hasher>
void operator()(Hasher& h, atomic_cell_view cell) const {
void operator()(Hasher& h, atomic_cell_view cell, const column_definition& cdef) const {
feed_hash(h, cell.is_live());
feed_hash(h, cell.timestamp());
if (cell.is_live()) {
if (cdef.is_counter()) {
::feed_hash(h, counter_cell_view(cell));
return;
}
if (cell.is_live_and_has_ttl()) {
feed_hash(h, cell.expiry());
feed_hash(h, cell.ttl());
@@ -61,15 +66,15 @@ struct appending_hash<atomic_cell_view> {
template<>
struct appending_hash<atomic_cell> {
template<typename Hasher>
void operator()(Hasher& h, const atomic_cell& cell) const {
feed_hash(h, static_cast<atomic_cell_view>(cell));
void operator()(Hasher& h, const atomic_cell& cell, const column_definition& cdef) const {
feed_hash(h, static_cast<atomic_cell_view>(cell), cdef);
}
};
template<>
struct appending_hash<collection_mutation> {
template<typename Hasher>
void operator()(Hasher& h, const collection_mutation& cm) const {
feed_hash(h, static_cast<collection_mutation_view>(cm));
void operator()(Hasher& h, const collection_mutation& cm, const column_definition& cdef) const {
feed_hash(h, static_cast<collection_mutation_view>(cm), cdef);
}
};

View File

@@ -39,10 +39,14 @@ public:
static atomic_cell_or_collection from_atomic_cell(atomic_cell data) { return { std::move(data._data) }; }
atomic_cell_view as_atomic_cell() const { return atomic_cell_view::from_bytes(_data); }
atomic_cell_ref as_atomic_cell_ref() { return { _data }; }
atomic_cell_mutable_view as_mutable_atomic_cell() { return atomic_cell_mutable_view::from_bytes(_data); }
atomic_cell_or_collection(collection_mutation cm) : _data(std::move(cm.data)) {}
explicit operator bool() const {
return !_data.empty();
}
bool can_use_mutable_view() const {
return !_data.is_fragmented();
}
static atomic_cell_or_collection from_collection_mutation(collection_mutation data) {
return std::move(data.data);
}
@@ -58,13 +62,13 @@ public:
template<typename Hasher>
void feed_hash(Hasher& h, const column_definition& def) const {
if (def.is_atomic()) {
::feed_hash(h, as_atomic_cell());
::feed_hash(h, as_atomic_cell(), def);
} else {
::feed_hash(as_collection_mutation(), h, def.type);
::feed_hash(h, as_collection_mutation(), def);
}
}
size_t memory_usage() const {
return _data.memory_usage();
size_t external_memory_usage() const {
return _data.external_memory_usage();
}
friend std::ostream& operator<<(std::ostream&, const atomic_cell_or_collection&);
};

View File

@@ -0,0 +1,41 @@
/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "auth/allow_all_authenticator.hh"
#include "service/migration_manager.hh"
#include "utils/class_registrator.hh"
namespace auth {
const sstring& allow_all_authenticator_name() {
static const sstring name = meta::AUTH_PACKAGE_NAME + "AllowAllAuthenticator";
return name;
}
// To ensure correct initialization order, we unfortunately need to use a string literal.
static const class_registrator<
authenticator,
allow_all_authenticator,
cql3::query_processor&,
::service::migration_manager&> registration("org.apache.cassandra.auth.AllowAllAuthenticator");
}

View File

@@ -0,0 +1,97 @@
/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <stdexcept>
#include "auth/authenticator.hh"
#include "auth/authenticated_user.hh"
#include "auth/common.hh"
namespace cql3 {
class query_processor;
}
namespace service {
class migration_manager;
}
namespace auth {
const sstring& allow_all_authenticator_name();
class allow_all_authenticator final : public authenticator {
public:
allow_all_authenticator(cql3::query_processor&, ::service::migration_manager&) {
}
future<> start() override {
return make_ready_future<>();
}
future<> stop() override {
return make_ready_future<>();
}
const sstring& qualified_java_name() const override {
return allow_all_authenticator_name();
}
bool require_authentication() const override {
return false;
}
option_set supported_options() const override {
return option_set();
}
option_set alterable_options() const override {
return option_set();
}
future<::shared_ptr<authenticated_user>> authenticate(const credentials_map& credentials) const override {
return make_ready_future<::shared_ptr<authenticated_user>>(::make_shared<authenticated_user>());
}
future<> create(sstring username, const option_map& options) override {
return make_ready_future();
}
future<> alter(sstring username, const option_map& options) override {
return make_ready_future();
}
future<> drop(sstring username) override {
return make_ready_future();
}
const resource_ids& protected_resources() const override {
static const resource_ids ids;
return ids;
}
::shared_ptr<sasl_challenge> new_sasl_challenge() const override {
throw std::runtime_error("Should not reach");
}
};
}

View File

@@ -0,0 +1,41 @@
/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "auth/allow_all_authorizer.hh"
#include "auth/common.hh"
#include "utils/class_registrator.hh"
namespace auth {
const sstring& allow_all_authorizer_name() {
static const sstring name = meta::AUTH_PACKAGE_NAME + "AllowAllAuthorizer";
return name;
}
// To ensure correct initialization order, we unfortunately need to use a string literal.
static const class_registrator<
authorizer,
allow_all_authorizer,
cql3::query_processor&,
::service::migration_manager&> registration("org.apache.cassandra.auth.AllowAllAuthorizer");
}

View File

@@ -0,0 +1,98 @@
/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include "authorizer.hh"
#include "exceptions/exceptions.hh"
#include "stdx.hh"
namespace cql3 {
class query_processor;
}
namespace service {
class migration_manager;
}
namespace auth {
class service;
const sstring& allow_all_authorizer_name();
class allow_all_authorizer final : public authorizer {
public:
allow_all_authorizer(cql3::query_processor&, ::service::migration_manager&) {
}
future<> start() override {
return make_ready_future<>();
}
future<> stop() override {
return make_ready_future<>();
}
const sstring& qualified_java_name() const override {
return allow_all_authorizer_name();
}
future<permission_set> authorize(service&, ::shared_ptr<authenticated_user>, data_resource) const override {
return make_ready_future<permission_set>(permissions::ALL);
}
future<> grant(::shared_ptr<authenticated_user>, permission_set, data_resource, sstring) override {
throw exceptions::invalid_request_exception("GRANT operation is not supported by AllowAllAuthorizer");
}
future<> revoke(::shared_ptr<authenticated_user>, permission_set, data_resource, sstring) override {
throw exceptions::invalid_request_exception("REVOKE operation is not supported by AllowAllAuthorizer");
}
future<std::vector<permission_details>> list(
service&,
::shared_ptr<authenticated_user> performer,
permission_set,
stdx::optional<data_resource>,
stdx::optional<sstring>) const override {
throw exceptions::invalid_request_exception("LIST PERMISSIONS operation is not supported by AllowAllAuthorizer");
}
future<> revoke_all(sstring dropped_user) override {
return make_ready_future();
}
future<> revoke_all(data_resource) override {
return make_ready_future();
}
const resource_ids& protected_resources() override {
static const resource_ids ids;
return ids;
}
future<> validate_configuration() const override {
return make_ready_future();
}
};
}

View File

@@ -1,383 +0,0 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* Copyright (C) 2016 ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include <seastar/core/sleep.hh>
#include <seastar/core/distributed.hh>
#include "auth.hh"
#include "authenticator.hh"
#include "authorizer.hh"
#include "database.hh"
#include "cql3/query_processor.hh"
#include "cql3/statements/raw/cf_statement.hh"
#include "cql3/statements/create_table_statement.hh"
#include "db/config.hh"
#include "service/migration_manager.hh"
#include "utils/loading_cache.hh"
#include "utils/hash.hh"
const sstring auth::auth::DEFAULT_SUPERUSER_NAME("cassandra");
const sstring auth::auth::AUTH_KS("system_auth");
const sstring auth::auth::USERS_CF("users");
static const sstring USER_NAME("name");
static const sstring SUPER("super");
static logging::logger logger("auth");
// TODO: configurable
using namespace std::chrono_literals;
const std::chrono::milliseconds auth::auth::SUPERUSER_SETUP_DELAY = 10000ms;
class auth_migration_listener : public service::migration_listener {
void on_create_keyspace(const sstring& ks_name) override {}
void on_create_column_family(const sstring& ks_name, const sstring& cf_name) override {}
void on_create_user_type(const sstring& ks_name, const sstring& type_name) override {}
void on_create_function(const sstring& ks_name, const sstring& function_name) override {}
void on_create_aggregate(const sstring& ks_name, const sstring& aggregate_name) override {}
void on_update_keyspace(const sstring& ks_name) override {}
void on_update_column_family(const sstring& ks_name, const sstring& cf_name, bool) override {}
void on_update_user_type(const sstring& ks_name, const sstring& type_name) override {}
void on_update_function(const sstring& ks_name, const sstring& function_name) override {}
void on_update_aggregate(const sstring& ks_name, const sstring& aggregate_name) override {}
void on_drop_keyspace(const sstring& ks_name) override {
auth::authorizer::get().revoke_all(auth::data_resource(ks_name));
}
void on_drop_column_family(const sstring& ks_name, const sstring& cf_name) override {
auth::authorizer::get().revoke_all(auth::data_resource(ks_name, cf_name));
}
void on_drop_user_type(const sstring& ks_name, const sstring& type_name) override {}
void on_drop_function(const sstring& ks_name, const sstring& function_name) override {}
void on_drop_aggregate(const sstring& ks_name, const sstring& aggregate_name) override {}
};
static auth_migration_listener auth_migration;
namespace std {
template <>
struct hash<auth::data_resource> {
size_t operator()(const auth::data_resource & v) const {
return v.hash_value();
}
};
template <>
struct hash<auth::authenticated_user> {
size_t operator()(const auth::authenticated_user & v) const {
return utils::tuple_hash()(v.name(), v.is_anonymous());
}
};
}
class auth::auth::permissions_cache {
public:
typedef utils::loading_cache<std::pair<authenticated_user, data_resource>, permission_set, utils::tuple_hash> cache_type;
typedef typename cache_type::key_type key_type;
permissions_cache()
: permissions_cache(
cql3::get_local_query_processor().db().local().get_config()) {
}
permissions_cache(const db::config& cfg)
: _cache(cfg.permissions_cache_max_entries(), expiry(cfg),
std::chrono::milliseconds(
cfg.permissions_validity_in_ms()),
[](const key_type& k) {
logger.debug("Refreshing permissions for {}", k.first.name());
return authorizer::get().authorize(::make_shared<authenticated_user>(k.first), k.second);
}) {
}
static std::chrono::milliseconds expiry(const db::config& cfg) {
auto exp = cfg.permissions_update_interval_in_ms();
if (exp == 0 || exp == std::numeric_limits<uint32_t>::max()) {
exp = cfg.permissions_validity_in_ms();
}
return std::chrono::milliseconds(exp);
}
future<> stop() {
return make_ready_future<>();
}
future<permission_set> get(::shared_ptr<authenticated_user> user, data_resource resource) {
return _cache.get(key_type(*user, std::move(resource)));
}
private:
cache_type _cache;
};
static distributed<auth::auth::permissions_cache> perm_cache;
/**
* Poor mans job schedule. For maximum 2 jobs. Sic.
* Still does nothing more clever than waiting 10 seconds
* like origin, then runs the submitted tasks.
*
* Only difference compared to sleep (from which this
* borrows _heavily_) is that if tasks have not run by the time
* we exit (and do static clean up) we delete the promise + cont
*
* Should be abstracted to some sort of global server function
* probably.
*/
struct waiter {
promise<> done;
timer<> tmr;
waiter() : tmr([this] {done.set_value();})
{
tmr.arm(auth::auth::SUPERUSER_SETUP_DELAY);
}
~waiter() {
if (tmr.armed()) {
tmr.cancel();
done.set_exception(std::runtime_error("shutting down"));
}
logger.trace("Deleting scheduled task");
}
void kill() {
}
};
typedef std::unique_ptr<waiter> waiter_ptr;
static std::vector<waiter_ptr> & thread_waiters() {
static thread_local std::vector<waiter_ptr> the_waiters;
return the_waiters;
}
void auth::auth::schedule_when_up(scheduled_func f) {
logger.trace("Adding scheduled task");
auto & waiters = thread_waiters();
waiters.emplace_back(std::make_unique<waiter>());
auto* w = waiters.back().get();
w->done.get_future().finally([w] {
auto & waiters = thread_waiters();
auto i = std::find_if(waiters.begin(), waiters.end(), [w](const waiter_ptr& p) {
return p.get() == w;
});
if (i != waiters.end()) {
waiters.erase(i);
}
}).then([f = std::move(f)] {
logger.trace("Running scheduled task");
return f();
}).handle_exception([](auto ep) {
return make_ready_future();
});
}
bool auth::auth::is_class_type(const sstring& type, const sstring& classname) {
if (type == classname) {
return true;
}
auto i = classname.find_last_of('.');
return classname.compare(i + 1, sstring::npos, type) == 0;
}
future<> auth::auth::setup() {
auto& db = cql3::get_local_query_processor().db().local();
auto& cfg = db.get_config();
future<> f = perm_cache.start();
if (is_class_type(cfg.authenticator(),
authenticator::ALLOW_ALL_AUTHENTICATOR_NAME)
&& is_class_type(cfg.authorizer(),
authorizer::ALLOW_ALL_AUTHORIZER_NAME)
) {
// just create the objects
return f.then([&cfg] {
return authenticator::setup(cfg.authenticator());
}).then([&cfg] {
return authorizer::setup(cfg.authorizer());
});
}
if (!db.has_keyspace(AUTH_KS)) {
std::map<sstring, sstring> opts;
opts["replication_factor"] = "1";
auto ksm = keyspace_metadata::new_keyspace(AUTH_KS, "org.apache.cassandra.locator.SimpleStrategy", opts, true);
f = service::get_local_migration_manager().announce_new_keyspace(ksm, false);
}
return f.then([] {
return setup_table(USERS_CF, sprint("CREATE TABLE %s.%s (%s text, %s boolean, PRIMARY KEY(%s)) WITH gc_grace_seconds=%d",
AUTH_KS, USERS_CF, USER_NAME, SUPER, USER_NAME,
90 * 24 * 60 * 60)); // 3 months.
}).then([&cfg] {
return authenticator::setup(cfg.authenticator());
}).then([&cfg] {
return authorizer::setup(cfg.authorizer());
}).then([] {
service::get_local_migration_manager().register_listener(&auth_migration); // again, only one shard...
// instead of once-timer, just schedule this later
schedule_when_up([] {
// setup default super user
return has_existing_users(USERS_CF, DEFAULT_SUPERUSER_NAME, USER_NAME).then([](bool exists) {
if (!exists) {
auto query = sprint("INSERT INTO %s.%s (%s, %s) VALUES (?, ?) USING TIMESTAMP 0",
AUTH_KS, USERS_CF, USER_NAME, SUPER);
cql3::get_local_query_processor().process(query, db::consistency_level::ONE, {DEFAULT_SUPERUSER_NAME, true}).then([](auto) {
logger.info("Created default superuser '{}'", DEFAULT_SUPERUSER_NAME);
}).handle_exception([](auto ep) {
try {
std::rethrow_exception(ep);
} catch (exceptions::request_execution_exception&) {
logger.warn("Skipped default superuser setup: some nodes were not ready");
}
});
}
});
});
});
}
future<> auth::auth::shutdown() {
// just make sure we don't have pending tasks.
// this is mostly relevant for test cases where
// db-env-shutdown != process shutdown
return smp::invoke_on_all([] {
thread_waiters().clear();
}).then([] {
return perm_cache.stop();
});
}
future<auth::permission_set> auth::auth::get_permissions(::shared_ptr<authenticated_user> user, data_resource resource) {
return perm_cache.local().get(std::move(user), std::move(resource));
}
static db::consistency_level consistency_for_user(const sstring& username) {
if (username == auth::auth::DEFAULT_SUPERUSER_NAME) {
return db::consistency_level::QUORUM;
}
return db::consistency_level::LOCAL_ONE;
}
static future<::shared_ptr<cql3::untyped_result_set>> select_user(const sstring& username) {
// Here was a thread local, explicit cache of prepared statement. In normal execution this is
// fine, but since we in testing set up and tear down system over and over, we'd start using
// obsolete prepared statements pretty quickly.
// Rely on query processing caching statements instead, and lets assume
// that a map lookup string->statement is not gonna kill us much.
return cql3::get_local_query_processor().process(
sprint("SELECT * FROM %s.%s WHERE %s = ?",
auth::auth::AUTH_KS, auth::auth::USERS_CF,
USER_NAME), consistency_for_user(username),
{ username }, true);
}
future<bool> auth::auth::is_existing_user(const sstring& username) {
return select_user(username).then(
[](::shared_ptr<cql3::untyped_result_set> res) {
return make_ready_future<bool>(!res->empty());
});
}
future<bool> auth::auth::is_super_user(const sstring& username) {
return select_user(username).then(
[](::shared_ptr<cql3::untyped_result_set> res) {
return make_ready_future<bool>(!res->empty() && res->one().get_as<bool>(SUPER));
});
}
future<> auth::auth::insert_user(const sstring& username, bool is_super)
throw (exceptions::request_execution_exception) {
return cql3::get_local_query_processor().process(sprint("INSERT INTO %s.%s (%s, %s) VALUES (?, ?)",
AUTH_KS, USERS_CF, USER_NAME, SUPER),
consistency_for_user(username), { username, is_super }).discard_result();
}
future<> auth::auth::delete_user(const sstring& username) throw(exceptions::request_execution_exception) {
return cql3::get_local_query_processor().process(sprint("DELETE FROM %s.%s WHERE %s = ?",
AUTH_KS, USERS_CF, USER_NAME),
consistency_for_user(username), { username }).discard_result();
}
future<> auth::auth::setup_table(const sstring& name, const sstring& cql) {
auto& qp = cql3::get_local_query_processor();
auto& db = qp.db().local();
if (db.has_schema(AUTH_KS, name)) {
return make_ready_future();
}
::shared_ptr<cql3::statements::raw::cf_statement> parsed = static_pointer_cast<
cql3::statements::raw::cf_statement>(cql3::query_processor::parse_statement(cql));
parsed->prepare_keyspace(AUTH_KS);
::shared_ptr<cql3::statements::create_table_statement> statement =
static_pointer_cast<cql3::statements::create_table_statement>(
parsed->prepare(db)->statement);
auto schema = statement->get_cf_meta_data();
auto uuid = generate_legacy_id(schema->ks_name(), schema->cf_name());
schema_builder b(schema);
b.set_uuid(uuid);
return service::get_local_migration_manager().announce_new_column_family(b.build(), false);
}
future<bool> auth::auth::has_existing_users(const sstring& cfname, const sstring& def_user_name, const sstring& name_column) {
auto default_user_query = sprint("SELECT * FROM %s.%s WHERE %s = ?", AUTH_KS, cfname, name_column);
auto all_users_query = sprint("SELECT * FROM %s.%s LIMIT 1", AUTH_KS, cfname);
return cql3::get_local_query_processor().process(default_user_query, db::consistency_level::ONE, { def_user_name }).then([=](::shared_ptr<cql3::untyped_result_set> res) {
if (!res->empty()) {
return make_ready_future<bool>(true);
}
return cql3::get_local_query_processor().process(default_user_query, db::consistency_level::QUORUM, { def_user_name }).then([all_users_query](::shared_ptr<cql3::untyped_result_set> res) {
if (!res->empty()) {
return make_ready_future<bool>(true);
}
return cql3::get_local_query_processor().process(all_users_query, db::consistency_level::QUORUM).then([](::shared_ptr<cql3::untyped_result_set> res) {
return make_ready_future<bool>(!res->empty());
});
});
});
}

View File

@@ -1,124 +0,0 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* Copyright (C) 2016 ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <chrono>
#include <seastar/core/sstring.hh>
#include <seastar/core/future.hh>
#include <seastar/core/shared_ptr.hh>
#include "exceptions/exceptions.hh"
#include "permission.hh"
#include "data_resource.hh"
namespace auth {
class authenticated_user;
class auth {
public:
class permissions_cache;
static const sstring DEFAULT_SUPERUSER_NAME;
static const sstring AUTH_KS;
static const sstring USERS_CF;
static const std::chrono::milliseconds SUPERUSER_SETUP_DELAY;
static bool is_class_type(const sstring& type, const sstring& classname);
static future<permission_set> get_permissions(::shared_ptr<authenticated_user>, data_resource);
/**
* Checks if the username is stored in AUTH_KS.USERS_CF.
*
* @param username Username to query.
* @return whether or not Cassandra knows about the user.
*/
static future<bool> is_existing_user(const sstring& username);
/**
* Checks if the user is a known superuser.
*
* @param username Username to query.
* @return true is the user is a superuser, false if they aren't or don't exist at all.
*/
static future<bool> is_super_user(const sstring& username);
/**
* Inserts the user into AUTH_KS.USERS_CF (or overwrites their superuser status as a result of an ALTER USER query).
*
* @param username Username to insert.
* @param isSuper User's new status.
* @throws RequestExecutionException
*/
static future<> insert_user(const sstring& username, bool is_super) throw(exceptions::request_execution_exception);
/**
* Deletes the user from AUTH_KS.USERS_CF.
*
* @param username Username to delete.
* @throws RequestExecutionException
*/
static future<> delete_user(const sstring& username) throw(exceptions::request_execution_exception);
/**
* Sets up Authenticator and Authorizer.
*/
static future<> setup();
static future<> shutdown();
/**
* Set up table from given CREATE TABLE statement under system_auth keyspace, if not already done so.
*
* @param name name of the table
* @param cql CREATE TABLE statement
*/
static future<> setup_table(const sstring& name, const sstring& cql);
static future<bool> has_existing_users(const sstring& cfname, const sstring& def_user_name, const sstring& name_column_name);
// For internal use. Run function "when system is up".
typedef std::function<future<>()> scheduled_func;
static void schedule_when_up(scheduled_func);
};
}

View File

@@ -41,7 +41,6 @@
#include "authenticated_user.hh"
#include "auth.hh"
const sstring auth::authenticated_user::ANONYMOUS_USERNAME("anonymous");
@@ -60,13 +59,6 @@ const sstring& auth::authenticated_user::name() const {
return _anon ? ANONYMOUS_USERNAME : _name;
}
future<bool> auth::authenticated_user::is_super() const {
if (is_anonymous()) {
return make_ready_future<bool>(false);
}
return auth::auth::is_super_user(_name);
}
bool auth::authenticated_user::operator==(const authenticated_user& v) const {
return _anon ? v._anon : _name == v._name;
}

View File

@@ -43,6 +43,7 @@
#include <seastar/core/sstring.hh>
#include <seastar/core/future.hh>
#include "seastarx.hh"
namespace auth {
@@ -57,14 +58,6 @@ public:
const sstring& name() const;
/**
* Checks the user's superuser status.
* Only a superuser is allowed to perform CREATE USER and DROP USER queries.
* Im most cased, though not necessarily, a superuser will have Permission.ALL on every resource
* (depends on IAuthorizer implementation).
*/
future<bool> is_super() const;
/**
* If IAuthenticator doesn't require authentication, this method may return true.
*/

View File

@@ -41,13 +41,14 @@
#include "authenticator.hh"
#include "authenticated_user.hh"
#include "common.hh"
#include "password_authenticator.hh"
#include "auth.hh"
#include "cql3/query_processor.hh"
#include "db/config.hh"
#include "utils/class_registrator.hh"
const sstring auth::authenticator::USERNAME_KEY("username");
const sstring auth::authenticator::PASSWORD_KEY("password");
const sstring auth::authenticator::ALLOW_ALL_AUTHENTICATOR_NAME("org.apache.cassandra.auth.AllowAllAuthenticator");
auth::authenticator::option auth::authenticator::string_to_option(const sstring& name) {
if (strcasecmp(name.c_str(), "password") == 0) {
@@ -64,64 +65,3 @@ sstring auth::authenticator::option_to_string(option opt) {
throw std::invalid_argument(sprint("Unknown option {}", opt));
}
}
/**
* Authenticator is assumed to be a fully state-less immutable object (note all the const).
* We thus store a single instance globally, since it should be safe/ok.
*/
static std::unique_ptr<auth::authenticator> global_authenticator;
future<>
auth::authenticator::setup(const sstring& type) throw (exceptions::configuration_exception) {
if (auth::auth::is_class_type(type, ALLOW_ALL_AUTHENTICATOR_NAME)) {
class allow_all_authenticator : public authenticator {
public:
const sstring& class_name() const override {
return ALLOW_ALL_AUTHENTICATOR_NAME;
}
bool require_authentication() const override {
return false;
}
option_set supported_options() const override {
return option_set();
}
option_set alterable_options() const override {
return option_set();
}
future<::shared_ptr<authenticated_user>> authenticate(const credentials_map& credentials) const throw(exceptions::authentication_exception) override {
return make_ready_future<::shared_ptr<authenticated_user>>(::make_shared<authenticated_user>());
}
future<> create(sstring username, const option_map& options) throw(exceptions::request_validation_exception, exceptions::request_execution_exception) override {
return make_ready_future();
}
future<> alter(sstring username, const option_map& options) throw(exceptions::request_validation_exception, exceptions::request_execution_exception) override {
return make_ready_future();
}
future<> drop(sstring username) throw(exceptions::request_validation_exception, exceptions::request_execution_exception) override {
return make_ready_future();
}
const resource_ids& protected_resources() const override {
static const resource_ids ids;
return ids;
}
::shared_ptr<sasl_challenge> new_sasl_challenge() const override {
throw std::runtime_error("Should not reach");
}
};
global_authenticator = std::make_unique<allow_all_authenticator>();
} else if (auth::auth::is_class_type(type, password_authenticator::PASSWORD_AUTHENTICATOR_NAME)) {
auto pwa = std::make_unique<password_authenticator>();
auto f = pwa->init();
return f.then([pwa = std::move(pwa)]() mutable {
global_authenticator = std::move(pwa);
});
} else {
throw exceptions::configuration_exception("Invalid authenticator type: " + type);
}
return make_ready_future();
}
auth::authenticator& auth::authenticator::get() {
assert(global_authenticator);
return *global_authenticator;
}

View File

@@ -69,7 +69,6 @@ class authenticator {
public:
static const sstring USERNAME_KEY;
static const sstring PASSWORD_KEY;
static const sstring ALLOW_ALL_AUTHENTICATOR_NAME;
/**
* Supported CREATE USER/ALTER USER options.
@@ -86,23 +85,14 @@ public:
using option_map = std::unordered_map<option, boost::any, enum_hash<option>>;
using credentials_map = std::unordered_map<sstring, sstring>;
/**
* Setup is called once upon system startup to initialize the IAuthenticator.
*
* For example, use this method to create any required keyspaces/column families.
* Note: Only call from main thread.
*/
static future<> setup(const sstring& type) throw(exceptions::configuration_exception);
/**
* Returns the system authenticator. Must have called setup before calling this.
*/
static authenticator& get();
virtual ~authenticator()
{}
virtual const sstring& class_name() const = 0;
virtual future<> start() = 0;
virtual future<> stop() = 0;
virtual const sstring& qualified_java_name() const = 0;
/**
* Whether or not the authenticator requires explicit login.
@@ -129,7 +119,7 @@ public:
*
* @throws authentication_exception if credentials don't match any known user.
*/
virtual future<::shared_ptr<authenticated_user>> authenticate(const credentials_map& credentials) const throw(exceptions::authentication_exception) = 0;
virtual future<::shared_ptr<authenticated_user>> authenticate(const credentials_map& credentials) const = 0;
/**
* Called during execution of CREATE USER query (also may be called on startup, see seedSuperuserOptions method).
@@ -141,7 +131,7 @@ public:
* @throws exceptions::request_validation_exception
* @throws exceptions::request_execution_exception
*/
virtual future<> create(sstring username, const option_map& options) throw(exceptions::request_validation_exception, exceptions::request_execution_exception) = 0;
virtual future<> create(sstring username, const option_map& options) = 0;
/**
* Called during execution of ALTER USER query.
@@ -154,7 +144,7 @@ public:
* @throws exceptions::request_validation_exception
* @throws exceptions::request_execution_exception
*/
virtual future<> alter(sstring username, const option_map& options) throw(exceptions::request_validation_exception, exceptions::request_execution_exception) = 0;
virtual future<> alter(sstring username, const option_map& options) = 0;
/**
@@ -164,7 +154,7 @@ public:
* @throws exceptions::request_validation_exception
* @throws exceptions::request_execution_exception
*/
virtual future<> drop(sstring username) throw(exceptions::request_validation_exception, exceptions::request_execution_exception) = 0;
virtual future<> drop(sstring username) = 0;
/**
* Set of resources that should be made inaccessible to users and only accessible internally.
@@ -177,9 +167,9 @@ public:
class sasl_challenge {
public:
virtual ~sasl_challenge() {}
virtual bytes evaluate_response(bytes_view client_response) throw(exceptions::authentication_exception) = 0;
virtual bytes evaluate_response(bytes_view client_response) = 0;
virtual bool is_complete() const = 0;
virtual future<::shared_ptr<authenticated_user>> get_authenticated_user() const throw(exceptions::authentication_exception) = 0;
virtual future<::shared_ptr<authenticated_user>> get_authenticated_user() const = 0;
};
/**

View File

@@ -41,23 +41,39 @@
#include "authorizer.hh"
#include "authenticated_user.hh"
#include "common.hh"
#include "default_authorizer.hh"
#include "auth.hh"
#include "cql3/query_processor.hh"
#include "db/config.hh"
#include "utils/class_registrator.hh"
const sstring auth::authorizer::ALLOW_ALL_AUTHORIZER_NAME("org.apache.cassandra.auth.AllowAllAuthorizer");
const sstring& auth::allow_all_authorizer_name() {
static const sstring name = meta::AUTH_PACKAGE_NAME + "AllowAllAuthorizer";
return name;
}
/**
* Authenticator is assumed to be a fully state-less immutable object (note all the const).
* We thus store a single instance globally, since it should be safe/ok.
*/
static std::unique_ptr<auth::authorizer> global_authorizer;
using authorizer_registry = class_registry<auth::authorizer, cql3::query_processor&>;
future<>
auth::authorizer::setup(const sstring& type) {
if (auth::auth::is_class_type(type, ALLOW_ALL_AUTHORIZER_NAME)) {
if (type == allow_all_authorizer_name()) {
class allow_all_authorizer : public authorizer {
public:
future<> start() override {
return make_ready_future<>();
}
future<> stop() override {
return make_ready_future<>();
}
const sstring& qualified_java_name() const override {
return allow_all_authorizer_name();
}
future<permission_set> authorize(::shared_ptr<authenticated_user>, data_resource) const override {
return make_ready_future<permission_set>(permissions::ALL);
}
@@ -86,16 +102,14 @@ auth::authorizer::setup(const sstring& type) {
};
global_authorizer = std::make_unique<allow_all_authorizer>();
} else if (auth::auth::is_class_type(type, default_authorizer::DEFAULT_AUTHORIZER_NAME)) {
auto da = std::make_unique<default_authorizer>();
auto f = da->init();
return f.then([da = std::move(da)]() mutable {
global_authorizer = std::move(da);
});
return make_ready_future();
} else {
throw exceptions::configuration_exception("Invalid authorizer type: " + type);
auto a = authorizer_registry::create(type, cql3::get_local_query_processor());
auto f = a->start();
return f.then([a = std::move(a)]() mutable {
global_authorizer = std::move(a);
});
}
return make_ready_future();
}
auth::authorizer& auth::authorizer::get() {

View File

@@ -51,8 +51,12 @@
#include "permission.hh"
#include "data_resource.hh"
#include "seastarx.hh"
namespace auth {
class service;
class authenticated_user;
struct permission_details {
@@ -69,10 +73,14 @@ using std::experimental::optional;
class authorizer {
public:
static const sstring ALLOW_ALL_AUTHORIZER_NAME;
virtual ~authorizer() {}
virtual future<> start() = 0;
virtual future<> stop() = 0;
virtual const sstring& qualified_java_name() const = 0;
/**
* The primary Authorizer method. Returns a set of permissions of a user on a resource.
*
@@ -80,7 +88,7 @@ public:
* @param resource Resource for which the authorization is being requested. @see DataResource.
* @return Set of permissions of the user on the resource. Should never return empty. Use permission.NONE instead.
*/
virtual future<permission_set> authorize(::shared_ptr<authenticated_user>, data_resource) const = 0;
virtual future<permission_set> authorize(service&, ::shared_ptr<authenticated_user>, data_resource) const = 0;
/**
* Grants a set of permissions on a resource to a user.
@@ -124,7 +132,7 @@ public:
* @throws RequestValidationException
* @throws RequestExecutionException
*/
virtual future<std::vector<permission_details>> list(::shared_ptr<authenticated_user> performer, permission_set, optional<data_resource>, optional<sstring>) const = 0;
virtual future<std::vector<permission_details>> list(service&, ::shared_ptr<authenticated_user> performer, permission_set, optional<data_resource>, optional<sstring>) const = 0;
/**
* This method is called before deleting a user with DROP USER query so that a new user with the same
@@ -154,18 +162,6 @@ public:
* @throws ConfigurationException when there is a configuration error.
*/
virtual future<> validate_configuration() const = 0;
/**
* Setup is called once upon system startup to initialize the IAuthorizer.
*
* For example, use this method to create any required keyspaces/column families.
*/
static future<> setup(const sstring& type);
/**
* Returns the system authorizer. Must have called setup before calling this.
*/
static authorizer& get();
};
}

70
auth/common.cc Normal file
View File

@@ -0,0 +1,70 @@
/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "auth/common.hh"
#include <seastar/core/shared_ptr.hh>
#include "cql3/query_processor.hh"
#include "cql3/statements/create_table_statement.hh"
#include "schema_builder.hh"
#include "service/migration_manager.hh"
namespace auth {
namespace meta {
const sstring DEFAULT_SUPERUSER_NAME("cassandra");
const sstring AUTH_KS("system_auth");
const sstring USERS_CF("users");
const sstring AUTH_PACKAGE_NAME("org.apache.cassandra.auth.");
}
future<> create_metadata_table_if_missing(
const sstring& table_name,
cql3::query_processor& qp,
const sstring& cql,
::service::migration_manager& mm) {
auto& db = qp.db().local();
if (db.has_schema(meta::AUTH_KS, table_name)) {
return make_ready_future<>();
}
auto parsed_statement = static_pointer_cast<cql3::statements::raw::cf_statement>(
cql3::query_processor::parse_statement(cql));
parsed_statement->prepare_keyspace(meta::AUTH_KS);
auto statement = static_pointer_cast<cql3::statements::create_table_statement>(
parsed_statement->prepare(db, qp.get_cql_stats())->statement);
const auto schema = statement->get_cf_meta_data();
const auto uuid = generate_legacy_id(schema->ks_name(), schema->cf_name());
schema_builder b(schema);
b.set_uuid(uuid);
return mm.announce_new_column_family(b.build(), false);
}
}

74
auth/common.hh Normal file
View File

@@ -0,0 +1,74 @@
/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <chrono>
#include <seastar/core/future.hh>
#include <seastar/core/reactor.hh>
#include <seastar/core/resource.hh>
#include <seastar/core/sstring.hh>
#include "delayed_tasks.hh"
#include "seastarx.hh"
namespace service {
class migration_manager;
}
namespace cql3 {
class query_processor;
}
namespace auth {
namespace meta {
extern const sstring DEFAULT_SUPERUSER_NAME;
extern const sstring AUTH_KS;
extern const sstring USERS_CF;
extern const sstring AUTH_PACKAGE_NAME;
}
template <class Task>
future<> once_among_shards(Task&& f) {
if (engine().cpu_id() == 0u) {
return f();
}
return make_ready_future<>();
}
template <class Task, class Clock>
void delay_until_system_ready(delayed_tasks<Clock>& ts, Task&& f) {
static const typename std::chrono::milliseconds delay_duration(10000);
ts.schedule_after(delay_duration, std::forward<Task>(f));
}
future<> create_metadata_table_if_missing(
const sstring& table_name,
cql3::query_processor&,
const sstring& cql,
::service::migration_manager&);
}

View File

@@ -47,11 +47,8 @@
const sstring auth::data_resource::ROOT_NAME("data");
auth::data_resource::data_resource(level l, const sstring& ks, const sstring& cf)
: _ks(ks), _cf(cf)
: _level(l), _ks(ks), _cf(cf)
{
if (l != get_level()) {
throw std::invalid_argument("level/keyspace/column mismatch");
}
}
auth::data_resource::data_resource()
@@ -67,14 +64,7 @@ auth::data_resource::data_resource(const sstring& ks, const sstring& cf)
{}
auth::data_resource::level auth::data_resource::get_level() const {
if (!_cf.empty()) {
assert(!_ks.empty());
return level::COLUMN_FAMILY;
}
if (!_ks.empty()) {
return level::KEYSPACE;
}
return level::ROOT;
return _level;
}
auth::data_resource auth::data_resource::from_name(
@@ -125,16 +115,14 @@ auth::data_resource auth::data_resource::get_parent() const {
}
}
const sstring& auth::data_resource::keyspace() const
throw (std::invalid_argument) {
const sstring& auth::data_resource::keyspace() const {
if (is_root_level()) {
throw std::invalid_argument("ROOT data resource has no keyspace");
}
return _ks;
}
const sstring& auth::data_resource::column_family() const
throw (std::invalid_argument) {
const sstring& auth::data_resource::column_family() const {
if (!is_column_family_level()) {
throw std::invalid_argument(sprint("%s data resource has no column family", name()));
}

View File

@@ -45,6 +45,7 @@
#include <iosfwd>
#include <set>
#include <seastar/core/sstring.hh>
#include "seastarx.hh"
namespace auth {
@@ -56,6 +57,7 @@ private:
static const sstring ROOT_NAME;
level _level;
sstring _ks;
sstring _cf;
@@ -116,13 +118,13 @@ public:
* @return keyspace of the resource.
* @throws std::invalid_argument if it's the root-level resource.
*/
const sstring& keyspace() const throw(std::invalid_argument);
const sstring& keyspace() const;
/**
* @return column family of the resource.
* @throws std::invalid_argument if it's not a cf-level resource.
*/
const sstring& column_family() const throw(std::invalid_argument);
const sstring& column_family() const;
/**
* @return Whether or not the resource has a parent in the hierarchy.

View File

@@ -46,46 +46,68 @@
#include <seastar/core/reactor.hh>
#include "auth.hh"
#include "common.hh"
#include "default_authorizer.hh"
#include "authenticated_user.hh"
#include "permission.hh"
#include "cql3/query_processor.hh"
#include "cql3/untyped_result_set.hh"
#include "exceptions/exceptions.hh"
#include "log.hh"
const sstring auth::default_authorizer::DEFAULT_AUTHORIZER_NAME(
"org.apache.cassandra.auth.CassandraAuthorizer");
const sstring& auth::default_authorizer_name() {
static const sstring name = meta::AUTH_PACKAGE_NAME + "CassandraAuthorizer";
return name;
}
static const sstring USER_NAME = "username";
static const sstring RESOURCE_NAME = "resource";
static const sstring PERMISSIONS_NAME = "permissions";
static const sstring PERMISSIONS_CF = "permissions";
static logging::logger logger("default_authorizer");
static logging::logger alogger("default_authorizer");
auth::default_authorizer::default_authorizer() {
// To ensure correct initialization order, we unfortunately need to use a string literal.
static const class_registrator<
auth::authorizer,
auth::default_authorizer,
cql3::query_processor&,
::service::migration_manager&> password_auth_reg("org.apache.cassandra.auth.CassandraAuthorizer");
auth::default_authorizer::default_authorizer(cql3::query_processor& qp, ::service::migration_manager& mm)
: _qp(qp)
, _migration_manager(mm) {
}
auth::default_authorizer::~default_authorizer() {
}
future<> auth::default_authorizer::init() {
sstring create_table = sprint("CREATE TABLE %s.%s ("
future<> auth::default_authorizer::start() {
static const sstring create_table = sprint("CREATE TABLE %s.%s ("
"%s text,"
"%s text,"
"%s set<text>,"
"PRIMARY KEY(%s, %s)"
") WITH gc_grace_seconds=%d", auth::auth::AUTH_KS,
") WITH gc_grace_seconds=%d", meta::AUTH_KS,
PERMISSIONS_CF, USER_NAME, RESOURCE_NAME, PERMISSIONS_NAME,
USER_NAME, RESOURCE_NAME, 90 * 24 * 60 * 60); // 3 months.
return auth::setup_table(PERMISSIONS_CF, create_table);
return auth::once_among_shards([this] {
return auth::create_metadata_table_if_missing(
PERMISSIONS_CF,
_qp,
create_table,
_migration_manager);
});
}
future<> auth::default_authorizer::stop() {
return make_ready_future<>();
}
future<auth::permission_set> auth::default_authorizer::authorize(
::shared_ptr<authenticated_user> user, data_resource resource) const {
return user->is_super().then([this, user, resource = std::move(resource)](bool is_super) {
service& ser, ::shared_ptr<authenticated_user> user, data_resource resource) const {
return auth::is_super_user(ser, *user).then([this, user, resource = std::move(resource)](bool is_super) {
if (is_super) {
return make_ready_future<permission_set>(permissions::ALL);
}
@@ -94,10 +116,9 @@ future<auth::permission_set> auth::default_authorizer::authorize(
* TOOD: could create actual data type for permission (translating string<->perm),
* but this seems overkill right now. We still must store strings so...
*/
auto& qp = cql3::get_local_query_processor();
auto query = sprint("SELECT %s FROM %s.%s WHERE %s = ? AND %s = ?"
, PERMISSIONS_NAME, auth::AUTH_KS, PERMISSIONS_CF, USER_NAME, RESOURCE_NAME);
return qp.process(query, db::consistency_level::LOCAL_ONE, {user->name(), resource.name() })
, PERMISSIONS_NAME, meta::AUTH_KS, PERMISSIONS_CF, USER_NAME, RESOURCE_NAME);
return _qp.process(query, db::consistency_level::LOCAL_ONE, {user->name(), resource.name() })
.then_wrapped([=](future<::shared_ptr<cql3::untyped_result_set>> f) {
try {
auto res = f.get0();
@@ -107,7 +128,7 @@ future<auth::permission_set> auth::default_authorizer::authorize(
}
return make_ready_future<permission_set>(permissions::from_strings(res->one().get_set<sstring>(PERMISSIONS_NAME)));
} catch (exceptions::request_execution_exception& e) {
logger.warn("CassandraAuthorizer failed to authorize {} for {}", user->name(), resource);
alogger.warn("CassandraAuthorizer failed to authorize {} for {}", user->name(), resource);
return make_ready_future<permission_set>(permissions::NONE);
}
});
@@ -120,11 +141,10 @@ future<> auth::default_authorizer::modify(
::shared_ptr<authenticated_user> performer, permission_set set,
data_resource resource, sstring user, sstring op) {
// TODO: why does this not check super user?
auto& qp = cql3::get_local_query_processor();
auto query = sprint("UPDATE %s.%s SET %s = %s %s ? WHERE %s = ? AND %s = ?",
auth::AUTH_KS, PERMISSIONS_CF, PERMISSIONS_NAME,
meta::AUTH_KS, PERMISSIONS_CF, PERMISSIONS_NAME,
PERMISSIONS_NAME, op, USER_NAME, RESOURCE_NAME);
return qp.process(query, db::consistency_level::ONE, {
return _qp.process(query, db::consistency_level::ONE, {
permissions::to_strings(set), user, resource.name() }).discard_result();
}
@@ -142,15 +162,14 @@ future<> auth::default_authorizer::revoke(
}
future<std::vector<auth::permission_details>> auth::default_authorizer::list(
::shared_ptr<authenticated_user> performer, permission_set set,
service& ser, ::shared_ptr<authenticated_user> performer, permission_set set,
optional<data_resource> resource, optional<sstring> user) const {
return performer->is_super().then([this, performer, set = std::move(set), resource = std::move(resource), user = std::move(user)](bool is_super) {
return auth::is_super_user(ser, *performer).then([this, performer, set = std::move(set), resource = std::move(resource), user = std::move(user)](bool is_super) {
if (!is_super && (!user || performer->name() != *user)) {
throw exceptions::unauthorized_exception(sprint("You are not authorized to view %s's permissions", user ? *user : "everyone"));
}
auto query = sprint("SELECT %s, %s, %s FROM %s.%s", USER_NAME, RESOURCE_NAME, PERMISSIONS_NAME, auth::AUTH_KS, PERMISSIONS_CF);
auto& qp = cql3::get_local_query_processor();
auto query = sprint("SELECT %s, %s, %s FROM %s.%s", USER_NAME, RESOURCE_NAME, PERMISSIONS_NAME, meta::AUTH_KS, PERMISSIONS_CF);
// Oh, look, it is a case where it does not pay off to have
// parameters to process in an initializer list.
@@ -158,15 +177,15 @@ future<std::vector<auth::permission_details>> auth::default_authorizer::list(
if (resource && user) {
query += sprint(" WHERE %s = ? AND %s = ?", USER_NAME, RESOURCE_NAME);
f = qp.process(query, db::consistency_level::ONE, {*user, resource->name()});
f = _qp.process(query, db::consistency_level::ONE, {*user, resource->name()});
} else if (resource) {
query += sprint(" WHERE %s = ? ALLOW FILTERING", RESOURCE_NAME);
f = qp.process(query, db::consistency_level::ONE, {resource->name()});
f = _qp.process(query, db::consistency_level::ONE, {resource->name()});
} else if (user) {
query += sprint(" WHERE %s = ?", USER_NAME);
f = qp.process(query, db::consistency_level::ONE, {*user});
f = _qp.process(query, db::consistency_level::ONE, {*user});
} else {
f = qp.process(query, db::consistency_level::ONE, {});
f = _qp.process(query, db::consistency_level::ONE, {});
}
return f.then([set](::shared_ptr<cql3::untyped_result_set> res) {
@@ -188,42 +207,40 @@ future<std::vector<auth::permission_details>> auth::default_authorizer::list(
}
future<> auth::default_authorizer::revoke_all(sstring dropped_user) {
auto& qp = cql3::get_local_query_processor();
auto query = sprint("DELETE FROM %s.%s WHERE %s = ?", auth::AUTH_KS,
auto query = sprint("DELETE FROM %s.%s WHERE %s = ?", meta::AUTH_KS,
PERMISSIONS_CF, USER_NAME);
return qp.process(query, db::consistency_level::ONE, { dropped_user }).discard_result().handle_exception(
return _qp.process(query, db::consistency_level::ONE, { dropped_user }).discard_result().handle_exception(
[dropped_user](auto ep) {
try {
std::rethrow_exception(ep);
} catch (exceptions::request_execution_exception& e) {
logger.warn("CassandraAuthorizer failed to revoke all permissions of {}: {}", dropped_user, e);
alogger.warn("CassandraAuthorizer failed to revoke all permissions of {}: {}", dropped_user, e);
}
});
}
future<> auth::default_authorizer::revoke_all(data_resource resource) {
auto& qp = cql3::get_local_query_processor();
auto query = sprint("SELECT %s FROM %s.%s WHERE %s = ? ALLOW FILTERING",
USER_NAME, auth::AUTH_KS, PERMISSIONS_CF, RESOURCE_NAME);
return qp.process(query, db::consistency_level::LOCAL_ONE, { resource.name() })
.then_wrapped([resource, &qp](future<::shared_ptr<cql3::untyped_result_set>> f) {
USER_NAME, meta::AUTH_KS, PERMISSIONS_CF, RESOURCE_NAME);
return _qp.process(query, db::consistency_level::LOCAL_ONE, { resource.name() })
.then_wrapped([this, resource](future<::shared_ptr<cql3::untyped_result_set>> f) {
try {
auto res = f.get0();
return parallel_for_each(res->begin(), res->end(), [&qp, res, resource](const cql3::untyped_result_set::row& r) {
return parallel_for_each(res->begin(), res->end(), [this, res, resource](const cql3::untyped_result_set::row& r) {
auto query = sprint("DELETE FROM %s.%s WHERE %s = ? AND %s = ?"
, auth::AUTH_KS, PERMISSIONS_CF, USER_NAME, RESOURCE_NAME);
return qp.process(query, db::consistency_level::LOCAL_ONE, { r.get_as<sstring>(USER_NAME), resource.name() })
, meta::AUTH_KS, PERMISSIONS_CF, USER_NAME, RESOURCE_NAME);
return _qp.process(query, db::consistency_level::LOCAL_ONE, { r.get_as<sstring>(USER_NAME), resource.name() })
.discard_result().handle_exception([resource](auto ep) {
try {
std::rethrow_exception(ep);
} catch (exceptions::request_execution_exception& e) {
logger.warn("CassandraAuthorizer failed to revoke all permissions on {}: {}", resource, e);
alogger.warn("CassandraAuthorizer failed to revoke all permissions on {}: {}", resource, e);
}
});
});
} catch (exceptions::request_execution_exception& e) {
logger.warn("CassandraAuthorizer failed to revoke all permissions on {}: {}", resource, e);
alogger.warn("CassandraAuthorizer failed to revoke all permissions on {}: {}", resource, e);
return make_ready_future();
}
});
@@ -231,7 +248,7 @@ future<> auth::default_authorizer::revoke_all(data_resource resource) {
const auth::resource_ids& auth::default_authorizer::protected_resources() {
static const resource_ids ids({ data_resource(auth::AUTH_KS, PERMISSIONS_CF) });
static const resource_ids ids({ data_resource(meta::AUTH_KS, PERMISSIONS_CF) });
return ids;
}

View File

@@ -41,26 +41,40 @@
#pragma once
#include <functional>
#include "authorizer.hh"
#include "cql3/query_processor.hh"
#include "service/migration_manager.hh"
namespace auth {
class default_authorizer : public authorizer {
public:
static const sstring DEFAULT_AUTHORIZER_NAME;
const sstring& default_authorizer_name();
default_authorizer();
class default_authorizer : public authorizer {
cql3::query_processor& _qp;
::service::migration_manager& _migration_manager;
public:
default_authorizer(cql3::query_processor&, ::service::migration_manager&);
~default_authorizer();
future<> init();
future<> start() override;
future<permission_set> authorize(::shared_ptr<authenticated_user>, data_resource) const override;
future<> stop() override;
const sstring& qualified_java_name() const override {
return default_authorizer_name();
}
future<permission_set> authorize(service&, ::shared_ptr<authenticated_user>, data_resource) const override;
future<> grant(::shared_ptr<authenticated_user>, permission_set, data_resource, sstring) override;
future<> revoke(::shared_ptr<authenticated_user>, permission_set, data_resource, sstring) override;
future<std::vector<permission_details>> list(::shared_ptr<authenticated_user>, permission_set, optional<data_resource>, optional<sstring>) const override;
future<std::vector<permission_details>> list(service&, ::shared_ptr<authenticated_user>, permission_set, optional<data_resource>, optional<sstring>) const override;
future<> revoke_all(sstring) override;

View File

@@ -46,28 +46,42 @@
#include <seastar/core/reactor.hh>
#include "auth.hh"
#include "common.hh"
#include "password_authenticator.hh"
#include "authenticated_user.hh"
#include "cql3/query_processor.hh"
#include "cql3/untyped_result_set.hh"
#include "log.hh"
#include "service/migration_manager.hh"
#include "utils/class_registrator.hh"
const sstring auth::password_authenticator::PASSWORD_AUTHENTICATOR_NAME("org.apache.cassandra.auth.PasswordAuthenticator");
const sstring& auth::password_authenticator_name() {
static const sstring name = meta::AUTH_PACKAGE_NAME + "PasswordAuthenticator";
return name;
}
// name of the hash column.
static const sstring SALTED_HASH = "salted_hash";
static const sstring USER_NAME = "username";
static const sstring DEFAULT_USER_NAME = auth::auth::DEFAULT_SUPERUSER_NAME;
static const sstring DEFAULT_USER_PASSWORD = auth::auth::DEFAULT_SUPERUSER_NAME;
static const sstring DEFAULT_USER_NAME = auth::meta::DEFAULT_SUPERUSER_NAME;
static const sstring DEFAULT_USER_PASSWORD = auth::meta::DEFAULT_SUPERUSER_NAME;
static const sstring CREDENTIALS_CF = "credentials";
static logging::logger logger("password_authenticator");
static logging::logger plogger("password_authenticator");
// To ensure correct initialization order, we unfortunately need to use a string literal.
static const class_registrator<
auth::authenticator,
auth::password_authenticator,
cql3::query_processor&,
::service::migration_manager&> password_auth_reg("org.apache.cassandra.auth.PasswordAuthenticator");
auth::password_authenticator::~password_authenticator()
{}
auth::password_authenticator::password_authenticator()
{}
auth::password_authenticator::password_authenticator(cql3::query_processor& qp, ::service::migration_manager& mm)
: _qp(qp)
, _migration_manager(mm) {
}
// TODO: blowfish
// Origin uses Java bcrypt library, i.e. blowfish salt
@@ -88,12 +102,10 @@ auth::password_authenticator::password_authenticator()
// and some old-fashioned random salt generation.
static constexpr size_t rand_bytes = 16;
static thread_local crypt_data tlcrypt = { 0, };
static sstring hashpw(const sstring& pass, const sstring& salt) {
// crypt_data is huge. should this be a thread_local static?
auto tmp = std::make_unique<crypt_data>();
tmp->initialized = 0;
auto res = crypt_r(pass.c_str(), salt.c_str(), tmp.get());
auto res = crypt_r(pass.c_str(), salt.c_str(), &tlcrypt);
if (res == nullptr) {
throw std::system_error(errno, std::system_category());
}
@@ -122,17 +134,16 @@ static sstring gensalt() {
sstring salt;
if (!prefix.empty()) {
return prefix + salt;
return prefix + input;
}
auto tmp = std::make_unique<crypt_data>();
tmp->initialized = 0;
// Try in order:
// blowfish 2011 fix, blowfish, sha512, sha256, md5
for (sstring pfx : { "$2y$", "$2a$", "$6$", "$5$", "$1$" }) {
salt = pfx + input;
if (crypt_r("fisk", salt.c_str(), tmp.get())) {
const char* e = crypt_r("fisk", salt.c_str(), &tlcrypt);
if (e && (e[0] != '*')) {
prefix = pfx;
return salt;
}
@@ -144,39 +155,52 @@ static sstring hashpw(const sstring& pass) {
return hashpw(pass, gensalt());
}
future<> auth::password_authenticator::init() {
gensalt(); // do this once to determine usable hashing
future<> auth::password_authenticator::start() {
return auth::once_among_shards([this] {
gensalt(); // do this once to determine usable hashing
sstring create_table = sprint(
"CREATE TABLE %s.%s ("
"%s text,"
"%s text," // salt + hash + number of rounds
"options map<text,text>,"// for future extensions
"PRIMARY KEY(%s)"
") WITH gc_grace_seconds=%d",
auth::auth::AUTH_KS,
CREDENTIALS_CF, USER_NAME, SALTED_HASH, USER_NAME,
90 * 24 * 60 * 60); // 3 months.
static const sstring create_table = sprint(
"CREATE TABLE %s.%s ("
"%s text,"
"%s text," // salt + hash + number of rounds
"options map<text,text>,"// for future extensions
"PRIMARY KEY(%s)"
") WITH gc_grace_seconds=%d",
meta::AUTH_KS,
CREDENTIALS_CF, USER_NAME, SALTED_HASH, USER_NAME,
90 * 24 * 60 * 60); // 3 months.
return auth::setup_table(CREDENTIALS_CF, create_table).then([this] {
// instead of once-timer, just schedule this later
auth::schedule_when_up([] {
return auth::has_existing_users(CREDENTIALS_CF, DEFAULT_USER_NAME, USER_NAME).then([](bool exists) {
if (!exists) {
cql3::get_local_query_processor().process(sprint("INSERT INTO %s.%s (%s, %s) VALUES (?, ?) USING TIMESTAMP 0",
auth::AUTH_KS,
CREDENTIALS_CF,
USER_NAME, SALTED_HASH
),
db::consistency_level::ONE, {DEFAULT_USER_NAME, hashpw(DEFAULT_USER_PASSWORD)}).then([](auto) {
logger.info("Created default user '{}'", DEFAULT_USER_NAME);
});
}
return auth::create_metadata_table_if_missing(
CREDENTIALS_CF,
_qp,
create_table,
_migration_manager).then([this] {
auth::delay_until_system_ready(_delayed, [this] {
return has_existing_users().then([this](bool existing) {
if (!existing) {
return _qp.process(
sprint(
"INSERT INTO %s.%s (%s, %s) VALUES (?, ?) USING TIMESTAMP 0",
meta::AUTH_KS,
CREDENTIALS_CF,
USER_NAME, SALTED_HASH),
db::consistency_level::ONE,
{ DEFAULT_USER_NAME, hashpw(DEFAULT_USER_PASSWORD) }).then([](auto) {
plogger.info("Created default user '{}'", DEFAULT_USER_NAME);
});
}
return make_ready_future<>();
});
});
});
});
}
future<> auth::password_authenticator::stop() {
return make_ready_future<>();
}
db::consistency_level auth::password_authenticator::consistency_for_user(const sstring& username) {
if (username == DEFAULT_USER_NAME) {
return db::consistency_level::QUORUM;
@@ -184,8 +208,8 @@ db::consistency_level auth::password_authenticator::consistency_for_user(const s
return db::consistency_level::LOCAL_ONE;
}
const sstring& auth::password_authenticator::class_name() const {
return PASSWORD_AUTHENTICATOR_NAME;
const sstring& auth::password_authenticator::qualified_java_name() const {
return password_authenticator_name();
}
bool auth::password_authenticator::require_authentication() const {
@@ -201,8 +225,7 @@ auth::authenticator::option_set auth::password_authenticator::alterable_options(
}
future<::shared_ptr<auth::authenticated_user> > auth::password_authenticator::authenticate(
const credentials_map& credentials) const
throw (exceptions::authentication_exception) {
const credentials_map& credentials) const {
if (!credentials.count(USERNAME_KEY)) {
throw exceptions::authentication_exception(sprint("Required key '%s' is missing", USERNAME_KEY));
}
@@ -218,12 +241,11 @@ future<::shared_ptr<auth::authenticated_user> > auth::password_authenticator::au
// obsolete prepared statements pretty quickly.
// Rely on query processing caching statements instead, and lets assume
// that a map lookup string->statement is not gonna kill us much.
auto& qp = cql3::get_local_query_processor();
return qp.process(
sprint("SELECT %s FROM %s.%s WHERE %s = ?", SALTED_HASH,
auth::AUTH_KS, CREDENTIALS_CF, USER_NAME),
consistency_for_user(username), { username }, true).then_wrapped(
[=](future<::shared_ptr<cql3::untyped_result_set>> f) {
return futurize_apply([this, username, password] {
return _qp.process(sprint("SELECT %s FROM %s.%s WHERE %s = ?", SALTED_HASH,
meta::AUTH_KS, CREDENTIALS_CF, USER_NAME),
consistency_for_user(username), {username}, true);
}).then_wrapped([=](future<::shared_ptr<cql3::untyped_result_set>> f) {
try {
auto res = f.get0();
if (res->empty() || !checkpw(password, res->one().get_as<sstring>(SALTED_HASH))) {
@@ -234,63 +256,57 @@ future<::shared_ptr<auth::authenticated_user> > auth::password_authenticator::au
std::throw_with_nested(exceptions::authentication_exception("Could not verify password"));
} catch (exceptions::request_execution_exception& e) {
std::throw_with_nested(exceptions::authentication_exception(e.what()));
} catch (...) {
std::throw_with_nested(exceptions::authentication_exception("authentication failed"));
}
});
}
future<> auth::password_authenticator::create(sstring username,
const option_map& options)
throw (exceptions::request_validation_exception,
exceptions::request_execution_exception) {
const option_map& options) {
try {
auto password = boost::any_cast<sstring>(options.at(option::PASSWORD));
auto query = sprint("INSERT INTO %s.%s (%s, %s) VALUES (?, ?)",
auth::AUTH_KS, CREDENTIALS_CF, USER_NAME, SALTED_HASH);
auto& qp = cql3::get_local_query_processor();
return qp.process(query, consistency_for_user(username), { username, hashpw(password) }).discard_result();
meta::AUTH_KS, CREDENTIALS_CF, USER_NAME, SALTED_HASH);
return _qp.process(query, consistency_for_user(username), { username, hashpw(password) }).discard_result();
} catch (std::out_of_range&) {
throw exceptions::invalid_request_exception("PasswordAuthenticator requires PASSWORD option");
}
}
future<> auth::password_authenticator::alter(sstring username,
const option_map& options)
throw (exceptions::request_validation_exception,
exceptions::request_execution_exception) {
const option_map& options) {
try {
auto password = boost::any_cast<sstring>(options.at(option::PASSWORD));
auto query = sprint("UPDATE %s.%s SET %s = ? WHERE %s = ?",
auth::AUTH_KS, CREDENTIALS_CF, SALTED_HASH, USER_NAME);
auto& qp = cql3::get_local_query_processor();
return qp.process(query, consistency_for_user(username), { hashpw(password), username }).discard_result();
meta::AUTH_KS, CREDENTIALS_CF, SALTED_HASH, USER_NAME);
return _qp.process(query, consistency_for_user(username), { hashpw(password), username }).discard_result();
} catch (std::out_of_range&) {
throw exceptions::invalid_request_exception("PasswordAuthenticator requires PASSWORD option");
}
}
future<> auth::password_authenticator::drop(sstring username)
throw (exceptions::request_validation_exception,
exceptions::request_execution_exception) {
future<> auth::password_authenticator::drop(sstring username) {
try {
auto query = sprint("DELETE FROM %s.%s WHERE %s = ?",
auth::AUTH_KS, CREDENTIALS_CF, USER_NAME);
auto& qp = cql3::get_local_query_processor();
return qp.process(query, consistency_for_user(username), { username }).discard_result();
meta::AUTH_KS, CREDENTIALS_CF, USER_NAME);
return _qp.process(query, consistency_for_user(username), { username }).discard_result();
} catch (std::out_of_range&) {
throw exceptions::invalid_request_exception("PasswordAuthenticator requires PASSWORD option");
}
}
const auth::resource_ids& auth::password_authenticator::protected_resources() const {
static const resource_ids ids({ data_resource(auth::AUTH_KS, CREDENTIALS_CF) });
static const resource_ids ids({ data_resource(meta::AUTH_KS, CREDENTIALS_CF) });
return ids;
}
::shared_ptr<auth::authenticator::sasl_challenge> auth::password_authenticator::new_sasl_challenge() const {
class plain_text_password_challenge: public sasl_challenge {
const password_authenticator& _self;
public:
plain_text_password_challenge(const password_authenticator& a)
: _authenticator(a)
plain_text_password_challenge(const password_authenticator& self) : _self(self)
{}
/**
@@ -306,9 +322,8 @@ const auth::resource_ids& auth::password_authenticator::protected_resources() co
* would expect
* @throws javax.security.sasl.SaslException
*/
bytes evaluate_response(bytes_view client_response)
throw (exceptions::authentication_exception) override {
logger.debug("Decoding credentials from client token");
bytes evaluate_response(bytes_view client_response) override {
plogger.debug("Decoding credentials from client token");
sstring username, password;
@@ -345,14 +360,59 @@ const auth::resource_ids& auth::password_authenticator::protected_resources() co
bool is_complete() const override {
return _complete;
}
future<::shared_ptr<authenticated_user>> get_authenticated_user() const
throw (exceptions::authentication_exception) override {
return _authenticator.authenticate(_credentials);
future<::shared_ptr<authenticated_user>> get_authenticated_user() const override {
return _self.authenticate(_credentials);
}
private:
const password_authenticator& _authenticator;
credentials_map _credentials;
bool _complete = false;
};
return ::make_shared<plain_text_password_challenge>(*this);
}
//
// Similar in structure to `auth::service::has_existing_users()`, but trying to generalize the pattern breaks all kinds
// of module boundaries and leaks implementation details.
//
future<bool> auth::password_authenticator::has_existing_users() const {
static const sstring default_user_query = sprint(
"SELECT * FROM %s.%s WHERE %s = ?",
meta::AUTH_KS,
CREDENTIALS_CF,
USER_NAME);
static const sstring all_users_query = sprint(
"SELECT * FROM %s.%s LIMIT 1",
meta::AUTH_KS,
CREDENTIALS_CF);
// This logic is borrowed directly from Apache Cassandra. By first checking for the presence of the default user, we
// can potentially avoid doing a range query with a high consistency level.
return _qp.process(
default_user_query,
db::consistency_level::ONE,
{ meta::DEFAULT_SUPERUSER_NAME },
true).then([this](auto results) {
if (!results->empty()) {
return make_ready_future<bool>(true);
}
return _qp.process(
default_user_query,
db::consistency_level::QUORUM,
{ meta::DEFAULT_SUPERUSER_NAME },
true).then([this](auto results) {
if (!results->empty()) {
return make_ready_future<bool>(true);
}
return _qp.process(
all_users_query,
db::consistency_level::QUORUM).then([](auto results) {
return make_ready_future<bool>(!results->empty());
});
});
});
}

View File

@@ -42,31 +42,48 @@
#pragma once
#include "authenticator.hh"
#include "cql3/query_processor.hh"
#include "delayed_tasks.hh"
namespace service {
class migration_manager;
}
namespace auth {
class password_authenticator : public authenticator {
public:
static const sstring PASSWORD_AUTHENTICATOR_NAME;
const sstring& password_authenticator_name();
password_authenticator();
class password_authenticator : public authenticator {
cql3::query_processor& _qp;
::service::migration_manager& _migration_manager;
delayed_tasks<> _delayed{};
public:
password_authenticator(cql3::query_processor&, ::service::migration_manager&);
~password_authenticator();
future<> init();
future<> start() override;
const sstring& class_name() const override;
future<> stop() override;
const sstring& qualified_java_name() const override;
bool require_authentication() const override;
option_set supported_options() const override;
option_set alterable_options() const override;
future<::shared_ptr<authenticated_user>> authenticate(const credentials_map& credentials) const throw(exceptions::authentication_exception) override;
future<> create(sstring username, const option_map& options) throw(exceptions::request_validation_exception, exceptions::request_execution_exception) override;
future<> alter(sstring username, const option_map& options) throw(exceptions::request_validation_exception, exceptions::request_execution_exception) override;
future<> drop(sstring username) throw(exceptions::request_validation_exception, exceptions::request_execution_exception) override;
future<::shared_ptr<authenticated_user>> authenticate(const credentials_map& credentials) const override;
future<> create(sstring username, const option_map& options) override;
future<> alter(sstring username, const option_map& options) override;
future<> drop(sstring username) override;
const resource_ids& protected_resources() const override;
::shared_ptr<sasl_challenge> new_sasl_challenge() const override;
static db::consistency_level consistency_for_user(const sstring& username);
private:
future<bool> has_existing_users() const;
};
}

View File

@@ -40,6 +40,7 @@
*/
#include <unordered_map>
#include <boost/algorithm/string.hpp>
#include "permission.hh"
const auth::permission_set auth::permissions::ALL_DATA =
@@ -75,7 +76,9 @@ const sstring& auth::permissions::to_string(permission p) {
}
auth::permission auth::permissions::from_string(const sstring& s) {
return permission_names.at(s);
sstring upper(s);
boost::to_upper(upper);
return permission_names.at(upper);
}
std::unordered_set<sstring> auth::permissions::to_strings(const permission_set& set) {

View File

@@ -44,6 +44,7 @@
#include <unordered_set>
#include <seastar/core/sstring.hh>
#include "seastarx.hh"
#include "enum_set.hh"
namespace auth {

51
auth/permissions_cache.cc Normal file
View File

@@ -0,0 +1,51 @@
/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "auth/permissions_cache.hh"
#include "auth/authorizer.hh"
#include "auth/common.hh"
#include "auth/service.hh"
#include "db/config.hh"
namespace auth {
permissions_cache_config permissions_cache_config::from_db_config(const db::config& dc) {
permissions_cache_config c;
c.max_entries = dc.permissions_cache_max_entries();
c.validity_period = std::chrono::milliseconds(dc.permissions_validity_in_ms());
c.update_period = std::chrono::milliseconds(dc.permissions_update_interval_in_ms());
return c;
}
permissions_cache::permissions_cache(const permissions_cache_config& c, service& ser, logging::logger& log)
: _cache(c.max_entries, c.validity_period, c.update_period, log, [&ser, &log](const key_type& k) {
log.debug("Refreshing permissions for {}", k.first.name());
return ser.underlying_authorizer().authorize(ser, ::make_shared<authenticated_user>(k.first), k.second);
}) {
}
future<permission_set> permissions_cache::get(::shared_ptr<authenticated_user> user, data_resource r) {
return _cache.get(key_type(*user, r));
}
}

99
auth/permissions_cache.hh Normal file
View File

@@ -0,0 +1,99 @@
/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <chrono>
#include <functional>
#include <iostream>
#include <utility>
#include <seastar/core/future.hh>
#include <seastar/core/shared_ptr.hh>
#include "auth/authenticated_user.hh"
#include "auth/data_resource.hh"
#include "auth/permission.hh"
#include "log.hh"
#include "utils/loading_cache.hh"
namespace std {
template <>
struct hash<auth::data_resource> final {
size_t operator()(const auth::data_resource & v) const {
return v.hash_value();
}
};
template <>
struct hash<auth::authenticated_user> final {
size_t operator()(const auth::authenticated_user & v) const {
return utils::tuple_hash()(v.name(), v.is_anonymous());
}
};
inline std::ostream& operator<<(std::ostream& os, const std::pair<auth::authenticated_user, auth::data_resource>& p) {
os << "{user: " << p.first.name() << ", data_resource: " << p.second << "}";
return os;
}
}
namespace db {
class config;
}
namespace auth {
class service;
struct permissions_cache_config final {
static permissions_cache_config from_db_config(const db::config&);
std::size_t max_entries;
std::chrono::milliseconds validity_period;
std::chrono::milliseconds update_period;
};
class permissions_cache final {
using cache_type = utils::loading_cache<
std::pair<authenticated_user, data_resource>,
permission_set,
utils::loading_cache_reload_enabled::yes,
utils::simple_entry_size<permission_set>,
utils::tuple_hash>;
using key_type = typename cache_type::key_type;
cache_type _cache;
public:
explicit permissions_cache(const permissions_cache_config&, service&, logging::logger&);
future <> stop() {
return _cache.stop();
}
future<permission_set> get(::shared_ptr<authenticated_user>, data_resource);
};
}

355
auth/service.cc Normal file
View File

@@ -0,0 +1,355 @@
/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "auth/service.hh"
#include <map>
#include <seastar/core/future-util.hh>
#include <seastar/core/shared_ptr.hh>
#include "auth/allow_all_authenticator.hh"
#include "auth/allow_all_authorizer.hh"
#include "auth/common.hh"
#include "cql3/query_processor.hh"
#include "cql3/untyped_result_set.hh"
#include "db/config.hh"
#include "db/consistency_level.hh"
#include "exceptions/exceptions.hh"
#include "log.hh"
#include "service/migration_listener.hh"
#include "utils/class_registrator.hh"
namespace auth {
namespace meta {
static const sstring user_name_col_name("name");
static const sstring superuser_col_name("super");
}
static logging::logger log("auth_service");
class auth_migration_listener final : public ::service::migration_listener {
authorizer& _authorizer;
public:
explicit auth_migration_listener(authorizer& a) : _authorizer(a) {
}
private:
void on_create_keyspace(const sstring& ks_name) override {}
void on_create_column_family(const sstring& ks_name, const sstring& cf_name) override {}
void on_create_user_type(const sstring& ks_name, const sstring& type_name) override {}
void on_create_function(const sstring& ks_name, const sstring& function_name) override {}
void on_create_aggregate(const sstring& ks_name, const sstring& aggregate_name) override {}
void on_create_view(const sstring& ks_name, const sstring& view_name) override {}
void on_update_keyspace(const sstring& ks_name) override {}
void on_update_column_family(const sstring& ks_name, const sstring& cf_name, bool) override {}
void on_update_user_type(const sstring& ks_name, const sstring& type_name) override {}
void on_update_function(const sstring& ks_name, const sstring& function_name) override {}
void on_update_aggregate(const sstring& ks_name, const sstring& aggregate_name) override {}
void on_update_view(const sstring& ks_name, const sstring& view_name, bool columns_changed) override {}
void on_drop_keyspace(const sstring& ks_name) override {
_authorizer.revoke_all(auth::data_resource(ks_name));
}
void on_drop_column_family(const sstring& ks_name, const sstring& cf_name) override {
_authorizer.revoke_all(auth::data_resource(ks_name, cf_name));
}
void on_drop_user_type(const sstring& ks_name, const sstring& type_name) override {}
void on_drop_function(const sstring& ks_name, const sstring& function_name) override {}
void on_drop_aggregate(const sstring& ks_name, const sstring& aggregate_name) override {}
void on_drop_view(const sstring& ks_name, const sstring& view_name) override {}
};
static db::consistency_level consistency_for_user(const sstring& name) {
if (name == meta::DEFAULT_SUPERUSER_NAME) {
return db::consistency_level::QUORUM;
} else {
return db::consistency_level::LOCAL_ONE;
}
}
static future<::shared_ptr<cql3::untyped_result_set>> select_user(cql3::query_processor& qp, const sstring& name) {
// Here was a thread local, explicit cache of prepared statement. In normal execution this is
// fine, but since we in testing set up and tear down system over and over, we'd start using
// obsolete prepared statements pretty quickly.
// Rely on query processing caching statements instead, and lets assume
// that a map lookup string->statement is not gonna kill us much.
return qp.process(
sprint(
"SELECT * FROM %s.%s WHERE %s = ?",
meta::AUTH_KS,
meta::USERS_CF,
meta::user_name_col_name),
consistency_for_user(name),
{ name },
true);
}
service_config service_config::from_db_config(const db::config& dc) {
const qualified_name qualified_authorizer_name(meta::AUTH_PACKAGE_NAME, dc.authorizer());
const qualified_name qualified_authenticator_name(meta::AUTH_PACKAGE_NAME, dc.authenticator());
service_config c;
c.authorizer_java_name = qualified_authorizer_name;
c.authenticator_java_name = qualified_authenticator_name;
return c;
}
service::service(
permissions_cache_config c,
cql3::query_processor& qp,
::service::migration_manager& mm,
std::unique_ptr<authorizer> a,
std::unique_ptr<authenticator> b)
: _permissions_cache_config(std::move(c))
, _permissions_cache(nullptr)
, _qp(qp)
, _migration_manager(mm)
, _authorizer(std::move(a))
, _authenticator(std::move(b))
, _migration_listener(std::make_unique<auth_migration_listener>(*_authorizer)) {
}
service::service(
permissions_cache_config cache_config,
cql3::query_processor& qp,
::service::migration_manager& mm,
const service_config& sc)
: service(
std::move(cache_config),
qp,
mm,
create_object<authorizer>(sc.authorizer_java_name, qp, mm),
create_object<authenticator>(sc.authenticator_java_name, qp, mm)) {
}
bool service::should_create_metadata() const {
const bool null_authorizer = _authorizer->qualified_java_name() == allow_all_authorizer_name();
const bool null_authenticator = _authenticator->qualified_java_name() == allow_all_authenticator_name();
return !null_authorizer || !null_authenticator;
}
future<> service::create_metadata_if_missing() {
auto& db = _qp.db().local();
auto f = make_ready_future<>();
if (!db.has_keyspace(meta::AUTH_KS)) {
std::map<sstring, sstring> opts{{"replication_factor", "1"}};
auto ksm = keyspace_metadata::new_keyspace(
meta::AUTH_KS,
"org.apache.cassandra.locator.SimpleStrategy",
opts,
true);
// We use min_timestamp so that default keyspace metadata will loose with any manual adjustments.
// See issue #2129.
f = _migration_manager.announce_new_keyspace(ksm, api::min_timestamp, false);
}
return f.then([this] {
// 3 months.
static const auto gc_grace_seconds = 90 * 24 * 60 * 60;
static const sstring users_table_query = sprint(
"CREATE TABLE %s.%s (%s text, %s boolean, PRIMARY KEY (%s)) WITH gc_grace_seconds=%s",
meta::AUTH_KS,
meta::USERS_CF,
meta::user_name_col_name,
meta::superuser_col_name,
meta::user_name_col_name,
gc_grace_seconds);
return create_metadata_table_if_missing(
meta::USERS_CF,
_qp,
users_table_query,
_migration_manager);
}).then([this] {
delay_until_system_ready(_delayed, [this] {
return has_existing_users().then([this](bool existing) {
if (!existing) {
//
// Create default superuser.
//
static const sstring query = sprint(
"INSERT INTO %s.%s (%s, %s) VALUES (?, ?) USING TIMESTAMP 0",
meta::AUTH_KS,
meta::USERS_CF,
meta::user_name_col_name,
meta::superuser_col_name);
return _qp.process(
query,
db::consistency_level::ONE,
{ meta::DEFAULT_SUPERUSER_NAME, true }).then([](auto&&) {
log.info("Created default superuser '{}'", meta::DEFAULT_SUPERUSER_NAME);
}).handle_exception([](auto exn) {
try {
std::rethrow_exception(exn);
} catch (const exceptions::request_execution_exception&) {
log.warn("Skipped default superuser setup: some nodes were not ready");
}
}).discard_result();
}
return make_ready_future<>();
});
});
return make_ready_future<>();
});
}
future<> service::start() {
return once_among_shards([this] {
if (should_create_metadata()) {
return create_metadata_if_missing();
}
return make_ready_future<>();
}).then([this] {
return when_all_succeed(_authorizer->start(), _authenticator->start());
}).then([this] {
_permissions_cache = std::make_unique<permissions_cache>(_permissions_cache_config, *this, log);
}).then([this] {
return once_among_shards([this] {
_migration_manager.register_listener(_migration_listener.get());
return make_ready_future<>();
});
});
}
future<> service::stop() {
return once_among_shards([this] {
_delayed.cancel_all();
return make_ready_future<>();
}).then([this] {
return _permissions_cache->stop();
}).then([this] {
return when_all_succeed(_authorizer->stop(), _authenticator->stop());
});
}
future<bool> service::has_existing_users() const {
static const sstring default_user_query = sprint(
"SELECT * FROM %s.%s WHERE %s = ?",
meta::AUTH_KS,
meta::USERS_CF,
meta::user_name_col_name);
static const sstring all_users_query = sprint(
"SELECT * FROM %s.%s LIMIT 1",
meta::AUTH_KS,
meta::USERS_CF);
// This logic is borrowed directly from Apache Cassandra. By first checking for the presence of the default user, we
// can potentially avoid doing a range query with a high consistency level.
return _qp.process(
default_user_query,
db::consistency_level::ONE,
{ meta::DEFAULT_SUPERUSER_NAME },
true).then([this](auto results) {
if (!results->empty()) {
return make_ready_future<bool>(true);
}
return _qp.process(
default_user_query,
db::consistency_level::QUORUM,
{ meta::DEFAULT_SUPERUSER_NAME },
true).then([this](auto results) {
if (!results->empty()) {
return make_ready_future<bool>(true);
}
return _qp.process(
all_users_query,
db::consistency_level::QUORUM).then([](auto results) {
return make_ready_future<bool>(!results->empty());
});
});
});
}
future<bool> service::is_existing_user(const sstring& name) const {
return select_user(_qp, name).then([](auto results) {
return !results->empty();
});
}
future<bool> service::is_super_user(const sstring& name) const {
return select_user(_qp, name).then([](auto results) {
return !results->empty() && results->one().template get_as<bool>(meta::superuser_col_name);
});
}
future<> service::insert_user(const sstring& name, bool is_superuser) {
return _qp.process(
sprint(
"INSERT INTO %s.%s (%s, %s) VALUES (?, ?)",
meta::AUTH_KS,
meta::USERS_CF,
meta::user_name_col_name,
meta::superuser_col_name),
consistency_for_user(name),
{ name, is_superuser }).discard_result();
}
future<> service::delete_user(const sstring& name) {
return _qp.process(
sprint(
"DELETE FROM %s.%s WHERE %s = ?",
meta::AUTH_KS,
meta::USERS_CF,
meta::user_name_col_name),
consistency_for_user(name),
{ name }).discard_result();
}
future<permission_set> service::get_permissions(::shared_ptr<authenticated_user> u, data_resource r) const {
return _permissions_cache->get(std::move(u), std::move(r));
}
//
// Free functions.
//
future<bool> is_super_user(const service& ser, const authenticated_user& u) {
if (u.is_anonymous()) {
return make_ready_future<bool>(false);
}
return ser.is_super_user(u.name());
}
}

133
auth/service.hh Normal file
View File

@@ -0,0 +1,133 @@
/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <memory>
#include <seastar/core/future.hh>
#include <seastar/core/sstring.hh>
#include "auth/authenticator.hh"
#include "auth/authorizer.hh"
#include "auth/authenticated_user.hh"
#include "auth/permission.hh"
#include "auth/permissions_cache.hh"
#include "delayed_tasks.hh"
#include "seastarx.hh"
namespace cql3 {
class query_processor;
}
namespace db {
class config;
}
namespace service {
class migration_manager;
class migration_listener;
}
namespace auth {
class authenticator;
class authorizer;
struct service_config final {
static service_config from_db_config(const db::config&);
sstring authorizer_java_name;
sstring authenticator_java_name;
};
class service final {
permissions_cache_config _permissions_cache_config;
std::unique_ptr<permissions_cache> _permissions_cache;
cql3::query_processor& _qp;
::service::migration_manager& _migration_manager;
std::unique_ptr<authorizer> _authorizer;
std::unique_ptr<authenticator> _authenticator;
// Only one of these should be registered, so we end up with some unused instances. Not the end of the world.
std::unique_ptr<::service::migration_listener> _migration_listener;
delayed_tasks<> _delayed{};
public:
service(
permissions_cache_config,
cql3::query_processor&,
::service::migration_manager&,
std::unique_ptr<authorizer>,
std::unique_ptr<authenticator>);
service(
permissions_cache_config,
cql3::query_processor&,
::service::migration_manager&,
const service_config&);
future<> start();
future<> stop();
future<bool> is_existing_user(const sstring& name) const;
future<bool> is_super_user(const sstring& name) const;
future<> insert_user(const sstring& name, bool is_superuser);
future<> delete_user(const sstring& name);
future<permission_set> get_permissions(::shared_ptr<authenticated_user>, data_resource) const;
authenticator& underlying_authenticator() {
return *_authenticator;
}
const authenticator& underlying_authenticator() const {
return *_authenticator;
}
authorizer& underlying_authorizer() {
return *_authorizer;
}
const authorizer& underlying_authorizer() const {
return *_authorizer;
}
private:
future<bool> has_existing_users() const;
bool should_create_metadata() const;
future<> create_metadata_if_missing();
};
future<bool> is_super_user(const service&, const authenticated_user&);
}

232
auth/transitional.cc Normal file
View File

@@ -0,0 +1,232 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* Copyright (C) 2017 ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "authenticator.hh"
#include "authenticated_user.hh"
#include "authenticator.hh"
#include "authorizer.hh"
#include "password_authenticator.hh"
#include "default_authorizer.hh"
#include "permission.hh"
#include "db/config.hh"
#include "utils/class_registrator.hh"
namespace auth {
class service;
static const sstring PACKAGE_NAME("com.scylladb.auth.");
static const sstring& transitional_authenticator_name() {
static const sstring name = PACKAGE_NAME + "TransitionalAuthenticator";
return name;
}
static const sstring& transitional_authorizer_name() {
static const sstring name = PACKAGE_NAME + "TransitionalAuthorizer";
return name;
}
class transitional_authenticator : public authenticator {
std::unique_ptr<authenticator> _authenticator;
public:
static const sstring PASSWORD_AUTHENTICATOR_NAME;
transitional_authenticator(cql3::query_processor& qp, ::service::migration_manager& mm)
: transitional_authenticator(std::make_unique<password_authenticator>(qp, mm))
{}
transitional_authenticator(std::unique_ptr<authenticator> a)
: _authenticator(std::move(a))
{}
future<> start() override {
return _authenticator->start();
}
future<> stop() override {
return _authenticator->stop();
}
const sstring& qualified_java_name() const override {
return transitional_authenticator_name();
}
bool require_authentication() const override {
return true;
}
option_set supported_options() const override {
return _authenticator->supported_options();
}
option_set alterable_options() const override {
return _authenticator->alterable_options();
}
future<::shared_ptr<authenticated_user>> authenticate(const credentials_map& credentials) const override {
auto i = credentials.find(authenticator::USERNAME_KEY);
if ((i == credentials.end() || i->second.empty()) && (!credentials.count(PASSWORD_KEY) || credentials.at(PASSWORD_KEY).empty())) {
// return anon user
return make_ready_future<::shared_ptr<authenticated_user>>(::make_shared<authenticated_user>());
}
return make_ready_future().then([this, &credentials] {
return _authenticator->authenticate(credentials);
}).handle_exception([](auto ep) {
try {
std::rethrow_exception(ep);
} catch (exceptions::authentication_exception&) {
// return anon user
return make_ready_future<::shared_ptr<authenticated_user>>(::make_shared<authenticated_user>());
}
});
}
future<> create(sstring username, const option_map& options) override {
return _authenticator->create(username, options);
}
future<> alter(sstring username, const option_map& options) override {
return _authenticator->alter(username, options);
}
future<> drop(sstring username) override {
return _authenticator->drop(username);
}
const resource_ids& protected_resources() const override {
return _authenticator->protected_resources();
}
::shared_ptr<sasl_challenge> new_sasl_challenge() const override {
class sasl_wrapper : public sasl_challenge {
public:
sasl_wrapper(::shared_ptr<sasl_challenge> sasl)
: _sasl(std::move(sasl))
{}
bytes evaluate_response(bytes_view client_response) override {
try {
return _sasl->evaluate_response(client_response);
} catch (exceptions::authentication_exception&) {
_complete = true;
return {};
}
}
bool is_complete() const {
return _complete || _sasl->is_complete();
}
future<::shared_ptr<authenticated_user>> get_authenticated_user() const {
return futurize_apply([this] {
return _sasl->get_authenticated_user().handle_exception([](auto ep) {
try {
std::rethrow_exception(ep);
} catch (exceptions::authentication_exception&) {
// return anon user
return make_ready_future<::shared_ptr<authenticated_user>>(::make_shared<authenticated_user>());
}
});
});
}
private:
::shared_ptr<sasl_challenge> _sasl;
bool _complete = false;
};
return ::make_shared<sasl_wrapper>(_authenticator->new_sasl_challenge());
}
};
class transitional_authorizer : public authorizer {
std::unique_ptr<authorizer> _authorizer;
public:
transitional_authorizer(cql3::query_processor& qp, ::service::migration_manager& mm)
: transitional_authorizer(std::make_unique<default_authorizer>(qp, mm))
{}
transitional_authorizer(std::unique_ptr<authorizer> a)
: _authorizer(std::move(a))
{}
~transitional_authorizer()
{}
future<> start() override {
return _authorizer->start();
}
future<> stop() override {
return _authorizer->stop();
}
const sstring& qualified_java_name() const override {
return transitional_authorizer_name();
}
future<permission_set> authorize(service& ser, ::shared_ptr<authenticated_user> user, data_resource resource) const override {
return is_super_user(ser, *user).then([](bool s) {
static const permission_set transitional_permissions =
permission_set::of<permission::CREATE,
permission::ALTER, permission::DROP,
permission::SELECT, permission::MODIFY>();
return make_ready_future<permission_set>(s ? permissions::ALL : transitional_permissions);
});
}
future<> grant(::shared_ptr<authenticated_user> user, permission_set ps, data_resource r, sstring s) override {
return _authorizer->grant(std::move(user), std::move(ps), std::move(r), std::move(s));
}
future<> revoke(::shared_ptr<authenticated_user> user, permission_set ps, data_resource r, sstring s) override {
return _authorizer->revoke(std::move(user), std::move(ps), std::move(r), std::move(s));
}
future<std::vector<permission_details>> list(service& ser, ::shared_ptr<authenticated_user> user, permission_set ps, optional<data_resource> r, optional<sstring> s) const override {
return _authorizer->list(ser, std::move(user), std::move(ps), std::move(r), std::move(s));
}
future<> revoke_all(sstring s) override {
return _authorizer->revoke_all(std::move(s));
}
future<> revoke_all(data_resource r) override {
return _authorizer->revoke_all(std::move(r));
}
const resource_ids& protected_resources() override {
return _authorizer->protected_resources();
}
future<> validate_configuration() const override {
return _authorizer->validate_configuration();
}
};
}
//
// To ensure correct initialization order, we unfortunately need to use string literals.
//
static const class_registrator<
auth::authenticator,
auth::transitional_authenticator,
cql3::query_processor&,
::service::migration_manager&> transitional_authenticator_reg("com.scylladb.auth.TransitionalAuthenticator");
static const class_registrator<
auth::authorizer,
auth::transitional_authorizer,
cql3::query_processor&,
::service::migration_manager&> transitional_authorizer_reg("com.scylladb.auth.TransitionalAuthorizer");

View File

@@ -21,14 +21,17 @@
#pragma once
#include "seastarx.hh"
#include "core/sstring.hh"
#include "hashing.hh"
#include <experimental/optional>
#include <iosfwd>
#include <functional>
#include "utils/mutable_view.hh"
using bytes = basic_sstring<int8_t, uint32_t, 31>;
using bytes_view = std::experimental::basic_string_view<int8_t>;
using bytes_mutable_view = basic_mutable_view<bytes_view::value_type>;
using bytes_opt = std::experimental::optional<bytes>;
using sstring_view = std::experimental::string_view;

View File

@@ -38,7 +38,7 @@ class bytes_ostream {
public:
using size_type = bytes::size_type;
using value_type = bytes::value_type;
static constexpr size_type max_chunk_size = 16 * 1024;
static constexpr size_type max_chunk_size() { return 16 * 1024; }
private:
static_assert(sizeof(value_type) == 1, "value_type is assumed to be one byte long");
struct chunk {
@@ -59,7 +59,6 @@ private:
};
// FIXME: consider increasing chunk size as the buffer grows
static constexpr size_type chunk_size{512};
static constexpr size_type usable_chunk_size{chunk_size - sizeof(chunk)};
private:
std::unique_ptr<chunk> _begin;
chunk* _current;
@@ -100,6 +99,19 @@ private:
}
return _current->size - _current->offset;
}
// Figure out next chunk size.
// - must be enough for data_size
// - must be at least chunk_size
// - try to double each time to prevent too many allocations
// - do not exceed max_chunk_size
size_type next_alloc_size(size_t data_size) const {
auto next_size = _current
? _current->size * 2
: chunk_size;
next_size = std::min(next_size, max_chunk_size());
// FIXME: check for overflow?
return std::max<size_type>(next_size, data_size + sizeof(chunk));
}
// Makes room for a contiguous region of given size.
// The region is accounted for as already written.
// size must not be zero.
@@ -110,7 +122,7 @@ private:
_size += size;
return ret;
} else {
auto alloc_size = size <= usable_chunk_size ? chunk_size : (size + sizeof(chunk));
auto alloc_size = next_alloc_size(size);
auto space = malloc(alloc_size);
if (!space) {
throw std::bad_alloc();
@@ -205,7 +217,7 @@ public:
}
while (!v.empty()) {
auto this_size = std::min(v.size(), size_t(max_chunk_size));
auto this_size = std::min(v.size(), size_t(max_chunk_size()));
std::copy_n(v.begin(), this_size, alloc(this_size));
v.remove_prefix(this_size);
}
@@ -329,7 +341,7 @@ public:
// if its size is below max_chunk_size. We probably could also gain
// some read performance by doing "real" reduction, i.e. merging
// all chunks until all but the last one is max_chunk_size.
if (size() < max_chunk_size) {
if (size() < max_chunk_size()) {
linearize();
}
}

View File

@@ -0,0 +1,661 @@
/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <vector>
#include "row_cache.hh"
#include "mutation_reader.hh"
#include "streamed_mutation.hh"
#include "partition_version.hh"
#include "utils/logalloc.hh"
#include "query-request.hh"
#include "partition_snapshot_reader.hh"
#include "partition_snapshot_row_cursor.hh"
#include "read_context.hh"
#include "flat_mutation_reader.hh"
namespace cache {
extern logging::logger clogger;
class cache_flat_mutation_reader final : public flat_mutation_reader::impl {
enum class state {
before_static_row,
// Invariants:
// - position_range(_lower_bound, _upper_bound) covers all not yet emitted positions from current range
// - if _next_row has valid iterators:
// - _next_row points to the nearest row in cache >= _lower_bound
// - _next_row_in_range = _next.position() < _upper_bound
// - if _next_row doesn't have valid iterators, it has no meaning.
reading_from_cache,
// Starts reading from underlying reader.
// The range to read is position_range(_lower_bound, min(_next_row.position(), _upper_bound)).
// Invariants:
// - _next_row_in_range = _next.position() < _upper_bound
move_to_underlying,
// Invariants:
// - Upper bound of the read is min(_next_row.position(), _upper_bound)
// - _next_row_in_range = _next.position() < _upper_bound
// - _last_row points at a direct predecessor of the next row which is going to be read.
// Used for populating continuity.
// - _population_range_starts_before_all_rows is set accordingly
reading_from_underlying,
end_of_stream
};
lw_shared_ptr<partition_snapshot> _snp;
position_in_partition::tri_compare _position_cmp;
query::clustering_key_filter_ranges _ck_ranges;
query::clustering_row_ranges::const_iterator _ck_ranges_curr;
query::clustering_row_ranges::const_iterator _ck_ranges_end;
lsa_manager _lsa_manager;
partition_snapshot_row_weakref _last_row;
// We need to be prepared that we may get overlapping and out of order
// range tombstones. We must emit fragments with strictly monotonic positions,
// so we can't just trim such tombstones to the position of the last fragment.
// To solve that, range tombstones are accumulated first in a range_tombstone_stream
// and emitted once we have a fragment with a larger position.
range_tombstone_stream _tombstones;
// Holds the lower bound of a position range which hasn't been processed yet.
// Only fragments with positions < _lower_bound have been emitted.
//
// It is assumed that !_lower_bound.is_clustering_row(). We depend on this when
// calling range_tombstone::trim_front() and when inserting dummy entries. Dummy
// entries are assumed to be only at !is_clustering_row() positions.
position_in_partition _lower_bound;
position_in_partition_view _upper_bound;
state _state = state::before_static_row;
lw_shared_ptr<read_context> _read_context;
partition_snapshot_row_cursor _next_row;
bool _next_row_in_range = false;
// True iff current population interval, since the previous clustering row, starts before all clustered rows.
// We cannot just look at _lower_bound, because emission of range tombstones changes _lower_bound and
// because we mark clustering intervals as continuous when consuming a clustering_row, it would prevent
// us from marking the interval as continuous.
// Valid when _state == reading_from_underlying.
bool _population_range_starts_before_all_rows;
future<> do_fill_buffer();
void copy_from_cache_to_buffer();
future<> process_static_row();
void move_to_end();
void move_to_next_range();
void move_to_range(query::clustering_row_ranges::const_iterator);
void move_to_next_entry();
// Emits all delayed range tombstones with positions smaller than upper_bound.
void drain_tombstones(position_in_partition_view upper_bound);
// Emits all delayed range tombstones.
void drain_tombstones();
void add_to_buffer(const partition_snapshot_row_cursor&);
void add_clustering_row_to_buffer(mutation_fragment&&);
void add_to_buffer(range_tombstone&&);
void add_to_buffer(mutation_fragment&&);
future<> read_from_underlying();
void start_reading_from_underlying();
bool after_current_range(position_in_partition_view position);
bool can_populate() const;
void maybe_update_continuity();
void maybe_add_to_cache(const mutation_fragment& mf);
void maybe_add_to_cache(const clustering_row& cr);
void maybe_add_to_cache(const range_tombstone& rt);
void maybe_add_to_cache(const static_row& sr);
void maybe_set_static_row_continuous();
void finish_reader() {
push_mutation_fragment(partition_end());
_end_of_stream = true;
_state = state::end_of_stream;
}
public:
cache_flat_mutation_reader(schema_ptr s,
dht::decorated_key dk,
query::clustering_key_filter_ranges&& crr,
lw_shared_ptr<read_context> ctx,
lw_shared_ptr<partition_snapshot> snp,
row_cache& cache)
: flat_mutation_reader::impl(std::move(s))
, _snp(std::move(snp))
, _position_cmp(*_schema)
, _ck_ranges(std::move(crr))
, _ck_ranges_curr(_ck_ranges.begin())
, _ck_ranges_end(_ck_ranges.end())
, _lsa_manager(cache)
, _tombstones(*_schema)
, _lower_bound(position_in_partition::before_all_clustered_rows())
, _upper_bound(position_in_partition_view::before_all_clustered_rows())
, _read_context(std::move(ctx))
, _next_row(*_schema, *_snp)
{
clogger.trace("csm {}: table={}.{}", this, _schema->ks_name(), _schema->cf_name());
push_mutation_fragment(partition_start(std::move(dk), _snp->partition_tombstone()));
}
cache_flat_mutation_reader(const cache_flat_mutation_reader&) = delete;
cache_flat_mutation_reader(cache_flat_mutation_reader&&) = delete;
virtual future<> fill_buffer() override;
virtual ~cache_flat_mutation_reader() {
maybe_merge_versions(_snp, _lsa_manager.region(), _lsa_manager.read_section());
}
virtual void next_partition() override {
clear_buffer_to_next_partition();
if (is_buffer_empty()) {
_end_of_stream = true;
}
}
virtual future<> fast_forward_to(const dht::partition_range&) override {
clear_buffer();
_end_of_stream = true;
return make_ready_future<>();
}
virtual future<> fast_forward_to(position_range pr) override {
throw std::bad_function_call();
}
};
inline
future<> cache_flat_mutation_reader::process_static_row() {
if (_snp->version()->partition().static_row_continuous()) {
_read_context->cache().on_row_hit();
row sr = _lsa_manager.run_in_read_section([this] {
return _snp->static_row();
});
if (!sr.empty()) {
push_mutation_fragment(mutation_fragment(static_row(std::move(sr))));
}
return make_ready_future<>();
} else {
_read_context->cache().on_row_miss();
return _read_context->get_next_fragment().then([this] (mutation_fragment_opt&& sr) {
if (sr) {
assert(sr->is_static_row());
maybe_add_to_cache(sr->as_static_row());
push_mutation_fragment(std::move(*sr));
}
maybe_set_static_row_continuous();
});
}
}
inline
future<> cache_flat_mutation_reader::fill_buffer() {
if (_state == state::before_static_row) {
auto after_static_row = [this] {
if (_ck_ranges_curr == _ck_ranges_end) {
finish_reader();
return make_ready_future<>();
}
_state = state::reading_from_cache;
_lsa_manager.run_in_read_section([this] {
move_to_range(_ck_ranges_curr);
});
return fill_buffer();
};
if (_schema->has_static_columns()) {
return process_static_row().then(std::move(after_static_row));
} else {
return after_static_row();
}
}
clogger.trace("csm {}: fill_buffer(), range={}, lb={}", this, *_ck_ranges_curr, _lower_bound);
return do_until([this] { return _end_of_stream || is_buffer_full(); }, [this] {
return do_fill_buffer();
});
}
inline
future<> cache_flat_mutation_reader::do_fill_buffer() {
if (_state == state::move_to_underlying) {
_state = state::reading_from_underlying;
_population_range_starts_before_all_rows = _lower_bound.is_before_all_clustered_rows(*_schema);
auto end = _next_row_in_range ? position_in_partition(_next_row.position())
: position_in_partition(_upper_bound);
return _read_context->fast_forward_to(position_range{_lower_bound, std::move(end)}).then([this] {
return read_from_underlying();
});
}
if (_state == state::reading_from_underlying) {
return read_from_underlying();
}
// assert(_state == state::reading_from_cache)
return _lsa_manager.run_in_read_section([this] {
auto next_valid = _next_row.iterators_valid();
clogger.trace("csm {}: reading_from_cache, range=[{}, {}), next={}, valid={}", this, _lower_bound,
_upper_bound, _next_row.position(), next_valid);
// We assume that if there was eviction, and thus the range may
// no longer be continuous, the cursor was invalidated.
if (!next_valid) {
auto adjacent = _next_row.advance_to(_lower_bound);
_next_row_in_range = !after_current_range(_next_row.position());
if (!adjacent && !_next_row.continuous()) {
_last_row = nullptr; // We could insert a dummy here, but this path is unlikely.
start_reading_from_underlying();
return make_ready_future<>();
}
}
_next_row.maybe_refresh();
clogger.trace("csm {}: next={}, cont={}", this, _next_row.position(), _next_row.continuous());
while (!is_buffer_full() && _state == state::reading_from_cache) {
copy_from_cache_to_buffer();
if (need_preempt()) {
break;
}
}
return make_ready_future<>();
});
}
inline
future<> cache_flat_mutation_reader::read_from_underlying() {
return consume_mutation_fragments_until(_read_context->underlying().underlying(),
[this] { return _state != state::reading_from_underlying || is_buffer_full(); },
[this] (mutation_fragment mf) {
_read_context->cache().on_row_miss();
maybe_add_to_cache(mf);
add_to_buffer(std::move(mf));
},
[this] {
_state = state::reading_from_cache;
_lsa_manager.run_in_update_section([this] {
auto same_pos = _next_row.maybe_refresh();
if (!same_pos) {
_read_context->cache().on_mispopulate(); // FIXME: Insert dummy entry at _upper_bound.
_next_row_in_range = !after_current_range(_next_row.position());
if (!_next_row.continuous()) {
start_reading_from_underlying();
}
return;
}
if (_next_row_in_range) {
maybe_update_continuity();
_last_row = _next_row;
add_to_buffer(_next_row);
try {
move_to_next_entry();
} catch (const std::bad_alloc&) {
// We cannot reenter the section, since we may have moved to the new range, and
// because add_to_buffer() should not be repeated.
_snp->region().allocator().invalidate_references(); // Invalidates _next_row
}
} else {
if (no_clustering_row_between(*_schema, _upper_bound, _next_row.position())) {
this->maybe_update_continuity();
} else if (can_populate()) {
rows_entry::compare less(*_schema);
auto& rows = _snp->version()->partition().clustered_rows();
if (query::is_single_row(*_schema, *_ck_ranges_curr)) {
with_allocator(_snp->region().allocator(), [&] {
auto e = alloc_strategy_unique_ptr<rows_entry>(
current_allocator().construct<rows_entry>(_ck_ranges_curr->start()->value()));
// Use _next_row iterator only as a hint, because there could be insertions after _upper_bound.
auto insert_result = rows.insert_check(_next_row.get_iterator_in_latest_version(), *e, less);
auto inserted = insert_result.second;
auto it = insert_result.first;
if (inserted) {
e.release();
auto next = std::next(it);
it->set_continuous(next->continuous());
clogger.trace("csm {}: inserted dummy at {}, cont={}", this, it->position(), it->continuous());
}
});
} else if (!_ck_ranges_curr->start() || _last_row.refresh(*_snp)) {
with_allocator(_snp->region().allocator(), [&] {
auto e = alloc_strategy_unique_ptr<rows_entry>(
current_allocator().construct<rows_entry>(*_schema, _upper_bound, is_dummy::yes, is_continuous::yes));
// Use _next_row iterator only as a hint, because there could be insertions after _upper_bound.
auto insert_result = rows.insert_check(_next_row.get_iterator_in_latest_version(), *e, less);
auto inserted = insert_result.second;
if (inserted) {
clogger.trace("csm {}: inserted dummy at {}", this, _upper_bound);
e.release();
} else {
clogger.trace("csm {}: mark {} as continuous", this, insert_result.first->position());
insert_result.first->set_continuous(true);
}
});
}
} else {
_read_context->cache().on_mispopulate();
}
try {
move_to_next_range();
} catch (const std::bad_alloc&) {
// We cannot reenter the section, since we may have moved to the new range
_snp->region().allocator().invalidate_references(); // Invalidates _next_row
}
}
});
return make_ready_future<>();
});
}
inline
void cache_flat_mutation_reader::maybe_update_continuity() {
if (can_populate() && (_population_range_starts_before_all_rows || _last_row.refresh(*_snp))) {
if (_next_row.is_in_latest_version()) {
clogger.trace("csm {}: mark {} continuous", this, _next_row.get_iterator_in_latest_version()->position());
_next_row.get_iterator_in_latest_version()->set_continuous(true);
} else {
// Cover entry from older version
with_allocator(_snp->region().allocator(), [&] {
auto& rows = _snp->version()->partition().clustered_rows();
rows_entry::compare less(*_schema);
auto e = alloc_strategy_unique_ptr<rows_entry>(
current_allocator().construct<rows_entry>(*_schema, _next_row.position(), is_dummy(_next_row.dummy()), is_continuous::yes));
auto insert_result = rows.insert_check(_next_row.get_iterator_in_latest_version(), *e, less);
auto inserted = insert_result.second;
if (inserted) {
clogger.trace("csm {}: inserted dummy at {}", this, e->position());
e.release();
}
});
}
} else {
_read_context->cache().on_mispopulate();
}
}
inline
void cache_flat_mutation_reader::maybe_add_to_cache(const mutation_fragment& mf) {
if (mf.is_range_tombstone()) {
maybe_add_to_cache(mf.as_range_tombstone());
} else {
assert(mf.is_clustering_row());
const clustering_row& cr = mf.as_clustering_row();
maybe_add_to_cache(cr);
}
}
inline
void cache_flat_mutation_reader::maybe_add_to_cache(const clustering_row& cr) {
if (!can_populate()) {
_last_row = nullptr;
_population_range_starts_before_all_rows = false;
_read_context->cache().on_mispopulate();
return;
}
clogger.trace("csm {}: populate({})", this, cr);
_lsa_manager.run_in_update_section_with_allocator([this, &cr] {
mutation_partition& mp = _snp->version()->partition();
rows_entry::compare less(*_schema);
auto new_entry = alloc_strategy_unique_ptr<rows_entry>(
current_allocator().construct<rows_entry>(cr.key(), cr.tomb(), cr.marker(), cr.cells()));
new_entry->set_continuous(false);
auto it = _next_row.iterators_valid() ? _next_row.get_iterator_in_latest_version()
: mp.clustered_rows().lower_bound(cr.key(), less);
auto insert_result = mp.clustered_rows().insert_check(it, *new_entry, less);
if (insert_result.second) {
_read_context->cache().on_row_insert();
new_entry.release();
}
it = insert_result.first;
rows_entry& e = *it;
if (!_ck_ranges_curr->start() || _last_row.refresh(*_snp)) {
clogger.trace("csm {}: set_continuous({})", this, e.position());
e.set_continuous(true);
} else {
_read_context->cache().on_mispopulate();
}
with_allocator(standard_allocator(), [&] {
_last_row = partition_snapshot_row_weakref(*_snp, it);
});
_population_range_starts_before_all_rows = false;
});
}
inline
bool cache_flat_mutation_reader::after_current_range(position_in_partition_view p) {
return _position_cmp(p, _upper_bound) >= 0;
}
inline
void cache_flat_mutation_reader::start_reading_from_underlying() {
clogger.trace("csm {}: start_reading_from_underlying(), range=[{}, {})", this, _lower_bound, _next_row_in_range ? _next_row.position() : _upper_bound);
_state = state::move_to_underlying;
}
inline
void cache_flat_mutation_reader::copy_from_cache_to_buffer() {
clogger.trace("csm {}: copy_from_cache, next={}, next_row_in_range={}", this, _next_row.position(), _next_row_in_range);
position_in_partition_view next_lower_bound = _next_row.dummy() ? _next_row.position() : position_in_partition_view::after_key(_next_row.key());
for (auto&& rts : _snp->range_tombstones(*_schema, _lower_bound, _next_row_in_range ? next_lower_bound : _upper_bound)) {
add_to_buffer(std::move(rts));
if (is_buffer_full()) {
return;
}
}
if (_next_row_in_range) {
_last_row = _next_row;
add_to_buffer(_next_row);
move_to_next_entry();
} else {
move_to_next_range();
}
}
inline
void cache_flat_mutation_reader::move_to_end() {
drain_tombstones();
finish_reader();
clogger.trace("csm {}: eos", this);
}
inline
void cache_flat_mutation_reader::move_to_next_range() {
auto next_it = std::next(_ck_ranges_curr);
if (next_it == _ck_ranges_end) {
move_to_end();
_ck_ranges_curr = next_it;
} else {
move_to_range(next_it);
}
}
inline
void cache_flat_mutation_reader::move_to_range(query::clustering_row_ranges::const_iterator next_it) {
auto lb = position_in_partition::for_range_start(*next_it);
auto ub = position_in_partition_view::for_range_end(*next_it);
_last_row = nullptr;
_lower_bound = std::move(lb);
_upper_bound = std::move(ub);
_ck_ranges_curr = next_it;
auto adjacent = _next_row.advance_to(_lower_bound);
_next_row_in_range = !after_current_range(_next_row.position());
clogger.trace("csm {}: move_to_range(), range={}, lb={}, ub={}, next={}", this, *_ck_ranges_curr, _lower_bound, _upper_bound, _next_row.position());
if (!adjacent && !_next_row.continuous()) {
// FIXME: We don't insert a dummy for singular range to avoid allocating 3 entries
// for a hit (before, at and after). If we supported the concept of an incomplete row,
// we could insert such a row for the lower bound if it's full instead, for both singular and
// non-singular ranges.
if (_ck_ranges_curr->start() && !query::is_single_row(*_schema, *_ck_ranges_curr)) {
// Insert dummy for lower bound
if (can_populate()) {
// FIXME: _lower_bound could be adjacent to the previous row, in which case we could skip this
clogger.trace("csm {}: insert dummy at {}", this, _lower_bound);
auto it = with_allocator(_lsa_manager.region().allocator(), [&] {
auto& rows = _snp->version()->partition().clustered_rows();
auto new_entry = current_allocator().construct<rows_entry>(*_schema, _lower_bound, is_dummy::yes, is_continuous::no);
return rows.insert_before(_next_row.get_iterator_in_latest_version(), *new_entry);
});
_last_row = partition_snapshot_row_weakref(*_snp, it);
} else {
_read_context->cache().on_mispopulate();
}
}
start_reading_from_underlying();
}
}
// _next_row must be inside the range.
inline
void cache_flat_mutation_reader::move_to_next_entry() {
clogger.trace("csm {}: move_to_next_entry(), curr={}", this, _next_row.position());
if (no_clustering_row_between(*_schema, _next_row.position(), _upper_bound)) {
move_to_next_range();
} else {
if (!_next_row.next()) {
move_to_end();
return;
}
_next_row_in_range = !after_current_range(_next_row.position());
clogger.trace("csm {}: next={}, cont={}, in_range={}", this, _next_row.position(), _next_row.continuous(), _next_row_in_range);
if (!_next_row.continuous()) {
start_reading_from_underlying();
}
}
}
inline
void cache_flat_mutation_reader::drain_tombstones(position_in_partition_view pos) {
while (true) {
reserve_one();
auto mfo = _tombstones.get_next(pos);
if (!mfo) {
break;
}
push_mutation_fragment(std::move(*mfo));
}
}
inline
void cache_flat_mutation_reader::drain_tombstones() {
while (true) {
reserve_one();
auto mfo = _tombstones.get_next();
if (!mfo) {
break;
}
push_mutation_fragment(std::move(*mfo));
}
}
inline
void cache_flat_mutation_reader::add_to_buffer(mutation_fragment&& mf) {
clogger.trace("csm {}: add_to_buffer({})", this, mf);
if (mf.is_clustering_row()) {
add_clustering_row_to_buffer(std::move(mf));
} else {
assert(mf.is_range_tombstone());
add_to_buffer(std::move(mf).as_range_tombstone());
}
}
inline
void cache_flat_mutation_reader::add_to_buffer(const partition_snapshot_row_cursor& row) {
if (!row.dummy()) {
_read_context->cache().on_row_hit();
add_clustering_row_to_buffer(row.row());
}
}
// Maintains the following invariants, also in case of exception:
// (1) no fragment with position >= _lower_bound was pushed yet
// (2) If _lower_bound > mf.position(), mf was emitted
inline
void cache_flat_mutation_reader::add_clustering_row_to_buffer(mutation_fragment&& mf) {
clogger.trace("csm {}: add_clustering_row_to_buffer({})", this, mf);
auto& row = mf.as_clustering_row();
auto key = row.key();
try {
drain_tombstones(row.position());
push_mutation_fragment(std::move(mf));
_lower_bound = position_in_partition::after_key(std::move(key));
} catch (...) {
// We may have emitted some of the range tombstones which start after the old _lower_bound
_lower_bound = position_in_partition::for_key(std::move(key));
throw;
}
}
inline
void cache_flat_mutation_reader::add_to_buffer(range_tombstone&& rt) {
clogger.trace("csm {}: add_to_buffer({})", this, rt);
// This guarantees that rt starts after any emitted clustering_row
if (!rt.trim_front(*_schema, _lower_bound)) {
return;
}
_lower_bound = position_in_partition(rt.position());
_tombstones.apply(std::move(rt));
drain_tombstones(_lower_bound);
}
inline
void cache_flat_mutation_reader::maybe_add_to_cache(const range_tombstone& rt) {
if (can_populate()) {
clogger.trace("csm {}: maybe_add_to_cache({})", this, rt);
_lsa_manager.run_in_update_section_with_allocator([&] {
_snp->version()->partition().row_tombstones().apply_monotonically(*_schema, rt);
});
} else {
_read_context->cache().on_mispopulate();
}
}
inline
void cache_flat_mutation_reader::maybe_add_to_cache(const static_row& sr) {
if (can_populate()) {
clogger.trace("csm {}: populate({})", this, sr);
_read_context->cache().on_row_insert();
_lsa_manager.run_in_update_section_with_allocator([&] {
_snp->version()->partition().static_row().apply(*_schema, column_kind::static_column, sr.cells());
});
} else {
_read_context->cache().on_mispopulate();
}
}
inline
void cache_flat_mutation_reader::maybe_set_static_row_continuous() {
if (can_populate()) {
clogger.trace("csm {}: set static row continuous", this);
_snp->version()->partition().set_static_row_continuous(true);
} else {
_read_context->cache().on_mispopulate();
}
}
inline
bool cache_flat_mutation_reader::can_populate() const {
return _snp->at_latest_version() && _read_context->cache().phase_of(_read_context->key()) == _read_context->phase();
}
} // namespace cache
inline flat_mutation_reader make_cache_flat_mutation_reader(schema_ptr s,
dht::decorated_key dk,
query::clustering_key_filter_ranges crr,
row_cache& cache,
lw_shared_ptr<cache::read_context> ctx,
lw_shared_ptr<partition_snapshot> snp)
{
return make_flat_mutation_reader<cache::cache_flat_mutation_reader>(
std::move(s), std::move(dk), std::move(crr), std::move(ctx), std::move(snp), cache);
}

View File

@@ -24,6 +24,7 @@
#include <boost/lexical_cast.hpp>
#include "exceptions/exceptions.hh"
#include "json.hh"
#include "seastarx.hh"
class schema;
@@ -58,30 +59,34 @@ class caching_options {
caching_options() : _key_cache(default_key), _row_cache(default_row) {}
public:
sstring to_sstring() const {
return json::to_json(std::map<sstring, sstring>({{ "keys", _key_cache }, { "rows_per_partition", _row_cache }}));
std::map<sstring, sstring> to_map() const {
return {{ "keys", _key_cache }, { "rows_per_partition", _row_cache }};
}
static caching_options from_sstring(const sstring& str) {
auto map = json::to_map(str);
if (map.size() > 2) {
throw exceptions::configuration_exception("Invalid map: " + str);
}
sstring k;
sstring r;
if (map.count("keys")) {
k = map.at("keys");
} else {
k = default_key;
}
sstring to_sstring() const {
return json::to_json(to_map());
}
if (map.count("rows_per_partition")) {
r = map.at("rows_per_partition");
} else {
r = default_row;
template<typename Map>
static caching_options from_map(const Map & map) {
sstring k = default_key;
sstring r = default_row;
for (auto& p : map) {
if (p.first == "keys") {
k = p.second;
} else if (p.first == "rows_per_partition") {
r = p.second;
} else {
throw exceptions::configuration_exception("Invalid caching option: " + p.first);
}
}
return caching_options(k, r);
}
static caching_options from_sstring(const sstring& str) {
return from_map(json::to_map(str));
}
bool operator==(const caching_options& other) const {
return _key_cache == other._key_cache && _row_cache == other._row_cache;
}

View File

@@ -22,6 +22,7 @@
#include "canonical_mutation.hh"
#include "mutation.hh"
#include "mutation_partition_serializer.hh"
#include "counters.hh"
#include "converting_mutation_partition_applier.hh"
#include "hashing_partition_visitor.hh"
#include "utils/UUID.hh"
@@ -44,7 +45,7 @@ canonical_mutation::canonical_mutation(const mutation& m)
mutation_partition_serializer part_ser(*m.schema(), m.partition());
bytes_ostream out;
ser::writer_of_canonical_mutation wr(out);
ser::writer_of_canonical_mutation<bytes_ostream> wr(out);
std::move(wr).write_table_id(m.schema()->id())
.write_schema_version(m.schema()->version())
.write_key(m.key())

566
cell_locking.hh Normal file
View File

@@ -0,0 +1,566 @@
/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <boost/intrusive/unordered_set.hpp>
#if __has_include(<boost/container/small_vector.hpp>)
#include <boost/container/small_vector.hpp>
template <typename T, size_t N>
using small_vector = boost::container::small_vector<T, N>;
#else
#include <vector>
template <typename T, size_t N>
using small_vector = std::vector<T>;
#endif
#include "fnv1a_hasher.hh"
#include "streamed_mutation.hh"
#include "mutation_partition.hh"
class cells_range {
using ids_vector_type = small_vector<column_id, 5>;
position_in_partition_view _position;
ids_vector_type _ids;
public:
using iterator = ids_vector_type::iterator;
using const_iterator = ids_vector_type::const_iterator;
cells_range()
: _position(position_in_partition_view(position_in_partition_view::static_row_tag_t())) { }
explicit cells_range(position_in_partition_view pos, const row& cells)
: _position(pos)
{
_ids.reserve(cells.size());
cells.for_each_cell([this] (auto id, auto&&) {
_ids.emplace_back(id);
});
}
position_in_partition_view position() const { return _position; }
bool empty() const { return _ids.empty(); }
auto begin() const { return _ids.begin(); }
auto end() const { return _ids.end(); }
};
class partition_cells_range {
const mutation_partition& _mp;
public:
class iterator {
const mutation_partition& _mp;
stdx::optional<mutation_partition::rows_type::const_iterator> _position;
cells_range _current;
public:
explicit iterator(const mutation_partition& mp)
: _mp(mp)
, _current(position_in_partition_view(position_in_partition_view::static_row_tag_t()), mp.static_row())
{ }
iterator(const mutation_partition& mp, mutation_partition::rows_type::const_iterator it)
: _mp(mp)
, _position(it)
{ }
iterator& operator++() {
if (!_position) {
_position = _mp.clustered_rows().begin();
} else {
++(*_position);
}
if (_position != _mp.clustered_rows().end()) {
auto it = *_position;
_current = cells_range(position_in_partition_view(position_in_partition_view::clustering_row_tag_t(), it->key()),
it->row().cells());
}
return *this;
}
iterator operator++(int) {
iterator it(*this);
operator++();
return it;
}
cells_range& operator*() {
return _current;
}
cells_range* operator->() {
return &_current;
}
bool operator==(const iterator& other) const {
return _position == other._position;
}
bool operator!=(const iterator& other) const {
return !(*this == other);
}
};
public:
explicit partition_cells_range(const mutation_partition& mp) : _mp(mp) { }
iterator begin() const {
return iterator(_mp);
}
iterator end() const {
return iterator(_mp, _mp.clustered_rows().end());
}
};
class locked_cell;
struct cell_locker_stats {
uint64_t lock_acquisitions = 0;
uint64_t operations_waiting_for_lock = 0;
};
class cell_locker {
public:
using timeout_clock = lowres_clock;
private:
using semaphore_type = basic_semaphore<default_timeout_exception_factory, timeout_clock>;
class partition_entry;
struct cell_address {
position_in_partition position;
column_id id;
};
class cell_entry : public bi::unordered_set_base_hook<bi::link_mode<bi::auto_unlink>>,
public enable_lw_shared_from_this<cell_entry> {
partition_entry& _parent;
cell_address _address;
semaphore_type _semaphore { 0 };
friend class cell_locker;
public:
cell_entry(partition_entry& parent, position_in_partition position, column_id id)
: _parent(parent)
, _address { std::move(position), id }
{ }
// Upgrades cell_entry to another schema.
// Changes the value of cell_address, so cell_entry has to be
// temporarily removed from its parent partition_entry.
// Returns true if the cell_entry still exist in the new schema and
// should be reinserted.
bool upgrade(const schema& from, const schema& to, column_kind kind) noexcept {
auto& old_column_mapping = from.get_column_mapping();
auto& column = old_column_mapping.column_at(kind, _address.id);
auto cdef = to.get_column_definition(column.name());
if (!cdef) {
return false;
}
_address.id = cdef->id;
return true;
}
const position_in_partition& position() const {
return _address.position;
}
future<> lock(timeout_clock::time_point _timeout) {
return _semaphore.wait(_timeout);
}
void unlock() {
_semaphore.signal();
}
~cell_entry() {
if (!is_linked()) {
return;
}
unlink();
if (!--_parent._cell_count) {
delete &_parent;
}
}
class hasher {
const schema* _schema; // pointer instead of reference for default assignment
public:
explicit hasher(const schema& s) : _schema(&s) { }
size_t operator()(const cell_address& ca) const {
fnv1a_hasher hasher;
ca.position.feed_hash(hasher, *_schema);
::feed_hash(hasher, ca.id);
return hasher.finalize();
}
size_t operator()(const cell_entry& ce) const {
return operator()(ce._address);
}
};
class equal_compare {
position_in_partition::equal_compare _cmp;
private:
bool do_compare(const cell_address& a, const cell_address& b) const {
return a.id == b.id && _cmp(a.position, b.position);
}
public:
explicit equal_compare(const schema& s) : _cmp(s) { }
bool operator()(const cell_address& ca, const cell_entry& ce) const {
return do_compare(ca, ce._address);
}
bool operator()(const cell_entry& ce, const cell_address& ca) const {
return do_compare(ca, ce._address);
}
bool operator()(const cell_entry& a, const cell_entry& b) const {
return do_compare(a._address, b._address);
}
};
};
class partition_entry : public bi::unordered_set_base_hook<bi::link_mode<bi::auto_unlink>> {
using cells_type = bi::unordered_set<cell_entry,
bi::equal<cell_entry::equal_compare>,
bi::hash<cell_entry::hasher>,
bi::constant_time_size<false>>;
static constexpr size_t initial_bucket_count = 16;
using max_load_factor = std::ratio<3, 4>;
dht::decorated_key _key;
cell_locker& _parent;
size_t _rehash_at_size = compute_rehash_at_size(initial_bucket_count);
std::unique_ptr<cells_type::bucket_type[]> _buckets; // TODO: start with internal storage?
size_t _cell_count = 0; // cells_type::empty() is not O(1) if the hook is auto-unlink
cells_type::bucket_type _internal_buckets[initial_bucket_count];
cells_type _cells;
schema_ptr _schema;
friend class cell_entry;
private:
static constexpr size_t compute_rehash_at_size(size_t bucket_count) {
return bucket_count * max_load_factor::num / max_load_factor::den;
}
void maybe_rehash() {
if (_cell_count >= _rehash_at_size) {
auto new_bucket_count = std::min(_cells.bucket_count() * 2, _cells.bucket_count() + 1024);
auto buckets = std::make_unique<cells_type::bucket_type[]>(new_bucket_count);
_cells.rehash(cells_type::bucket_traits(buckets.get(), new_bucket_count));
_buckets = std::move(buckets);
_rehash_at_size = compute_rehash_at_size(new_bucket_count);
}
}
public:
partition_entry(schema_ptr s, cell_locker& parent, const dht::decorated_key& dk)
: _key(dk)
, _parent(parent)
, _cells(cells_type::bucket_traits(_internal_buckets, initial_bucket_count),
cell_entry::hasher(*s), cell_entry::equal_compare(*s))
, _schema(s)
{ }
~partition_entry() {
if (is_linked()) {
_parent._partition_count--;
}
}
// Upgrades partition entry to new schema. Returns false if all
// cell_entries has been removed during the upgrade.
bool upgrade(schema_ptr new_schema);
void insert(lw_shared_ptr<cell_entry> cell) {
_cells.insert(*cell);
_cell_count++;
maybe_rehash();
}
cells_type& cells() {
return _cells;
}
struct hasher {
size_t operator()(const dht::decorated_key& dk) const {
return std::hash<dht::decorated_key>()(dk);
}
size_t operator()(const partition_entry& pe) const {
return operator()(pe._key);
}
};
class equal_compare {
dht::decorated_key_equals_comparator _cmp;
public:
explicit equal_compare(const schema& s) : _cmp(s) { }
bool operator()(const dht::decorated_key& dk, const partition_entry& pe) {
return _cmp(dk, pe._key);
}
bool operator()(const partition_entry& pe, const dht::decorated_key& dk) {
return _cmp(dk, pe._key);
}
bool operator()(const partition_entry& a, const partition_entry& b) {
return _cmp(a._key, b._key);
}
};
};
using partitions_type = bi::unordered_set<partition_entry,
bi::equal<partition_entry::equal_compare>,
bi::hash<partition_entry::hasher>,
bi::constant_time_size<false>>;
static constexpr size_t initial_bucket_count = 4 * 1024;
using max_load_factor = std::ratio<3, 4>;
std::unique_ptr<partitions_type::bucket_type[]> _buckets;
partitions_type _partitions;
size_t _partition_count = 0;
size_t _rehash_at_size = compute_rehash_at_size(initial_bucket_count);
schema_ptr _schema;
// partitions_type uses equality comparator which keeps a reference to the
// original schema, we must ensure that it doesn't die.
schema_ptr _original_schema;
cell_locker_stats& _stats;
friend class locked_cell;
private:
struct locker;
static constexpr size_t compute_rehash_at_size(size_t bucket_count) {
return bucket_count * max_load_factor::num / max_load_factor::den;
}
void maybe_rehash() {
if (_partition_count >= _rehash_at_size) {
auto new_bucket_count = std::min(_partitions.bucket_count() * 2, _partitions.bucket_count() + 64 * 1024);
auto buckets = std::make_unique<partitions_type::bucket_type[]>(new_bucket_count);
_partitions.rehash(partitions_type::bucket_traits(buckets.get(), new_bucket_count));
_buckets = std::move(buckets);
_rehash_at_size = compute_rehash_at_size(new_bucket_count);
}
}
public:
explicit cell_locker(schema_ptr s, cell_locker_stats& stats)
: _buckets(std::make_unique<partitions_type::bucket_type[]>(initial_bucket_count))
, _partitions(partitions_type::bucket_traits(_buckets.get(), initial_bucket_count),
partition_entry::hasher(), partition_entry::equal_compare(*s))
, _schema(s)
, _original_schema(std::move(s))
, _stats(stats)
{ }
~cell_locker() {
assert(_partitions.empty());
}
void set_schema(schema_ptr s) {
_schema = s;
}
schema_ptr schema() const {
return _schema;
}
// partition_cells_range is required to be in cell_locker::schema()
future<std::vector<locked_cell>> lock_cells(const dht::decorated_key& dk, partition_cells_range&& range,
timeout_clock::time_point timeout);
};
class locked_cell {
lw_shared_ptr<cell_locker::cell_entry> _entry;
public:
explicit locked_cell(lw_shared_ptr<cell_locker::cell_entry> entry)
: _entry(std::move(entry)) { }
locked_cell(const locked_cell&) = delete;
locked_cell(locked_cell&&) = default;
~locked_cell() {
if (_entry) {
_entry->unlock();
}
}
};
struct cell_locker::locker {
cell_entry::hasher _hasher;
cell_entry::equal_compare _eq_cmp;
partition_entry& _partition_entry;
partition_cells_range _range;
partition_cells_range::iterator _current_ck;
cells_range::const_iterator _current_cell;
timeout_clock::time_point _timeout;
std::vector<locked_cell> _locks;
cell_locker_stats& _stats;
private:
void update_ck() {
if (!is_done()) {
_current_cell = _current_ck->begin();
}
}
future<> lock_next();
bool is_done() const { return _current_ck == _range.end(); }
public:
explicit locker(const ::schema& s, cell_locker_stats& st, partition_entry& pe, partition_cells_range&& range, timeout_clock::time_point timeout)
: _hasher(s)
, _eq_cmp(s)
, _partition_entry(pe)
, _range(std::move(range))
, _current_ck(_range.begin())
, _timeout(timeout)
, _stats(st)
{
update_ck();
}
locker(const locker&) = delete;
locker(locker&&) = delete;
future<> lock_all() {
// Cannot defer before first call to lock_next().
return lock_next().then([this] {
return do_until([this] { return is_done(); }, [this] {
return lock_next();
});
});
}
std::vector<locked_cell> get() && { return std::move(_locks); }
};
inline
future<std::vector<locked_cell>> cell_locker::lock_cells(const dht::decorated_key& dk, partition_cells_range&& range, timeout_clock::time_point timeout) {
partition_entry::hasher pe_hash;
partition_entry::equal_compare pe_eq(*_schema);
auto it = _partitions.find(dk, pe_hash, pe_eq);
std::unique_ptr<partition_entry> partition;
if (it == _partitions.end()) {
partition = std::make_unique<partition_entry>(_schema, *this, dk);
} else if (!it->upgrade(_schema)) {
partition = std::unique_ptr<partition_entry>(&*it);
_partition_count--;
_partitions.erase(it);
}
if (partition) {
std::vector<locked_cell> locks;
for (auto&& r : range) {
if (r.empty()) {
continue;
}
for (auto&& c : r) {
auto cell = make_lw_shared<cell_entry>(*partition, position_in_partition(r.position()), c);
_stats.lock_acquisitions++;
partition->insert(cell);
locks.emplace_back(std::move(cell));
}
}
if (!locks.empty()) {
_partitions.insert(*partition.release());
_partition_count++;
maybe_rehash();
}
return make_ready_future<std::vector<locked_cell>>(std::move(locks));
}
auto l = std::make_unique<locker>(*_schema, _stats, *it, std::move(range), timeout);
auto f = l->lock_all();
return f.then([l = std::move(l)] {
return std::move(*l).get();
});
}
inline
future<> cell_locker::locker::lock_next() {
while (!is_done()) {
if (_current_cell == _current_ck->end()) {
++_current_ck;
update_ck();
continue;
}
auto cid = *_current_cell++;
cell_address ca { position_in_partition(_current_ck->position()), cid };
auto it = _partition_entry.cells().find(ca, _hasher, _eq_cmp);
if (it != _partition_entry.cells().end()) {
_stats.operations_waiting_for_lock++;
return it->lock(_timeout).then([this, ce = it->shared_from_this()] () mutable {
_stats.operations_waiting_for_lock--;
_stats.lock_acquisitions++;
_locks.emplace_back(std::move(ce));
});
}
auto cell = make_lw_shared<cell_entry>(_partition_entry, position_in_partition(_current_ck->position()), cid);
_stats.lock_acquisitions++;
_partition_entry.insert(cell);
_locks.emplace_back(std::move(cell));
}
return make_ready_future<>();
}
inline
bool cell_locker::partition_entry::upgrade(schema_ptr new_schema) {
if (_schema == new_schema) {
return true;
}
auto buckets = std::make_unique<cells_type::bucket_type[]>(_cells.bucket_count());
auto cells = cells_type(cells_type::bucket_traits(buckets.get(), _cells.bucket_count()),
cell_entry::hasher(*new_schema), cell_entry::equal_compare(*new_schema));
_cells.clear_and_dispose([&] (cell_entry* cell_ptr) noexcept {
auto& cell = *cell_ptr;
auto kind = cell.position().is_static_row() ? column_kind::static_column
: column_kind::regular_column;
auto reinsert = cell.upgrade(*_schema, *new_schema, kind);
if (reinsert) {
cells.insert(cell);
} else {
_cell_count--;
}
});
// bi::unordered_set move assignment is actually a swap.
// Original _buckets cannot be destroyed before the container using them is
// so we need to explicitly make sure that the original _cells is no more.
_cells = std::move(cells);
auto destroy = [] (auto) { };
destroy(std::move(cells));
_buckets = std::move(buckets);
_schema = new_schema;
return _cell_count;
}

View File

@@ -27,125 +27,125 @@
class checked_file_impl : public file_impl {
public:
checked_file_impl(disk_error_signal_type& s, file f)
: _signal(s) , _file(f) {
checked_file_impl(const io_error_handler& error_handler, file f)
: _error_handler(error_handler), _file(f) {
_memory_dma_alignment = f.memory_dma_alignment();
_disk_read_dma_alignment = f.disk_read_dma_alignment();
_disk_write_dma_alignment = f.disk_write_dma_alignment();
}
virtual future<size_t> write_dma(uint64_t pos, const void* buffer, size_t len, const io_priority_class& pc) override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->write_dma(pos, buffer, len, pc);
});
}
virtual future<size_t> write_dma(uint64_t pos, std::vector<iovec> iov, const io_priority_class& pc) override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->write_dma(pos, iov, pc);
});
}
virtual future<size_t> read_dma(uint64_t pos, void* buffer, size_t len, const io_priority_class& pc) override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->read_dma(pos, buffer, len, pc);
});
}
virtual future<size_t> read_dma(uint64_t pos, std::vector<iovec> iov, const io_priority_class& pc) override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->read_dma(pos, iov, pc);
});
}
virtual future<> flush(void) override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->flush();
});
}
virtual future<struct stat> stat(void) override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->stat();
});
}
virtual future<> truncate(uint64_t length) override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->truncate(length);
});
}
virtual future<> discard(uint64_t offset, uint64_t length) override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->discard(offset, length);
});
}
virtual future<> allocate(uint64_t position, uint64_t length) override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->allocate(position, length);
});
}
virtual future<uint64_t> size(void) override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->size();
});
}
virtual future<> close() override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->close();
});
}
// returns a handle for plain file, so make_checked_file() should be called
// on file returned by handle.
virtual std::unique_ptr<seastar::file_handle_impl> dup() override {
return get_file_impl(_file)->dup();
}
virtual subscription<directory_entry> list_directory(std::function<future<> (directory_entry de)> next) override {
return do_io_check(_signal, [&] {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->list_directory(next);
});
}
virtual future<temporary_buffer<uint8_t>> dma_read_bulk(uint64_t offset, size_t range_size, const io_priority_class& pc) override {
return do_io_check(_error_handler, [&] {
return get_file_impl(_file)->dma_read_bulk(offset, range_size, pc);
});
}
private:
disk_error_signal_type &_signal;
const io_error_handler& _error_handler;
file _file;
};
inline file make_checked_file(disk_error_signal_type& signal, file& f)
inline file make_checked_file(const io_error_handler& error_handler, file f)
{
return file(::make_shared<checked_file_impl>(signal, f));
return file(::make_shared<checked_file_impl>(error_handler, f));
}
future<file>
inline open_checked_file_dma(disk_error_signal_type& signal,
inline open_checked_file_dma(const io_error_handler& error_handler,
sstring name, open_flags flags,
file_open_options options)
file_open_options options = {})
{
return do_io_check(signal, [&] {
return do_io_check(error_handler, [&] {
return open_file_dma(name, flags, options).then([&] (file f) {
return make_ready_future<file>(make_checked_file(signal, f));
return make_ready_future<file>(make_checked_file(error_handler, f));
});
});
}
future<file>
inline open_checked_file_dma(disk_error_signal_type& signal,
sstring name, open_flags flags)
{
return do_io_check(signal, [&] {
return open_file_dma(name, flags).then([&] (file f) {
return make_ready_future<file>(make_checked_file(signal, f));
});
});
}
future<file>
inline open_checked_directory(disk_error_signal_type& signal,
inline open_checked_directory(const io_error_handler& error_handler,
sstring name)
{
return do_io_check(signal, [&] {
return do_io_check(error_handler, [&] {
return engine().open_directory(name).then([&] (file f) {
return make_ready_future<file>(make_checked_file(signal, f));
return make_ready_future<file>(make_checked_file(error_handler, f));
});
});
}

View File

@@ -19,6 +19,6 @@
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "gc_clock.hh"
#include "clocks-impl.hh"
std::atomic<int64_t> clocks_offset;

49
clocks-impl.hh Normal file
View File

@@ -0,0 +1,49 @@
/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <algorithm>
#include <atomic>
#include <chrono>
#include <cstdint>
extern std::atomic<int64_t> clocks_offset;
template<typename Duration>
static inline void forward_jump_clocks(Duration delta)
{
auto d = std::chrono::duration_cast<std::chrono::seconds>(delta).count();
clocks_offset.fetch_add(d, std::memory_order_relaxed);
}
static inline std::chrono::seconds get_clocks_offset()
{
auto off = clocks_offset.load(std::memory_order_relaxed);
return std::chrono::seconds(off);
}
// Returns a time point which is earlier from t by d, or minimum time point if it cannot be represented.
template<typename Clock, typename Duration, typename Rep, typename Period>
inline
auto saturating_subtract(std::chrono::time_point<Clock, Duration> t, std::chrono::duration<Rep, Period> d) -> decltype(t) {
return std::max(t, decltype(t)::min() + d) - d;
}

View File

@@ -42,47 +42,63 @@ std::ostream& operator<<(std::ostream& out, const bound_kind k);
bound_kind invert_kind(bound_kind k);
int32_t weight(bound_kind k);
static inline bound_kind flip_bound_kind(bound_kind bk)
{
switch (bk) {
case bound_kind::excl_end: return bound_kind::excl_start;
case bound_kind::incl_end: return bound_kind::incl_start;
case bound_kind::excl_start: return bound_kind::excl_end;
case bound_kind::incl_start: return bound_kind::incl_end;
}
abort();
}
class bound_view {
const static thread_local clustering_key empty_prefix;
public:
const static thread_local clustering_key empty_prefix;
const clustering_key_prefix& prefix;
bound_kind kind;
bound_view(const clustering_key_prefix& prefix, bound_kind kind)
: prefix(prefix)
, kind(kind)
{ }
struct compare {
bound_view(const bound_view& other) noexcept = default;
bound_view& operator=(const bound_view& other) noexcept {
if (this != &other) {
this->~bound_view();
new (this) bound_view(other);
}
return *this;
}
struct tri_compare {
// To make it assignable and to avoid taking a schema_ptr, we
// wrap the schema reference.
std::reference_wrapper<const schema> _s;
compare(const schema& s) : _s(s)
tri_compare(const schema& s) : _s(s)
{ }
bool operator()(const clustering_key_prefix& p1, int32_t w1, const clustering_key_prefix& p2, int32_t w2) const {
int operator()(const clustering_key_prefix& p1, int32_t w1, const clustering_key_prefix& p2, int32_t w2) const {
auto type = _s.get().clustering_key_prefix_type();
auto res = prefix_equality_tri_compare(type->types().begin(),
type->begin(p1), type->end(p1),
type->begin(p2), type->end(p2),
tri_compare);
::tri_compare);
if (res) {
return res < 0;
return res;
}
auto d1 = p1.size(_s);
auto d2 = p2.size(_s);
if (d1 == d2) {
return w1 < w2;
return w1 - w2;
}
return d1 < d2 ? w1 <= 0 : w2 > 0;
return d1 < d2 ? w1 - (w1 <= 0) : -(w2 - (w2 <= 0));
}
int operator()(const bound_view b, const clustering_key_prefix& p) const {
return operator()(b.prefix, weight(b.kind), p, 0);
}
int operator()(const clustering_key_prefix& p, const bound_view b) const {
return operator()(p, 0, b.prefix, weight(b.kind));
}
int operator()(const bound_view b1, const bound_view b2) const {
return operator()(b1.prefix, weight(b1.kind), b2.prefix, weight(b2.kind));
}
};
struct compare {
// To make it assignable and to avoid taking a schema_ptr, we
// wrap the schema reference.
tri_compare _cmp;
compare(const schema& s) : _cmp(s)
{ }
bool operator()(const clustering_key_prefix& p1, int32_t w1, const clustering_key_prefix& p2, int32_t w2) const {
return _cmp(p1, w1, p2, w2) < 0;
}
bool operator()(const bound_view b, const clustering_key_prefix& p) const {
return operator()(b.prefix, weight(b.kind), p, 0);
@@ -106,20 +122,33 @@ public:
static bound_view top() {
return {empty_prefix, bound_kind::incl_end};
}
/*
template<template<typename> typename T, typename U>
concept bool Range() {
return requires (T<U> range) {
{ range.start() } -> stdx::optional<U>;
{ range.end() } -> stdx::optional<U>;
};
};*/
template<template<typename> typename Range>
static std::pair<bound_view, bound_view> from_range(const Range<clustering_key_prefix>& range) {
return {
range.start() ? bound_view(range.start()->value(), range.start()->is_inclusive() ? bound_kind::incl_start : bound_kind::excl_start) : bottom(),
range.end() ? bound_view(range.end()->value(), range.end()->is_inclusive() ? bound_kind::incl_end : bound_kind::excl_end) : top(),
};
template<template<typename> typename R>
GCC6_CONCEPT( requires Range<R, clustering_key_prefix_view> )
static bound_view from_range_start(const R<clustering_key_prefix>& range) {
return range.start()
? bound_view(range.start()->value(), range.start()->is_inclusive() ? bound_kind::incl_start : bound_kind::excl_start)
: bottom();
}
template<template<typename> typename R>
GCC6_CONCEPT( requires Range<R, clustering_key_prefix> )
static bound_view from_range_end(const R<clustering_key_prefix>& range) {
return range.end()
? bound_view(range.end()->value(), range.end()->is_inclusive() ? bound_kind::incl_end : bound_kind::excl_end)
: top();
}
template<template<typename> typename R>
GCC6_CONCEPT( requires Range<R, clustering_key_prefix> )
static std::pair<bound_view, bound_view> from_range(const R<clustering_key_prefix>& range) {
return {from_range_start(range), from_range_end(range)};
}
template<template<typename> typename R>
GCC6_CONCEPT( requires Range<R, clustering_key_prefix_view> )
static stdx::optional<typename R<clustering_key_prefix_view>::bound> to_range_bound(const bound_view& bv) {
if (&bv.prefix == &empty_prefix) {
return {};
}
bool inclusive = bv.kind != bound_kind::excl_end && bv.kind != bound_kind::excl_start;
return {typename R<clustering_key_prefix_view>::bound(bv.prefix.view(), inclusive)};
}
friend std::ostream& operator<<(std::ostream& out, const bound_view& b) {
return out << "{bound: prefix=" << b.prefix << ", kind=" << b.kind << "}";

View File

@@ -54,6 +54,7 @@ public:
auto end() const { return _ref.end(); }
bool empty() const { return _ref.empty(); }
size_t size() const { return _ref.size(); }
const clustering_row_ranges& ranges() const { return _ref; }
static clustering_key_filter_ranges get_ranges(const schema& schema, const query::partition_slice& slice, const partition_key& key) {
const query::clustering_row_ranges& ranges = slice.row_ranges(schema, key);

219
clustering_ranges_walker.hh Normal file
View File

@@ -0,0 +1,219 @@
/*
* Copyright (C) 2017 ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include "schema.hh"
#include "query-request.hh"
#include "streamed_mutation.hh"
// Utility for in-order checking of overlap with position ranges.
class clustering_ranges_walker {
const schema& _schema;
const query::clustering_row_ranges& _ranges;
query::clustering_row_ranges::const_iterator _current;
query::clustering_row_ranges::const_iterator _end;
bool _in_current; // next position is known to be >= _current_start
bool _with_static_row;
position_in_partition_view _current_start;
position_in_partition_view _current_end;
stdx::optional<position_in_partition> _trim;
size_t _change_counter = 1;
private:
bool advance_to_next_range() {
_in_current = false;
if (!_current_start.is_static_row()) {
if (_current == _end) {
return false;
}
++_current;
}
++_change_counter;
if (_current == _end) {
_current_end = _current_start = position_in_partition_view::after_all_clustered_rows();
return false;
}
_current_start = position_in_partition_view::for_range_start(*_current);
_current_end = position_in_partition_view::for_range_end(*_current);
return true;
}
public:
clustering_ranges_walker(const schema& s, const query::clustering_row_ranges& ranges, bool with_static_row = true)
: _schema(s)
, _ranges(ranges)
, _current(ranges.begin())
, _end(ranges.end())
, _in_current(with_static_row)
, _with_static_row(with_static_row)
, _current_start(position_in_partition_view::for_static_row())
, _current_end(position_in_partition_view::before_all_clustered_rows())
{
if (!with_static_row) {
if (_current == _end) {
_current_start = position_in_partition_view::before_all_clustered_rows();
} else {
_current_start = position_in_partition_view::for_range_start(*_current);
_current_end = position_in_partition_view::for_range_end(*_current);
}
}
}
clustering_ranges_walker(clustering_ranges_walker&& o) noexcept
: _schema(o._schema)
, _ranges(o._ranges)
, _current(o._current)
, _end(o._end)
, _in_current(o._in_current)
, _with_static_row(o._with_static_row)
, _current_start(o._current_start)
, _current_end(o._current_end)
, _trim(std::move(o._trim))
, _change_counter(o._change_counter)
{ }
clustering_ranges_walker& operator=(clustering_ranges_walker&& o) {
if (this != &o) {
this->~clustering_ranges_walker();
new (this) clustering_ranges_walker(std::move(o));
}
return *this;
}
// Excludes positions smaller than pos from the ranges.
// pos should be monotonic.
// No constraints between pos and positions passed to advance_to().
//
// After the invocation, when !out_of_range(), lower_bound() returns the smallest position still contained.
void trim_front(position_in_partition pos) {
position_in_partition::less_compare less(_schema);
do {
if (!less(_current_start, pos)) {
break;
}
if (less(pos, _current_end)) {
_trim = std::move(pos);
_current_start = *_trim;
_in_current = false;
++_change_counter;
break;
}
} while (advance_to_next_range());
}
// Returns true if given position is contained.
// Must be called with monotonic positions.
// Idempotent.
bool advance_to(position_in_partition_view pos) {
position_in_partition::less_compare less(_schema);
do {
if (!_in_current && less(pos, _current_start)) {
break;
}
// All subsequent clustering keys are larger than the start of this
// range so there is no need to check that again.
_in_current = true;
if (less(pos, _current_end)) {
return true;
}
} while (advance_to_next_range());
return false;
}
// Returns true if the range expressed by start and end (as in position_range) overlaps
// with clustering ranges.
// Must be called with monotonic start position. That position must also be greater than
// the last position passed to the other advance_to() overload.
// Idempotent.
bool advance_to(position_in_partition_view start, position_in_partition_view end) {
position_in_partition::less_compare less(_schema);
do {
if (!less(_current_start, end)) {
break;
}
if (less(start, _current_end)) {
return true;
}
} while (advance_to_next_range());
return false;
}
// Returns true if the range tombstone expressed by start and end (as in position_range) overlaps
// with clustering ranges.
// No monotonicity restrictions on argument values across calls.
// Does not affect lower_bound().
// Idempotent.
bool contains_tombstone(position_in_partition_view start, position_in_partition_view end) const {
position_in_partition::less_compare less(_schema);
if (_trim && !less(*_trim, end)) {
return false;
}
auto i = _current;
while (i != _end) {
auto range_start = position_in_partition_view::for_range_start(*i);
if (!less(range_start, end)) {
return false;
}
auto range_end = position_in_partition_view::for_range_end(*i);
if (less(start, range_end)) {
return true;
}
++i;
}
return false;
}
// Returns true if advanced past all contained positions. Any later advance_to() until reset() will return false.
bool out_of_range() const {
return !_in_current && _current == _end;
}
// Resets the state of the walker so that advance_to() can be now called for new sequence of positions.
// Any range trimmings still hold after this.
void reset() {
auto trim = std::move(_trim);
auto ctr = _change_counter;
*this = clustering_ranges_walker(_schema, _ranges, _with_static_row);
_change_counter = ctr + 1;
if (trim) {
trim_front(std::move(*trim));
}
}
// Can be called only when !out_of_range()
position_in_partition_view lower_bound() const {
return _current_start;
}
// When lower_bound() changes, this also does
// Always > 0.
size_t lower_bound_change_counter() const {
return _change_counter;
}
};

3
coding-style.md Normal file
View File

@@ -0,0 +1,3 @@
# Scylla Coding Style
Please see the [Seastar style document](https://github.com/scylladb/seastar/blob/master/coding-style.md).

View File

@@ -21,6 +21,9 @@
#pragma once
#include "sstables/shared_sstable.hh"
#include "exceptions/exceptions.hh"
class column_family;
class schema;
using schema_ptr = lw_shared_ptr<const schema>;
@@ -33,12 +36,14 @@ enum class compaction_strategy_type {
size_tiered,
leveled,
date_tiered,
time_window,
};
class compaction_strategy_impl;
class sstable;
class sstable_set;
struct compaction_descriptor;
struct resharding_descriptor;
class compaction_strategy {
::shared_ptr<compaction_strategy_impl> _compaction_strategy_impl;
@@ -52,7 +57,13 @@ public:
compaction_strategy& operator=(compaction_strategy&&);
// Return a list of sstables to be compacted after applying the strategy.
compaction_descriptor get_sstables_for_compaction(column_family& cfs, std::vector<lw_shared_ptr<sstable>> candidates);
compaction_descriptor get_sstables_for_compaction(column_family& cfs, std::vector<shared_sstable> candidates);
std::vector<resharding_descriptor> get_resharding_jobs(column_family& cf, std::vector<shared_sstable> candidates);
// Some strategies may look at the compacted and resulting sstables to
// get some useful information for subsequent compactions.
void notify_completion(const std::vector<shared_sstable>& removed, const std::vector<shared_sstable>& added);
// Return if parallel compaction is allowed by strategy.
bool parallel_compaction() const;
@@ -75,6 +86,8 @@ public:
return "LeveledCompactionStrategy";
case compaction_strategy_type::date_tiered:
return "DateTieredCompactionStrategy";
case compaction_strategy_type::time_window:
return "TimeWindowCompactionStrategy";
default:
throw std::runtime_error("Invalid Compaction Strategy");
}
@@ -93,6 +106,8 @@ public:
return compaction_strategy_type::leveled;
} else if (short_name == "DateTieredCompactionStrategy") {
return compaction_strategy_type::date_tiered;
} else if (short_name == "TimeWindowCompactionStrategy") {
return compaction_strategy_type::time_window;
} else {
throw exceptions::configuration_exception(sprint("Unable to find compaction strategy class '%s'", name));
}

View File

@@ -39,6 +39,9 @@ public:
compatible_ring_position(const schema& s, dht::ring_position&& rp)
: _schema(&s), _rp(std::move(rp)) {
}
const dht::token& token() const {
return _rp->token();
}
friend int tri_compare(const compatible_ring_position& x, const compatible_ring_position& y) {
return x._rp->tri_compare(*x._schema, *y._rp);
}

View File

@@ -22,7 +22,7 @@
#pragma once
#include "types.hh"
#include <iostream>
#include <iosfwd>
#include <algorithm>
#include <vector>
#include <boost/range/iterator_range.hpp>
@@ -130,10 +130,10 @@ public:
bytes decompose_value(const value_type& values) {
return serialize_value(values);
}
class iterator : public std::iterator<std::input_iterator_tag, bytes_view> {
class iterator : public std::iterator<std::input_iterator_tag, const bytes_view> {
private:
bytes_view _v;
value_type _current;
bytes_view _current;
private:
void read_current() {
size_type len;
@@ -220,6 +220,9 @@ public:
assert(AllowPrefixes == allow_prefixes::yes);
return std::distance(begin(v), end(v)) == (ssize_t)_types.size();
}
bool is_empty(bytes_view v) const {
return begin(v) == end(v);
}
void validate(bytes_view v) {
// FIXME: implement
warn(unimplemented::cause::VALIDATION);

View File

@@ -184,6 +184,8 @@ bytes to_legacy(CompoundType& type, bytes_view packed) {
return legacy_form;
}
class composite_view;
// Represents a value serialized according to Origin's CompositeType.
// If is_compound is true, then the value is one or more components encoded as:
//
@@ -202,7 +204,7 @@ public:
, _is_compound(is_compound)
{ }
composite(bytes&& b)
explicit composite(bytes&& b)
: _bytes(std::move(b))
, _is_compound(true)
{ }
@@ -239,7 +241,7 @@ public:
using component_view = std::pair<bytes_view, eoc>;
private:
template<typename Value, typename = std::enable_if_t<!std::is_same<const data_value, std::decay_t<Value>>::value>>
static size_t size(Value& val) {
static size_t size(const Value& val) {
return val.size();
}
static size_t size(const data_value& val) {
@@ -304,23 +306,36 @@ public:
return f(const_cast<bytes&>(_bytes));
}
// marker is ignored if !is_compound
template<typename RangeOfSerializedComponents>
static bytes serialize_value(RangeOfSerializedComponents&& values, bool is_compound = true) {
static composite serialize_value(RangeOfSerializedComponents&& values, bool is_compound = true, eoc marker = eoc::none) {
auto size = serialized_size(values, is_compound);
bytes b(bytes::initialized_later(), size);
auto i = b.begin();
serialize_value(std::forward<decltype(values)>(values), i, is_compound);
return b;
if (is_compound && !b.empty()) {
b.back() = eoc_type(marker);
}
return composite(std::move(b), is_compound);
}
template<typename RangeOfSerializedComponents>
static composite serialize_static(const schema& s, RangeOfSerializedComponents&& values) {
// FIXME: Optimize
auto b = bytes(size_t(2), bytes::value_type(0xff));
std::vector<bytes_view> sv(s.clustering_key_size());
b += composite::serialize_value(boost::range::join(sv, std::forward<RangeOfSerializedComponents>(values)), true).release_bytes();
return composite(std::move(b));
}
static eoc to_eoc(int8_t eoc_byte) {
return eoc_byte == 0 ? eoc::none : (eoc_byte < 0 ? eoc::start : eoc::end);
}
class iterator : public std::iterator<std::input_iterator_tag, const component_view> {
bytes_view _v;
component_view _current;
private:
eoc to_eoc(int8_t eoc_byte) {
return eoc_byte == 0 ? eoc::none : (eoc_byte < 0 ? eoc::start : eoc::end);
}
void read_current() {
size_type len;
{
@@ -406,6 +421,10 @@ public:
return _bytes;
}
bytes release_bytes() && {
return std::move(_bytes);
}
size_t size() const {
return _bytes.size();
}
@@ -426,26 +445,20 @@ public:
return _is_compound;
}
// The following factory functions assume this composite is a compound value.
template <typename ClusteringElement>
static composite from_clustering_element(const schema& s, const ClusteringElement& ce) {
return serialize_value(ce.components(s));
return serialize_value(ce.components(s), s.is_compound());
}
static composite from_exploded(const std::vector<bytes_view>& v, eoc marker = eoc::none) {
static composite from_exploded(const std::vector<bytes_view>& v, bool is_compound, eoc marker = eoc::none) {
if (v.size() == 0) {
return bytes(size_t(1), bytes::value_type(marker));
return composite(bytes(size_t(1), bytes::value_type(marker)), is_compound);
}
auto b = serialize_value(v);
b.back() = eoc_type(marker);
return composite(std::move(b));
return serialize_value(v, is_compound, marker);
}
static composite static_prefix(const schema& s) {
static bytes static_marker(size_t(2), bytes::value_type(0xff));
std::vector<bytes_view> sv(s.clustering_key_size());
return static_marker + serialize_value(sv);
return serialize_static(s, std::vector<bytes_view>());
}
explicit operator bytes_view() const {
@@ -456,6 +469,15 @@ public:
friend inline std::ostream& operator<<(std::ostream& os, const std::pair<Component, eoc>& c) {
return os << "{value=" << c.first << "; eoc=" << sprint("0x%02x", eoc_type(c.second) & 0xff) << "}";
}
friend std::ostream& operator<<(std::ostream& os, const composite& v);
struct tri_compare {
const std::vector<data_type>& _types;
tri_compare(const std::vector<data_type>& types) : _types(types) {}
int operator()(const composite&, const composite&) const;
int operator()(composite_view, composite_view) const;
};
};
class composite_view final {
@@ -476,14 +498,15 @@ public:
, _is_compound(true)
{ }
std::vector<bytes> explode() const {
std::vector<bytes_view> explode() const {
if (!_is_compound) {
return { to_bytes(_bytes) };
return { _bytes };
}
std::vector<bytes> ret;
std::vector<bytes_view> ret;
ret.reserve(8);
for (auto it = begin(), e = end(); it != e; ) {
ret.push_back(to_bytes(it->first));
ret.push_back(it->first);
auto marker = it->second;
++it;
if (it != e && marker != composite::eoc::none) {
@@ -505,6 +528,15 @@ public:
return { begin(), end() };
}
composite::eoc last_eoc() const {
if (!_is_compound || _bytes.empty()) {
return composite::eoc::none;
}
bytes_view v(_bytes);
v.remove_prefix(v.size() - 1);
return composite::to_eoc(read_simple<composite::eoc_type>(v));
}
auto values() const {
return components() | boost::adaptors::transformed([](auto&& c) { return c.first; });
}
@@ -527,4 +559,46 @@ public:
bool operator==(const composite_view& k) const { return k._bytes == _bytes && k._is_compound == _is_compound; }
bool operator!=(const composite_view& k) const { return !(k == *this); }
friend inline std::ostream& operator<<(std::ostream& os, composite_view v) {
return os << "{" << ::join(", ", v.components()) << ", compound=" << v._is_compound << ", static=" << v.is_static() << "}";
}
};
inline
std::ostream& operator<<(std::ostream& os, const composite& v) {
return os << composite_view(v);
}
inline
int composite::tri_compare::operator()(const composite& v1, const composite& v2) const {
return (*this)(composite_view(v1), composite_view(v2));
}
inline
int composite::tri_compare::operator()(composite_view v1, composite_view v2) const {
// See org.apache.cassandra.db.composites.AbstractCType#compare
if (v1.empty()) {
return v2.empty() ? 0 : -1;
}
if (v2.empty()) {
return 1;
}
if (v1.is_static() != v2.is_static()) {
return v1.is_static() ? -1 : 1;
}
auto a_values = v1.components();
auto b_values = v2.components();
auto cmp = [&](const data_type& t, component_view c1, component_view c2) {
// First by value, then by EOC
auto r = t->compare(c1.first, c2.first);
if (r) {
return r;
}
return static_cast<int>(c1.second) - static_cast<int>(c2.second);
};
return lexicographical_tri_compare(_types.begin(), _types.end(),
a_values.begin(), a_values.end(),
b_values.begin(), b_values.end(),
cmp);
}

View File

@@ -39,17 +39,17 @@ public:
static constexpr auto CHUNK_LENGTH_KB = "chunk_length_kb";
static constexpr auto CRC_CHECK_CHANCE = "crc_check_chance";
private:
compressor _compressor = compressor::none;
compressor _compressor;
std::experimental::optional<int> _chunk_length;
std::experimental::optional<double> _crc_check_chance;
public:
compression_parameters() = default;
compression_parameters(compressor c) : _compressor(c) { }
compression_parameters(compressor c = compressor::lz4) : _compressor(c) { }
compression_parameters(const std::map<sstring, sstring>& options) {
validate_options(options);
auto it = options.find(SSTABLE_COMPRESSION);
if (it == options.end() || it->second.empty()) {
_compressor = compressor::none;
return;
}
const auto& compressor_class = it->second;

View File

@@ -12,7 +12,9 @@
# The name of the cluster. This is mainly used to prevent machines in
# one logical cluster from joining another.
cluster_name: 'Test Cluster'
# It is recommended to change the default value when creating a new cluster.
# You can NOT modify this value for an existing cluster
#cluster_name: 'Test Cluster'
# This defines the number of tokens randomly assigned to this node on the ring
# The more tokens, relative to other nodes, the larger the proportion of data
@@ -85,10 +87,26 @@ listen_address: localhost
# Leaving this blank will set it to the same value as listen_address
# broadcast_address: 1.2.3.4
# When using multiple physical network interfaces, set this to true to listen on broadcast_address
# in addition to the listen_address, allowing nodes to communicate in both interfaces.
# Ignore this property if the network configuration automatically routes between the public and private networks such as EC2.
#
# listen_on_broadcast_address: false
# port for the CQL native transport to listen for clients on
# For security reasons, you should not expose this port to the internet. Firewall it if needed.
native_transport_port: 9042
# Enabling native transport encryption in client_encryption_options allows you to either use
# encryption for the standard port or to use a dedicated, additional port along with the unencrypted
# standard native_transport_port.
# Enabling client encryption and keeping native_transport_port_ssl disabled will use encryption
# for native_transport_port. Setting native_transport_port_ssl to a different value
# from native_transport_port will use encryption for native_transport_port_ssl while
# keeping native_transport_port unencrypted.
#native_transport_port_ssl: 9142
# Throttles all outbound streaming file transfers on this node to the
# given total throughput in Mbps. This is necessary because Scylla does
# mostly sequential IO when streaming data during bootstrap or repair, which
@@ -192,6 +210,9 @@ api_address: 127.0.0.1
# Caution should be taken on increasing the size of this threshold as it can lead to node instability.
batch_size_warn_threshold_in_kb: 5
# Fail any multiple-partition batch exceeding this value. 50kb (10x warn threshold) by default.
batch_size_fail_threshold_in_kb: 50
# Authentication backend, identifying users
# Out of the box, Scylla provides org.apache.cassandra.auth.{AllowAllAuthenticator,
# PasswordAuthenticator}.
@@ -217,6 +238,15 @@ batch_size_warn_threshold_in_kb: 5
# that do not have vnodes enabled.
# initial_token:
# RPC address to broadcast to drivers and other Scylla nodes. This cannot
# be set to 0.0.0.0. If left blank, this will be set to the value of
# rpc_address. If rpc_address is set to 0.0.0.0, broadcast_rpc_address must
# be set.
# broadcast_rpc_address: 1.2.3.4
# Uncomment to enable experimental features
# experimental: true
###################################################
## Not currently supported, reserved for future use
###################################################
@@ -249,17 +279,17 @@ batch_size_warn_threshold_in_kb: 5
# Validity period for permissions cache (fetching permissions can be an
# expensive operation depending on the authorizer, CassandraAuthorizer is
# one example). Defaults to 2000, set to 0 to disable.
# one example). Defaults to 10000, set to 0 to disable.
# Will be disabled automatically for AllowAllAuthorizer.
# permissions_validity_in_ms: 2000
# permissions_validity_in_ms: 10000
# Refresh interval for permissions cache (if enabled).
# After this interval, cache entries become eligible for refresh. Upon next
# access, an async reload is scheduled and the old value returned until it
# completes. If permissions_validity_in_ms is non-zero, then this must be
# also.
# Defaults to the same value as permissions_validity_in_ms.
# permissions_update_interval_in_ms: 1000
# completes. If permissions_validity_in_ms is non-zero, then this also must have
# a non-zero value. Defaults to 2000. It's recommended to set this value to
# be at least 3 times smaller than the permissions_validity_in_ms.
# permissions_update_interval_in_ms: 2000
# The partitioner is responsible for distributing groups of rows (by
# partition key) across nodes in the cluster. You should leave this
@@ -273,28 +303,6 @@ batch_size_warn_threshold_in_kb: 5
#
partitioner: org.apache.cassandra.dht.Murmur3Partitioner
# policy for data disk failures:
# die: shut down gossip and Thrift and kill the JVM for any fs errors or
# single-sstable errors, so the node can be replaced.
# stop_paranoid: shut down gossip and Thrift even for single-sstable errors.
# stop: shut down gossip and Thrift, leaving the node effectively dead, but
# can still be inspected via JMX.
# best_effort: stop using the failed disk and respond to requests based on
# remaining available sstables. This means you WILL see obsolete
# data at CL.ONE!
# ignore: ignore fatal errors and let requests fail, as in pre-1.2 Scylla
# disk_failure_policy: stop
# policy for commit disk failures:
# die: shut down gossip and Thrift and kill the JVM, so the node can be replaced.
# stop: shut down gossip and Thrift, leaving the node effectively dead, but
# can still be inspected via JMX.
# stop_commit: shutdown the commit log, letting writes collect but
# continuing to service reads, as in pre-2.0.5 Scylla
# ignore: ignore fatal errors and let the batches fail
# commit_failure_policy: stop
# Maximum size of the key cache in memory.
#
# Each key cache hit saves 1 seek and each row cache hit saves 2 seeks at the
@@ -409,29 +417,6 @@ partitioner: org.apache.cassandra.dht.Murmur3Partitioner
# the smaller of 1/4 of heap or 512MB.
# file_cache_size_in_mb: 512
# Total permitted memory to use for memtables. Scylla will stop
# accepting writes when the limit is exceeded until a flush completes,
# and will trigger a flush based on memtable_cleanup_threshold
# If omitted, Scylla will set both to 1/4 the size of the heap.
# memtable_heap_space_in_mb: 2048
# memtable_offheap_space_in_mb: 2048
# Ratio of occupied non-flushing memtable size to total permitted size
# that will trigger a flush of the largest memtable. Lager mct will
# mean larger flushes and hence less compaction, but also less concurrent
# flush activity which can make it difficult to keep your disks fed
# under heavy write load.
#
# memtable_cleanup_threshold defaults to 1 / (memtable_flush_writers + 1)
# memtable_cleanup_threshold: 0.11
# Specify the way Scylla allocates and manages memtable memory.
# Options are:
# heap_buffers: on heap nio buffers
# offheap_buffers: off heap (direct) nio buffers
# offheap_objects: native memory, eliminating nio buffer heap overhead
# memtable_allocation_type: heap_buffers
# Total space to use for commitlogs.
#
# If space gets above this value (it will round up to the next nearest
@@ -443,17 +428,6 @@ partitioner: org.apache.cassandra.dht.Murmur3Partitioner
# available for Scylla.
commitlog_total_space_in_mb: -1
# This sets the amount of memtable flush writer threads. These will
# be blocked by disk io, and each one will hold a memtable in memory
# while blocked.
#
# memtable_flush_writers defaults to the smaller of (number of disks,
# number of cores), with a minimum of 2 and a maximum of 8.
#
# If your data directories are backed by SSD, you should increase this
# to the number of cores.
#memtable_flush_writers: 8
# A fixed memory pool size in MB for for SSTable index summaries. If left
# empty, this will default to 5% of the heap size. If the memory usage of
# all index summaries exceeds this limit, SSTables with low read rates will
@@ -518,13 +492,6 @@ commitlog_total_space_in_mb: -1
# Whether to start the thrift rpc server.
# start_rpc: true
# RPC address to broadcast to drivers and other Scylla nodes. This cannot
# be set to 0.0.0.0. If left blank, this will be set to the value of
# rpc_address. If rpc_address is set to 0.0.0.0, broadcast_rpc_address must
# be set.
# broadcast_rpc_address: 1.2.3.4
# enable or disable keepalive on rpc/native connections
# rpc_keepalive: true
@@ -762,22 +729,17 @@ commitlog_total_space_in_mb: -1
# certificate: conf/scylla.crt
# keyfile: conf/scylla.key
# truststore: <none, use system trust>
# require_client_auth: False
# priority_string: <none, use default>
# enable or disable client/server encryption.
# client_encryption_options:
# enabled: false
# certificate: conf/scylla.crt
# keyfile: conf/scylla.key
# require_client_auth: false
# Set trustore and truststore_password if require_client_auth is true
# truststore: conf/.truststore
# truststore_password: cassandra
# More advanced defaults below:
# protocol: TLS
# algorithm: SunX509
# store_type: JKS
# cipher_suites: [TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA,TLS_DHE_RSA_WITH_AES_128_CBC_SHA,TLS_DHE_RSA_WITH_AES_256_CBC_SHA,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA]
# truststore: <none, use system trust>
# require_client_auth: False
# priority_string: <none, use default>
# internode_compression controls whether traffic between nodes is
# compressed.
@@ -823,3 +785,23 @@ commitlog_total_space_in_mb: -1
# By default, Scylla binds all interfaces to the prometheus API
# It is possible to restrict the listening address to a specific one
# prometheus_address: 0.0.0.0
# Distribution of data among cores (shards) within a node
#
# Scylla distributes data within a node among shards, using a round-robin
# strategy:
# [shard0] [shard1] ... [shardN-1] [shard0] [shard1] ... [shardN-1] ...
#
# Scylla versions 1.6 and below used just one repetition of the pattern;
# this intefered with data placement among nodes (vnodes).
#
# Scylla versions 1.7 and above use 4096 repetitions of the pattern; this
# provides for better data distribution.
#
# the value below is log (base 2) of the number of repetitions.
#
# Set to 0 to avoid rewriting all data when upgrading from Scylla 1.6 and
# below.
#
# Keep at 12 for new clusters.
murmur3_partitioner_ignore_msb_bits: 12

View File

@@ -86,14 +86,14 @@ def try_compile(compiler, source = '', flags = []):
with tempfile.NamedTemporaryFile() as sfile:
sfile.file.write(bytes(source, 'utf-8'))
sfile.file.flush()
return subprocess.call([compiler, '-x', 'c++', '-o', '/dev/null', '-c', sfile.name] + flags,
return subprocess.call([compiler, '-x', 'c++', '-o', '/dev/null', '-c', sfile.name] + args.user_cflags.split() + flags,
stdout = subprocess.DEVNULL,
stderr = subprocess.DEVNULL) == 0
def warning_supported(warning, compiler):
# gcc ignores -Wno-x even if it is not supported
adjusted = re.sub('^-Wno-', '-W', warning)
return try_compile(flags = [adjusted], compiler = compiler)
return try_compile(flags = ['-Werror', adjusted], compiler = compiler)
def debug_flag(compiler):
src_with_auto = textwrap.dedent('''\
@@ -108,6 +108,11 @@ def debug_flag(compiler):
print('Note: debug information disabled; upgrade your compiler')
return ''
def maybe_static(flag, libs):
if flag and not args.static:
libs = '-Wl,-Bstatic {} -Wl,-Bdynamic'.format(libs)
return libs
class Thrift(object):
def __init__(self, source, service):
self.source = source
@@ -162,7 +167,9 @@ modes = {
scylla_tests = [
'tests/mutation_test',
'tests/mvcc_test',
'tests/streamed_mutation_test',
'tests/flat_mutation_reader_test',
'tests/schema_registry_test',
'tests/canonical_mutation_test',
'tests/range_test',
@@ -170,6 +177,8 @@ scylla_tests = [
'tests/keys_test',
'tests/partitioner_test',
'tests/frozen_mutation_test',
'tests/serialized_action_test',
'tests/clustering_ranges_walker_test',
'tests/perf/perf_mutation',
'tests/lsa_async_eviction_test',
'tests/lsa_sync_eviction_test',
@@ -178,18 +187,22 @@ scylla_tests = [
'tests/perf/perf_hash',
'tests/perf/perf_cql_parser',
'tests/perf/perf_simple_query',
'tests/perf/perf_fast_forward',
'tests/perf/perf_cache_eviction',
'tests/cache_flat_mutation_reader_test',
'tests/row_cache_stress_test',
'tests/memory_footprint',
'tests/perf/perf_sstable',
'tests/cql_query_test',
'tests/storage_proxy_test',
'tests/schema_change_test',
'tests/mutation_reader_test',
'tests/key_reader_test',
'tests/mutation_query_test',
'tests/row_cache_test',
'tests/test-serialization',
'tests/sstable_test',
'tests/sstable_mutation_test',
'tests/sstable_resharding_test',
'tests/memtable_test',
'tests/commitlog_test',
'tests/cartesian_product_test',
@@ -211,6 +224,7 @@ scylla_tests = [
'tests/murmur_hash_test',
'tests/allocation_strategy_test',
'tests/logalloc_test',
'tests/log_heap_test',
'tests/managed_vector_test',
'tests/crc_test',
'tests/flush_queue_test',
@@ -222,6 +236,20 @@ scylla_tests = [
'tests/database_test',
'tests/nonwrapping_range_test',
'tests/input_stream_test',
'tests/sstable_atomic_deletion_test',
'tests/virtual_reader_test',
'tests/view_schema_test',
'tests/counter_test',
'tests/cell_locker_test',
'tests/streaming_histogram_test',
'tests/duration_test',
'tests/vint_serialization_test',
'tests/compress_test',
'tests/chunked_vector_test',
'tests/loading_cache_test',
'tests/castas_fcts_test',
'tests/big_decimal_test',
'tests/aggregate_fcts_test',
]
apps = [
@@ -252,6 +280,8 @@ arg_parser.add_argument('--ldflags', action = 'store', dest = 'user_ldflags', de
help = 'Extra flags for the linker')
arg_parser.add_argument('--compiler', action = 'store', dest = 'cxx', default = 'g++',
help = 'C++ compiler path')
arg_parser.add_argument('--c-compiler', action='store', dest='cc', default='gcc',
help='C compiler path')
arg_parser.add_argument('--with-osv', action = 'store', dest = 'with_osv', default = '',
help = 'Shortcut for compile for OSv')
arg_parser.add_argument('--enable-dpdk', action = 'store_true', dest = 'dpdk', default = False,
@@ -263,13 +293,19 @@ arg_parser.add_argument('--debuginfo', action = 'store', dest = 'debuginfo', typ
arg_parser.add_argument('--static-stdc++', dest = 'staticcxx', action = 'store_true',
help = 'Link libgcc and libstdc++ statically')
arg_parser.add_argument('--static-thrift', dest = 'staticthrift', action = 'store_true',
help = 'Link libthrift statically')
help = 'Link libthrift statically')
arg_parser.add_argument('--static-boost', dest = 'staticboost', action = 'store_true',
help = 'Link boost statically')
arg_parser.add_argument('--tests-debuginfo', action = 'store', dest = 'tests_debuginfo', type = int, default = 0,
help = 'Enable(1)/disable(0)compiler debug information generation for tests')
arg_parser.add_argument('--python', action = 'store', dest = 'python', default = 'python3',
help = 'Python3 path')
add_tristate(arg_parser, name = 'hwloc', dest = 'hwloc', help = 'hwloc support')
add_tristate(arg_parser, name = 'xen', dest = 'xen', help = 'Xen support')
arg_parser.add_argument('--enable-gcc6-concepts', dest='gcc6_concepts', action='store_true', default=False,
help='enable experimental support for C++ Concepts as implemented in GCC 6')
arg_parser.add_argument('--enable-alloc-failure-injector', dest='alloc_failure_injector', action='store_true', default=False,
help='enable allocation failure injection')
args = arg_parser.parse_args()
defines = []
@@ -292,23 +328,26 @@ scylla_core = (['database.cc',
'memtable.cc',
'schema_mutations.cc',
'release.cc',
'supervisor.cc',
'utils/logalloc.cc',
'utils/large_bitset.cc',
'mutation_partition.cc',
'mutation_partition_view.cc',
'mutation_partition_serializer.cc',
'mutation_reader.cc',
'flat_mutation_reader.cc',
'mutation_query.cc',
'key_reader.cc',
'keys.cc',
'counters.cc',
'sstables/sstables.cc',
'sstables/compress.cc',
'sstables/row.cc',
'sstables/partition.cc',
'sstables/filter.cc',
'sstables/compaction.cc',
'sstables/compaction_strategy.cc',
'sstables/compaction_manager.cc',
'sstables/atomic_deletion.cc',
'sstables/integrity_checked_file_impl.cc',
'transport/event.cc',
'transport/event_notifier.cc',
'transport/server.cc',
@@ -323,15 +362,19 @@ scylla_core = (['database.cc',
'cql3/sets.cc',
'cql3/maps.cc',
'cql3/functions/functions.cc',
'cql3/functions/castas_fcts.cc',
'cql3/statements/cf_prop_defs.cc',
'cql3/statements/cf_statement.cc',
'cql3/statements/authentication_statement.cc',
'cql3/statements/create_keyspace_statement.cc',
'cql3/statements/create_table_statement.cc',
'cql3/statements/create_view_statement.cc',
'cql3/statements/create_type_statement.cc',
'cql3/statements/create_user_statement.cc',
'cql3/statements/drop_index_statement.cc',
'cql3/statements/drop_keyspace_statement.cc',
'cql3/statements/drop_table_statement.cc',
'cql3/statements/drop_view_statement.cc',
'cql3/statements/drop_type_statement.cc',
'cql3/statements/schema_altering_statement.cc',
'cql3/statements/ks_prop_defs.cc',
@@ -348,6 +391,7 @@ scylla_core = (['database.cc',
'cql3/statements/create_index_statement.cc',
'cql3/statements/truncate_statement.cc',
'cql3/statements/alter_table_statement.cc',
'cql3/statements/alter_view_statement.cc',
'cql3/statements/alter_user_statement.cc',
'cql3/statements/drop_user_statement.cc',
'cql3/statements/list_users_statement.cc',
@@ -393,16 +437,22 @@ scylla_core = (['database.cc',
'cql3/selection/selector.cc',
'cql3/restrictions/statement_restrictions.cc',
'cql3/result_set.cc',
'cql3/variable_specifications.cc',
'db/consistency_level.cc',
'db/system_keyspace.cc',
'db/schema_tables.cc',
'db/cql_type_parser.cc',
'db/legacy_schema_migrator.cc',
'db/commitlog/commitlog.cc',
'db/commitlog/commitlog_replayer.cc',
'db/commitlog/commitlog_entry.cc',
'db/config.cc',
'db/heat_load_balance.cc',
'db/index/secondary_index.cc',
'db/marshal/type_parser.cc',
'db/batchlog_manager.cc',
'db/view/view.cc',
'index/secondary_index_manager.cc',
'io/io.cc',
'utils/utils.cc',
'utils/UUID_gen.cc',
@@ -414,6 +464,7 @@ scylla_core = (['database.cc',
'utils/dynamic_bitset.cc',
'utils/managed_bytes.cc',
'utils/exceptions.cc',
'utils/config_file.cc',
'gms/version_generator.cc',
'gms/versioned_value.cc',
'gms/gossiper.cc',
@@ -423,6 +474,7 @@ scylla_core = (['database.cc',
'gms/gossip_digest_ack2.cc',
'gms/endpoint_state.cc',
'gms/application_state.cc',
'gms/inet_address.cc',
'dht/i_partitioner.cc',
'dht/murmur3_partitioner.cc',
'dht/byte_ordered_partitioner.cc',
@@ -450,7 +502,7 @@ scylla_core = (['database.cc',
'service/client_state.cc',
'service/migration_task.cc',
'service/storage_service.cc',
'service/load_broadcaster.cc',
'service/misc_services.cc',
'service/pager/paging_state.cc',
'service/pager/query_pagers.cc',
'streaming/stream_task.cc',
@@ -466,26 +518,33 @@ scylla_core = (['database.cc',
'streaming/stream_manager.cc',
'streaming/stream_result_future.cc',
'streaming/stream_session_state.cc',
'gc_clock.cc',
'clocks-impl.cc',
'partition_slice_builder.cc',
'init.cc',
'lister.cc',
'repair/repair.cc',
'exceptions/exceptions.cc',
'dns.cc',
'auth/auth.cc',
'auth/allow_all_authenticator.cc',
'auth/allow_all_authorizer.cc',
'auth/authenticated_user.cc',
'auth/authenticator.cc',
'auth/authorizer.cc',
'auth/common.cc',
'auth/default_authorizer.cc',
'auth/data_resource.cc',
'auth/password_authenticator.cc',
'auth/permission.cc',
'auth/permissions_cache.cc',
'auth/service.cc',
'auth/transitional.cc',
'tracing/tracing.cc',
'tracing/trace_keyspace_helper.cc',
'tracing/trace_state.cc',
'table_helper.cc',
'range_tombstone.cc',
'range_tombstone_list.cc',
'db/size_estimates_recorder.cc'
'disk-error-handler.cc',
'duration.cc',
'vint-serialization.cc',
]
+ [Antlr3Grammar('cql3/Cql.g')]
+ [Thrift('interface/cassandra.thrift', 'Cassandra')]
@@ -546,6 +605,8 @@ idls = ['idl/gossip_digest.idl.hh',
'idl/idl_test.idl.hh',
'idl/commitlog.idl.hh',
'idl/tracing.idl.hh',
'idl/consistency_level.idl.hh',
'idl/cache_temperature.idl.hh',
]
scylla_tests_dependencies = scylla_core + api + idls + [
@@ -564,63 +625,92 @@ deps = {
'scylla': idls + ['main.cc'] + scylla_core + api,
}
tests_not_using_seastar_test_framework = set([
'tests/keys_test',
pure_boost_tests = set([
'tests/partitioner_test',
'tests/map_difference_test',
'tests/keys_test',
'tests/compound_test',
'tests/range_tombstone_list_test',
'tests/anchorless_list_test',
'tests/nonwrapping_range_test',
'tests/test-serialization',
'tests/range_test',
'tests/crc_test',
'tests/managed_vector_test',
'tests/dynamic_bitset_test',
'tests/idl_test',
'tests/cartesian_product_test',
'tests/streaming_histogram_test',
'tests/duration_test',
'tests/vint_serialization_test',
'tests/compress_test',
'tests/chunked_vector_test',
'tests/big_decimal_test',
])
tests_not_using_seastar_test_framework = set([
'tests/perf/perf_mutation',
'tests/lsa_async_eviction_test',
'tests/lsa_sync_eviction_test',
'tests/row_cache_alloc_stress',
'tests/perf_row_cache_update',
'tests/cartesian_product_test',
'tests/perf/perf_hash',
'tests/perf/perf_cql_parser',
'tests/message',
'tests/perf/perf_simple_query',
'tests/perf/perf_fast_forward',
'tests/perf/perf_cache_eviction',
'tests/row_cache_stress_test',
'tests/memory_footprint',
'tests/test-serialization',
'tests/gossip',
'tests/compound_test',
'tests/range_test',
'tests/crc_test',
'tests/perf/perf_sstable',
'tests/managed_vector_test',
'tests/dynamic_bitset_test',
'tests/idl_test',
'tests/range_tombstone_list_test',
'tests/anchorless_list_test',
'tests/nonwrapping_range_test',
])
]) | pure_boost_tests
for t in tests_not_using_seastar_test_framework:
if not t in scylla_tests:
raise Exception("Test %s not found in scylla_tests" % (t))
for t in scylla_tests:
deps[t] = scylla_tests_dependencies + [t + '.cc']
deps[t] = [t + '.cc']
if t not in tests_not_using_seastar_test_framework:
deps[t] += scylla_tests_dependencies
deps[t] += scylla_tests_seastar_deps
else:
deps[t] += scylla_core + api + idls + ['tests/cql_test_env.cc']
deps['tests/sstable_test'] += ['tests/sstable_datafile_test.cc']
deps['tests/sstable_test'] += ['tests/sstable_datafile_test.cc', 'tests/sstable_utils.cc']
deps['tests/mutation_reader_test'] += ['tests/sstable_utils.cc']
deps['tests/bytes_ostream_test'] = ['tests/bytes_ostream_test.cc']
deps['tests/bytes_ostream_test'] = ['tests/bytes_ostream_test.cc', 'utils/managed_bytes.cc', 'utils/logalloc.cc', 'utils/dynamic_bitset.cc']
deps['tests/input_stream_test'] = ['tests/input_stream_test.cc']
deps['tests/UUID_test'] = ['utils/UUID_gen.cc', 'tests/UUID_test.cc']
deps['tests/UUID_test'] = ['utils/UUID_gen.cc', 'tests/UUID_test.cc', 'utils/uuid.cc', 'utils/managed_bytes.cc', 'utils/logalloc.cc', 'utils/dynamic_bitset.cc']
deps['tests/murmur_hash_test'] = ['bytes.cc', 'utils/murmur_hash.cc', 'tests/murmur_hash_test.cc']
deps['tests/allocation_strategy_test'] = ['tests/allocation_strategy_test.cc', 'utils/logalloc.cc', 'utils/dynamic_bitset.cc']
deps['tests/log_heap_test'] = ['tests/log_heap_test.cc']
deps['tests/anchorless_list_test'] = ['tests/anchorless_list_test.cc']
warnings = [
'-Wno-mismatched-tags', # clang-only
'-Wno-maybe-uninitialized', # false positives on gcc 5
'-Wno-tautological-compare',
'-Wno-parentheses-equality',
'-Wno-c++11-narrowing',
'-Wno-c++1z-extensions',
'-Wno-sometimes-uninitialized',
'-Wno-return-stack-address',
'-Wno-missing-braces',
'-Wno-unused-lambda-capture',
'-Wno-misleading-indentation',
'-Wno-overflow',
'-Wno-noexcept-type',
'-Wno-nonnull-compare'
]
warnings = [w
for w in warnings
if warning_supported(warning = w, compiler = args.cxx)]
warnings = ' '.join(warnings)
warnings = ' '.join(warnings + ['-Wno-error=deprecated-declarations'])
dbgflag = debug_flag(args.cxx) if args.debuginfo else ''
tests_link_rule = 'link' if args.tests_debuginfo else 'link_stripped'
@@ -674,6 +764,9 @@ if not try_compile(compiler=args.cxx, source='''\
print('Installed boost version too old. Please update {}.'.format(pkgname("boost-devel")))
sys.exit(1)
has_sanitize_address_use_after_scope = try_compile(compiler=args.cxx, flags=['-fsanitize-address-use-after-scope'], source='int f() {}')
defines = ' '.join(['-D' + d for d in defines])
globals().update(vars(args))
@@ -696,7 +789,7 @@ scylla_release = file.read().strip()
extra_cxxflags["release.cc"] = "-DSCYLLA_VERSION=\"\\\"" + scylla_version + "\\\"\" -DSCYLLA_RELEASE=\"\\\"" + scylla_release + "\\\"\""
seastar_flags = ['--disable-xen']
seastar_flags = []
if args.dpdk:
# fake dependencies on dpdk, so that it is built before anything else
seastar_flags += ['--enable-dpdk']
@@ -704,9 +797,16 @@ elif args.dpdk_target:
seastar_flags += ['--dpdk-target', args.dpdk_target]
if args.staticcxx:
seastar_flags += ['--static-stdc++']
if args.staticboost:
seastar_flags += ['--static-boost']
if args.gcc6_concepts:
seastar_flags += ['--enable-gcc6-concepts']
if args.alloc_failure_injector:
seastar_flags += ['--enable-alloc-failure-injector']
seastar_cflags = args.user_cflags + " -march=nehalem"
seastar_flags += ['--compiler', args.cxx, '--cflags=%s' % (seastar_cflags)]
seastar_ldflags = args.user_ldflags
seastar_flags += ['--compiler', args.cxx, '--c-compiler', args.cc, '--cflags=%s' % (seastar_cflags), '--ldflags=%s' %(seastar_ldflags)]
status = subprocess.call([python, './configure.py'] + seastar_flags, cwd = 'seastar')
@@ -737,7 +837,14 @@ for mode in build_modes:
seastar_deps = 'practically_anything_can_change_so_lets_run_it_every_time_and_restat.'
args.user_cflags += " " + pkg_config("--cflags", "jsoncpp")
libs = "-lyaml-cpp -llz4 -lz -lsnappy " + pkg_config("--libs", "jsoncpp") + ' -lboost_filesystem' + ' -lcrypt' + ' -lboost_date_time'
libs = ' '.join(['-lyaml-cpp', '-llz4', '-lz', '-lsnappy', pkg_config("--libs", "jsoncpp"),
maybe_static(args.staticboost, '-lboost_filesystem'), ' -lcrypt',
maybe_static(args.staticboost, '-lboost_date_time'),
])
if not args.staticboost:
args.user_cflags += ' -DBOOST_TEST_DYN_LINK'
for pkg in pkgs:
args.user_cflags += ' ' + pkg_config('--cflags', pkg)
libs += ' ' + pkg_config('--libs', pkg)
@@ -763,10 +870,12 @@ with open(buildfile, 'w') as f:
builddir = {outdir}
cxx = {cxx}
cxxflags = {user_cflags} {warnings} {defines}
ldflags = {user_ldflags}
ldflags = -fuse-ld=gold {user_ldflags}
libs = {libs}
pool link_pool
depth = {link_pool_depth}
pool seastar_pool
depth = 1
rule ragel
command = ragel -G2 -o $out $in
description = RAGEL $out
@@ -792,7 +901,7 @@ with open(buildfile, 'w') as f:
f.write(textwrap.dedent('''\
cxxflags_{mode} = -I. -I $builddir/{mode}/gen -I seastar -I seastar/build/{mode}/gen
rule cxx.{mode}
command = $cxx -MMD -MT $out -MF $out.d {seastar_cflags} $cxxflags $cxxflags_{mode} -c -o $out $in
command = $cxx -MD -MT $out -MF $out.d {seastar_cflags} $cxxflags $cxxflags_{mode} $obj_cxxflags -c -o $out $in
description = CXX $out
depfile = $out.d
rule link.{mode}
@@ -810,7 +919,17 @@ with open(buildfile, 'w') as f:
command = thrift -gen cpp:cob_style -out $builddir/{mode}/gen $in
description = THRIFT $in
rule antlr3.{mode}
command = sed -e '/^#if 0/,/^#endif/d' $in > $builddir/{mode}/gen/$in && antlr3 $builddir/{mode}/gen/$in && sed -i 's/^\\( *\)\\(ImplTraits::CommonTokenType\\* [a-zA-Z0-9_]* = NULL;\\)$$/\\1const \\2/' build/{mode}/gen/${{stem}}Parser.cpp
# We replace many local `ExceptionBaseType* ex` variables with a single function-scope one.
# Because we add such a variable to every function, and because `ExceptionBaseType` is not a global
# name, we also add a global typedef to avoid compilation errors.
command = sed -e '/^#if 0/,/^#endif/d' $in > $builddir/{mode}/gen/$in $
&& antlr3 $builddir/{mode}/gen/$in $
&& sed -i -e 's/^\\( *\)\\(ImplTraits::CommonTokenType\\* [a-zA-Z0-9_]* = NULL;\\)$$/\\1const \\2/' $
-e '1i using ExceptionBaseType = int;' $
-e 's/^{{/{{ ExceptionBaseType\* ex = nullptr;/; $
s/ExceptionBaseType\* ex = new/ex = new/; $
s/exceptions::syntax_exception e/exceptions::syntax_exception\& e/' $
build/{mode}/gen/${{stem}}Parser.cpp
description = ANTLR3 $in
''').format(mode = mode, **modeval))
f.write('build {mode}: phony {artifacts}\n'.format(mode = mode,
@@ -835,22 +954,15 @@ with open(buildfile, 'w') as f:
objs += dep.objects('$builddir/' + mode + '/gen')
if isinstance(dep, Antlr3Grammar):
objs += dep.objects('$builddir/' + mode + '/gen')
if binary.endswith('.pc'):
vars = modeval.copy()
vars.update(globals())
pc = textwrap.dedent('''\
Name: Seastar
URL: http://seastar-project.org/
Description: Advanced C++ framework for high-performance server applications on modern hardware.
Version: 1.0
Libs: -L{srcdir}/{builddir} -Wl,--whole-archive -lseastar -Wl,--no-whole-archive {dbgflag} -Wl,--no-as-needed {static} {pie} -fvisibility=hidden -pthread {user_ldflags} {libs} {sanitize_libs}
Cflags: -std=gnu++1y {dbgflag} {fpie} -Wall -Werror -fvisibility=hidden -pthread -I{srcdir} -I{srcdir}/{builddir}/gen {user_cflags} {warnings} {defines} {sanitize} {opt}
''').format(builddir = 'build/' + mode, srcdir = os.getcwd(), **vars)
f.write('build $builddir/{}/{}: gen\n text = {}\n'.format(mode, binary, repr(pc)))
elif binary.endswith('.a'):
if binary.endswith('.a'):
f.write('build $builddir/{}/{}: ar.{} {}\n'.format(mode, binary, mode, str.join(' ', objs)))
else:
if binary.startswith('tests/'):
local_libs = '$libs'
if binary not in tests_not_using_seastar_test_framework or binary in pure_boost_tests:
local_libs += ' ' + maybe_static(args.staticboost, '-lboost_unit_test_framework')
if has_thrift:
local_libs += ' ' + thrift_libs + ' ' + maybe_static(args.staticboost, '-lboost_system')
# Our code's debugging information is huge, and multiplied
# by many tests yields ridiculous amounts of disk space.
# So we strip the tests by default; The user can very
@@ -858,15 +970,15 @@ with open(buildfile, 'w') as f:
# to the test name, e.g., "ninja build/release/testname_g"
f.write('build $builddir/{}/{}: {}.{} {} {}\n'.format(mode, binary, tests_link_rule, mode, str.join(' ', objs),
'seastar/build/{}/libseastar.a'.format(mode)))
if has_thrift:
f.write(' libs = {} -lboost_system $libs\n'.format(thrift_libs))
f.write(' libs = {}\n'.format(local_libs))
f.write('build $builddir/{}/{}_g: link.{} {} {}\n'.format(mode, binary, mode, str.join(' ', objs),
'seastar/build/{}/libseastar.a'.format(mode)))
f.write(' libs = {}\n'.format(local_libs))
else:
f.write('build $builddir/{}/{}: link.{} {} {}\n'.format(mode, binary, mode, str.join(' ', objs),
'seastar/build/{}/libseastar.a'.format(mode)))
if has_thrift:
f.write(' libs = {} -lboost_system $libs\n'.format(thrift_libs))
if has_thrift:
f.write(' libs = {} {} $libs\n'.format(thrift_libs, maybe_static(args.staticboost, '-lboost_system')))
for src in srcs:
if src.endswith('.cc'):
obj = '$builddir/' + mode + '/' + src.replace('.cc', '.o')
@@ -905,7 +1017,7 @@ with open(buildfile, 'w') as f:
f.write('build {}: ragel {}\n'.format(hh, src))
for hh in swaggers:
src = swaggers[hh]
f.write('build {}: swagger {}\n'.format(hh,src))
f.write('build {}: swagger {} | seastar/json/json2code.py\n'.format(hh,src))
for hh in serializers:
src = serializers[hh]
f.write('build {}: serializer {} | idl-compiler.py\n'.format(hh,src))
@@ -922,8 +1034,12 @@ with open(buildfile, 'w') as f:
for cc in grammar.sources('$builddir/{}/gen'.format(mode)):
obj = cc.replace('.cpp', '.o')
f.write('build {}: cxx.{} {} || {}\n'.format(obj, mode, cc, ' '.join(serializers)))
if cc.endswith('Parser.cpp') and has_sanitize_address_use_after_scope:
# Parsers end up using huge amounts of stack space and overflowing their stack
f.write(' obj_cxxflags = -fno-sanitize-address-use-after-scope\n')
f.write('build seastar/build/{mode}/libseastar.a seastar/build/{mode}/apps/iotune/iotune seastar/build/{mode}/gen/http/request_parser.hh seastar/build/{mode}/gen/http/http_response_parser.hh: ninja {seastar_deps}\n'
.format(**locals()))
f.write(' pool = seastar_pool\n')
f.write(' subdir = seastar\n')
f.write(' target = build/{mode}/libseastar.a build/{mode}/apps/iotune/iotune build/{mode}/gen/http/request_parser.hh build/{mode}/gen/http/http_response_parser.hh\n'.format(**locals()))
f.write(textwrap.dedent('''\

View File

@@ -22,6 +22,7 @@
#pragma once
#include "mutation_partition_view.hh"
#include "mutation_partition.hh"
#include "schema.hh"
// Mutation partition visitor which applies visited data into
@@ -37,12 +38,12 @@ private:
static bool is_compatible(const column_definition& new_def, const data_type& old_type, column_kind kind) {
return ::is_compatible(new_def.kind, kind) && new_def.type->is_value_compatible_with(*old_type);
}
void accept_cell(row& dst, column_kind kind, const column_definition& new_def, const data_type& old_type, atomic_cell_view cell) {
static void accept_cell(row& dst, column_kind kind, const column_definition& new_def, const data_type& old_type, atomic_cell_view cell) {
if (is_compatible(new_def, old_type, kind) && cell.timestamp() > new_def.dropped_at()) {
dst.apply(new_def, atomic_cell_or_collection(cell));
}
}
void accept_cell(row& dst, column_kind kind, const column_definition& new_def, const data_type& old_type, collection_mutation_view cell) {
static void accept_cell(row& dst, column_kind kind, const column_definition& new_def, const data_type& old_type, collection_mutation_view cell) {
if (!is_compatible(new_def, old_type, kind)) {
return;
}
@@ -94,8 +95,8 @@ public:
_p.apply_row_tombstone(_p_schema, rt);
}
virtual void accept_row(clustering_key_view key, tombstone deleted_at, const row_marker& rm) override {
deletable_row& r = _p.clustered_row(_p_schema, key);
virtual void accept_row(position_in_partition_view key, const row_tombstone& deleted_at, const row_marker& rm, is_dummy dummy, is_continuous continuous) override {
deletable_row& r = _p.clustered_row(_p_schema, key, dummy, continuous);
r.apply(rm);
r.apply(deleted_at);
_current_row = &r;
@@ -116,4 +117,14 @@ public:
accept_cell(_current_row->cells(), column_kind::regular_column, *def, col.type(), collection);
}
}
// Appends the cell to dst upgrading it to the new schema.
// Cells must have monotonic names.
static void append_cell(row& dst, column_kind kind, const column_definition& new_def, const data_type& old_type, const atomic_cell_or_collection& cell) {
if (new_def.is_atomic()) {
accept_cell(dst, kind, new_def, old_type, cell.as_atomic_cell());
} else {
accept_cell(dst, kind, new_def, old_type, cell.as_collection_mutation());
}
}
};

332
counters.cc Normal file
View File

@@ -0,0 +1,332 @@
/*
* Copyright (C) 2016 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "service/storage_service.hh"
#include "counters.hh"
#include "mutation.hh"
#include "combine.hh"
counter_id counter_id::local()
{
return counter_id(service::get_local_storage_service().get_local_id());
}
bool counter_id::less_compare_1_7_4::operator()(const counter_id& a, const counter_id& b) const
{
if (a._most_significant != b._most_significant) {
return a._most_significant < b._most_significant;
} else {
return a._least_significant < b._least_significant;
}
}
std::ostream& operator<<(std::ostream& os, const counter_id& id) {
return os << id.to_uuid();
}
std::ostream& operator<<(std::ostream& os, counter_shard_view csv) {
return os << "{global_shard id: " << csv.id() << " value: " << csv.value()
<< " clock: " << csv.logical_clock() << "}";
}
std::ostream& operator<<(std::ostream& os, counter_cell_view ccv) {
return os << "{counter_cell timestamp: " << ccv.timestamp() << " shards: {" << ::join(", ", ccv.shards()) << "}}";
}
void counter_cell_builder::do_sort_and_remove_duplicates()
{
boost::range::sort(_shards, [] (auto& a, auto& b) { return a.id() < b.id(); });
std::vector<counter_shard> new_shards;
new_shards.reserve(_shards.size());
for (auto& cs : _shards) {
if (new_shards.empty() || new_shards.back().id() != cs.id()) {
new_shards.emplace_back(cs);
} else {
new_shards.back().apply(cs);
}
}
_shards = std::move(new_shards);
_sorted = true;
}
std::vector<counter_shard> counter_cell_view::shards_compatible_with_1_7_4() const
{
auto sorted_shards = boost::copy_range<std::vector<counter_shard>>(shards());
counter_id::less_compare_1_7_4 cmp;
boost::range::sort(sorted_shards, [&] (auto& a, auto& b) {
return cmp(a.id(), b.id());
});
return sorted_shards;
}
static bool apply_in_place(atomic_cell_or_collection& dst, atomic_cell_or_collection& src)
{
auto dst_ccmv = counter_cell_mutable_view(dst.as_mutable_atomic_cell());
auto src_ccmv = counter_cell_mutable_view(src.as_mutable_atomic_cell());
auto dst_shards = dst_ccmv.shards();
auto src_shards = src_ccmv.shards();
auto dst_it = dst_shards.begin();
auto src_it = src_shards.begin();
while (src_it != src_shards.end()) {
while (dst_it != dst_shards.end() && dst_it->id() < src_it->id()) {
++dst_it;
}
if (dst_it == dst_shards.end() || dst_it->id() != src_it->id()) {
// Fast-path failed. Revert and fall back to the slow path.
if (dst_it == dst_shards.end()) {
--dst_it;
}
while (src_it != src_shards.begin()) {
--src_it;
while (dst_it->id() != src_it->id()) {
--dst_it;
}
src_it->swap_value_and_clock(*dst_it);
}
return false;
}
if (dst_it->logical_clock() < src_it->logical_clock()) {
dst_it->swap_value_and_clock(*src_it);
} else {
src_it->set_value_and_clock(*dst_it);
}
++src_it;
}
auto dst_ts = dst_ccmv.timestamp();
auto src_ts = src_ccmv.timestamp();
dst_ccmv.set_timestamp(std::max(dst_ts, src_ts));
src_ccmv.set_timestamp(dst_ts);
src.as_mutable_atomic_cell().set_counter_in_place_revert(true);
return true;
}
static void revert_in_place_apply(atomic_cell_or_collection& dst, atomic_cell_or_collection& src)
{
assert(dst.can_use_mutable_view() && src.can_use_mutable_view());
auto dst_ccmv = counter_cell_mutable_view(dst.as_mutable_atomic_cell());
auto src_ccmv = counter_cell_mutable_view(src.as_mutable_atomic_cell());
auto dst_shards = dst_ccmv.shards();
auto src_shards = src_ccmv.shards();
auto dst_it = dst_shards.begin();
auto src_it = src_shards.begin();
while (src_it != src_shards.end()) {
while (dst_it != dst_shards.end() && dst_it->id() < src_it->id()) {
++dst_it;
}
assert(dst_it != dst_shards.end() && dst_it->id() == src_it->id());
dst_it->swap_value_and_clock(*src_it);
++src_it;
}
auto dst_ts = dst_ccmv.timestamp();
auto src_ts = src_ccmv.timestamp();
dst_ccmv.set_timestamp(src_ts);
src_ccmv.set_timestamp(dst_ts);
src.as_mutable_atomic_cell().set_counter_in_place_revert(false);
}
bool counter_cell_view::apply_reversibly(atomic_cell_or_collection& dst, atomic_cell_or_collection& src)
{
auto dst_ac = dst.as_atomic_cell();
auto src_ac = src.as_atomic_cell();
if (!dst_ac.is_live() || !src_ac.is_live()) {
if (dst_ac.is_live() || (!src_ac.is_live() && compare_atomic_cell_for_merge(dst_ac, src_ac) < 0)) {
std::swap(dst, src);
return true;
}
return false;
}
if (dst_ac.is_counter_update() && src_ac.is_counter_update()) {
auto src_v = src_ac.counter_update_value();
auto dst_v = dst_ac.counter_update_value();
dst = atomic_cell::make_live_counter_update(std::max(dst_ac.timestamp(), src_ac.timestamp()),
src_v + dst_v);
return true;
}
assert(!dst_ac.is_counter_update());
assert(!src_ac.is_counter_update());
if (counter_cell_view(dst_ac).shard_count() >= counter_cell_view(src_ac).shard_count()
&& dst.can_use_mutable_view() && src.can_use_mutable_view()) {
if (apply_in_place(dst, src)) {
return true;
}
}
src.as_mutable_atomic_cell().set_counter_in_place_revert(false);
auto dst_shards = counter_cell_view(dst_ac).shards();
auto src_shards = counter_cell_view(src_ac).shards();
counter_cell_builder result;
combine(dst_shards.begin(), dst_shards.end(), src_shards.begin(), src_shards.end(),
result.inserter(), counter_shard_view::less_compare_by_id(), [] (auto& x, auto& y) {
return x.logical_clock() < y.logical_clock() ? y : x;
});
auto cell = result.build(std::max(dst_ac.timestamp(), src_ac.timestamp()));
src = std::exchange(dst, atomic_cell_or_collection(cell));
return true;
}
void counter_cell_view::revert_apply(atomic_cell_or_collection& dst, atomic_cell_or_collection& src)
{
if (dst.as_atomic_cell().is_counter_update()) {
auto src_v = src.as_atomic_cell().counter_update_value();
auto dst_v = dst.as_atomic_cell().counter_update_value();
dst = atomic_cell::make_live(dst.as_atomic_cell().timestamp(),
long_type->decompose(dst_v - src_v));
} else if (src.as_atomic_cell().is_counter_in_place_revert_set()) {
revert_in_place_apply(dst, src);
} else {
std::swap(dst, src);
}
}
stdx::optional<atomic_cell> counter_cell_view::difference(atomic_cell_view a, atomic_cell_view b)
{
assert(!a.is_counter_update());
assert(!b.is_counter_update());
if (!b.is_live() || !a.is_live()) {
if (b.is_live() || (!a.is_live() && compare_atomic_cell_for_merge(b, a) < 0)) {
return atomic_cell(a);
}
return { };
}
auto a_shards = counter_cell_view(a).shards();
auto b_shards = counter_cell_view(b).shards();
auto a_it = a_shards.begin();
auto a_end = a_shards.end();
auto b_it = b_shards.begin();
auto b_end = b_shards.end();
counter_cell_builder result;
while (a_it != a_end) {
while (b_it != b_end && (*b_it).id() < (*a_it).id()) {
++b_it;
}
if (b_it == b_end || (*a_it).id() != (*b_it).id() || (*a_it).logical_clock() > (*b_it).logical_clock()) {
result.add_shard(counter_shard(*a_it));
}
++a_it;
}
stdx::optional<atomic_cell> diff;
if (!result.empty()) {
diff = result.build(std::max(a.timestamp(), b.timestamp()));
} else if (a.timestamp() > b.timestamp()) {
diff = atomic_cell::make_live(a.timestamp(), bytes_view());
}
return diff;
}
void transform_counter_updates_to_shards(mutation& m, const mutation* current_state, uint64_t clock_offset) {
// FIXME: allow current_state to be frozen_mutation
auto transform_new_row_to_shards = [clock_offset] (auto& cells) {
cells.for_each_cell([clock_offset] (auto, atomic_cell_or_collection& ac_o_c) {
auto acv = ac_o_c.as_atomic_cell();
if (!acv.is_live()) {
return; // continue -- we are in lambda
}
auto delta = acv.counter_update_value();
auto cs = counter_shard(counter_id::local(), delta, clock_offset + 1);
ac_o_c = counter_cell_builder::from_single_shard(acv.timestamp(), cs);
});
};
if (!current_state) {
transform_new_row_to_shards(m.partition().static_row());
for (auto& cr : m.partition().clustered_rows()) {
transform_new_row_to_shards(cr.row().cells());
}
return;
}
clustering_key::less_compare cmp(*m.schema());
auto transform_row_to_shards = [clock_offset] (auto& transformee, auto& state) {
std::deque<std::pair<column_id, counter_shard>> shards;
state.for_each_cell([&] (column_id id, const atomic_cell_or_collection& ac_o_c) {
auto acv = ac_o_c.as_atomic_cell();
if (!acv.is_live()) {
return; // continue -- we are in lambda
}
counter_cell_view ccv(acv);
auto cs = ccv.local_shard();
if (!cs) {
return; // continue
}
shards.emplace_back(std::make_pair(id, counter_shard(*cs)));
});
transformee.for_each_cell([&] (column_id id, atomic_cell_or_collection& ac_o_c) {
auto acv = ac_o_c.as_atomic_cell();
if (!acv.is_live()) {
return; // continue -- we are in lambda
}
while (!shards.empty() && shards.front().first < id) {
shards.pop_front();
}
auto delta = acv.counter_update_value();
if (shards.empty() || shards.front().first > id) {
auto cs = counter_shard(counter_id::local(), delta, clock_offset + 1);
ac_o_c = counter_cell_builder::from_single_shard(acv.timestamp(), cs);
} else {
auto& cs = shards.front().second;
cs.update(delta, clock_offset + 1);
ac_o_c = counter_cell_builder::from_single_shard(acv.timestamp(), cs);
shards.pop_front();
}
});
};
transform_row_to_shards(m.partition().static_row(), current_state->partition().static_row());
auto& cstate = current_state->partition();
auto it = cstate.clustered_rows().begin();
auto end = cstate.clustered_rows().end();
for (auto& cr : m.partition().clustered_rows()) {
while (it != end && cmp(it->key(), cr.key())) {
++it;
}
if (it == end || cmp(cr.key(), it->key())) {
transform_new_row_to_shards(cr.row().cells());
continue;
}
transform_row_to_shards(cr.row().cells(), it->row().cells());
}
}

435
counters.hh Normal file
View File

@@ -0,0 +1,435 @@
/*
* Copyright (C) 2016 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <boost/range/algorithm/find_if.hpp>
#include "atomic_cell_or_collection.hh"
#include "types.hh"
#include "stdx.hh"
class mutation;
class mutation;
class counter_id {
int64_t _least_significant;
int64_t _most_significant;
public:
static_assert(std::is_same<decltype(std::declval<utils::UUID>().get_least_significant_bits()), int64_t>::value
&& std::is_same<decltype(std::declval<utils::UUID>().get_most_significant_bits()), int64_t>::value,
"utils::UUID is expected to work with two signed 64-bit integers");
counter_id() = default;
explicit counter_id(utils::UUID uuid) noexcept
: _least_significant(uuid.get_least_significant_bits())
, _most_significant(uuid.get_most_significant_bits())
{ }
utils::UUID to_uuid() const {
return utils::UUID(_most_significant, _least_significant);
}
bool operator<(const counter_id& other) const {
return to_uuid() < other.to_uuid();
}
bool operator>(const counter_id& other) const {
return other.to_uuid() < to_uuid();
}
bool operator==(const counter_id& other) const {
return to_uuid() == other.to_uuid();
}
bool operator!=(const counter_id& other) const {
return !(*this == other);
}
public:
// (Wrong) Counter ID ordering used by Scylla 1.7.4 and earlier.
struct less_compare_1_7_4 {
bool operator()(const counter_id& a, const counter_id& b) const;
};
public:
static counter_id local();
// For tests.
static counter_id generate_random() {
return counter_id(utils::make_random_uuid());
}
};
static_assert(std::is_pod<counter_id>::value, "counter_id should be a POD type");
std::ostream& operator<<(std::ostream& os, const counter_id& id);
template<typename View>
class basic_counter_shard_view {
enum class offset : unsigned {
id = 0u,
value = unsigned(id) + sizeof(counter_id),
logical_clock = unsigned(value) + sizeof(int64_t),
total_size = unsigned(logical_clock) + sizeof(int64_t),
};
private:
typename View::pointer _base;
private:
template<typename T>
T read(offset off) const {
T value;
std::copy_n(_base + static_cast<unsigned>(off), sizeof(T), reinterpret_cast<signed char*>(&value));
return value;
}
public:
static constexpr auto size = size_t(offset::total_size);
public:
basic_counter_shard_view() = default;
explicit basic_counter_shard_view(typename View::pointer ptr) noexcept
: _base(ptr) { }
counter_id id() const { return read<counter_id>(offset::id); }
int64_t value() const { return read<int64_t>(offset::value); }
int64_t logical_clock() const { return read<int64_t>(offset::logical_clock); }
void swap_value_and_clock(basic_counter_shard_view& other) noexcept {
static constexpr size_t off = size_t(offset::value);
static constexpr size_t size = size_t(offset::total_size) - off;
typename View::value_type tmp[size];
std::copy_n(_base + off, size, tmp);
std::copy_n(other._base + off, size, _base + off);
std::copy_n(tmp, size, other._base + off);
}
void set_value_and_clock(const basic_counter_shard_view& other) noexcept {
static constexpr size_t off = size_t(offset::value);
static constexpr size_t size = size_t(offset::total_size) - off;
std::copy_n(other._base + off, size, _base + off);
}
bool operator==(const basic_counter_shard_view& other) const {
return id() == other.id() && value() == other.value()
&& logical_clock() == other.logical_clock();
}
bool operator!=(const basic_counter_shard_view& other) const {
return !(*this == other);
}
struct less_compare_by_id {
bool operator()(const basic_counter_shard_view& x, const basic_counter_shard_view& y) const {
return x.id() < y.id();
}
};
};
using counter_shard_view = basic_counter_shard_view<bytes_view>;
std::ostream& operator<<(std::ostream& os, counter_shard_view csv);
class counter_shard {
counter_id _id;
int64_t _value;
int64_t _logical_clock;
private:
template<typename T>
static void write(const T& value, bytes::iterator& out) {
out = std::copy_n(reinterpret_cast<const signed char*>(&value), sizeof(T), out);
}
private:
// Shared logic for applying counter_shards and counter_shard_views.
// T is either counter_shard or basic_counter_shard_view<U>.
template<typename T>
GCC6_CONCEPT(requires requires(T shard) {
{ shard.value() } -> int64_t;
{ shard.logical_clock() } -> int64_t;
})
counter_shard& do_apply(T&& other) noexcept {
auto other_clock = other.logical_clock();
if (_logical_clock < other_clock) {
_logical_clock = other_clock;
_value = other.value();
}
return *this;
}
public:
counter_shard(counter_id id, int64_t value, int64_t logical_clock) noexcept
: _id(id)
, _value(value)
, _logical_clock(logical_clock)
{ }
explicit counter_shard(counter_shard_view csv) noexcept
: _id(csv.id())
, _value(csv.value())
, _logical_clock(csv.logical_clock())
{ }
counter_id id() const { return _id; }
int64_t value() const { return _value; }
int64_t logical_clock() const { return _logical_clock; }
counter_shard& update(int64_t value_delta, int64_t clock_increment) noexcept {
_value += value_delta;
_logical_clock += clock_increment;
return *this;
}
counter_shard& apply(counter_shard_view other) noexcept {
return do_apply(other);
}
counter_shard& apply(const counter_shard& other) noexcept {
return do_apply(other);
}
static size_t serialized_size() {
return counter_shard_view::size;
}
void serialize(bytes::iterator& out) const {
write(_id, out);
write(_value, out);
write(_logical_clock, out);
}
};
class counter_cell_builder {
std::vector<counter_shard> _shards;
bool _sorted = true;
private:
void do_sort_and_remove_duplicates();
public:
counter_cell_builder() = default;
counter_cell_builder(size_t shard_count) {
_shards.reserve(shard_count);
}
void add_shard(const counter_shard& cs) {
_shards.emplace_back(cs);
}
void add_maybe_unsorted_shard(const counter_shard& cs) {
add_shard(cs);
if (_sorted && _shards.size() > 1) {
auto current = _shards.rbegin();
auto previous = std::next(current);
_sorted = current->id() > previous->id();
}
}
void sort_and_remove_duplicates() {
if (!_sorted) {
do_sort_and_remove_duplicates();
}
}
size_t serialized_size() const {
return _shards.size() * counter_shard::serialized_size();
}
void serialize(bytes::iterator& out) const {
for (auto&& cs : _shards) {
cs.serialize(out);
}
}
bool empty() const {
return _shards.empty();
}
atomic_cell build(api::timestamp_type timestamp) const {
return atomic_cell::make_live_from_serializer(timestamp, serialized_size(), [this] (bytes::iterator out) {
serialize(out);
});
}
static atomic_cell from_single_shard(api::timestamp_type timestamp, const counter_shard& cs) {
return atomic_cell::make_live_from_serializer(timestamp, counter_shard::serialized_size(), [&cs] (bytes::iterator out) {
cs.serialize(out);
});
}
class inserter_iterator : public std::iterator<std::output_iterator_tag, counter_shard> {
counter_cell_builder* _builder;
public:
explicit inserter_iterator(counter_cell_builder& b) : _builder(&b) { }
inserter_iterator& operator=(const counter_shard& cs) {
_builder->add_shard(cs);
return *this;
}
inserter_iterator& operator=(const counter_shard_view& csv) {
return operator=(counter_shard(csv));
}
inserter_iterator& operator++() { return *this; }
inserter_iterator& operator++(int) { return *this; }
inserter_iterator& operator*() { return *this; };
};
inserter_iterator inserter() {
return inserter_iterator(*this);
}
};
// <counter_id> := <int64_t><int64_t>
// <shard> := <counter_id><int64_t:value><int64_t:logical_clock>
// <counter_cell> := <shard>*
template<typename View>
class basic_counter_cell_view {
protected:
atomic_cell_base<View> _cell;
private:
class shard_iterator : public std::iterator<std::input_iterator_tag, basic_counter_shard_view<View>> {
typename View::pointer _current;
basic_counter_shard_view<View> _current_view;
public:
shard_iterator() = default;
shard_iterator(typename View::pointer ptr) noexcept
: _current(ptr), _current_view(ptr) { }
basic_counter_shard_view<View>& operator*() noexcept {
return _current_view;
}
basic_counter_shard_view<View>* operator->() noexcept {
return &_current_view;
}
shard_iterator& operator++() noexcept {
_current += counter_shard_view::size;
_current_view = basic_counter_shard_view<View>(_current);
return *this;
}
shard_iterator operator++(int) noexcept {
auto it = *this;
operator++();
return it;
}
shard_iterator& operator--() noexcept {
_current -= counter_shard_view::size;
_current_view = basic_counter_shard_view<View>(_current);
return *this;
}
shard_iterator operator--(int) noexcept {
auto it = *this;
operator--();
return it;
}
bool operator==(const shard_iterator& other) const noexcept {
return _current == other._current;
}
bool operator!=(const shard_iterator& other) const noexcept {
return !(*this == other);
}
};
public:
boost::iterator_range<shard_iterator> shards() const {
auto bv = _cell.value();
auto begin = shard_iterator(bv.data());
auto end = shard_iterator(bv.data() + bv.size());
return boost::make_iterator_range(begin, end);
}
size_t shard_count() const {
return _cell.value().size() / counter_shard_view::size;
}
public:
// ac must be a live counter cell
explicit basic_counter_cell_view(atomic_cell_base<View> ac) noexcept : _cell(ac) {
assert(_cell.is_live());
assert(!_cell.is_counter_update());
}
api::timestamp_type timestamp() const { return _cell.timestamp(); }
static data_type total_value_type() { return long_type; }
int64_t total_value() const {
return boost::accumulate(shards(), int64_t(0), [] (int64_t v, counter_shard_view cs) {
return v + cs.value();
});
}
stdx::optional<counter_shard_view> get_shard(const counter_id& id) const {
auto it = boost::range::find_if(shards(), [&id] (counter_shard_view csv) {
return csv.id() == id;
});
if (it == shards().end()) {
return { };
}
return *it;
}
stdx::optional<counter_shard_view> local_shard() const {
// TODO: consider caching local shard position
return get_shard(counter_id::local());
}
bool operator==(const basic_counter_cell_view& other) const {
return timestamp() == other.timestamp() && boost::equal(shards(), other.shards());
}
};
struct counter_cell_view : basic_counter_cell_view<bytes_view> {
using basic_counter_cell_view::basic_counter_cell_view;
// Returns counter shards in an order that is compatible with Scylla 1.7.4.
std::vector<counter_shard> shards_compatible_with_1_7_4() const;
// Reversibly applies two counter cells, at least one of them must be live.
// Returns true iff dst was modified.
static bool apply_reversibly(atomic_cell_or_collection& dst, atomic_cell_or_collection& src);
// Reverts apply performed by apply_reversible().
static void revert_apply(atomic_cell_or_collection& dst, atomic_cell_or_collection& src);
// Computes a counter cell containing minimal amount of data which, when
// applied to 'b' returns the same cell as 'a' and 'b' applied together.
static stdx::optional<atomic_cell> difference(atomic_cell_view a, atomic_cell_view b);
friend std::ostream& operator<<(std::ostream& os, counter_cell_view ccv);
};
struct counter_cell_mutable_view : basic_counter_cell_view<bytes_mutable_view> {
using basic_counter_cell_view::basic_counter_cell_view;
void set_timestamp(api::timestamp_type ts) { _cell.set_timestamp(ts); }
};
// Transforms mutation dst from counter updates to counter shards using state
// stored in current_state.
// If current_state is present it has to be in the same schema as dst.
void transform_counter_updates_to_shards(mutation& dst, const mutation* current_state, uint64_t clock_offset);
template<>
struct appending_hash<counter_shard_view> {
template<typename Hasher>
void operator()(Hasher& h, const counter_shard_view& cshard) const {
::feed_hash(h, cshard.id().to_uuid());
::feed_hash(h, cshard.value());
::feed_hash(h, cshard.logical_clock());
}
};
template<>
struct appending_hash<counter_cell_view> {
template<typename Hasher>
void operator()(Hasher& h, const counter_cell_view& cell) const {
::feed_hash(h, true); // is_live
::feed_hash(h, cell.timestamp());
for (auto&& csv : cell.shards()) {
::feed_hash(h, csv);
}
}
};

89
cpu_controller.hh Normal file
View File

@@ -0,0 +1,89 @@
/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <seastar/core/thread.hh>
#include <seastar/core/timer.hh>
#include <chrono>
// Simple proportional controller to adjust shares of memtable/streaming flushes.
//
// Goal is to flush as fast as we can, but not so fast that we steal all the CPU from incoming
// requests, and at the same time minimize user-visible fluctuations in the flush quota.
//
// What that translates to is we'll try to keep virtual dirty's firt derivative at 0 (IOW, we keep
// virtual dirty constant), which means that the rate of incoming writes is equal to the rate of
// flushed bytes.
//
// The exact point at which the controller stops determines the desired flush CPU usage. As we
// approach the hard dirty limit, we need to be more aggressive. We will therefore define two
// thresholds, and increase the constant as we cross them.
//
// 1) the soft limit line
// 2) halfway between soft limit and dirty limit
//
// The constants q1 and q2 are used to determine the proportional factor at each stage.
//
// Below the soft limit, we are in no particular hurry to flush, since it means we're set to
// complete flushing before we a new memtable is ready. The quota is dirty * q1, and q1 is set to a
// low number.
//
// The first half of the virtual dirty region is where we expect to be usually, so we have a low
// slope corresponding to a sluggish response between q1 * soft_limit and q2.
//
// In the second half, we're getting close to the hard dirty limit so we increase the slope and
// become more responsive, up to a maximum quota of qmax.
//
// For now we'll just set them in the structure not to complicate the constructor. But q1, q2 and
// qmax can easily become parameters if we find another user.
class flush_cpu_controller {
static constexpr float hard_dirty_limit = 0.50;
static constexpr float q1 = 0.01;
static constexpr float q2 = 0.2;
static constexpr float qmax = 1;
float _current_quota = 0.0f;
float _goal;
std::function<float()> _current_dirty;
std::chrono::milliseconds _interval;
timer<> _update_timer;
seastar::thread_scheduling_group _scheduling_group;
seastar::thread_scheduling_group *_current_scheduling_group = nullptr;
void adjust();
public:
seastar::thread_scheduling_group* scheduling_group() {
return _current_scheduling_group;
}
float current_quota() const {
return _current_quota;
}
struct disabled {
seastar::thread_scheduling_group *backup;
};
flush_cpu_controller(disabled d) : _scheduling_group(std::chrono::nanoseconds(0), 0), _current_scheduling_group(d.backup) {}
flush_cpu_controller(std::chrono::milliseconds interval, float soft_limit, std::function<float()> current_dirty);
flush_cpu_controller(flush_cpu_controller&&) = default;
};

View File

@@ -36,15 +36,19 @@ options {
#include "cql3/statements/raw/select_statement.hh"
#include "cql3/statements/alter_keyspace_statement.hh"
#include "cql3/statements/alter_table_statement.hh"
#include "cql3/statements/alter_view_statement.hh"
#include "cql3/statements/create_keyspace_statement.hh"
#include "cql3/statements/drop_keyspace_statement.hh"
#include "cql3/statements/create_index_statement.hh"
#include "cql3/statements/create_table_statement.hh"
#include "cql3/statements/create_view_statement.hh"
#include "cql3/statements/create_type_statement.hh"
#include "cql3/statements/drop_type_statement.hh"
#include "cql3/statements/alter_type_statement.hh"
#include "cql3/statements/property_definitions.hh"
#include "cql3/statements/drop_index_statement.hh"
#include "cql3/statements/drop_table_statement.hh"
#include "cql3/statements/drop_view_statement.hh"
#include "cql3/statements/truncate_statement.hh"
#include "cql3/statements/raw/update_statement.hh"
#include "cql3/statements/raw/insert_statement.hh"
@@ -315,9 +319,7 @@ cqlStatement returns [shared_ptr<raw::parsed_statement> stmt]
| st10=createIndexStatement { $stmt = st10; }
| st11=dropKeyspaceStatement { $stmt = st11; }
| st12=dropTableStatement { $stmt = st12; }
#if 0
| st13=dropIndexStatement { $stmt = st13; }
#endif
| st14=alterTableStatement { $stmt = st14; }
| st15=alterKeyspaceStatement { $stmt = st15; }
| st16=grantStatement { $stmt = st16; }
@@ -340,6 +342,9 @@ cqlStatement returns [shared_ptr<raw::parsed_statement> stmt]
| st30=createAggregateStatement { $stmt = st30; }
| st31=dropAggregateStatement { $stmt = st31; }
#endif
| st32=createViewStatement { $stmt = st32; }
| st33=alterViewStatement { $stmt = st33; }
| st34=dropViewStatement { $stmt = st34; }
;
/*
@@ -394,6 +399,7 @@ unaliasedSelector returns [shared_ptr<selectable::raw> s]
| K_WRITETIME '(' c=cident ')' { tmp = make_shared<selectable::writetime_or_ttl::raw>(c, true); }
| K_TTL '(' c=cident ')' { tmp = make_shared<selectable::writetime_or_ttl::raw>(c, false); }
| f=functionName args=selectionFunctionArgs { tmp = ::make_shared<selectable::with_function::raw>(std::move(f), std::move(args)); }
| K_CAST '(' arg=unaliasedSelector K_AS t=native_type ')' { tmp = ::make_shared<selectable::with_cast::raw>(std::move(arg), std::move(t)); }
)
( '.' fi=cident { tmp = make_shared<selectable::with_field_selection::raw>(std::move(tmp), std::move(fi)); } )*
{ $s = tmp; }
@@ -716,7 +722,7 @@ createTableStatement returns [shared_ptr<cql3::statements::create_table_statemen
cfamDefinition[shared_ptr<cql3::statements::create_table_statement::raw_statement> expr]
: '(' cfamColumns[expr] ( ',' cfamColumns[expr]? )* ')'
( K_WITH cfamProperty[expr] ( K_AND cfamProperty[expr] )*)?
( K_WITH cfamProperty[$expr->properties()] ( K_AND cfamProperty[$expr->properties()] )*)?
;
cfamColumns[shared_ptr<cql3::statements::create_table_statement::raw_statement> expr]
@@ -732,15 +738,15 @@ pkDef[shared_ptr<cql3::statements::create_table_statement::raw_statement> expr]
| '(' k1=ident { l.push_back(k1); } ( ',' kn=ident { l.push_back(kn); } )* ')' { $expr->add_key_aliases(l); }
;
cfamProperty[shared_ptr<cql3::statements::create_table_statement::raw_statement> expr]
: property[expr->properties]
| K_COMPACT K_STORAGE { $expr->set_compact_storage(); }
cfamProperty[cql3::statements::cf_properties& expr]
: property[$expr.properties()]
| K_COMPACT K_STORAGE { $expr.set_compact_storage(); }
| K_CLUSTERING K_ORDER K_BY '(' cfamOrdering[expr] (',' cfamOrdering[expr])* ')'
;
cfamOrdering[shared_ptr<cql3::statements::create_table_statement::raw_statement> expr]
cfamOrdering[cql3::statements::cf_properties& expr]
@init{ bool reversed=false; }
: k=ident (K_ASC | K_DESC { reversed=true;} ) { $expr->set_ordering(k, reversed); }
: k=ident (K_ASC | K_DESC { reversed=true;} ) { $expr.set_ordering(k, reversed); }
;
@@ -772,12 +778,13 @@ createIndexStatement returns [::shared_ptr<create_index_statement> expr]
auto props = make_shared<index_prop_defs>();
bool if_not_exists = false;
auto name = ::make_shared<cql3::index_name>();
std::vector<::shared_ptr<index_target::raw>> targets;
}
: K_CREATE (K_CUSTOM { props->is_custom = true; })? K_INDEX (K_IF K_NOT K_EXISTS { if_not_exists = true; } )?
(idxName[name])? K_ON cf=columnFamilyName '(' id=indexIdent ')'
(idxName[name])? K_ON cf=columnFamilyName '(' (target1=indexIdent { targets.emplace_back(target1); } (',' target2=indexIdent { targets.emplace_back(target2); } )*)? ')'
(K_USING cls=STRING_LITERAL { props->custom_class = sstring{$cls.text}; })?
(K_WITH properties[props])?
{ $expr = ::make_shared<create_index_statement>(cf, name, id, props, if_not_exists); }
{ $expr = ::make_shared<create_index_statement>(cf, name, targets, props, if_not_exists); }
;
indexIdent returns [::shared_ptr<index_target::raw> id]
@@ -787,6 +794,39 @@ indexIdent returns [::shared_ptr<index_target::raw> id]
| K_FULL '(' c=cident ')' { $id = index_target::raw::full_collection(c); }
;
/**
* CREATE MATERIALIZED VIEW <viewName> AS
* SELECT <columns>
* FROM <CF>
* WHERE <pkColumns> IS NOT NULL
* PRIMARY KEY (<pkColumns>)
* WITH <property> = <value> AND ...;
*/
createViewStatement returns [::shared_ptr<create_view_statement> expr]
@init {
bool if_not_exists = false;
std::vector<::shared_ptr<cql3::column_identifier::raw>> partition_keys;
std::vector<::shared_ptr<cql3::column_identifier::raw>> composite_keys;
}
: K_CREATE K_MATERIALIZED K_VIEW (K_IF K_NOT K_EXISTS { if_not_exists = true; })? cf=columnFamilyName K_AS
K_SELECT sclause=selectClause K_FROM basecf=columnFamilyName
(K_WHERE wclause=whereClause)?
K_PRIMARY K_KEY (
'(' '(' k1=cident { partition_keys.push_back(k1); } ( ',' kn=cident { partition_keys.push_back(kn); } )* ')' ( ',' c1=cident { composite_keys.push_back(c1); } )* ')'
| '(' k1=cident { partition_keys.push_back(k1); } ( ',' cn=cident { composite_keys.push_back(cn); } )* ')'
)
{
$expr = ::make_shared<create_view_statement>(
std::move(cf),
std::move(basecf),
std::move(sclause),
std::move(wclause),
std::move(partition_keys),
std::move(composite_keys),
if_not_exists);
}
( K_WITH cfamProperty[{ $expr->properties() }] ( K_AND cfamProperty[{ $expr->properties() }] )*)?
;
#if 0
/**
@@ -833,7 +873,7 @@ alterKeyspaceStatement returns [shared_ptr<cql3::statements::alter_keyspace_stat
alterTableStatement returns [shared_ptr<alter_table_statement> expr]
@init {
alter_table_statement::type type;
auto props = make_shared<cql3::statements::cf_prop_defs>();;
auto props = make_shared<cql3::statements::cf_prop_defs>();
std::vector<std::pair<shared_ptr<cql3::column_identifier::raw>, shared_ptr<cql3::column_identifier::raw>>> renames;
bool is_static = false;
}
@@ -867,6 +907,18 @@ alterTypeStatement returns [::shared_ptr<alter_type_statement> expr]
)
;
/**
* ALTER MATERIALIZED VIEW <CF> WITH <property> = <value>;
*/
alterViewStatement returns [::shared_ptr<alter_view_statement> expr]
@init {
auto props = make_shared<cql3::statements::cf_prop_defs>();
}
: K_ALTER K_MATERIALIZED K_VIEW cf=columnFamilyName K_WITH properties[props]
{
$expr = ::make_shared<alter_view_statement>(std::move(cf), std::move(props));
}
;
renames[::shared_ptr<alter_type_statement::renames> expr]
: fromId=ident K_TO toId=ident { $expr->add_rename(fromId, toId); }
@@ -897,16 +949,23 @@ dropTypeStatement returns [::shared_ptr<drop_type_statement> stmt]
: K_DROP K_TYPE (K_IF K_EXISTS { if_exists = true; } )? name=userTypeName { $stmt = ::make_shared<drop_type_statement>(name, if_exists); }
;
#if 0
/**
* DROP MATERIALIZED VIEW [IF EXISTS] <view_name>
*/
dropViewStatement returns [::shared_ptr<drop_view_statement> stmt]
@init { bool if_exists = false; }
: K_DROP K_MATERIALIZED K_VIEW (K_IF K_EXISTS { if_exists = true; } )? cf=columnFamilyName
{ $stmt = ::make_shared<drop_view_statement>(cf, if_exists); }
;
/**
* DROP INDEX [IF EXISTS] <INDEX_NAME>
*/
dropIndexStatement returns [DropIndexStatement expr]
@init { boolean ifExists = false; }
: K_DROP K_INDEX (K_IF K_EXISTS { ifExists = true; } )? index=indexName
{ $expr = new DropIndexStatement(index, ifExists); }
dropIndexStatement returns [::shared_ptr<drop_index_statement> expr]
@init { bool if_exists = false; }
: K_DROP K_INDEX (K_IF K_EXISTS { if_exists = true; } )? index=indexName
{ $expr = ::make_shared<drop_index_statement>(index, if_exists); }
;
#endif
/**
* TRUNCATE <CF>;
@@ -1109,6 +1168,7 @@ constant returns [shared_ptr<cql3::constants::literal> constant]
| t=INTEGER { $constant = cql3::constants::literal::integer(sstring{$t.text}); }
| t=FLOAT { $constant = cql3::constants::literal::floating_point(sstring{$t.text}); }
| t=BOOLEAN { $constant = cql3::constants::literal::bool_(sstring{$t.text}); }
| t=DURATION { $constant = cql3::constants::literal::duration(sstring{$t.text}); }
| t=UUID { $constant = cql3::constants::literal::uuid(sstring{$t.text}); }
| t=HEXNUMBER { $constant = cql3::constants::literal::hex(sstring{$t.text}); }
| { sign=""; } ('-' {sign = "-"; } )? t=(K_NAN | K_INFINITY) { $constant = cql3::constants::literal::floating_point(sstring{sign + $t.text}); }
@@ -1243,6 +1303,10 @@ normalColumnOperation[operations_type& operations, ::shared_ptr<cql3::column_ide
}
add_raw_update(operations, key, make_shared<cql3::operation::addition>(cql3::constants::literal::integer($i.text)));
}
| K_SCYLLA_COUNTER_SHARD_LIST '(' t=term ')'
{
add_raw_update(operations, key, ::make_shared<cql3::operation::set_counter_value_from_tuple_list>(t));
}
;
specializedColumnOperation[std::vector<std::pair<shared_ptr<cql3::column_identifier::raw>,
@@ -1304,7 +1368,8 @@ relation[std::vector<cql3::relation_ptr>& clauses]
| K_TOKEN l=tupleOfIdentifiers type=relationType t=term
{ $clauses.emplace_back(::make_shared<cql3::token_relation>(std::move(l), *type, std::move(t))); }
| name=cident K_IS K_NOT K_NULL {
$clauses.emplace_back(make_shared<cql3::single_column_relation>(std::move(name), cql3::operator_type::IS_NOT, cql3::constants::NULL_LITERAL)); }
| name=cident K_IN marker=inMarker
{ $clauses.emplace_back(make_shared<cql3::single_column_relation>(std::move(name), cql3::operator_type::IN, std::move(marker))); }
| name=cident K_IN in_values=singleColumnInValues
@@ -1401,15 +1466,20 @@ native_type returns [shared_ptr<cql3_type> t]
| K_COUNTER { $t = cql3_type::counter; }
| K_DECIMAL { $t = cql3_type::decimal; }
| K_DOUBLE { $t = cql3_type::double_; }
| K_DURATION { $t = cql3_type::duration; }
| K_FLOAT { $t = cql3_type::float_; }
| K_INET { $t = cql3_type::inet; }
| K_INT { $t = cql3_type::int_; }
| K_SMALLINT { $t = cql3_type::smallint; }
| K_TEXT { $t = cql3_type::text; }
| K_TIMESTAMP { $t = cql3_type::timestamp; }
| K_TINYINT { $t = cql3_type::tinyint; }
| K_UUID { $t = cql3_type::uuid; }
| K_VARCHAR { $t = cql3_type::varchar; }
| K_VARINT { $t = cql3_type::varint; }
| K_TIMEUUID { $t = cql3_type::timeuuid; }
| K_DATE { $t = cql3_type::date; }
| K_TIME { $t = cql3_type::time; }
;
collection_type returns [shared_ptr<cql3::cql3_type::raw> pt]
@@ -1483,6 +1553,8 @@ basic_unreserved_keyword returns [sstring str]
| K_DISTINCT
| K_CONTAINS
| K_STATIC
| K_FROZEN
| K_TUPLE
| K_FUNCTION
| K_AGGREGATE
| K_SFUNC
@@ -1500,6 +1572,7 @@ basic_unreserved_keyword returns [sstring str]
K_SELECT: S E L E C T;
K_FROM: F R O M;
K_AS: A S;
K_CAST: C A S T;
K_WHERE: W H E R E;
K_AND: A N D;
K_KEY: K E Y;
@@ -1528,6 +1601,8 @@ K_KEYSPACE: ( K E Y S P A C E
K_KEYSPACES: K E Y S P A C E S;
K_COLUMNFAMILY:( C O L U M N F A M I L Y
| T A B L E );
K_MATERIALIZED:M A T E R I A L I Z E D;
K_VIEW: V I E W;
K_INDEX: I N D E X;
K_CUSTOM: C U S T O M;
K_ON: O N;
@@ -1551,6 +1626,7 @@ K_DESC: D E S C;
K_ALLOW: A L L O W;
K_FILTERING: F I L T E R I N G;
K_IF: I F;
K_IS: I S;
K_CONTAINS: C O N T A I N S;
K_GRANT: G R A N T;
@@ -1577,9 +1653,12 @@ K_BOOLEAN: B O O L E A N;
K_COUNTER: C O U N T E R;
K_DECIMAL: D E C I M A L;
K_DOUBLE: D O U B L E;
K_DURATION: D U R A T I O N;
K_FLOAT: F L O A T;
K_INET: I N E T;
K_INT: I N T;
K_SMALLINT: S M A L L I N T;
K_TINYINT: T I N Y I N T;
K_TEXT: T E X T;
K_UUID: U U I D;
K_VARCHAR: V A R C H A R;
@@ -1587,6 +1666,8 @@ K_VARINT: V A R I N T;
K_TIMEUUID: T I M E U U I D;
K_TOKEN: T O K E N;
K_WRITETIME: W R I T E T I M E;
K_DATE: D A T E;
K_TIME: T I M E;
K_NULL: N U L L;
K_NOT: N O T;
@@ -1616,6 +1697,7 @@ K_REPLACE: R E P L A C E;
K_DETERMINISTIC: D E T E R M I N I S T I C;
K_SCYLLA_TIMEUUID_LIST_INDEX: S C Y L L A '_' T I M E U U I D '_' L I S T '_' I N D E X;
K_SCYLLA_COUNTER_SHARD_LIST: S C Y L L A '_' C O U N T E R '_' S H A R D '_' L I S T;
// Case-insensitive alpha characters
fragment A: ('a'|'A');
@@ -1701,6 +1783,20 @@ fragment EXPONENT
: E ('+' | '-')? DIGIT+
;
fragment DURATION_UNIT
: Y
| M O
| W
| D
| H
| M
| S
| M S
| U S
| '\u00B5' S
| N S
;
INTEGER
: '-'? DIGIT+
;
@@ -1725,6 +1821,13 @@ BOOLEAN
: T R U E | F A L S E
;
DURATION
: '-'? DIGIT+ DURATION_UNIT (DIGIT+ DURATION_UNIT)*
| '-'? 'P' (DIGIT+ 'Y')? (DIGIT+ 'M')? (DIGIT+ 'D')? ('T' (DIGIT+ 'H')? (DIGIT+ 'M')? (DIGIT+ 'S')?)? // ISO 8601 "format with designators"
| '-'? 'P' DIGIT+ 'W'
| '-'? 'P' DIGIT DIGIT DIGIT DIGIT '-' DIGIT DIGIT '-' DIGIT DIGIT 'T' DIGIT DIGIT ':' DIGIT DIGIT ':' DIGIT DIGIT // ISO 8601 "alternative format"
;
IDENT
: LETTER (LETTER | DIGIT | '_')*
;

View File

@@ -79,6 +79,7 @@ abstract_marker::raw::raw(int32_t bind_index)
return ::make_shared<maps::marker>(_bind_index, receiver);
}
assert(0);
return shared_ptr<term>();
}
assignment_testable::test_result abstract_marker::raw::test_assignment(database& db, const sstring& keyspace, ::shared_ptr<column_specification> receiver) {

View File

@@ -71,13 +71,15 @@ int64_t attributes::get_timestamp(int64_t now, const query_options& options) {
}
auto tval = _timestamp->bind_and_get(options);
if (!tval) {
if (tval.is_null()) {
throw exceptions::invalid_request_exception("Invalid null value of timestamp");
}
if (tval.is_unset_value()) {
return now;
}
try {
data_type_for<int64_t>()->validate(*tval);
} catch (marshal_exception e) {
} catch (marshal_exception& e) {
throw exceptions::invalid_request_exception("Invalid timestamp value");
}
return value_cast<int64_t>(data_type_for<int64_t>()->deserialize(*tval));
@@ -88,14 +90,16 @@ int32_t attributes::get_time_to_live(const query_options& options) {
return 0;
auto tval = _time_to_live->bind_and_get(options);
if (!tval) {
if (tval.is_null()) {
throw exceptions::invalid_request_exception("Invalid null value of TTL");
}
if (tval.is_unset_value()) {
return 0;
}
try {
data_type_for<int32_t>()->validate(*tval);
}
catch (marshal_exception e) {
catch (marshal_exception& e) {
throw exceptions::invalid_request_exception("Invalid TTL value");
}

View File

@@ -40,11 +40,29 @@
*/
#include "cql3/column_condition.hh"
#include "statements/request_validations.hh"
#include "unimplemented.hh"
#include "lists.hh"
#include "maps.hh"
#include <boost/range/algorithm_ext/push_back.hpp>
namespace {
void validate_operation_on_durations(const abstract_type& type, const cql3::operator_type& op) {
using cql3::statements::request_validations::check_false;
if (op.is_slice() && type.references_duration()) {
check_false(type.is_collection(), "Slice conditions are not supported on collections containing durations");
check_false(type.is_tuple(), "Slice conditions are not supported on tuples containing durations");
check_false(type.is_user_type(), "Slice conditions are not supported on UDTs containing durations");
// We're a duration.
throw exceptions::invalid_request_exception(sprint("Slice conditions are not supported on durations"));
}
}
}
namespace cql3 {
bool
@@ -95,6 +113,7 @@ column_condition::raw::prepare(database& db, const sstring& keyspace, const colu
}
return column_condition::in_condition(receiver, std::move(terms));
} else {
validate_operation_on_durations(*receiver.type, _op);
return column_condition::condition(receiver, _value->prepare(db, keyspace, receiver.column_specification), _op);
}
}
@@ -129,6 +148,8 @@ column_condition::raw::prepare(database& db, const sstring& keyspace, const colu
| boost::adaptors::transformed(std::bind(&term::raw::prepare, std::placeholders::_1, std::ref(db), std::ref(keyspace), value_spec)));
return column_condition::in_condition(receiver, _collection_element->prepare(db, keyspace, element_spec), terms);
} else {
validate_operation_on_durations(*receiver.type, _op);
return column_condition::condition(receiver,
_collection_element->prepare(db, keyspace, element_spec),
_value->prepare(db, keyspace, value_spec),

View File

@@ -23,6 +23,8 @@
#include "exceptions/exceptions.hh"
#include "cql3/selection/simple_selector.hh"
#include <regex>
namespace cql3 {
column_identifier::column_identifier(sstring raw_text, bool keep_case) {
@@ -59,6 +61,17 @@ sstring column_identifier::to_string() const {
return _text;
}
sstring column_identifier::to_cql_string() const {
static const std::regex unquoted_identifier_re("[a-z][a-z0-9_]*");
if (std::regex_match(_text.begin(), _text.end(), unquoted_identifier_re)) {
return _text;
}
static const std::regex double_quote_re("\"");
std::string result = _text;
std::regex_replace(result, double_quote_re, "\"\"");
return '"' + result + '"';
}
column_identifier::raw::raw(sstring raw_text, bool keep_case)
: _raw_text{raw_text}
, _text{raw_text}

View File

@@ -47,7 +47,7 @@
#include <algorithm>
#include <functional>
#include <iostream>
#include <iosfwd>
namespace cql3 {
@@ -80,6 +80,8 @@ public:
sstring to_string() const;
sstring to_cql_string() const;
friend std::ostream& operator<<(std::ostream& out, const column_identifier& i) {
return out << i._text;
}

View File

@@ -44,6 +44,7 @@
namespace cql3 {
thread_local const ::shared_ptr<constants::value> constants::UNSET_VALUE = ::make_shared<constants::value>(cql3::raw_value::make_unset_value());
thread_local const ::shared_ptr<term::raw> constants::NULL_LITERAL = ::make_shared<constants::null_literal>();
thread_local const ::shared_ptr<terminal> constants::null_literal::NULL_VALUE = ::make_shared<constants::null_literal::null_value>();
@@ -51,14 +52,15 @@ std::ostream&
operator<<(std::ostream&out, constants::type t)
{
switch (t) {
case constants::type::STRING: return out << "STRING";
case constants::type::INTEGER: return out << "INTEGER";
case constants::type::UUID: return out << "UUID";
case constants::type::FLOAT: return out << "FLOAT";
case constants::type::BOOLEAN: return out << "BOOLEAN";
case constants::type::HEX: return out << "HEX";
};
assert(0);
case constants::type::STRING: return out << "STRING";
case constants::type::INTEGER: return out << "INTEGER";
case constants::type::UUID: return out << "UUID";
case constants::type::FLOAT: return out << "FLOAT";
case constants::type::BOOLEAN: return out << "BOOLEAN";
case constants::type::HEX: return out << "HEX";
case constants::type::DURATION: return out << "DURATION";
}
abort();
}
bytes
@@ -97,7 +99,9 @@ constants::literal::test_assignment(database& db, const sstring& keyspace, ::sha
cql3_type::kind::TEXT,
cql3_type::kind::INET,
cql3_type::kind::VARCHAR,
cql3_type::kind::TIMESTAMP>::contains(kind)) {
cql3_type::kind::TIMESTAMP,
cql3_type::kind::DATE,
cql3_type::kind::TIME>::contains(kind)) {
return assignment_testable::test_result::WEAKLY_ASSIGNABLE;
}
break;
@@ -109,7 +113,10 @@ constants::literal::test_assignment(database& db, const sstring& keyspace, ::sha
cql3_type::kind::DOUBLE,
cql3_type::kind::FLOAT,
cql3_type::kind::INT,
cql3_type::kind::SMALLINT,
cql3_type::kind::TIMESTAMP,
cql3_type::kind::DATE,
cql3_type::kind::TINYINT,
cql3_type::kind::VARINT>::contains(kind)) {
return assignment_testable::test_result::WEAKLY_ASSIGNABLE;
}
@@ -139,6 +146,11 @@ constants::literal::test_assignment(database& db, const sstring& keyspace, ::sha
return assignment_testable::test_result::WEAKLY_ASSIGNABLE;
}
break;
case type::DURATION:
if (kind == cql3_type::kind_enum_set::prepare<cql3_type::kind::DURATION>()) {
return assignment_testable::test_result::EXACT_MATCH;
}
break;
}
return assignment_testable::test_result::NOT_ASSIGNABLE;
}
@@ -150,10 +162,10 @@ constants::literal::prepare(database& db, const sstring& keyspace, ::shared_ptr<
throw exceptions::invalid_request_exception(sprint("Invalid %s constant (%s) for \"%s\" of type %s",
_type, _text, *receiver->name, receiver->type->as_cql3_type()->to_string()));
}
return ::make_shared<value>(std::experimental::make_optional(parsed_value(receiver->type)));
return ::make_shared<value>(cql3::raw_value::make_value(parsed_value(receiver->type)));
}
void constants::deleter::execute(mutation& m, const exploded_clustering_prefix& prefix, const update_parameters& params) {
void constants::deleter::execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) {
if (column.type->is_multi_cell()) {
collection_type_impl::mutation coll_m;
coll_m.tomb = params.make_tombstone();

View File

@@ -44,6 +44,7 @@
#include "cql3/abstract_marker.hh"
#include "cql3/update_parameters.hh"
#include "cql3/operation.hh"
#include "cql3/values.hh"
#include "cql3/term.hh"
#include "core/shared_ptr.hh"
@@ -59,7 +60,7 @@ public:
#endif
public:
enum class type {
STRING, INTEGER, UUID, FLOAT, BOOLEAN, HEX
STRING, INTEGER, UUID, FLOAT, BOOLEAN, HEX, DURATION
};
/**
@@ -67,18 +68,20 @@ public:
*/
class value : public terminal {
public:
bytes_opt _bytes;
value(bytes_opt bytes_) : _bytes(std::move(bytes_)) {}
virtual bytes_opt get(const query_options& options) override { return _bytes; }
virtual bytes_view_opt bind_and_get(const query_options& options) override { return as_bytes_view_opt(_bytes); }
cql3::raw_value _bytes;
value(cql3::raw_value bytes_) : _bytes(std::move(bytes_)) {}
virtual cql3::raw_value get(const query_options& options) override { return _bytes; }
virtual cql3::raw_value_view bind_and_get(const query_options& options) override { return _bytes.to_view(); }
virtual sstring to_string() const override { return to_hex(*_bytes); }
};
static thread_local const ::shared_ptr<value> UNSET_VALUE;
class null_literal final : public term::raw {
private:
class null_value final : public value {
public:
null_value() : value({}) {}
null_value() : value(cql3::raw_value::make_null()) {}
virtual ::shared_ptr<terminal> bind(const query_options& options) override { return {}; }
virtual sstring to_string() const override { return "null"; }
};
@@ -146,6 +149,10 @@ public:
return ::make_shared<literal>(type::HEX, text);
}
static ::shared_ptr<literal> duration(sstring text) {
return ::make_shared<literal>(type::DURATION, text);
}
virtual ::shared_ptr<term> prepare(database& db, const sstring& keyspace, ::shared_ptr<column_specification> receiver);
private:
bytes parsed_value(data_type validator);
@@ -169,14 +176,13 @@ public:
assert(!_receiver->type->is_collection());
}
virtual bytes_view_opt bind_and_get(const query_options& options) override {
virtual cql3::raw_value_view bind_and_get(const query_options& options) override {
try {
auto value = options.get_value_at(_bind_index);
if (value) {
_receiver->type->validate(*value);
return *value;
}
return std::experimental::nullopt;
return value;
} catch (const marshal_exception& e) {
throw exceptions::invalid_request_exception(e.what());
}
@@ -187,7 +193,7 @@ public:
if (!bytes) {
return ::shared_ptr<terminal>{};
}
return ::make_shared<constants::value>(std::move(to_bytes_opt(*bytes)));
return ::make_shared<constants::value>(std::move(cql3::raw_value::make_value(to_bytes(*bytes))));
}
};
@@ -195,54 +201,48 @@ public:
public:
using operation::operation;
virtual void execute(mutation& m, const exploded_clustering_prefix& prefix, const update_parameters& params) override {
virtual void execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) override {
auto value = _t->bind_and_get(params._options);
auto cell = value ? make_cell(*value, params) : make_dead_cell(params);
m.set_cell(prefix, column, std::move(cell));
if (value.is_null()) {
m.set_cell(prefix, column, std::move(make_dead_cell(params)));
} else if (value.is_value()) {
m.set_cell(prefix, column, std::move(make_cell(*value, params)));
}
}
};
#if 0
public static class Adder extends Operation
{
public Adder(ColumnDefinition column, Term t)
{
super(column, t);
struct adder final : operation {
using operation::operation;
virtual void execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) override {
auto value = _t->bind_and_get(params._options);
if (value.is_null()) {
throw exceptions::invalid_request_exception("Invalid null value for counter increment");
} else if (value.is_unset_value()) {
return;
}
auto increment = value_cast<int64_t>(long_type->deserialize_value(*value));
m.set_cell(prefix, column, make_counter_update_cell(increment, params));
}
};
public void execute(ByteBuffer rowKey, ColumnFamily cf, Composite prefix, UpdateParameters params) throws InvalidRequestException
{
ByteBuffer bytes = t.bindAndGet(params.options);
if (bytes == null)
throw new InvalidRequestException("Invalid null value for counter increment");
long increment = ByteBufferUtil.toLong(bytes);
CellName cname = cf.getComparator().create(prefix, column);
cf.addColumn(params.makeCounter(cname, increment));
struct subtracter final : operation {
using operation::operation;
virtual void execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) override {
auto value = _t->bind_and_get(params._options);
if (value.is_null()) {
throw exceptions::invalid_request_exception("Invalid null value for counter increment");
} else if (value.is_unset_value()) {
return;
}
auto increment = value_cast<int64_t>(long_type->deserialize_value(*value));
if (increment == std::numeric_limits<int64_t>::min()) {
throw exceptions::invalid_request_exception(sprint("The negation of %d overflows supported counter precision (signed 8 bytes integer)", increment));
}
m.set_cell(prefix, column, make_counter_update_cell(-increment, params));
}
}
public static class Substracter extends Operation
{
public Substracter(ColumnDefinition column, Term t)
{
super(column, t);
}
public void execute(ByteBuffer rowKey, ColumnFamily cf, Composite prefix, UpdateParameters params) throws InvalidRequestException
{
ByteBuffer bytes = t.bindAndGet(params.options);
if (bytes == null)
throw new InvalidRequestException("Invalid null value for counter increment");
long increment = ByteBufferUtil.toLong(bytes);
if (increment == Long.MIN_VALUE)
throw new InvalidRequestException("The negation of " + increment + " overflows supported counter precision (signed 8 bytes integer)");
CellName cname = cf.getComparator().create(prefix, column);
cf.addColumn(params.makeCounter(cname, -increment));
}
}
#endif
};
class deleter : public operation {
public:
@@ -250,7 +250,7 @@ public:
: operation(column, {})
{ }
virtual void execute(mutation& m, const exploded_clustering_prefix& prefix, const update_parameters& params) override;
virtual void execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) override;
};
};

View File

@@ -19,11 +19,43 @@
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include <iostream>
#include <iterator>
#include <regex>
#include "cql3_type.hh"
#include "cql3/util.hh"
#include "ut_name.hh"
namespace cql3 {
sstring cql3_type::to_string() const {
if (_type->is_user_type()) {
return "frozen<" + util::maybe_quote(_name) + ">";
}
if (_type->is_tuple()) {
return "frozen<" + _name + ">";
}
return _name;
}
shared_ptr<cql3_type> cql3_type::raw::prepare(database& db, const sstring& keyspace) {
try {
auto&& ks = db.find_keyspace(keyspace);
return prepare_internal(keyspace, ks.metadata()->user_types());
} catch (no_such_keyspace& nsk) {
throw exceptions::invalid_request_exception("Unknown keyspace " + keyspace);
}
}
bool cql3_type::raw::is_duration() const {
return false;
}
bool cql3_type::raw::references_user_type(const sstring& name) const {
return false;
}
class cql3_type::raw_type : public raw {
private:
shared_ptr<cql3_type> _type;
@@ -35,6 +67,9 @@ public:
virtual shared_ptr<cql3_type> prepare(database& db, const sstring& keyspace) {
return _type;
}
shared_ptr<cql3_type> prepare_internal(const sstring&, lw_shared_ptr<user_types_metadata>) override {
return _type;
}
virtual bool supports_freezing() const {
return false;
@@ -47,6 +82,10 @@ public:
virtual sstring to_string() const {
return _type->to_string();
}
virtual bool is_duration() const override {
return _type->get_type()->equals(duration_type);
}
};
class cql3_type::raw_collection : public raw {
@@ -76,7 +115,7 @@ public:
return true;
}
virtual shared_ptr<cql3_type> prepare(database& db, const sstring& keyspace) override {
virtual shared_ptr<cql3_type> prepare_internal(const sstring& keyspace, lw_shared_ptr<user_types_metadata> user_types) override {
assert(_values); // "Got null values type for a collection";
if (!_frozen && _values->supports_freezing() && !_values->_frozen) {
@@ -93,16 +132,30 @@ public:
}
if (_kind == &collection_type_impl::kind::list) {
return make_shared(cql3_type(to_string(), list_type_impl::get_instance(_values->prepare(db, keyspace)->get_type(), !_frozen), false));
return make_shared(cql3_type(to_string(), list_type_impl::get_instance(_values->prepare_internal(keyspace, user_types)->get_type(), !_frozen), false));
} else if (_kind == &collection_type_impl::kind::set) {
return make_shared(cql3_type(to_string(), set_type_impl::get_instance(_values->prepare(db, keyspace)->get_type(), !_frozen), false));
if (_values->is_duration()) {
throw exceptions::invalid_request_exception(sprint("Durations are not allowed inside sets: %s", *this));
}
return make_shared(cql3_type(to_string(), set_type_impl::get_instance(_values->prepare_internal(keyspace, user_types)->get_type(), !_frozen), false));
} else if (_kind == &collection_type_impl::kind::map) {
assert(_keys); // "Got null keys type for a collection";
return make_shared(cql3_type(to_string(), map_type_impl::get_instance(_keys->prepare(db, keyspace)->get_type(), _values->prepare(db, keyspace)->get_type(), !_frozen), false));
if (_keys->is_duration()) {
throw exceptions::invalid_request_exception(sprint("Durations are not allowed as map keys: %s", *this));
}
return make_shared(cql3_type(to_string(), map_type_impl::get_instance(_keys->prepare_internal(keyspace, user_types)->get_type(), _values->prepare_internal(keyspace, user_types)->get_type(), !_frozen), false));
}
abort();
}
bool references_user_type(const sstring& name) const override {
return (_keys && _keys->references_user_type(name)) || _values->references_user_type(name);
}
bool is_duration() const override {
return false;
}
virtual sstring to_string() const override {
sstring start = _frozen ? "frozen<" : "";
sstring end = _frozen ? ">" : "";
@@ -132,7 +185,7 @@ public:
_frozen = true;
}
virtual shared_ptr<cql3_type> prepare(database& db, const sstring& keyspace) override {
virtual shared_ptr<cql3_type> prepare_internal(const sstring& keyspace, lw_shared_ptr<user_types_metadata> user_types) override {
if (_name.has_keyspace()) {
// The provided keyspace is the one of the current statement this is part of. If it's different from the keyspace of
// the UTName, we reject since we want to limit user types to their own keyspace (see #6643)
@@ -144,23 +197,23 @@ public:
} else {
_name.set_keyspace(keyspace);
}
if (!user_types) {
// bootstrap mode.
throw exceptions::invalid_request_exception(sprint("Unknown type %s", _name));
}
try {
auto&& ks = db.find_keyspace(_name.get_keyspace());
try {
auto&& type = ks.metadata()->user_types()->get_type(_name.get_user_type_name());
if (!_frozen) {
throw exceptions::invalid_request_exception("Non-frozen User-Defined types are not supported, please use frozen<>");
}
return make_shared<cql3_type>(_name.to_string(), std::move(type));
} catch (std::out_of_range& e) {
throw exceptions::invalid_request_exception(sprint("Unknown type %s", _name));
auto&& type = user_types->get_type(_name.get_user_type_name());
if (!_frozen) {
throw exceptions::invalid_request_exception("Non-frozen User-Defined types are not supported, please use frozen<>");
}
} catch (no_such_keyspace& nsk) {
throw exceptions::invalid_request_exception("Unknown keyspace " + _name.get_keyspace());
return make_shared<cql3_type>(_name.to_string(), std::move(type));
} catch (std::out_of_range& e) {
throw exceptions::invalid_request_exception(sprint("Unknown type %s", _name));
}
}
bool references_user_type(const sstring& name) const override {
return _name.get_string_type_name() == name;
}
virtual bool supports_freezing() const override {
return true;
}
@@ -191,7 +244,7 @@ public:
}
_frozen = true;
}
virtual shared_ptr<cql3_type> prepare(database& db, const sstring& keyspace) override {
virtual shared_ptr<cql3_type> prepare_internal(const sstring& keyspace, lw_shared_ptr<user_types_metadata> user_types) override {
if (!_frozen) {
freeze();
}
@@ -200,10 +253,17 @@ public:
if (t->is_counter()) {
throw exceptions::invalid_request_exception("Counters are not allowed inside tuples");
}
ts.push_back(t->prepare(db, keyspace)->get_type());
ts.push_back(t->prepare_internal(keyspace, user_types)->get_type());
}
return make_cql3_tuple_type(tuple_type_impl::get_instance(std::move(ts)));
}
bool references_user_type(const sstring& name) const override {
return std::any_of(_types.begin(), _types.end(), [&name](auto t) {
return t->references_user_type(name);
});
}
virtual sstring to_string() const override {
return sprint("tuple<%s>", join(", ", _types));
}
@@ -271,17 +331,23 @@ thread_local shared_ptr<cql3_type> cql3_type::bigint = make("bigint", long_type,
thread_local shared_ptr<cql3_type> cql3_type::blob = make("blob", bytes_type, cql3_type::kind::BLOB);
thread_local shared_ptr<cql3_type> cql3_type::boolean = make("boolean", boolean_type, cql3_type::kind::BOOLEAN);
thread_local shared_ptr<cql3_type> cql3_type::double_ = make("double", double_type, cql3_type::kind::DOUBLE);
thread_local shared_ptr<cql3_type> cql3_type::empty = make("empty", empty_type, cql3_type::kind::EMPTY);
thread_local shared_ptr<cql3_type> cql3_type::float_ = make("float", float_type, cql3_type::kind::FLOAT);
thread_local shared_ptr<cql3_type> cql3_type::int_ = make("int", int32_type, cql3_type::kind::INT);
thread_local shared_ptr<cql3_type> cql3_type::smallint = make("smallint", short_type, cql3_type::kind::SMALLINT);
thread_local shared_ptr<cql3_type> cql3_type::text = make("text", utf8_type, cql3_type::kind::TEXT);
thread_local shared_ptr<cql3_type> cql3_type::timestamp = make("timestamp", timestamp_type, cql3_type::kind::TIMESTAMP);
thread_local shared_ptr<cql3_type> cql3_type::tinyint = make("tinyint", byte_type, cql3_type::kind::TINYINT);
thread_local shared_ptr<cql3_type> cql3_type::uuid = make("uuid", uuid_type, cql3_type::kind::UUID);
thread_local shared_ptr<cql3_type> cql3_type::varchar = make("varchar", utf8_type, cql3_type::kind::TEXT);
thread_local shared_ptr<cql3_type> cql3_type::timeuuid = make("timeuuid", timeuuid_type, cql3_type::kind::TIMEUUID);
thread_local shared_ptr<cql3_type> cql3_type::date = make("date", simple_date_type, cql3_type::kind::DATE);
thread_local shared_ptr<cql3_type> cql3_type::time = make("time", time_type, cql3_type::kind::TIME);
thread_local shared_ptr<cql3_type> cql3_type::inet = make("inet", inet_addr_type, cql3_type::kind::INET);
thread_local shared_ptr<cql3_type> cql3_type::varint = make("varint", varint_type, cql3_type::kind::VARINT);
thread_local shared_ptr<cql3_type> cql3_type::decimal = make("decimal", decimal_type, cql3_type::kind::DECIMAL);
thread_local shared_ptr<cql3_type> cql3_type::counter = make("counter", counter_type, cql3_type::kind::COUNTER);
thread_local shared_ptr<cql3_type> cql3_type::duration = make("duration", duration_type, cql3_type::kind::DURATION);
const std::vector<shared_ptr<cql3_type>>&
cql3_type::values() {
@@ -293,15 +359,21 @@ cql3_type::values() {
cql3_type::counter,
cql3_type::decimal,
cql3_type::double_,
cql3_type::empty,
cql3_type::float_,
cql3_type:inet,
cql3_type::inet,
cql3_type::int_,
cql3_type::smallint,
cql3_type::text,
cql3_type::timestamp,
cql3_type::tinyint,
cql3_type::uuid,
cql3_type::varchar,
cql3_type::varint,
cql3_type::timeuuid,
cql3_type::date,
cql3_type::time,
cql3_type::duration,
};
return v;
}
@@ -321,5 +393,23 @@ operator<<(std::ostream& os, const cql3_type::raw& r) {
return os << r.to_string();
}
namespace util {
sstring maybe_quote(const sstring& s) {
static const std::regex unquoted("\\w*");
static const std::regex double_quote("\"");
if (std::regex_match(s.begin(), s.end(), unquoted)) {
return s;
}
std::ostringstream ss;
ss << "\"";
std::regex_replace(std::ostreambuf_iterator<char>(ss), s.begin(), s.end(), double_quote, "\"\"");
ss << "\"";
return ss.str();
}
}
}

View File

@@ -47,6 +47,7 @@
#include "enum_set.hh"
class database;
class user_types_metadata;
namespace cql3 {
@@ -63,19 +64,23 @@ public:
bool is_counter() const { return _type->is_counter(); }
bool is_native() const { return _native; }
data_type get_type() const { return _type; }
sstring to_string() const { return _name; }
sstring to_string() const;
// For UserTypes, we need to know the current keyspace to resolve the
// actual type used, so Raw is a "not yet prepared" CQL3Type.
class raw {
public:
virtual ~raw() {}
bool _frozen = false;
virtual bool supports_freezing() const = 0;
virtual bool is_collection() const;
virtual bool is_counter() const;
virtual bool is_duration() const;
virtual bool references_user_type(const sstring&) const;
virtual std::experimental::optional<sstring> keyspace() const;
virtual void freeze();
virtual shared_ptr<cql3_type> prepare(database& db, const sstring& keyspace) = 0;
virtual shared_ptr<cql3_type> prepare_internal(const sstring& keyspace, lw_shared_ptr<user_types_metadata>) = 0;
virtual shared_ptr<cql3_type> prepare(database& db, const sstring& keyspace);
static shared_ptr<raw> from(shared_ptr<cql3_type> type);
static shared_ptr<raw> user_type(ut_name name);
static shared_ptr<raw> map(shared_ptr<raw> t1, shared_ptr<raw> t2);
@@ -98,7 +103,7 @@ private:
public:
enum class kind : int8_t {
ASCII, BIGINT, BLOB, BOOLEAN, COUNTER, DECIMAL, DOUBLE, FLOAT, INT, INET, TEXT, TIMESTAMP, UUID, VARCHAR, VARINT, TIMEUUID
ASCII, BIGINT, BLOB, BOOLEAN, COUNTER, DECIMAL, DOUBLE, EMPTY, FLOAT, INT, SMALLINT, TINYINT, INET, TEXT, TIMESTAMP, UUID, VARCHAR, VARINT, TIMEUUID, DATE, TIME, DURATION
};
using kind_enum = super_enum<kind,
kind::ASCII,
@@ -108,15 +113,21 @@ public:
kind::COUNTER,
kind::DECIMAL,
kind::DOUBLE,
kind::EMPTY,
kind::FLOAT,
kind::INET,
kind::INT,
kind::SMALLINT,
kind::TINYINT,
kind::TEXT,
kind::TIMESTAMP,
kind::UUID,
kind::VARCHAR,
kind::VARINT,
kind::TIMEUUID>;
kind::TIMEUUID,
kind::DATE,
kind::TIME,
kind::DURATION>;
using kind_enum_set = enum_set<kind_enum>;
private:
std::experimental::optional<kind_enum_set::prepared> _kind;
@@ -129,17 +140,23 @@ public:
static thread_local shared_ptr<cql3_type> blob;
static thread_local shared_ptr<cql3_type> boolean;
static thread_local shared_ptr<cql3_type> double_;
static thread_local shared_ptr<cql3_type> empty;
static thread_local shared_ptr<cql3_type> float_;
static thread_local shared_ptr<cql3_type> int_;
static thread_local shared_ptr<cql3_type> smallint;
static thread_local shared_ptr<cql3_type> text;
static thread_local shared_ptr<cql3_type> timestamp;
static thread_local shared_ptr<cql3_type> tinyint;
static thread_local shared_ptr<cql3_type> uuid;
static thread_local shared_ptr<cql3_type> varchar;
static thread_local shared_ptr<cql3_type> timeuuid;
static thread_local shared_ptr<cql3_type> date;
static thread_local shared_ptr<cql3_type> time;
static thread_local shared_ptr<cql3_type> inet;
static thread_local shared_ptr<cql3_type> varint;
static thread_local shared_ptr<cql3_type> decimal;
static thread_local shared_ptr<cql3_type> counter;
static thread_local shared_ptr<cql3_type> duration;
static const std::vector<shared_ptr<cql3_type>>& values();
public:

View File

@@ -46,7 +46,7 @@
#include "service/storage_proxy.hh"
#include "cql3/query_options.hh"
namespace transport {
namespace cql_transport {
namespace messages {
@@ -89,7 +89,7 @@ public:
* @param state the current query state
* @param options options for this query (consistency, variables, pageSize, ...)
*/
virtual future<::shared_ptr<transport::messages::result_message>>
virtual future<::shared_ptr<cql_transport::messages::result_message>>
execute(distributed<service::storage_proxy>& proxy, service::query_state& state, const query_options& options) = 0;
/**
@@ -97,7 +97,7 @@ public:
*
* @param state the current query state
*/
virtual future<::shared_ptr<transport::messages::result_message>>
virtual future<::shared_ptr<cql_transport::messages::result_message>>
execute_internal(distributed<service::storage_proxy>& proxy, service::query_state& state, const query_options& options) = 0;
virtual bool uses_function(const sstring& ks_name, const sstring& function_name) const = 0;

View File

@@ -67,10 +67,6 @@ class error_collector : public error_listener<RecognizerType, ExceptionBaseType>
*/
const sstring_view _query;
/**
* The error messages.
*/
std::vector<sstring> _error_msgs;
public:
/**
@@ -81,7 +77,10 @@ public:
*/
error_collector(const sstring_view& query) : _query(query) {}
virtual void syntax_error(RecognizerType& recognizer, ANTLR_UINT8** token_names, ExceptionBaseType* ex) override {
/**
* Format and throw a new \c exceptions::syntax_exception.
*/
[[noreturn]] virtual void syntax_error(RecognizerType& recognizer, ANTLR_UINT8** token_names, ExceptionBaseType* ex) override {
auto hdr = get_error_header(ex);
auto msg = get_error_message(recognizer, ex, token_names);
std::stringstream result;
@@ -90,22 +89,15 @@ public:
if (recognizer instanceof Parser)
appendQuerySnippet((Parser) recognizer, builder);
#endif
_error_msgs.emplace_back(result.str());
}
virtual void syntax_error(RecognizerType& recognizer, const sstring& msg) override {
_error_msgs.emplace_back(msg);
throw exceptions::syntax_exception(result.str());
}
/**
* Throws the first syntax error found by the lexer or the parser if it exists.
*
* @throws SyntaxException the syntax error.
* Throw a new \c exceptions::syntax_exception.
*/
void throw_first_syntax_error() {
if (!_error_msgs.empty()) {
throw exceptions::syntax_exception(_error_msgs[0]);
}
[[noreturn]] virtual void syntax_error(RecognizerType&, const sstring& msg) override {
throw exceptions::syntax_exception(msg);
}
private:

View File

@@ -41,6 +41,7 @@
#pragma once
#include "seastarx.hh"
#include <seastar/core/sstring.hh>
#include <antlr3.hpp>
@@ -52,6 +53,7 @@ namespace cql3 {
template<typename RecognizerType, typename ExceptionBaseType>
class error_listener {
public:
virtual ~error_listener() = default;
/**
* Invoked when a syntax error occurs.

View File

@@ -43,7 +43,7 @@
#include "types.hh"
#include <vector>
#include <iostream>
#include <iosfwd>
#include <boost/functional/hash.hpp>
namespace cql3 {

View File

@@ -41,6 +41,7 @@
#pragma once
#include "utils/big_decimal.hh"
#include "aggregate_function.hh"
#include "native_aggregate_function.hh"
@@ -111,9 +112,70 @@ make_sum_function() {
return make_shared<sum_function_for<Type>>();
}
template <typename Type>
class impl_div_for_avg {
public:
static Type div(const Type& x, const int64_t y) {
return x/y;
}
};
template <>
class impl_div_for_avg<big_decimal> {
public:
static big_decimal div(const big_decimal& x, const int64_t y) {
return x.div(y, big_decimal::rounding_mode::HALF_EVEN);
}
};
// We need a wider accumulator for average, since summing the inputs can overflow
// the input type
template <typename T>
struct accumulator_for;
template <>
struct accumulator_for<int8_t> {
using type = __int128;
};
template <>
struct accumulator_for<int16_t> {
using type = __int128;
};
template <>
struct accumulator_for<int32_t> {
using type = __int128;
};
template <>
struct accumulator_for<int64_t> {
using type = __int128;
};
template <>
struct accumulator_for<float> {
using type = float;
};
template <>
struct accumulator_for<double> {
using type = double;
};
template <>
struct accumulator_for<boost::multiprecision::cpp_int> {
using type = boost::multiprecision::cpp_int;
};
template <>
struct accumulator_for<big_decimal> {
using type = big_decimal;
};
template <typename Type>
class impl_avg_function_for final : public aggregate_function::aggregate {
Type _sum{};
typename accumulator_for<Type>::type _sum{};
int64_t _count = 0;
public:
virtual void reset() override {
@@ -121,9 +183,9 @@ public:
_count = 0;
}
virtual opt_bytes compute(cql_serialization_format sf) override {
Type ret = 0;
Type ret{};
if (_count) {
ret = _sum / _count;
ret = impl_div_for_avg<Type>::div(_sum, _count);
}
return data_type_for<Type>()->decompose(ret);
}

View File

@@ -0,0 +1,82 @@
/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "castas_fcts.hh"
#include "cql3/functions/native_scalar_function.hh"
namespace cql3 {
namespace functions {
namespace {
using bytes_opt = std::experimental::optional<bytes>;
class castas_function_for : public cql3::functions::native_scalar_function {
castas_fctn _func;
public:
castas_function_for(data_type to_type,
data_type from_type,
castas_fctn func)
: native_scalar_function("castas" + to_type->as_cql3_type()->to_string(), to_type, {from_type})
, _func(func) {
}
virtual bool is_pure() override {
return true;
}
virtual void print(std::ostream& os) const override {
os << "cast(" << _arg_types[0]->name() << " as " << _return_type->name() << ")";
}
virtual bytes_opt execute(cql_serialization_format sf, const std::vector<bytes_opt>& parameters) override {
auto from_type = arg_types()[0];
auto to_type = return_type();
auto&& val = parameters[0];
if (!val) {
return val;
}
auto val_from = from_type->deserialize(*val);
auto val_to = _func(val_from);
return to_type->decompose(val_to);
}
};
shared_ptr<function> make_castas_function(data_type to_type, data_type from_type, castas_fctn func) {
return ::make_shared<castas_function_for>(std::move(to_type), std::move(from_type), std::move(func));
}
} /* Anonymous Namespace */
shared_ptr<function> castas_functions::get(data_type to_type, const std::vector<shared_ptr<cql3::selection::selector>>& provided_args, schema_ptr s) {
if (provided_args.size() != 1) {
throw exceptions::invalid_request_exception("Invalid CAST expression");
}
auto from_type = provided_args[0]->get_type();
auto from_type_key = from_type;
if (from_type_key->is_reversed()) {
from_type_key = dynamic_cast<const reversed_type_impl&>(*from_type).underlying_type();
}
auto f = get_castas_fctn(to_type, from_type_key);
return make_castas_function(to_type, from_type, f);
}
}
}

View File

@@ -0,0 +1,63 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* Modified by ScyllaDB
*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <tuple>
#include <unordered_map>
#include "cql3/functions/function.hh"
#include "cql3/functions/abstract_function.hh"
#include "exceptions/exceptions.hh"
#include "core/print.hh"
#include "cql3/cql3_type.hh"
#include "cql3/selection/selector.hh"
namespace cql3 {
namespace functions {
class castas_functions {
public:
static shared_ptr<function> get(data_type to_type, const std::vector<shared_ptr<cql3::selection::selector>>& provided_args, schema_ptr s);
};
}
}

View File

@@ -59,13 +59,13 @@ public:
virtual bool uses_function(const sstring& ks_name, const sstring& function_name) const override;
virtual void collect_marker_specification(shared_ptr<variable_specifications> bound_names) override;
virtual shared_ptr<terminal> bind(const query_options& options) override;
virtual bytes_view_opt bind_and_get(const query_options& options) override;
virtual cql3::raw_value_view bind_and_get(const query_options& options) override;
private:
static bytes_opt execute_internal(cql_serialization_format sf, scalar_function& fun, std::vector<bytes_opt> params);
public:
virtual bool contains_bind_marker() const override;
private:
static shared_ptr<terminal> make_terminal(shared_ptr<function> fun, bytes_opt result, cql_serialization_format sf);
static shared_ptr<terminal> make_terminal(shared_ptr<function> fun, cql3::raw_value result, cql_serialization_format sf);
public:
class raw : public term::raw {
function_name _name;

View File

@@ -43,7 +43,7 @@
#include "core/sstring.hh"
#include "db/system_keyspace.hh"
#include <iostream>
#include <iosfwd>
#include <functional>
namespace cql3 {

View File

@@ -59,6 +59,14 @@ functions::init() {
declare(make_to_blob_function(type->get_type()));
declare(make_from_blob_function(type->get_type()));
}
declare(aggregate_fcts::make_count_function<int8_t>());
declare(aggregate_fcts::make_max_function<int8_t>());
declare(aggregate_fcts::make_min_function<int8_t>());
declare(aggregate_fcts::make_count_function<int16_t>());
declare(aggregate_fcts::make_max_function<int16_t>());
declare(aggregate_fcts::make_min_function<int16_t>());
declare(aggregate_fcts::make_count_function<int32_t>());
declare(aggregate_fcts::make_max_function<int32_t>());
declare(aggregate_fcts::make_min_function<int32_t>());
@@ -67,6 +75,26 @@ functions::init() {
declare(aggregate_fcts::make_max_function<int64_t>());
declare(aggregate_fcts::make_min_function<int64_t>());
declare(aggregate_fcts::make_count_function<boost::multiprecision::cpp_int>());
declare(aggregate_fcts::make_max_function<boost::multiprecision::cpp_int>());
declare(aggregate_fcts::make_min_function<boost::multiprecision::cpp_int>());
declare(aggregate_fcts::make_count_function<big_decimal>());
declare(aggregate_fcts::make_max_function<big_decimal>());
declare(aggregate_fcts::make_min_function<big_decimal>());
declare(aggregate_fcts::make_count_function<float>());
declare(aggregate_fcts::make_max_function<float>());
declare(aggregate_fcts::make_min_function<float>());
declare(aggregate_fcts::make_count_function<double>());
declare(aggregate_fcts::make_max_function<double>());
declare(aggregate_fcts::make_min_function<double>());
declare(aggregate_fcts::make_count_function<sstring>());
declare(aggregate_fcts::make_max_function<sstring>());
declare(aggregate_fcts::make_min_function<sstring>());
//FIXME:
//declare(aggregate_fcts::make_count_function<bytes>());
//declare(aggregate_fcts::make_max_function<bytes>());
@@ -76,20 +104,22 @@ functions::init() {
declare(make_varchar_as_blob_fct());
declare(make_blob_as_varchar_fct());
declare(aggregate_fcts::make_sum_function<int8_t>());
declare(aggregate_fcts::make_sum_function<int16_t>());
declare(aggregate_fcts::make_sum_function<int32_t>());
declare(aggregate_fcts::make_sum_function<int64_t>());
declare(aggregate_fcts::make_sum_function<float>());
declare(aggregate_fcts::make_sum_function<double>());
declare(aggregate_fcts::make_sum_function<boost::multiprecision::cpp_int>());
declare(aggregate_fcts::make_sum_function<big_decimal>());
declare(aggregate_fcts::make_avg_function<int8_t>());
declare(aggregate_fcts::make_avg_function<int16_t>());
declare(aggregate_fcts::make_avg_function<int32_t>());
declare(aggregate_fcts::make_avg_function<int64_t>());
#if 0
declare(AggregateFcts.sumFunctionForFloat);
declare(AggregateFcts.sumFunctionForDouble);
declare(AggregateFcts.sumFunctionForDecimal);
declare(AggregateFcts.sumFunctionForVarint);
declare(AggregateFcts.avgFunctionForFloat);
declare(AggregateFcts.avgFunctionForDouble);
declare(AggregateFcts.avgFunctionForVarint);
declare(AggregateFcts.avgFunctionForDecimal);
#endif
declare(aggregate_fcts::make_avg_function<float>());
declare(aggregate_fcts::make_avg_function<double>());
declare(aggregate_fcts::make_avg_function<boost::multiprecision::cpp_int>());
declare(aggregate_fcts::make_avg_function<big_decimal>());
// also needed for smp:
#if 0
@@ -299,10 +329,10 @@ function_call::collect_marker_specification(shared_ptr<variable_specifications>
shared_ptr<terminal>
function_call::bind(const query_options& options) {
return make_terminal(_fun, to_bytes_opt(bind_and_get(options)), options.get_cql_serialization_format());
return make_terminal(_fun, cql3::raw_value::make_value(bind_and_get(options)), options.get_cql_serialization_format());
}
bytes_view_opt
cql3::raw_value_view
function_call::bind_and_get(const query_options& options) {
std::vector<bytes_opt> buffers;
buffers.reserve(_terms.size());
@@ -316,7 +346,7 @@ function_call::bind_and_get(const query_options& options) {
buffers.push_back(std::move(to_bytes_opt(val)));
}
auto result = execute_internal(options.get_cql_serialization_format(), *_fun, std::move(buffers));
return options.make_temporary(result);
return options.make_temporary(cql3::raw_value::make_value(result));
}
bytes_opt
@@ -328,7 +358,7 @@ function_call::execute_internal(cql_serialization_format sf, scalar_function& fu
fun.return_type()->validate(*result);
}
return result;
} catch (marshal_exception e) {
} catch (marshal_exception& e) {
throw runtime_exception(sprint("Return of function %s (%s) is not a valid value for its declared return type %s",
fun, to_hex(result),
*fun.return_type()->as_cql3_type()
@@ -347,7 +377,7 @@ function_call::contains_bind_marker() const {
}
shared_ptr<terminal>
function_call::make_terminal(shared_ptr<function> fun, bytes_opt result, cql_serialization_format sf) {
function_call::make_terminal(shared_ptr<function> fun, cql3::raw_value result, cql_serialization_format sf) {
if (!dynamic_pointer_cast<const collection_type_impl>(fun->return_type())) {
return ::make_shared<constants::value>(std::move(result));
}
@@ -413,7 +443,7 @@ function_call::raw::prepare(database& db, const sstring& keyspace, ::shared_ptr<
// If all parameters are terminal and the function is pure, we can
// evaluate it now, otherwise we'd have to wait execution time
if (all_terminal && scalar_fun->is_pure()) {
return make_terminal(scalar_fun, execute(*scalar_fun, parameters), query_options::DEFAULT.get_cql_serialization_format());
return make_terminal(scalar_fun, cql3::raw_value::make_value(execute(*scalar_fun, parameters)), query_options::DEFAULT.get_cql_serialization_format());
} else {
return ::make_shared<function_call>(scalar_fun, parameters);
}
@@ -426,7 +456,7 @@ function_call::raw::execute(scalar_function& fun, std::vector<shared_ptr<term>>
for (auto&& t : parameters) {
assert(dynamic_cast<terminal*>(t.get()));
auto&& param = static_cast<terminal*>(t.get())->get(query_options::DEFAULT);
buffers.push_back(std::move(param));
buffers.push_back(std::move(to_bytes_opt(param)));
}
return execute_internal(cql_serialization_format::internal(), fun, buffers);

Some files were not shown because too many files have changed in this diff Show More