Compare commits

..

2165 Commits

Author SHA1 Message Date
Nadav Har'El
afa2c1b0bf materialized_views: propagate "view virtual columns" between nodes
db::schema_tables::ALL and db::schema_tables::all_tables() are both supposed
to list the same schema tables - the former is the list of their names, and
the latter is the list of their schemas. This code duplication makes it easy
to forget to update one of them, and indeed recently the new
"view_virtual_columns" was added to all_tables() but not to ALL.

What this patch does is to make ALL a function instead of constant vector.
The newly named all_table_names() function uses all_tables() so the list
of schema tables only appears once.

So that nobody worries about the performance impact, all_table_names()
caches the list in a per-thread vector that is only prepared once per thread.

Because after this patch all_table_names() has the "view_virtual_columns"
that was previously missing, this patch also fixes #4339, which was about
virtual columns in materialized views not being propagated to other nodes.

Unfortunately, to test the fix for #4339 we need a test with multiple
nodes, so we cannot test it here in a unit test, and will instead use
the dtest framework, in a separate patch.

Fixes #4339

Branches: 3.0
Tests: all unit tests (release and debug mode), new dtest for #4339. The unit test mutation_reader_test failed in debug mode but not in release mode, but this probably has nothing to do with this patch (?).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Message-Id: <20190320063437.32731-1-nyh@scylladb.com>
(cherry picked from commit 7c874057f5)
2020-01-06 00:37:59 +02:00
Tomasz Grabiec
ad70fe8503 cql: alter type: Format field name as text instead of hex
Fixes #4841

Message-Id: <1565702635-26214-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 64ff1b6405)
2020-01-05 18:55:40 +02:00
Gleb Natapov
3cd9c78056 cache_hitrate_calculator: do not ignore a future returned from gossiper::add_local_application_state
We should wait for a future returned from add_local_application_state() to
resolve before issuing new calculation, otherwise two
add_local_application_state() may run simultaneously for the same state.

Fixes #4838.

Message-Id: <20190812082158.GE17984@scylladb.com>
(cherry picked from commit 00c4078af3)
2020-01-05 18:50:27 +02:00
Benny Halevy
c5e5ed2775 tracing: one_session_records: keep local tracing ptr
Similar to trace_state keep shared_ptr<tracing> _local_tracing_ptr
in one_session_records when constructed so it can be used
during shutdown.

Fixes #5243

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 7aef39e400)
2019-12-24 18:42:33 +02:00
Tomasz Grabiec
666266c3cf types: Fix abort on type alter which affects a compact storage table with no regular columns
Fixes #4837

Message-Id: <1565702247-23800-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 34cff6ed6b)
2019-12-24 17:44:40 +02:00
Dejan Mircevski
19b5d70338 tests: Add cquery_nofail() utility
Most tests await the result of cql_test_env::execute_cql().  Most
would also benefit from reporting errors with top-level location
included.

Ref #4837 (a prerequisite for backporting)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
(cherry picked from commit a9849ecba7)
2019-12-24 17:44:40 +02:00
Amnon Heiman
b3cdee7e27 init: do not allow replace-address for seeds
If a node is a seed node, it can not be started with
replace-address-first-boot or the replace-address flag.

The issue is that as a seed node it will generate new tokens instead of
replacing the existing one the user expect it to replaec when supplying
the flags.

This patch will throw a bad_configuration_error exception
in this case.

Fixes #3889

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
(cherry picked from commit 399d79fc6f)
2019-12-23 17:24:52 +02:00
Rafael Ávila de Espíndola
4c42f18d82 cql: Fix use of UDT in reversed columns
We were missing calls to underlying_type in a few locations and so the
insert would think the given literal was invalid and the select would
refuse to fetch a UDT field.

Fixes #4672

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190708200516.59841-1-espindola@scylladb.com>
(cherry picked from commit 4e7ffb80c0)
2019-12-23 15:57:47 +02:00
Benny Halevy
ea8f8ab7a3 sstables: mc: prevent signed integer overflow
Fix runtime error: signed integer overflow
introduced by 2dc3776407

Delta-encoded values may wrap around if the encoded value is
less than the base value.  This could happen in two places:
In the mc-format serialization header itself, where the base values are implicit
Cassandra epoch time, and in the sstables data files, where the base values
are taken from the encoding_stats (later written to the serialization_header).

In these cases, when the calculation is done using signed integer/long we may see
"runtime error: signed integer overflow" messages in debug mode
(with -fsanitize=undefined / -fsanitize=signed-integer-overflow).

Overflow here is expected and harmless since we do not gurantee that
neither the base values in the serialization header are greater than
or equal to Cassandra's epoch now that the delta-encoded values are
always greater than or equal to the respective base values in
the serialization header.

To prevent these warnings, the subtraction/addition should be done with unsigned
(two's complement) arithmetic and the result converted to the signed type.

Note that to keep the code simple where possible, when also rely on implicit
conversion of signed integers to unsigned when either one of added value is unsigned
and the other is signed.

Fixes: #4098

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190120142950.15776-1-bhalevy@scylladb.com>
(cherry picked from commit 844a2de263)
2019-12-15 15:52:35 +02:00
Piotr Sarna
db6821ce8f table: Reduce read amplification in view update generation
This commit makes sure that single-partition readers for
read-before-write do not have fast-forwarding enabled,
as it may lead to huge read amplification. The observed case was:
1. Creating an index.
  CREATE INDEX index1  ON myks2.standard1 ("C1");
2. Running cassandra-stress in order to generate view updates.
cassandra-stress write no-warmup n=1000000 cl=ONE -schema \
  'replication(factor=2) compaction(strategy=LeveledCompactionStrategy)' \
  keyspace=myks2 -pop seq=4000000..8000000 -rate threads=100 -errors
  skip-read-validation -node 127.0.0.1;

Without disabling fast-forwarding, single-partition readers
were turned into scanning readers in cache, which resulted
in reading 36GB (sic!) on a workload which generates less
than 1GB of view updates. After applying the fix, the number
dropped down to less than 1GB, as expected.

Refs #5409
Fixes #4615
Fixes #5418

(cherry picked from commit 79c3a508f4)
2019-12-05 22:36:41 +02:00
Rafael Ávila de Espíndola
3c91bad0dc commitlog: make sure a file is closed
If allocate or truncate throws, we have to close the file.

Fixes #4877

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191114174810.49004-1-espindola@scylladb.com>
(cherry picked from commit 6160b9017d)
2019-11-24 17:50:06 +02:00
Tomasz Grabiec
bbe41a82be row_cache: Fix abort on bad_alloc during cache update
Since 90d6c0b, cache will abort when trying to detach partition
entries while they're updated. This should never happen. It can happen
though, when the update fails on bad_alloc, because the cleanup guard
invalidates the cache before it releases partition snapshots (held by
"update" coroutine).

Fix by destroying the coroutine first.

Fixes #5327.

Tests:
  - row_cache_test (dev)

Message-Id: <1574360259-10132-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit e3d025d014)
2019-11-24 17:44:30 +02:00
Nadav Har'El
6fb42269e9 merge: row_marker: correct row expiry condition
Merged patch set by Piotr Dulikowski:

This change corrects condition on which a row was considered expired by its
TTL.

The logic that decides when a row becomes expired was inconsistent with the
logic that decides if a single cell is expired. A single cell becomes expired
when expiry_timestamp <= now, while a row became expired when
expiry_timestamp < now (notice the strict inequality). For rows inserted
with TTL, this caused non-key cells to expire (change their values to null)
one second before the row disappeared. Now, row expiry logic uses non-strict
inequality.

Fixes #4263,
Fixes #5290.

Tests:

    unit(dev)
    python test described in issue #5290

(cherry picked from commit 9b9609c65b)
(cherry picked from commit 95acf71680)
2019-11-20 21:40:40 +02:00
Asias He
ee2255a189 gossip: Fix max generation drift measure
Assume n1 and n2 in a cluster with generation number g1, g2. The
cluster runs for more than 1 year (MAX_GENERATION_DIFFERENCE). When n1
reboots with generation g1' which is time based, n2 will see
g1' > g2 + MAX_GENERATION_DIFFERENCE and reject n1's gossip update.

To fix, check the generation drift with generation value this node would
get if this node were restarted.

This is a backport of CASSANDRA-10969.

Fixes #5164

(cherry picked from commit 0a52ecb6df)
2019-11-20 11:39:37 +02:00
Kamil Braun
3218e6cd4c view: fix bug in virtual columns.
When creating a virtual column of non-frozen map type,
the wrong type was used for the map's keys.

Fixes #5165.

(cherry picked from commit ef9d5750c8)
2019-11-19 11:17:54 +02:00
Rafael Ávila de Espíndola
1d94aac551 sstable: close file_writer if an exception in thrown
The previous code was not exception safe and would eventually cause a
file to be destroyed without being closed, causing an assert failure.

Unfortunately it doesn't seem to be possible to test this without
error injection, since using an invalid directory fails before this
code is executed.

Fixes #4948

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190904002314.79591-1-espindola@scylladb.com>
(cherry picked from commit 000514e7cc)
2019-11-19 11:17:54 +02:00
Avi Kivity
2e5110d063 reconcilable_result: use chunked_vector to hold partitions
Usually, a reconcilable_result holds very few partitions (1 is common),
since the page size is limited by 1MB. But if we have paging disabled or
if we are reconciling a range full of tombstones, we may see many more.
This can cause large allocations.

Change to chunked_vector to prevent those large allocations, as they
can be quite expensive.

Fixes #4780.

(cherry picked from commit 093d2cd7e5)
2019-11-19 11:17:54 +02:00
Avi Kivity
e4bb7ce73c utils::chunked_vector: add rbegin() and related iterators
Needed as an std::vector replacement.

(cherry picked from commit eaa9a5b0d7)

Prerequisite for #4780.
2019-11-19 11:17:54 +02:00
Avi Kivity
ecc54c1a68 utils: chunked_vector: make begin()/end() const correct
begin() of a const vector should return a const_iterator, to avoid
giving the caller the ability to mutate it.

This slipped through since iterator's constructor does a const_cast.

Noticed by code inspection.

(cherry picked from commit df6faae980)

Prerequisite for #4780.
2019-11-19 11:17:54 +02:00
Glauber Costa
71cfd108c6 do not crash in user-defined operations if the controller is disabled
Scylla currently crashes if we run manual operations like nodetool
compact with the controller disabled. While we neither like nor
recommend running with the controller disabled, due to some corner cases
in the controller algorithm we are not yet at the point in which we can
deprecate this and are sometimes forced to disable it.

The reason for the crash is that manual operations will invoke
_backlog_of_shares, which returns what is the backlog needed to
create a certain number of shares. That scan the existing control
points, but when we run without the controller there are no control
points and we crash.

Backlog doesn't matter if the controller is disabled, and the return
value of this function will be immaterial in this case. So to avoid the
crash, we return something right away if the controller is disabled.

Fixes #5016

Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit c9f2d1d105)
2019-11-19 11:17:54 +02:00
Avi Kivity
d40a7a5e9e Merge "Add proper aggregation for paged indexing" from Piotr
"
Fixes #4540

This series adds proper handling of aggregation for paged indexed queries.
Before this series returned results were presented to the user in per-page
partial manner, while they should have been returned as a single aggregated
value.

Tests: unit(dev)
"

* 'add_proper_aggregation_for_paged_indexing_for_3.0' of https://github.com/psarna/scylla:
  test: add 'eventually' block to index paging test
  tests: add indexing+paging test case for clustering keys
  tests: add indexing + paging + aggregation test case
  cql3: make DEFAULT_COUNT_PAGE_SIZE constant public
  cql3: add proper aggregation to paged indexing
  cql3: add a query options constructor with explicit page size
  cql3: enable explicit copying of query_options
  cql3: split execute_base_query implementation
2019-11-19 11:17:54 +02:00
Takuya ASADA
a163d245ec dist/common/scripts/scylla_setup: don't proceed with empty NIC name
Currently NIC selection prompt on scylla_setup just proceed setup when
user just pressed Enter key on the prompt.
The prompt should ask NIC name again until user input correct NIC name.

Fixes #4517
Message-Id: <20190617124925.11559-1-syuu@scylladb.com>

(cherry picked from commit 7320c966bc)
2019-11-19 11:17:54 +02:00
Piotr Sarna
045831b706 test: add 'eventually' block to index paging test
Without 'eventually', the test is flaky because the index can still
be not up to date while checking its conditions.

Fixes #4670

(cherry picked from commit ebbe038d19)
2019-11-15 09:15:29 +01:00
Piotr Sarna
148245ab6a tests: add indexing+paging test case for clustering keys
Indexing a non-prefix part of the clustering key has a separate
code path (see issue #3405), so it deserves a separate test case.
2019-11-14 12:32:08 +01:00
Piotr Sarna
bbe5de1403 tests: add indexing + paging + aggregation test case
Indexed queries used to erroneously return partial per-page results
for aggregation queries. This test case used to reproduce the problem
and now ensures that there would be no regressions.

Refs #4540
2019-11-14 12:32:07 +01:00
Piotr Sarna
ca0df416c0 cql3: make DEFAULT_COUNT_PAGE_SIZE constant public
The constant will be later used in test scenarios.
2019-11-14 12:25:37 +01:00
Piotr Sarna
37ed60374e cql3: add proper aggregation to paged indexing
Aggregated and paged filtering needs to aggregate the results
from all pages in order to avoid returning partial per-page
results. It's a little bit more complicated than regular aggregation,
because each paging state needs to be translated between the base
table and the underlying view. The routine keeps fetching pages
from the underlying view, which are then used to fetch base rows,
which go straight to the result set builder.

Fixes #4540
2019-11-14 12:25:37 +01:00
Piotr Sarna
7c991a276b cql3: add a query options constructor with explicit page size
For internal use, there already exists a query_options constructor
that copies data from another query_options with overwritten paging
state. This commit adds an option to overwrite page size as well.
2019-11-14 10:49:28 +01:00
Piotr Sarna
72e039be85 cql3: enable explicit copying of query_options 2019-11-14 10:49:28 +01:00
Piotr Sarna
a28ecc4714 cql3: split execute_base_query implementation
In order to handle aggregation queries correctly, the function that
returns base query results is split into two, so it's possible to
access raw query results, before they're converted into end-user
CQL message.
2019-11-14 10:49:28 +01:00
Avi Kivity
584c555698 Update seastar submodule
* seastar 3920dcb3f8...083dc0875e (2):
  > core: fix a race in execution stages
  > execution_stage: prevent unbounded growth

Fixes #4749.
Fixes #4856.
2019-11-13 13:15:54 +02:00
null
e772f11ee0 release: prepare for3.0.11 by yaronkaikov 2019-10-30 11:01:40 +02:00
Botond Dénes
d79b6a7481 repair: repair_cf_range(): extract result of local checksum calculation only once
The loop that collects the result of the checksum calculations and logs
any errors. The error logging includes `checksums[0]` which corresponds
to the checksum calculation on the local node. This violates the
assumption of the code following the loop, which assumes that the future
of `checksums[0]` is intact after the loop terminates. However this is
only true when the checksum calculation is successful and is false when
it fails, as in this case the loop extracts the error and logs it. When
the code after the loop checks again whether said calculation failed, it
will get a false negative and will go ahead and attempt to extract the
value, triggering an assert failure.
Fix by making sure that even in the case of failed checksum calculation,
the result of `checksum[0]` is extracted only once.

Fixes: #5238
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20191029151709.90986-1-bdenes@scylladb.com>
(cherry picked from commit e48f301e95)
2019-10-29 20:43:50 +02:00
Avi Kivity
85168c500c Merge "Fix handling of schema alters and eviction in cache" from Tomasz
"
Fixes #5134, Eviction concurrent with preempted partition entry update after
  memtable flush may allow stale data to be populated into cache.

Fixes #5135, Cache reads may miss some writes if schema alter followed by a
  read happened concurrently with preempted partition entry update.

Fixes #5127, Cache populating read concurrent with schema alter may use the
  wrong schema version to interpret sstable data.

Fixes #5128, Reads of multi-row partitions concurrent with memtable flush may
  fail or cause a node crash after schema alter.
"

* tag 'fix-cache-issues-with-schema-alter-and-eviction-v2' of github.com:tgrabiec/scylla:
  tests: row_cache: Introduce test_alter_then_preempted_update_then_memtable_read
  tests: row_cache_stress_test: Verify all entries are evictable at the end
  tests: row_cache_stress_test: Exercise single-partition reads
  tests: row_cache_stress_test: Add periodic schema alters
  tests: memtable_snapshot_source: Allow changing the schema
  tests: simple_schema: Prepare for schema altering
  row_cache: Record upgraded schema in memtable entries during update
  memtable: Extract memtable_entry::upgrade_schema()
  row_cache, mvcc: Prevent locked snapshots from being evicted
  row_cache: Make evict() not use invalidate_unwrapped()
  mvcc: Introduce partition_snapshot::touch()
  row_cache, mvcc: Do not upgrade schema of entries which are being updated
  row_cache: Use the correct schema version to populate the partition entry
  delegating_reader: Optimize fill_buffer()
  row_cache, memtable: Use upgrade_schema()
  flat_mutation_reader: Introduce upgrade_schema()

(cherry picked from commit 8ed6f94a16)
(cherry picked from commit 3f4d9f210f)
2019-10-22 19:47:02 +02:00
Botond Dénes
5b9e2cd6e6 querier_cache: correctly account entries evicted on insertion in the population
Currently, the population stat is not increased for entries that are
evicted immediately on insert, however the code that does the eviction
still decreases the population stat, leading to an imbalance and in some
cases the underflow of the population stat. To fix, unconditionally
increase the population stat upon inserting an entry, regardless of
whether it is immediately evicted or not.

Fixes: #5123

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20191001153215.82997-1-bdenes@scylladb.com>
(cherry picked from commit 00b432b61d)
2019-10-05 12:36:34 +03:00
Avi Kivity
77f33ca106 Merge " hinted handoff: fix races during shutdown and draining" from Vlad
"
Fix races that may lead to use-after-free events and file system level exceptions
during shutdown and drain.

The root cause of use-after-free events in question is that space_watchdog blocks on
end_point_hints_manager::file_update_mutex() and we need to make sure this mutex is alive as long as
it's accessed even if the corresponding end_point_hints_manager instance
is destroyed in the context of manager::drain_for().

File system exceptions may occur when space_watchdog attempts to scan a
directory while it's being deleted from the drain_for() context.
In case of such an exception new hints generation is going to be blocked
- including for materialized views, till the next space_watchdog round (in 1s).

Issues that are fixed are #4685 and #4836.

Tested as follows:
 1) Patched the code in order to trigger the race with (a lot) higher
    probability and running slightly modified hinted handoff replace
    dtest with a debug binary for 100 times. Side effect of this
    testing was discovering of #4836.
 2) Using the same patch as above tested that there are no crashes and
    nodes survive stop/start sequences (they were not without this series)
    in the context of all hinted handoff dtests. Ran the whole set of
    tests with dev binary for 10 times.
"

Fixes #4685
Fixes #4836.

* 'hinted_handoff_race_between_drain_for_and_space_watchdog_no_global_lock-v2' of https://github.com/vladzcloudius/scylla:
  hinted handoff: fix a race on a directory removal between space_watchdog and drain_for()
  hinted handoff: make taking file_update_mutex safe
  db::hints::manager::drain_for(): fix alignment
  db::hints::manager: serialize calls to drain_for()
  db::hints: cosmetics: identation and missing method qualifier

(cherry picked from commit 3cb081eb84)
2019-10-05 12:25:51 +03:00
Gleb Natapov
93760f13ee messaging_service: enable reuseaddr on messaging service rpc
Fixes #4943

Message-Id: <20190918152405.GV21540@scylladb.com>
(cherry picked from commit 73e3d0a283)
2019-10-03 15:24:53 +03:00
Avi Kivity
e597ae1176 Update seastar submodule
* seastar af3fc691b9...3920dcb3f8 (2):
  > net: socket::{set,get}_reuseaddr() should not be virtual
  > Merge "fix some tcp connection bugs and add reuseaddr option to a client socket" from Gleb

Prerequisite for #4943.
2019-10-03 15:23:35 +03:00
Tomasz Grabiec
79c7015cce Merge "hinted handoff: don't reuse_segments and discard corrupted segments" from Vlad
This series addresses two issues in the hinted handoff that should
complete fixing the infamous #4231.

In particular the second patch removes the requirement to manually
delete hints files after upgrading to 3.0.4.

Tested with manual unit testing.

* https://github.com/vladzcloudius/scylla.git hinted_handoff_drop_broken_segments-v3:
  hinted handoff: disable "reuse_segments"
  commitlog: introduce a segment_error
  hinted handoff: discard corrupted segments

(cherry picked from commit ac0d435c3e)
2019-09-28 19:52:57 +03:00
Asias He
00a14000cd storage_service: Replicate and advertise tokens early in the boot up process
When a node is restarted, there is a race between gossip starts (other
nodes will mark this node up again and send requests) and the tokens are
replicated to other shards. Here is an example:

- n1, n2
- n2 is down, n1 think n2 is down
- n2 starts again, n2 starts gossip service, n1 thinks n2 is up and sends
  reads/writes to n2, but n2 hasn't replicated the token_metadata to all
  the shards.
- n2 complains:
  token_metadata - sorted_tokens is empty in first_token_index!
  token_metadata - sorted_tokens is empty in first_token_index!
  token_metadata - sorted_tokens is empty in first_token_index!
  token_metadata - sorted_tokens is empty in first_token_index!
  token_metadata - sorted_tokens is empty in first_token_index!
  token_metadata - sorted_tokens is empty in first_token_index!
  storage_proxy - Failed to apply mutation from $ip#4: std::runtime_error
  (sorted_tokens is empty in first_token_index!)

The code path looks like below:

0 stoarge_service::init_server
1    prepare_to_join()
2          add gossip application state of NET_VERSION, SCHEMA and so on.
3         _gossiper.start_gossiping().get()
4    join_token_ring()
5           _token_metadata.update_normal_tokens(tokens, get_broadcast_address());
6           replicate_to_all_cores().get()
7           storage_service::set_gossip_tokens() which adds the gossip application state of TOKENS and STATUS

The race talked above is at line 3 and line 6.

To fix, we can replicate the token_metadata early after it is filled
with the tokens read from system table before gossip starts. So that
when other nodes think this restarting node is up, the tokens are
already replicated to all the shards.

In addition, this patch also fixes the issue that other nodes might see
a node miss the TOKENS and STATUS application state in gossip if that
node failed in the middle of a restarting process, i.e., it is killed
after line 3 and before line 7. As a result we could not replace the
node.

Tests: update_cluster_layout_tests.py
Fixes: #4709
Fixes: #4723
(cherry picked from commit 3b39a59135)
2019-09-22 12:46:36 +03:00
Avi Kivity
1c40a0fcd2 Update seastar submodule
* seastar ea859b5840...af3fc691b9 (1):
  > iotune: fix exception handling in case test file creation fails

Fixes #5001.
2019-09-18 18:37:23 +03:00
Gleb Natapov
e10735852b messaging_service: configure different streaming domain for each rpc server
A streaming domain identifies a server across shards. Each server should
have different one.

Fixes: #4953

Message-Id: <20190908085327.GR21540@scylladb.com>
(cherry picked from commit 9e9f64d90e)
2019-09-09 20:37:40 +03:00
Avi Kivity
42433a25a8 Update seastar submodule
* seastar 445b5126c2...ea859b5840 (1):
  > perftune: fix missing import for logging

Fixes #4958.
2019-09-04 13:50:29 +03:00
Paweł Dziepak
d04d3fa653 mutation_partition: verify row::append_cell() precondition
row::append_cell() has a precondition that the new cell column id needs
to be larger than that of any other already existing cell. If this
precondition is violated the row will end up in an invalid state. This
patch adds assertion to make sure we fail early in such cases.

(cherry picked from commit 060e3f8ac2)
2019-08-23 15:06:18 +02:00
Avi Kivity
1bcc5a1b5c Merge "database: assign proper io priority for streaming view updates" from Piotr
"
Streamed view updates parasitized on writing io priority, which is
reserved for user writes - it's now properly bound to streaming
write priority.

Verified manually by checking appropriate io metrics: scylla_io_queue_total_bytes{class="streaming_write" ...} vs scylla_io_queue_total_bytes{class="query" ...}

Tests: unit(dev)
"

Fixes #4615.

* 'assign_proper_io_priority_to_streaming_view_updates' of https://github.com/psarna/scylla:
  db,view: wrap view update generation in stream scheduling group
  database: assign proper io priority for streaming view updates

(cherry picked from commit 2c7435418a)
2019-08-22 16:21:42 +03:00
Botond Dénes
450b9ac9bf multishard_combining_reader: shard reader: don't stop on non-full prefixes
This patch is a backport of the fix for #4733 (merged to master as
0cf4fab). As the shard reader code has been substantially refactored
post the 3.0 branch cut time, that fix cannot be backported at all,
instead this is a separate fix developed specially for 3.0.

To quickly reiterate, the problem at hand is that when recreating a
previously evicted shard reader of a multishard reader, the position of
the last fragment seen by that reader is used as the position after
which the read resumes. For this we just created a clustering range
starting from *after* the key (open bound). This works well in most
cases but when that last key is a non-full prefix this will also ignore
any still unread clustering rows that falls into that prefix.

This patch doesn't attempt to fix the problem in a systematic way like
the fix in master does, making sure reader recreation works properly
with prefixes as well, instead, for the sake of minimizing the impact,
we simply avoid ending the buffer on a prefix key. This fix is more
naive and can cause over-read when the stream contains lots of
successive range tombstones with prefix positions. On the other hand,
this leads to a *much* simpler fix, and anyway, as reader eviction is
much rarer in 3.0 this should have a lesser impact.

A unit test is also added to make sure the problem is fixed.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20190819120748.28168-1-bdenes@scylladb.com>
2019-08-19 15:09:47 +03:00
Jenkins
b3bfd8c08d release: prepare for 3.0.10 by hagitsegev 2019-08-14 14:58:50 -04:00
Tomasz Grabiec
53c10b72dc Merge "Fix the system.size_estimates table" from Kamil
Fixes a segfault when querying for an empty keyspace.

Also, fixes an infinite loop on smp > 1. Queries to
system.size_estimates table which are not single-partition queries
caused Scylla to go into an infinite loop inside
multishard_combining_reader::fill_buffer. This happened because
multishard_combinind_reader assumes that shards return rows belonging
to separate partitions, which was not the case for
size_estimates_mutation_reader.

Fixes #4689
2019-08-14 15:31:54 +02:00
Kamil Braun
a690e20966 Fix infinite looping when performing a range query on system.size_estimates.
Queries to system.size_estimates table which are not single parition queries
caused Scylla to go into an infinite loop inside multishard_combining_reader::fill_buffer.
This happened because multishard_combinind_reader assumes that shards return rows belonging
to separate partitions, which was not the case for size_estimates_mutation_reader.
This commit fixes the issue and closes #4689.
2019-08-14 12:51:33 +02:00
Kamil Braun
7172009a0d Fix segmentation fault when querying system.size_estimates for an empty keyspace. 2019-08-14 12:51:33 +02:00
Kamil Braun
cb688ef62e Refactor size_estimates_virtual_reader
Move the implementation of size_estimates_mutation_reader
to a separate compilation unit to speed up compilation times
and increase readability.

Refactor tests to use seastar::thread.
2019-08-14 12:51:27 +02:00
Kamil Braun
ff8265dd66 Fix command line argument parsing in main.
Command line arguments are parsed twice in Scylla: once in main and once in Seastar's app_template::run.
The first parse is there to check if the "--version" flag is present --- in this case the version is printed
and the program exists.  The second parsing is correct; however, most of the arguments were improperly treated
as positional arguments during the first parsing (e.g., "--network host" would treat "host" as a positional argument).
This happened because the arguments weren't known to the command line parser.
This commit fixes the issue by moving the parsing code until after the arguments are registered.
Resolves #4141.

Signed-off-by: Kamil Braun <kbraun@scylladb.com>
(cherry picked from commit f155a2d334)
2019-08-13 20:13:24 +03:00
Avi Kivity
a198db31dc Merge "Fix disable_sstable_write synchronization with on_compaction_completion" from Benny
"
disable_sstable_write needs to acquire _sstable_deletion_sem to properly synchronize
with background deletions done by on_compaction_completion to ensure no sstables will
be created or deleted during reshuffle_sstables after
storage_service::load_new_sstables disables sstable writes.

Fixes #4622

Test: unit(dev), nodetool_additional_test.py migration_test.py
"

* 'fix-disable-sstable-write-for-3.0' of https://github.com/bhalevy/scylla:
  table: document _sstables_lock/_sstable_deletion_sem locking order
  table: disable_sstable_write: acquire _sstable_deletion_sem
  table: uninline enable_sstable_write
2019-08-12 16:53:47 +03:00
Avi Kivity
094a2a4263 Merge "Catch unclosed partition sstable write #4794" from Tomasz
"
Not emitting partition_end for a partition is incorrect. SStable
writer assumes that it is emitted. If it's not, the sstable will not
be written correctly. The partition index entry for the last partition
will be left partially written, which will result in errors during
reads. Also, statistics and sstable key ranges will not include the
last partition.

It's better to catch this problem at the time of writing, and not
generate bad sstables.

Another way of handling this would be to implicitly generate a
partition_end, but I don't think that we should do this. We cannot
trust the mutation stream when invariants are violated, we don't know
if this was really the last partition which was supposed to be
written. So it's safer to fail the write.

Enabled for both mc and la/ka.

Passing --abort-on-internal-error on the command line will switch to
aborting instead of throwing an exception.

The reason we don't abort by default is that it may bring the whole
cluster down and cause unavailability, while it may not be necessary
to do so. It's safer to fail just the affected operation,
e.g. repair. However, failing the operation with an exception leaves
little information for debugging the root cause. So the idea is that the
user would enable aborts on only one of the nodes in the cluster to
get a core dump and not bring the whole cluster down.
"

* 'catch-unclosed-partition-sstable-write' of https://github.com/tgrabiec/scylla:
  sstables: writer: Validate that partition is closed when the input mutation stream ends
  config, exceptions: Add helper for handling internal errors
  utils: config_file: Introduce named_value::observe()

(cherry picked from commit 95c0804731)
(cherry picked from commit cf4c238b28)
2019-08-08 16:47:26 +03:00
Asias He
cc0b4d249b streaming: Send error code from the sender to receiver
In case of error on the sender side, the sender does not propagate the
error to the receiver. The sender will close the stream. As a result,
the receiver will get nullopt from the source in
get_next_mutation_fragment and pass mutation_fragment_opt with no value
to the generating_reader. In turn, the generating_reader generates end
of stream. However, the last element that the generating_reader has
generated can be any type of mutation_fragment. This makes the sstable
that consumes the generating_reader violates the mutation_fragment
stream rule.

To fix, we need to propagate the error. However RPC streaming does not
support propagate the error in the framework. User has to send an error
code explicitly.

Fixes: #4789
(cherry picked from commit bac987e32a)

streaming: Move stream_mutation_fragments_cmd to a new file

Avoid including the stream_session.hh in messaging_service.hh.

More importantly, fix the build because currently messaging_service.cc
and messaging_service.hh does not include stream_mutation_fragments_cmd.
I am not sure why it builds on my machine. Spotted this when backporting
the change to 3.0 branch.

Refs: #4789
(cherry picked from commit 49a73aa2fc)

streaming: Do not call rpc stream flush in send_mutation_fragments

The stream close() guarantees the data sent will be flushed. No need to
call the stream flush() since the stream is not reused.

Follow up fix for commit bac987e32a (streaming: Send error code from
the sender to receiver).

Fixes: #4789
(cherry picked from commit 288371ce75)
Message-Id: <87058e290ae3f59f874b860121786b22f24957c7.1565189319.git.asias@scylladb.com>
2019-08-08 11:41:25 +02:00
Asias He
e10afc7f50 messaging_service: Check if messaging_service is stopped before get_rpc_client
get_rpc_client assumes the messaging_service is not stopped. We should check
is_stopping() before we call get_rpc_client.

We do such check in existing code, e.g., send_message and friends. Do
the same check in the newly introduced
make_sink_and_source_for_stream_mutation_fragments() and friends for row
level repair.

Fixes: #4767
(cherry picked from commit 5d3e4d7b73)

Note: only the change for make_sink_and_source_for_stream_mutation_fragments is backported.
Message-Id: <06079d4e48ea81ba567a2f45be2ab3a51f042e28.1565189319.git.asias@scylladb.com>
2019-08-08 11:40:49 +02:00
Tomasz Grabiec
407dfe0d68 lsa: Fix spurios abort with --enable-abort-on-lsa-bad-alloc
allocate_segment() can fail even though we're not out of memory, when
it's invoked inside an allocating section with the cache region
locked. That section may later succeed after retried after memory
reclamation.

We should ignore bad_alloc thrown inside allocating section body and
fail only when the whole section fails.

Fixes #2924

Message-Id: <1550597493-22500-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit dafe22dd83)
2019-08-08 11:39:39 +02:00
Raphael S. Carvalho
9370996a18 table: do not rely on undefined behavior in cleanup_sstables
It shouldn't rely on argument evaluation order, which is ub.

Fixes #4718.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry-picked from commit 0e732ed1cf)
2019-08-07 21:53:12 +03:00
Rafael Ávila de Espíndola
ac105dd2a7 mc writer: Fix exception safety when closing _index_writer
This fixes a possible cause of #4614.

From the backtrace in that issue, it looks like a file is being closed
twice. The first point in the backtrace where that seems likely is in
the MC writer.

My first idea was to add a writer::close and make it the responsibility
of the code using the writer to call it. That way we would move work
out of the destructor.

That is a bit hard since the writer is destroyed from
flat_mutation_reader::impl::~consumer_adapter and that would need to
get a close function too.

This patch instead just fixes an exception safety issue. If
_index_writer->close() throws, _index_writer is still valid and
~writer will try to close it again.

If the exception was thrown after _completed.set_value(), that would
explain the assert about _completed.set_value() being called twice.

With this patch the path outside of the destructor now moves the
writer to a local variable before trying to close it.

Fixes #4614
Message-Id: <20190710171747.27337-1-espindola@scylladb.com>

(cherry picked from commit 281f3a69f8)
2019-08-07 21:43:44 +03:00
Benny Halevy
1e62fc8aac table: document _sstables_lock/_sstable_deletion_sem locking order
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 0e4567c881)
2019-08-07 17:09:47 +03:00
Benny Halevy
c724eee649 table: disable_sstable_write: acquire _sstable_deletion_sem
`disable_sstable_write` needs to acquire `_sstable_deletion_sem`
to properly synchronize with background deletions done by
`on_compaction_completion` to ensure no sstables will be created
or deleted during `reshuffle_sstables` after
`storage_service::load_new_sstables` disables sstable writes.

Fixes #4622

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 6dad9baa1c)
2019-08-07 17:06:38 +03:00
Benny Halevy
ebb14d93c9 table: uninline enable_sstable_write
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit bbbd749f70)
2019-08-07 17:04:08 +03:00
Tomasz Grabiec
d77aaada86 sstables: ka/la: reader: Make sure push_ready_fragments() does not miss to emit partition_end
Currently, if there is a fragment in _ready and _out_of_range was set
after row end was consumer, push_ready_fragments() would return
without emitting partition_end.

This is problematic once we make consume_row_start() emit
partiton_start directly, because we will want to assume that all
fragments for the previous partition are emitted by then. If they're
not, then we'd emit partition_start before partition_end for the
previous partition. The fix is to make sure that
push_ready_fragments() emits everything.

Fixes #4786

(cherry picked from commit 9b8ac5ecbc)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2019-08-01 13:06:56 +03:00
Avi Kivity
acd05e089f Update seastar submodule
* seastar 16641efb15...445b5126c2 (1):
  > reactor: fix deadlock of stall detector vs dlopen

Fixes #4759.
2019-07-31 18:33:28 +03:00
Avi Kivity
f591c9c710 sstable: index_reader: close index_reader::reader more robustly
If we had an error while reading, then we would have failed to close
the reader, which in turn can cause memory corruption. Make the
closing more robust by using then_wrapped (that doesn't skip on
exception) and log the error for analysis.

Fixes #4761.

(cherry picked from commit b272db368f)
2019-07-27 18:20:17 +03:00
Jenkins
dea4489078 release: prepare for 3.0.9 by hagitsegev 2019-07-24 12:09:49 +03:00
Raphael S. Carvalho
3172cc6bac sstables/compaction: Fix segfault when replacing expired sstable in incremental compaction
Fully expired sstable is not added to compacting set, meaning it's not actually
compacted, but it's kept in a list of sstables which incremental compaction
uses to check if any sstable can be replaced.
Incremental compaction was unconditionally removing expired sstable from compacting
set, which led to segfault because end iterator was given.

The fix is about changing sstable_set::erase() behavior to follow standard one
for erase functions which will works if the target element is not present.

Fixes #4085.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190130163100.5824-1-raphaelsc@scylladb.com>
(cherry picked from commit 930f8caff9)
2019-07-22 15:07:00 +03:00
Asias He
840d466c4d streaming: Do not open rpc stream connection if ranges are not relevant to a shard
Given a list of ranges to stream, stream_transfer_task will create an
reader with the ranges and create a rpc stream connection on all the shards.

When user provides ranges to repair with -st -et options, e.g.,
using scylla-manger, such ranges can belong to only one shard, repair
will pass such ranges to streaming.

As a result, only one shard will have data to send while the rpc stream
connections are created on all the shards, which can cause the kernel
run out of ports in some systems.

To mitigate the problem, do not open the connection if the ranges do not
belong to the shard at all.

Refs: #4708
(cherry picked from commit 64a4c0ede2)
2019-07-21 10:24:21 +03:00
Kamil Braun
e30c289835 Fix timestamp_type_impl::timestamp_from_string.
Now it accepts the 'z' or 'Z' timezone, denoting UTC+00:00.
Fixes #4641.

Signed-off-by: Kamil Braun <kbraun@scylladb.com>
(cherry picked from commit 4417e78125)
2019-07-17 21:56:03 +03:00
Eliran Sinvani
f769828a68 auth: Prevent race between role_manager and pasword_authenticator
When scylla is started for the first time with PasswordAuthenticator
enabled, it can be that a record of the default superuser
will be created in the table with the can_login and is_superuser
set to null. It happens because the module in charge of creating
the row is the role manger and the module in charge of setting the
default password salted hash value is the password authenticator.
Those two modules are started together, it the case when the
password authenticator finish the initialization first, in the
period until the role manager completes it initialization, the row
contains those null columns and any loging attempt in this period
will cause a memory access violation since those columns are not
expected to ever be null. This patch removes the race by starting
the password authenticator and autorizer only after the role manger
finished its initialization.

Tests:
  1. Unit tests (release)
  2. Auth and cqlsh auth related dtests.

Fixes #4226

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <20190714124839.8392-1-eliransin@scylladb.com>
(cherry picked from commit 997a146c7f)
2019-07-15 21:18:24 +03:00
kbr-
7d743563bf Implement tuple_type_impl::to_string_impl. (#4645)
Resolves #4633.

Signed-off-by: Kamil Braun <kbraun@scylladb.com>
(cherry picked from commit 8995945052)
2019-07-08 11:11:30 +03:00
Jenkins
23da53c4f3 release: prepare for 3.0.8 by hagitsegev 2019-06-27 11:12:21 +03:00
Piotr Sarna
d4df119735 main: stop view builder conditionally
The view builder is started only if it's enabled in config,
via the view_building=true variable. Unfortunately, stopping
the builder was unconditional, which may result in failed
assertions during shutdown. To remedy this, view building
is stopped only if it was previously started.

Fixes #4589

(cherry picked from commit efa7951ea5)
2019-06-26 11:05:13 +03:00
Avi Kivity
bdcbf4aa4e Merge "Backport fixing infinite paging for indexed queries" from Piotr
"
This series backports fixing infinite paging for indexed queries
to branch-3.0.

Tests: unit(dev)
"

Fixes #4569

* 'fix_infinite_paging_for_indexed_queries_for_3.0' of https://github.com/psarna/scylla:
  tests: add test case for finishing index paging
  cql3: fix infinite paging for indexed queries
2019-06-25 11:56:11 +03:00
Avi Kivity
e80cd9dfed Merge "Backport fixing ignoring ck restrictions in filtering" from Piotr
"
Tests: unit(dev)
Refs #4541
"

* 'fix_ignoring_ck_restrictions_in_filtering_for_3.0_2' of https://github.com/psarna/scylla:
  tests: add a test case for filtering clustering key
  cql3: fix qualifying clustering key restrictions for filtering
  cql3: fix fetching clustering key columns for filtering
2019-06-25 11:56:11 +03:00
Piotr Sarna
87fd298a6e tests: add a test case for filtering clustering key
The test cases makes sure that clustering key restriction
columns are fetched for filtering if they form a clustering key prefix,
but not a primary key prefix (partition key columns are missing).

Ref #4541
2019-06-25 10:05:34 +02:00
Piotr Sarna
7dce5484c2 cql3: fix qualifying clustering key restrictions for filtering
Clustering key restrictions can sometimes avoid filtering if they form
a prefix, but that can happen only if the whole partition key is
restricted as well.

Ref #4541
2019-06-25 10:05:34 +02:00
Piotr Sarna
23df964b96 cql3: fix fetching clustering key columns for filtering
When a column is not present in the select clause, but used for
filtering, it usually needs to be fetched from replicas.
Sometimes it can be avoided, e.g. if primary key columns form a valid
prefix - then, they will be optimized out before filtering itself.
However, clustering key prefix can only be qualified for this
optimization if the whole partition key is restricted - otherwise
the clustering columns still need to be present for filtering.

This commit also fixes tests in cql_query_test suite, because they now
expect more values - columns fetched for filtering will be present as
well (only internally, the clients receive only data they asked for).

Fixes #4541
2019-06-25 10:05:27 +02:00
Piotr Sarna
fcab0d1392 tests: add test case for finishing index paging
The test case makes sure that paging indexes does not result
in an infinite loop.

Refs #4569

(cherry picked from commit b8cadc928c)
2019-06-24 10:14:35 +02:00
Piotr Sarna
a0c4a8501e cql3: fix infinite paging for indexed queries
Indexed queries need to translate between view table paging state
and base table paging state, in order to be able to page the results
correctly. One of the stages of this translation is overwriting
the paging state obtained from the base query, in order to return
view paging state to the user, so it can be used for fetching next
pages. Unfortunately, in the original implementation the paging
state was overwritten only if more pages were available,
while if 'remaining' pages were equal to 0, nothing was done.
This is not enough, because the paging state of the base query
needs to be overwritten unconditionally - otherwise a guard paging state
value of 'remaining == 0' is returned back to the client along with
'has_more_pages = true', which will result in an infinite loop.
This patch correctly overwrites the base paging state unconditionally.

Fixes #4569

(cherry picked from commit 88f3ade16f)
2019-06-24 09:37:06 +02:00
Nadav Har'El
b6fa715f7b storage_proxy: fix race and crash in case of MV and other node shutdown
Recently, in merge commit 2718c90448,
we added the ability to cancel pending view-update requests when we detect
that the target node went down. This is important for view updates because
these have a very long timeout (5 minutes), and we wanted to make this
timeout even longer.

However, the implementation caused a race: Between *creating* the update's
request handler (create_write_response_handler()) and actually starting
the request with this handler (mutate_begin()), there is a preemption point
and we may end up deleting the request handler before starting the request.
So mutate_begin() must gracefully handle the case of a missing request
handler, and not crash with a segmentation fault as it did before this patch.

Eventually the lifetime management of request handlers could be refactored
to avoid this delicate fix (which requires more comments to explain than
code), or even better, it would be more correct to cancel individual writes
when a node goes down, not drop the entire handler (see issue #4523).
However, for now, let's not do such invasive changes and just fix bug that
we set out to fix.

Fixes #4386.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190620123949.22123-1-nyh@scylladb.com>
(cherry picked from commit 6e87bca65d)
2019-06-23 21:13:10 +03:00
Avi Kivity
9b3ca26d7f Merge "Fix deciding whether a query uses indexing" from Piotr
"
This series backports fixing deciding whether a query uses indexing
for branch-3.0

Fixes #4539
Branches: 3.0
"

* 'fix_deciding_whether_a_query_uses_indexing_for_3.0' of https://github.com/psarna/scylla:
  tests: add case for partition key index and filtering
  cql3: fix deciding if a query uses indexing
2019-06-18 14:41:47 +03:00
Piotr Sarna
7b8e570e6c tests: add case for partition key index and filtering
The test ensures that partition key index does not influence
filtering decisions for regular columns.

Ref #4539
2019-06-18 13:35:32 +02:00
Piotr Sarna
a947f2cd84 cql3: fix deciding if a query uses indexing
The code that decides whether a query should used indexing
was buggy - a partition key index might have influenced the decision
even if the whole partition key was passed in the query (which
effectively means that indexing it is not necessary).

Fixes #4539
2019-06-18 13:19:31 +02:00
Avi Kivity
5ce5f61b08 Update seastar submodule
* seastar f541231...16641ef (1):
  > build: add libatomic to install-depenencies.sh

Fixes #4562.
2019-06-17 13:52:04 +03:00
Piotr Jastrzebski
7b65ec866b sstables: distinguish empty and missing cellpath
Before this patch mc sstables writer was ignoring
empty cellpaths. This is a wrong behaviour because
it is possible to have empty key in a map. In such case,
our writer creats a wrong sstable that we can't read back.
This is becaus a complex cell expects cellpath for each
simple cell it has. When writer ignores empty cellpath
it writes nothing and instead it should write a length
of zero to the file so that we know there's an empty cellpath.

Fixes #4533

Tests: unit(release)

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <46242906c691a56a915ca5994b36baf87ee633b7.1560532790.git.piotr@scylladb.com>
(cherry picked from commit a41c9763a9)
2019-06-16 09:06:37 +03:00
Jenkins
4c16c1fe1b release: prepare for 3.0.7 by hagitsegev 2019-05-26 22:30:19 +03:00
Paweł Dziepak
f2d2a9f5b8 Merge "Fix empty counters handling in MC" from Piotr
"
Before this patchset empty counters were incorrectly persisted for
MC format. No value was written to disk for them. The correct way
is to still write a header that informs the counter is empty.

We also need to make sure that reading wrongly persisted empty
counters works because customers may have sstables with wrongly
persisted empty counters.

Fixes #4363
"

* 'haaawk/4363/v3' of github.com:scylladb/seastar-dev:
  sstables: add test for empty counters
  docs: add CorrectEmptyCounters to sstable-scylla-format
  sstables: Add a feature for empty counters in Scylla.db.
  sstables: Write header for empty counters
  sstables: Remove unused variables in make_counter_cell
  sstables: Handle empty counter value in read path

(cherry picked from commit 899ebe483a)
2019-05-24 06:23:38 +03:00
Gleb Natapov
cb3b687492 cache_hitrate_calculator: make cache hitrate calculation preemptable
The calculation is done in a non preemptable loop over all tables, so if
numbers of tables is very large it may take a while since we also build
a string for gossiper state. Make the loop preemtable and also make
the string calculation more efficient by preallocating memory for it.
Message-Id: <20190516132748.6469-3-gleb@scylladb.com>

(cherry picked from commit 31bf4cfb5e)
2019-05-17 14:38:38 +02:00
Gleb Natapov
1bb84cdbcf cache_hitrate_calculator: do not copy stats map for each cpu
invoke_on_all() copies provided function for each shard it is executed
on, so by moving stats map into the capture we copy it for each shard
too. Avoid it by putting it into the top level object which is already
captured by reference.
Message-Id: <20190516132748.6469-2-gleb@scylladb.com>

(cherry picked from commit 4517c56a57)
2019-05-17 12:40:45 +02:00
Gleb Natapov
b6307d54be cache_hitrate_calculator: wait for ongoing calculation to complete during stop
Currently stop returns ready future immediately. This is not a problem
since calculation loop holds a shared pointer to the local service, so
it will not be destroyed until calculation completes and global database
object db, that also used by the calculation, is never destroyed. But the
later is just a workaround for a shutdown sequence that cannot handle
it and will be changed one day. Make cache hitrate calculation service
ready for it.

Message-Id: <20190422113538.GR21208@scylladb.com>
(cherry picked from commit c6b3b9ff13)
2019-05-17 12:40:41 +02:00
Avi Kivity
a20000c1a2 Merge "multishard reader: fix handling of non strictly monotonous positions" from Botond
"
The shard readers of the multishard reader assumed that the positions in
the data stream are strictly monotonous. This assumption is invalid.
Range tombstones can have positions that they can share with other range
tombstones and/or a clustering row. The effect of this false assumption
was that when the shard reader was evicted such that the last seen
fragment was a range tombstone, when recreated it would skip any unseen
fragments that have the same position as that of the last seen range
tombstone.

This series contains some additional fixes for the
`flat_mutation_reader_from_mutations()` reader, to make the backported unit
tests pass.

Fixes: #4418

Tests: unit(release: network_topology_strategy_test times out - don't
think it is related to these changes)
"

* 'multishard_reader_handle_non_strictly_monotonous_positions-branch-3.0/v1' of https://github.com/denesb/scylla:
  tests: add unit test for multishard reader correctly handling non-strictly monotonous positions
  flat_mutation_reader: add make_flat_mutation_reader_from_fragments() overload with range and slice
  flat_mutation_reader: add flat_mutation_reader_from_mutations() overload with range and slice
  flat_mutation_reader_from_mutations: destroy all remaining mutations
  flat_mutation_reader_from_mutations: fix empty range case
  multishard_combining_reader: fix handling of non-strictly monotonous positions
  position_in_partition_view: add region() accessor
2019-05-06 11:42:29 +03:00
Botond Dénes
b3cbc2e58a tests: add unit test for multishard reader correctly handling non-strictly monotonous positions
(cherry picked from commit aa18bb33b9)
2019-05-06 11:19:04 +03:00
Botond Dénes
e4c1c4f052 flat_mutation_reader: add make_flat_mutation_reader_from_fragments() overload with range and slice
To be able to support this new overload, the reader is made
partition-range aware. It will now correctly only return fragments that
fall into the partition-range it was created with. For completeness'
sake and to be able to test it, also implement
`fast_forward_to(const dht::partition_range)`. Slicing is done by
filtering out non-overlapping fragments from the initial list of
fragments. Also add a unit test that runs it through the mutation_source
test suite.

(cherry picked from commit 51e81cf027)
2019-05-06 11:19:04 +03:00
Benny Halevy
bfe3b4cc59 time_window_backlog_tracker: fix use after free
Fixes #4465

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190430094209.13958-1-bhalevy@scylladb.com>
(cherry picked from commit 3a2fa82d6e)
2019-05-06 09:38:31 +03:00
Botond Dénes
6a4bc5bd71 flat_mutation_reader: add flat_mutation_reader_from_mutations() overload with range and slice
To be able to run the mutation-source test suite with this reader. In
the next patch, this reader will be used in testing another reader, so
it is important to make sure it works correctly first.

(cherry picked from commit bc08f8fd07)
2019-05-06 09:17:48 +03:00
Paweł Dziepak
6c818bcec0 flat_mutation_reader_from_mutations: destroy all remaining mutations
If the reader is fast-forwarded to another partition range mutation_ may
be left with some partial mutations. Make sure that those are properly
destroyed.

(cherry picked from commit 048ed2e3d3)
2019-05-06 09:17:17 +03:00
Paweł Dziepak
1598d358f0 flat_mutation_reader_from_mutations: fix empty range case
An iterator shall not be dereferenced until it is verified that it is
dereferencable.

(cherry picked from commit d50cd31eee)
2019-05-06 09:17:17 +03:00
Botond Dénes
7252715c69 multishard_combining_reader: fix handling of non-strictly monotonous positions
The shard readers under a multishard reader are paused every time the
read moves to another ahrd. When paused they can be evicted at any time.
When this happens, they will be re-created lazily on the next
operation, with a start position such that they continue reading from
where the evicted reader left off. This start position is determined
from the last fragment seen by the previous reader. When this position
is clustering position, the reader will be recreated such that it reads
the clustering range (from the half-read partition): (last-ckey, +inf).
This can cause problems if the last fragment seen by the evicted reader
was a range-tombstone. Range tombstones can share the same clustering
position with other range tombstones and potentially one clustering row.
This means that when the reader is recreated, it will start from the
next clustering position, ignoring any unread fragments that share the
same position as the last seen range tombstone.
To fix, ensure that when pausing the reader, we extract all fragments
for the last position. To this end, when the last extracted fragment
is a range tombstone (with pos x), we continue reading until we see a
fragment with a position y that is greater. This way it is ensured that
we have seen all fragments for pos x and it is safe to resume the read,
starting from after position x.

(cherry picked from commit eba310163d)
2019-05-06 09:17:17 +03:00
Botond Dénes
37e143cba5 position_in_partition_view: add region() accessor
(cherry picked from commit b30af48c83)
2019-05-06 09:17:17 +03:00
Jenkins
bf68fae01b release: prepare for 3.0.6 by penberg 2019-05-03 14:14:31 +03:00
Gleb Natapov
d566466fca batchlog_manager: fix array out of bound access
endpoint_filter() function assumes that each bucket of
std::unordered_multimap contains elements with the same key only, so
its size can be used to know how many elements with a particular key
are there.  But this is not the case, elements with multiple keys may
share a bucket. Fix it by counting keys in other way.

Fixes #3229

Message-Id: <20190501133127.GE21208@scylladb.com>
(cherry picked from commit 95c6d19f6c)
2019-05-03 11:59:29 +03:00
Avi Kivity
e32e682911 Merge "SI: Add virtual columns to underlying MV" from Duarte
"
Virtual columns are MV-specific columns that contribute to the
liveness of view rows. However, we were not adding those columns when
creating an index's underlying MV, causing indexes to miss base rows.

Fixes #4144
Branches: master, branch-3.0
"

Reviewed-by: Nadav Har'El <nyh@scylladb.com>

* 'sec-index/virtual-columns/v1' of https://github.com/duarten/scylla:
  tests/secondary_index_test: Add reproducer for #4144
  index/secondary_index_manager: Add virtual columns to MV

(cherry picked from commit ebf179318c)
2019-05-01 12:58:35 +01:00
Tomasz Grabiec
3c46bbf244 lsa: Fix compact_and_evict() being called with a too low step
compact_and_evict gets memory_to_release in bytes while
reclamation step is in segments.

Broken in f092decd90.

It doesn't make much difference with the current default step of 1
segment since we cannot reclaim less than that, so shouldn't cause
problems in practice.

Ref #4445

Message-Id: <1556013920-29676-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 21fbf59fa8)
2019-04-23 23:10:38 +03:00
Gleb Natapov
5567cf4b1b cache_hitrate_calculator: fix use after free in non_system_filter lambda
non_system_filter lambda is defined static which means it is initialized
only once, so the 'this' that is will capture will belong to a shard
where the function runs first. During service destruction the function
may run on different shard and access already other's shard service that
may be already freed.

Fixed #4425

Message-Id: <20190421152139.GN21208@scylladb.com>
(cherry picked from commit 306f5b99b5)
2019-04-22 09:52:42 +03:00
Tomasz Grabiec
733c04ad50 lsa: Fix potential bad_alloc even though evictable memory exists
When we start the LSA reclamation it can be that
segment_pool::_free_segments is 0 under some conditions and
segment_pool::_current_emergency_reserve_goal is set to 1. The
reclamation step is 1 segment, and compact_and_evict_locked() frees 1
segment back into the segment_pool. However,
segment_pool::reclaim_segments() doesn't free anything to the standard
allocator because the condition _free_segments >
_current_emergency_reserve_goal is false. As a result,
tracker::impl::reclaim() returns 0 as the amount of released memory,
tracker::reclaim() returns
memory::reclaiming_result::reclaimed_nothing and the seastar allocator
thinks it's a real OOM and throws std::bad_alloc.

The fix is to change compact_and_evict() to make sure that reserves
are met, by releasing more if they're not met at entry.

This change also allows us to drop the variant of allocate_segment()
which accepts the reclamation step as a means to refill reserves
faster. This is now not needed, because compact_and_evict() will look
at the reserve deficit to increase the amount of memory to reclaim.

Fixes #4445

Message-Id: <1555671713-16530-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit f092decd90)
2019-04-20 16:44:49 +03:00
Raphael S. Carvalho
05913b6f58 database: fix 2x increase in disk usage during cleanup compaction
Don't hold reference to sstables cleaned up, so that file descriptors
for their index and data files will be closed and consequently disk
space released.

Fixes #3735.

Backport note:
To reduce risk considerably, we'll not backport a mechanism to release
sstable introduced in incremental compaction work.
Instead, only one sstable is passed to table::cleanup_sstables() at a
time (it won't affect performance because the operation is serialized
anyway), to make it easy to release reference to cleaned sstable held
by compaction manager.

tests: release mode; manually checked cleanup's disk space issue is gone.

(cherry picked from commit 5bc028f78b)

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190417155024.27359-1-raphaelsc@scylladb.com>
2019-04-17 18:01:48 +01:00
Duarte Nunes
79cf277ea2 db/schema_tables: Diff tables using ID instead of name
Currently we diff schemas based on table/view name, and if the names
match, then we detect altered schemas by comparing the schema
mutations. This fails to detect transitions which involve dropping and
recreating a schema with the same name, if a node receives these
notifications simultaneously (for example, if the node was temporarily
down or partitioned).

Note that because the ID is persisted and created when executing a
create_table_statement, then even if a schema is re-created with the
exact same structure as before, we will still considered it altered
because the mutations will differ.

This also stops schema pulling from working, since it relies on schema
merging.

The solution is to diff schemas using their ID, and not their name.

Keyspaces and user types are also susceptible to this, but in their
case it's fine: these are values with no identity, and are just
metadata. Dropping and recreating a keyspace can be views as dropping
all tables from the keyspace, altering it, and eventually adding new
tables to the keyspace.

Note that this solution doesn't apply to tables dropped and created
with the same ID (using the `WITH ID = {}` syntax). For that, we would
need to detect deltas instead of applying changes and then reading the
new state to find differences. However, this solution is enough,
because tables are usually created with ID = {} for very specific,
peculiar reasons. The original motivation meant for the new table to
be treated exactly as the old, so the current behavior is in fact the
desired one.

Tests: unit(release), dtests(schema_test, schema_management_test)

Fixes #3797

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181001230932.47153-2-duarte@scylladb.com>
(cherry picked from commit 40a30d4129)
2019-04-17 18:01:48 +01:00
Duarte Nunes
03ada48b40 db/schema_tables: Drop tables before creating new ones
Doing it by the inverse order doesn't support dropping and creating a
schema with the same name.

Refs #3797

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181001230932.47153-1-duarte@scylladb.com>
(cherry picked from commit e404f09a23)
2019-04-17 18:01:48 +01:00
Duarte Nunes
394afae3a8 service/migration_manager: Validate duplicate ID in time
We allow tables to be created with the ID property, mostly for
advanced recovery cases. However, we need to validate that the ID
doesn't match an existing one. We currently do this in
database::add_column_family(), but this is already too late in the
normal workflow: if we allow the schema change to go through, then
it is applied to the system tables and loaded the next time the node
boots, regardless of us throwing from database::add_column_family().

To fix this, we perform this validation when announcing a new table.

Note that the check wasn't removed from database::add_column_family();
it's there since 2015 and there might have been other reasons to add
it that are not related to the ID property.

Refs #2059

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181001230142.46743-1-duarte@scylladb.com>
(cherry picked from commit 7ba944a243)
2019-04-17 18:01:48 +01:00
Tomasz Grabiec
69d0b1e15c schema_tables: Serialize schema merges fairly
All schema changes made to the node locally are serialized on a
semaphore which lives on shard 0. For historical reasons, they don't
queue but rather try to take the lock without blocking and retry on
failure with a random delay from the range [0, 100 us]. Contenders
which do not originate on shard 0 will have an extra disadvantage as
each lock attempt will be longer by the across-shard round trip
latency. If there is constant contention on shard 0, contenders
originating from other shards may keep loosing to take the lock.

Schema merge executed on behalf of a DDL statement may originate on
any shard. Same for the schema merge which is coming from a push
notification. Schema merge executed as part of the background schema
pull will originate on shard 0 only, where the application state
change listeners run. So if there are constant schema pulls, DDL
statements may take a long time to get through.

The fix is to serialize merge requests fairly, by using the blocking
semaphore::wait(), which is fair.

We don't have to back-off any more, since submit_to() no longer has a
global concurrency limit.

Fixes #4436.

Message-Id: <1555349915-27703-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 3fd82021b1)
2019-04-16 10:19:45 +03:00
Avi Kivity
403f66ecad Revert "Revert "compaction: fix use-after-free when calculating backlog after schema change""
This reverts commit 841ceac4f9. It was reverted because the test
failed; this turned out due to a miscompile of the test.

With this additional fix, the test compiles correctly:

--- a/tests/sstable_datafile_test.cc
+++ b/tests/sstable_datafile_test.cc
@@ -4785,11 +4785,11 @@ SEASTAR_TEST_CASE(backlog_tracker_correctness_after_stop_tracking_compaction) {

             auto fut = sstables::compact_sstables(sstables::compaction_descriptor(ssts), *cf, sst_gen);

             bool stopped_tracking = false;
             for (auto& info : cf._data->cm.get_compactions()) {
-                if (info->cf == &*cf) {
+                if (info->cf == cf->schema()->cf_name()) {
                     info->stop_tracking();
                     stopped_tracking = true;
                 }
             }
             BOOST_REQUIRE(stopped_tracking);

info->cf is an sstring, and &*cf is a table*. It's not clear how the compiler
was able to compare an sstring and a pointer.
2019-04-12 21:45:59 +03:00
Shlomi Livne
841ceac4f9 Revert "compaction: fix use-after-free when calculating backlog after schema change"
This reverts commit 2b326fc7fa.
2019-04-12 09:55:06 +03:00
Avi Kivity
0fce4b228e Update seastar submodule
* seastar e9bb565...f541231 (1):
  > Merge "perftune.py: log NVMe IRQ SMP" from Vlad

Fixes #4057.
2019-04-12 00:19:58 +03:00
Botond Dénes
2336c092a0 types: fix date_type_impl::less() (timestamp cql type)
date_type_impl::less() invokes `compare_unsigned()` to compare the
underlying raw byte values. `compared_unsigned()` is a tri comparator,
however `date_type_impl::less()` implicitely converted the returned
value to bool. In effect, `date_type_impl::less()` would *always* return
`true` when the two compared values were not equal.

Found while working on a unit test which empoly a randomly generated
schema to test a component.


Fixes #4419.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <8a17c81bad586b3772bf3d1d1dae0e3dc3524e2d.1554907100.git.bdenes@scylladb.com>
(cherry picked from commit f201f8abab)
2019-04-12 00:19:58 +03:00
Raphael S. Carvalho
2b326fc7fa compaction: fix use-after-free when calculating backlog after schema change
The problem happens after a schema change because we fail to properly
remove ongoing compaction, which stopped being tracked, from list that
is used to calculate backlog, so it may happen that a compaction read
monitor (ceases to exist after compaction ends) is used after freed.

Fixes #4410.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190409024936.23775-1-raphaelsc@scylladb.com>
(cherry picked from commit 8a117c338a)
2019-04-12 00:19:58 +03:00
Asias He
a62edaf7a9 streaming: Reject stream if the _sys_dist_ks or _view_update_generator are not ready
They are of type db::system_distributed_keyspace and
db::view::view_update_generator.

n1 is in normal status
n2 boots up and _sys_dist_ks or _view_update_generator are not
initialized
n1 runs stream, n2 is the follower.
n2 uses the _sys_dist_ks or _view_update_generator
"Assertion `local_is_initialized()' failed" is observed

Fixes #4360

Message-Id: <4ae13e1640ac8707a9ba0503a2744f6faf89ecf4.1554330030.git.asias@scylladb.com>
(cherry picked from commit f212dfb887)
2019-04-11 07:38:53 +03:00
Takuya ASADA
d527ef19f7 dist/docker/redhat: switch to python36
Since EPEL switched python3 default version to 3.6, we need to follow
the change.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190409094139.4797-2-syuu@scylladb.com>
2019-04-10 16:10:30 +03:00
Takuya ASADA
8568dc94f4 dist/ami: switch to python36
Since EPEL switched python3 default version to 3.6, we need to follow
the change.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190409094139.4797-1-syuu@scylladb.com>
2019-04-10 16:10:29 +03:00
Takuya ASADA
6e51a95668 dist/redhat: switch to python36
Since EPEL switched python3 default version to 3.6, we need to follow
the change.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190408160934.32701-1-syuu@scylladb.com>
2019-04-08 19:12:59 +03:00
Avi Kivity
071191b967 Update seastar submodule
> fstream: remove default extent allocation hint
  > deleter: prevent early memory free caused by deleter append.
  > iotune: throw exception if no data was collected
  > perftune.py: fix irqbalance tuning on Ubuntu 18
  > fix use after free in rpc server handler

Fixes #4336
Fixes #4400
Fixes #4401
Fixes #4402
Fixes #4404
Ref #3867
2019-04-03 16:32:45 +03:00
Shlomi Livne
c6c841c34f Revert "Merge 'Add canceling long-standing view update requests' from Piotr"
This series introduces regression in dtests
materialized_views_test/TestMaterializedViews/interrupt_build_process_with_resharding_*.

This reverts commit b2227c7a5e.

Ref #3826
Ref #3966
Ref #4028

Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
Message-Id: <a3aea137bfde956241acc6d57e1c387a8202486c.1554116404.git.shlomi@scylladb.com>
2019-04-01 14:07:47 +03:00
Nadav Har'El
83a8f779bb view_complex_test: fix another ttl
In a previous patch I fixed most TTLs in the view_complex_test.cc tests
from low numbers to 100 seconds. I missed one. This one never caused
problems in practice, but for good form, let's fix it too.

Ref #3918.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20181115160234.26478-1-nyh@scylladb.com>
(cherry picked from commit 45f05b06d2)
2019-03-31 12:28:37 +03:00
Nadav Har'El
6066968e33 view_complex_test: increase low ttl which may fail test on busy machine
Several of the tests in tests/view_complex_test.cc set a cell with a
TTL, and then skip time ahead artificially with forward_jump_clocks(),
to go past the TTL time and check the cell disappeared as expected.

The TTLs chosen for these tests were arbitrary numbers - some had 3 seconds,
some 5 seconds, and some 60 seconds. The actual number doesn't matter - it
is completely artificial (we move the clock with forward_jump_clocks() and
never really wait for that amount of time) and could very well be a million
seconds. But *low* numbers, like the 3 seconds, present a problem on extremely
overcomitted test machines. Our eventually() function already allows for
the possibility that things can hang for up to 8 seconds, but with a 3 second
TTL, we can find ourselves with data being expired and the test failing just
after 3 seconds of wall time have passed - while the test intended that the
dataq will expire only when we explicitly call forward_jump_clocks().

So this patch changes all the TTLs in this test to be the same high number -
100 seconds. This hopefully fixes #3918.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20181115125607.22647-1-nyh@scylladb.com>
(cherry picked from commit 4108458b8e)
2019-03-31 12:27:48 +03:00
Tomasz Grabiec
d3d877b9db Merge "db/view: Apply tracked tombstones for new updates" from Duarte
When generating view updates for base mutations when no pre-existing
data exists, we were forgetting to apply the tracked tombstones.

Fixes #4321
Tests: unit(dev)

* https://github.com/duarten/scylla materialized-views/4321/v1.1:
  db/view: Apply tracked tombstones for new updates
  tests/view_schema_test: Add reproducer for #4321

(cherry picked from commit 2b8bf0dbf8)
2019-03-27 21:56:21 +00:00
Piotr Sarna
5ec646cb4e types: fix varint and decimal serialization
Varint and decimal types serialization did not update the output
iterator after generating a value, which may lead to corrupted
sstables - variable-length integers were properly serialized,
but if anything followed them directly in the buffer (e.g. in a tuple),
their value will be overwritten.

Fixes #4348

Tests: unit (dev)
dtest: json_test.FromJsonUpdateTests.complex_data_types_test
       json_test.FromJsonInsertTests.complex_data_types_test
       json_test.ToJsonSelectTests.complex_data_types_test

Note that dtests still do not succeed 100% due to formatting differences
in compared results (e.g. 1.0e+07 vs 1.0E7, but it's no longer a query
correctness issue.

(cherry picked from commit 287a02dc05)
2019-03-26 16:38:37 +02:00
Jenkins
68b54b2e52 release: prepare for 3.0.5 by hagitsegev 2019-03-26 10:52:59 +02:00
Duarte Nunes
66a48746b8 service/storage_proxy: Don't consider view hints for MV backpressure
When a view replica becomes unavailable, updates to it are stored as
hints at the paired based replica. This on-disk queue of pending view
updates grows as long as there are view updated and the view replica
remains unavailable. Currently, we take that relative queue size into
account when calculating the delay for new base writes, in the context
of the backpressure algorithm for materialized views.

However, the way we're calculating that on-disk backlog is wrong,
since we calculate it per-device and then feed it to all the hints
managers for that device. This means that normal hints will show up as
backlog for the view hints manager, which in turn introduces delays.
This can make the view backpressure mechanism kick-in even if the
cluster uses no materialized views.

There's yet another way in which considering the view hints backlog is
wrong: a view replica that is unavailable for some period of time can
cause the backlog to grow to a point where all base writes are applied
the maximum delay of 1 second. This turns a single-node failure into
cluster unavailability.

The fix to both issues is to simply not take this on-disk backlog into
account for the backpressure algorithm.

Fixes #4351
Fixes #4352

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190321170418.25953-1-duarte@scylladb.com>
(cherry picked from commit 93a1c27b31)
2019-03-25 15:02:06 +02:00
Duarte Nunes
b2227c7a5e Merge 'Add canceling long-standing view update requests' from Piotr
"
This series allows canceling view update requests when a node is
discovered DOWN. View updates are sent in the background with long
timeout (5 minutes), and in case we discover that the node is
unavailable, there's no point in waiting that long for the request
to finish. What's more, waiting for these requests occurs on shutdown,
which may result in waiting 5 minutes until Scylla properly shuts down,
which is bad for both users and dtests.

This series implements storage_proxy as a lifecycle subscriber,
so it can react to membership changes. It also keeps track of all
"interruptible" writes per endpoint, so once a node is detected as DOWN,
an artificial timeout can be triggered for all aforementioned write
requests.

Fixes #3826
Fixes #3966
Fixes #4028
"

* 'write_hints_for_view_updates_on_shutdown_4' of https://github.com/psarna/scylla:
  service: remove unused stop_hints_manager
  storage_proxy: add drain_on_shutdown implementation
  main: register storage proxy as lifecycle subscriber
  storage_proxy: add endpoint_lifecycle_subscriber interface
  storage_proxy: register view update handlers for view write type
  storage_proxy: add intrusive list of view write handlers
  storage_proxy: add view_update_write_response_handler

(cherry picked from commit 2718c90448)
2019-03-24 17:27:18 +02:00
Avi Kivity
97357a7321 Merge "sstables: mc: Write and read static compact tables the same way as Cassandra" from Tomasz
"
Static compact tables are tables with compact storage and no
clustering columns.

Before this patch, Scylla was writing rows of static compact tables as
clustered rows instead of as static rows. That's because in our in-memory
model such tables have regular rows and no static row. In Cassandra's
schema (since 3.x), those tables have columns which are marked as
static and there are no regular columns.

This worked fine as long as Scylla was writing and reading those
sstables. But when importing sstables from Cassandra, our reader was
skipping the static row, since it's not present in our schema, and
returning no rows as a result. Also, Cassandra, and Scylla tools,
would have problems reading those sstables.

Fix this by writing rows for such tables the same way as Cassandra
does. In order to support rolling downgrade, we do that only when all
nodes are upgraded.

Fixes #4139.

Tests:

  - unit (dev)
"

* tag 'static-compact-mc-fix-v3.1' of github.com:tgrabiec/scylla:
  tests: sstables: Test reading of static compact sstable generated by Cassandra
  tests: sstables: Add test for writing and reading of static compact tables
  sstables: mc: Write static compact tables the same way as Cassandra
  sstable: mc: writer: Set _static_row_written inside write_static_row()
  sstables: Add sstable::features()
  sstables: mc: writer: Prepare write_static_row() for working with any column_kind
  storage_service: Introduce the CORRECT_STATIC_COMPACT feature flag
  sstables: mc: writer: Build indexed_columns together with serialization_header
  sstables: mc: writer: De-optimize make_serialization_header()
  sstable: mc: writer: Move attaching of mc-specific components out of generic code

(cherry picked from commit eddb98e8c6)
2019-03-24 16:34:42 +02:00
Tomasz Grabiec
089e41999a tests: sstables: Extract make_sstable_mutation_source()
Message-Id: <1540459849-27612-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 46d0c157ae)
2019-03-24 13:48:41 +02:00
Asias He
c537b3dd8e storage_service: Wait for gossip to settle only if do_bind is set
In commit 71bf757b2c, we call
wait_for_gossip_to_settle() which takes some time to complete in
storage_service::prepare_to_join().

In tests/cql_query_test calls init_server with do_bind == false which in
turn calls storage_service::prepare_to_join(). Since in the test, there
is only one node, there is no point to wait for gossip to settle.

To make the cql_query_test fast again, do not call
wait_for_gossip_to_settle if do_bind is false.

Before this patch, cql_query_test takes forever to complete.
After it takes 10s.

Ref #4289.

Tests: tests/cql_query_test
Message-Id: <3ae509e0a011ae30eef3f383c6a107e194e0e243.1553147332.git.asias@scylladb.com>
(cherry picked from commit c0f744b407)
2019-03-23 15:13:40 +02:00
Gleb Natapov
75a737c958 messaging_service: keep shared pointer to an rpc connection while opening mutation fragment stream
Current code captures a reference to rpc::client in a continuation, but
there is no guaranty that the reference will be valid when continuation runs.
Capture shared pointer to rpc::client instead.

Fixes #4350.

Message-Id: <20190314135538.GC21521@scylladb.com>
(cherry picked from commit bb93d990ad)
2019-03-23 15:11:28 +02:00
Tomasz Grabiec
ea0f1c039d row_cache: Fix abort in cache populating read concurrent with memtable flush
When we're populating a partition range and the population range ends
with a partition key (not a token) which is present in sstables and
there was a concurrent memtable flush, we would abort on the following
assert in cache::autoupdating_underlying_reader:

     utils::phased_barrier::phase_type creation_phase() const {
         assert(_reader);
         return _reader_creation_phase;
     }

That's because autoupdating_underlying_reader::move_to_next_partition()
clears the _reader field when it tries to recreate a reader but it finds
the new range to be empty:

         if (!_reader || _reader_creation_phase != phase) {
            if (_last_key) {
                auto cmp = dht::ring_position_comparator(*_cache._schema);
                auto&& new_range = _range.split_after(*_last_key, cmp);
                if (!new_range) {
                    _reader = {};
                    return make_ready_future<mutation_fragment_opt>();
                }

Fix by not asserting on _reader. creation_phase() will now be
meaningful even after we clear the _reader. The meaning of
creation_phase() is now "the phase in which the reader was last
created or 0", which makes it valid in more cases than before.

If the reader was never created we will return 0, which is smaller
than any phase returned by cache::phase_of(), since cache starts from
phase 1. This shouldn't affect current behavior, since we'd abort() if
called for this case, it just makes the value more appropriate for the
new semantics.

Tests:

  - unit.row_cache_test (debug)

Fixes #4236
Message-Id: <1553107389-16214-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit 69775c5721)
2019-03-22 09:31:59 -03:00
Asias He
27cf758f12 streaming: Get rid of the keep alive timer in streaming
There is no guarantee that rpc streaming makes progress in some time
period. Remove the keep alive timer in streaming to avoid killing the
session when the rpc streaming is just slow.

The keep alive timer is used to close the session in the following case:

n2 (the rpc streaming sender) streams to n1 (the rpc streaming receiver)
kill -9 n2

We need this because we do not kill the session when gossip think a node
is down, because we think the node down might only be temporary
and it is a waste to drop the previous work that has done especially
when the stream session takes long time.

Since in range_streamer, we do not stream all data in a single stream
session, we stream 10% of the data per time, and we have retry logic.
I think it is fine to kill a stream session when gossip thinks a node is
down. This patch changes to close all stream session with the node that
gossip think it is down.
Message-Id: <bdbb9486a533eee25fcaf4a23a946629ba946537.1551773823.git.asias@scylladb.com>

(cherry picked from commit b8158dd65d)
2019-03-20 19:47:11 +01:00
Avi Kivity
76fd69244a Merge "Fix empty remote common_features in check_knows_remote_features" from Asias
Three nodes in the cluster node1, node2, node3

Shutdown the whole cluster

Start node1

Start node2, node2 sees empty remote common_features.

   gossip - Feature check passed.  Local node 127.0.0.2 features =
   {CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS,
   DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS,
   LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT,
   RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3,
   STREAM_WITH_RPC_STREAM, WRITE_FAILURE_REPLY, XXHASH},
   Remote common_features = {}

The problem is node3 hasn't started yet, node1 sees node3 has empty
features. In get_supported_features(), an empty common features will be
returned if an empty features of a node is seen. To fix, we should
fallback to use the features saved in system table.

Start node3, node3 sees empty remote common_features.

   gossip - Feature check passed. Local node 127.0.0.3 features =
   {CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS,
   DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS,
   LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT,
   RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3,
   STREAM_WITH_RPC_STREAM, WRITE_FAILURE_REPLY, XXHASH},
   Remote common_features = {}

The problem is node3 hasn't inserted its own features into gossip
endpoint_state_map. get_supported_features() returns the common features
of all nodes in endpoint_state_map. To fix, we should fallback to use
the features stored in the system table for such node in this case.

Fixes #4225
Fixes #4341

* tag 'backport.fix.empty.remote.common.features.for.3.0.v1' of github.com:scylladb/seastar-dev:
  gossiper: Enable features only after gossip is settled
  Merge "Fix empty remote common_features in check_knows_remote_features" from Asias
  Merge "Fix window during init where waiting for a feature can be ignored" from Avi
2019-03-20 12:40:45 +02:00
Asias He
0eb2ea8f00 gossiper: Enable features only after gossip is settled
n1, n2, n3 in the cluster,

shutdown n1, n2, n3

start n1, n2

start n3, we saw features are enabled using the system table while n1 and n2 are already up and running in the cluster.

INFO  2019-02-27 09:24:41,023 [shard 0] gossip - Feature check passed. Local node 127.0.0.3 features = {CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS, DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS, LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT, RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3, STREAM_WITH_RPC_STREAM, TRUNCATION_TABLE, WRITE_FAILURE_REPLY, XXHASH}, Remote common_features = {CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS, DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS, LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT, RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3, STREAM_WITH_RPC_STREAM, TRUNCATION_TABLE, WRITE_FAILURE_REPLY, XXHASH}
INFO  2019-02-27 09:24:41,025 [shard 0] storage_service - Starting up server gossip
INFO  2019-02-27 09:24:41,063 [shard 0] gossip - Node 127.0.0.1 does not contain SUPPORTED_FEATURES in gossip, using features saved in system table, features={CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS, DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS, LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT, RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3, STREAM_WITH_RPC_STREAM, TRUNCATION_TABLE, WRITE_FAILURE_REPLY, XXHASH}
INFO  2019-02-27 09:24:41,063 [shard 0] gossip - Node 127.0.0.2 does not contain SUPPORTED_FEATURES in gossip, using features saved in system table, features={CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS, DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS, LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT, RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3, STREAM_WITH_RPC_STREAM, TRUNCATION_TABLE, WRITE_FAILURE_REPLY, XXHASH}

The problem is we enable the features too early in the start up process.
We should enable features after gossip is settled.

Fixes #4289
Message-Id: <04f2edb25457806bd9e8450dfdcccc9f466ae832.1551406991.git.asias@scylladb.com>

(cherry picked from commit 71bf757b2c)
2019-03-20 17:28:20 +08:00
Tomasz Grabiec
20eaf0b85f Merge "Fix empty remote common_features in check_knows_remote_features" from Asias
Three nodes in the cluster node1, node2, node3

Shutdown the whole cluster

Start node1

Start node2, node2 sees empty remote common_features.

   gossip - Feature check passed.  Local node 127.0.0.2 features =
   {CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS,
   DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS,
   LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT,
   RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3,
   STREAM_WITH_RPC_STREAM, WRITE_FAILURE_REPLY, XXHASH},
   Remote common_features = {}

The problem is node3 hasn't started yet, node1 sees node3 has empty
features. In get_supported_features(), an empty common features will be
returned if an empty features of a node is seen. To fix, we should
fallback to use the features saved in system table.

Start node3, node3 sees empty remote common_features.

   gossip - Feature check passed. Local node 127.0.0.3 features =
   {CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS,
   DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS,
   LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT,
   RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3,
   STREAM_WITH_RPC_STREAM, WRITE_FAILURE_REPLY, XXHASH},
   Remote common_features = {}

The problem is node3 hasn't inserted its own features into gossip
endpoint_state_map. get_supported_features() returns the common features
of all nodes in endpoint_state_map. To fix, we should fallback to use
the features stored in the system table for such node in this case.

Fixes #4225
Fixes #4341

* dev asias/fix_check_knows_remote_features.upstream.v4.1:
  gossiper: Remove unused register_feature and unregister_feature
  gossiper: Remove unused wait_for_feature_on_all_node and
    wait_for_feature_on_node
  gossiper: Log feature is enabled only if the feature is not enabled
    previously
  gossiper: Fix empty remote common_features in
    check_knows_remote_features

(cherry picked from commit b0e6f17a22)
2019-03-20 17:27:30 +08:00
Tomasz Grabiec
751fdc9f6c Merge "Fix window during init where waiting for a feature can be ignored" from Avi
storage_service keeps a bunch of "feature" variables, indicating cluster-wide
supported features, and has the ability to wait until the entire cluster supports
a given feature.

The propagation of features depends on gossip, but gossip is initialized after
storage_service, so the current code late-initializes the features. However, that
means that whoever waits on a feature between storage_service initialization and
gossip initialization loses their wait entry. In #3952, we have proof that this
in fact happens.

Fix this by removing the circular dependency. We now store features in a new
service, feature_service, that is started before both gossip and storage_service.
Gossip updates feature_service while storage_service reads for it.

Fixes #3953.

* https://github.com/avikivity/3953/v4.1:
  storage_service: deinline enable_all_features()
  gossiper: keep features registered
  tests/gossip: switch to seastar::thread
  storage_service: deinline init/deinit functions
  gossiper: split feature storage into a new feature_service
  gossiper: maybe enable features after start_gossiping()
  storage_service: fix gap when feature::when_enabled() doesn't work

(cherry picked from commit 6012a63660)
2019-03-20 17:25:35 +08:00
Tomasz Grabiec
5e3a52024e sstable/compaction: Use correct schema in the writing consumer
Introduced in 2a437ab427.

regular_compaction::select_sstable_writer() creates the sstable writer
when the first partition is consumed from the combined mutation
fragment stream. It gets the schema directly from the table
object. That may be a different schema than the one used by the
readers if there was a concurrent schema alter duringthat small time
window. As a result, the writing consumer attached to readers will
interpret fragments using the wrong version of the schema.

One effect of this is storing values of some columns under a different
column.

This patch replaces all column_family::schema() accesses with accesses
to the _schema memeber which is obtained once per compaction and is
the same schema which readers use.

Fixes #4304.

Tests:

  - manual tests with hard-coded schema change injection to reproduce the bug
  - build/dev/scylla boot
  - tests/sstable_mutation_test

Message-Id: <1551698056-23386-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 58e7ad20eb)
2019-03-04 18:16:43 +02:00
Avi Kivity
3869b5ab51 Merge "Fix commitlog chunks overwriting each other" from Paweł
"
This series fixes a problem in the commitlog cycle() function that
confused in-memory and on-disk size of chunks it wrote to disk. The
former was used to decide how much data needs to be actually written,
and the latter was used to compute the offset of the next chunk. If two
chunk writes happened concurrently one the one positioned earlier in
the file could corrupt the header of the next one.

Fixes #4231.

Tests: unit(dev), dtest(commitlog_test.py:TestCommitLog.test_commitlog_replay_on_startup,test_commitlog_replay_with_alter_table)
"

* tag 'fix-commitlog-cycle/v1' of https://github.com/pdziepak/scylla:
  commitlog: write the correct buffer size
  utils/fragmented_temporary_buffer_view: add remove suffix

(cherry picked from commit d95dec22d9)
2019-03-04 17:58:46 +02:00
Jenkins
3cca6f5384 release: prepare for 3.0.4 by hagitsegev 2019-03-04 10:14:33 +02:00
Nadav Har'El
15188b5ea5 Materialized views: fix accidental zeroing of flow-control delay
The materialized-views flow control carefully calculates an amount of
microseconds to delay a client to slow it down to the desired rate -
but then a typo (std::min instead of std::max) causes this delay to
be zeroed, which in effect completely nullifies the flow control
algorithm.

Before this fix, experiments suggested that view flow control was
not having any effect and view backlog not bounded at all. After this
fix, we can see the flow control having its desired effect, and the
view backlog converging.

Fixes #4143.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190226161452.498-1-nyh@scylladb.com>
(cherry picked from commit da54d0fc7d)
2019-02-27 22:17:44 +02:00
Avi Kivity
96d9ebb67e Update seastar submodule
* seastar b79be02...4a06029 (1):
  > net: fix tcp load balancer accounting leak while moving socket to other shard

Fixes #4269.
2019-02-25 23:22:09 +02:00
Avi Kivity
f18b370198 Merge "sstables: mc: writer: Avoid large allocations for keeping promoted index entries" from Tomasz
"
Currently we keep the entries in a circular_buffer, which uses
a contiguous storage. For large partitions with many promoted index
entries this can cause OOM and sstable compaction failure.

A similar problem exists for the offset vector built
in write_promoted_index().

This change solves the problem by serializing promoted index entries
and the offset vector on the fly directly into a bytes_ostream, which
uses fragmented storage.

The serialization of the first entry is deferred, so that
serialization is avoided if there will be less than 2
entries. Promoted index is not added for such partitions.

There still remains a problem that large-enough promoted index can cause OOM.

Refs #4217

Tests:
  - unit (release)
  - scylla-bench write

Branches: 3.0
"

* tag 'fix-large-alloc-for-promoted-index-v3' of github.com:tgrabiec/scylla:
  sstables: mc: writer: Avoid large allocations for maintaining promoted index
  sstables: mc: writer: Avoid double-serialization of the promoted index

(cherry picked from commit fdefee696e)
2019-02-24 15:45:32 +02:00
Avi Kivity
8e657e5685 Merge " Fix INSERT JSON with null values" from Piotr
"
Fixes #4256

This miniseries fixes a problem with inserting NULL values through
INSERT JSON interface.

Tests: unit (dev)
"

* 'fix_insert_json_with_null' of https://github.com/psarna/scylla:
  tests: add test for INSERT JSON with null values
  cql3: add missing value erasing to json parser

(cherry picked from commit 5520fc37ba)
2019-02-22 15:52:46 +02:00
Avi Kivity
4fde670abf Merge "Add DEFAULT UNSET support to JSON" from Piotr
"
This series adds DEFAULT UNSET and DEFAULT NULL keyword support
to INSERT JSON statement, as stated in #3909.

Tests: unit (release)
"

* 'add_json_default_unset_2' of https://github.com/psarna/scylla:
  tests: add DEFAULT UNSET case to JSON cql tests
  tests: split JSON part of cql query test
  cql3: add DEFAULT UNSET to INSERT JSON

(cherry picked from commit 447f953a2c)
2019-02-22 15:52:16 +02:00
Avi Kivity
923318e636 sstables: checksummed_file_writer: fix dma alignment
checksummed_file_writer does not override allocate_buffer(), so it inherits
data_source_impl's default allocate_buffer, which does not care about alignment.
The buffer is then passed to the real file_data_sink_impl, and thence to the file
itself, which cannot complete the write since it is not properly aligned.

This doesn't fail in release mode, since the Seastar allocator will supply a
properly aligned buffer even if not asked to do so. The ASAN allocator usually
does supply an aligned buffer, but not always, which causes the test to fail.

Fix by forwarding the allocate_buffer() function to the underlying data_source.

Fixes #4262.
Branches: branch-3.0
Message-Id: <20190221184115.6695-1-avi@scylladb.com>

(cherry picked from commit 34b254381f)
2019-02-22 09:28:55 +02:00
Amnon Heiman
35cc09b150 scylla-housekeeping: Read JSON as UTF-8 string for older Python 3 compatibility
Python 3.6 is the first version to accept bytes to the json.loads(),
which causes the following error on older Python 3 versions:

  Traceback (most recent call last):
    File "/usr/lib/scylla/scylla-housekeeping", line 175, in <module>
      args.func(args)
    File "/usr/lib/scylla/scylla-housekeeping", line 121, in check_version
      raise e
    File "/usr/lib/scylla/scylla-housekeeping", line 116, in check_version
      versions = get_json_from_url(version_url + params)
    File "/usr/lib/scylla/scylla-housekeeping", line 55, in get_json_from_url
      return json.loads(data)
    File "/usr/lib64/python3.4/json/__init__.py", line 312, in loads
      s.__class__.__name__))
  TypeError: the JSON object must be str, not 'bytes'

To support those older Python versions, convert the bytes read to utf8
strings before calling the json.loads().

Fixes #4239
Branches: master, 3.0

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20190218112312.24455-1-amnon@scylladb.com>
(cherry picked from commit 750b76b1de)
2019-02-20 11:03:12 +02:00
Asias He
22f41f04ba range_streamer: Futurize add_ranges
It might take long time for get_all_ranges_with_sources_for and
get_all_ranges_with_strict_sources_for to calculate which cause reactor
stall. To fix, run them in a thread and yield. Those functions are used in
the slow path, it is ok to yield more than needed.

Fixes #3639

Message-Id: <63aa7794906ac020c9d9b2984e1351a8298a249b.1536135617.git.asias@scylladb.com>
(cherry picked from commit 8edf3defdf)
2019-02-20 11:03:12 +02:00
Avi Kivity
fae11c0d6b fragmented_temporary_buffer: fix read_exactly() during premature end-of-stream
read_exactly(), when given a stream that does not contain the amount of data
requested, will loop endlessly, allocating more and more memory as it does, until
it fails with an exception (at which point it will release the memory).

Fix by returning an empty result, like input_stream::read_exactly() (which it
replaces). Add a test case that fails without a fix.

Affected callers are the native transport, commitlog replay, and internal
deserialization.

Fixes #4233.

Branches: master, branch-3.0
Tests: unit(dev)
Message-Id: <20190216150825.14841-1-avi@scylladb.com>
(cherry picked from commit 03531c2443)
2019-02-20 11:03:11 +02:00
Nadav Har'El
82016c07f2 Materialized views: limit size of row batching during bulk view building
The bulk materialized-view building processes (when adding a materialized
view to a table with existing data) currently reads the base table in
batches of 128 (view_builder::batch_size) rows. This is clearly better
than reading entire partitions (which may be huge), but still, 128 rows
may grow pretty large when we have rows with large strings or blobs,
and there is no real reason to buffer 128 rows when they are large.

Instead, when the rows we read so far exceed some size threshold (in this
patch, 1MB), we can operate on them immediately instead of waiting for
128.

As a side-effect, this patch also solves another bug: At worst case, all
the base rows of one batch may be written into one output view partition,
in one mutation. But there is a hard limit on the size of one mutation
(commitlog_segment_size_in_mb, by default 32MB), so we cannot allow the
batch size to exceed this limit. By not batching further after 1MB,
we avoid reaching this limit when individual rows do not reach it but
128 of them did.

Fixes #4213.

This patch also includes a unit test reproducing #4213, and demonstrating
that it is now solved.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190214093424.7172-1-nyh@scylladb.com>
(cherry picked from commit fec562ec8f)
2019-02-16 21:54:41 +02:00
Botond Dénes
282ccbb072 service/storage_service: fix pre-bootstrap wait for schema agreement
When bootstrapping, a node should to wait to have a schema agreement
with its peers, before it can join the ring. This is to ensure it can
immediately accept writes. Failing to reach schema agreement before
joining is not fatal, as the node can pull unknown schemas on writes
on-demand. However, if such a schema contains references to UDFs, the
node will reject writes using it, due to #3760.

To ensure that schema agreement is reached before joining the ring,
`storage_service::join_token_ring()` has to checks. First it checks that
at least one peer was connected previously. For this it compares
`database::get_version()` with `database::empty_version`. The (implied)
assumption is that this will become something other than
`database::empty_version` only after having connected (and pulled
schemas from) at least one peer. This assumption doesn't hold anymore,
as we now set the version earlier in the boot process.
The second check verifies that we have the same schema version as all
known, live peers. This check assumes (since 3e415e2) that we have
already "met" all (or at least some) of our peers and if there is just
one known node (us) it concludes that this is a single-node cluster,
which automatically has schema agreement.
It's easy to see how these two checks will fail. The first fails to
ensure that we have met our peers, and the second wrongfully concludes
that we are a one-node cluster, and hence have schema agreement.

To fix this, modify the first check. Instead of relying on the presence
of a non-empty database version, supposedly implying that we already
talked to our peers, explicitely make sure that we have really talked to
*at least* one other node, before proceeding to the second check, which
will now do the correct thing, actually checking the schema versions.

Fixes: #4196

Branches: 3.0, 2.3

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <40b95b18e09c787e31ba6c5519fb64d68b4ca32e.1550228389.git.bdenes@scylladb.com>
(cherry picked from commit 2125e99531)
2019-02-16 19:04:08 +02:00
Avi Kivity
873e0f0e14 auth: password_authenticator: protect against NULL salted_hash
In case salted_hash was NULL, we'd access uninitialized memory when dereferencing
the optional in get_as<>().

Protect against that by using get_opt() and failing authentication if we see a NULL.

Fixes #4168.

Tests: unit (release)
Branches: 3.0, 2.3
Message-Id: <20190211173820.8053-1-avi@scylladb.com>
(cherry picked from commit da9628c6dc)
2019-02-11 23:55:53 +02:00
Jenkins
c36c58c64e release: prepare for 3.0.3 by hagitsegev 2019-02-11 18:58:02 +02:00
Duarte Nunes
0685c8f5bc Merge 'Fix misdetection of remote counter shards' from Paweł
"
The code reading counter cells form sstables verifies that there are no
unsupported local or remote shards. The latter are detected by checking
if all shards are present in the counter cell header (only remote shards
do not have entries there). However, the logic responsible for doing
that was incorrectly computing the total number of counter shards in a
cell if the header was larger than a single counter shard. This resulted
in incorrect complaints that remote shards are present.

Fixes #4206

Tests: unit(release)
"

* tag 'counter-header-fix/v1' of https://github.com/pdziepak/scylla:
  tests/sstables: test counter cell header with large number of shards
  sstables/counters: fix remote counter shard detection

(cherry picked from commit d2d885fb93)
2019-02-11 14:09:55 +02:00
Calle Wilund
92cf2934c6 tls: Use a default prio string disabling TLS1.0 forcing min 128bits
Fixes #4010

Unless user sets this explicitly, we should try explicitly avoid
deprecated protocol versions. While gnutls should do this for
connections initiated thusly, clients such as drivers etc might
use obsolete versions.

Message-Id: <20190107131513.30197-1-calle@scylladb.com>
(cherry picked from commit ba6a8ef35b)
2019-02-05 19:45:13 +02:00
Calle Wilund
ed2fb65732 commitlog_replayer: Bugfix: finding truncation positions uses local var ref
"uuid" was ref:ed in a continuation. Works 99.9% of the time because
the continuation is not actually delayed (and assuming we begin the
checks with non-truncated (system) cf:s it works).
But if we do delay continuation, the resulting cf map will be
borked.

Fixes #4187.

Message-Id: <20190204141831.3387-1-calle@scylladb.com>
(cherry picked from commit 9cadbaa96f)
2019-02-04 20:25:17 +02:00
Gleb Natapov
ce2957d106 messaging_service: do not forget to close stream when sending it to another side failed
Fixes #4124

Message-Id: <20190131091857.GC3172@scylladb.com>
(cherry picked from commit a70374d982)
2019-02-03 13:00:20 +02:00
Avi Kivity
b31d94e317 Update seastar submodule
* seastar 5226277...b79be02 (1):
  > rpc: support closing streaming when only sink or source was created

Ref #4124.
2019-02-03 12:59:43 +02:00
Asias He
da80f27f44 migration_manager: Fix nullptr dereference in maybe_schedule_schema_pull
Commit 976324bbb8 changed to use
get_application_state_ptr to get a pointer of the application_state. It
may return nullptr that is dereferenced unconditionally.

In resharding_test.py:ReshardingTest_nodes4_with_SizeTieredCompactionStrategy.resharding_by_smp_increase_test, we saw:

   4 nodes in the tests

   n1, n2, n3, n4 are started

   n1 is stopped

   n1 is changed to use different shard config

   n1 is restarted ( 2019-01-27 04:56:00,377 )

The backtrace happened on n2 right fater n1 restarts:

   0 INFO 2019-01-27 04:56:05,175 [shard 0] gossip - Feature STREAM_WITH_RPC_STREAM is enabled
   1 INFO 2019-01-27 04:56:05,175 [shard 0] gossip - Feature WRITE_FAILURE_REPLY is enabled
   2 INFO 2019-01-27 04:56:05,175 [shard 0] gossip - Feature XXHASH is enabled
   3 WARN 2019-01-27 04:56:05,177 [shard 0] gossip - Fail to send EchoMessage to 127.0.58.1: seastar::rpc::closed_error (connection is closed)
   4 INFO 2019-01-27 04:56:05,205 [shard 0] gossip - InetAddress 127.0.58.1 is now UP, status =
   5 Segmentation fault on shard 0.
   6 Backtrace:
   7 0x00000000041c0782
   8 0x00000000040d9a8c
   9 0x00000000040d9d35
   10 0x00000000040d9d83
   11 /lib64/libpthread.so.0+0x00000000000121af
   12 0x0000000001a8ac0e
   13 0x00000000040ba39e
   14 0x00000000040ba561
   15 0x000000000418c247
   16 0x0000000004265437
   17 0x000000000054766e
   18 /lib64/libc.so.6+0x0000000000020f29
   19 0x00000000005b17d9

We do not know when this backtrace happened, but according to log from n3 an n4:

   INFO 2019-01-27 04:56:22,154 [shard 0] gossip - InetAddress 127.0.58.2 is now DOWN, status = NORMAL
   INFO 2019-01-27 04:56:21,594 [shard 0] gossip - InetAddress 127.0.58.2 is now DOWN, status = NORMAL

We can be sure the backtrace on n2 happened before 04:56:21 - 19 seconds (the
delay the gossip notice a peer is down), so the abort time is around 04:56:0X.
The migration_manager::maybe_schedule_schema_pull that triggers the backtrace
must be scheduled before n1 is restarted, because it dereference
application_state pointer after it sleeps 60 seconds, so the time
maybe_schedule_schema_pull is called is around 04:55:0X which is before n1 is
restarted.

So my theory is: migration_manager::maybe_schedule_schema_pull is scheduled, at this time
n1 has SCHEMA application_state, when n1 restarts, n2 gets new application
state from n1 which does not have SCHEMA yet, when migration_manager::maybe_schedule
wakes up from the 60 sleep, n1 has non-empty endpoint_state but empty
application_state for SCHEMA. We dereference the nullptr
application_state and abort.

Fixes: #4148
Tests: resharding_test.py:ReshardingTest_nodes4_with_SizeTieredCompactionStrategy.resharding_by_smp_increase_test
Message-Id: <9ef33277483ae193a49c5f441486ee6e045d766b.1548896554.git.asias@scylladb.com>
(cherry picked from commit 28d6d117d2)
2019-02-01 13:00:38 +02:00
Jenkins
5174b1cd13 release: prepare for 3.0.2 by slivne 2019-01-30 16:14:34 +02:00
Nadav Har'El
9ba608cae4 cql3: really ensure retrieval of columns for filtering
Commit fd422c954e aimed to fix
issue #3803. In that issue, if a query SELECTed only certain columns but
did filtering (ALLOW FILTERING) over other unselected columns, the filtering
didn't work. The fix involved adding the columns being filtered to the set
of columns we read from disk, so they can be filtered.

But that commit included an optimization: If you have clustering keys
c1 and c2, and the query asks for a specific partition key and c1 < 3 and
c2 > 3, the "c1 < 3" part does NOT need to be filtered because it is already
done as a slice (a contiguous read from disk). The committed code erroneously
concluded that both c1 and c2 don't need to be filtered, which was wrong
(c2 *does* need to be read and filtered).

In this patch, we fix this optimization. Previously, we used the "prefix
length", which in the above example was 2 (both c1 and c2 were filtered)
but we need a new and more elaborate function,
num_prefix_columns_that_need_not_be_filtered(), to determine we can only
skip filtering of 1 (c1) and cannot skip the second.

Fixes #4121. This patch also adds a unit test to confirm this.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Message-Id: <20190123131212.6269-1-nyh@scylladb.com>
(cherry picked from commit 76f1fcc346)
2019-01-23 21:11:05 +02:00
Avi Kivity
f7c5cbc645 build: fix libdeflate object file corruption during parallel build
libdeflate's build places some object files in the source directory, which is
shared between the debug and release build. If the same object file (for the two
modes) is written concurrently, or if one more reads it while the other writes it,
it will be corrupted.

Fix by not building the executables at all. They aren't needed, and we already
placed the libraries' objects in the build directory (which is unshared). We only
need the libraries anyway.

Fixes #4130.
Branches: master, branch-3.0
Message-Id: <20190123145435.19049-1-avi@scylladb.com>

(cherry picked from commit c83ae62aed)
2019-01-23 21:11:05 +02:00
Duarte Nunes
cf4b4d4878 Merge 'hinted handoff: cache cf mappings' from Vlad
"
Cache cf mappings when breaking in the middle of a segment sending so
that the sender has them the next time it wants to send this segment
for where it left off before.

Also add the "discard" metric so that we can track hints that are being
discarded in the send flow.
"

Fixes #4122

* 'hinted_handoff_cache_cf_mappings-v1' of https://github.com/vladzcloudius/scylla:
  hinted handoff: cache column family mappings for segments that were not sent out in full
  hinted handoff: add a "discarded" metric

(cherry picked from commit 88c7c1e851)
2019-01-23 17:14:29 +02:00
Asias He
45bb1ba1b7 streaming: Futurize estimate_partitions
The loop can take a long time if the number of sstables and/or ranges
are large. To fix, futurize the loop.

Fixes: #4005

Message-Id: <3b05cb84f3f57cc566702142c6365a04b075018e.1545290730.git.asias@scylladb.com>
(cherry picked from commit bcba6b4f4d)
2019-01-22 18:20:21 +02:00
Botond Dénes
28294ed42e auth/service: unregister migration listener on stop()
Otherwise any event that triggers notification to this listener would
trigger a heap-use-after-free.

Refs: #4107

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <b6bbd609371a2312aed7571b05119d59c7d103d7.1548067626.git.bdenes@scylladb.com>
(cherry picked from commit f229dff210)
2019-01-22 17:54:36 +02:00
Jenkins
3c4f8cf6ed release: prepare for 3.0.1 by hagitsegev 2019-01-20 12:42:00 +02:00
Botond Dénes
7b94264ae5 mutlishard_mutation_query(): use correct reader concurrency semaphore
The multishard mutation query used the semaphore obtained from
`database::user_read_concurrency_sem()` to pause-resume shard readers.
This presented a problem when `multishard_mutation_query()` was reading
from system tables. In this case the readers themselves would obtain
their permits from the system read concurrency semaphore. Since the
pausing of shard readers used the user read semaphore, pausing failed to
fulfill its objective of alleviating pressure on the semaphore the reads
obtained their permits from. In some cases this lead to a deadlock
during system reads.
To ensure the correct semaphore is used for pausing-resuming readers,
obtain the semaphore from the `table` object. To avoid looking up the
table on every pause or resume call, cache the semaphores when readers
are created.

Fixes: #4096

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <c784a3cd525ce29642d7216fbe92638fa7884e88.1547729119.git.bdenes@scylladb.com>
(cherry picked from commit 4537ec7426)
2019-01-17 18:08:01 +02:00
Duarte Nunes
22a085fbd3 Merge 'Fix filtering with LIMIT and paging' from Piotr
"
Before this series the limit was applied per page instead
of globally, which might have resulted in returning too many
rows.

To fix that:
 1. restrictions filter now has a 'remaining' parameter
    in order to stop accepting rows after enough of them
    have already been accepted
 2. pager passes its row limit to restrictions filter,
    so no more rows than necessary will be served to the client
 3. results no longer need to be trimmed on select_statement
    level

Tests: unit (release)
"

Fixes #4100

* 'fix_filtering_limit_with_paging_3' of https://github.com/psarna/scylla:
  tests: add filtering+limit+paging test case
  tests: allow null paging state in filtering tests
  cql3: fix filtering with LIMIT with regard to paging

(cherry picked from commit 7505815013)
2019-01-17 18:07:41 +02:00
Tomasz Grabiec
2d181da656 row_cache: Fix crash on memtable flush with LCS
Presence checker is constructed and destroyed in the standard
allocator context, but the presence check was invoked in the LSA
context. If the presence checker allocates and caches some managed
objects, there will be alloc-dealloc mismatch.

That is the case with LeveledCompactionStrategy, which uses
incremental_selector.

Fix by invoking the presence check in the standard allocator context.

Fixes #4063.

Message-Id: <1547547700-16599-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 32f711ce56)
2019-01-15 21:16:13 +02:00
Nadav Har'El
d427a23d42 scylla_util.py: make view_hints_directory setting optional
It is optional to set "view_hints_directory", so we shouldn't insist that
it is defined in scylla.yaml on upgrade.

Fixes #4091.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190114125225.10794-1-nyh@scylladb.com>
(cherry picked from commit 9062750089)
2019-01-14 16:59:40 +02:00
Shlomi Livne
37ab553f02 release: prepare for 3.0.0
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
2019-01-12 22:24:08 +02:00
Raphael S. Carvalho
6a3f4fb3f9 database: Fix race condition in sstable snapshot
Race condition takes place when one of the sstables selected by snapshot
is deleted by compaction. Snapshot fails because it tries to link a
sstable that was previously unlinked by compaction's sstable deletion.

Refs #4051.

(master commit 1b7cad3531)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190110194048.26051-1-raphaelsc@scylladb.com>
2019-01-11 13:48:12 +02:00
Avi Kivity
8168d13887 Merge "Fix UDTs representation in serialization header" from Piotr
"
Tests: unit(release)
"

Fixes #4073.

* commit 'FETCH_HEAD~1':
  Add test for serialization header with UDT
  Fix UDT names in serialization header

(cherry picked from commit 4a6aeced59)
2019-01-11 07:48:23 +02:00
Benny Halevy
13bdec6eb4 sstables: mc: sign-extend serialization_header min_local_deletion_time_base and min_ttl_base
Refs #4074
Refs #3353

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190110141439.1324-1-bhalevy@scylladb.com>
(cherry picked from commit 2dc3776407)
2019-01-11 07:47:45 +02:00
Benny Halevy
57e7081d86 sstables: mc: sign-extend delta local_deletion_time and delta ttl
Follow Cassandra's encoding so that values that are less than the
baseline encoding_stats will wrap-around in 64-bits rather tham 32.

Fixes #4074
Refs #3353

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190109192703.18371-1-bhalevy@scylladb.com>
(cherry picked from commit 60323b79d1)
2019-01-09 23:16:00 +02:00
Avi Kivity
2fcae36d96 tests: mutation_source_test: generate valid utf-8 data
test_fast_forwarding_across_partitions_to_empty_range uses an uninitialized
string to populate an sstable, but this can be invalid utf-8 so that sstable
cannot be sstabledumped.

Make it valid by using make_random_string().

Fixes #4040.
Message-Id: <20190107193240.14409-1-avi@scylladb.com>

(cherry picked from commit d8adbeda11)
2019-01-08 14:53:55 +02:00
Avi Kivity
ba62dcd5c7 Update seastar submodule
* seastar 618bc23...5226277 (1):
  > iotune: Initialize io_rates member variables

Fixes #4064.
2019-01-08 11:39:50 +02:00
Nadav Har'El
515399ce17 materialized views: move hints to top-level directory
While we keep ordinary hints in a directory parallel to the data directory,
we decided to keep the materialized view hints in a subdirectory of the data
directory, named "view_pending_updates". But during boot, we expect all
subdirectories of data/ to be keyspace names, and when we notice this one,
we print a warning:

   WARN: database - Skipping undefined keyspace: view_pending_updates

This spurious warning annoyed users. But moreover, we could have bigger
problems if the user actually tries to create a keyspace with that name.

So in this patch, we move the view hints to a separate top-level directory,
which defaults to /var/lib/scylla/view_hints, but as usual can be configured.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190107142257.16342-1-nyh@scylladb.com>
(cherry picked from commit da090a5458)
2019-01-07 22:01:56 +02:00
Benny Halevy
772c4b5fdc sstables: mc: expired_liveness_ttl should be max int32_t rather than max uint32_t
Corresponding to Cassandra's EXPIRED_LIVENESS_TTL = Integer.MAX_VALUE;

Fixes #4060

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190107172457.20430-1-bhalevy@scylladb.com>
(cherry picked from commit 40410465d7)
2019-01-07 21:59:59 +02:00
Avi Kivity
874d88c98d Update seastar submodule
* seastar 08f1258...618bc23 (1):
  > perftune.py: tune only active NVMe HW queues on i3 AWS instances

Ref #3831.
2019-01-06 12:59:22 +02:00
Avi Kivity
5a178ff635 compaction_controller: increase minimum shares to 50 (~5%) for small-data workloads
The workload in #3844 has these characteristics:
 - very small data set size (a few gigabytes per shard)
 - large working set size (all the data, enough for high cache miss rate)
 - high overwrite rate (so a compaction results in 12X data reduction)

As a result, the compaction backlog controller assigns very few shares to
compaction (low data set size -> low backlog), so compaction proceeds very slowly.
Meanwhile, we have tons of cache misses, and each cache miss needs to read from a
large number of sstables (since compaction isn't progressing). The end result is
a high read amplification, and in this test, timeouts.

While we could declare that the scenario is very artificial, there are other
real-world scenarios that could trigger it. Consider a 100% write load
(population phase) followed by 100% read. Towards the end of the last compaction,
the backlog will drop more and more until compaction slows to a crawl, and until
it completes, all the data (for that compaction) will have to be read from its
input sstables, resulting in read amplification.

We should probably have read amplification affect the backlog, but for now the
simpler solution is to increase the minimum shares to 50 so that compaction
always makes forward progress. This will result in higher-than-needed compaction
bandwidth in some low write rate scenarios so we will see fluctuations in request
rate (what the controller was designed to avoid), but these fluctioations will be
limited to 5%.

Since the base class backlog_controller has a fixed (0, 0) point, remove it
and add it to derived classes (setting it to (0, 50) for compaction).

Fixes #3844 (or at least improves it).
Message-Id: <20181231162710.29410-1-avi@scylladb.com>

(cherry picked from commit b0980ba7c6)
2019-01-04 13:28:43 +02:00
Avi Kivity
d67439b910 Revert "release: prepare for 3.0-rc4"
This reverts commit 21a5a4c76a. we were already
at rc4, and the commit only changes the syntax (from the incorrect one to
the correct one).
2019-01-04 12:34:38 +02:00
Hagit Segev
21a5a4c76a release: prepare for 3.0-rc4 2019-01-03 23:53:47 +02:00
Tomasz Grabiec
f818d6ee3f tests: cql_test_env: Start the compaction manager
Broken in fee4d2e

Not doing this results in compaction requests being ignored.

One effect of this is that perf_fast_forward produces many sstables instead of one.

Refs #3984
Refs #3983

Message-Id: <1544719540-10178-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 245a0d953a)
2019-01-03 14:56:42 +01:00
Tomasz Grabiec
20c2745592 Merge "Improve times to start / stop the nodes" from Glauber
If the compaction manager is started, compactions may start (this is
regardless of whether or not we trigger them). The problem with that is
that they start at a time in which we are flushing the commitlog and the
initialization procedure waits for the commitlog to be fully flushed and
the resulting memtables flushed before we move on.

Because there are no incoming writes, the amount of shares in memtable
flushes decrease as memory used decreases and that can cause the startup
procedure to take a long time.

We have recently started to bump the shares manually for manual flushes.
While that guarantees that we will not drive the shares to zero, I will
make the argument that we can do better by making sure that those things
are, at this point, running alone: user experience is affected by
startup times and the bump we give to user-triggered operations will
only do so much. Even if we increase the shares a lot flushes will still
be fighting for resources with compactions and startup will take longer
than it could.

By making sure that flushes are this point running alone we improve the
user experience by making sure the startup is as fast as it can be.

There is a similar problem at the drain level, which is also fixed in this
series.

Fixes #3958

* git@github.com:glommer/scylla.git faster-restart
  compaction_manager: delay initialization of the compaction manager.
  drain: stop compactions early

(cherry picked from commit 3e70ae1d06)
2019-01-03 14:56:16 +01:00
Avi Kivity
cf5c72561c release: prepare for scylla-3.0-rc4 2019-01-03 13:15:58 +02:00
Botond Dénes
53b85e5d32 querier_cache: unregister queriers evicted due to expired TTL
Currently queriers evicted due to their TTL expiring are not
unregistered from the `reader_concurrency_semaphore`. This can cause a
use-after-free when the semaphore tries to evict the same querier at
some later point in time, as the querier entry it has a pointer to is
now invalid.

Fix by unregistering the querier from the semaphore before destroying
the entry.

Refs: #4018
Refs: #4031

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <4adfd09f5af8a12d73c29d59407a789324cd3d01.1546504034.git.bdenes@scylladb.com>
(cherry picked from commit e5a0ea390a)
2019-01-03 13:14:02 +02:00
Avi Kivity
2456cf63f2 querier_cache: unregister querier from reader_concurrency_semaphore during eviction
In insert_querier(), we may evict older queriers to make room for the new one.
However, we forgot to unregister the evicted queriers from
reader_concurrency_semaphore. As a result, when reader_concurrency_semaphore
eventually wanted to evict something, it saw an inactive_read_handle that was
not connected to a querier_cache::entry, and crashed on use-after-free.

Fix by evicting through the inactive_read_handle associated with the querier
to be evicted. This removes traces of the querier from both
reader_concurrency_semaphore and querier_cache. We also have to massage the
statistics since querier_inactive_read::evict() updates different counters.

Fixes #4018.

Tests: unit(release)
Reviewed-by: Botond Denes <bdenes@scylladb.com>
Message-Id: <20190102175023.26093-1-avi@scylladb.com>
(cherry picked from commit 918d255168)
2019-01-03 13:14:00 +02:00
Pekka Enberg
c1f6ce4251 Merge 'Fixes for the view_update_from_staging_generator' from Duarte
"This series contains a couple of fixes to the
view_update_from_staging_generator, the object responsible for
generating view updates from sstables written through streaming.

Fixes #4021"
* 'materialized-views/staging-generator-fixes/v2' of https://github.com/duarten/scylla:
  db/view/view_update_from_staging_generator: Break semaphore on stop()
  db/view/view_update_from_staging_generator: Restore formatting
  db/view/view_update_from_staging_generator: Avoid creating more than one fiber

(cherry picked from commit 96172b7bca)
2018-12-29 20:22:54 +02:00
Duarte Nunes
fc82eb5586 streaming/stream_session: Only stage sstables for tables with views
When streaming, sstables for which we need to generate view updates
are placed in a special staging directory. However, we only need to do
this for tables that actually have views.

Refs #4021
Message-Id: <20181227215412.5632-1-duarte@scylladb.com>

(cherry picked from commit bab7e6877b)
2018-12-28 20:52:15 +02:00
Avi Kivity
f58e592345 Merge "Fix use-after-free when destroying partition_snapshots in the background"from Tomasz
"
partition_snapshots created in the memtable will keep a reference to
the memtable (as region*) and to memtable::_cleaner. As long as the
reader is alive, the memtable will be kept alive by
partition_snapshot_flat_reader::_container_guard. But after that
nothing prevents it from being destroyed. The snapshot can outlive the
read if mutation_cleaner::merge_and_destroy() defers its destruction
for later. When the read ends after memtable was flushed, the snapshot
will be queued in the cache's cleaner, but internally will reference
memtable's region and cleaner. This will result in a use-after-free
when the snapshot resumes destruction.

The fix is to update snapshots's region and cleaner references at the
time of queueing to point to the cache's region and cleaner.

When memtable is destroyed without being moved to cache there is no
problem because the snapshot would be queued into memtable's cleaner,
which will be drained on destruction from all snapshots.

Introduced in f3da043 (in >= 3.0-rc1)

Fixes #4030.

Tests:

  - mvcc_test (debug)

"

* tag 'fix-snapshot-merging-use-after-free-v1.1' of github.com:tgrabiec/scylla:
  tests: mvcc: Add test_snapshot_merging_after_container_is_destroyed
  tests: mvcc: Introduce mvcc_container::migrate()
  tests: mvcc: Make mvcc_partition move-constructible
  tests: mvcc: Introduce mvcc_container::make_not_evictable()
  tests: mvcc: Allow constructing mvcc_container without a cache_tracker
  mutation_cleaner: Migrate partition_snapshots when queueing for background cleanup
  mvcc: partition_snapshot: Introduce migrate()
  mutation_cleaner: impl: Store a back-reference to the owning mutation_cleaner

(cherry picked from commit 8e2f6d0513)
2018-12-28 13:37:29 +02:00
Duarte Nunes
6375b1e5b7 streaming/stream_session: Don't use table reference across defer points
When creating a sstable from which to generate view updates, we held
on to a table reference across defer points. In case there's a
concurrent schema drop, the table object might be destroyed and we
will incur in a use-after-free. Solve this by holding on to a shared
pointer and pinning the table object.

Refs #4021

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181227105921.3601-1-duarte@scylladb.com>
(cherry picked from commit 66e45469b2)
2018-12-28 10:58:26 +02:00
Gleb Natapov
7ca24efb39 streaming: always read from rpc::source until end-of-stream during mutation sending
rpc::source cannot be abandoned until EOS is reached, but current code
does not obey it if error code is received, it throws exception instead that
aborts the reading loop. Fix it by moving exception throwing out of the
loop.

Fixes: #4025

Message-Id: <20181227135051.GC29458@scylladb.com>
(cherry picked from commit 37b4043677)
2018-12-27 18:59:59 +02:00
Avi Kivity
32ebaaa585 Update libdeflate submodule
* libdeflate 17ec6c9...e7e54ea (1):
  > build: improve out-of-tree build with multiple output trees

(cherry picked from commit d6a22c50cb)
2018-12-25 14:41:24 +02:00
Nadav Har'El
a88c722a4c build_ami.sh: need to check out the right branch of scylla-jmx
This patch is for branch 3.0's build_ami.sh.
It checks out the latest master branch of scylla-jmx, which not only
sounds wrong, it also doesn't work: the latest master of scylla-jmx
can only build a "relocatable package" but branch 3.0 doesn't work with
those.

This patch needs to be applied only in branch 3.0.

It should probably be made more general, though... build_ami.sh should
have been able to figure out what is the *current* branch, and if it is
branch-3.0 or next-3.0, check out branch-3.0 of the other repositories.
But I'm not sure how to do this correctly.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20181217214610.4498-1-nyh@scylladb.com>
2018-12-25 12:37:11 +02:00
Tomasz Grabiec
07582d6c10 sstables: index_reader: Fix abort when _trust_pi == trust_promoted_index::no
data is not moved-from if _trust_pi == trust_promoted_index::no, which
triggers the assert on data.empty(). We should make it empty
unconditionally.

Message-Id: <1545408731-14333-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 419c771791)
2018-12-24 11:45:14 +02:00
Tomasz Grabiec
18c89edbf7 sstables: mc: reader: Use enum class instead of variant
variant is an overkill here.

Message-Id: <1545409014-16289-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 07d153c769)
2018-12-24 11:45:09 +02:00
Duarte Nunes
5558fa8c44 service/storage_proxy: Protect against empty mutation when storing hint
mutation_holder::get_mutation_for() can return nullptr's, so protect
against those when storing a hint.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181221194853.98775-2-duarte@scylladb.com>
(cherry picked from commit e6a8883228)
2018-12-23 12:27:27 +02:00
Duarte Nunes
f678eb52cd service/storage_proxy: Protect against empty mutation in mutation_holder
The per_destination_mutation holder can contain empty mutations,
so make sure release_mutation() skips over those.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181221194853.98775-1-duarte@scylladb.com>
(cherry picked from commit 6c4a34f378)
2018-12-23 12:27:25 +02:00
Tomasz Grabiec
dfb23f4b38 sstables: mc: index_reader: Handle CK_SIZE split across buffers properly
we incorrectly falled-through to the next state instead of returning
to read more data.

This can manifest in a number of ways, an abort, or incorrect read.

Introduced in 917528c

Fixes #4011.

Message-Id: <1545402032-4114-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit d2f96a60f6)
2018-12-21 20:40:35 +02:00
Tomasz Grabiec
502ddf158a sstables: mc: reader: Avoid unnecessary index reads on fast forwarding
When the next pending fragments are after the start of the new range,
we know there is no need to skip.

Caught by perf_fast_forward --datasets large-part-ds3 \
                            --run-tests=large-partition-slicing

Refs #3984
Message-Id: <1545308006-16389-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit 7afe2bad51)
2018-12-21 20:40:35 +02:00
Paweł Dziepak
0ccb0a127a Merge "Optimize slicing sstable readers" from Tomasz
"
Contains several improvements for fast-forwarding and slicing readers. Mainly
for the MC format, but not only:

  - Exiting the parser early when going out of the fast-forwarding window [MC-format-only]
  - Avoiding reading of the head of the partition when slicing
  - Avoiding parsing rows which are going to be skipped [MC-format-only]
"

* 'sstable-mc-optimize-slicing-reads' of github.com:tgrabiec/scylla:
  sstables: mc: reader: Skip ignored rows before parsing them
  sstables: mc: reader: Call _cells.clear() when row ends rather than when it starts
  sstables: mc: mutation_fragment_filter: Take position_in_partition rather than a clustering_row
  sstables: mc: reader: Do not call consume_row_marker_and_tombstone() for static rows
  sstables: mc: parser: Allow the consumer to skip the whole row
  sstables: continuous_data_consumer: Introduce skip()
  sstables: continuous_data_consumer: Make position() meaningful inside state_processor::process_state()
  sstables: mc: parser: Allocate dynamic_bitset once per read instead of once per row
  sstables: reader: Do not read the head of the partition when index can be used
  sstables: mc: mutation_fragment_filter: Check the fast-forward window first
  sstables: mc: writer: Avoid calling unsigned_vint::serialized_size()

(cherry picked from commit e6d26a528f)
2018-12-21 20:40:35 +02:00
Avi Kivity
b94997be0d Merge " Extract MC sstable writer to a separate compilation unit" from Tomasz
"
The motivation is to keep code related to each format separate, to make it
easier to comprehend and reduce incremental compilation times.

Also reduces dependency on sstable writer code by removing writer bits from
sstales.hh.

The ka/la format writers are still left in sstables.cc, they could be also extracted.
"

* 'extract-sstable-writer-code' of github.com:tgrabiec/scylla:
  sstables: Make variadic write() not picked on substitution error
  sstables: Extract MC format writer to mc/writer.cc
  sstables: Extract maybe_add_summary_entry() out of components_writer
  sstables: Publish functions used by writers in writer.hh
  sstables: Move common write functions to writer.hh
  sstables: Extract sstable_writer_impl to a header
  sstables: Do not include writer.hh from sstables.hh
  sstables: mc: Extract bound_kind_m related stuff into mc/types.hh
  sstables: types: Extract sstable_enabled_features::all()
  sstables: Move components_writer to .cc
  tests: sstable_datafile_test: Avoid dependency on components_writer

(cherry picked from commit b023e8b45d)
2018-12-21 20:40:35 +02:00
Benny Halevy
d3a5b10cb8 sstables_stats: writer_impl: move common members to base class
To be used by sstable_writer for stats collection.

Note that this patch is factored out so it can be verified with no
other change in functionality.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 6853c1677d)
2018-12-21 20:40:35 +02:00
Benny Halevy
48f3f899ac sstable: make write_crc, write_digest, and new_sstable_component_file private methods
Prepare for per-sstable sub directory.
Also, these functions get most of their parameters from the sst at hand so they might
as well be first class members.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit ad5f1e4fbb)
2018-12-21 20:40:35 +02:00
Paweł Dziepak
c4f745276c Merge "Optimize sstable writing of large partitions" from Tomasz
"
This series contains several optimizations of the MC format sstable writer, mainly:
  - Avoiding output_stream when serializing into memory (e.g. a row)
  - Faster serialization of primitive types when serializing into memory

I measured the improvement in throughput (frag/s) using perf_fast_forward for
datasets with a single large partition with many small rows:

  - 10% for a row with a single cell of 8 bytes
  - 10% for a row with a single cell of 100 bytes
  -  9% for a row with a single cell of 1000 bytes
  - 13% for a row with 6 cells of 100 bytes
"

* tag 'avoid-output-stream-in-sstable-writer-v2' of github.com:tgrabiec/scylla:
  bytes_ostream: Optimize writing of fixed-size types
  sstables: mc: Write temporary data to bytes_ostream rather than file_writer
  sstables: mc: Avoid double-serialization of a range tombstone marker
  sstables: file_writer: Generalize bytes& writer to accept bytes_view
  sstables: Templetize write() functions on the writer
  sstables: Turn m_format_write_helpers.cc into an impl header
  sstables: De-futurize file_writer
  bytes_ostream: Implement clear()
  bytes_ostream: Make initial chunk size configurable

(cherry picked from commit e3f53542c9)
2018-12-21 20:40:35 +02:00
Hagit Segev
392c7dee3c release: prepare for 3.0-rc3 2018-12-21 20:19:50 +02:00
Gleb Natapov
04e982f909 streaming: hold to sink while close() is running and call close on error as well
Currently if something throws while streaming in mutation sending loop
sink is not closed. Also when close() is running the code does not hold
onto sink object. close() is async, so sink should be kept alive until
it completes. The patch uses do_with() to hold onto sink while close is
running and run close() on error path too.

Fixes #4004.

Message-Id: <20181220155931.GL3075@scylladb.com>
(cherry picked from commit 393269d34b)
2018-12-20 19:50:47 +02:00
Amnon Heiman
97a8cc149e node_exporter_install: switch to node_exporter 0.17
The newer version of node_exporter comes with important bug fixes, that
is especially important for I3.metal is not supported with the older
version of node_exporter.

The dashboards can now support both the new and the old version of
node_exporter.

Fixes #3927

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20181210085251.23312-1-amnon@scylladb.com>
(cherry picked from commit 09c2b8b48a)
2018-12-20 19:12:00 +02:00
Avi Kivity
dbe347811c Merge "materialized views: Apply backpressure from view replicas" from Duarte
"
As the amount of pending view updates increases we know that there’s a
mismatch between the rate at which the base receives writes and the
rate at which the view retires them. We react by applying backpressure
to decrease the rate of incoming base writes, allowing the slow view
replicas to catch up. We want to delay the client’s next writes to a
base replica and we use the base’s backlog of view updates to derive
this delay.

To validate this approach we tested a 3 node Scylla cluster on GCE,
using n1-standard-4 instances with NVMEs. A loader running on a
n1-standard-8 instance run cassandra-stress with 100 threads. With the
delay function d(x) set to 1s, we see no base write timeouts. With the
delay function as defined in the series, we see that backlogs stabilize
at some (arbitrary) point, as predicted, but this stabilization
co-exists with base write timeouts. However, the system overall behaves
better than the current version, with the 100 view update limit, and
also better than the version without such limit or any backpressure.

More work is necessary to further stabilize the system. Namely, we want
to keep delaying until we see the backlog is decreasing. This will
require us to add more delay beyond the stabilization point, which in
turn should minimize the base write timeouts, and will also minimize the
amount of memory the backlog takes at each base replica.

Design document:
    https://docs.google.com/document/d/1J6GeLBvN8_c3SbLVp8YsOXHcLc9nOLlRY7pC6MH3JWo

Fixes #2538
"

Reviewed-by: Nadav Har'El <nyh@scylladb.com>

* 'materialized-views/backpressure/v2' of https://github.com/duarten/scylla: (32 commits)
  service/storage_proxy: Release mutation as early as possible
  service/storage_proxy: Delay replica writes based on view update backlog
  service/storage_proxy: Get the backlog of a particular base replica
  service/storage_proxy: Add counters for delayed base writes
  main: Start and stop the view_update_backlog_broker
  service: Distribute a node's view update backlog
  service: Advertise view update backlog over gossip
  service/storage_proxy: Send view update backlog from replicas
  service/storage_proxy: Prepare to receive replica view update backlog
  service/storage_proxy: Expose local view update backlog
  tests/view_schema_test: Add simple test for db::view::node_update_backlog
  db/view: Introduce node_update_backlog class
  db/hints: Initialize current backlog
  database: Add counter for current view backlog
  database: Expose current memory view update backlog
  idl: Add db::view::update_backlog
  db/view: Add view_update_backlog
  database: Wait on view update semaphore for view building
  service/storage_proxy: Use near-infinite timeouts for view updates
  database: generate_and_propagate_view_updates no longer needs a timeout
  ...

(cherry picked from commit b66f59aa3d)
2018-12-20 19:11:56 +02:00
Avi Kivity
8f2d24bb8f config: remove "to be removed before release" notice mc sstable config
The "enable_sstables_mc_format" config item help text wants to remove itself
before release. Since scylla-3.0 did not get enough mc format mileage, we
decided to leave it in, so the notice should be removed.

Fixes #4003.
Message-Id: <20181219082554.23923-1-avi@scylladb.com>

(cherry picked from commit dd51c659f7)
2018-12-19 19:08:36 +02:00
Nadav Har'El
689e11c892 scylla.spec: add libatomic
In some cases which I've yet to understand, build fails without libatomic.
We need to add it to the mock build machine.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20181218154757.25236-1-nyh@scylladb.com>
2018-12-19 12:55:00 +00:00
Nadav Har'El
1766c793a8 build_rpm.sh: put temporary mock in build/, not /var/lib.
build_rpm.sh uses "mock" to build an entire Scylla build environment,
which easily spans more than 15 gigabytes. mock, by defaults, puts this
build directory in a subdirectory of /var/lib/mock. There is no reason
why temporary build products need to be in the root directory: Some machines
(like mine) don't have that much free space in the root directory making it
impossible to use this script on such machines. and it's too easy
to leave temporary files there without noticing.

With this patch, the mock directories are put in build/mock/ instead of
/var/lib/mock.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20181217195952.15154-1-nyh@scylladb.com>
2018-12-19 12:54:37 +02:00
Avi Kivity
0b09008cde Merge "Make sstable reader fail on unknown colum names in MC format" from Piotr
"
Before the reader was just ignoring such columns but this creates a risk of data loss.

Refs #2598
"

* 'haaawk/2598/v3' of github.com:scylladb/seastar-dev:
  sstables: Add test_sstable_reader_on_unknown_column
  sstables: Exception on sstable's column not present in schema
  sstables: store column name in column_translation::column_info
  sstables: Make test_dropped_column_handling test dropped columns

(cherry picked from commit b0cb69ec25)
2018-12-18 16:23:51 +00:00
Piotr Jastrzebski
713e60f690 sstables: Extract mp_row_cosumer_m::check_schema_mismatch
This method will contain common logic used in multiple places
and reduce code duplication.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <bbda2f4ea4f9325055f096dc549f63b1bb03d3b6.1543311990.git.piotr@scylladb.com>
(cherry picked from commit 4366302c4c)
2018-12-18 16:23:47 +00:00
Paweł Dziepak
7b6841f947 Merge "Check for schema mismatch after dropping dead cells" from Piotr
"
Previously we were checking for schema incompatibility between current schema and sstable
serialization header before reading any data. This isn't the best approach because
data in sstable may be already irrelevant due to column drop for example.

This patchset moves the check after actual data is read and verified that it has
a timestamp new enough to classify it as nonobsolete.

Fixes #3924
"

* 'haaawk/3924/v3' of github.com:scylladb/seastar-dev:
  sstables: Enable test_schema_change for MC format
  sstables3: Throw error on schema mismatch only for live cells
  sstables: Pass column_info to consume_*_column
  sstables: Add schema_mismatch to column_info
  sstables: Store column data type in column_info
  sstables: Remove code duplication in column_translation

(cherry picked from commit 62ea153629)
2018-12-18 15:27:53 +00:00
Tomasz Grabiec
f124b7026f Merge 'Add tests for schema changes' from Paweł
This series adds a generic test for schema changes that generates
various schema and data before and after an ALTER TABLE operation. It is
then used to check correctness of mutation::upgrade() and sstable
readers and lead to the discovery of #3924 and #3925.

Fixes #3925.

* https://github.com/pdziepak/scylla.git schema-change-test/v3.1
  schema_builder: make member function names less confusing
  converting_mutation_partition_applier: fix collection type changes
  converting_mutation_partition_applier: do not emit empty collections
  sstable: use format() instead of sprint()
  tests/random-utils: make functions and variables inline
  tests: add models for schemas and data
  tests: generate schema changes
  tests/mutation: add test for schema changes
  tests/sstable: add test for schema changes

(cherry picked from commit 564b328b2e)
2018-12-18 14:57:50 +00:00
Avi Kivity
28cca751d1 Merge "Don't binary compare compressed sstables in test_write_many_partitions_* tests" from Piotr
"
Compression is not deterministic so instead of binary comparing the sstable files we just read data back
and make sure everything that was written down is still present.

Tests: unit(release)
"

* 'haaawk/binary-compare-of-compressed-sstables/v3' of github.com:scylladb/seastar-dev:
  sstables: Remove compressed parameter from get_write_test_path
  sstables: Remove unused sstable test files
  sstables: Ensure compare_sstables isn't used for compressed files
  sstables: Don't binary compare compressed sstables
  sstables: Remove debug printout from test_write_many_partitions

(cherry picked from commit 1ff6b8fb96)
2018-12-18 14:53:52 +00:00
Duarte Nunes
21d08aa41e Merge 'Fix evictable shard reader related issues' from Botond
"
Recently some additional issues were discovered related to recent
changes to the way inactive readers are evicted and making shard readers
evictable.
One such issue is that the `querier_cache` is not prepared for the
querier to be immediately evicted by the reader concurrency semaphore,
when registered with it as an inactive read (#3987).
The other issue is that the multishard mutation query code was not
fully prepared for evicted shard readers being re-created, or failing
why being re-created (#3991).

This series fixes both of these issues and adds a unit test which covers
the second one. I am working on a unit test which would cover the second
issue, but it's proving to be a difficult one and I don't want to delay
the fixes for these issues any longer as they also affect 3.0.

Fixes: #3987
Fixes: #3991
"

* 'evictable-reader-related-issues/branch-3.0/v1' of https://github.com/denesb/scylla:
  multishard_mutation_query: reset failed readers to inexistent state
  multishard_mutation_query: handle missing readers when dismantling
  multishard_mutation_query: add support for keeping stats for discarded partitions
  multishard_mutation_query: expect evicted reader state when creating reader
  multishard_mutation_query: pretty-print the reader state in log messages
  querier_cache: check that the query wasn't evicted during registering
  reader_concurrency_semaphore: use the correct types in the constructor
  reader_concurrency_semaphore: add consume_resources()
  reader_concurrency_semaphore::inactive_read_handle: add operator bool()
2018-12-18 13:36:42 +00:00
Botond Dénes
f0b5170fa6 multishard_mutation_query: reset failed readers to inexistent state
When attempting to dismantling readers, some of the to-be-dismantled
readers might be in a failed state. The code waiting on the reader to
stop is expecting failures, however it didn't do anything besides
logging the failure and bumping a counter. Code in the lower layers did
not know how to deal with a failed reader and would trigger
`std::bad_variant_access` when trying to process (save or cleanup) it.
To prevent this, reset the state of failed readers to `inexistent_state`
so code in the lower layers doesn't attempt to further process them.

(cherry picked from commit b4c3aab4a7)
2018-12-18 14:46:56 +02:00
Botond Dénes
3b617e873c multishard_mutation_query: handle missing readers when dismantling
When dismantling the combined buffer and the compaction state we are no
longer guaranteed to have the reader each partition originated from. The
reader might have been evicted and not resumed, or resuming it might
have failed. In any case we can no longer assume the originating reader
of each partition will be present. If a reader no longer exists,
discard the partitions that it emitted.

(cherry picked from commit 9cef043841)
2018-12-18 14:46:56 +02:00
Botond Dénes
4eb9836e64 multishard_mutation_query: add support for keeping stats for discarded partitions
In the next patches we will add code that will have to discard some of
the dismantled partitions/fragments/bytes. Prepare the
`dismantle_buffer_stats` struct for being able to track the discarded
partitions/fragments/bytes in addition to those that were successfully
dismantled.

(cherry picked from commit 438bef333b)
2018-12-18 14:46:56 +02:00
Botond Dénes
46af353209 multishard_mutation_query: expect evicted reader state when creating reader
Previously readers were created once, so `make_remote_reader()` had a
validation to ensure readers were not attempted at being created more
than once. This validation was done by checking that the reader-state is
either `inexistent` or `successful_lookup`. However with the
introduction of pausing shard readers, it is now possible that a reader
will have to be created and then re-created several times, however this
validation was not updated to expect this.
Update the validation so it also expects the reader-state to be
`evicted`, the state the reader will be if it was evicted while paused.

(cherry picked from commit ce52436af4)
2018-12-18 14:46:54 +02:00
Botond Dénes
76f70c676e multishard_mutation_query: pretty-print the reader state in log messages
(cherry picked from commit 1effb1995b)
2018-12-18 14:34:33 +02:00
Botond Dénes
afc9f0e177 querier_cache: check that the query wasn't evicted during registering
The reader concurrency semaphore can evict the querier when it is
registered as an inactive read. Make the `querier_cache` aware of this
so that it doesn't continue to process the inserted querier when this
happens.
Also add a unit test for this.

(cherry picked from commit 5780f2ce7a)
2018-12-18 14:34:33 +02:00
Botond Dénes
c899191ad5 reader_concurrency_semaphore: use the correct types in the constructor
Previously there was a type mismatch for `count` and `memory`, between
the actual type used to store them in the class (signed) and the type
of the parameters in the constructor (unsigned).
Although negative numbers are completely valid for these members,
initializing them to negative numbers don't make sense, this is why they
used unsigned types in the constructor. This restriction can backfire
however when someone intends to give these parameters the maximum
possible value, which, when interpreted as a signed value will be `-1`.
What's worse the caller might not even be aware of this unsigned->signed
conversion and be very suprised when they find out.
So to prevent surprises, expose the real type of these members, trusting
the clients of knowing what they are doing.

Also add a `no_limits` constructor, so clients don't have to make sure
they don't overflow internal types.

(cherry picked from commit e1d8237e6b)
2018-12-18 14:34:33 +02:00
Botond Dénes
a3563e5f7d reader_concurrency_semaphore: add consume_resources()
(cherry picked from commit dfd649a6b4)
2018-12-18 14:34:33 +02:00
Botond Dénes
78c5b09694 reader_concurrency_semaphore::inactive_read_handle: add operator bool()
(cherry picked from commit 21b44adbfe)
2018-12-18 14:34:33 +02:00
Amnon Heiman
a51878205a node-exporter.service: Update command line to fix service startup
The upgrade to node_exporter 0.17 commit
09c2b8b48a ("node_exporter_install: switch
to node_exporter 0.17") caused the service to no longer start. Turns out
node_exported broke backwards compatibility of the command line between
0.15 to 0.16. Fix it up.

While fixing the command line, all the collector that are enabled by
default were removed.

Fixes #3989

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
[ penberg@scylladb.com: edit commit message ]
Message-Id: <20181213114831.27216-1-amnon@scylladb.com>
(cherry picked from commit 571755e117)
2018-12-18 08:33:04 +02:00
Avi Kivity
46efc08882 Update seastar submodule
* seastar 6700dc3...08f1258 (1):
  > reactor: disable nowait aio due to a kernel bug

Fixes #3996.
2018-12-17 17:00:14 +02:00
Avi Kivity
c95433c967 Update seastar submodule
* seastar 1651a2a...6700dc3 (3):
  > build: link against libatomic
  > core/semaphore: Allow combining semaphore_units()
  > core/shared_ptr: Allow releasing a lw_shared_ptr to a non-const object

Fixes #3996.
2018-12-17 15:53:01 +02:00
Piotr Sarna
df3b6fb4a8 cql3: refuse to create index on COMPACT STORAGE with ck
To follow C* compatibility, creating an index on COMPACT STORAGE
table should be disallowed not only on base primary keys,
but also when the base table contains clustering keys.
Message-Id: <ab40c39730aff2e164d11ee5159ff62b8ec9e8e8.1544698186.git.sarna@scylladb.com>

(cherry picked from commit 6743af5dbd)
2018-12-17 09:45:43 +02:00
Piotr Sarna
44ee43bb17 cql3: add refusing to create an index on static column
Secondary indexes on static columns are not yet supported,
so creating such index should return an appropriate error.

Fixes #3993
Message-Id: <700b0a71e80da52d2d5250edacc12626b55681fa.1544785127.git.sarna@scylladb.com>

(cherry picked from commit 63bd43e57e)
2018-12-17 09:44:52 +02:00
Asias He
aac363ca86 storage_service: Notify NEW_NODE only when a node is new node
This is a backport of CASSANDRA-11038.

Before this, a restarted node will be reported as new node with NEW_NODE
cql notification.

To fix, only send NEW_NODE notification when the node was not part of
the cluster

Fixes: #3979
Tests: pushed_notifications_test.py:TestPushedNotifications.restart_node_test
Message-Id: <453d750b98b5af510c4637db25b629f07dd90140.1544583244.git.asias@scylladb.com>
(cherry picked from commit 71c1681f6c)
2018-12-16 13:59:19 +02:00
Duarte Nunes
9e6cc5b024 service/storage_proxy: Embed the expire timer in the response handler
Embedding the expire timer for a write response in the
abstract_write_response_handler simplifies the code as it allows
removing the rh_entry type.

It will also make the timeout easily accessible inside the handler,
for future patches.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181213111818.39983-1-duarte@scylladb.com>
(cherry picked from commit f8878238ed)
2018-12-13 13:24:09 +00:00
Duarte Nunes
13b72c7b92 Merge branch 'gossip: Send node UP event to cql client after cql server is up' from Asias
"
This is a backport of CASSANDRA-8236.

Before this patch, scylla sends the node UP event to cql client when it
sees a new node joins the cluster, i.e., when a new node's status
becomes NORMAL. The problem is, at this time, the cql server might not
be ready yet. Once the client receives the UP event, it tries to
connect to the new node's cql port and fails.

To fix, a new application_sate::RPC_READY is introduced, new node sets
RPC_READY to false when it starts gossip in the very beginning and sets
RPC_READY to true when the cql server is ready.

The RPC_READY is a bad name but I think it is better to follow Cassandra.

Nodes with or without this patch are supposed to work together with no
problem.

Refs #3843
"

* 'asias/node_up_down.upstream.v4.1' of github.com:scylladb/seastar-dev:
  storage_service: Use cql_ready facility
  storage_service: Handle application_state::RPC_READY
  storage_service: Add notify_cql_change
  storage_service: Add debug log in notify_joined
  storage_service: Add extra check in notify_joined
  storage_service: Add notify_joined
  storage_service: Add debug log in notify_up
  storage_service: Add extra check in notify_up
  storage_service: Add notify_up
  storage_service: Make notify_left log debug level
  storage_service: Introduce notify_left
  storage_service: Add debug log in notify_down
  storage_service: Introduce notify_down
  storage_service: Add set_cql_ready
  gossip: Add gossiper::is_cql_ready
  gms: Add endpoint_state::is_cql_ready
  gms: Add application_state::RPC_READY
  gms: Introduce cql_ready in versioned_value

(cherry picked from commit a42b2895c2)
2018-12-13 12:06:59 +00:00
Avi Kivity
6b011fbe0a build: pass C compiler configuration in dist package build
Just like we allow customizing the C++ compiler, we should allow customizing
the C compiler.

Ref #3978
Message-Id: <20181211172821.30830-1-avi@scylladb.com>

(cherry picked from commit fa96e07e6b)
2018-12-12 14:41:38 +02:00
Tomasz Grabiec
9dd4e1b01f sstables: index_reader: Avoid schema copy in advance_to()
Introduced in 7e15e43.

Exposed by perf_fast_forward:

  running: large-partition-skips on dataset large-part-ds1
  Testing scanning large partition with skips.
  Reads whole range interleaving reads with skips according to read-skip pattern:
  read    skip      time (s)     frags     frag/s (...)
  1       0         5.268780   8000000    1518378

  1       1        31.695985   4000000     126199
Message-Id: <1544614272-21970-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit 0a853b8866)
2018-12-12 14:38:49 +02:00
Nadav Har'El
e91c741ef5 secondary indexes: fail attempts to create a CUSTOM INDEX
Cassandra supports a "CREATE CUSTOM INDEX" to create a secondary index
with a custom implementation. The only custom implementation that Cassandra
supports is SASI. But Scylla doesn't support this, or any other custom
index implementation. If a CREATE CUSTOM INDEX statement is used, we
shouldn't silently ignore the "CUSTOM" tag, we should generate an error.

This patch also includes a regression test that "CREATE CUSTOM INDEX"
statements with valid syntax fail (before this patch, they succeeded).

Fixes #3977

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20181211224545.18349-2-nyh@scylladb.com>
(cherry picked from commit a0379209e6)
2018-12-12 00:32:35 +00:00
Nadav Har'El
b18e9e115d Fix typo in error message
Interestingly, this typo was copied from the original Cassandra source
code :-)

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20181211224545.18349-1-nyh@scylladb.com>
(cherry picked from commit 36db4fba23)
2018-12-12 00:32:35 +00:00
Avi Kivity
0b86ab0d2a build: build libdeflate with user selected C compiler
If the user specified a C compiler, use it to build libdeflate.

Fixes #3978.
Message-Id: <20181211145604.14847-1-avi@scylladb.com>

(cherry picked from commit 34a31a807d)
2018-12-11 19:24:24 +02:00
Duarte Nunes
97cd9108d6 db/system_distributed_keyspace: Create the schema with min_timestamp
Different nodes can concurrently create the distributed system
keyspace on boot, before the "if not exists" clause can take effect.

However, the resulting schema mutations will be different since
different nodes use different timestamps. This patch forces the
timestamps to be the same across all nodes, so we save some schema
mismatches.

This fixes a bug exposed by ca5dfdf, whereby the initialization of the
distributed system keyspace is done before waiting for schema
agreement. While waiting for schema agreement in
storage_service::join_token_ring(), the node still hasn't joined the
ring and schemas can't be pulled from it, so nodes can deadlock. A
similar situation can happen between a seed node and a non-seed node,
where the seed node progresses to a different "wait for schema
agreement" barrier, but still can't make progress because it can't
pull the schema from the non-seed node still trying to join the ring.

Finally, it is assumed that changes to the schema of the current
distributed system keyspace tables will be protected by a cluster
feature and a subsequent schema synchronization, such that all nodes
will be at a point where schemas can be transferred around.

Fixes #3976

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181211113407.20075-1-duarte@scylladb.com>
(cherry picked from commit 89ae3fbf11)
2018-12-11 14:53:30 +00:00
Hagit Segev
f81fe96b0b release: prepare for 3.0-rc2 2018-12-11 12:32:34 +02:00
Avi Kivity
91ce3a7957 sstables: fix overflow in clustering key blocks header bit access
_ck_blocks_header is a 64-bit variable, so the mask should be 64 bits too.
Otherwise, a shift in the range 32-63 will produce wrong results.

Fix by using a 64-bit mask.

Found by Fedora 29's ubsan.

Fixes #3973.
Message-Id: <20181209120549.21371-1-avi@scylladb.com>

(cherry picked from commit 7c7da0b462)
2018-12-10 14:10:27 +02:00
Takuya ASADA
af7e58f4c5 dist/offline_installer/redhat: fix missing dependencies
Offline installer with Scylla 3.0 causes dependency error on CentOS, added
missing packages.

Fixes #3969

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20181207020711.23055-1-syuu@scylladb.com>
(cherry picked from commit a2d0ebf4d9)
2018-12-10 14:10:15 +02:00
Amos Kong
bd3373b511 scylla_setup: only ask for nic in interactive mode
Current scylla_setup still asks for nic even nic is already assigned in cmdline.

Fixes #3908

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <6b867e17a5583c495c771a37d5fa1e8366b1d61b.1542337635.git.amos@scylladb.com>
(cherry picked from commit 09a3b11c2f)
2018-12-09 19:26:34 +02:00
Gleb Natapov
4820130abe storage_proxy: fix crash during write timeout callback invocation
rh_entry address is captured inside timeout's callback lambda, so the
structure should not be moved after it is created. Change the code to
create rh_entry in-place instead of moving it into the map.

Fixes #3972.

Message-Id: <20181206164043.GN25283@scylladb.com>
(cherry picked from commit 9fb79bf379)
2018-12-09 15:25:52 +02:00
Tomasz Grabiec
9b299241e5 Merge "Fixes for collecting stats in SST3 + more tests" from Vladimir
This patchset fixes several remaining issues found during thorough
testing of SSTables 3.x statistics and enriches ~30 unit tests with
statistics validation against Cassandra-generated golden copies.

* https://github.com/argenet/scylla/tree/projects/sstables-30/sst3-tests-statistics/v1:
  sstables: Enforce estimated_partitions in generate_summary() to be
    always positive.
  sstables: Don't enforce default max_local_deletion_time value for 'mc'
    files.
  sstables: Update TTL/local deletion stats for non-expiring and live
    liveness_info.
  sstables: Collect statistics when writing RT markers to SSTables 3.x.
  tests: Return sstable_assertions from validate_read() helper.
  tests: Introduce helper for validating stats metadata in SSTables 3.x
    tests.
  tests: Add stats metadata validation to test_write_static_row.
  tests: Add stats metadata validation to
    test_write_composite_partition_key.
  tests: Add stats metadata validation to
    test_write_composite_clustering_key.
  tests: Add stats metadata validation to test_write_wide_partitions.
  tests: Add stats metadata validation to write_ttled_row
  tests: Add stats metadata validation to write_ttled_column
  tests: Add stats metadata validation to write_deleted_column
  tests: Add stats metadata validation to write_deleted_row
  tests: Add stats metadata validation to write_collection_wide_update
  tests: Add stats metadata validation to
    write_collection_incremental_update
  tests: Add stats metadata validation to write_multiple_partitions
  tests: Add stats metadata validation to write_multiple_rows
  tests: Add stats metadata validation to
    write_missing_columns_large_set
  tests: Add stats metadata validation to write_different_types
  tests: Add stats metadata validation to write_empty_clustering_values
  tests: Add stats metadata validation to write_large_clustering_key
  tests: Add stats metadata validation to write_compact_table
  tests: Add stats metadata validation to write_user_defined_type_table
  tests: Add stats metadata validation to write_simple_range_tombstone
  tests: Add stats metadata validation to
    write_adjacent_range_tombstones
  tests: Add stats metadata validation to
    write_non_adjacent_range_tombstones
  tests: Add stats metadata validation to
    write_mixed_rows_and_range_tombstones
  tests: Add stats metadata validation to
    write_adjacent_range_tombstones_with_rows
  tests: Add stats metadata validation to
    write_range_tombstone_same_start_with_row
  tests: Add stats metadata validation to
    write_range_tombstone_same_end_with_row
  tests: Add stats metadata validation to
    write_two_non_adjacent_range_tombstones
  tests: Delete unused (bogus) Statistics.db file from write_ SST3
    tests.

(cherry picked from commit bb24d378b2)
2018-12-08 14:08:46 +02:00
Avi Kivity
745a98e151 Merge "Fix deadlocking multishard readers" from Botond
"
Multishard combining readers, running concurrently, with limited
concurrency and no timeout may deadlock, due to inactive shard readers
sitting on permits. To avoid this we have to make sure that all shard
readers belonging to a multishard combining readers, that are not
currently active, can be evicted to free up their permits, ensuring that
all readers can make progress.
Making inactive shard readers evictable is the solution for this
problem, however the original series introducing this solution
(414b14a6bd) did not go all they way and
left some loose ends. These loose ends are tied up by this mini-series.
Namely, two issues remained:
* The last reader to reach EOS was not paused (made evictable).
* Readers created/resumed as part of a read-ahead were not paused
  immediately after finishing the read-ahead.

This series fixes both of these.

Fixes: #3865
Tests: unit(release, debug)
"

* 'fix-multishard-reader-deadlock/v1' of https://github.com/denesb/scylla:
  multishard_combining_reader: pause readers after reading ahead
  multishard_combining_reader: pause *all* EOS'd readers

(cherry picked from commit 21b4b2b9a1)
2018-12-08 14:08:46 +02:00
Avi Kivity
b9c99af18b Merge "Fix tombstone histogram when writing SSTables 3.x" from Vladimir
"
This patchset extends a number of existing tests to check SSTables
statistics for 'mc' format and fixes an issue discovered with the help
of one of the tests.

Tests: unit {release}
"

* 'projects/sstables-30/check-stats/v2' of https://github.com/argenet/scylla:
  tests: Run sstable_timestamp_metadata_correcness_with_negative with all SSTables versions.
  tests: Run sstable_tombstone_histogram_test for all SSTables versions.
  tests: Run min_max_clustering_key_test on all SSTables versions.
  tests: Expand test_sstable_max_local_deletion_time_2 to run for all SSTables versions.
  tests: Run test_sstable_max_local_deletion_time on all SSTables versions.
  tests: Extend test checking tombstones histogram to cover all SSTables versions.
  sstables: Properly track row-level tombstones when writing SSTables 3.x.
  tests: Run min_max_clustering_key_test_2 for all SSTables versions.
  tests: Make reusable_sst() helper accept SSTables version parameter.

(cherry picked from commit f073ea5f87)
2018-12-08 14:08:44 +02:00
Asias He
cded9c7ac7 gossip: Fix race in real_mark_alive and shutdown msg
In dtest, we have

   self.check_rows_on_node(node1, 2000)
   self.check_rows_on_node(node2, 2000)

which introduce the following cluster operations:

1) Initially:

- node1 up
- node2 up

2) self.check_rows_on_node(node1, 2000)
- node2 down
- node2 up (A: node2 will call gossiper::real_mark_alive when node2 boots
up to mark node1 up)

3) self.check_rows_on_node(node2, 2000)
- node1 down (B: node1 will send shutdown gossip message to node2, node2
will mark node1 down)
- node1 up (C: when node1 is up, node2 will call
gossiper::real_mark_alive)

Since there is no guarantee the order of Operation A and Operation B, it
is possible node2 will mark node1 as status=shutdown and mark node1 is
UP.

In Operation C, node2 will call gossiper::real_mark_alive to mark node1
up, but since node2 might think node1 is already up, node2 will exit
early in gossiper::real_mark_alive and not log "InetAddress 127.0.0.1 is
now UP, status={}"

As a result, dtest fails to see node2 reports node1 is up when it boots
node1 and fail the test.

   TimeoutError: 23 Nov 2018 10:44:19 [node2] Missing: ['127.0.0.1.* now UP']

In the log we can see node1 marked as DOWN and UP almost at the same time on node2:

   INFO  2018-11-23 22:31:29,999 [shard 0] gossip - InetAddress 127.0.0.1 is now DOWN, status = shutdown
   INFO  2018-11-23 22:31:30,006 [shard 0] gossip - InetAddress 127.0.0.1 is now UP, status = shutdown

Fixes #3940

Tests: dtest with 20 consecutive succesful runs
Message-Id: <996dc325cbcc3f94fc0b7569217aa65464eaaa1c.1543213511.git.asias@scylladb.com>
(cherry picked from commit eeeb2da7bb)
2018-12-08 13:42:43 +02:00
Gleb Natapov
4acfc5ed8f hints: make hints manager more resilient to unexpected directory content
Currently if hints directory contains unexpected directories Scylla fails to
start with unhandled std::invalid_argument exception. Make the manager
ignore malformed files instead and try to proceed anyway.
Message-Id: <20181121134618.29936-2-gleb@scylladb.com>

(cherry picked from commit b4a8802edc)
2018-12-08 13:42:43 +02:00
Gleb Natapov
cb9199bc7f hints: add auxiliary function for scanning high level hints directory
We scan hints directory in two places: to search for files to replay and
to search for directories to remove after resharding. The code that
translates directory name to a shard is duplicated. It is simple now, so
not a bit issue but in case it grows better have it in one place.
Message-Id: <20181121134618.29936-1-gleb@scylladb.com>

(cherry picked from commit 9433d02624)
2018-12-08 13:42:43 +02:00
Tomasz Grabiec
695ff5383f Merge "Correct the usage of row ttl and add write-read test" from Piotr
Fixes the condition which determines whether a row ttl should be used for a cell
and adds a test that uses each generated mutation to populate mutation source
and then verifies that it can read back the same mutation.

* seastar-dev.git haaawk/sst3/write-read-test/v3:
  Fix use_row_ttl condition
  Add test_all_data_is_read_back

(cherry picked from commit b8c405c019)
2018-12-08 13:42:43 +02:00
Tomasz Grabiec
730e48bf60 configure.py: Always add a rule for building gen_crc_combine_table
Fixes a build failure when only the scylla binary was selected for
building like this:

  ./configure.py --with scylla

In this case the rule for gen_crc_combine_table was missing, but it is
needed to build crc_combine_table.o

Message-Id: <1544010138-21282-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit edbef7400b)
2018-12-08 13:42:43 +02:00
Tomasz Grabiec
af6d4f40e1 utils/gz: Fix compilation on non-x86 archs
gen_crc_combine_table is now executed on every build, so it should not
fail on unsupported archs. The generated file will not contain data,
but this is fine since it should not be used.

Another problem is that u32 and u64 aliases were not visible in the #else
branch in crc_combine.cc
Message-Id: <1543864425-5650-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit 9a4c00beb7)
2018-12-08 13:42:43 +02:00
Avi Kivity
9d8507de09 Merge "Optimize checksum_combine() for CRC32" from Tomek
"
zlib's crc32_combine() is not very efficient. It is faster to re-combine
the buffer using crc32(). It's still substantial amount of work which
could be avoided.

This patch introduces a fast implementation of crc32_combine() which
uses a different algorithm than zlib. It also utilizes intrinsics for
carry-less multiplication instruction to perform the computation faster.
The details of the algorithm can be found in code comments.

Performance results using perf_checksum and second buffer of length 64 KiB:

zlib CRC32 combine:   38'851   ns
libdeflate CRC32:      4'797   ns
fast_crc32_combine():     11   ns

So the new implementation is 3500x faster than zlib's, and 417x faster than
re-checksumming the buffer using libdeflate.

Tested on i7-5960X CPU @ 3.00GHz

Performance was also evaluated using sstable writer benchmark:

  perf_fast_forward --populate --sstable-format=mc --data-directory /tmp/perf-mc \
     --value-size=10000 --rows 1000000 --datasets small-part

It yielded 9% improvement in median frag/s (129'055 vs 117'977).

Refs #3874
"

* tag 'fast-crc32-combine-v2' of github.com:tgrabiec/scylla:
  tests: perf_checksum: Test fast_crc32_combine()
  tests: Rename libdeflate_test to checksum_utils_test
  tests: libdeflate: Add more tests for checksum_combine()
  tests: libdeflate: Check both libdeflate and default checksummers
  sstables: Use fast_crc_combine() in the default checksummer
  utils/gz: Add fast implementation of crc32_combine()
  utils/gz: Add pre-computed polynomials
  utils/gz: Import Barett reduction implementation from libdeflate
  utils: Extract clmul() from crc.hh

(cherry picked from commit b098b5b987)
2018-12-08 13:42:43 +02:00
Tomasz Grabiec
07c980845d utils/crc: Add clmul_u32() implementation
Needed for backporting dependent changes.

Extracted from:

      commit 79136e895f
      Author: Yibo Cai (Arm Technology China) <Yibo.Cai@arm.com>
      Date:   Thu Nov 1 03:26:16 2018 +0000

        utils/crc: calculate crc in parallel
2018-12-08 13:42:43 +02:00
Tomasz Grabiec
c52b8239d0 configure.py: Compile against Westmere on x86
Needed for backporting dependent changes.

Extracted from:

  commit 79136e895f
  Author: Yibo Cai (Arm Technology China) <Yibo.Cai@arm.com>
  Date:   Thu Nov 1 03:26:16 2018 +0000

    utils/crc: calculate crc in parallel
2018-12-08 13:42:43 +02:00
Tomasz Grabiec
5a07a4fac8 configure.py: Use armv8-a+crc+crypto ISA on aarch64
Needed for backporting dependent changes.

Extracted from:

  commit 1c48e3fbec
  Author: Yibo Cai (Arm Technology China) <Yibo.Cai@arm.com>
  Date:   Mon Oct 29 02:58:19 2018 +0000

    utils/crc: leverage arm64 crc extension
2018-12-08 13:42:43 +02:00
Avi Kivity
b9c046b17b Merge "Optimize checksum computation for the MC sstable format" from Tomek
"
One part of the improvement comes from replacing zlib's CRC32 with the one
from libdeflate, which is optimized for modern architecture and utilizes the
PCLMUL instruction.

perf_checksum test was introduced to measure performance of various
checksumming operations.

Results for 514 B (relevant for writing with compression enabled):

    test                                      iterations      median         mad         min         max
    crc_test.perf_deflate_crc32_combine            58414    16.711us     3.483ns    16.708us    16.725us
    crc_test.perf_adler_combine                165788278     6.059ns     0.031ns     6.027ns     7.519ns
    crc_test.perf_zlib_crc32_combine               59546    16.767us    26.191ns    16.741us    16.801us
    ---
    crc_test.perf_deflate_crc32_checksum        12705072    83.267ns     4.580ns    78.687ns    98.964ns
    crc_test.perf_adler_checksum                 3918014   206.701ns    23.469ns   183.231ns   258.859ns
    crc_test.perf_zlib_crc32_checksum            2329682   428.787ns     0.085ns   428.702ns   510.085ns

Results for 64 KB (relevant for writing with compression disabled):

    test                                      iterations      median         mad         min         max
    crc_test.perf_deflate_crc32_combine            25364    38.393us    17.683ns    38.375us    38.545us
    crc_test.perf_adler_combine                169797143     5.842ns     0.009ns     5.833ns     6.901ns
    crc_test.perf_zlib_crc32_combine               26067    38.663us    95.094ns    38.546us    40.523us
    ---
    crc_test.perf_deflate_crc32_checksum          202821     4.937us    14.426ns     4.912us     5.093us
    crc_test.perf_adler_checksum                   44684    22.733us   206.263ns    22.492us    25.258us
    crc_test.perf_zlib_crc32_checksum              18839    53.049us    36.117ns    53.013us    53.274us

The new CRC32 implementation (deflate_crc32) doesn't provide a fast
checksum_combine() yet, it delegates to zlib so it's as slow as the latter.

Because for CRC32 checksum_combine() is several orders of magnitude slower
than checksum(), we avoid calling checksum_combine() completely for this
checksummer. We still do it for adler32, which has combine() which is faster
than checksum().

SStable write performance was evaluated by running:

  perf_fast_forward --populate --data-directory /tmp/perf-mc \
     --rows=10000000 -c1 -m4G --datasets small-part

Below is a summary of the average frag/s for a memtable flush. Each result is
an average of about 20 flushes with stddev of about 4k.

Before:

 [1] MC,lz4: 330'903
 [2] LA,lz4: 450'157
 [3] MC,checksum: 419'716
 [4] LA,checksum: 459'559

After:

 [1'] MC,lz4: 446'917 ([1] + 35%)
 [2'] LA,lz4: 456'046 ([2] + 1.3%)
 [3'] MC,checksum: 462'894 ([3] + 10%)
 [4'] LA,checksum: 467'508 ([4] + 1.7%)

After this series, the performance of the MC format writer is similar to that
of the LA format before the series.

There seems to be a small but consistent improvement for LA too. I'm not sure
why.
"

* tag 'improve-mc-sstable-checksum-libdeflate-v3' of github.com:tgrabiec/scylla:
  tests: perf: Introduce perf_checksum
  tests: Add test for libdeflate CRC32 implementation
  sstables: compress: Use libdeflate for crc32
  sstables: compress: Rename crc32_utils to zlib_crc32_checksummer
  licenses: Add libdeflate license
  Integrate libdeflate with the build system
  Add libdeflate submodule
  sstables: Avoid checksum_combine() for the crc32 checksummer
  sstables: compress: Avoid unnecessary checksum_combine()
  sstables: checksum_utils: Add missing include

(cherry picked from commit 5e759b0c07)
2018-12-08 13:42:43 +02:00
Avi Kivity
979cb636b8 Update seastar submodule
* seastar e64281d...1651a2a (1):
  > tests: perf: Make do_not_optimize() take the argument by const&
2018-12-08 13:42:43 +02:00
Botond Dénes
59cf9d9070 querier: fix evict_one() and evict_all_for_table()
Both of these have the same problem. They remove the to-be-evicted
entries from `_entries` but they don't unregister the `entry` from the
`read_concurrency_semaphore`. This results in the
`reader_concurrency_semaphore` being left with a dangling pointer to the
entries will trigger segfault when it tries to evict the associated
inactive reads.

Also add a unit test for `evict_all_for_table()` to check that it works
properly (`evict_one()` is only used in tests, so no dedicated test for
it).

Fixes: #3962

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <57001857e3791c6385721b624d33b667ccda2e7d.1544010868.git.bdenes@scylladb.com>
(cherry picked from commit 77dbc7d09a)
2018-12-06 11:38:44 +02:00
Duarte Nunes
c9ec9d4087 Merge seastar upstream
* seastar 880826e...e64281d (2):
  > core/semaphore: Change the access of semaphore_units main ctor
  > Merge "Add semaphore_units<>::split() function" from Duarte

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-12-05 20:25:17 +00:00
Gleb Natapov
2e8fefbc5a storage_proxy: store hint for CL=ANY if all nodes replied with failure
Current code assumes that request failed if all replicas replied with
failure, but this is not true for CL=ANY requests. Take it into account.

Fixed: #3565
(cherry picked from commit 17197fb005)
2018-12-05 20:14:58 +00:00
Gleb Natapov
6be0635029 storage_proxy: complete write request early if all replicas replied with success of failure
Currently if write request reaches CL and all replicas replied, but some
replied with failures, the request will wait for timeout to be retired.
Detect this case and retire request immediately instead.

Fixes #3566

(cherry picked from commit d1d04eae3c)
2018-12-05 20:14:57 +00:00
Gleb Natapov
04a544c0a2 storage_proxy: check that write failure response comes from recognized replica
Before accounting failure response we need to make sure it comes from a
replica that participates in the request.

(cherry picked from commit 76ab3d716b)
2018-12-05 20:14:57 +00:00
Gleb Natapov
028f9b95d1 storage_proxy: move code executed on write timeout into separate function
Currently the callback is in lambda, but we will want to call the code
not only during timer expiration.

(cherry picked from commit 7bc68aa0eb)
2018-12-05 20:14:57 +00:00
Avi Kivity
54258ca8eb Merge "db/hints: Use frozen_mutation in hinted handoff" from Duarte
"
This series changes hinted handoff to work with `frozen_mutation`s
instead of naked `mutation`s. Instead of unfreezing a mutation from
the commitlog entry and then freezing it again for sending, now we'll
just keep the read, frozen mutation.

Tests: unit(release)
"

* 'hh-manager-cleanup/v1' of https://github.com/duarten/scylla:
  db/hints/manager: Use frozen_mutation instead of mutation
  db/hints/manager: Use database::find_schema()
  db/commitlog/commitlog_entry: Allow moving the contained mutation
  service/storage_proxy: send_to_endpoint overload accepting frozen_mutation
  service/storage_proxy: Build a shared_mutation from a frozen_mutation
  service/storage_proxy: Lift frozen_mutation_and_schema
  service/storage_proxy: Allow non-const ranges in mutate_prepare()

(cherry picked from commit 1891779e64)
2018-12-05 20:14:57 +00:00
Gleb Natapov
c9a030f1f0 storage_proxy: count number of timed out write attempts after CL is reached
It is useful to have this counter to investigate the reason for read
repairs. Non zero value means that writes were lost after CL is reached
and RR is expected.

Message-Id: <20181009120900.GF22665@scylladb.com>
(cherry picked from commit 207b57a892)
2018-12-05 20:14:57 +00:00
Gleb Natapov
1c7daef554 storage_proxy: do not pass write_stats down to send_to_live_endpoints
write_stats is referenced from write handler which is available in
send_to_live_endpoints already. No need to pass it down.

Message-Id: <20181009133017.GA14449@scylladb.com>
(cherry picked from commit 319ece8180)
2018-12-05 20:14:57 +00:00
Duarte Nunes
f8195a77b0 db/view/view_builder: Don't timeout waiting for view to be built
Remove the timeout argument to
db::view::view_builder::wait_until_built(), a test-only function to
wait until a given materialized view has finished building.

This change is motivated by the fact that some tests running on slow
environments will timeout. Instead of incrementally increasing the
timeout, remove it completely since tests are already run under an
exterior timeout.

Fixes #3920

Tests: unit release(view_build_test, view_schema_test)

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181115173902.19048-1-duarte@scylladb.com>
(cherry picked from commit 6fbf792777)
2018-12-05 19:20:36 +00:00
Duarte Nunes
5b724c80ab db/view: Don't copy keyspace name
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181022104527.14555-1-duarte@scylladb.com>
(cherry picked from commit f3a5ec0fd9)
2018-12-05 19:19:26 +00:00
Nadav Har'El
4a7ae81b3f materialized views: update stats.write statistics in all cases
mutate_MV usually calls send_to_endpoint() to push view update to remote
view replicas. This function gets passed a statistics object,
service::storage_proxy_stats::write_stats and, in particular, updates
its "writes" statistic which counts the number of ongoing writes.

In the case that the paired view replica happens to be the *same* node,
we avoid calling send_to_endpoint() and call mutate_locally() instead.
That function does not take a write_stats object, so the "writes" statistic
doesn't get incremented for the duration of the write. So we should do
this explicitly.

Co-authored-by: Nadav Har'El <nyh@scylladb.com>
Co-authored-by: Duarte Nunes <duarte@scylladb.com>
(cherry picked from commit 1d5f8d0015)
2018-12-05 19:19:26 +00:00
Piotr Sarna
3cf26a60a2 auth: add abort_source to waiting for schema agreement
When the auth service is requested to stop during bootstrap,
it might have still not reached schema agreement.
Currently, waiting for this agreement is done in an infinite loop,
without taking abort_source into account.
This patch introduces checking if abort was requested
and breaking the loop in such case, so auth service can terminate.

Tests:
unit (release)
dtest (bootstrap_test.py:TestBootstrap.shutdown_wiped_node_cannot_join_test)
Message-Id: <1b7ded14b7c42254f02b5d2e10791eb767aae7fc.1543914769.git.sarna@scylladb.com>

(cherry picked from commit 7b0a3fbf8a)
2018-12-04 14:33:05 +00:00
Tomasz Grabiec
2103d0d52b sstables: Write Statistics.db offset map entries in the same order as Cassandra
Before this patch we were writing offset map enteies in unspecified
order, the one returned by std::unorderd_map. Cassandra writes them
sorted by metadata_type. Use the same order for improved
compatibility.

Fixes #3955.

Message-Id: <1543846649-22861-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit aa19f98d18)
2018-12-04 14:30:19 +02:00
Avi Kivity
16ee3b3ebe Merge "Make inactive shard readers evictable" from Botond
"
This series attempts to solve the regressions recently discovered in
performance of multi-partition range-scans. Namely that they:
* Flood the reader concurrency semaphore's queues, trampling other
  reads.
* Behave very badly when too many of them is running concurrently
  (trashing).
* May deadlock if enough of them is running without a timeout.

The solution for these problems is to make inactive shard readers
evictable. This should address all three issues listed above, to varying
degrees:
* Shard readers will now not cling onto their permits for the entire
  duration of the scan, which might be a lot of time.
* Will be less affected by infinite concurrency (more than the node can
  handle) as each scan now can make progress by evicting inactive shard
  readers belonging to other scans.
* Will not deadlock at all.

In addition to the above fix, this series also bundles two further
improvements:
* Add a mechanism to `reader_concurrecy_semaphore` to be notified of
  newly inserted evictables.
* General cleanups and fixes for `multishard_combining_reader` and
  `foreign_reader`.

I can unbundle these mini series and send them separately, if the
maintainers so prefer, altough considering that this series will have to
be backported to 3.0, I think this present form is better.

Fixes: #3835
"

* 'evictable-inactive-shard-readers/v7' of https://github.com/denesb/scylla: (27 commits)
  tests/multishard_mutation_query_test: test stateless query too
  tests/querier_cache: fail resource-based eviction test gracefully
  tests/querier_cache: simplify resource-based eviction test
  tests/mutation_reader_test: add test_multishard_combining_reader_next_partition
  tests/mutation_reader_test: restore indentation
  tests/mutation_reader_test: enrich pause-related multishard reader test
  multishard_combining_reader: use pause-resume API
  query::partition_slice: add clear_ranges() method
  position_in_partition: add region() accessor
  foreign_reader: add pause-resume API
  tests/mutation_reader_test: implement the pause-resume API
  query_mutations_on_all_shards(): implement pause-resume API
  make_multishard_streaming_reader(): implement the pause-resume API
  database: add accessors for user and streaming concurrency semaphores
  reader_lifecycle_policy: extend with a pause-resume API
  query_mutations_on_all_shards(): restore indentation
  query_mutations_on_all_shards(): simplify the state-machine
  multishard_combining_reader: use the reader lifecycle policy
  multishard_combining_reader: add reader lifecycle policy
  multishard_combining_reader: drop unnecessary `reader_promise` member
  ...

(cherry picked from commit 414b14a6bd)
2018-12-04 12:13:13 +02:00
Duarte Nunes
b0a9c40ab1 service/storage_proxy: Consider target liveness in sent_to_endpoint()
So we don't attempt to send mutations to unreachable endpoints and
instead store a hint for them, we now check the endpoint status and
populate dead_endpoints accordingly in
storage_proxy::send_to_endpoint().

Fixes #3820

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181007100640.2182-1-duarte@scylladb.com>
(cherry picked from commit 30d6ed8f92)
2018-12-03 18:38:05 +00:00
Duarte Nunes
53924e5c7f service/storage_proxy: Fix formatting of send_to_endpoint()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181006204756.32232-1-duarte@scylladb.com>
(cherry picked from commit a69d468101)
2018-12-03 18:37:59 +00:00
Avi Kivity
befe0012f5 Merge "Fix multiple summary regeneration bugs." from Vladimir
"
This patchset addresses two recently discovered bugs both triggered by
summary regeneration:

Tests: unit {release}

+

Validated with debug build of Scylla (ASAN) that no use-after-free
occurs when re-generating Summary.db.
"

* 'projects/sstables-30/summary-regeneration/v1' of https://github.com/argenet/scylla:
  tests: Add test reading SSTables in 'mc' format with missing summary.
  sstables: When loading, read statistics before summary.
  database: Capture io_priority_class by reference to avoid dangling ref.

(cherry picked from commit 009cbd3dcb)
2018-12-02 13:32:09 +02:00
Duarte Nunes
1953c5fa61 Merge 'Fix filtering with LIMIT' from Piotr
"
This series adds proper handling of filtering queries with LIMIT.
Previously the limit was erroneously applied before filtering,
which leads to truncated results.

To avoid that, paged filtering queries now use an enhanced pager,
which remembers how many rows dropped and uses that information
to fetch for more pages if the limit is not yet reached.

For unpaged filtering queries, paging is done internally as in case
of aggregations to avoid returning keeping huge results in memory.

Also, previously, all limited queries used the page size counted
from max(page size, limit). It's not good for filtering,
because with LIMIT 1 we would then query for rows one-by-one.
To avoid that, filtered queries ask for the whole page and the results
are truncated if need be afterwards.

Tests: unit (release)
"

* 'fix_filtering_with_limit_2' of https://github.com/psarna/scylla:
  tests: add filtering with LIMIT test
  tests: split filtering tests from cql_query_test
  cql3: add proper handling of filtering with LIMIT
  service/pager: use dropped_rows to adjust how many rows to read
  service/pager: virtualize max_rows_to_fetch function
  cql3: add counting dropped rows in filtering pager

(cherry picked from commit 1afda28cf3)
2018-12-02 12:07:46 +02:00
Duarte Nunes
b72a94b53e Merge 'Fix checking if system tables need view updates' from Piotr
"
This miniseries ensures that system tables are not checked
for having view updates, because they never do.
What's more, distributed system table is used in the process,
so it's unsafe to query the table while streaming it.

Tests: unit (release), dtest(update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_decommission_node_2_test)
"

* 'fix_checking_if_system_tables_need_view_updates_3' of https://github.com/psarna/scylla:
  streaming: don't check view building of system tables
  database: add is_internal_keyspace
  streaming: remove unused sstable_is_staging bool class

(cherry picked from commit d09d4bbd91)
2018-11-28 15:39:34 +00:00
Piotr Sarna
3f82b697f2 main: fix deinitialization order for view update generator
View update generator should be stopped only after
drain_on_shutdown() is performed on storage service.
Message-Id: <4d2bda4c73422a2ebf46d6dcd06c95d960839889.1543230849.git.sarna@scylladb.com>

(cherry picked from commit 6ab8235369)
2018-11-27 12:34:50 +00:00
Takuya ASADA
ee1ef853e5 dist/common/systemd/scylla-housekeeping-restart.service.mustache: specify correct repo for Debian variants
We do specify correct repo for both Red Hat/Debian variants on -deily, but
mistakenly don't for -restart, so do same on -restart.

Fixes #3906

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20181109224509.27380-1-syuu@scylladb.com>
(cherry picked from commit 7740cd2142)
2018-11-27 09:59:05 +02:00
Raphael S. Carvalho
6e7e7f3822 sstables: deprecate sstable metadata's ancestors
The reason for that is that it's not available in sstable format mc,
so we can no longer rely on it in common code for the currently
supported formats.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20181121170057.20900-1-raphaelsc@scylladb.com>
(cherry picked from commit d29482dce8)
2018-11-24 12:36:40 +02:00
Paweł Dziepak
82a36edc9d Merge "Optimize sstable writing of the MC format" from Tomasz
"
Tested with perf_fast_forward from:

  github.com/tgrabiec/scylla.git perf_fast_forward-for-sst3-opt-write-v1

Using the following command line:

  build/release/tests/perf/perf_fast_forward_g --populate --sstable-format=mc \
     --data-directory /tmp/perf-mc --rows=10000000 -c1 -m4G \
     --datasets small-part

The average reported flush throughput was (stdev for the avergages is around 4k):
  - for mc before the series: 367848 frag/s
  - for lc before the series: 463458 frag/s (= mc.before +25%)
  - for mc after the series: 429276 frag/s (= mc.before +16%)
  - for lc after the series: 466495 frag/s (= mc.before +26%)

Refs #3874.
"

* tag 'sst3-opt-write-v2' of github.com:tgrabiec/scylla:
  sstables: mc: Avoid serialization of promoted index when empty
  sstables: mc: Avoid double serialization of rows
  tests: sstable 3.x: Do not compare Statistics component
  utils: Introduce memory_data_sink
  schema: Optimize column count getters
  sstables: checksummed_file_data_sink_impl: Bypass output_stream

(cherry picked from commit 4aa5d83590)
2018-11-24 12:36:40 +02:00
Avi Kivity
d4efa3c9b2 Update seastar submodule
* seastar d6647df...880826e (1):
  > fstream: Introduce make_file_data_sink()
2018-11-24 12:36:40 +02:00
Avi Kivity
324dae3e12 Merge "compress: Restore lz4 as default compressor" from Duarte
"
Enables sstable compression with LZ4 by default, which was the
long-time behavior until a regression turned off compression by
default.

Fixes #3926
"

* 'restore-default-compression/v2' of https://github.com/duarten/scylla:
  tests/cql_query_test: Assert default compression options
  compress: Restore lz4 as default compressor
  tests: Be explicit about absence of compression

(cherry picked from commit bb85a21a8f)
2018-11-21 16:45:22 +02:00
Tomasz Grabiec
c0ffc9a2b7 utils: phased_barrier: Make advance_and_await() have strong exception guarantees
Currently, when advance_and_await() fails to allocate the new gate
object, it will throw bad_alloc and leave the phased_barrier object in
an invalid state. Calling advance_and_await() again on it will result
in undefined behavior (typically SIGSEGV) beacuse _gate will be
disengaged.

One place affected by this is table::seal_active_memtable(), which
calls _flush_barrier.advance_and_await(). If this throws, subsequent
flush attempts will SIGSEGV.

This patch rearranges the code so that advance_and_await() has strong
exception guarantees.
Message-Id: <1542645562-20932-1-git-send-email-tgrabiec@scylladb.com>

Fixes #3931.

(cherry picked from commit 57e25fa0f8)
2018-11-21 12:17:27 +02:00
Glauber Costa
f81fa5f75c remove monitor if sstable write failed
In (almost) all SSTable write paths, we need to inform the monitor that
the write has failed as well. The monitor will remove the SSTable from
controller's tracking at that point.

Except there is one place where we are not doing that: streaming of big
mutations. Streaming of big mutations is an interesting use case, in
which it is done in 2 parts: if the writing of the SSTable fails right
away, then we do the correct thing.

But the SSTables are not commited at that point and the monitors are
still kept around with the SSTables until a later time, when they are
finally committed. Between those two points in time, it is possible that
the streaming code will detect a failure and manually call
fail_streaming_mutations(), which marks the SSTable for deletions. At
that point we should propagate that information to the monitor as well,
but we don't.

Fixes #3732 (hopefully)
Tests: unit (release)

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20181114213618.16789-1-glauber@scylladb.com>
(cherry picked from commit 9f403334c8)
2018-11-20 19:27:54 +02:00
Glauber Costa
6fd1cfcfce sstables: correctly parse estimated histograms
In commit a33f0d6, we changed the way we handle arrays during the write
and parse code to avoid reactor stalls. Some potentially big loops were
transformed into futurized loops, and also some calls to vector resizes
were replaced by a reserve + push_back idiom.

The latter broke parsing of the estimated histogram. The reason being
that the vectors that are used here are already initialized internally
by the estimated_histogram object. Therefore, when we push_back, we
don't fill the array all the way from index 0, but end up with a zeroed
beginning and only push back some of the elements we need.

We could revert this array to a resize() call. After all, the reason we
are using reserve + push_back is to avoid calling the constructor member
for each element, but We don't really expect the integer specialization
to do any of that.

However, to avoid confusion with future developers that may feel tempted
to converted this as well for the sake of consistency, it is safer to
just make sure these arrays are zeroed.

Fixes #3918

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20181116130853.10473-1-glauber@scylladb.com>
(cherry picked from commit c6811bd877)
2018-11-17 17:20:00 +02:00
Nadav Har'El
9d458ffea9 Materialized Views and Secondary Index: no longer experimental
After this patch, the Materialized Views and Secondary Index features
are considered generally-available and no longer require passing an
explicit "--experimental=on" flag to Scylla.

The "--experimental=on" flag and the db::config::check_experimental()
function remain unused, as we graduated the only two features which used
this flag. However, we leave the support for experimental features in
the code, to make it easier to add new experimental features in the future.
Another reason to leave the command-line parameter behind is so existing
scripts that still use it will not break.

Fixes #3917

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20181115144456.25518-1-nyh@scylladb.com>
(cherry picked from commit 78ed7d6d0c)
2018-11-15 19:50:30 +02:00
Duarte Nunes
9776a048e7 Merge 'Generating view updates during streaming' from Piotr
During streaming, there are cases when we should invoke the view write
path. In particular, if we're streaming because of repair or if a view
has not yet finished building and we're bootstrapping a new node.

The design constraints are:
1) The streamed writes should be visible to new writes, but the
   sstable should not participate in compaction, or we would lose the
   ability to exclude the streamed writes on a restart;
2) The streamed writes must not be considered when generating view
   updates for them;
3) Resilient to node restarts;
4) Resilient to concurrent stream sessions, possibly streaming mutations for overlapping ranges.

We achieve this by writing the streamed writes to an sstable in a
different folder, call it "staging". We achieve 1) by publishing the
sstable to the column family sstable set, but excluding it from
compactions. We do these steps upon boot, by looking at the staging
directory, thus achieving 3).

Fixes #3275

* 'streaming_view_to_staging_sstables_9' of https://github.com/psarna/scylla: (29 commits)
  tests: add materialized views test
  tests: add view update generator to cql test env
  main: add registering staging sstables read from disk
  database: add a check if loaded sstable is already staging
  database: add get_staging_sstable method
  streaming: stream tables with views through staging sstables
  streaming: add system distributed keyspace ref to streaming
  streaming: add view update generator reference to streaming
  main: add generating missed mv updates from staging sstables
  storage_service: move initializing sys_dist_ks before bootstrap
  db/view: add view_update_from_staging_generator service
  db/view: add view updating consumer
  table: add stream_view_replica_updates
  table: split push_view_replica_updates
  table: add as_mutation_source_excluding
  table: move push_view_replica_updates to table.cc
  database: add populating tables with staging sstables
  database: add creating /staging directory for sstables
  database: add sstable-excluding reader
  table: add move_sstable_from_staging_in_thread function
  ...

(cherry picked from commit a38f6078fb)
2018-11-15 17:46:20 +02:00
Asias He
10cf97375e streaming: Expose reason for streaming
On receiving a mutation_fragment or a mutation triggered by a streaming
operation, we pass an enum stream_reason to notify the receiver what
the streaming is used for. So the receiver can decide further operation,
e.g., send view updates, beyond applying the streaming data on disk.

Fixes #3276
Message-Id: <f15ebcdee25e87a033dcdd066770114a499881c0.1539498866.git.asias@scylladb.com>

(cherry picked from commit 7f826d3343)
2018-11-15 17:45:31 +02:00
Paweł Dziepak
e6355a9a01 Merge "Write static rows for all partitions if there are static columns" from Vladimir
"
It appears that in case when there are any static columns in serialization header,
Cassandra would write a (possibly empty) static row to every partition
in the SSTables file.

This patchset alings Scylla's logic with that of Cassandra.

Note that Scylla optimizes the case when no partition contains a static
row because it keeps track of updated columns that Scylla currently does
not do - see #3901 for details.

Fixes #3900.
"

* 'projects/sstables-30/write-all-static-rows/v1' of https://github.com/argenet/scylla:
  tests: Test writing empty static rows for partitions in tables with static columns.
  sstables: Ignore empty static rows on reading.
  sstables: Write empty static rows when there are static columns in the table.

(cherry picked from commit 6469a1b451)
2018-11-12 15:59:35 -08:00
Raphael S. Carvalho
e57907a1d5 sstables: fix procedure to get fully expired sstables with MC format
MC format lacks ancestors metadata, so we need to workaround it by using
ancestors in metadata collector, which is only available for a sstable
written during this instance. It works fine here because we only want
to know if a sstable recently compacted has an ancestor which wasn't
yet deleted.

Fixes #3852.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Reviewed-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <20181102154951.22950-1-raphaelsc@scylladb.com>
(cherry picked from commit 1c5934c934)
2018-11-06 16:03:18 +02:00
Pekka Enberg
f94b46e7e0 docker: Switch to 3.0 RPM repository 2018-11-01 19:40:10 +02:00
Avi Kivity
6847c12668 Merge "dist: use perftune.py for disks tuning" from Vlad
"
Use perftune.py for tuning disks:
   - Distribute/pin disks' IRQs:
      - For NVMe drives: evenly among all present CPUs.
      - For non-NVMe drives: according to chosen tuning mode.
   - For all disks used by scylla:
      - Tune nomerges
      - Tune I/O scheduler.

It's important to tune NIC and disks together in order to keep IRQ
pinning in the same mode.

Disk are detected and tuned based on the current content of
/etc/scylla/scylla.yaml configuration file.
"

Fixes #3831.

* 'use_perftune_for_disks-v3' of https://github.com/vladzcloudius/scylla:
  dist: change the sysconfig parameter name to reflect the new semantics
  scylla_util.py::sysconfig_parser: introduce has_option()
  dist: scylla_setup and scylla_sysconfig_setup: change paremeters names to reflect new semantics
  dist: don't distribute posix_net_conf.sh any more
  dist: use perftune.py to tune disks and NIC

(cherry picked from commit f170e3e589)
2018-11-01 19:19:04 +02:00
Avi Kivity
80b86def1f Update seastar submodule
* seastar 0c8a2c8...d6647df (3):
  > scripts: perftune.py: properly merge parameters from the command line and the configuration file
  > scripts: perftune.py: prioritize I/O schedulers
  > Merge "scripts: perftune.py: support different I/O schedulers" from Vlad

Ref #3831.
2018-11-01 19:18:07 +02:00
Vlad Zolotarov
c6de9ea39b config: enable hinted handoff by default
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20181019180401.12400-1-vladz@scylladb.com>
(cherry picked from commit 4d1bb719a4)
2018-11-01 10:41:44 +02:00
Avi Kivity
94bed81c1d Update seastar submodule
* seastar 39b89de...0c8a2c8 (1):
  > prometheus: Allow preemption between each metric

See scylladb/seastar#469.
2018-10-31 19:21:21 +02:00
Hagit Segev
0f3a21f0bb release: prepare for 3.0-rc1 2018-10-31 12:08:43 +02:00
Tomasz Grabiec
976db7e9e0 Merge "Proper support for static rows in SSTables 3.x" from Vladimir
This patchset addresses two issues with static rows support in SSTables
3.x. ('mc' format):

1. Since collections are allowed in static rows, we need to check for
complex deletion, set corresponding flag and write tombstones, if any.
2. Column indices need to be partitioned for static columns the same way
they are partitioned for regular ones.

 * github.com/argenet/scylla.git projects/sstables-30/columns-proper-order-followup/v1:
  sstables: Partition static columns by atomicity when reading/writing
    SSTables 3.x.
  sstables: Use std::reference_wrapper<> instead of a helper structure.
  sstables: Check for complex deletion when writing static rows.
  tests: Add/fix comments to
    test_write_interleaved_atomic_and_collection_columns.
  tests: Add test covering inverleaved atomic and collection cells in
    static row.

(cherry picked from commit 62c7685b0d)
2018-10-30 14:51:21 +01:00
Nadav Har'El
996b86b804 Materalized views: fix race condition in resharding while view building
When a node reshards (i.e., restarts with a different number of CPUs), and
is in the middle of building a view for a pre-existing table, the view
building needs to find the right token from which to start building on all
shards. We ran the same code on all shards, hoping they would all make
the same decision on which token to continue. But in some cases, one
shard might make the decision, start building, and make progress -
all before a second shard goes to make the decision, which will now
be different.

This resulted, in some rare cases, in the new materialized view missing
a few rows when the build was interrupted with a resharding.

The fix is to add the missing synchronization: All shards should make
the same decision on whether and how to reshard - and only then should
start building the view.

Fixes #3890
Fixes #3452

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20181028140549.21200-1-nyh@scylladb.com>
(cherry picked from commit b8337f8c9d)
2018-10-29 09:52:25 +00:00
Avi Kivity
b7b217cc43 Merge "Re-order columns when reading/writing SSTables 3.x" from Vladimir
"
In Cassandra, row columns are stored in a BTree that uses the following
ordering on them:
    - all atomic columns go first, then all multi-cell ones
    - columns of both types (atomic and multi-cell) are
      lexicographically ordered by name regarding each other

Scylla needs to store columns and their respective indices using the
same ordering as well as when reading them back.

Fixes #3853

Tests: unit {release}

+

Checked that the following SSTables are dumped fine using Cassandra's
sstabledump:

cqlsh:sst3> CREATE TABLE atomic_and_collection3 ( pk int, ck int, rc1 text, rc2 list<text>, rc3 text, rc4 list<text>, rc5 text, rc6 list<text>, PRIMARY KEY (pk, ck)) WITH compression = {'sstable_compression': ''};
cqlsh:sst3> INSERT INTO atomic_and_collection3 (pk, ck, rc1, rc4, rc5) VALUES (0, 0, 'hello', ['beautiful','world'], 'here');
<< flush >>

sstabledump:

[
  {
    "partition" : {
      "key" : [ "0" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 96,
        "clustering" : [ 0 ],
        "liveness_info" : { "tstamp" : "1540599270139464" },
        "cells" : [
          { "name" : "rc1", "value" : "hello" },
          { "name" : "rc5", "value" : "here" },
          { "name" : "rc4", "deletion_info" : { "marked_deleted" : "1540599270139463", "local_delete_time" : "1540599270" } },
          { "name" : "rc4", "path" : [ "45e22cb0-d97d-11e8-9f07-000000000000" ], "value" : "beautiful" },
          { "name" : "rc4", "path" : [ "45e22cb1-d97d-11e8-9f07-000000000000" ], "value" : "world" }
        ]
      }
    ]
  }
]
"

* 'projects/sstables-30/columns-proper-order/v1' of https://github.com/argenet/scylla:
  tests: Test interleaved atomic and multi-cell columns written to SSTables 3.x.
  sstables: Re-order columns (atomic first, then collections) for SSTables 3.x.
  sstables: Use a compound structure for storing information used for reading columns.

(cherry picked from commit 75dbff984c)
2018-10-28 15:51:47 +02:00
Tomasz Grabiec
c274430933 Merge "Properly write static rows missing columns for SSTables 3.x." from Vladimir
Before this fix, write_missing_columns() helper would always deal with
regular columns even when writing static rows.

This would cause errors on reading those files.

Now, the missing columns are written correctly for regular and static
rows alike.

* github.com/argenet/scylla.git projects/sstables-30/fix-writing-static-missing-columns/v1:
  schema: Add helper method returning the count of columns of specified
    kind.
  sstables: Honour the column kind when writing missing columns in 'mc'
    format.
  tests: Add test for a static row with missing columns (SStables 3.x.).

(cherry picked from commit cf2d5c19fb)
2018-10-26 13:30:12 +03:00
Avi Kivity
893a18a7c4 Merge "Properly writing/reading shadowable deletions with SSTables 3.x." from Vladimir
"
This patchset adddresses two problems with shadowable deletions handling
in SSTables 3.x. ('mc' format).

Firstly, we previously did not set a flag indicating the presence of
extended flags byte with HAS_SHADOWABLE_DELETION bitmask on writing.
This would break subsequent reading and cause all types of failures up
to crash.

Secondly, when reading rows with this extended flag set, we need to
preserve that information and create a shadowable_tombstone for the row.

Tests: unit {release}
+

Verified manually with 'hexdump' and using modified 'sstabledump' that
second (shadowable) tombstone is written for MV tables by Scylla.

+
DTest (materialized_views_test.py:TestMaterializedViews.hundred_mv_concurrent_test)
that originally failed due to this issue has successfully passed locally.
"

* 'projects/sstables-30/shadowable-deletion/v4' of https://github.com/argenet/scylla:
  tests: Add tests writing both regular and shadowable tombstones to SSTables 3.x.
  tests: Add test covering writing and reading a shadowable tombstone with SSTables 3.x.
  sstables: Support Scylla-specific extension for writing shadowable tombstones.
  sstables: Introduce a feature for shadowable tombstones in Scylla.db.
  memtable: Track regular and shadowable tombstones separately in encoding_stats_collector.
  sstables: Error out when reading SSTables 3.x with Cassandra shadowable deletion.
  sstables: Support checking row extension flags for Cassandra shadowable deletion.

(cherry picked from commit 8210f4c982)
2018-10-24 19:32:57 +03:00
Tomasz Grabiec
39b39058fc sstable_mutation_reader: Do not read partition index when scanning
Even when we're using a full clustering range, need_skip() will return
true when we start a new partition and advance_context() will be
called with position_in_partition::before_all_clustered_rows(). We
should detect that there is no need to skip to that position before
the call to advance_to(*_current_partition_key), which will read the
index page.

Fixes #3868.

Message-Id: <1539881775-8578-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 9e756d3863)
2018-10-24 19:32:40 +03:00
Avi Kivity
6bf4a73d88 thrift: limit message size
Limit message size according to the configuration, to avoid a huge message from
allocating all of the server's memory.

We also need to limit memory used in aggregate by thrift, but that is left to
another patch.

Fixes #3878.
Message-Id: <20181024081042.13067-1-avi@scylladb.com>

(cherry picked from commit a9836ad758)
2018-10-24 19:32:25 +03:00
Gleb Natapov
ca4846dd63 stream_session: remove unused capture
'Consumer function' parameter for distribute_reader_and_consume_on_shards()
captures schema_ptr (which is a seastar::shared_ptr), but the function
is later copied on another shard at which point schema_ptr is also copied
and its counter is incremented by the wrong shard. The capture is not
even used, so lets just drop it.

Fixes #3838

Message-Id: <20181011075500.GN14449@scylladb.com>
(cherry picked from commit ceb361544a)
2018-10-24 09:47:02 +03:00
Takuya ASADA
2663ff7bc1 dist/common/sysctl.d: add new conf file to set fs.aio-max-nr
We need raise fs.aio-max-nr to larger value since Seastar may allocates
more then 65535 AIO events (= kernel default value)

Fixes #3842

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20181023030449.15445-1-syuu@scylladb.com>
(cherry picked from commit 950dbdb466)
2018-10-24 09:45:51 +03:00
Tomasz Grabiec
043a575fcd Merge "Correctly handle dropped columns in SSTable 3" from Piotr J.
Previously we were making assumptions about missing columns
(the size of its value, whether it's a collection or a counter) but
they didn't have to be always true. Now we're using column type
from serialization header to use the right values.

Fixes #3859

* seastar-dev.git haaawk/projects/sstables-30/handling-dropped-columns/v4:
  sstables 3: Correctly handle dropped columns in column_translation
  sstables 3: Add test for dropped columns handling

(cherry picked from commit fc37b80d24)
2018-10-24 09:45:25 +03:00
Vlad Zolotarov
00dc400993 storage_proxy::query_result_local: create a single tracing span on a replica shard
Every call of a tracing::global_trace_state_ptr object instead of a
tracing::tracing_state_ptr or a call to tracing::global_trace_state_ptr::get()
creates a new tracing session (span) object.

This should never be done unless query handling moves to a different shard.

Fixes #3862

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20181018003500.10030-1-vladz@scylladb.com>
(cherry picked from commit a87c11bad2)
2018-10-24 09:45:14 +03:00
Duarte Nunes
522a48a244 Merge 'Fix for a select statement with filtered columns' from Eliran
"
This patchset fixes #3803. When a select statement with filtering
is executed and the column that is needed for the filtering is not
present in the select clause, rows that should have been filtered out
according to this column will still be present in the result set.

Tests:
 1. The testcase from the issue.
 2. Unit tests (release) including the
 newly added test from this patchset.
"

* 'issues/3803/v10' of https://github.com/eliransin/scylla:
  unit test: add test for filtering queries without the filtered column
  cql3 unit test: add assertion for the number of serialized columns
  cql3: ensure retrieval of columns for filtering
  cql3: refactor find_idx to be part of statement restrictions object
  cql3: add prefix size common functionality to all clustering restrictions
  cql3: rename selection metadata manipulation functions

(cherry picked from commit 3fe92663d4)
2018-10-24 09:44:46 +03:00
Paweł Dziepak
5faa28ce45 cql3: restore original timeout behaviour for aggregate queries
Commit 1d34ef38a8 "cql3: make pagers use
time_point instead of duration" has unintentionally altered the timeout
semantics for aggregate queries. Such requests fetch multiple pages before
sending a response to the client. Originally, each of those fetches had
a timeout-duration to finish, after the problematic commit the whole
request needs to complete in a single timeout-duration. This,
unsurprisingly, makes some queries that were successful before fail with
a timeout. This patch restores the original behaviour.

Fixes #3877.

Message-Id: <20181022125318.4384-1-pdziepak@scylladb.com>
(cherry picked from commit c94d2b6aa6)
2018-10-24 09:43:59 +03:00
Avi Kivity
52be02558e config: mark range_request_timeout_in_ms and request_timeout_in_ms as Used
This makes them available in scylla --help.

Fixes #3884.
Message-Id: <20181023101150.29856-1-avi@scylladb.com>

(cherry picked from commit d9e0ea6bb0)
2018-10-24 09:43:54 +03:00
Avi Kivity
a7cbfbe63f Merge "hinted handoff: give a sender a low priority" from Vlad
"
Hinted handoff should not overpower regular flows like READs, WRITEs or
background activities like memtable flushes or compactions.

In order to achieve this put its sending in the STEAMING CPU scheduling
group and its commitlog object into the STREAMING I/O scheduling group.

Fixes #3817
"

* 'hinted_handoff_scheduling_groups-v2' of https://github.com/vladzcloudius/scylla:
  db::hints::manager: use "streaming" I/O scheduling class for reads
  commitlog::read_log_file(): set the a read I/O priority class explicitly
  db::hints::manager: add hints sender to the "streaming" CPU scheduling group

(cherry picked from commit 1533487ba8)
2018-10-24 09:43:39 +03:00
Duarte Nunes
28fd2044d2 Merge 'hinted handoff: add manager::state and split storing and replaying enablement' from Vlad
"
Refs #3828
(Probably fixes it)

We found a few flaws in a way we enable hints replaying.
First of all it was allowed before manager::start() is complete.
Then, since manager::start() is called after messaging_service is
initialized there was a time window when hints are rejected and this
creates an issue for MV.

Both issues above were found in the context of #3828.

This series fixes them both.

Tested {release}:
dtest: materialized_views_test.py:TestMaterializedViews.write_to_hinted_handoff_for_views_test
dtest: hintedhandoff_additional_test.py
"

* 'hinted_handoff_dont_create_hints_until_started-v1' of https://github.com/vladzcloudius/scylla:
  hinted handoff: enable storing hints before starting messaging_service
  db::hints::manager: add a "started" state
  db::hints::manager: introduce a _state

(cherry picked from commit 3a53b3cebc)
2018-10-24 09:43:03 +03:00
Calle Wilund
76ff2e5c3d messaging_service: Make rpc streaming sink respect tls connection
Fixes #3787

Message service streaming sink was created using direct call to
rpc::client::make_sink. This in turn needs a new socker, which it
creates completely ignoring what underlying transport is active for the
client in question.

Fix by retaining the tls credential pointer in the client wrapper, and
using this in a sink method to determine whether to create a new tls
socker, or just go ahead with a plain one.

Message-Id: <20181010003249.30526-1-calle@scylladb.com>
(cherry picked from commit 3cb50c861d)
2018-10-23 07:36:21 +00:00
Avi Kivity
7b34d54a96 locator: fix abstract_replication_strategy::get_ranges() and friends violating sort order
get_ranges() is supposed to return ranges in sorted order. However, a35136533d
broke this and returned the range that was supposed to be last in the second
position (e.g. [0, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9]). The broke cleanup, which
relied on the sort order to perform a binary search. Other users of the
get_ranges() family did not rely on the sort order.

Fixes #3872.
Message-Id: <20181019113613.1895-1-avi@scylladb.com>

(cherry picked from commit 1ce52d5432)
2018-10-23 07:36:21 +00:00
Duarte Nunes
26c31f6798 Merge "db/hints: Expose current backlog" from Duarte
"
Hints are stored on disk by a hints::manager, ensuring they are
eventually sent. A hints::resource_manager ensures the hints::managers
it tracks don't consume more than their allocated resources by
monitoring disk space and disabling new hints if needed. This series
fixes some bugs related to the backlog calculation, but mainly exposes
the backlog through a hints::manager so upper layers can apply flow
control.

Refs #2538
"

* 'hh-manager-backlog/v3' of https://github.com/duarten/scylla:
  db/hints/manager: Expose current backlog
  db/hints/manager: Move decision about blocking hints to the manager
  db/hints/resource_manager: Correctly account resources in space_watchdog
  db/hints/resource_manager: Replace timer with seastar::thread
  db/hints/resource_manager: Ensure managers are correctly registered
  db/hints/resource_manager: Fix formatting
  db/hints: Disallow moving or copying the managers
2018-10-23 07:36:21 +00:00
Glauber Costa
28fa66591a sstables: print sstable path in case of an exception
Without that, we don't know where to look for the problems

Before:

 compaction failed: sstables::malformed_sstable_exception (Too big ttl: 3163676957)

After:

 compaction_manager - compaction failed: sstables::malformed_sstable_exception (Too big ttl: 4294967295 in sstable /var/lib/scylla/data/system_traces/events-8826e8e9e16a372887533bc1fc713c25/mc-832-big-Data.db)

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20181016181004.17838-1-glauber@scylladb.com>
(cherry picked from commit 7edae5421d)
2018-10-23 07:36:14 +00:00
Piotr Sarna
0fee1d9e43 cql3: add asking for pk/ck in the base query
Base query partition and clustering keys are used to generate
paging state for an index query, so they always need to be present
when a paged base query is processed.
Message-Id: <f3bf69453a6fd2bc842c8bdbd602d62c91cf9218.1538568953.git.sarna@scylladb.com>

Fixes #3855.
(cherry picked from commit 4a23297117)
2018-10-16 19:59:42 +03:00
Piotr Sarna
76e72e28f4 cql3: add checking for may_need_paging when executing base query
It's not sufficient to check for positive page_size when preparing
a base query for indexed select statement - may_need_paging() should
be called as well.
Message-Id: <d435820019e4082a64ca9807541f0c9ad334e6a8.1538568953.git.sarna@scylladb.com>

(cherry picked from commit 50d3de0693)
2018-10-16 19:58:58 +03:00
Piotr Sarna
f969e80965 cql3: move base query command creation to a separate function
Message-Id: <6b48b8cbd6312da4a17bfd3c85af628b4215e9f4.1538568953.git.sarna@scylladb.com>
(cherry picked from commit 11b8831c04)
2018-10-16 19:58:56 +03:00
Vladimir Krivopalov
2029134063 sstables: Reset opened range tombstone when moving to another partition.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <f6dc6b0bd88ca44f2ef84c2a8bee43fde82c89cc.1539396572.git.vladimir@scylladb.com>
(cherry picked from commit 092276b13d)
2018-10-15 13:26:22 +03:00
Vladimir Krivopalov
f30fe7bd17 sstables: Factor out code resetting values for a new partition.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <83a3a4ce6942b036be447bcfeb66142828e75293.1539396572.git.vladimir@scylladb.com>
(cherry picked from commit 926b6430fd)
2018-10-15 13:26:20 +03:00
Piotr Sarna
aeb418af9e service/pager: avoid dereferencing null partition key
The pager::state() function returns a valid paging object even
if the pager itself is exhausted. It may also not contain the partition
key, so using it unconditionally was a bug - now, in case there is no
partition key present, paging state will contain an empty partition key.

Fixes #3829

Message-Id: <28401eb21ab8f12645c0a33d9e92ada9de83e96b.1539074813.git.sarna@scylladb.com>
(cherry picked from commit b3685342a6)
2018-10-15 12:47:25 +03:00
Glauber Costa
714e6d741f api: use longs instead of ints for snapshot sizes
Int types in json will be serialized to int types in C++. They will then
only be able to handle 4GB, and we tend to store more data than that.

Without this patch, listsnapshots is broken in all versions.

Fixes: #3845

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20181012155902.7573-1-glauber@scylladb.com>
(cherry picked from commit 98332de268)
2018-10-12 22:01:59 +03:00
Tomasz Grabiec
95c5872450 Merge "Enable sstable_mutation_test with SSTables 3.x." from Vladimir
Introduce uppermost_bound() method instead of upper_bound() in mutation_fragment_filter and clustering_ranges_walker.

For now, this has been only used to produce the final range tombstone
for sliced reads inside consume_partition_end().

Usage of the upper bound of the current range causes problems of two
kinds:
    1. If not all the slicing ranges have been traversed with the
    clustering range walker, which is possible when the last read
    mutation fragment was before some of the ranges and reading was limited
    to a specific range of positions taken from index, the emitted range
    tombstone will not cover the untraversed slices.

    2. At the same time, if all ranges have been walked past, the end
    bound is set to after_all_clustered_rows and the emitted RT may span
    more data than it should.

To avoid both situations, the uppermost bound is used instead, which
refers to the upper bound of the last range in the sequence.

* github.com/scylladb/seastar-dev.git haaawk/projects/sstables-30/enable-mc-with-sstable-mutation-test/v2
  sstables: Use uppermost_bound() instead of upper_bound() in
    mutation_fragment_filter.
  tests: Enable sstable_mutation_test for SSTables 'mc' format.

Rebased by Piotr J.

(cherry picked from commit b89556512a)
2018-10-12 17:46:49 +03:00
Tomasz Grabiec
87f8968553 Merge "Make SST3 pass test_clustering_slices test" from Piotr
* seastar-dev.git haaawk/sst3/test_clustering_slices/v8:
  sstables: Extract on_end_of_stream from consume_partition_end
  sstables: Don't call consume_range_tombstone_end in
    consume_partition_end
  sstables: Change the way fragments are returned from consumer

(cherry picked from commit 193efef950)
2018-10-12 17:46:46 +03:00
Tomasz Grabiec
2895428d44 Merge "Handle dead row markers when writing to SSTables 3.x" from Vladimir
There is a mismatch between row markers used in SSTables 2.x (ka/la) and
liveness_info used by SSTables 3.x (mc) in that a row marker can be
written as a deleted cell but liveness_info cannot.

To handle this, for a dead row marker the corresponding liveness_info is
written as expiring liveness_info with a fake TTL set to 1.
This approach is adapted from the solution for CASSANDRA-13395 that
exercised similar issue during SSTables upgrades.

* github.com/argenet/scylla.git projects/sstables-30/dead-row-marker/v7:
  sstables: Introduce TTL limitation and special 'expired TTL' value.
  sstables: Write dead row marker as expired liveness info.
  tests: Add test covering dead row marker writing to SSTables 3.x.

(cherry picked from commit a7a14e3af2)
2018-10-11 15:03:58 +03:00
Botond Dénes
e18f182cfc multishard_mutation_query(): don't attempt to stop broken readers
Currently, when stopping a reader fails, it simply won't be attempted to
be saved, and it will be left in the `_readers` array as-is. This can
lead to an assertion failure as the reader state will contain futures
that were already waited upon, and that the cleanup code will attempt to
wait on again. To prevent this, when stopping a reader fails, reset it
to nonexistent state, so that the cleanup code doesn't attempt to do
anything with it.

Refs: #3830

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <a1afc1d3d74f196b772e6c218999c57c15ca05be.1539088164.git.bdenes@scylladb.com>
(cherry picked from commit d467b518bc)
2018-10-10 10:12:00 +03:00
Vladimir Krivopalov
cf8cdbf87d sstables: Add missing 'mc' format into format strings map in sstable::filename().
Fixes #3832.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <269421fb2ac8ab389231cbe9ed501da7e7ff936a.1539048008.git.vladimir@scylladb.com>
(cherry picked from commit e9aba6a9c3)
2018-10-10 05:53:28 +03:00
Avi Kivity
eb2814067d Update seastar submodule
* seastar 5712816...39b89de (1):
  > prometheus: Fix histogram text representation

Fixes #3827.
2018-10-09 16:35:54 +03:00
Avi Kivity
0c722d4547 Point seastar submodule at scylla-seastar.git
This allows us to freeze this branch's Seastar and only backport selected fixes.
2018-10-09 16:29:53 +03:00
Nadav Har'El
54cf463430 materialized views: refuse to filter by non-key column
A materialized views can provide a filter so as to pick up only a subset
of the rows from the base table. Usually, the filter operates on columns
from the base table's primary key. If we use a filter on regular (non-key)
columns, things get hairy, and as issue #3430 showed, wrong: merely updating
this column in the base table may require us to delete, or resurrect, the
view row. But normally we need to do the above when the "new view key column"
was updated, when there is one. We use shadowable tombstones with one
timestamp to do this, so it cannot take into account the two timestamp from
those two columns (the filtered column and the new key column).

So in the current code, filtering by a non-key column does not work correctly.
In this patch we provide two test cases (one involving TTLs, and one involves
only normal updates), which demonstrate vividly that it does *not* work
correctly. With normal updates, trying to resurect a view row that has
previously disappeared, fails. With TTLs, things are even worse, and the view
row fails to disappear when the filtered column is TTLed.

In Cassandra, the same thing doesn't work correctly as well (see
CASSANDRA-13798 and CASSANDRA-13832) so they decided to refuse creating
a materialized view filtering a non-key column. In this patch we also
do this - fail the creation of such an unsupported view. For this reason,
the two tests mentioned above are commented out in a "#if", with, instead,
a trivial test verifying a failure to create such a view.

Note that as explained above, when the filtered column and new view key
column are *different* we have a problem. But when they are the *same* - namely
we filter by a non-key base column which actually *is* a key in the view -
we are actually fine. This patch includes additional test cases verifying
that this case is really fine and provides correct results. Accordingly,
this case is *not* forbidden in the view creation code.

Fixes #3430.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20181008185633.24616-1-nyh@scylladb.com>
(cherry picked from commit b8668dc0f8)
2018-10-09 10:18:58 +01:00
Nadav Har'El
d2a0622edd materialized views: enable two tests in view_schema_test
We had two commented out tests based on Cassandra's MV unit tests, for
the case that the view's filter (the "SELECT" clause used to define the
view) filtered by a non-primary-key column. These tests used to fail
because of problems we had in the filtering code, but they now succeed,
so we can enable them. This patch also adds some comments about what
the tests do, and adds a few more cases to one of the tests.

Refs #3430.

However, note that the success of these tests does not really prove that
the non-PK-column filtering feature works fully correctly and that issue
forbidding it, as explained in
https://issues.apache.org/jira/browse/CASSANDRA-13798. We can probably
fix this feature with our "virtual cells" mechanism, but will need to add
a test to confirm the possible problem and its (probably needed fix).
We do not add such a test in this patch.

In the meantime, issue #3430 should remain open: we still *allow* users
to create MV with such a filter, and, as the tests in this patch show,
this "mostly" works correctly. We just need to prove and/or fix what happens
with the complex row liveness issues a la issue #3362.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20181004213637.32330-1-nyh@scylladb.com>
(cherry picked from commit e4ef7fc40a)
2018-10-09 10:18:54 +01:00
Duarte Nunes
60edaec757 Merge 'Fix issues with endpoint state replication to other shards' from Tomasz
Fixes #3798
Fixes #3694

Tests:

  unit(release), dtest([new] cql_tests.py:TruncateTester.truncate_after_restart_test)

* tag 'fix-gossip-shard-replication-v1' of github.com:tgrabiec/scylla:
  gms/gossiper: Replicate enpoint states in add_saved_endpoint()
  gms/gossiper: Make reset_endpoint_state_map() have effect on all shards
  gms/gossiper: Replicate STATUS change from mark_as_shutdown() to other shards
  gms/gossiper: Always override states from older generations

(cherry picked from commit 48ebe6552c)
2018-10-09 10:14:30 +03:00
Avi Kivity
5802532cb3 Merge "Fix mutation fragments clobbering on fast_forward" from Vladimir
"
This patchset fixes a bug in SSTables 3.x reading when fast-forwarding
is enabled. It is possible that a mutation fragment, row or RT marker,
is read and then stored because it falls outside the current
fast-forwarding range.

If the reader is further fast-forwarded but the
row still falls outside of it, the reader would still continue reading
and get the next fragment, if any, that would clobber the currently
stored one. With this fix, the reader does not attempt to read on
after storing the current fragment.

Tests: unit {release}
"

* 'projects/sstables-30/row-skipped-on-double-ff/v2' of https://github.com/argenet/scylla:
  tests: Add test for reading rows after multiple fast-forwarding with SSTables 3.x.
  sstables: mp_row_consumer_m to notify reader on end of stream when storing a mutation fragment.
  sstables: In mp_row_consumer_m::push_mutation_fragments(), return the called helper's value.

(cherry picked from commit 0fa60660b8)
2018-10-09 09:35:51 +03:00
Eliran Sinvani
83ea91055e cql3 : add workaround to antlr3 null dereference bug
The Antlr3 exception class has a null dereference bug that crashes
the system when trying to extract the exception message using
ANTLR_Exception<...>::displayRecognitionError(...) function. When
a parsing error occurs the CqlParser throws an exception which in
turn processesed for some special cases in scylla to generate a custom
message. The default case however, creates the message using
displayRecognitionError, causing the system to crash.
The fix is a simple workaround, making sure the pointer is not null
before the call to the function. A "proper" fix can't be implemented
because the exception class itself is implemented outside scylla
in antlr headers that resides on the host machine os.

Tested manualy 2 testcases, a typo causing scylla to crash and
a cql comment without a newline at the end also caused scylla to crash.
Ran unit tests (release).

Fixes #3740
Fixes #3764

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <cfc7e0d758d7a855d113bb7c8191b0fd7d2e8921.1538566542.git.eliransin@scylladb.com>
(cherry picked from commit 20f49566a2)
2018-10-04 14:07:25 +03:00
Piotr Sarna
e7863d3d54 tests: add missing get() calls in threaded context
One test case missed a few get() calls in order to wait
for continuations, which only accidentally worked,
because it was followed by 'eventually()' blocks.
Message-Id: <69c145575ac81154c4b5f500d01c6b045a267088.1536839959.git.sarna@scylladb.com>

(cherry picked from commit a5570cb288)
2018-10-04 14:06:50 +03:00
Piotr Sarna
57f124b905 tests: add collections test for secondary indexing
Test case regarding creating indexes on collection columns
is added to the suite.

Refs #3654
Refs #2962
Message-Id: <1b6844634b6e9a353028545813571647c92fb330.1536839959.git.sarna@scylladb.com>

(cherry picked from commit 8a2abd45fb)
2018-10-04 14:06:48 +03:00
Piotr Sarna
40d8de5784 cql3: prevent creation of indexes on non-frozen collections
Until indexes for non-frozen collections is implemented,
creating such indexes should be disallowed to prevent unnecessary
errors on insertions/selections.

Fixes #3653
Refs #2962
Message-Id: <218cf96d5e38340806fb9446b8282d2296ba5f43.1536839959.git.sarna@scylladb.com>

(cherry picked from commit 2d355bdf47)
2018-10-04 14:06:47 +03:00
Avi Kivity
1468ec62de Merge "Handle simple column type schema changes in SST3" from Piotr
"
This patchset enables very simple column type conversions.
It covers only handling variable and fixed size type differences.
Two types still have to be compatiple on bits level to be able to convert a field from one to the other.
"

* 'haaawk/sst3/column_type_schema_change/v4' of github.com:scylladb/seastar-dev:
  Fix check_multi_schema to actually check the column type change
  Handle very basic column type conversions in SST3
  Enable check_multi_schema for SST3

(cherry picked from commit b9702222f8)
2018-10-03 17:44:26 +03:00
Avi Kivity
c6ef56ae1e Revert "compaction: demote compaction start/end messages to DEBUG level"
This reverts commit b443a9b930. The compaction
history table doesn't have enough information to be a replacement for this
log message yet.

(cherry picked from commit 7c8143c3c4)
2018-10-03 17:44:21 +03:00
Avi Kivity
ad62313b86 utils: crc32: mark power crc32 assembly as not requiring an executable stack
The linker uses an opt-in system for non-executable stack: if all object files
opt into a non-executable stack, the binary will have a non-executable stack,
which is very desirable for security. The compiler cooperates by opting into
a non-executable stack whenever possible (always for our code).

However, we also have an assembly file (for fast power crc32 computations).
Since it doesn't opt into a non-executable stack, we get a binary with
executable stack, which Gentoo's build system rightly complains about.

Fix by adding the correct incantation to the file.

Fixes #3799.

Reported-by: Alexys Jacob <ultrabug@gmail.com>
Message-Id: <20181002151251.26383-1-avi@scylladb.com>
(cherry picked from commit aaab8a3f46)
2018-10-02 23:22:56 +03:00
Avi Kivity
de87f798e1 release: prepare for 3.0-rc0 2018-10-02 12:00:50 +03:00
Calle Wilund
2996b8154f storage_proxy: Add missing re-throw in truncate_blocking
Iff truncation times out, we want to log it, but the exception should
not be swallowed, but re-thrown.

Fixes #3796.

Message-Id: <20181001112325.17809-1-calle@scylladb.com>
2018-10-01 19:07:04 +02:00
Paweł Dziepak
ad4a50dab6 Merge "multi range reader: add support for range generating functor" from Botond
"
This series adds support for range generator functors to multi range
reader. A range generator functor can lazily generate an uknown amount
of ranges on-the-fly for the reader to read.
The range generator support was added by refactoring
`flat_multi_range_mutation_reader` to work in terms of a generator
functor. The existing overload taking a `dht::partition_range_vector`
is adapted to the generator interface behind the scenes.
"

* 'multi-range-reader-generator/v9' of https://github.com/denesb/scylla:
  tests/flat_mutation_reader_test: extend multi-range reader tests
  make_flat_multi_range_reader: add documentation
  make_flat_multi_range_reader: add generator overload
  flat_multi_range_reader: refactor to work in terms of generator
  make_flat_multi_range_reader(): better handle the 0 range case
  flat_mutation_reader: add move_buffer_content_to()
  flat_multi_range_mutation_reader: drop fwd_mr ctor parameter
2018-10-01 12:53:31 +01:00
Duarte Nunes
e6630c627b Merge 'Add secondary index paging' from Piotr
"
Indexed select statement consists of two queries - the view query
used to extract base keys and the base query that uses those keys
to return base rows.
The main idea of this series is to replace raw proxy.query() call
during the view query to one that uses a pager.
Additionally, paging info from the view query needs to be returned
to the client, in order to be used later for requesting new pages.
"

* 'paging_indexes_7' of https://github.com/psarna/scylla:
  tests: add test for secondary index with paging
  cql3: remove execute(primary_keys) from select statement
  cql3: add incremental base queries to index query
  storage_proxy: make get_restricted_ranges public
  cql3: add base query handling function to indexed statement
  cql3: add generating base key from index keys
  cql3: add paging state generation function
  cql3: move getting index view schema to prepare stage
  pager: make state() defined for exhausted pagers
  cql3: add maybe_set_paging_state function
  cql3: rename set_has_more_pages to set_paging_state
  pager: add setters for partition/clustering keys
  cql3: add paging to read_posting_list
  cql3: add non-const get_result_metadata method
  cql3: make find_index_* functions return paging state
  cql3: make read_posting_list return future<rows>
  cql3: make pagers use time_point instead of duration
2018-10-01 10:42:21 +01:00
Avi Kivity
900ffad979 config: re-add murmur3_ignore_msb_bits to scylla.yaml
Commit d6b0c4dda4 changed the built-in default
murmur3_ignore_msb_bits to 12 (from 0) and removed the scylla.yaml default.

Removal of the scylla.yaml default was a mistake for two reasons:
 - if someone downgrades a cluster, keeping scylla.yaml derived from the
   master branch, they will experience resharding since the built-in default,
   which has changed, will take effect. While that scenario is not supported,
   it already happened and caused much consternation.
 - if, in the future, we wish to change the default, we will cause resharding
   again. Embedding the default in scylla.yaml allows us to change the default
   for new clusters while allowing upgraded clusters to retain older values.

Therefore, this patch restores murmur3_ignore_msb_bits in scylla.yaml. Future
changes to the configuration item should change both scylla.yaml and the
built-in default.

Message-Id: <20180930090053.21136-1-avi@scylladb.com>
2018-10-01 10:01:36 +03:00
Takuya ASADA
0a471c32cb dist/ami/files/scylla_install_ami: enable ssh_deletekeys
For some reason upstream AMI is disabling 'ssh_deletekeys' feature on
cloud-init, but generating SSH host keys should important for public AMI
images, so enable it again.

See: https://cloudinit.readthedocs.io/en/latest/topics/modules.html?highlight=ssh_deletekeys#ssh

Fixes scylladb/scylla-ami#31

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180927122816.27809-1-syuu@scylladb.com>
2018-09-30 16:29:46 +03:00
Paweł Dziepak
2bcaf4309e utils/reusable_buffer: do not warn about large allocations
Reusable buffers are meant to be used when protocol or third-party
library limiations force us to allocate large contiguous buffers. There
isn't much that can be done about this so there is little point in
warning about that.

Fixes #3788.
Message-Id: <20180928085141.6469-1-pdziepak@scylladb.com>
2018-09-30 11:12:23 +03:00
Asias He
91dae0149d token_metadata: Invalidate cached ring in update_normal_tokens
In commit 4a0b561376, "storage_service:
Get rid of moving operation", we removed remove_from_moving() in
update_normal_tokens(). However, remove_from_moving() calls
invalidate_cached_rings(). We should call invalidate_cached_rings() in
update_normal_tokens(), otherwise we will get wrong token range to
address map in the token_metadata cache.

This issue exists in master only. It is not in any of the releases.

Message-Id: <c03f2ed478cfdb84494f36dce9a8cfc05ed9e0cd.1538288364.git.asias@scylladb.com>
2018-09-30 11:06:46 +03:00
Alexys Jacob
6d6764133b dist/common/scripts: coding style fixes
dist/common/scripts/scylla_blocktune.py:24:10: E401 multiple imports on one line
dist/common/scripts/scylla_blocktune.py:27:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_blocktune.py:35:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_blocktune.py:48:1: E305 expected 2 blank lines after class or function definition, found 1
dist/common/scripts/scylla_blocktune.py:52:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_blocktune.py:59:5: E306 expected 1 blank line before a nested definition, found 0
dist/common/scripts/scylla_blocktune.py:74:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_blocktune.py:81:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_blocktune.py:87:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_config_get.py:26:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_config_get.py:43:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_config_get.py:53:1: E305 expected 2 blank lines after class or function definition, found 1
dist/common/scripts/scylla_util.py:18:22: E401 multiple imports on one line
dist/common/scripts/scylla_util.py:19:22: E401 multiple imports on one line
dist/common/scripts/scylla_util.py:24:1: F401 'string' imported but unused
dist/common/scripts/scylla_util.py:32:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:50:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:61:30: E225 missing whitespace around operator
dist/common/scripts/scylla_util.py:75:53: E703 statement ends with a semicolon
dist/common/scripts/scylla_util.py:79:32: E272 multiple spaces before keyword
dist/common/scripts/scylla_util.py:80:25: E703 statement ends with a semicolon
dist/common/scripts/scylla_util.py:85:32: E201 whitespace after '['
dist/common/scripts/scylla_util.py:85:51: E202 whitespace before ']'
dist/common/scripts/scylla_util.py:130:34: E201 whitespace after '['
dist/common/scripts/scylla_util.py:130:65: E202 whitespace before ']'
dist/common/scripts/scylla_util.py:170:1: E266 too many leading '#' for block comment
dist/common/scripts/scylla_util.py:172:11: E225 missing whitespace around operator
dist/common/scripts/scylla_util.py:174:10: E225 missing whitespace around operator
dist/common/scripts/scylla_util.py:178:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:181:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:184:17: E201 whitespace after '['
dist/common/scripts/scylla_util.py:184:50: E202 whitespace before ']'
dist/common/scripts/scylla_util.py:186:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:193:16: E201 whitespace after '['
dist/common/scripts/scylla_util.py:193:76: E202 whitespace before ']'
dist/common/scripts/scylla_util.py:195:18: E201 whitespace after '{'
dist/common/scripts/scylla_util.py:195:27: E203 whitespace before ':'
dist/common/scripts/scylla_util.py:195:41: E203 whitespace before ':'
dist/common/scripts/scylla_util.py:195:48: E202 whitespace before '}'
dist/common/scripts/scylla_util.py:203:25: E201 whitespace after '['
dist/common/scripts/scylla_util.py:203:54: E202 whitespace before ']'
dist/common/scripts/scylla_util.py:204:76: E225 missing whitespace around operator
dist/common/scripts/scylla_util.py:208:27: E703 statement ends with a semicolon
dist/common/scripts/scylla_util.py:217:27: E201 whitespace after '['
dist/common/scripts/scylla_util.py:217:62: E202 whitespace before ']'
dist/common/scripts/scylla_util.py:238:25: E201 whitespace after '['
dist/common/scripts/scylla_util.py:238:87: E202 whitespace before ']'
dist/common/scripts/scylla_util.py:257:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:258:11: E225 missing whitespace around operator
dist/common/scripts/scylla_util.py:259:11: E225 missing whitespace around operator
dist/common/scripts/scylla_util.py:268:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:277:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:280:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:283:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:286:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:297:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:302:5: E722 do not use bare except'
dist/common/scripts/scylla_util.py:305:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:325:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:329:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:335:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:338:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:341:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:343:81: E231 missing whitespace after ','
dist/common/scripts/scylla_util.py:352:1: E305 expected 2 blank lines after class or function definition, found 1
dist/common/scripts/scylla_util.py:352:21: E231 missing whitespace after ':'
dist/common/scripts/scylla_util.py:352:41: E231 missing whitespace after ':'
dist/common/scripts/scylla_util.py:352:65: E231 missing whitespace after ':'
dist/common/scripts/scylla_util.py:353:1: E302 expected 2 blank lines, found 0
dist/common/scripts/scylla_util.py:358:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:360:22: E225 missing whitespace around operator
dist/common/scripts/scylla_util.py:365:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:367:11: E225 missing whitespace around operator
dist/common/scripts/scylla_util.py:370:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:373:15: E225 missing whitespace around operator
dist/common/scripts/scylla_util.py:374:14: E225 missing whitespace around operator
dist/common/scripts/scylla_util.py:375:14: E225 missing whitespace around operator
dist/common/scripts/scylla_util.py:376:20: E225 missing whitespace around operator
dist/common/scripts/scylla_util.py:385:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:388:9: E225 missing whitespace around operator
dist/common/scripts/scylla_util.py:389:9: E225 missing whitespace around operator
dist/common/scripts/scylla_util.py:393:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:396:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:399:1: E302 expected 2 blank lines, found 1
dist/common/scripts/scylla_util.py:432:1: E302 expected 2 blank lines, found 1

Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20180918213707.6069-1-ultrabug@gentoo.org>
2018-09-30 11:00:37 +03:00
Botond Dénes
eba8d68313 tests/flat_mutation_reader_test: extend multi-range reader tests
Add unit tests for the generator version and extend existing ones with
tests for the corner cases (0 and 1 range).
2018-09-28 14:27:55 +03:00
Botond Dénes
bb7447bbe4 make_flat_multi_range_reader: add documentation 2018-09-28 14:27:55 +03:00
Botond Dénes
39bfd5d1df make_flat_multi_range_reader: add generator overload
Allows creating a multi range reader from an arbitrary callable that
return std::optional<dht::partition_range>. The callable is expected to
return a new range on each call, such that passing each successive range
to `flat_mutation_reader::fast_forward_to` is valid. When exhausted the
callable is expected to return std::nullopt.
2018-09-28 14:27:55 +03:00
Botond Dénes
8c5387890d flat_multi_range_reader: refactor to work in terms of generator
Instead of working with a dht::partition_range_vector directly, work
with an abstract generator that returns a pointer to the next range on
each invocation. When exhausted it returns nullptr. This opens up the
possibility to create multi range readers from a generator functor that
creates ranges lazily. This is indeed what the next path does.
2018-09-28 14:27:55 +03:00
Botond Dénes
f3bf2e83dd make_flat_multi_range_reader(): better handle the 0 range case
Previously, when the passed in range of partition ranges contained 0
ranges, an empty reader was returned. This means that the returned
reader was forwardable or not depending on the number of passed in
ranges. This is inconsistent and can lead to nasty surprises.
To solve this problem add `forwardable_empty_mutation_reader`, a
specialized reader that delays creating the underlying reader until
fast_forward_to() is called on it, and thus a range is available.

When `make_flat_multi_range_mutation_reader()` is called with
`mutation_reader::forwarding::no` a simple empty reader is created, like
before.
2018-09-28 14:27:55 +03:00
Botond Dénes
03be9510a7 flat_mutation_reader: add move_buffer_content_to()
`move_buffer_content_to()` makes it possible to implement more efficient
wrapping readers, readers that wrap another flat mutation reader but do
no transformation to the underlying fragment stream.
These readers, when filling their buffers, can simply fill the
underlying reader's buffer, then move its content into their own. When
the reader's own buffer is empty, this is very efficient, as it can be
done by simply swapping the buffers, avoiding the work of moving the
fragments one-by-one.
2018-09-28 14:27:54 +03:00
Botond Dénes
68b6c83ee8 flat_multi_range_mutation_reader: drop fwd_mr ctor parameter
The factory function creating this reader ensures that the passed-in
ranges vector has more then one range, which effectively makes the
`fwd_mr` constructor parameter have no effect. The underlying reader
will always be created with `mutation_reader::forwarding::yes` as it has
to be able to fast-forward between the ranges.
2018-09-28 14:25:03 +03:00
Duarte Nunes
b8749a61dc tests/aggregate_fcts_test: Fix formatting of create_table()
And drop the template.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180927223315.28254-1-duarte@scylladb.com>
2018-09-28 09:45:27 +02:00
Duarte Nunes
17578c3579 tests/aggregate_fcts_test: Add test case for wrapped types
Provide a test case which checks a type being wrapped in a
reverse_type plays no role in assignment.

Refs #3789

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180927223201.28152-2-duarte@scylladb.com>
2018-09-28 07:09:08 +03:00
Duarte Nunes
5e7bb20c8a cql3/selection/selector: Unwrap types when validating assignment
When validating assignment between two types, it's possible one of
them is wrapped in a reverse_type, if it comes, for example, from the
type associated with a clustering column. When checking for weak
assignment the types are correctly unwrapped, but not when checking
for an exact match, which this patch fixes.

Technically, the receiver is never a reversed_type for the current
callers, but this is the morally correct implementation, as the type
being reversed or not plays no role in assignment.

Tests: unit(release)

Fixes #3789

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180927223201.28152-1-duarte@scylladb.com>
2018-09-28 07:08:19 +03:00
Piotr Sarna
da3821c598 tests: add test for secondary index with paging
A test case with enough rows to have multiple pages
is added to secondary_index_test suite.
2018-09-27 15:29:28 +02:00
Piotr Sarna
4b4f57747a cql3: remove execute(primary_keys) from select statement
Right now, with specialized execute() that takes primary keys
for indexed_table_select_statement, the original execute()
method implemented in select_statement is not used anywhere,
so it's removed.
2018-09-27 15:29:28 +02:00
Piotr Sarna
9e0b3cad1e cql3: add incremental base queries to index query
Base queries that are part of index queries are allowed to be short,
which can result in wasted work - e.g. when we query all replicas
in parallel, but have to discard most of the result, since the first
one (in token order) resulted in a short read.
Thus, we start by quering 1 range, check if the read is short,
and if not, continue by querying 2x more ranges than before.

Refs #2960
2018-09-27 15:29:28 +02:00
Piotr Sarna
c41e0ade6c storage_proxy: make get_restricted_ranges public
This function is useful for splitting ranges in indexed queries.
2018-09-27 15:29:28 +02:00
Piotr Sarna
5b16aeb395 cql3: add base query handling function to indexed statement
Handling a base query during the indexed statement execution
may require updating its paging state.
2018-09-27 15:29:28 +02:00
Piotr Sarna
bce7232555 cql3: add generating base key from index keys
A function that computes base partition/clustering key from index view
primary key is provided.
2018-09-27 15:29:28 +02:00
Piotr Sarna
2f085848d8 cql3: add paging state generation function
For indexed queries, the paging state needs to be updated
based on the results of base query when the read was short.
2018-09-27 15:29:28 +02:00
Piotr Sarna
f21bcbefdf cql3: move getting index view schema to prepare stage
Searching for index view schema for an indexed statement can be done
once in prepare stage, so it's moved to indexed_table_select_statement
prepare method.
2018-09-27 15:29:28 +02:00
Piotr Sarna
b6d90b2869 pager: make state() defined for exhausted pagers
If service::pager is exhausted, state() function used to return
a nullptr instead of a pointer to a valid paging state and the
documented return type in this case was 'unspecified'.
Sometimes a paging state may be needed anyway, even if the pager
is already exhausted - thus, state() return value becomes defined
after this commit. Exhausted pagers will return a valid object
to a state with _remaining field set to 0.
2018-09-27 15:29:28 +02:00
Piotr Sarna
c1be660c3a cql3: add maybe_set_paging_state function
set_paging_state is split into its unconditional variant and a maybe_
one in order to avoid double checks.
2018-09-27 15:29:28 +02:00
Piotr Sarna
744ac3bf7b cql3: rename set_has_more_pages to set_paging_state
This function's primary goal is to set the paging state passed
as a parameter, so its name is changed to match the semantics better.
2018-09-27 15:29:28 +02:00
Glauber Costa
c3f27784de database: guarantee a minimum amount of shares when manual operations are requested.
We have found issues when a flush is requested outside the usual
memtable flush loop and because there is not a lot of data the
controller will not have a high amount of shares.

To prevent this, this patch guarantees some minimum amount of shares
when extraneous operations (nodetool flush, commitlog-driven flush, etc)
are requested.

Another option would be to add shares instead of guarantee a minimum.
But in my view the approach I am taking here has two main advantages:

1) It won't cause spikes when those operations are requested
2) It is cumbersome to add shares in the current infrastructure, as just
adding backlog can cause shares to spike. Consider this example:

  Backlog is within the first range of very low backlog (~0.2). Shares
  for this would be around ~20. If we want to add 200 shares, that is
  equivalent to a backlog of 0.8. Once we add those two backlogs
  together, we end up with 1 (max backlog).

Fixes #3761

Tests: unit (release)

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180927131904.8826-1-glauber@scylladb.com>
2018-09-27 15:20:31 +02:00
Piotr Sarna
336cc70438 pager: add setters for partition/clustering keys 2018-09-27 15:18:06 +02:00
Piotr Sarna
7c1e4c2deb cql3: add paging to read_posting_list
Instead of a single query, paging is used in order to query
an index.
2018-09-27 15:18:06 +02:00
Piotr Sarna
b83aa69a2e cql3: add non-const get_result_metadata method 2018-09-27 15:18:06 +02:00
Piotr Sarna
430a49f91a cql3: make find_index_* functions return paging state
In order to implement secondary index paging, intermediary query
functions now also return paging state for the view query.
2018-09-27 15:18:06 +02:00
Piotr Sarna
c3dd1775c8 cql3: make read_posting_list return future<rows>
Instead of returning a coordinator result and making a caller parse it
later, read_posting_list now extracts rows by itself.
This change is later needed when querying is replaced with a pager.
2018-09-27 15:18:06 +02:00
Piotr Sarna
1d34ef38a8 cql3: make pagers use time_point instead of duration
A standard way for passing a timeout parameter is specifying
a time_point, while pagers used to take a duration in order
to compute time points on the fly. This patch adds a timeout
parameter, which is a time_point, to fetch_page().
2018-09-27 15:18:06 +02:00
Tomasz Grabiec
78d9205a50 Merge "Multiple fixes to tests/normalizing_reader" from Vladimir
This patchset addresses multiple errors in normalizing_reader
implementation found during review.

I have decided to not make a clustering key full inside
before_key()/after_key() helpers. The reason is that for this they
would need schema to be passed as another parameter so existing
methods don't suit. OTOH, introducing new members for a class using
for testing purposes only seems an overkill.

* github.com/argenet/scylla.git projects/sstables-30/normalizing_reader_fixes/v1:
  range_tombstone: Add constructor accepting position_in_partition_views
    for range bounds.
  tests: Make sure range tombstone is properly split over rows with
    non-full keys.
  tests: Multiple fixes for draining and clearing range tombstones in
    normalizing_reader.
2018-09-27 12:51:47 +02:00
Vladimir Krivopalov
653fb37ea5 range_tombstone: Remove code that duplicates logic.
The actions performed by the call to set_start() were duplicated by the
immediately following code lines that are removed with this patch.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <20eaa1338c1719ded34f5c9ada69ec03907936f5.1537989044.git.vladimir@scylladb.com>
2018-09-27 12:05:25 +02:00
Vladimir Krivopalov
b74706a8f5 tests: Multiple fixes for draining and clearing range tombstones in normalizing_reader.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-26 19:24:10 -07:00
Vladimir Krivopalov
26d4d276e9 tests: Make sure range tombstone is properly split over rows with non-full keys.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-26 17:19:43 -07:00
Vladimir Krivopalov
fbccae0d15 range_tombstone: Add constructor accepting position_in_partition_views for range bounds.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-26 17:17:18 -07:00
Avi Kivity
e0b34003b5 tests: sstable_mutation_test: await background jobs
We only wait from the last test case, so if an individual test is executed,
a memory leak may be reported.

Fix by waiting from all test cases.
Message-Id: <20180926203723.18026-1-avi@scylladb.com>
2018-09-26 21:48:32 +01:00
Eliran Sinvani
44d93b4d4c cql3: fix incorrect results returned from prepared select with an IN clause
When executing a prepared select statement with a multicolumn IN, the
system returned incorrect results due to a memory violation (a bytes view
referring to an out of scope bytes object).
Added test for the prepared statement results correctness.

Tests:
1. unit (release) with the new test.
2. Python script.

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <36c9cf9ed3fe72e3b4801e3cd120678429ce218a.1537947897.git.eliransin@scylladb.com>
2018-09-26 15:23:41 +03:00
Eliran Sinvani
22ad5434d1 cql3 : fix a crash upon preparing select with an IN restriction due to memory violation
When preparing a select query with a multicolumn in restriction, the
node crashed due to using a parameter after using a move on it.

Tests:
1. UnitTests (release)
2. Preparing a select statement that crashed the system before,
and verify it is not crashing.

Fixes #3204
Fixes #3692

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <7ebd210cd714a460ee5557ac612da970cee03270.1537947897.git.eliransin@scylladb.com>
2018-09-26 15:23:38 +03:00
Avi Kivity
8f5e80e61a Revert "setup: add the lazytime XFS version"
This reverts commit f828fe0d59. It causes
scylla_raid_setup to fail on CentOS 7.

Fixes #3784.
2018-09-26 11:10:07 +01:00
Avi Kivity
e8d988caf8 Merge "Enable existing SSTables unit tests for 'mc' format" from Vladimir and Piotr
"
This patchset fixes several issues in SSTables 3.x ('mc') writing and
parsing and extends existing SSTables unit tests to cover the new
format.

The only test enabled temporarily is check_multi_schema because it
turned out that reading SSTables 3.x with a different schema has not
been implemented in full. This will be addressed in a separate patchset.

This patchset depends on the "Support SSTables 3.x in Scylla runtime"
patchset.

Tests: unit {release}
"

* 'projects/sstables-30/unit-tests/v3' of https://github.com/argenet/scylla: (25 commits)
  tests: Enable existing SSTables tests for 'mc' format.
  tests: Fix test_wrong_range_tombstone_order for 'mc' format.
  tests: Extend reader assertions to check clustering keys made full.
  tests: Disable test_old_format_non_compound_range_tombstone_is_read for 'mc' format.
  tests: Disable check_multi_schema for 'mc' format.
  tests: Fix test_promoted_index_read for 'mc' format by using normalizing_reader.
  tests: Fix promoted_index_read to not rely on a specific index length
  tests: Add 'mc' files for test_wrong_range_tombstone_order
  tests: Add 'mc' files for test_wrong_counter_shard_order
  tests: Add 'mc' files for summary_test
  tests: Add 'mc' files for test_promoted_index_read
  tests: Add 'mc' files for test_partition_skipping
  tests: Add 'mc' files for large_partition tests (promoted_index_read, sub_partition_read, sub_partitions_read
  tests: Add 'mc' files for test_counter_read
  tests: Add 'mc' files for test_broken_promoted_index_is_skipped
  tests: SSTables 'mc' files for sliced_mutation_reads_test.
  tests: Introduce normalizing_reader helper for SSTables tests.
  mutation_fragment: Add range_tombstone_stream::empty() method.
  sstables: Make key full when setting a range tombstone start from end open marker.
  sstables: For 'mc' format, use excl_start when split an RT over a row with a full key.
  ...
2018-09-26 11:10:07 +01:00
Avi Kivity
337ee6153a Merge "Support SSTables 3.x in Scylla runtime" from Vladimir and Piotr
"
This patchset makes it possible to use SSTables 'mc' format, commonly
referred to as 'SSTables 3.x', when running Scylla instance.

Several bugs found on this way are fixed. Also, a configuration option
is introduced to allow running Scylla either with 'mc' or 'la' format
as default.

Tests: unit {release}

+ tested Scylla with both 'la' and 'mc' formats to work fine:

cqlsh> CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};                                                                  [3/1890]
cqlsh> USE test;
cqlsh:test> CREATE TABLE cfsst3 (pk int, ck int, rc int, PRIMARY KEY (pk, ck)) WITH compression = {'sstable_compression': ''};
cqlsh:test> INSERT INTO cfsst3 (pk, ck, rc) VALUES ( 4, 7, 8);
    <<flush>>
cqlsh:test> DELETE from cfsst3 WHERE pk = 4 and ck> 3 and ck < 8;
    <<flush>>
cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 2, 3);
cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 4, 6);
cqlsh:test> SELECT * FROM cfsst3 ;

 pk | ck | rc
----+----+------
  2 |  3 | null
  4 |  6 | null

(2 rows)
    <<Scylla restart>>
cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 5, 7);
cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 6, 8);
cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 7, 9);
cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 8, 10);
cqlsh:test> SELECT * from cfsst3 ;

 pk | ck | rc
----+----+------
  5 |  7 | null
  8 | 10 | null
  2 |  3 | null
  4 |  6 | null
  7 |  9 | null
  6 |  8 | null

(6 rows)
"

* 'projects/sstables-30/try-runtime/v8' of https://github.com/argenet/scylla:
  database: Honour enable_sstables_mc_format configuration option.
  sstables: Support SSTables 'mc' format as a feature.
  db: Add configuration option for enabling SSTables 'mc' format.
  tests: Add test for reading a complex column with zero subcolumns (SST3).
  sstables: Fix parsing of complex columns with zero subcolumns.
  sstables: Explicitly cast api::timestamp_type to uint64_t when delta-encoding.
  sstables: Use parser_type instead of abstract_type::parse_type in column_translation.
  bytes: Add helper for turning bytes_view into sstring_view.
  sstables: Only forward the call to fast_forwarding_to in mp_row_consumer_m if filter exists.
  sstables: Fix string formatting for exception messages in m_format_read_helpers.
  sstables: Don't validate timestamps against the max value on parsing.
  sstables: Always store only min bases in serialization_header.
  sstables: Support 'mc' version parsing from filename.
  SST3: Make sure we call consume_partition_end
2018-09-26 11:10:07 +01:00
Vladimir Krivopalov
38c8d1ce05 tests: Enable existing SSTables tests for 'mc' format.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 18:02:46 -07:00
Vladimir Krivopalov
c33e0f3f15 tests: Fix test_wrong_range_tombstone_order for 'mc' format.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 18:02:46 -07:00
Vladimir Krivopalov
ad2b9e44ee tests: Extend reader assertions to check clustering keys made full.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 18:02:46 -07:00
Vladimir Krivopalov
9239195473 tests: Disable test_old_format_non_compound_range_tombstone_is_read for 'mc' format.
This test is not applicable to the 'mc' format as it covers a backward
compatibility case which may only occur with SSTables generated by older
Scylla versions in 'ka' format.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 18:02:46 -07:00
Vladimir Krivopalov
952536c9f5 tests: Disable check_multi_schema for 'mc' format.
Altering types in schema has been disabled in Origin (see
CASSANDRA-12443). We do the same.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 18:02:46 -07:00
Vladimir Krivopalov
86aae36e04 tests: Fix test_promoted_index_read for 'mc' format by using normalizing_reader.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:55:52 -07:00
Vladimir Krivopalov
5422203714 tests: Fix promoted_index_read to not rely on a specific index length
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:55:52 -07:00
Vladimir Krivopalov
be5fe11f22 tests: Add 'mc' files for test_wrong_range_tombstone_order
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:55:52 -07:00
Vladimir Krivopalov
3dd6e6f899 tests: Add 'mc' files for test_wrong_counter_shard_order
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:55:52 -07:00
Vladimir Krivopalov
f08a2b35da tests: Add 'mc' files for summary_test
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:55:52 -07:00
Vladimir Krivopalov
7e40947a80 tests: Add 'mc' files for test_promoted_index_read
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:55:52 -07:00
Vladimir Krivopalov
20f3edba61 tests: Add 'mc' files for test_partition_skipping
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:55:52 -07:00
Vladimir Krivopalov
8c37801ae5 tests: Add 'mc' files for large_partition tests (promoted_index_read, sub_partition_read, sub_partitions_read
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:55:52 -07:00
Vladimir Krivopalov
28c32a353a tests: Add 'mc' files for test_counter_read
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:55:52 -07:00
Vladimir Krivopalov
60c9a25b38 tests: Add 'mc' files for test_broken_promoted_index_is_skipped
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:55:52 -07:00
Vladimir Krivopalov
24342dc27d tests: SSTables 'mc' files for sliced_mutation_reads_test.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:55:52 -07:00
Vladimir Krivopalov
4393233a86 tests: Introduce normalizing_reader helper for SSTables tests.
This is a helper flat_mutation_reader that wraps another reader and
splits range tombstones over rows before emitting them.

It is used to produce the same mutation streams for both old (ka/la) and
new (mc) SSTables formats in unit tests.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:55:52 -07:00
Vladimir Krivopalov
7a5c4f0a63 mutation_fragment: Add range_tombstone_stream::empty() method.
The method checks if the underlying range_tombstone_list is empty.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:55:52 -07:00
Vladimir Krivopalov
eddf846c8a sstables: Make key full when setting a range tombstone start from end open marker.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:55:52 -07:00
Vladimir Krivopalov
fa48a78d71 sstables: For 'mc' format, use excl_start when split an RT over a row with a full key.
This fixes the monotonicity issue as otherwise the range tombstone
emitted after such clustering row has a start position that should be
ordered before that of the row.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:55:52 -07:00
Vladimir Krivopalov
45082ef18c sstables: Don't write promoted index consisting of a single block in 'mc' format.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:55:52 -07:00
Piotr Jastrzebski
8f5ac1d86f SST3: Make sure we emit range tombstone when slicing/fft
If we go past the slice to be read with a range tombstone being opened
we need to emit an RT corresponding to this slice.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-09-25 17:55:52 -07:00
Piotr Jastrzebski
ade8027960 Add mutation_fragment_filter::upper_bound
This method returns end of current position range.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-09-25 17:55:52 -07:00
Piotr Jastrzebski
82ff29cde8 Add clustering_ranges_walker::upper_bound
This method returns end of current position range.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-09-25 17:55:52 -07:00
Piotr Jastrzebski
bff49345cd Add position_in_partition_view::as_end_bound_view
This will be used in sstables 3.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-09-25 17:55:52 -07:00
Vladimir Krivopalov
cd80d6ff65 database: Honour enable_sstables_mc_format configuration option.
Only enable SSTables 'mc' format if the entire cluster supports it and
it is enabled in the configuration file.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:23:40 -07:00
Vladimir Krivopalov
c98937e04c sstables: Support SSTables 'mc' format as a feature.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:23:40 -07:00
Vladimir Krivopalov
650b245657 db: Add configuration option for enabling SSTables 'mc' format.
This flag will only be used for testing purposes until Scylla 3.o
release and will be removed once SSTables 'mc' testing is completed.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:23:40 -07:00
Vladimir Krivopalov
0edd3c57a9 tests: Add test for reading a complex column with zero subcolumns (SST3).
The files are generated by Scylla as a compaction_history table.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:23:40 -07:00
Vladimir Krivopalov
24590fe88c sstables: Fix parsing of complex columns with zero subcolumns.
Before this fix, a complex column with zero subcolumns would be
incorrecty parsed as it would re-read the deletion time twice.

Now, this case is handled properly.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:23:40 -07:00
Vladimir Krivopalov
be3613bdb6 sstables: Explicitly cast api::timestamp_type to uint64_t when delta-encoding.
This avoids noisy warnings like "signed value overflow" when ASAN is
turned on.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:23:40 -07:00
Vladimir Krivopalov
0048f4814e sstables: Use parser_type instead of abstract_type::parse_type in column_translation.
abstract_type::parse_type() only deals with simple types and fails to
parse wrapped types such as
org.apache.cassandra.db.marshal.FrozenType(org.apache.cassandra.db.marshal.ListType(org.apache.cassandra.db.marshal.UTF8Type))

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:23:40 -07:00
Vladimir Krivopalov
0f298113c7 bytes: Add helper for turning bytes_view into sstring_view.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:23:40 -07:00
Vladimir Krivopalov
9166badebe sstables: Only forward the call to fast_forwarding_to in mp_row_consumer_m if filter exists.
It may happen that we hit the end of partition and then get
fast_forward_to() called in which case we attempt to call it from an
already destroyed object. We need to check the _mf_filter value before
doing so.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:23:40 -07:00
Vladimir Krivopalov
fc901eb700 sstables: Fix string formatting for exception messages in m_format_read_helpers.
Before this fix, the code was a potential undefined behaviour and crash
because it would add a large value to a const char* and try to create a
std::string out of it.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:23:40 -07:00
Vladimir Krivopalov
84341821b1 sstables: Don't validate timestamps against the max value on parsing.
Internally, timestamps are represented as signed integers (int64_t) but
stored as unsigned ones. So it is quite possible to store data with
timestamp that is represented as a number larger than the max value of
int64_t type.
One such example is api::min_timestamp() that is used when generating
system schema tables ("keyspaces"). When cast to uint64_t, it turns into
a large value.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:23:40 -07:00
Vladimir Krivopalov
bdca27ae41 sstables: Always store only min bases in serialization_header.
There previously was an inconsistency in treating min values stored in a
serialization_header. They are written to or read from a Statistics.db
as deltas against fixed bases, but when we parse timeouts from the data
file, we need the full bases, not just deltas.

This inconsistency causes wrong timestamp values if we write an sstable
and then read from it using one and the same sstable object because we
turn min values into bases on write and then don't adjust them back
because we already have them in memory.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:23:40 -07:00
Vladimir Krivopalov
057c26f894 sstables: Support 'mc' version parsing from filename.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:23:40 -07:00
Piotr Jastrzebski
d8e6d1ed98 SST3: Make sure we call consume_partition_end
even when we slice and fast forward to.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-09-25 17:23:40 -07:00
Raphael S. Carvalho
745e35fa82 database: Fix sstable resharding for mc format
SStable format mc doesn't write ancestors to metadata, so resharding
will not work with this new format because it relies on ancestors to
replace new unshared sstables with old shared ones.
Fix is about not relying on ancestors metadata for this operation.

Fixes #3777.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180922211933.1987-1-raphaelsc@scylladb.com>
2018-09-25 18:37:48 +03:00
Nadav Har'El
05f8ed270b Add docs/metrics.md - documentation on metrics
Today I realised that although we have per-table metrics, they are not
*really* available by default. I was suprised to find that we don't have
(as far as I can tell) a document explaining why it is so, or how to enable
them anyway. Moreover, the more I investigated this issue, the more I
realised how little I know on Scylla's metrics - how they are calculated,
how they are collected, their different types, and so on.

So I sat down to figure out everything I wanted to learn about Scylla metrics,
and then wrote it all down in a new document, docs/metrics.md.

There are some missing pieces in this document marked by TODO, and probably
additional missing pieces that I'm not aware of, but I think this is already
a good start and can be (and should be) improved-on later.

We really need to have more of these documents describing various Scylla
subsystems to new developers - what each subsystem does, why it does what
it does, where is the code, and so on. I am facing these problems every
day as a seasoned developer - I can't even imagine what our new developers
face when trying to understand a subsystem they are not yet familiar with.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180920131103.20590-1-nyh@scylladb.com>
2018-09-25 17:51:20 +03:00
Paweł Dziepak
a3746d3b05 paging: make may_need_paging() more conservative
There is a bad interaction between may_need_paging() and query result
size limiter. The former is trying to avoid the complexity of paged
queries when the number of returned rows is going to be smaller than the
page size. The latter uses the fact that paged queries need not return
all requested rows to limit the size of a query results. Since
may_need_paging() may turn a paged query into non-paged one as a side
effect it disables the oversized result protection.

This patch limits the cases when may_need_paging() disables paging to
the situations when we know for sure that query result size limiter
won't be needed, i.e.: the result is not going to contain more than one
row. If the client knows for sure that the paging is not needed and
the performance impact is worthwhile it can disable paging on its side.
Otherwise, let's default to the safer behaviour.

Fixes #3620.

Message-Id: <20180925134431.24329-1-pdziepak@scylladb.com>
2018-09-25 17:01:04 +03:00
Avi Kivity
c6f651ead4 Merge "Use fragmented buffers in commitlog writes" from Paweł
"
This series changes commitlog write path so that it uses fragmented
buffers and therefore avoids large allocations. This is done by first
switching the code to use seastar memory_output_stream interface, which
can handle fragmented buffer without any additional actions from the
user code needed and then making it use buffers of fixed size 128 kB.

Tests: unit(release, debug) dtest(commitlog_test.py:TestCommitLog.test_commitlog_replay_on_startup commitlog_test.py:TestCommitLog.test_commitlog_replay_with_alter_table)
"

* tag 'fragmented-commitlog-writes/v3' of https://github.com/pdziepak/scylla:
  commitlog: switch to fragmented buffers
  commitlog: drop buffer pools
  commitlog: drop recovery from bad alloc
  utils: drop data_output
  commitlog: use memory_output_stream
  serialization_visitors: add support for memory_output_stream
  utils: fragmented_temporary_buffer::view: add remove_prefix()
  utils: fragmented_temporary_buffer: add empty() and size_bytes()
  utils: fragmented_temporary_buffer: add get_ostream()
  idl: serializer: don't assume Iterator::value_type is bytes_view
  idl: serializer:  create buffer view from streams
  utils: crc: accept FragmentRange
2018-09-25 12:43:06 +03:00
Avi Kivity
8276ada1c4 tests: sstable_3_x_test: await sstable background tasks
When an sstable is deleted, this work is done as a background task
since it cannot be done from the destructor.  If we don't wait for
that background task, it is detected as a leak by ASAN.

Fix by waiting for background tasks in every test.

A more complete fix would involve having a factory class create
sstables and assume the responsibility for background tasks, and
something similar to with_cql_test_env(), but that is deferred until later.

Tests: sstable_3_x_test (debug).
Message-Id: <20180923111745.8313-1-avi@scylladb.com>
2018-09-24 10:43:58 +02:00
Takuya ASADA
21a12aa458 dist/redhat: specify correct repo file path on scylla-housekeeping services
Currently, both scylla-housekeeping-daily/-restart services mistakenly
specify repo file path as "@@REPOFILES@@", witch is copied from .in
template, need to be replace with actual path.

Fixes #3776

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180921031605.9330-1-syuu@scylladb.com>
2018-09-23 11:38:26 +03:00
Glauber Costa
f828fe0d59 setup: add the lazytime XFS version
Starting with kernel 4.17 XFS will support the lazytime mount option.
That will be beneficial for Scylla as updating times synchronously is
one of our current sources of stalls.

Fortunately, older kernels are able to parse the option and just ignore
it. We verified that to be the case in a 4.15 kernel on ubuntu.
Therefore, just add the option unconditionally.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180920170017.13215-1-glauber@scylladb.com>
2018-09-20 20:12:44 +03:00
Gleb Natapov
0bf9a78c78 sstables: wrap file into checked file after applying extensions
File extensions can also produce errors that checked file wants to
intercept and act upon. The patch changes the order in which files are
wrapped to make checked file the outermost wrapped to be able to handle
exception generated by all inner wrappers.

Message-Id: <20180920124430.GD2326@scylladb.com>
2018-09-20 15:57:38 +03:00
Botond Dénes
eb357a385d flat_mutation_reader: make timeout opt-out rather than opt-in
Currently timeout is opt-in, that is, all methods that even have it
default it to `db::no_timeout`. This means that ensuring timeout is used
where it should be is completely up to the author and the reviewrs of
the code. As humans are notoriously prone to mistakes this has resulted
in a very inconsistent usage of timeout, many clients of
`flat_mutation_reader` passing the timeout only to some members and only
on certain call sites. This is small wonder considering that some core
operations like `operator()()` only recently received a timeout
parameter and others like `peek()` didn't even have one until this
patch. Both of these methods call `fill_buffer()` which potentially
talks to the lower layers and is supposed to propagate the timeout.
All this makes the `flat_mutation_reader`'s timeout effectively useless.

To make order in this chaos make the timeout parameter a mandatory one
on all `flat_mutation_reader` methods that need it. This ensures that
humans now get a reminder from the compiler when they forget to pass the
timeout. Clients can still opt-out from passing a timeout by passing
`db::no_timeout` (the previous default value) but this will be now
explicit and developers should think before typing it.

There were suprisingly few core call sites to fix up. Where a timeout
was available nearby I propagated it to be able to pass it to the
reader, where I couldn't I passed `db::no_timeout`. Authors of the
latter kind of code (view, streaming and repair are some of the notable
examples) should maybe consider propagating down a timeout if needed.
In the test code (the wast majority of the changes) I just used
`db::no_timeout` everywhere.

Tests: unit(release, debug)

Signed-off-by: Botond Dénes <bdenes@scylladb.com>

Message-Id: <1edc10802d5eb23de8af28c9f48b8d3be0f1a468.1536744563.git.bdenes@scylladb.com>
2018-09-20 11:31:24 +02:00
Asias He
de05df216f streaming: Use rpc::source on the shard where it is created
rpc::source can only work on the shard where it is created, thus we can
not apply the load distribution optimization. Disable it and let the
multishard_writer to forward the data to the correct shard.

Fixes #3731.

Message-Id: <0d1b4d3e7adcfdc4e392b83aeb2544b95f3f46dd.1537430162.git.asias@scylladb.com>
2018-09-20 12:29:24 +03:00
Avi Kivity
8b2bf73c6f Merge "Fix compaction metadata read/write for SSTables 3.x" from Vladimir
"
In SSTables 3.x, the 'ancestors' field of compaction metadata is no
longer stored in the Statistics.db file

The newly added test has previously failed due to this inconsistency.

Tests: unit {release}
"

* 'projects/sstables-30/empty_clustering_key/v1' of https://github.com/argenet/scylla:
  tests: Add test for reading table with empty clustering key from SSTables 3.x.
  tests: Update Statistics.db files for SSTables 3.x write tests.
  sstables: Do not parse ancestors from compaction metadata for SSTables 3.x
2018-09-20 09:53:46 +03:00
Vladimir Krivopalov
bf351c4a4f tests: Add test for reading table with empty clustering key from SSTables 3.x.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-19 20:57:23 -07:00
Vladimir Krivopalov
3bbb013ecd tests: Update Statistics.db files for SSTables 3.x write tests.
Those files have been generated with 'ancestors' field in compaction
metadata and so were invalid.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-19 20:57:23 -07:00
Vladimir Krivopalov
48fa088ec6 sstables: Do not parse ancestors from compaction metadata for SSTables 3.x
Ancestors array has been removed starting from 'ma' format
(CASSANDRA-7066).

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-19 17:11:43 -07:00
Vlad Zolotarov
043ced243e fix_system_distributed_tables.sh: adjust newly added 'request_size' and 'response_size' columns
Adjust the script to the new schema of system_traces.sessions. Two
new columns have been added:
   - request_size:  int
   - response_size: int

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20180919005504.12498-1-vladz@scylladb.com>
2018-09-19 15:46:11 +01:00
Paweł Dziepak
4469f76e7c commitlog: switch to fragmented buffers
So far commitlog was using contiguous buffers for storing the data that
is about to be written to disk. It was able to coalesce small writes so
that multiple small mutations would use the same buffer, but if a
muation was large the commitlog would attempt to allocate a single,
appropriately large buffer. This excessively stresses the memory
allocator and may cause memory fragmentation to become an issue. The
solution is to use fixed-size buffers of 128 kB, which is the standard
buffer size in Scylla and keep large values fragmented.
2018-09-18 17:22:59 +01:00
Paweł Dziepak
7c1add6769 commitlog: drop buffer pools
Buffer pools were added in 7191a130bb
"Commitlog: recycle buffers to reduce fragmentation." They introduce a
lot of complexity and will become unnecessary once the code is switched
to use fixed-size 128kB buffers.
2018-09-18 17:22:59 +01:00
Paweł Dziepak
9fee8b8d76 commitlog: drop recovery from bad alloc
If a node cannot allocate a 128 kB it is already in a very bad shape, so
there isn't much value in trying to recover by attempting smaller
allocations and it just adds more complexity to the segment allocation.
It actually may be better to let some requests fail and give the node a
chance to recover rather than trying to use every last byte of free
memory and end up with bad_alloc in a noexcept context.
2018-09-18 17:22:59 +01:00
Paweł Dziepak
2e5b375309 utils: drop data_output 2018-09-18 17:22:59 +01:00
Paweł Dziepak
fe48aaae46 commitlog: use memory_output_stream
memory_output_stream deals with all required pointer arithmetic and
allows easy transition to fragmented buffers.
2018-09-18 17:22:59 +01:00
Paweł Dziepak
b9ab058834 serialization_visitors: add support for memory_output_stream 2018-09-18 17:22:59 +01:00
Paweł Dziepak
cbe2ef9e5c utils: fragmented_temporary_buffer::view: add remove_prefix() 2018-09-18 17:22:59 +01:00
Alexys Jacob
24b90ef527 configure.py: coding style fixes
configure.py:23:10: E401 multiple imports on one line
configure.py:39:61: W291 trailing whitespace
configure.py:47:1: E302 expected 2 blank lines, found 1
configure.py:53:16: W291 trailing whitespace
configure.py:55:1: E302 expected 2 blank lines, found 1
configure.py:62:1: E302 expected 2 blank lines, found 1
configure.py:63:53: E251 unexpected spaces around keyword / parameter equals
configure.py:63:55: E251 unexpected spaces around keyword / parameter equals
configure.py:63:68: E251 unexpected spaces around keyword / parameter equals
configure.py:63:70: E251 unexpected spaces around keyword / parameter equals
configure.py:63:92: E251 unexpected spaces around keyword / parameter equals
configure.py:63:94: E251 unexpected spaces around keyword / parameter equals
configure.py:64:33: E251 unexpected spaces around keyword / parameter equals
configure.py:64:35: E251 unexpected spaces around keyword / parameter equals
configure.py:65:54: E251 unexpected spaces around keyword / parameter equals
configure.py:65:56: E251 unexpected spaces around keyword / parameter equals
configure.py:65:69: E251 unexpected spaces around keyword / parameter equals
configure.py:65:71: E251 unexpected spaces around keyword / parameter equals
configure.py:65:94: E251 unexpected spaces around keyword / parameter equals
configure.py:65:96: E251 unexpected spaces around keyword / parameter equals
configure.py:66:33: E251 unexpected spaces around keyword / parameter equals
configure.py:66:35: E251 unexpected spaces around keyword / parameter equals
configure.py:68:1: E302 expected 2 blank lines, found 1
configure.py:72:18: E712 comparison to True should be 'if cond is True:' or 'if cond:'
configure.py:80:1: E302 expected 2 blank lines, found 1
configure.py:83:1: E302 expected 2 blank lines, found 1
configure.py:87:1: E302 expected 2 blank lines, found 1
configure.py:87:33: E251 unexpected spaces around keyword / parameter equals
configure.py:87:35: E251 unexpected spaces around keyword / parameter equals
configure.py:87:45: E251 unexpected spaces around keyword / parameter equals
configure.py:87:47: E251 unexpected spaces around keyword / parameter equals
configure.py:88:56: E251 unexpected spaces around keyword / parameter equals
configure.py:88:58: E251 unexpected spaces around keyword / parameter equals
configure.py:90:1: E302 expected 2 blank lines, found 1
configure.py:94:1: E302 expected 2 blank lines, found 1
configure.py:94:42: E251 unexpected spaces around keyword / parameter equals
configure.py:94:44: E251 unexpected spaces around keyword / parameter equals
configure.py:94:54: E251 unexpected spaces around keyword / parameter equals
configure.py:94:56: E251 unexpected spaces around keyword / parameter equals
configure.py:104:42: E251 unexpected spaces around keyword / parameter equals
configure.py:104:44: E251 unexpected spaces around keyword / parameter equals
configure.py:105:42: E251 unexpected spaces around keyword / parameter equals
configure.py:105:44: E251 unexpected spaces around keyword / parameter equals
configure.py:110:1: E302 expected 2 blank lines, found 1
configure.py:114:29: E251 unexpected spaces around keyword / parameter equals
configure.py:114:31: E251 unexpected spaces around keyword / parameter equals
configure.py:114:61: E251 unexpected spaces around keyword / parameter equals
configure.py:114:63: E251 unexpected spaces around keyword / parameter equals
configure.py:116:1: E302 expected 2 blank lines, found 1
configure.py:123:26: E251 unexpected spaces around keyword / parameter equals
configure.py:123:28: E251 unexpected spaces around keyword / parameter equals
configure.py:123:49: E251 unexpected spaces around keyword / parameter equals
configure.py:123:51: E251 unexpected spaces around keyword / parameter equals
configure.py:123:84: E251 unexpected spaces around keyword / parameter equals
configure.py:123:86: E251 unexpected spaces around keyword / parameter equals
configure.py:129:1: E302 expected 2 blank lines, found 1
configure.py:135:1: E302 expected 2 blank lines, found 1
configure.py:137:35: E251 unexpected spaces around keyword / parameter equals
configure.py:137:37: E251 unexpected spaces around keyword / parameter equals
configure.py:137:53: E251 unexpected spaces around keyword / parameter equals
configure.py:137:55: E251 unexpected spaces around keyword / parameter equals
configure.py:137:83: E251 unexpected spaces around keyword / parameter equals
configure.py:137:85: E251 unexpected spaces around keyword / parameter equals
configure.py:143:1: E302 expected 2 blank lines, found 1
configure.py:148:1: E302 expected 2 blank lines, found 1
configure.py:152:5: E301 expected 1 blank line, found 0
configure.py:159:5: E301 expected 1 blank line, found 0
configure.py:161:5: E301 expected 1 blank line, found 0
configure.py:163:5: E301 expected 1 blank line, found 0
configure.py:165:5: E301 expected 1 blank line, found 0
configure.py:168:1: E302 expected 2 blank lines, found 1
configure.py:169:5: F841 local variable 'mach' is assigned to but never used
configure.py:175:1: E302 expected 2 blank lines, found 1
configure.py:178:5: E301 expected 1 blank line, found 0
configure.py:183:5: E301 expected 1 blank line, found 0
configure.py:185:5: E301 expected 1 blank line, found 0
configure.py:187:5: E301 expected 1 blank line, found 0
configure.py:189:5: E301 expected 1 blank line, found 0
configure.py:192:1: E305 expected 2 blank lines after class or function definition, found 1
configure.py:329:5: E123 closing bracket does not match indentation of opening bracket's line
configure.py:335:5: E123 closing bracket does not match indentation of opening bracket's line
configure.py:340:41: E251 unexpected spaces around keyword / parameter equals
configure.py:340:43: E251 unexpected spaces around keyword / parameter equals
configure.py:340:60: E251 unexpected spaces around keyword / parameter equals
configure.py:340:62: E251 unexpected spaces around keyword / parameter equals
configure.py:340:85: E251 unexpected spaces around keyword / parameter equals
configure.py:340:87: E251 unexpected spaces around keyword / parameter equals
configure.py:341:30: E251 unexpected spaces around keyword / parameter equals
configure.py:341:32: E251 unexpected spaces around keyword / parameter equals
configure.py:342:29: E251 unexpected spaces around keyword / parameter equals
configure.py:342:31: E251 unexpected spaces around keyword / parameter equals
configure.py:343:38: E251 unexpected spaces around keyword / parameter equals
configure.py:343:40: E251 unexpected spaces around keyword / parameter equals
configure.py:343:54: E251 unexpected spaces around keyword / parameter equals
configure.py:343:56: E251 unexpected spaces around keyword / parameter equals
configure.py:344:29: E251 unexpected spaces around keyword / parameter equals
configure.py:344:31: E251 unexpected spaces around keyword / parameter equals
configure.py:345:37: E251 unexpected spaces around keyword / parameter equals
configure.py:345:39: E251 unexpected spaces around keyword / parameter equals
configure.py:345:52: E251 unexpected spaces around keyword / parameter equals
configure.py:345:54: E251 unexpected spaces around keyword / parameter equals
configure.py:346:29: E251 unexpected spaces around keyword / parameter equals
configure.py:346:31: E251 unexpected spaces around keyword / parameter equals
configure.py:349:43: E251 unexpected spaces around keyword / parameter equals
configure.py:349:45: E251 unexpected spaces around keyword / parameter equals
configure.py:349:59: E251 unexpected spaces around keyword / parameter equals
configure.py:349:61: E251 unexpected spaces around keyword / parameter equals
configure.py:349:84: E251 unexpected spaces around keyword / parameter equals
configure.py:349:86: E251 unexpected spaces around keyword / parameter equals
configure.py:350:29: E251 unexpected spaces around keyword / parameter equals
configure.py:350:31: E251 unexpected spaces around keyword / parameter equals
configure.py:351:44: E251 unexpected spaces around keyword / parameter equals
configure.py:351:46: E251 unexpected spaces around keyword / parameter equals
configure.py:351:60: E251 unexpected spaces around keyword / parameter equals
configure.py:351:62: E251 unexpected spaces around keyword / parameter equals
configure.py:351:86: E251 unexpected spaces around keyword / parameter equals
configure.py:351:88: E251 unexpected spaces around keyword / parameter equals
configure.py:352:29: E251 unexpected spaces around keyword / parameter equals
configure.py:352:31: E251 unexpected spaces around keyword / parameter equals
configure.py:353:43: E251 unexpected spaces around keyword / parameter equals
configure.py:353:45: E251 unexpected spaces around keyword / parameter equals
configure.py:353:59: E251 unexpected spaces around keyword / parameter equals
configure.py:353:61: E251 unexpected spaces around keyword / parameter equals
configure.py:353:79: E251 unexpected spaces around keyword / parameter equals
configure.py:353:81: E251 unexpected spaces around keyword / parameter equals
configure.py:354:29: E251 unexpected spaces around keyword / parameter equals
configure.py:354:31: E251 unexpected spaces around keyword / parameter equals
configure.py:355:45: E251 unexpected spaces around keyword / parameter equals
configure.py:355:47: E251 unexpected spaces around keyword / parameter equals
configure.py:355:61: E251 unexpected spaces around keyword / parameter equals
configure.py:355:63: E251 unexpected spaces around keyword / parameter equals
configure.py:355:78: E251 unexpected spaces around keyword / parameter equals
configure.py:355:80: E251 unexpected spaces around keyword / parameter equals
configure.py:356:29: E251 unexpected spaces around keyword / parameter equals
configure.py:356:31: E251 unexpected spaces around keyword / parameter equals
configure.py:359:45: E251 unexpected spaces around keyword / parameter equals
configure.py:359:47: E251 unexpected spaces around keyword / parameter equals
configure.py:359:61: E251 unexpected spaces around keyword / parameter equals
configure.py:359:63: E251 unexpected spaces around keyword / parameter equals
configure.py:359:83: E251 unexpected spaces around keyword / parameter equals
configure.py:359:85: E251 unexpected spaces around keyword / parameter equals
configure.py:360:29: E251 unexpected spaces around keyword / parameter equals
configure.py:360:31: E251 unexpected spaces around keyword / parameter equals
configure.py:361:48: E251 unexpected spaces around keyword / parameter equals
configure.py:361:50: E251 unexpected spaces around keyword / parameter equals
configure.py:361:69: E251 unexpected spaces around keyword / parameter equals
configure.py:361:71: E251 unexpected spaces around keyword / parameter equals
configure.py:361:87: E251 unexpected spaces around keyword / parameter equals
configure.py:361:89: E251 unexpected spaces around keyword / parameter equals
configure.py:362:29: E251 unexpected spaces around keyword / parameter equals
configure.py:362:31: E251 unexpected spaces around keyword / parameter equals
configure.py:363:48: E251 unexpected spaces around keyword / parameter equals
configure.py:363:50: E251 unexpected spaces around keyword / parameter equals
configure.py:363:64: E251 unexpected spaces around keyword / parameter equals
configure.py:363:66: E251 unexpected spaces around keyword / parameter equals
configure.py:363:89: E251 unexpected spaces around keyword / parameter equals
configure.py:363:91: E251 unexpected spaces around keyword / parameter equals
configure.py:364:29: E251 unexpected spaces around keyword / parameter equals
configure.py:364:31: E251 unexpected spaces around keyword / parameter equals
configure.py:365:46: E251 unexpected spaces around keyword / parameter equals
configure.py:365:48: E251 unexpected spaces around keyword / parameter equals
configure.py:365:62: E251 unexpected spaces around keyword / parameter equals
configure.py:365:64: E251 unexpected spaces around keyword / parameter equals
configure.py:365:82: E251 unexpected spaces around keyword / parameter equals
configure.py:365:84: E251 unexpected spaces around keyword / parameter equals
configure.py:365:97: E251 unexpected spaces around keyword / parameter equals
configure.py:365:99: E251 unexpected spaces around keyword / parameter equals
configure.py:366:29: E251 unexpected spaces around keyword / parameter equals
configure.py:366:31: E251 unexpected spaces around keyword / parameter equals
configure.py:367:48: E251 unexpected spaces around keyword / parameter equals
configure.py:367:50: E251 unexpected spaces around keyword / parameter equals
configure.py:367:70: E251 unexpected spaces around keyword / parameter equals
configure.py:367:72: E251 unexpected spaces around keyword / parameter equals
configure.py:368:1: E101 indentation contains mixed spaces and tabs
configure.py:368:1: W191 indentation contains tabs
configure.py:368:4: E128 continuation line under-indented for visual indent
configure.py:368:8: E251 unexpected spaces around keyword / parameter equals
configure.py:368:10: E251 unexpected spaces around keyword / parameter equals
configure.py:369:48: E251 unexpected spaces around keyword / parameter equals
configure.py:369:50: E251 unexpected spaces around keyword / parameter equals
configure.py:369:73: E251 unexpected spaces around keyword / parameter equals
configure.py:369:75: E251 unexpected spaces around keyword / parameter equals
configure.py:370:1: E101 indentation contains mixed spaces and tabs
configure.py:370:13: E128 continuation line under-indented for visual indent
configure.py:370:17: E251 unexpected spaces around keyword / parameter equals
configure.py:370:19: E251 unexpected spaces around keyword / parameter equals
configure.py:371:47: E251 unexpected spaces around keyword / parameter equals
configure.py:371:49: E251 unexpected spaces around keyword / parameter equals
configure.py:371:71: E251 unexpected spaces around keyword / parameter equals
configure.py:371:73: E251 unexpected spaces around keyword / parameter equals
configure.py:372:13: E128 continuation line under-indented for visual indent
configure.py:372:17: E251 unexpected spaces around keyword / parameter equals
configure.py:372:19: E251 unexpected spaces around keyword / parameter equals
configure.py:373:50: E251 unexpected spaces around keyword / parameter equals
configure.py:373:52: E251 unexpected spaces around keyword / parameter equals
configure.py:373:76: E251 unexpected spaces around keyword / parameter equals
configure.py:373:78: E251 unexpected spaces around keyword / parameter equals
configure.py:374:13: E128 continuation line under-indented for visual indent
configure.py:374:17: E251 unexpected spaces around keyword / parameter equals
configure.py:374:19: E251 unexpected spaces around keyword / parameter equals
configure.py:375:52: E251 unexpected spaces around keyword / parameter equals
configure.py:375:54: E251 unexpected spaces around keyword / parameter equals
configure.py:375:68: E251 unexpected spaces around keyword / parameter equals
configure.py:375:70: E251 unexpected spaces around keyword / parameter equals
configure.py:375:94: E251 unexpected spaces around keyword / parameter equals
configure.py:375:96: E251 unexpected spaces around keyword / parameter equals
configure.py:375:109: E251 unexpected spaces around keyword / parameter equals
configure.py:375:111: E251 unexpected spaces around keyword / parameter equals
configure.py:376:29: E251 unexpected spaces around keyword / parameter equals
configure.py:376:31: E251 unexpected spaces around keyword / parameter equals
configure.py:377:43: E251 unexpected spaces around keyword / parameter equals
configure.py:377:45: E251 unexpected spaces around keyword / parameter equals
configure.py:377:59: E251 unexpected spaces around keyword / parameter equals
configure.py:377:61: E251 unexpected spaces around keyword / parameter equals
configure.py:377:79: E251 unexpected spaces around keyword / parameter equals
configure.py:377:81: E251 unexpected spaces around keyword / parameter equals
configure.py:378:29: E251 unexpected spaces around keyword / parameter equals
configure.py:378:31: E251 unexpected spaces around keyword / parameter equals
configure.py:379:30: E251 unexpected spaces around keyword / parameter equals
configure.py:379:32: E251 unexpected spaces around keyword / parameter equals
configure.py:379:46: E251 unexpected spaces around keyword / parameter equals
configure.py:379:48: E251 unexpected spaces around keyword / parameter equals
configure.py:379:62: E251 unexpected spaces around keyword / parameter equals
configure.py:379:64: E251 unexpected spaces around keyword / parameter equals
configure.py:380:30: E251 unexpected spaces around keyword / parameter equals
configure.py:380:32: E251 unexpected spaces around keyword / parameter equals
configure.py:380:44: E251 unexpected spaces around keyword / parameter equals
configure.py:380:46: E251 unexpected spaces around keyword / parameter equals
configure.py:380:58: E251 unexpected spaces around keyword / parameter equals
configure.py:380:60: E251 unexpected spaces around keyword / parameter equals
configure.py:395:36: E251 unexpected spaces around keyword / parameter equals
configure.py:395:38: E251 unexpected spaces around keyword / parameter equals
configure.py:395:76: E251 unexpected spaces around keyword / parameter equals
configure.py:395:78: E251 unexpected spaces around keyword / parameter equals
configure.py:398:18: E127 continuation line over-indented for visual indent
configure.py:424:32: W291 trailing whitespace
configure.py:649:18: E124 closing bracket does not match visual indentation
configure.py:650:17: E127 continuation line over-indented for visual indent
configure.py:650:17: W503 line break before binary operator
configure.py:651:17: W503 line break before binary operator
configure.py:652:17: E124 closing bracket does not match visual indentation
configure.py:784:8: E713 test for membership should be 'not in'
configure.py:790:45: W291 trailing whitespace
configure.py:819:32: E261 at least two spaces before inline comment
configure.py:832:5: E123 closing bracket does not match indentation of opening bracket's line
configure.py:836:35: E251 unexpected spaces around keyword / parameter equals
configure.py:836:37: E251 unexpected spaces around keyword / parameter equals
configure.py:836:49: E251 unexpected spaces around keyword / parameter equals
configure.py:836:51: E251 unexpected spaces around keyword / parameter equals
configure.py:845:45: E251 unexpected spaces around keyword / parameter equals
configure.py:845:47: E251 unexpected spaces around keyword / parameter equals
configure.py:845:59: E251 unexpected spaces around keyword / parameter equals
configure.py:845:61: E251 unexpected spaces around keyword / parameter equals
configure.py:848:43: E251 unexpected spaces around keyword / parameter equals
configure.py:848:45: E251 unexpected spaces around keyword / parameter equals
configure.py:869:1: E302 expected 2 blank lines, found 1
configure.py:879:1: E305 expected 2 blank lines after class or function definition, found 1
configure.py:965:118: E225 missing whitespace around operator
configure.py:967:18: E124 closing bracket does not match visual indentation
configure.py:969:27: F821 undefined name 'python'
configure.py:969:73: E251 unexpected spaces around keyword / parameter equals
configure.py:969:75: E251 unexpected spaces around keyword / parameter equals
configure.py:976:7: E201 whitespace after '{'
configure.py:976:12: E203 whitespace before ':'
configure.py:976:73: E202 whitespace before '}'
configure.py:981:58: E251 unexpected spaces around keyword / parameter equals
configure.py:981:60: E251 unexpected spaces around keyword / parameter equals
configure.py:987:10: E222 multiple spaces after operator
configure.py:1001:17: E124 closing bracket does not match visual indentation
configure.py:1026:29: E251 unexpected spaces around keyword / parameter equals
configure.py:1026:31: E251 unexpected spaces around keyword / parameter equals
configure.py:1100:82: W291 trailing whitespace
configure.py:1110:29: E251 unexpected spaces around keyword / parameter equals
configure.py:1110:31: E251 unexpected spaces around keyword / parameter equals
configure.py:1110:49: E251 unexpected spaces around keyword / parameter equals
configure.py:1110:51: E251 unexpected spaces around keyword / parameter equals
configure.py:1111:64: E251 unexpected spaces around keyword / parameter equals
configure.py:1111:66: E251 unexpected spaces around keyword / parameter equals
configure.py:1112:13: E128 continuation line under-indented for visual indent
configure.py:1112:22: E251 unexpected spaces around keyword / parameter equals
configure.py:1112:24: E251 unexpected spaces around keyword / parameter equals
configure.py:1140:106: W291 trailing whitespace
configure.py:1149:86: E127 continuation line over-indented for visual indent
configure.py:1191:116: E251 unexpected spaces around keyword / parameter equals
configure.py:1191:118: E251 unexpected spaces around keyword / parameter equals
configure.py:1191:139: E251 unexpected spaces around keyword / parameter equals
configure.py:1191:141: E251 unexpected spaces around keyword / parameter equals
configure.py:1197:83: E231 missing whitespace after ','
configure.py:1200:76: E231 missing whitespace after ','
configure.py:1215:99: W291 trailing whitespace
configure.py:1242:31: E251 unexpected spaces around keyword / parameter equals
configure.py:1242:33: E251 unexpected spaces around keyword / parameter equals

Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20180917155438.12410-1-ultrabug@gentoo.org>
2018-09-18 13:49:23 +03:00
Avi Kivity
e5e59ea9cf Merge "More SSTables 3.x write tests enriched with read after write." from Vladimir
"
Some of the write tests were missing the read after write validation
which has now been added for better coverage.

Tests: unit {release}
"

* 'projects/sstables-30/more-enriched-tests/v1' of https://github.com/argenet/scylla:
  tests: Enrich test_write_adjacent_range_tombstones_with_rows with read after write
  tests: Enrich test_write_many_range_tombstones with read after write
  tests: Enrich test_write_mixed_rows_and_range_tombstones with read after write
  tests: Enrich test_write_non_adjacent_range_tombstones with read after write
  tests: Enrich test_write_adjacent_range_tombstones with read after write
  tests: Enrich test_write_simple_range_tombstone with read after write.
  tests: Enrich test_write_deleted_column with read after write.
2018-09-18 13:45:52 +03:00
Paweł Dziepak
e464ad4f5d utils: fragmented_temporary_buffer: add empty() and size_bytes() 2018-09-18 11:29:37 +01:00
Paweł Dziepak
f4bb219a8b utils: fragmented_temporary_buffer: add get_ostream() 2018-09-18 11:29:37 +01:00
Paweł Dziepak
196c5a5eee idl: serializer: don't assume Iterator::value_type is bytes_view 2018-09-18 11:29:36 +01:00
Paweł Dziepak
953942b256 idl: serializer: create buffer view from streams 2018-09-18 11:29:36 +01:00
Paweł Dziepak
252cf0c681 utils: crc: accept FragmentRange 2018-09-18 11:29:36 +01:00
Avi Kivity
9d90ba470b Merge "Fix deleted counters handling in SSTables 3.x" from Vladimir
"
This patchset fixes the bug in SSTables 3.x parser that did not properly
handle deleted counter cells.

A write test is enriched to validate read after write so that this case
is covered.

Tests: unit {release}
"

* 'projects/sstables-30/fix-deleted-counters-read/v1' of https://github.com/argenet/scylla:
  tests: Read after write in test_write_counter_table.
  sstables: Fix deleted counter cells processing in SSTables 3.x parser.
2018-09-18 12:20:54 +03:00
Vladimir Krivopalov
8c08ccbd3b tests: Enrich test_write_adjacent_range_tombstones_with_rows with read after write
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-17 11:06:24 -07:00
Vladimir Krivopalov
f0966a935e tests: Enrich test_write_many_range_tombstones with read after write
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-17 11:06:10 -07:00
Vladimir Krivopalov
262874a90c tests: Enrich test_write_mixed_rows_and_range_tombstones with read after write
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-17 11:05:56 -07:00
Vladimir Krivopalov
6fbf4d3589 tests: Enrich test_write_non_adjacent_range_tombstones with read after write
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-17 11:05:42 -07:00
Vladimir Krivopalov
4bf9c87a1a tests: Enrich test_write_adjacent_range_tombstones with read after write
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-17 11:05:26 -07:00
Vladimir Krivopalov
5b087daf91 tests: Enrich test_write_simple_range_tombstone with read after write.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-17 11:04:57 -07:00
Vladimir Krivopalov
e63d960b8e tests: Enrich test_write_deleted_column with read after write.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-17 11:04:25 -07:00
Eliran Sinvani
83628f5881 cql3: maintain correctness of multicolumn restriction on mixed order columns
When a query with multicolumn inequality is issued on clustering columns
having mixed order (ASC and DESC together), if the ranges are not
broken to none overlapping lexicographically monotonic ones, the node
return incorrect rows. This is due to the search nature
(prefix comparison). The solution is to break the range imposed
by the restriction into several single column restrictions OR-ed
together that will be logically equivalent and preserve the
monotonicity assumption. This commit also fixes incorrect results
returned by a multicolumn query on an all descending columns.

A unit test have been added to account for both issues fixed.

Fixes #2050
Tests: Unit test, manual tests of the use case in the issue.

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <3b96620a3bd8b0614359a3b0757f324d45189dbb.1536478193.git.eliransin@scylladb.com>
2018-09-17 20:35:55 +03:00
Vladimir Krivopalov
e796fa2b02 tests: Read after write in test_write_counter_table.
This covers the case of deleted counter cells.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-17 10:11:48 -07:00
Vladimir Krivopalov
79ccce147c sstables: Fix deleted counter cells processing in SSTables 3.x parser.
Deleted counter cells should be processed the same way as regular
deleted cells.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-17 10:10:57 -07:00
Alexys Jacob
cd74dfebfb scripts: coding style fixes
scripts/create-relocatable-package.py:24:1: F401 'shutil' imported but unused
scripts/create-relocatable-package.py:24:1: F401 'tempfile' imported but unused
scripts/create-relocatable-package.py:24:16: E401 multiple imports on one line
scripts/create-relocatable-package.py:26:1: E302 expected 2 blank lines, found 1
scripts/create-relocatable-package.py:47:1: E305 expected 2 blank lines after class or function definition, found 1
scripts/create-relocatable-package.py:93:6: E225 missing whitespace around operator

Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20180917152520.5032-1-ultrabug@gentoo.org>
2018-09-17 18:40:23 +03:00
Alexys Jacob
c80d7b97cc scyllatop: more coding style fixes
tools/scyllatop/metric.py:2:1: F401 're' imported but unused
tools/scyllatop/metric.py:53:20: E221 multiple spaces before operator
tools/scyllatop/metric.py:69:20: E221 multiple spaces before operator

Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20180917153308.7240-1-ultrabug@gentoo.org>
2018-09-17 18:39:53 +03:00
Raphael S. Carvalho
5bc028f78b database: fix 2x increase in disk usage during cleanup compaction
Don't hold reference to sstables cleaned up, so that file descriptors
for their index and data files will be closed and consequently disk
space released.

Fixes #3735.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180914194047.26288-1-raphaelsc@scylladb.com>
2018-09-17 17:26:46 +03:00
Alexys Jacob
46d101c1f2 scyllatop: coding style fixes
tools/scyllatop/prometheus.py:3:1: F401 'sys' imported but unused
tools/scyllatop/prometheus.py:7:1: E302 expected 2 blank lines, found 1
tools/scyllatop/prometheus.py:12:5: E301 expected 1 blank line, found 0
tools/scyllatop/prometheus.py:17:1: W293 blank line contains whitespace
tools/scyllatop/prometheus.py:22:82: E225 missing whitespace around operator

Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20180914110847.1862-1-ultrabug@gentoo.org>
2018-09-17 15:45:43 +03:00
Botond Dénes
a84c26799d tests/mutation_reader_test: fix flaky restricted reader timeout test
The test in question is `restricted_reader_timeout`.

Use `eventually_true()` instead of `sleep()` to wait on the timeout
expiring, making the test more robust on overloaded machines.

Also fix graceful failing, another longstanding issue with this test.
The readers created for the test need different destruction logic
depending whether the test failed or succeeded. Previously this was
dealt with by using the logic that worked in case of success and using
asserts to abort when the test failed, thus avoiding developers
investigating the invalid memory accesses happening due to the wrong
destruction logic.
The solution is to use BOOST_CHECK() macro in the check that validates
whether timeout works as expected. This allows for execution to continue
even if the test failed, and thus allows for running the proper cleanup
code even when the test failed.

Fixes: #3719
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <911921dffc924f1b0a3e86408757467e9be2b65b.1537169933.git.bdenes@scylladb.com>
2018-09-17 09:40:45 +01:00
Nadav Har'El
0006e21c4d tests/view_complex_test: add missing timestamp
test_partial_delete_selected_column() does a long string of various
updates and deletes, each specifies a different timestamp. In one
of these updates, the timestamp was forgotten. This means that the
server picks the current time, a large number.

As the test is currently written, it doesn't matter which timestamp
was chosen, the test would still succeed (if timestamp >= 15, and it
must be since the timestamp is the time from the epoch).
But the intention was probably to use timestamp = 15, so let's make
this intention clear.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180905095552.11883-2-nyh@scylladb.com>
2018-09-17 00:38:55 +01:00
Nadav Har'El
2ae4ed151e tests/view_complex_test - add test passpoints
We recently saw a failure in test_partial_delete_selected_column() but
this is a very long test doing many operations and comparisons of their
results, and without BOOST_TEST_PASSPOINT() we can't know which of them
really failed.

So let's sprinkle BOOST_TEST_PASSPOINT() calls between the different parts
of test_partial_delete_selected_column(). If this test ever fails again,
we'll know where.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180905095552.11883-1-nyh@scylladb.com>
2018-09-17 00:38:55 +01:00
Jesse Haber-Kucharsky
9d27045c76 auth: Shorten random_device instance life-span
On Fedora 28, creating an instance of `std::random_device` opens a file
descriptor for `/dev/urandom` (observed via `strace`).

By declaring static thread-local instances of `std::random_device`,
these descriptors will be open (barring optimization by the compiler)
for the entire duration of the Scylla process's life.

However, the `std::random_device` instance is only necessary for
initializing the `RandomNumberEngine` for generating salts. With this
change, the file-descriptor is closed immediately after the engine is
initialized.

I considered generalizing this pattern of initialization into a
function, but with only two uses (and simple ones) I think this would
only obscure things.

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Tests: unit (release)
Message-Id: <f1b985d99f66e5e64d714fd0f087e235b71557d2.1536697368.git.jhaberku@scylladb.com>
2018-09-12 12:14:21 +01:00
Botond Dénes
dfad223ea2 multishard_mutation_reader: shard_reader: don't do concurrent read-aheads
multishard_mutation_reader starts read-aheads on the
shards-to-be-read-soon. When doing this it didn't check whether the
respective shards had an ongoing read-ahead already. This lead to a
single shard executing multiple concurrent read-aheads. This is damaging
for multiple reasons:
    * Can lead to concurrent access of the remote reader's data members.
    * The `shard_reader` was designed around a single read-ahead and
    thus will synchronise foreground reads with only the last one.

The practical implications of this seen so far was that queries reading
a large number of rows (large enough to reliably trigger the
bug) would stop the read early, due the `combined_mutation_reader`'s
internal accounting being messed up by concurrent access.

Also add a unit test. Instead of coming up with a very specific, and
very contrived unit test, use the test-case that detected this bug in
the first place: count(*) on a table with lots of rows (>1000). This
unit-test should serve well for detecting any similar bugs in the
future.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <ff1c49be64e2fb443f9aa8c5c8d235e682442248.1536746388.git.bdenes@scylladb.com>
2018-09-12 11:43:18 +01:00
Botond Dénes
6a07b8ae83 multishard_mutation_reader: update shard_reader's comment
The `adandoned` member was renamed to `stopped`. Update the comment
accordingly.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <1d655785f28fe1e5fa041f2f49852f0ad88be53e.1536743950.git.bdenes@scylladb.com>
2018-09-12 11:32:08 +02:00
Botond Dénes
d9a2ffad84 mutation_partition: don't move tracing_state early
Currently the `trace_state` is moved into the `querier` object's
constructor when one has to be created. Since the trace_state is used
below this lines this had the effect that on the first page of the
query, when a querier object has to be created, tracing would not work
inside the `querier_cache` which received a move-from `trace_state` (a
nullptr effectively).
Change the move to a copy so the other half of the function doesn't use
a moved-from `trace_state`.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <4987419781aa287141aa9dc8ce99c5068b564c84.1536739052.git.bdenes@scylladb.com>
2018-09-12 11:32:08 +02:00
Botond Dénes
49704755b0 combined_mutation_reader: propagate timeout in fill_buffer()
All user reads go through the combined reader. Not propagating the
timeout down from there means that the storage layer's timeout
functionality is effectively disabled. Spotted while reading the code.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <7fc10eca1c231dd04ac433913d9e6a51b6b17139.1536657041.git.bdenes@scylladb.com>
2018-09-11 15:44:28 +02:00
Botond Dénes
99ab43a1cc flat_mutation_reader: add timeout parameter to operator()()
For consistency with fast_foward_to() and fill_buffer(), and for
correctness: operator()() calls fill_buffer() and thus should provide a
timeout for the storage layer.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <6e97552ac2372e5846c955d94400b5315dbd2a89.1536657041.git.bdenes@scylladb.com>
2018-09-11 15:44:12 +02:00
Tomasz Grabiec
eb321a0830 Merge "Enrich SSTables 3.x write tests with subsequent read" from Vladimir
As our support for reading SSTables 3.x rows is nearly complete, the
write tests can be extended to read data after write.
This patchset adds reading to a handful of write tests.

* https://github.com/argenet/scylla/tree/projects/sstables-30/enrich-write-tests/v6:
  tests: Factor out the helper building SSTables path for write tests.
  tests: Add validate_read() helper to use in SSTables 3.x write tests.
  tests: Preserve tmpdir in SSTables 3.x write tests upon comparison.
  tests: Read SSTables for write_static_row test after validating write.
  tests: Read SSTables for write_composite_partition_key test after
    validating write.
  tests: Read SSTables for write_composite_clustering_key test after
    validating write.
  tests: Read SSTables for write_wide_partitions test after validating
    write.
  tests: Read SSTables for write_ttled_column test after validating
    write.
  tests: Read SSTables for write_collection_wide_update test after
    validating write.
  tests: Read SSTables for write_collection_incremental_update test
    after validating write.
  tests: Read SSTables for write_missing_columns_large_set test after
    validating write.
  tests: Read SSTables for write_multiple_partitions test after
    validating write.
  tests: Read SSTables for write_multiple_rows test after validating
    write.
  tests: Read SSTables for write_different_types test after validating
    write.
  tests: Read SSTables for write_empty_clustering_values test after
    validating write.
  tests: Read SSTables for write_large_clustering_keys test after
    validating write.
  tests: Read SSTables for write_user_defined_type_table test after
    validating write.
  tests: Read SSTables for write_deleted_row test after validating
    write.
  sstables: Fix SSTables 3.x parsing: check use_row_ttl() for TTLed
    columns.
  tests: Read SSTables for write_ttled_row test after validating write.
  Read SSTables for write_compact_table test after validating write.
  tests: Read SSTables for tests of many partitions after validating
    write.
2018-09-11 15:42:43 +02:00
Duarte Nunes
3f0643f34f Merge 'Misc improvements to stateful range scans' from Botond
"
This series contains miscellaneous improvements to the stateful range
scans. These improvements are either things that I forgot to include in
the original series (tracing), was requested by other developers
(comments) or I discovered them while reading the code (lockup and
cleanup).
"

* 'multishard_mutation_query_fixes/v1' of https://github.com/denesb/scylla:
  multishard_mutation_query: add some tracing
  multishard_mutation_query: add comment to `read_context`
  multishard_mutation_query: always cleanup readers properly
  multishard_mutation_query: fix possible deadlock when creating a reader fails
2018-09-11 10:26:05 +01:00
Botond Dénes
7d71b42651 multishard_mutation_query: add some tracing
Add tracing for the following events:
1) Dismantling of the combined buffer.
2) Dismantling of the compaction state.
3) Cleaning up the readers.

(1) and (2) can possibly have adverse effects on the performance of the
query and hence it is important that details about the dismantled
fragments is exposed in the tracing data.
(3) is less critical but still good to know how much readers were
created by the read (in case they aren't saved). Since normally (in
strateful queries) this will always be 0 only trace this when it is
non-zero (and is interesting).
2018-09-11 08:18:16 +03:00
Botond Dénes
b41be7c8e5 multishard_mutation_query: add comment to read_context
Explain the purpose of the class and its intended usage and any gotchas
the reader/modifier of the code has to keep in mind.
2018-09-11 08:18:16 +03:00
Botond Dénes
b6e1a8f32d multishard_mutation_query: always cleanup readers properly
Currently the reader cleanup code, which ensures the readers and their
dependent objects are destroyed in the corect order and a single
smp::submit_to() message, are only run when the readers are attempted to
be saved. However proper cleanup is needed not only then, but also when
the query is not stateful. Rename the current `cleanup()` method to
`stop()`, make it public and call it from a `finally()` block after the
page is finalized to ensure readers are properly cleaned up at all
times.
Also make sure that failures in `stop()` are never propagated so that
a failure in the cleanup doesn't fail the read itself.
2018-09-11 08:18:16 +03:00
Vladimir Krivopalov
c4a4ef6e3c tests: Read SSTables for tests of many partitions after validating write.
This covers five tests, including three for compressed tables:
  - write_many_partitions_deflate
  - write_many_partitions_lz4
  - write_many_partitions_snappy
  - write_many_live_partitions
  - write_many_deleted_partitions

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-10 17:28:48 -07:00
Vladimir Krivopalov
f1214bfceb Read SSTables for write_compact_table test after validating write.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-10 17:28:48 -07:00
Vladimir Krivopalov
a39638c0ba tests: Read SSTables for write_ttled_row test after validating write.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-10 17:28:48 -07:00
Vladimir Krivopalov
bcae761d72 sstables: Fix SSTables 3.x parsing: check use_row_ttl() for TTLed columns.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-10 17:28:48 -07:00
Vladimir Krivopalov
9b55f06456 tests: Read SSTables for write_deleted_row test after validating write.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-10 17:28:48 -07:00
Vladimir Krivopalov
8869f1a591 tests: Read SSTables for write_user_defined_type_table test after validating write.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-10 17:28:48 -07:00
Vladimir Krivopalov
dae49358d8 tests: Read SSTables for write_large_clustering_keys test after validating write.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-10 17:28:48 -07:00
Vladimir Krivopalov
8c2bc4a16a tests: Read SSTables for write_empty_clustering_values test after validating write.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-10 17:28:48 -07:00
Vladimir Krivopalov
6f23446962 tests: Read SSTables for write_different_types test after validating write.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-10 17:28:48 -07:00
Vladimir Krivopalov
4865f2f5a3 tests: Read SSTables for write_multiple_rows test after validating write.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-10 17:28:48 -07:00
Vladimir Krivopalov
3594b887df tests: Read SSTables for write_multiple_partitions test after validating write.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-10 17:28:48 -07:00
Vladimir Krivopalov
eee775dab7 tests: Read SSTables for write_missing_columns_large_set test after validating write.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-10 17:28:48 -07:00
Vladimir Krivopalov
2d764da415 tests: Read SSTables for write_collection_incremental_update test after validating write.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-10 17:28:48 -07:00
Vladimir Krivopalov
88a3b05210 tests: Read SSTables for write_collection_wide_update test after validating write.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-10 17:28:48 -07:00
Vladimir Krivopalov
abdae2dd9e tests: Read SSTables for write_ttled_column test after validating write.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-10 17:28:48 -07:00
Vladimir Krivopalov
cdf148dc67 tests: Read SSTables for write_wide_partitions test after validating write.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-10 17:28:48 -07:00
Vladimir Krivopalov
5b1a4686eb tests: Read SSTables for write_composite_clustering_key test after validating write.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-10 17:28:48 -07:00
Vladimir Krivopalov
e908d07fe7 tests: Read SSTables for write_composite_partition_key test after validating write.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-10 17:28:48 -07:00
Vladimir Krivopalov
aa5dc16dbb tests: Read SSTables for write_static_row test after validating write.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-10 17:28:48 -07:00
Vladimir Krivopalov
42ab8ed3cd tests: Preserve tmpdir in SSTables 3.x write tests upon comparison.
It can be used to do other checks on written files, like reading them
back.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-10 17:28:48 -07:00
Vladimir Krivopalov
bc16304e99 tests: Add validate_read() helper to use in SSTables 3.x write tests.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-10 17:28:48 -07:00
Vladimir Krivopalov
6cddd7500a tests: Factor out the helper building SSTables path for write tests.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-10 17:28:48 -07:00
Botond Dénes
b3f1fe14e8 multishard_mutation_query: fix possible deadlock when creating a reader fails
Failing to create a reader (`do_make_remote_reader()`) can lead to a
deadlock if the reader is in any of the future_*_state states, as the
`then()` block is not executed and hence the promise of the first
future in the chain is not set. Avoid this by changing the `then()` to a
`then_wrapped()` and using `set_exception()` and `set_value()`
accordingly, such that the future is resolved on both the happy and
error path.
2018-09-10 16:41:13 +03:00
Avi Kivity
4553238653 messaging: fix unbounded allocation in TLS RPC server
The non-TLS RPC server has an rpc::resource_limits configuration that limits
its memory consumption, but the TLS server does not. That means a many-node
TLS configuration can OOM if all nodes gang up on a single replica.

Fix by passing the limits to the TLS server too.

Fixes #3757.
Message-Id: <20180907192607.19802-1-avi@scylladb.com>
2018-09-10 12:11:16 +01:00
Gleb Natapov
9e438933a2 mutation_query_test: add test for result size calculation
Check that digest only and digest+data query calculate result size to be
the same.

Message-Id: <20180906153800.GK2326@scylladb.com>
2018-09-06 20:54:57 +03:00
Gleb Natapov
d7674288a9 mutation_partition: accurately account for result size in digest only queries
When measuring_output_stream is used to calculate result's element size
it incorrectly takes into account not only serialized element size, but
a placeholder that ser::qr_partition__rows/qr_partition__static_row__cells
constructors puts in the beginning. Fix it by taking starting point in a
stream before element serialization and subtracting it afterwords.

Fixes #3755

Message-Id: <20180906153609.GJ2326@scylladb.com>
2018-09-06 20:52:44 +03:00
Takuya ASADA
2136479012 dist/debian: delete mounts.conf on scylla-server.postrm
Since we added mounts.conf on 687372bc48,
we need to delete the file on uninstall the package.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180905204631.9265-1-syuu@scylladb.com>
2018-09-06 16:50:14 +03:00
Gleb Natapov
98092353df mutation_partition: correctly measure static row size when doing digest calculation
The code uses incorrect output stream in case only digest is requested
and thus getting incorrect data size. Failing to correctly account
for static row size while calculating digest may cause digest mismatch
between digest and data query.

Fixes #3753.

Message-Id: <20180905131219.GD2326@scylladb.com>
2018-09-06 13:09:41 +03:00
Takuya ASADA
ab361e9897 dist/redhat: add mounts.conf to ghost file
Since we added mounts.conf on 687372bc48,
we need to delete the file on uninstall the package.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180905191037.1570-1-syuu@scylladb.com>
2018-09-05 22:14:48 +03:00
Jesse Haber-Kucharsky
682805b22c auth: Use finite time-out for all QUORUM reads
Commit e664f9b0c6 transitioned internal
CQL queries in the auth. sub-system to be executed with finite time-outs
instead of infinite ones.

It should have also modified the functions in `auth/roles-metadata.cc`
to have finite time-outs.

This change fixes some previously failing dtests, particularly around
repair. Without this change, the QUORUM query fails to terminate when
the necessary consistency level cannot be achieved.

Fixes #3736.

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <e244dc3e731b4019f3be72c52a91f23ee4bb68d1.1536163859.git.jhaberku@scylladb.com>
2018-09-05 21:55:26 +03:00
Tomasz Grabiec
82270c8699 storage_proxy: Fix misqualification of reads as foreground or background in some cases
The foreground reads metric is derived from the number of live read
executors minus the number of background reads. Background reads are
counted down when their resolver times out. However, a read executor
may still be around for a while, resulting in such reads being
accounted as foreground.

Usually, the gap in which this happens is short, because executor
reference holders timeout quickly as well. It's not always the case
though. For instance, local read executor doesn't time out quickly
when the target shard has an overloaded CPU, and it takes a while
before the request goes through all the queues, even if IO is not
involved. Observed in #3628.

Fixes #3734.

Another problem is that all reads which received CL responses are
accounted as background, until all replicas respond, but if such read
needs reconciliation, it's still practically a foreground read and
should be accounted as such. Found during code review.

Fixes #3745.

This patch fixes both issues by rearranging accounting to track
foreground reads instead of background reads, and considering all
reads as foreground until the resulting promise is resolved.

Message-Id: <1535999620-25784-1-git-send-email-tgrabiec@scylladb.com>
2018-09-05 20:42:51 +03:00
Avi Kivity
c168805ca6 Merge "Filtering and fast-forwarding of range tombstones in SSTables 3.x" from Vladimir
"
This patchset adds proper support for sliced reads of partitions
containing range tombstones.

Given the SSTables 3.x repesentation of range tombstones by separate
start and end markers, we refer to the index for the information about
the currently opened range tombstone, if any, when skipping to the next
promoted index block.

Note that for this we have to take the promoted index block immediately
preceding the one we are jumping to.

Tests: unit {release}
"

* 'projects/sstables-30/range-tombstones-slicing/v3' of https://github.com/argenet/scylla:
  tests: Test filtering and forwarding on a partition with interleaved rows and RTs.
  tests: Add tests for reading wide partitions with range tombstones.
  sstables: Support slicing for range tombstones.
  sstables: Set/reset range tombstone start from end open marker.
  sstables: Fix end_open_marker population in promoted index blocks.
  sstables: Add need_skip() helper to data_consume_context.
  sstables: For end_open_marker, return both position in partition and deletion time.
2018-09-05 20:38:39 +03:00
Vladimir Krivopalov
3d13ee3909 tests: Test filtering and forwarding on a partition with interleaved rows and RTs.
In this test, rows lie inside range tombstones so we split them on
reading.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-05 09:48:17 -07:00
Vladimir Krivopalov
d39e58a97a tests: Add tests for reading wide partitions with range tombstones.
Test the case where rows lie outside range tombstones.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-05 09:48:17 -07:00
Vladimir Krivopalov
ec2047e1e6 sstables: Support slicing for range tombstones.
Both filtering on queried ranges and fast-forwarding are supported.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-05 09:48:17 -07:00
Vladimir Krivopalov
d57380f44c sstables: Set/reset range tombstone start from end open marker.
When we skip through a wide partition using promoted index, we may land
to a position that lies in the middle of a range tombstone so we need to
be aware of it. For this, we check if the previous promoted block has an
end open marker and either set the range tombstone start using it or
reset if missing.

Note several things about the implementation.

Firstly, we have to peek back at the previous promoted index block for the
end open marker, and so we have to always preserve one more promoted
index block when we read the next batch so that we can stil access it.

Secondly, we use the previous promoted block end position to build
position in partition for the range tombstone start.

Lastly, we don't have a notion of end open marker in older consumers
that work with SSTables of ka/la formats so we only call the
corresponding methods if the consumer supports them.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-05 09:48:17 -07:00
Vladimir Krivopalov
939e4893ef sstables: Fix end_open_marker population in promoted index blocks.
We should not access the internal object stored in std::optional when
passing the end_open_marker, moreover that it can be disengaged.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-05 09:48:17 -07:00
Vladimir Krivopalov
84bff86fbc sstables: Add need_skip() helper to data_consume_context.
This methods tells whether we will need to skip to reach the input
position or not.
It can be used for skipping with index when reading SSTables 3.x because
we only want to to set/reset the open range tombstone bound when we
actually move to another promoted index block.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-05 09:48:17 -07:00
Tomasz Grabiec
cd201d1987 db/batchlog_manager: Do not return a value from timer callback
Timer callbacks are std::function<void()>.

Exposed by changing callback_t to noncopyable_function<>.

Message-Id: <1536138045-29209-1-git-send-email-tgrabiec@scylladb.com>
2018-09-05 12:32:21 +03:00
Asias He
89b769a073 storage_service: Wait for range setup before announcing join status
When a joining node announcing join status through gossip, other
existing nodes will send writes to the joining node. At this time, it
is possible the joining node hasn't learnt the tokens of other nodes
that causes the error like below:

   token_metadata - sorted_tokens is empty in first_token_index!
   storage_proxy - Failed to apply mutation from 127.0.4.1#0:
   std::runtime_error (sorted_tokens is empty in first_token_index!)

To fix, wait for the token range setup before announcing the join
status.

Fixes: #3382
Tests: 60 run of materialized_views_test.py:TestMaterializedViews.add_dc_during_mv_update_test

Message-Id: <01abb21ae3315ae275297e507c5956e5774557ef.1536128531.git.asias@scylladb.com>
2018-09-05 10:51:43 +03:00
Vlad Zolotarov
dae70e1166 tests: loading_cache_test: configure a validity timeout in test_loading_cache_loading_different_keys to a greater value
Change the validity timeout from 1s to 1h in order to avoid false alarms
on busy systems: for a short value there is a chance that
(loading_cache.size() == num_loaders) check is going to run after some elements
of the cache have already been evicted.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20180904193026.7304-1-vladz@scylladb.com>
2018-09-05 10:19:59 +03:00
Vladimir Krivopalov
ac0c71bdc1 sstables: For end_open_marker, return both position in partition and deletion time.
Prior to this fix, the end_open_marker has been only accessible as a
plain deletion_time structure. Now it also contains the start position
of a promoted index block so that it can be used for setting range
tombstone open bound.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-04 18:16:21 -07:00
Piotr Sarna
f494d03c3f tests: add test case for filtering with DESC clustering order
Refs #3741

Message-Id: <1b8eab8d668eb000b306686c15324e6acde8e616.1535981852.git.sarna@scylladb.com>
2018-09-04 16:05:19 +03:00
Piotr Sarna
8e52b66516 cql3: fix filtering with descending clustering order
When slice::is_satisfied_by() restriction check is performed
on raw data represented as bytes, it should always use a regular
type comparator, not a reversed one. Reversed types are used to
preserve descending clustering order, but comparison with constants
should be used with a regular underlying type comparator (for x < 1
to actually mean 'lesser than 1' instead of 'bigger than 1, because
the clustering order is reversed').

Fixes #3741

Message-Id: <3e25fc66688c9253287f2c4f31ede8339b9bbe23.1535981852.git.sarna@scylladb.com>
2018-09-04 16:05:15 +03:00
Piotr Sarna
5b5c9f2707 cql3: fix a 'pratition_key' typo
partition_key got misspelled with 'pratition_key' typo in the original
series.

Message-Id: <de59fe6161df5442b19d8ba4336e2f828b7ede32.1535981852.git.sarna@scylladb.com>
2018-09-04 16:05:09 +03:00
Takuya ASADA
bd8a5664b8 dist/common/scripts/scylla_raid_setup: create scylla-server.service.d when it doesn't exist
When /etc/systemd/system/scylla-server.service.d/capabilities.conf is
not installed, we don't have /etc/systemd/system/scylla-server.service.d/,
need to create it.

Fixes #3738

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180904015841.18433-1-syuu@scylladb.com>
2018-09-04 10:12:32 +03:00
Tomasz Grabiec
4fb3f7e8eb managed_vector: Make external_memory_usage() ignore reserved space
This ensures that row::external_memory_usage() is invariant to
insertion order of cells.

It should be so, so that accounting of a clustering_row, merged from
multiple MVCC versions by the partition_snapshot_flat_reader on behalf
of a memtable flush, doesn't give a greater result than what is used
by the memtable region. Overaccounting leads to assertion failure in
~flush_memory_accounter.

Fixes #3625 (hopefully).

Message-Id: <1535982513-19922-1-git-send-email-tgrabiec@scylladb.com>
2018-09-03 17:09:54 +03:00
Takuya ASADA
d78762d627 dist/debian: fix broken debian/changelog
It also need $MUSTACHE_DIST.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180903094558.3862-1-syuu@scylladb.com>
2018-09-03 14:04:01 +03:00
Duarte Nunes
e49a14e308 Merge 'Stateful range scans' from Botond
"
This series extends the query statefullness, introduced by f8613a841 to
point queries, to range scans as well. This means that queriers will be
saved and reused for range scans too.
This series builds heavily on the infrastructure introduced by stateful
point queries, namely the querier object and the querier_cache. It also
builds on another critical piece of infrastructure, the
multishard_combining_reader, introduced by 2d126a79b.
To make the range scan on a given node suspendable and resumable we move
away from the current code in
`storage_proxy::query_nonsingular_mutations_locally()` and use a
multishard_combining_reader to execute the read. When the page is filled
this reader is dismantled and its shard readers are saved in the
querier cache.
There are of course a lot more details to it but this is the gist of it.

Tests: unit(release, debug), dtest(paging_test.py, paging_additional_test.py)
"

* '1865/range-scans/v7.1' of https://github.com/denesb/scylla: (33 commits)
  query_pagers: generate query_uuid for range-scans as well
  storage_proxy: use preferred/last replicas
  storage_proxy: add preferred/last replicas to the signature of query_partition_key_range_concurrent
  db::consistency_level::filter_for_query() add preferred_endpoints
  storage_proxy: use query_mutations_from_all_shards() for range scans
  tests: add unit test for multishard_mutation_query()
  tests/mutation_assertions.hh: add missing include
  multishard_mutation_query: add badness counters
  database: add query_mutations_on_all_shards()
  mutation_compactor: add detach_state()
  flat_mutation_reader: add unpop_mutation_fragment()
  Move reconcilable_result_builder declaration to mutation_query.hh
  mutation_source_test: add an additional REQUIRE()
  mutation: add missing assert to mutation from reader
  querier: add shard_mutation_querier
  querier: prepare for multi-ranges
  tests/querier_cache: add tests specific for multiple entry-types
  querier: split querier into separate data and mutation querier types
  querier: move consume_page logic into a free function
  querier: move all matching related logic into free functions
  ...
2018-09-03 09:09:17 +01:00
Botond Dénes
cd49c23a66 query_pagers: generate query_uuid for range-scans as well
And thus enable stateful range scans.
2018-09-03 10:31:44 +03:00
Botond Dénes
6486d6c8bd storage_proxy: use preferred/last replicas 2018-09-03 10:31:44 +03:00
Botond Dénes
577a06ce1b storage_proxy: add preferred/last replicas to the signature of query_partition_key_range_concurrent 2018-09-03 10:31:44 +03:00
Botond Dénes
6e59cee244 db::consistency_level::filter_for_query() add preferred_endpoints
To the second overload (the one without read-repair related params) too.
2018-09-03 10:31:44 +03:00
Botond Dénes
2f66bde26f storage_proxy: use query_mutations_from_all_shards() for range scans 2018-09-03 10:31:44 +03:00
Botond Dénes
6779b63dfe tests: add unit test for multishard_mutation_query() 2018-09-03 10:31:44 +03:00
Botond Dénes
c678b665b4 tests/mutation_assertions.hh: add missing include 2018-09-03 10:31:44 +03:00
Botond Dénes
253407bdc8 multishard_mutation_query: add badness counters
Add badness counters that allow tracking problems. The following
counters are added:
1) multishard_query_unpopped_fragments
2) multishard_query_unpopped_bytes
3) multishard_query_failed_reader_stops
4) multishard_query_failed_reader_saves

The first pair of counters observe the amount of work range scan queries
have to undo on each page. It is normal for these counters to be
non-zero, however sudden spikes in their values can indicate problems.
This undoing of work is needed for stateful range-scans to work.
When stateful queries are enabled the `multishard_combining_reader` is
dismantled and all unconsumed fragments in its and any of its
intermediate reader's buffers are pushed back into the originating shard
reader's buffer (via `unpop_mutation_fragment()`). This also includes
the `partition_start`, the `static_row` (if there is one) and all
extracted and active `range_tombstone` fragments. This together can
amount to a substantial amount of fragments.
(1) counts the amount of fragments moved back, while (2) counts the
number of bytes. Monitoring size and quantity separately allows for
detecting edge cases like moving many small fragments or just a few huge
ones. The counters count the fragments/bytes moved back to readers
located on the shard they belong to.

The second pair of counters are added to detect any problems around
saving readers. Since the failure to save a reader will not fail the
read itself, it is necessary to add visibility to these failures by
other means.
(3) counts the number of times stopping a shard reader (waiting
on pending read-aheads and next-partitions) failed while (4)
counts the number of times inserting the reader into the `querier_cache`
failed.
Contrary to the first two counters, which will almost certainly never be
zero, these latter two counters should always be zero. Any other value
indicates problems in the respective shards/nodes.
2018-09-03 10:31:44 +03:00
Botond Dénes
97364c7ad9 database: add query_mutations_on_all_shards()
This method allows for querying a range or ranges on all shards of the
node. Under the hood it uses the multishard_combining_reader for
executing the query.
It supports paging and stateful queries (saving and reusing the readers
between pages). All this is transparent to the client, who only needs to
supply the same query::read_command::query_uuid through the pages of the
query (and supply correct start positions on each page, that match the
stop position of the last page).
2018-09-03 10:31:44 +03:00
Botond Dénes
33d72efa49 mutation_compactor: add detach_state()
Allow the state of the compaction to be detached. The detached state is
a set of mutation fragments, which if replayed through a new compactor
object will result in the latter being in the same state as the previous
one was.
This allows for storing the compaction state in the compacted reader by
using `unpop_mutation_fragment()` to push back the fragments that
comprise the detached state into the reader. This way, if a new
compaction object is created it can just consume the reader and continue
where the previous compaction left off.
2018-09-03 10:31:44 +03:00
Botond Dénes
48054ed810 flat_mutation_reader: add unpop_mutation_fragment()
This is the inverse of `pop_mutation_fragment()`. Allow fragments to be
pushed back into the buffer of the reader to undo a previous consumtion
of the fragments.
2018-09-03 10:31:44 +03:00
Botond Dénes
3bcd577907 Move reconcilable_result_builder declaration to mutation_query.hh
It will be used by code outside of mutation_partition.cc so it needs to
be public. The definition remains in mutation_partition.cc.
2018-09-03 10:31:44 +03:00
Botond Dénes
b8b34223a4 mutation_source_test: add an additional REQUIRE()
test_streamed_mutation_forwarding_is_consistent_with_slicing already has
a REQUIRE() for the mutation read with the slicing reader. Add another
one for the forwarding reader. This makes it more consistent and also
helps finding problems with either the forwarding or slicing reader.
2018-09-03 10:31:44 +03:00
Botond Dénes
d347866664 mutation: add missing assert to mutation from reader
read_mutation_from_flat_mutation_reader's internal adapter can build a
single mutation only and hence can consume only a single partition.
If more than one partitions are pushed down from the producer the
adaptor will very likely crash. To avoid unnecessary investigations add
an assert() to fail early and make it clear what the real problem is.
All other consume_ methods have an assert() already for their
invariants so this is just following suit.
2018-09-03 10:31:44 +03:00
Botond Dénes
ecb1e79bcc querier: add shard_mutation_querier
The querier to be used for saving shard readers belonging to a
multishard range scan. This querier doesn't provide a `consume_page`
method as it doesn't support reading from it directly. It is more
of a storage to allow caching the reader and any objects it depends on.
2018-09-03 10:31:44 +03:00
Botond Dénes
07cdf766c5 querier: prepare for multi-ranges
In the next patch a querier will be added that reads multiple ranges as
opposed to a single range that data and mutation queriers read.
To keep `querier_cache` code seamless regarding this difference change all
range-matching logic to work in terms of `dht::partition_ranges_view`.
This allows for cheap and seamless way of having a single code-base for
the insert/lookup logic. Code actually matching ranges is updated to be
able to handle both singular and multi-ranges while maintaining backward
compatibility.
2018-09-03 10:31:44 +03:00
Botond Dénes
88a7effd8d tests/querier_cache: add tests specific for multiple entry-types 2018-09-03 10:31:44 +03:00
Botond Dénes
c12008b8cb querier: split querier into separate data and mutation querier types
Instead of hiding what compaction method the querier uses (and only
expose it via rejecting 'can_be_used_for_page()`) make it very explicit
that these are really two different queriers. This allows using
different indexes for the two queriers in `querier_cache` and
eliminating the possibility of picking up a querier with the wrong
compaction method (read kind).
This also makes it possible to add new querier type(s) that suit the
multishard-query's needs without making a confusing mess of `querier` by
making it a union of all querying logic.

Splitting the queriers this way changes what happens when a lookup finds
a querier of the wrong kind (e.g. emit_only_live::yes for an
emit_only_live::no command). As opposed to dropping the found (but
wrong) querier the querier will now simply not be found by the lookup.
This is a result of using separate search indexes for the different
mutation kinds. This change should have no practical implications.

Splitting is done by making querier templated on `emit_only_live_rows`.
It doesn't make sense to duplicate the entire querier as the two share
99% of the code.
2018-09-03 10:31:44 +03:00
Botond Dénes
e46251ebf6 querier: move consume_page logic into a free function
In preparation of the now single querier being split into multiple more
specialized ones. Make it possible for the multiple queriers sharing the
same implementation. Also, the code can now be reused by outside code as
well, not just queriers.
2018-09-03 10:31:44 +03:00
Botond Dénes
c53f17ddb8 querier: move all matching related logic into free functions
So that they can be used for multiple querier classes easily, without
inheritance. The functions are not visible from the header.
Also update the comments on `querier` to w.r.t. the disappeared
checking functions. Change the language to be more general. In practice
these checks are never done by client code, instead they are done by the
`querier_cache`.
2018-09-03 10:31:44 +03:00
Botond Dénes
43f464c52d querier: inline querier::current_position() and make it public 2018-09-03 10:31:44 +03:00
Botond Dénes
86a61ded7d querier: s/position/position_view/
Also treat it as a view, that is take it by value in functions,
instead of reference.
2018-09-03 10:31:44 +03:00
Botond Dénes
6e4ec53679 querier: move position outside of querier
In preparation for having multiple querier types that can share code
without inheritance.
2018-09-03 10:31:44 +03:00
Botond Dénes
a172dfec4e querier: move clustering_position_tracker outside of querier
In preparation for having multiple querier types that can share code
without inheritance.
2018-09-03 10:31:44 +03:00
Botond Dénes
7bd955e993 querier_cache: move insert/lookup related logic into free functions
In preparations for introducing support multiple entry types in the
querier_cache move all insert/lookup related logic into free functions.
Later these functions will be templated so they can handle multiple
entry types with the same code.
2018-09-03 10:31:44 +03:00
Botond Dénes
cded477b94 querier: return std::optional<querier> instead of using create_fun()
Requiring the caller of lookup() to pass in a `create_fun()` was not
such a good idea in hindsight. It leads to awkward call sites and even
more awkward code when trying to find out whether the lookup was
successfull or not.
Returning an optional gives calling code much more flexibility and makes
the code cleaner.
2018-09-03 10:31:44 +03:00
Botond Dénes
5f726e9a89 querier: move all to query namespace
To avoid name clashes.
2018-09-03 10:31:44 +03:00
Botond Dénes
867f69b9d1 dht::i_partitioner: add partition_ranges_view 2018-09-03 10:31:44 +03:00
Botond Dénes
a011a9ebf2 mutation_reader: multishard_combining_reader: support custom dismantler
Add a dismantler functor parameter. When the multishard reader is
destroyed this functor will be called for each shard reader, passing a
future to a `stopped_foreign_reader`. This future becomes available when
the shard reader is stopped, that is, when it finished all in-progress
read-aheads and/or pending next partition calls.

The intended use case for the dismantler functor is a client that needs
to be notified when readers are destroyed and/or has to have access to
any unconsumed fragments from the foreign readers wrapping the shard
readers.
2018-09-03 10:31:44 +03:00
Botond Dénes
f13b878a94 mutation_reader: pass all standard reader params to remote_reader_factory
Extend `remote_reader_factory` interface so that it accepts all standard
mutation reader creation parameters. This allows factory lambdas to be
truly stateless, not having to capture any standard parameters that is
needed for creating the reader.
Standard parameters are those accepted by
`mutation_source::make_reader()`.
2018-09-03 10:31:44 +03:00
Botond Dénes
e67c6d9f39 flat_mutation_reader::impl: add protected buffer() member
To allow implementations to access the buffer in a read-only way.
2018-09-03 10:31:44 +03:00
Botond Dénes
8915293257 multishard_combining_reader: fix incorrect comment 2018-09-03 10:31:44 +03:00
Botond Dénes
75d60b0627 docs: add paged-queries.md design doc 2018-09-03 10:31:44 +03:00
Duarte Nunes
6593226849 Merge branch 'loading_cache: fix a consistency of size() and iterators APIs' from Vlad
"
After we fixed reloading flow it enabled situations when items are no longer cached but
still held in the underlying loading_shared_values object. Since loading_cache::size() returns
the size of its loading_shared_values object and loading_cache::begin()/end()/find() are returning
iterators based on loading_shared_values iterators these APIs may return very weird values, e.g.
size() may return the same value after one of the items have been removed using remove(key) API.

This series fixes this by switching mentioned above APIs to work on top of lru_list object instead
of loading_shared_values.
"

* 'loading_cache_fix_api_semantics-v1' of https://github.com/vladzcloudius/scylla:
  loading_cache: make iterator work on top of lru_list iterators instead of loading_shared_values'
  loading_cache: make size() return the size of lru_list instead of loading_shared_values
2018-09-01 11:05:28 +01:00
Avi Kivity
fd8eae50db build: add relocatable package target
A relocatable package contains the Scylla (and iotune)
executables (in a bin/ directory), any libraries they may need (lib/)
the configuration file defaults (conf/) and supporting scripts (dist/).
The libraries are picked up from the host; including libc and the dynamic
linker (ld.so).

We also provide a thunk script that forces the library path
(LD_LIBRARY_PATH) to point at our libraries, and overrides the
interpreter to point at our ld.so.

With these files, it is possible to run a fully functional Scylla
instance on any Linux distribution. This is similar to chroot or
containers, except that we run in the same namespace as the host.

The packages are created by running

    ninja build/release/scylla-package.tar

or

    ninja --mode debug build/debug/scylla-package.tar
Message-Id: <20180828065352.30730-1-avi@scylladb.com>
2018-08-31 23:14:42 +01:00
Vlad Zolotarov
945d26e4ee loading_cache: make iterator work on top of lru_list iterators instead of loading_shared_values'
Reloading may hold value in the underlying loading_shared_values while
the corresponding cache values have already been deleted.

This may create weird situations like this:

<populate cache with 10 entries>
cache.remove(key1);
for (auto& e : cache) {
    std::out << e << std::endl;
}

<all 10 entries are printed, including the one for "key1">

In order to avoid such situations we are going to make the loading_cache::iterator
to be a transform_iterator of lru_list::iterator instead of loading_shared_values::iterator
because lru_list contains entries only for cached items.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-08-30 20:56:44 -04:00
Vlad Zolotarov
1e56c7dd58 loading_cache: make size() return the size of lru_list instead of loading_shared_values
reloading flow may hold the items in the underlying loading_shared_values
after they have been removed (e.g. via remove(key) API) thereby loading_shared_values.size()
doesn't represent the correct value for the loading_cache. lru_list.size() on the other hand - does.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-08-30 15:55:30 -04:00
Paweł Dziepak
dbbd664600 Update seastar submodule
* seastar 12f18ce...5712816 (6):
  > tests: add signal_test to test list
  > Merge "Enhancements for memory_output_stream" from Paweł
  > seastar-addr2line: don't print an empty line between backtrace lines
  > seastar-addr2line: add --verbose option
  > seastar-addr2line: make prefix matching non-greedy
  > future: make available() const
2018-08-30 11:41:27 +01:00
Glauber Costa
8dea1b3c61 database: fix directory for information when loading new SSTables from upload dir
When we load new SSTables, we use the directory information from the
entry descriptor to build information about those SSTables. When the
descriptor is created by flush_upload_dir, the sstable directory used in
the descriptor contains the `upload` part. Therefore, we will try to
load SSTables that are in the upload directory when we already moved
them out and fail.

Since the generation also changes, we have been historically fixing the
generation manually, but not the SSTable directory. The reason for that
is that up until recently, the SSTable directory was passed statically
to open_sstables, ignoring whatever the entry descriptor said. Now that
the sstable directory is also derived from the entry descriptor, we
should fix that too.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180829165326.12183-1-glauber@scylladb.com>
2018-08-30 10:34:25 +03:00
Nadav Har'El
2f02d006b3 materialized views: more tests
Additional tests for cases surrounding issue #3362, where base rows
disappear (or not) and view rows need to disappear (or not) as well.
These new tests focus on checking that view_updates::do_delete_old_entry()
is correct.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180829131914.16042-2-nyh@scylladb.com>
2018-08-29 14:33:48 +01:00
Nadav Har'El
16a6f76873 materialized views: simplify do_delete_old_entry()
In previous patches, we gave up on an old (and broken) attempt to track
the timestamps of many unselected base-table columns through one row marker
in the view table - and replaced them by "virtual cells", one per unselected
cell.

The do_delete_old_entry() function still contains old code which maintained
that row marker, and is no longer needed. That old code is no only no longer
needed, it also no longer did anything because all columns now appear in
the view (as virtual columns) so the code ignored them when calculating the
row marker.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180829131914.16042-1-nyh@scylladb.com>
2018-08-29 14:33:41 +01:00
Duarte Nunes
79d796e710 Merge 'Materialized Views: row liveness correction' from Nadav
"
When a view's partition key contains only columns from the base's partition
key (and not an additional one), the liveness - existance or disappearance -
of a view-table row is tied to the liveness of the base table row. And
that, in turn, depends not only on selected columns (base-table columns
SELECTed to also appear in the view) but also on unselected columns.

This means that we may need to keep a view row alive even without data,
just because some unselected column is alive in the base table. Before this
patch set we tried to build a single "row marker" in the view column which
tried to summarize the liveness information in all unselected columns.
But this proved unworkable, as explained in issue #3362 and as will be
demonstrated in unit tests at the end of this series.

Because we can't replace several unselected cells by one row marker, what
we do in this series is to add for each for the unselected cells a "virtual
cell" which contains the cell's liveness information (timestamp, deletion,
ttl) but not its value. For collections, we can't represent the entire
collection by one virtual cell, and rather need a collection of virtual
cells.

Fixes #3362
"

* 'virtual-cols-v3' of https://github.com/nyh/scylla:
  Materialized Views: test that virtual columns are not visible
  Materialized Views: unit test reproducing fixed issue #3362
  Materialized Views: no need for elaborate row marker calculations
  Materialized Views: add unselected columns as virtual columns
  Materialized Views: fill virtual columns
  Do not allow selecting a virtual column
  schema: persist "view virtual" columns to a separate system table
  schema: add "view virtual" flag to schema's column_definition
  Add "empty" type name to CQL parser, but only for internal parsing
2018-08-29 14:32:38 +01:00
Paweł Dziepak
6f1c3e6945 Merge "Convert more execution_stages to inherit scheduling_groups" from Avi
"
Previous work (71471bb322) converted the CQL layer to inheriting
execution stages, paving the way to multiple users sharing the front-end.

This patchset does the same thing to the back-end, converting more execution
stages to preserve the caller's scheduling_group. Since RPC now (8c993e0728)
assigns the correct scheduling group within the replica, we can extend that
work so a statement is executed with the same scheduling group all the way
to sstable parsing, even if we cross nodes in the process. This improves
performance isolation and paves the way to multi-user SLA guarantees.
"

* tag 'inherit-sched_group/v1' of https://github.com/avikivity/scylla:
  database: make database's mutation apply stage inherit its scheduling group from the caller
  database: make database::_mutation_query_stage inherit the scheduling group
  database: make database::_data_query_stage inheriting its caller's scheduling_group
  storage_proxy: make _mutate_stage inherit its caller's scheduling_group
2018-08-28 13:49:31 +01:00
Duarte Nunes
f6aadd8077 Merge 'utils::loading_cache: improve reload() robustness' from Vlad
"This series introduces a few improvements related to a reload flow.

From now on the callback may assume that the "key" parameter value
is kept alive till the end of its execution in the reloading flow.

It may also safely evict as many items from the cache as needed."

Fixes #3606

* 'loading_cache_improve_reload-v1' of https://github.com/vladzcloudius/scylla:
  utils::loading_cache: hold a shared_value_ptr to the value when we reload
  utils::loading_cache::on_timer(): remove not needed capture of "this"
  utils::loading_cache::on_timer(): use chunked_vector for storing elements we want to reload
2018-08-28 10:52:20 +01:00
Piotr Sarna
aa2bfc0a71 tests: add multi-column pk test to INSERT JSON case
Refs #3687
Message-Id: <6ba1328549ed701691ca7cbdacc7d6fa72f2c3de.1534171422.git.sarna@scylladb.com>
2018-08-28 11:34:13 +03:00
Piotr Sarna
fa72422baa cql3: fix handling multi-column partition key in INSERT JSON
Multiple column partition keys were previously handled incorrectly,
now the implementation is based on from_exploded instead of
from_singular.

Fixes #3687
Message-Id: <09e0bdb0f1c18d49b9e67c21777d93ba1545a13c.1534171422.git.sarna@scylladb.com>
2018-08-28 11:34:11 +03:00
Avi Kivity
1fd9974b6b Merge "tests/loading_cache_test: Fix flakiness" from Duarte
"
Fix loading_cache_test flakiness by retrying assertions.

Tests: unit(loading_cache_test(debug, release))

Fixes #3723
"

* 'loading-cache-test-flake/v4' of https://github.com/duarten/scylla:
  tests/loading_cache_test: Unflake test_loading_cache_loading_reloading
  tests/loading_cache_test: Use eventually() instead of open-coding it
  tests/mutation_reader_test: Extract eventually_true() to eventually.hh
  tests/cql_test_env: Lift eventually() to its own header file
2018-08-28 09:35:09 +03:00
Takuya ASADA
4a5157857a dist/debian: support package renaming on build script
To automatically rename packages on enterprise release, added package name
prefix as a variable on build_deb.sh.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180828010445.11920-1-syuu@scylladb.com>
2018-08-28 09:25:07 +03:00
Avi Kivity
22396d57c2 Update seastar submodule
* seastar 9bb1611...12f18ce (17):
  > correctly configure I/O Scheduler for usage with the YAML file
  > Added support for user-defined signal handlers
  > Added reactor method to modify blocked_reactor_notify_ms
  > configure.py: Use the user-specified compiler for dialect detection
  > seastar-addr2line: clear current trace when omitting already seen trace
  > seastar-addr2line: fix redirecting output to a file
  > seastar-addr2line: don't require a space before the addresses
  > tests: Ensure test thread is always joined
  > README.md: Add cute badges
  > iotune: adjust num-io-queues recommendation
  > dns: add SRV record lookup
  > reactor: define max_aio_per_queue for C++14
  > reactor,alien: silence GCC warnings
  > core,json,net: silence GCC warnings
  > fstream: "using data_sink_impl::put" to silence gcc warning
  > Merge 'Ensure Seastar compiles in C++14 mode' from Jesse
  > Revert "foreign_ptr: allow waiting for the destruction of the managed ptr"
2018-08-28 09:10:14 +03:00
Tomasz Grabiec
75cde85349 Merge "Support reading range tombstones" from Piotr and Vladimir
Implement and test support for reading range tombstones in SSTables 3.

Does not yet support reads which are using slicing or fast forwarding.

From github.com/scylladb/seastar-dev.git haaawk/sstables3/tombstones_v11:

Piotr Jastrzebski (5):
  sstables: Add consumer_m::consume_range_tombstone
  sstables: Support null columns in ck
  sstables: Support reading range_tombstones
  sstables: Test reading range_tombstones
  sstables: Add test for RT with non-full key

Vladimir Krivopalov (2):
  sstables: Add operator<< overload for bound_kind_m.
  keys: Add clustering_key_prefix::make_full helper.
2018-08-27 20:43:38 +02:00
Duarte Nunes
40044c0460 tests/loading_cache_test: Unflake test_loading_cache_loading_reloading
The `loading_cache_test::test_loading_cache_loading_reloading` test
case is flaky, and fails in both debug and release mode. In an
over-provisioned environment, it's possible that when the reactor
runs, the timers for the `sleep()` and for reloading the
`loading_cache` are both expired, and continuations are scheduled with
an arbitrary order, causing the test to fail.

Fixes #3723

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-08-27 19:24:05 +01:00
Duarte Nunes
0cb03b966d tests/loading_cache_test: Use eventually() instead of open-coding it
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-08-27 19:24:05 +01:00
Duarte Nunes
b89fa0d67b tests/mutation_reader_test: Extract eventually_true() to eventually.hh
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-08-27 19:24:05 +01:00
Duarte Nunes
636c5ded6c tests/cql_test_env: Lift eventually() to its own header file
Retrying is needed everywhere.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-08-27 19:24:00 +01:00
Avi Kivity
5792a59c96 migration_manager: downgrade frightening "Can't send migration request" ERROR
This error is transient, since as soon as the node is up we will be able
to send the migration request.  Downgrade it to a warning to reduce anxiety
among people who actually read the logs (like QA).

The message is also badly worded as no one can guess what a migration
request is, but that is left to another patch.

Fixes #3706.
Message-Id: <20180821070200.18691-1-avi@scylladb.com>
2018-08-27 14:49:36 +02:00
Takuya ASADA
10b67c7934 dist/ami: package scylla-ami as rpm
Now scylla-ami is not submodule of scylla repo, it will works as
independent repository just like scylla-jmx and scylla-tools, provides
.rpm package to install AMI scripts on AMI.

Most files are gone from dist/ami/files, but scylla_install_ami copied
from scylla-ami, since it requires to install scylla .rpms, cannot
pacakge in scylla-ami rpm.

On scylla_install_ami, we dropped ixgbevf/ena drivers code, we will
provide 'scylla-ixgbevf' and 'scylla-ena' DKMS .rpm instead.
It will automatically build kernel modules for current kernel.

A repo of the driver packages is on
https://copr.fedorainfracloud.org/coprs/scylladb/scylla-ami-drivers/

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180821201101.4631-1-syuu@scylladb.com>
2018-08-27 11:48:52 +03:00
Avi Kivity
62750eb517 Merge "Prepare for removing Iterator from simple_memory_input_stream" from Paweł
"
Right now, simple_memory_input_stream takes Iterator as a template
parameter. That iterator is supposed to point to fragments in a
underlying fragmented buffer. This makes no sense, since simple streams
deal only with contiguous buffer.

This series removes any assumption that simple_memory_input_stream has
iterator_type member from Scylla so that it can be removed.
"

* tag 'prepare-simple-stream-no-iterator/v1' of https://github.com/pdziepak/scylla:
  idl: deserialized_bytes_proxy do not assume presence of iterator_type
  idl-compiler: specify return type of with_serialized_stream() lambdas
2018-08-26 16:29:06 +03:00
Avi Kivity
16478355be Merge "Refactor password handling" from Jesse
"
This series is a refactor of password management, motivated by a
combination of correctness bugs, improving testability, improving
clarity, and adding documentation.

Tests: unit (release)
"

* 'jhk/passwords_refactor/v2' of https://github.com/hakuch/scylla:
  auth: Clean up implementation comments
  auth: Remove unnecessary local variable
  auth: Allow different random engines for salt
  auth: Correct modulo bias in salt generation
  auth: Extract random byte generation for salt
  auth: Split out test for best supported scheme
  auth: Rename function to use full words
  auth: Add domain-specific exception for passwords
  auth: Document passwords interface
  auth: Move passsword stuff to its own namespace
  auth: Identify password hashing errors correctly
  auth: Add unit tests for password handling
  auth: Move password handling to its own files
  auth: Construct `std::random_device` instances once
2018-08-26 11:18:31 +03:00
Tomasz Grabiec
2afce13967 database: Avoid OOM when soft pressure but nothing to flush
There could be soft pressure, but soft-pressure flusher may not be
able to make progress (Refs #3716). It will keep trying to flush empty
memtables, which block on earlier flushes to complete, and thus
allocate continuations in memory. Those continuations accumulate in
memory and can cause OOM.

flush will take longer to complete. Due to scheduling group isolation,
the soft-pressure flusher will keep getting the CPU.

This causes bad_alloc and crashes of dtest:
limits_test.py:TestLimits.max_cells_test

Fixes #3717

Message-Id: <1535102520-23039-1-git-send-email-tgrabiec@scylladb.com>
2018-08-26 11:03:58 +03:00
Tomasz Grabiec
1e50f85288 database: Make soft-pressure memtable flusher not consider already flushed memtables
The flusher picks the memtable list which contains the largest region
according to region_impl::evictable_occupancy().total_space(), which
follows region::occupancy().total_space(). But only the latest
memtable in the list can start flushing. It can happen that the
memtable corresponding to the largest region was already flushed to an
sstable (flush permit released), but not yet fsynced or moved to
cache, so it's still in the memtable list.

The latest memtable in the winning list may be small, or empty, in
which case the soft pressure flusher will not be able to make much
progress. There could be other memtable lists with non-empty
(flushable) latest memtables. This can lead to writes unnecessarily
blocking on dirty.

I observed this for the system memtable group, where it's easy for the
memtables to overshoot small soft pressure limits. The flusher kept
trying to flush empty memtables, while the previous non-empty memtable
was still in the group.

The CPU scheduler makes this worse, because it runs memtable_to_cache
in a separate scheduling group, so it further defers in time the
removal of the flushed memtable from the memtable list.

This patch fixes the problem by making regions corresponding to
memtables which started flushing report evictable_occupancy() as 0, so
that they're picked by the flusher last.

Fixes #3716.
Message-Id: <1535040132-11153-2-git-send-email-tgrabiec@scylladb.com>
2018-08-26 11:02:34 +03:00
Tomasz Grabiec
364418b5c5 logalloc: Make evictable_occupancy() indicate no free space
Doesn't fix any bug, but it's closer to the truth that all segments
are used rather than none is used.

Message-Id: <1535040132-11153-1-git-send-email-tgrabiec@scylladb.com>
2018-08-26 11:02:32 +03:00
Avi Kivity
54ac334f4b Update scylla-ami submodule
* dist/ami/files/scylla-ami c7e5a70...b7db861 (2):
  > scylla-ami-setup.service: run only on first startup
  > Use fstab to mount RAID volume on every reboot
2018-08-26 10:57:32 +03:00
Takuya ASADA
ff55e3c247 dist/common/scripts/scylla_raid_setup: refuse start scylla-server.service when RAID volume is not mounted
Since the Linux system abort booting when it fails to mount fstab entries,
user may not able to see an error message when we use fstab to mount
/var/lib/scylla on AMI.

Instead of abort booting, we can just abort to start scylla-server.service
when RAID volume is not mounted, using RequiresMountsFor directive of systemd
unit file.

See #3640

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180824185511.17557-1-syuu@scylladb.com>
2018-08-26 10:55:34 +03:00
Avi Kivity
37f9a3c566 database: make database's mutation apply stage inherit its scheduling group from the caller
Like the two preceeding patches, convert the mutation apply stage
to an inheriting_concrete_scheduling_group.  This change has two
added benefits: we get rid of a thread_local, and we drop a
with_scheduling_group() inside an execution stage which just creates a bunch
of continuations and somewhat undoes the benefit of the execution stage.
2018-08-24 19:04:49 +03:00
Avi Kivity
ebff1cfc37 database: make database::_mutation_query_stage inherit the scheduling group
Like the preceeding patch and for the same reasons, adjust
database::_mutation_query_stage to inherit the scheduling group from its
caller.
2018-08-24 19:04:49 +03:00
Avi Kivity
596fb6f2f7 database: make database::_data_query_stage inheriting its caller's scheduling_group
Now (8c993e0728) that replica-side operations run under the correct
scheduling group, we can inherit the scheduling_group for _data_query_stage
from the caller.  By itself this doesn't do much, but it will later allow us
to have multiple groups for statement executions.
2018-08-24 19:04:49 +03:00
Avi Kivity
908e497f3d storage_proxy: make _mutate_stage inherit its caller's scheduling_group
Right now, storage_proxy's mutate_stage violates isolation by running
in a plain execution_stage without a scheduling_group. This means do_mutate()
will run under the main scheduling_group, at least until we reach the database
apply execution stage, which is correct.

Fix by moving to an inheriting execution stage; this works because the
messaging service will tell RPC to set the correct execution stage for us. We
could explicitly specify statement_scheduling_group, but inheriting the
scheduling group allows us to have multiple statment scheduling groups, later.
2018-08-24 19:04:49 +03:00
Paweł Dziepak
4ca991ea65 idl: deserialized_bytes_proxy do not assume presence of iterator_type
deserialized_bytes_proxy assumes that the provided input stream has
iterator_type that represents the iterator pointing to the next
fragment of the fragmented underlying buffyer. This makes little sense
if the input stream is a contiguous one (i.e.
simple_memory_input_stream) so let's not make such assumptions.
2018-08-24 16:19:40 +01:00
Paweł Dziepak
3b7579aa0e idl-compiler: specify return type of with_serialized_stream() lambdas
IDL-generated code uses with_serialized_stream() to optimise for cases
when the underlying buffer is not fragmented. The provided lambda will
be called with wither simple or fragmented stream as an argument. The
consequence of this is that both instantations of generic lambda need to
return the same type. This is a problem if the type is deduced and
depends on the provided input stream (e.g. different type for fragmented
and simple streams). The solution is to explictly specify the return
type as the type returned by deserialising general utils::input_stream.
This way each instantation of lambda can return whatever it wants as
long as it is convertible to the type that the serialiser would return
if utils::input_stream was given.
2018-08-24 16:07:20 +01:00
Tomasz Grabiec
10f6b125c8 database: Run system table flushes in the main scheduling group
memtable flushes for system and regular region groups run under the
memtable_scheduling_group, but the controller adjusts shares based on
the occupancy of the regular region group.

It can happen that regular is not under pressure, but system is. In
this case the controller will incorrectly assign low shares to the
memtable flush of system. This may result in high latency and low
throughput for writes in the system group.

I observed writes to the sytem keyspace timing out (on scylla-2.3-rc2)
in the dtest: limits_test.py:TestLimits.max_cells_test, which went
away after this.

Fixes #3717.

Message-Id: <1535016026-28006-1-git-send-email-tgrabiec@scylladb.com>
2018-08-23 15:07:05 +03:00
Piotr Sarna
94262cf5d0 tests: add null collection test scenario to INSERT JSON
Refs #3664
Message-Id: <a34b9f5e8b9d7e3dd8906b559957220d74734b41.1534848313.git.sarna@scylladb.com>
2018-08-23 11:22:07 +03:00
Piotr Sarna
465045368f cql3: add proper setting of empty collections in INSERT JSON
Previously empty collections where incorrectly added as dead cells,
which resulted in serialization errors later.

Fixes #3664
Message-Id: <a9c90d66c6737641cafe40edb779df490ada0309.1534848313.git.sarna@scylladb.com>
2018-08-23 11:22:05 +03:00
Duarte Nunes
36a293bb23 cell_locking: Use xxhash instead of fnv1a
Being the single user of fnv1a, this allows us to get rid of it. As
the TODO inside fnv1a_hasher.hh indicates, and judging by any
independent benchmark, fnv1a is very slow. As we have added xx_hash
since then, and we know it to be fast, use it instead.

Tests: unit(release/cell_locker_test)

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180823081715.26089-1-duarte@scylladb.com>
2018-08-23 11:21:00 +03:00
Piotr Jastrzebski
2997fda1b1 sstables: Add test for RT with non-full key
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-08-22 18:28:11 +02:00
Piotr Jastrzebski
c50929233f sstables: Test reading range_tombstones
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-08-22 18:28:11 +02:00
Piotr Jastrzebski
7434be348c sstables: Support reading range_tombstones
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-08-22 18:27:41 +02:00
Piotr Jastrzebski
d19a108d87 sstables: Support null columns in ck
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-08-22 14:32:10 +02:00
Piotr Jastrzebski
3636697663 sstables: Add consumer_m::consume_range_tombstone
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-08-22 12:53:15 +02:00
Vladimir Krivopalov
8acf4ddb8e keys: Add clustering_key_prefix::make_full helper.
This method fills non-full clustering key with trailing empty values to
make it full.
This can be used for clustering keys of rows in a compact table as,
unlike in regular tables, they can be non-full.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-08-22 12:13:23 +02:00
Amnon Heiman
ab207356a5 API: storage_service stream endpoints
This patch changes how list of tokens returned from the storage_service
API.

Instead of create a vector and construct a json object of it, use the
streaming capabilities of the http.

This is important for large cluster and prevent large allocations.

Fixes #3701

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20180820195631.26792-1-amnon@scylladb.com>
2018-08-22 11:24:38 +03:00
Takuya ASADA
e4f38b7c22 dist/redhat: support package renaming on build script
To automatically rename packages on enterprise release, added package name
prefix as a variable on build_rpm.sh.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180822072105.9420-1-syuu@scylladb.com>
2018-08-22 11:03:39 +03:00
Piotr Sarna
4a274ee7e2 tests: add parsing varint from JSON string test
Refs #3666
Message-Id: <f4205e9484f5385796fade7986e3e38dcbc65bac.1534845398.git.sarna@scylladb.com>
2018-08-21 11:20:11 +01:00
Piotr Sarna
37a5c38471 types: enable deserializing varint from JSON string
Previously deserialization failed because the JSON string
representing a number was unnecessarily quoted.

Fixes #3666
Message-Id: <a0a100dbac7c151d627522174303657d1da05c27.1534845398.git.sarna@scylladb.com>
2018-08-21 11:20:11 +01:00
Tomasz Grabiec
6937cc2d1c Merge 'Fix multi-cell static list updates in the presence of ckeys' from Duarte
Fixes a regression introduced in
9e88b60ef5, which broke the lookup for
prefetched values of lists when a clustering key is specified.

This is the code that was removed from some list operations:

 std::experimental::optional<clustering_key> row_key;
 if (!column.is_static()) {
   row_key = clustering_key::from_clustering_prefix(*params._schema, prefix);
 }
 ...
 auto&& existing_list = params.get_prefetched_list(m.key().view(), row_key, column);

Put it back, in the form of common code in the update_parameters class.

Fixes #3703

* https://github.com/duarten/scylla cql-list-fixes/v1:
  tests/cql_query_test: Test multi-cell static list updates with ckeys
  cql3/lists: Fix multi-cell static list updates in the presence of ckeys
  keys: Add factory for an empty clustering_key_prefix_view
2018-08-21 12:14:30 +02:00
Vladimir Krivopalov
c8422c9a91 sstables: Add operator<< overload for bound_kind_m.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-08-20 16:22:53 -07:00
Duarte Nunes
ff7304b190 tests/cql_query_test: Test multi-cell static list updates with ckeys
Refs #3703

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-08-20 21:39:37 +01:00
Duarte Nunes
05731cb5ad cql3/lists: Fix multi-cell static list updates in the presence of ckeys
This patch fixes a regression introduced in
9e88b60ef5, which broke the lookup for
prefetched values of lists when a clustering key is specified.

This is the code that was removed from some list operations:

std::experimental::optional<clustering_key> row_key;
if (!column.is_static()) {
  row_key = clustering_key::from_clustering_prefix(*params._schema, prefix);
}
...
auto&& existing_list = params.get_prefetched_list(m.key().view(), row_key, column);

Put it back, in the form of common code in the update_parameters class.

Fixes #3703

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-08-20 21:39:37 +01:00
Duarte Nunes
ce461b06d7 keys: Add factory for an empty clustering_key_prefix_view
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-08-20 21:39:37 +01:00
Avi Kivity
231174cda9 build: auto-detect g++ -gz support
Older combinations of g++/binutils don't support -gz, so auto-detect its
presence.

Fixes #3697.
Message-Id: <20180817161113.2287-1-avi@scylladb.com>
2018-08-20 18:48:18 +02:00
Tomasz Grabiec
c31dff8211 Merge 'Skip inside wide partitions using index (rows only)' from Vladimir
This patchset adds support for skipping inside wide partitions using
index for sliced queries. This can significantly reduce disk I/O for
queries that only need to read a small amount of data from a wide
partition.

Other changes include general code clean-up and simplification.

 * github.com/argenet/scylla.git tree/projects/sstables-30/skip_using_index/v6:
  sstables: Support resetting data_consume_rows_context_m to
    indexable_element::cell.
  tests: Add tests to cover skipping with index through SSTables 3.x.
  sstables: Support skipping inside wide partitions using index.
  to_string: Add operator<< overload for std::optional.
  sstables: Use std::optional instead of std::experimental::optional.
2018-08-20 18:39:51 +02:00
Avi Kivity
e605cd4ff8 multishard_writer_test: reduce mutation count in release mode
We see occasional bad_alloc failures in release mode; this is due
to the random mutation generator generating large mutations.

Reduce the mutation count to 300. I tested 100 runs and all passed,
so it reduces the false positive rate to < 1%.
2018-08-20 16:53:05 +03:00
Gleb Natapov
7277ee2939 storage_proxy: do not fail read without speculation on connection error
After ac27d1c93b if a read executor has just enough targets to
achieve request's CL and a connection to one of them will be dropped
during execution ReadFailed error will be returned immediately and
client will not have a chance to issue speculative read (retry). The
patch changes the code to not return ReadFailed error immediately, but
wait for timeout instead and give a client chance to issue speculative
read in case read executor does not have additional targets to send
speculative reads to by itself.

Fixes #3699.
Message-Id: <20180819131646.GK2326@scylladb.com>
2018-08-20 10:12:31 +03:00
Vladimir Krivopalov
f1b9f82ff5 sstables: Use std::optional instead of std::experimental::optional.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-08-17 18:20:05 -07:00
Vladimir Krivopalov
7b1d4915a1 to_string: Add operator<< overload for std::optional.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-08-17 18:20:05 -07:00
Vladimir Krivopalov
3e92434eed sstables: Support skipping inside wide partitions using index.
This fix adds proper support for skipping inside wide partitions using
index for sliced reads. This significantly reduces disk I/O for filtered
queries.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-08-17 18:20:04 -07:00
Vladimir Krivopalov
ec78fb9f13 tests: Add tests to cover skipping with index through SSTables 3.x.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-08-17 18:19:22 -07:00
Vladimir Krivopalov
4bf1e9de3f sstables: Support resetting data_consume_rows_context_m to indexable_element::cell.
Set the proper parsing state when resetting to indexable_element::cell.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-08-17 10:09:19 -07:00
Eliran Sinvani
f5f6cf2096 cql3: remove rejection of an IN relation if not on last partition KEY
The constraint is no longer relevant, since Casandra removed
it in version 2.2. In addition the mechanism for handling this
case is already implemented and is identical in case of
clustering keys with single column EQ,= and IN relations.
(Cartesian product of singular ranges).

A unit test for this test case was added.

Fixes #1735
Tests:
1. Unit Tests.
2. Manual testing with the case described in the issue.
3. dtest: ql_additional_tests.py:TestCQL.composite_row_key_test

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <83b43fdc1ca0e0cc287f66f11816fc71b8bd2925.1534430405.git.eliransin@scylladb.com>
2018-08-16 19:32:43 +01:00
Eliran Sinvani
d743ceae76 cql3: ignore LIMIT in select statement with aggregate
LIMIT should restrict the output result and not the query whose result
set is aggregated. when using aggregate the output is guarantied to
be only one row long. since LIMIT accepts only none negative numbers,
it has no effect and can be ignored.

Fixes #2028
Tests: The issue described Testcase ,  UnitTests.

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <6c235376c81f052020e2ed23d0a3d071b36d4415.1534416997.git.eliransin@scylladb.com>
2018-08-16 19:31:56 +01:00
Nadav Har'El
8c604921ac Materialized Views: test that virtual columns are not visible
In the previous patches, we added "virtual columns" to materialized views
to solve row liveness issues (issue #3362). Here we add a test that confirms
that although these virtual columns exist in the view, they should not be
visible to the user. They cannot be explicitly SELECTed from the view table,
and a "SELECT *" will skip them.

Refs #3362.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-08-16 15:51:46 +03:00
Nadav Har'El
5ca974547a Materialized Views: unit test reproducing fixed issue #3362
This patch includes several tests reproducing issue #3362 - the effect
of unselected columns on view-table row liveness - and confirming
that it was fixed.

We found two example scenarios to demonstrate the bug. One scenario,
test_3362_with_ttls(), involves an unselected column with a TTL. The other,
test_3362_no_ttls() demonstrates the same bug without using TTL, and using
explicit updates and deletions instead. These two tests are heavily
commented, to explain what they test, and why.

In addition to these two basic tests, we also include similar tests
involving multiple items in a collection column, instead of multiple
separate columns, which demonstrate the same problem exists there (and
why, unfortunately, the "virtual columns" we add in that case need to
be collections too).

We also test that the virtual columns - and the problems they fix -
work not only on columns originally created with the view, but also
with unselected columns added later with ALTER TABLE on the base table.

Refs #3362.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-08-16 15:48:07 +03:00
Nadav Har'El
6c00341383 Materialized Views: no need for elaborate row marker calculations
Now that we have separate virtual cells to represent unselected columns
in a materialized view, we no longer need the elaborate row-marker liveness
calculations which aimed (but failed) to do the same thing. So that code
can be removed.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-08-16 15:45:41 +03:00
Nadav Har'El
30f721afab Materialized Views: add unselected columns as virtual columns
When a view's partition key contains only columns from the base's partition
key (and not an additional one), the liveness (existance or disappearance)
of a view-table row is tied to the liveness of the base table row - and
that depends not only on selected columns (base-table columns SELECTed to
also appear in the view) but also on unselected columns.

This means that we may need to keep a view row alive even without data,
just because some unselected column is alive in the base table. Before this
patch we tried to build a single "row marker" in the view column which
summarizes the liveness information in all unselected columns, but this
proved unworkable, as explained in issue #3362 and as will be demonstrated
in unit tests in a later patch.

Because we can't replace several unselected cells by one row marker, what
we do in this patch is to add for each for the unselected cell a "virtual
cell" which contains the cell's liveness information (timestamp, deletion,
ttl) but not its value. For collections, we can't represent the entire
collection by one virtual cell, and rather need a collection of virtual
cells.

This patch just adds the virtual columns to the view schema. Code in
the previous patch, when it notices the virtual columns in the view's
schema, added the appropriate content into these columns.

We may need to add virtual columns to a view when first created, but also
when an unselected column is added to the base table with "ALTER TABLE",
so both are supported in this patch.

Fixes #3362.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-08-16 15:42:22 +03:00
Nadav Har'El
782baa44ef Materialized Views: fill virtual columns
The add_cells_to_view() function usually adds selected cells from the base
table to the view mutation. For issue #3362, we sometimes want to also
add unselected cells as "virtual" cells -  truncated versions of the
base-table cells just without the values.

This patch contains the code to fill the virtual columns' data using the
regular columns from the base table.

This patch does not yet actually *add* any virtual columns to the schema,
so until that is done (in the next patch), this patch will not yet cause
any behavior change. This is important for bisectability.

Refs #3362.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-08-16 15:38:27 +03:00
Nadav Har'El
3f3a76aa8f Do not allow selecting a virtual column
For issue #3362, we will need to add to a materialized view also unselected
base-table columns as "virtual columns". We need these columns to exist
to keep view rows alive, but we don't want the user to be able to see
them.

In this patch we prevent SELECTing the virtual columns of the view,
and also exclude the virtual columns from a "SELECT *" on a view.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-08-16 15:34:22 +03:00
Nadav Har'El
36a657fc10 schema: persist "view virtual" columns to a separate system table
In the previous patch, we added a "view virtual" flag on columns. In this
patch we add persistance to this flag: I.e., writing it to the on-disk
schema table and reading it back on startup. But the implementation is
not as simple as adding a flag:

In the on-disk system tables, we have a "columns" table listing all the
columns in the database and their types. Cqlsh's "DESCRIBE MATERIALIZED
VIEW" works by reading this "columns" table, and listing all of the
requested view's columns. Therefore, we cannot add "virtual columns" -
which are columns not added by the user and not intended to be seen -
to this list.

We therefore need to create in this patch a separate list for virtual
columns, in a new table "view_virtual_columns". This table is essentially
identical to the existing "columns" table, just separate. We need to write
each column to the appropriate table (columns with the view_virtual flag to
"view_virtual_columns", columns without it to the old "columns"), read
from both on startup, and remember to delete columns from both when a table
is dropped.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-08-16 15:30:06 +03:00
Nadav Har'El
0a1d93138d schema: add "view virtual" flag to schema's column_definition
In this patch we add a flag, "view virtual", that we can mark on on a
column defined in a schema. In following patches, we will add such virtual
columns to materialized views to allow view rows to remain alive despite
having no data (refs #3362).

After this patch, the "view virtual" flag exists in our in-memory
representation of the schema, but not persisted to disk - we will
fix this in the next patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-08-16 15:23:09 +03:00
Nadav Har'El
b4fc711903 Add "empty" type name to CQL parser, but only for internal parsing
Even before this patch, Scylla supported the "empty" type (a column with
no content) but only internally - i.e., in code but not in CQL syntax.
The "empty" type was used in dense tables without regular columns, and a
special optimization in db::cql_type_parser::parse() allowed this type
name to be parsed when reading the schema tables, without allowing the
"empty" type to be used by users in CQL statements.

However, parse() only supported "empty" itself, and more complex types
like list<empty> were not recognized by parse(). In the following patches,
we plan to add to virtual columns to materialized views, with types empty,
list<empty> or map<something, empty>. We need all these types to work, and
before this patch, they don't. But we want all of these types to only work
internally - when Scylla's code creates these hidden columns; we do not
want to add the "empty" type to CQL's syntax.

This is what we do in this patch: The CQL parser's comparator_type rule
now has a parameter, "internal", used to differenciate internal calls
via db::cql_type_parser::parse() from calls from CQL query parsing.
If a user tries something like:

    CREATE TABLE e (pk empty PRIMARY KEY);

He will get the error:

    Invalid (reserved) user type name empty

Note that here, as usual, unknown types are treated as "user types",
and "empty" is not allowed as a user type name - we "reserve" it in case
one day in the future we will want to allow users a direct syntax to
create empty columns. We already have, following Cassandra, a bunch of
other names reserved from being user type names, including "byte",
"complex", and others (see _reserved_type_names()), and using "empty"
as a type name will result in a similar error message.

Just like all other type names, the name "empty" is not a reserved
keyword in other senses: a user can create a table or a column with
the name "empty", just like he can create one with the name "int".

Refs #3362.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-08-16 15:12:27 +03:00
Duarte Nunes
a4355fe7e7 cql3/query_options: Use _value_views in prepare()
_value_views is the authoritative data structure for the
client-specified values. Indeed, the ctor called
transport::request::read_options() leaves _values completely empty.

In query_options::prepare() we were, however, using _values to
associated values to the client-specified column names, and not
_value_views. Fix this by using _value_views instead.

As for the reasons we didn't see this bug earlier, I assume it's
because very few drivers set the 0x04 query options flag, which means
column names are omitted. This is the right thing to do since most
drivers have enough information to correctly position the values.

Fixes #3688

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180814234605.14775-1-duarte@scylladb.com>
2018-08-15 10:38:09 +01:00
Duarte Nunes
8751a58a2b cql3/query_options: Preserve unset values when building value_views
A raw value can be in one of three states: a valid value, an unset
value, a null value. When translating raw_values to their views, we
were treating both unset and null values are null raw_value_views.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180814231051.14385-1-duarte@scylladb.com>
2018-08-15 10:37:29 +01:00
Duarte Nunes
805ce6e019 cql3/query_processor: Validate presence of statement values timeously
We need to validate before calling query_options::prepare() whether
the set of prepared statement values sent in the query matches the
amount of names we need to bind, otherwise we risk an out-of-bounds
access if the client also specified names together with the values.

Refs #3688

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180814225607.14215-1-duarte@scylladb.com>
2018-08-15 10:37:13 +01:00
Eliran Sinvani
d734d316a6 cql3: ensure repeated values in IN clauses don't return repeated rows
When the list of values in the IN list of a single column contains
duplicates, multiple executors are activated since the assumption
is that each value in the IN list corresponds to a different partition.
this results in the same row appearing in the result number times
corresponding to the duplication of the partition value.

Added queries for the in restriction unitest and fixed with a bad result check.

Fixes #2837
Tests: Queries as in the usecase from the GitHub issue in both forms ,
prepared and plain (using python driver),Unitest.

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <ad88b7218fa55466be7bc4303dc50326a3d59733.1534322238.git.eliransin@scylladb.com>
2018-08-15 10:21:22 +01:00
Duarte Nunes
a025bf6a7d Merge seastar upstream
Seastar introduced a "compat" namespace, which conflicts with Scylla's
own "compat" namespaces. The merge thus includes changes to scope
uses of Scylla's "compat" namespaces.

* seastar 8ad870f...9bb1611  (5):
  > util/variant_utils: Ensure variant_cast behaves well with rvalues
  > util/std-compat: Fix infinite recursion
  > doc/tutorial: Undo namespace changes
  > util/variant_utils: Add cast_variant()
  > Add compatbility with C++17's library types

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-08-14 13:07:09 +01:00
Duarte Nunes
25a0a0f83d tests/cql_test_env: Increase eventually() attempts
The current value has proved to be insufficient for our CI
infrastructure.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180814112201.8595-1-duarte@scylladb.com>
2018-08-14 12:37:32 +01:00
Duarte Nunes
495a92c5b6 tests/gossip_test: Use RAII for orderly destruction
Change the test so that services are correctly teared down, by the
correct order (e.g., storage_service access the messaging_service when
stopping).

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180814112111.8521-2-duarte@scylladb.com>
2018-08-14 12:27:14 +01:00
Duarte Nunes
3956a77235 tests/gossip_test: Don't bind address to avoid conflicts
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180814112111.8521-1-duarte@scylladb.com>
2018-08-14 12:27:02 +01:00
Piotr Sarna
310d0a74b9 cql3: throw proper request exception for INSERT JSON
JSON code is amended in order to return proper
"Missing mandatory PRIMARY KEY part" message instead of generic
"Attempt to access value of a disengaged optional object".

Fixes #3665
Message-Id: <69157d659d51ce5a2d408614ce3ba7bf8e3a5d88.1534161127.git.sarna@scylladb.com>
2018-08-13 23:57:37 +01:00
Piotr Sarna
b73669c329 tests: add parsing numeric values from string
Numeric values (ints, doubles) should accept string representation
when passed in INSERT JSON statement.

Refs #3666
Message-Id: <586fea8fd08fe01f7a133f82f517e26d08d7cb76.1534153955.git.sarna@scylladb.com>
2018-08-13 23:57:37 +01:00
Piotr Sarna
b3f438bfec types: enable parsing numeric JSON values from string
In order to be Cassandra-compatible, JSON values passed in INSERT JSON
statement should accept string parameters for numeric types - int,
double, etc.

Fixes #3666
Message-Id: <4da9a2f68de31492a2e9432493663a62b138c2f2.1534153955.git.sarna@scylladb.com>
2018-08-13 23:57:37 +01:00
Duarte Nunes
5de02ab98c tracing: Pass string_view instead of string to add_query
This resulted in superfluous copies.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180812085326.6260-1-duarte@scylladb.com>
2018-08-13 23:57:37 +01:00
Jesse Haber-Kucharsky
b95bbb2e72 auth: Clean up implementation comments 2018-08-13 13:24:45 -04:00
Jesse Haber-Kucharsky
9519a03351 auth: Remove unnecessary local variable
The variable could be declared `const`, but removing it outright seems
more clear and this way we don't have to come up with a name.
2018-08-13 13:24:45 -04:00
Jesse Haber-Kucharsky
52d3ff057a auth: Allow different random engines for salt
This makes the function useable in more contexts due to
flexibility (including in tests), since the state is not captured and
the characteristics of salt generation can be customized to the caller's
needs.
2018-08-13 13:24:45 -04:00
Jesse Haber-Kucharsky
836fd954e1 auth: Correct modulo bias in salt generation
Instead of reducing the large value via `%`, which can produce
non-uniformly distributed values when the range is small, we specify the
range in the distribution, which is uniform by construction.
2018-08-13 13:24:45 -04:00
Jesse Haber-Kucharsky
fe58a0b207 auth: Extract random byte generation for salt 2018-08-13 13:24:45 -04:00
Jesse Haber-Kucharsky
fd60d61ebf auth: Split out test for best supported scheme
The `generate_salt` function invokes this function internally now.

This change means that `generate_salt` is now thread-safe and therefore
does not have to be invoked by a single thread only when starting the
`password_authenticator`.

This further means that `generate_salt` does not need to be part of the
public interface of the module, and can be moved to the implementation
file.
2018-08-13 13:24:45 -04:00
Jesse Haber-Kucharsky
adf058bd1f auth: Rename function to use full words 2018-08-13 13:24:45 -04:00
Jesse Haber-Kucharsky
9b8cbb8542 auth: Add domain-specific exception for passwords 2018-08-13 13:24:45 -04:00
Jesse Haber-Kucharsky
dbea3f5a01 auth: Document passwords interface 2018-08-13 13:24:45 -04:00
Jesse Haber-Kucharsky
b272d622f8 auth: Move passsword stuff to its own namespace
For clarity and nicer function names.
2018-08-13 13:24:45 -04:00
Jesse Haber-Kucharsky
de01aaf181 auth: Identify password hashing errors correctly
See fce10f2c6e for reference.
2018-08-13 13:24:45 -04:00
Jesse Haber-Kucharsky
c10fcbf7a5 auth: Add unit tests for password handling
This will mean we can make changes more confidently.
2018-08-13 13:24:45 -04:00
Jesse Haber-Kucharsky
2a40bcb281 auth: Move password handling to its own files
While the `password_authenticator` is a complex component with lots of
dependencies, password hashing and checking itself is a process with
limited logical state and dependencies, which makes it easy to isolate
and test.
2018-08-13 13:24:45 -04:00
Jesse Haber-Kucharsky
03cf57db62 auth: Construct std::random_device instances once
`std::random_device` has a lot of implementation-specific behavior, and
as a result we cannot assume much about its performance characteristics.

We initialize thread-specific static instances of `std::random_device`
once so that we don't have the overhead of invoking the ctor during
every invocation of `gensalt`.
2018-08-13 13:24:45 -04:00
Duarte Nunes
f86811a3c9 Merge seastar upstream
* seastar d40faff...8ad870f (9):
  > reactor: switch indentation
  > properly configure I/O Scheduler when --max-io-requests is passed
  > IOTune: tell users that the evaluation will take a while
  > exceptions: fix compilation with static libstdc++
  > apps/iotune: print out which config file updated
  > foreign_ptr: allow waiting for the destruction of the managed ptr
  > Merge "Improve UX for backtraces read from stdin" from Botond

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-08-12 14:01:36 +01:00
Avi Kivity
183d5ba178 build: compress debug sections
Compressing debug section reduces build size by 30% with no
significant increase in build time.

Results on a 4-core system (ninja release, size in MB):

before:

18056	build

real	59m43.138s
user	229m3.180s
sys	6m49.460s

after:

12387	build

real	60m30.112s
user	232m8.962s
sys	6m49.364s

Presumably, the difference in debug mode is even greater.x
Message-Id: <20180811180444.30578-1-avi@scylladb.com>
2018-08-11 19:41:55 +01:00
Takuya ASADA
2ef1b094d7 dist/common/scripts/scylla_setup: don't proceed RAID setup until user type 'done'
Need to wait user confirmation before running RAID setup.

See #3659
Fixes #3681

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180810194507.1115-1-syuu@scylladb.com>
2018-08-11 18:48:05 +03:00
Takuya ASADA
b7cf3d7472 dist/common/scripts/scylla_setup: don't mention about interactive mode prompt when running on non-interactive mode
Skip showing message when it's non-interactive mode.

Fixes #3674

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180810191945.32693-1-syuu@scylladb.com>
2018-08-11 18:48:03 +03:00
Takuya ASADA
ef9475dd3c dist/common/scripts/scylla_setup: check existance of housekeeping.cfg before asking to run version check
Skip asking to run version check when housekeeping.cfg is already
exists.
Fixes #3657

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180807232313.15525-1-syuu@scylladb.com>
2018-08-11 18:48:02 +03:00
Takuya ASADA
f30b701872 dist/debian: fix install scylla-server.service
On previous commit we moved debian/scylla-server.service to
debian/scylla-server.scylla-server.service to explicitly specify
subpackage name, but it doesn't work for dh_installinit without '--name'
option.

Result of that current scylla-server .deb package missing
scylla-server.service, so we need to rename the service to original
file name.

Fixes #3675

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180810221944.24837-1-syuu@scylladb.com>
2018-08-11 15:07:37 +03:00
Duarte Nunes
1521dc56ae Merge 'Pass query options to restrictions filter' from Piotr
"
This miniseries fixes ALLOW FILTERING support for prepared statements
by passing correct query options to the filter instead of empty ones.
"

* 'pass_query_options_to_restrictions_filter' of https://github.com/psarna/scylla:
  tests: add testing prepared statements with ALLOW FILTERING
  cql3: pass query options to restrictions filter
2018-08-09 18:15:18 +01:00
Duarte Nunes
95677877c2 Merge 'JSON support fixes' from Piotr
"
This series addresses SELECT/INSERT JSON support issues, namely
handling null values properly and parsing decimals from strings.
It also comes with updated cql tests.

Tests: unit (release)
"

* 'json_fixes_3' of https://github.com/psarna/scylla:
  cql3: remove superfluous null conversions in to_json_string
  tests: update JSON cql tests
  cql3: enable parsing decimal JSON values from string
  cql3: add missing return for dead cells
  cql3: simplify parsing optional JSON values
  cql3: add handling null value in to_json
  cql3: provide to_json_string for optional bytes argument
2018-08-09 18:05:34 +01:00
Piotr Sarna
9ba218c161 cql3: remove superfluous null conversions in to_json_string
Some types checked when passed bytes argument was empty, and if so,
returned "null" as a JSON string. Now, with to_json_string(bytes_opt)
it's not needed anymore. Also, some types returned "null" instead
of signaling a deserialization error.
2018-08-09 18:07:12 +02:00
Piotr Sarna
fc187fa31e tests: update JSON cql tests
Tests are updated to check for recently fixed issues, i.e.
 * proper handling of null values
 * parsing decimal values from string

Refs #3664
Refs #3666
Refs #3667
2018-08-09 18:07:12 +02:00
Piotr Sarna
957cc712b6 cql3: enable parsing decimal JSON values from string
In order to be Cassandra-compatible, decimal type should be parsable
from both numeric values and strings.

Fixes #3666
2018-08-09 18:07:12 +02:00
Piotr Sarna
f962b85fa3 cql3: add missing return for dead cells
Fixes #3664
2018-08-09 18:07:12 +02:00
Piotr Sarna
cdbeed4e3b cql3: simplify parsing optional JSON values
With new to_json_string implementation that accepts bytes_opt,
parsing optional values can be simplified to remove explicit
branching.
2018-08-09 18:07:12 +02:00
Piotr Sarna
e4396e17cb cql3: add handling null value in to_json
Previously to_json function would fail with null passed as a parameter.

Fixes #3667
2018-08-09 18:07:12 +02:00
Piotr Sarna
52052b53a8 cql3: provide to_json_string for optional bytes argument
In order to handle optional arguments in a neat way, a wrapper
for to_json_string is provided.
2018-08-09 18:07:07 +02:00
Piotr Sarna
4a9014675f tests: add testing prepared statements with ALLOW FILTERING
ALLOW FILTERING support for prepared statements was buggy,
so a test case for prepared statements is added to cql test suite.
2018-08-09 18:06:09 +02:00
Piotr Sarna
8c18aaa511 cql3: pass query options to restrictions filter
Query options may contain bound values needed for checking filtering
restrictions. Previously, empty query_options{} were used, which
caused prepared statements to fail.

Fixes #3677
2018-08-09 17:44:45 +02:00
Eliran Sinvani
3f2bb07599 cql3: Count unpaged select queries
If the counter goes up this can be a possible reason for slowdown in
queries (since it means that potentially a large amount of data will
be sent to the client at once).

Fixes #2478
Tests: cqlsh with PAGING OFF and ON and validating with a print.

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <01253cee0b8c1110aaee3da41d1f434ca798b430.1533817568.git.eliransin@scylladb.com>
2018-08-09 13:53:44 +01:00
Tomasz Grabiec
024b3c9fd9 mutation_partition: Fix exception safety of row::apply_monotonically()
When emplace_back() fails, value is already moved-from into a
temporary, which breaks monotonicity expected from
apply_monotonically(). As a result, writes to that cell will be lost.

The fix is to avoid the temporary by in-place construction of
cell_and_hash. To do that, appropriate cell_and_hash constructor was
added.

Found by mutation_test.cc::test_apply_monotonically_is_monotonic with
some modifications to the random mutation generator.

Introduced in 99a3e3a.

Fixes #3678.

Message-Id: <1533816965-27328-1-git-send-email-tgrabiec@scylladb.com>
2018-08-09 15:29:10 +03:00
Tomasz Grabiec
fd543603dd tests: random_mutation_generator: Use collection_member::yes for collection cells
Caused assert failure when collection cells were so large as to
require fragmentation. Currently collection cells are not fragmented,
and deserialization asserts that.

Message-Id: <1533817077-27583-1-git-send-email-tgrabiec@scylladb.com>
2018-08-09 15:27:20 +03:00
Vladimir Krivopalov
55d2fdee9a clustering_key_filter_ranges: Fix move assignment to avoid undefined behaviour.
Get rid of the new(this) trick that results in undefined behaviour
because the class contains a const reference member.

Use std::reference_wrapper instead to ease the transition.

Refs #3032.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <5642bf79659231627dd7f8693c17cb46f274bcda.1533765105.git.vladimir@scylladb.com>
2018-08-09 00:53:17 +01:00
Takuya ASADA
ad7bc313f7 dist/common/scripts: pass format variables to colorprint()
When we use str.format() to pass variables on the message it will always
causes Exception like "KeyError: 'red'", since the message contains color
variables but it's not passed to str.format().
To avoid the error we need to pass all format variables to colorprint()
and run str.format() inside the function.

Fixes #3649

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180803015216.14328-1-syuu@scylladb.com>
2018-08-08 18:37:50 +03:00
Avi Kivity
d6b0c4dda4 config: default murmur3_ignore_msb_bits to 12 even if not specified in scylla.yaml
When murmur3_ignore_msb_bits was introduced in 1.7, we set its default zero
(to avoid resharding on upgrade) and set it to 12 in the scylla.yaml template
(to make sure we get the right value for new clusters).

Now, however, things have changed:
 - clusters installed before 1.7 are a small minority
 - they should have resharded long ago
 - resharding is much better these days
 - we have more migrations from Cassandra compared to old clusters

To allow clusters that migrated using their cassandra.yaml, and to clean up
the default scylla.yaml, make the default 12.

Users upgrading from pre-1.7 clusters will need to update their scylla.yaml,
or to reshard (which is a good idea anyway).

Fixes #3670.
Message-Id: <20180808063003.26046-1-avi@scylladb.com>
2018-08-08 13:46:06 +02:00
Asias He
d47d46e1a8 streaming: Use streaming_write_priority for the sstable writer
Use the streaming io priority otherwise it uses the default io priority.

Message-Id: <e1836a9a93e7204d4bc9bba9c841d57c8b24aff8.1533715438.git.asias@scylladb.com>
2018-08-08 11:08:06 +03:00
Takuya ASADA
15825d8bf1 dist/common/scripts/scylla_setup: print message when EC2 instance is optimized for Scylla
Currently scylla_ec2_check exits silently when EC2 instance is optimized
for Scylla, it's not clear a result of the check, need to output
message.

Note that this change effects AMI login prompt too.

Fixes #3655

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180808024256.9601-1-syuu@scylladb.com>
2018-08-08 10:17:52 +03:00
Takuya ASADA
652eb5ae0e dist/common/scripts/scylla_setup: fix typo on interactive setup
Scylls -> Scylla

Fixes #3656

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180808002443.1374-1-syuu@scylladb.com>
2018-08-08 09:15:13 +03:00
Vladimir Krivopalov
7f77087caa tests: Add tests performing compaction on SSTables 3.x.
These tests check the correctness of resulting compacted SSTables based
on the files produced by compacting input files with Cassandra.

Note that output files are not identical to those generated by Cassandra
because Scylla compaction does not yet optimise delta-encoded values
using serialization header.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <3fa05ce72352292d1026ce80ac87552889d10d96.1533667535.git.vladimir@scylladb.com>
2018-08-08 08:50:41 +03:00
Rafi Einstein
c7f41c988f Add a counter to count large partition warning in compaction
Fixes #3562

Tests: dtest(compaction_test.py)
Message-Id: <20180807190324.82014-1-rafie@scylladb.com>
2018-08-07 20:15:09 +01:00
Avi Kivity
c9caaa8e6e docker: adjust for script conversion to Python
Since our scripts were converted to Python, we can no longer
source them from a shell. Execute them directly instead. Also,
we now need to import configuration variables ourselves, since
scylla_prepare, being an independent process, won't do it for
us.

Fixes #3647
Message-Id: <20180802153017.11112-1-avi@scylladb.com>
2018-08-07 15:34:03 +01:00
Takuya ASADA
a300926495 dist/common/scripts/scylla_setup: use specified NIC ifname correctly
Interactive NIC selection prompt always returns 'eth0' as selected NIC name
mistakenly, need to fix.

Fixes #3651

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180803020724.15155-1-syuu@scylladb.com>
2018-08-06 20:59:19 +03:00
Amnon Heiman
80b1ef0f47 storage_service: Add nodes_status related metrics
This patch adds a metric for a node own operation mode, the
operation_mode metric represent the enum modes as gauge values according
to: UNKNOWN = 0, STARTING = 1, JOINING = 2, NORMAL = 3, LEAVING = 4, DECOMMISSIONED =
5, DRAINING = 6, DRAINED = 7, MOVING = 8

Fixes: #3482

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20180806142706.23579-1-amnon@scylladb.com>
2018-08-06 18:19:56 +03:00
Tomasz Grabiec
88053b3bc9 tests: sstables: Replace sleep with accurate synchronzation
Message-Id: <1533545829-31109-1-git-send-email-tgrabiec@scylladb.com>
2018-08-06 10:09:39 +01:00
Avi Kivity
13b729bf71 Merge "tracing: store request and response sizes" from Vlad
"
Store sizes of the request and the response for each traces query.

In the example below I traced the cassandra-stress write workload with a default schema using the probabilistic tracing.

Here is an entry created for one of queries:

cassandra@cqlsh> SELECT parameters FROM system_traces.sessions where session_id=30c3a8ea-96bb-11e8-8a97-000000000000;

 parameters
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 {'consistency_level': 'LOCAL_ONE', 'page_size': '5000', 'param[0]': 'f749eb03d6a995d8b3496075da8f20aa9228c5db12401e8a37000fa5baa13531...', 'param[1]': '845809b53a9aff7eef8f85308eaef79e03c696653ca23957f1ed5d539dc00463...', 'param[2]': 'd303585def93a5d40e41ceb12880ad3ede3d9f6308a1b1c5e42e911a191f1de1...', 'param[3]': 'be77c7da059d4b52687cd9b3eaa7d04cdfe7b5e38e84a8eea318299a01c7845f...', 'param[4]': '32faaaea1b3d73d9d628a4945b69a8531740348d49ee30c03f697dd2d63e8dee...', 'param[5]': '50503850374d34323330', 'query': 'UPDATE "standard1" SET "C0" = ?,"C1" = ?,"C2" = ?,"C3" = ?,"C4" = ? WHERE KEY=?', 'serial_consistency_level': 'SERIAL'}

(1 rows)
cassandra@cqlsh> SELECT request_size,response_size FROM system_traces.sessions where session_id=30c3a8ea-96bb-11e8-8a97-000000000000;

 request_size | response_size
--------------+---------------
          239 |             4

(1 rows)

Now let's try to read the same keyspace1.standard1 entry (based on the "key" value in "param[5]") from cqlsh and trace it using TRACING ON.

cassandra@cqlsh> TRACING ON
Now Tracing is enabled
cassandra@cqlsh> SELECT * from keyspace1.standard1 where key=0x50503850374d34323330;

 key                    | C0                                                                     | C1                                                                     | C2                                                                     | C3                                                                     |
C4
------------------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------------------------------------------------------------------+-
-----------------------------------------------------------------------
 0x50503850374d34323330 | 0xf749eb03d6a995d8b3496075da8f20aa9228c5db12401e8a37000fa5baa135315430 | 0x845809b53a9aff7eef8f85308eaef79e03c696653ca23957f1ed5d539dc00463e10e | 0xd303585def93a5d40e41ceb12880ad3ede3d9f6308a1b1c5e42e911a191f1de12924 | 0xbe77c7da059d4b52687cd9b3eaa7d04cdfe7b5e38e84a8eea318299a01c7845fb8a2 |
0x32faaaea1b3d73d9d628a4945b69a8531740348d49ee30c03f697dd2d63e8dee5dde

(1 rows)

Tracing session: 639ca0a0-96bb-11e8-8a97-000000000000

 activity                                                                                                                                 | timestamp                  | source        | source_elapsed
------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+---------------+----------------
                                                                                                                       Execute CQL3 query | 2018-08-02 21:20:20.906000 | 192.168.1.138 |              0
                                                                                                            Parsing a statement [shard 0] | 2018-08-02 21:20:20.906358 | 192.168.1.138 |             --
                                                                                                         Processing a statement [shard 0] | 2018-08-02 21:20:20.906405 | 192.168.1.138 |             47
 Creating read executor for token -5698461774438220979 with all: {192.168.1.138} targets: {192.168.1.138} repair decision: NONE [shard 0] | 2018-08-02 21:20:20.906445 | 192.168.1.138 |             87
                                                                                                    read_data: querying locally [shard 0] | 2018-08-02 21:20:20.906448 | 192.168.1.138 |             90
                                                           Start querying the token range that starts with -5698461774438220979 [shard 0] | 2018-08-02 21:20:20.906452 | 192.168.1.138 |             94
                                                                                                               Querying is done [shard 0] | 2018-08-02 21:20:20.906509 | 192.168.1.138 |            151
                                                                                           Done processing - preparing a result [shard 0] | 2018-08-02 21:20:20.906533 | 192.168.1.138 |            175
                                                                                                                         Request complete | 2018-08-02 21:20:20.906186 | 192.168.1.138 |            186

cassandra@cqlsh> TRACING OFF
Disabled Tracing.

cassandra@cqlsh> SELECT request_size,response_size FROM system_traces.sessions where session_id=639ca0a0-96bb-11e8-8a97-000000000000;

 request_size | response_size
--------------+---------------
           82 |           369

(1 rows)
"

* 'tracing_request_response_size-v2' of https://github.com/vladzcloudius/scylla:
  tracing: move all tracing related API functions to a cold path
  tracing: store a query response size
  tracing: store request size
2018-08-05 18:26:29 +03:00
Jesse Haber-Kucharsky
fce10f2c6e auth: Don't use unsupported hashing algorithms
In previous versions of Fedora, the `crypt_r` function returned
`nullptr` when a requested hashing algorithm was not supported.

This is consistent with the documentation of the function in its man
page.

As of Fedora 28, the function's behavior changes so that the encrypted
text is not `nullptr` on error, but instead the string "*0".

The info pages for `crypt_r` clarify somewhat (and contradict the man
pages):

    Some implementations return `NULL` on failure, and others return an
    _invalid_ hashed passphrase, which will begin with a `*` and will
    not be the same as SALT.

Because of this change of behavior, users running Scylla on a Fedora 28
machine which was upgraded from a previous release would not be able to
authenticate: an unsupported hashing algorithm would be selected,
producing encrypted text that did not match the entry in the table.

With this change, unsupported algorithms are correctly detected and
users should be able to continue to authenticate themselves.

Fixes #3637.

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <bcd708f3ec195870fa2b0d147c8910fb63db7e0e.1533322594.git.jhaberku@scylladb.com>
2018-08-05 08:57:36 +03:00
Vlad Zolotarov
896c1822b5 tracing: move all tracing related API functions to a cold path
This patch completes what was started in a4282c2c6e

Make trace_state_ptr to be a wrapper class around lw_shared_ptr<trace_state> that
hints that bool(trace_state_ptr) is likely to return FALSE.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-08-03 12:32:54 -04:00
Vlad Zolotarov
6db90a2e63 tracing: store a query response size
Add a new "response_size" column to system_traces.sessions and store a size of an uncompressed response
for a traced query.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-08-03 12:29:36 -04:00
Vlad Zolotarov
05020921bb tracing: store request size
Add a new column "request_size" to system_traces.sessions and store
the uncompressed request frame data size.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-08-03 12:29:36 -04:00
Avi Kivity
3b42fcfeb2 Merge "Fix exception safety in imr::utils::object" from Paweł
"

There is an exception safety problem in imr::utils::object. If multiple
memory allocations are needed and one of them fails the main object is
going to be freed (as expected). However, at this stage it is not
constructed yet, so  when LSA asks its migrator for the size it may get
a meaningless value. The solution is to remember the size until object
is fully created and use sized deallocation in case of failures.

Fixes #3618.

Tests: unit(release, debug/imr_test)
"
2018-08-02 12:10:24 +03:00
Takuya ASADA
1bb463f7e5 dist/debian: install *.service on correct subpackage
We mistakenly installing scylla-housekeeping-*.service to scylla-conf
package, all *.service should explicitly specified subpackage name.

Fixes #3642

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180801233042.307-1-syuu@scylladb.com>
2018-08-02 11:39:52 +03:00
Paweł Dziepak
fd44d13145 tests/imr: add test for exception safety in imr::utils::object::make() 2018-08-01 16:50:58 +01:00
Paweł Dziepak
7ec906e657 imr: detect lsa migrator mismatch
Each IMR type needs its own LSA migrator. It is possible that user will
provide a migrator for a different type than the one which instance is
being created. This patch adds compile-time detection of that bug.
2018-08-01 16:50:58 +01:00
Benny Halevy
6b179b0183 HACKING.md: update ./install-dependencies.sh filename
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20180801150813.25408-1-bhalevy@scylladb.com>
2018-08-01 18:09:29 +03:00
Paweł Dziepak
6fbf2d72e9 imr::utils::object_context: fix context_for for backpointer
Each member of a structure may require different deserialisation
context. They are provided by context_for<Tag>() method of the context
used to deserialise the structure itself.

imr::utils::object needs to add backpointer to the structure it manages
so that it can be used in the LSA memory. This is done by creating a
structure that has two members: the backpointer and the actual structure
that imr::utils::object is to manage. imr::utils::object_context creates
approperiate deserialisation context for it.

context_for() is called for each member of a structure. object_context
implementation of context_for() always created a deserialisation context
for the underlying structure regardless which member that was, so it was
done also for backpointer. This is wrong since the context may read the
object on its creation.

The fix is to use no_context_t for the backpointer.
2018-08-01 15:17:25 +01:00
Paweł Dziepak
61749019cb imr::utils::object: fix exception safety if allocation fails
imr::utils::object::make() handles creation of IMR objects. They are
created in three phases:
  1. The size of the object and all additional needed memory allocations
     is determined
  2. All needed buffers are allocated
  3. Data is written to the allocated space

When IMR objects are deallocated LSA asks their migrator for the size.
Migrator may read some parts of the object to figure out its size. This
is a problem if there is allocation failure in make() at point 2.
If one of required allocations fails, the buffers that were already
acquired need to be freed. However, since the object hasn't been fully
created yet migrator won't return a valid value.

The solution for this is to remember object size until all allocations
are completed. This way the LSA won't need to ask migrators for it in
case of failure. imr::alloc::object_allocator already does that but
imr::utils::object doesn't. This patch fixes that.
2018-08-01 15:17:13 +01:00
Piotr Sarna
156888fb44 docs: fix system.large_partitions doc entry
For some reason the doc entry for large_partitions was outdated.
It contained incorrect ORDERING information and wrong usage example,
since large_partitions' schema changed multiple times during
the reviewing process.

Message-Id: <1910f270419536ebccffde163ec1bfc67d273306.1533128957.git.sarna@scylladb.
com>
2018-08-01 16:12:39 +03:00
Asias He
95849371aa range_streamer: Remove unordered_multimap usage
We need the mapping between dht::token_range to
std::vector<inet_address> and inet_address to dht::token_range_vector in
various places. Currently, we use std::unordered_multimap and convert to
std::unordered_map. It is better to use std::unordered_map in the first
place. The changes like below:

- Change from

  std::unordered_multimap<dht::token_range, inet_address>

to

  std::unordered_map<dht::token_range, std::vector<inet_address>>

- Change from

   std::unordered_multimap<inet_address, dht::token_range>

to

   std::unordered_map<inet_address, dht::token_range_vector>

Message-Id: <b8ecc41775e46ec064db3ee07510c404583390aa.1533106019.git.asias@scylladb.com>
2018-08-01 13:01:41 +03:00
Gleb Natapov
44a6afad8c cache_hitrate_calculator: fix race when new table is added during calculations
The calculation consists of several parts with preemption point between
them, so a table can be added while calculation is ongoing. Do not
assume that table exists in intermediate data structure.

Fixes #3636

Message-Id: <20180801093147.GD23569@scylladb.com>
2018-08-01 12:45:03 +03:00
Avi Kivity
620e950fc8 Merge "No infinite time-outs for internal distributed queries" from Jesse
"
This series replaces infinite time-outs in internal distributed
(non-local) CQL queries with finite ones.

The implementation of tracing, which also performs internal queries,
already has finite time-outs, so it is unchanged.

Fixes #3603.
"

* 'jhk/finite_time_outs/v2' of https://github.com/hakuch/scylla:
  Use finite time-outs for internal auth. queries
  Use finite query time-outs for `system_distributed`
2018-08-01 11:23:42 +03:00
Asias He
4a0b561376 storage_service: Get rid of moving operation
The moving operation changes a node's token to a new token. It is
supported only when a node has one token. The legacy moving operation is
useful in the early days before the vnode is introduced where a node has
only one token. I don't think it is useful anymore.

In the future, we might support adjusting the number of vnodes to reblance
the token range each node owns.

Removing it simplifies the cluster operation logic and code.

Fixes #3475

Message-Id: <144d3bea4140eda550770b866ec30e961933401d.1533111227.git.asias@scylladb.com>
2018-08-01 11:18:17 +03:00
Asias He
02befb6474 gossip: Log seeds seen
It is useful for debugging bootstap issue, especially for large
clusters.

Also do not use the `_seeds` as the set_seeds function parameter since
there is a class member called _seeds.

Refs #3417
Message-Id: <15e6bdf06376949ced1bdb845f810da09266783d.1532474820.git.asias@scylladb.com>
2018-08-01 10:57:56 +03:00
Takuya ASADA
2cd99d800b dist/common/scripts/scylla_ntp_setup: fix typo
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1533070539-2147-1-git-send-email-syuu@scylladb.com>
2018-08-01 10:31:07 +03:00
Avi Kivity
2c9b886b6d logalloc: reindent
No functional changes.
Message-Id: <20180731125116.32009-1-avi@scylladb.com>
2018-08-01 00:35:54 +01:00
Jesse Haber-Kucharsky
e664f9b0c6 Use finite time-outs for internal auth. queries 2018-07-31 11:38:16 -04:00
Jesse Haber-Kucharsky
ca44f4de3c Use finite query time-outs for system_distributed 2018-07-31 11:38:15 -04:00
Paweł Dziepak
b20a15bdda Merge "Prevent scheduling leaks when out of memtable space" from Avi
"
When we are out of memtable space (real of virtual), lsa will defer running
our mutation application and run it later when memory is in fact available.
However, it will run it in the main group, giving the write more shares than it
would otherwise get.

This patchset fixes the problem by running those deferred mutation applications
in the correct scheduling group.

Fixes #3638
"

* tag '3638/v2' of https://github.com/avikivity/scylla:
  database: tag dirty memory managers with scheduling groups
  logalloc: run releaser() in user-provided scheduling group
2018-07-31 11:55:19 +01:00
Avi Kivity
2d311c26b3 database: tag dirty memory managers with scheduling groups
dirty memory managers run code on behalf of their callers
in a background fiber, so provide that background fiber with
the scheduling group appropriate to their caller.

 - system: main (we want to let system writes through quickly)
 - dirty: statement (normal user writes)
 - streaming: streaming (streaming writes)
2018-07-31 13:18:21 +03:00
Paweł Dziepak
98217f0d66 Update seastar submodule
* seastar 6b97e00...d40faff (10):
  > tutorial: update build as needed for newer pandoc
  > core: fix __libc_free return type signature
  > future-utils: when_all: avoid calling member function on an uninitialized data member
  > future-util: reduce continuations in when_all (variadic version)
  > future-utils: remove allocation in when_all() if all futures are available
  > future: reduce allocations in when_all()
  > future: fill missing futurize::from_tuple() functions
  > future: expose more types in continuation_base
  > log: predict logger::is_enabled() as false
  > README: add Resources section with infomation about the mailing list etc.
2018-07-31 10:12:52 +01:00
Avi Kivity
0fc54aab98 logalloc: run releaser() in user-provided scheduling group
Let the user specify which scheduling group should run the
releaser, since it is running functions on the user's behalf.

Perhaps a cleaner interface is to require the user to call
a long-running function for the releaser, and so we'd just
inherit its scheduling group, but that's a much bigger change.
2018-07-31 11:57:58 +03:00
Avi Kivity
f258df099a Update ami submodule
* dist/ami/files/scylla-ami d53834f...c7e5a70 (1):
  > ds2_configure.py: uncomment 'cluster_name' when it's commented out
2018-07-31 09:34:33 +03:00
Avi Kivity
e7ae4beef0 main: run prometheus and API servers under streaming group
Both the Prometheus and the API servers are used for maintenance
operations, similarly to streaming. Run them under the streaming
scheduling group to prevent them from impacting normal operations,
and rename the streaming scheduling group to reflect the more
generic role.

This helps to prevent spikes from Prometheus or API requests from
interfering with the normal workload. Using an existing group is
preferable to creating a new group because in the worst case, all
the non-main-workload groups compete with the main workload.
Consolidating them allows us to give them significant shares in
total without increasing competition in the worst case.

The group's label is unchanged to preserve compatibility with
dashboards.

A nice side effect is that repair, which is initiated by API calls,
gets placed into the maintenance group naturally. Compaction tasks
which are run by compaction manager are not changed.
Message-Id: <20180714160723.23655-1-avi@scylladb.com>
2018-07-30 15:07:33 +01:00
Avi Kivity
a4282c2c6e tracing: move tracing code to cold path
Most queries run without tracing (and those that run with tracing
are not sensitive to a few cycles), so mark the tracing paths as
cold.
Message-Id: <20180723133000.30482-1-avi@scylladb.com>
2018-07-30 15:05:57 +01:00
Rafi Einstein
123f2c2a1c Add a counter for reverse queries
Fixes #3492

Tests: dtest(cql_additional_tests.py)
Message-Id: <20180729202615.22459-1-rafie@scylladb.com>
2018-07-30 12:34:43 +03:00
Takuya ASADA
032b26deeb dist/common/scripts/scylla_ntp_setup: fix typo
Comment on Python is "#" not "//".

Fixes #3629

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180730091022.4512-1-syuu@scylladb.com>
2018-07-30 12:30:53 +03:00
Avi Kivity
04d88e8ff7 scripts: add a script to compute optimal number of compile jobs
This will allow continuous integration to use the optimal number
of compiler jobs, without having to resort to complex calculations
from its scripting environment.

Message-Id: <20180722172050.13148-1-avi@scylladb.com>
2018-07-30 10:15:11 +03:00
Avi Kivity
a4c9330bfc Merge "Optimise paged queries" from Paweł
"
This series adds some optimisations to the paging logic, that attempt to
close the performance gap between paged and not paged queries. The
former are more complex so always are going to be slower, but the
performance loss was unacceptably large.

Fixes #3619.

Performance with paging:
        ./perf_paging_before  ./perf_paging_after   diff
 read              271246.13            312815.49  15.3%

Without paging:
        ./perf_nopaging_before  ./perf_nopaging_after   diff
 read                343732.17              342575.77  -0.3%

Tests: unit(release), dtests(paging_test.py, paging_additional_test.py)
"

* tag 'optimise-paging/v1' of https://github.com/pdziepak/scylla:
  cql3: select statement: don't copy metadata if not needed
  cql3: query_options: make simple getter inlineable
  cql3: metadata: avoid copying column information
  query_pager: avoid visiting result_view if not needed
  query::result_view: add get_last_partition_and_clustering_key()
  query::result_reader: fix const correctness
  tests/uuid: add more tests including make_randm_uuid()
  utils: uuid: don't use std::random_device()
2018-07-26 19:24:03 +03:00
Nadav Har'El
25bd139508 cross-tree: clean up use of std::random_device()
std::random_device() uses the relatively slow /dev/urandom, and we rarely if
ever intend to use it directly - we normally want to use it to seed a faster
random_engine (a pseudo-random number generator).

In many places in the code, we first created a random_device variable, and then
using it created a random_engine variable. However, this practice created the
risk of a programmer accidentally using the random_device object, instead of the
random_engine object, because both have the same API; This hurts performance.

This risk materialized in just two places in the code, utils/uuid.cc and
gms/gossiper.cc. A patch for to uuid.cc was sent previously by Pawel and is
not included in this patch, and the fix for gossiper.{cc,hh} is included here.

To avoid risking the same mistake in the future, this patch switches across the
code to an idiom where the random_device object is not *named*, so cannot be
accidentally used. We use the following idiom:

   std::default_random_engine _engine{std::random_device{}()};

Here std::random_device{}() creates the random device (/dev/urandom) and pulls
a random integer from it. It then uses this seed to create the random_engine
(the pseudo-random number generator). The std::random_device{} object is
temporary and unnamed, and cannot be unintentionally used directly.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180726154958.4405-1-nyh@scylladb.com>
2018-07-26 16:54:58 +01:00
Takuya ASADA
8e4d1350c9 dist/common/scripts/scylla_ntp_setup: ignore ntpdate error
Even ntpdate fails to adjust clock ntpd may able to recover it later,
ignore ntpdate error keep running the script.

Fixes #3629

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180726080206.28891-1-syuu@scylladb.com>
2018-07-26 14:44:53 +03:00
Paweł Dziepak
3e32245bb8 cql3: select statement: don't copy metadata if not needed 2018-07-26 12:37:20 +01:00
Paweł Dziepak
15775c958a cql3: query_options: make simple getter inlineable 2018-07-26 12:37:06 +01:00
Paweł Dziepak
ef0c999742 cql3: metadata: avoid copying column information
The column-related metadata is shared by all requests done with the same
perpared query. However, metadata class contains also some additional
flags and paging state which may differ. This patch allows sharing
column information among multiple instances of the metadata class.
2018-07-26 12:17:04 +01:00
Paweł Dziepak
757d9e3b5d query_pager: avoid visiting result_view if not needed
query::result_visitor provides get_last_partition_and_clustering_key()
which allows getting those without iterating through the whole result.
Moreover, row count may be precomputed in the result, if it isn't there
is query::result_view::count_partitions_and_rows() for getting it.
2018-07-26 12:14:48 +01:00
Paweł Dziepak
9b6dc52255 query::result_view: add get_last_partition_and_clustering_key()
Paging needs to get last partition and clustering key (if the latter
exists). Previously, this was done by result_view visitor but that is
suboptimal. Let's add a direct getter for those.
2018-07-26 12:12:08 +01:00
Paweł Dziepak
b5ed4c8806 query::result_reader: fix const correctness 2018-07-26 12:11:27 +01:00
Paweł Dziepak
495df277f9 tests/uuid: add more tests including make_randm_uuid() 2018-07-26 12:03:37 +01:00
Paweł Dziepak
b485deb124 utils: uuid: don't use std::random_device()
std::random_device() is extremely slow. This patch modifies
make_rand_uuid() so that it requires only two invocations of the PRNG.
2018-07-26 12:02:32 +01:00
Avi Kivity
b167647bf6 dist: redhat: fix up bad file ownership of rpms/srpms
mock outputs files owned by root. This causes attempts
by scripts that want to junk the working directory (typically
continuous integration) to fail on permission errors.

Fixup those permissions after the fact.
Message-Id: <20180719163553.5186-1-avi@scylladb.com>
2018-07-26 08:20:42 +03:00
Avi Kivity
bea1f715dc storage_proxy: count cross-shard operations
Count operations which were started on one shard and
were performed on another, due to non-shard-aware driver
and/or RPC.
Message-Id: <20180723155118.8545-1-avi@scylladb.com>
2018-07-25 16:21:04 +01:00
Avi Kivity
d6ef74fe36 Merge "Fix JSON string quoting" from Piotr
"

This mini-series covers a regression caused by newest versions
of jsoncpp library, which changed the way of quoting UTF-8 strings.

Tests: unit (release)
"

* 'add_json_quoting_3' of https://github.com/psarna/scylla:
  tests: add JSON unit test
  types: use value_to_quoted_string in JSON quoting
  json: add value_to_quoted_string helper function

Ref #3622.
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
2018-07-25 17:49:55 +03:00
Piotr Sarna
b367cff05d tests: add JSON unit test
Since value_to_quoted_string now has an internal implementation,
a unit test is provided to check if strings are quoted
and escaped properly.
2018-07-25 13:16:06 +02:00
Piotr Sarna
d307b5712c types: use value_to_quoted_string in JSON quoting
In order to avoid regressions caused by external libraries,
our own value_to_quoted_string implementation is used.

Fixes #3622
2018-07-25 13:16:06 +02:00
Piotr Sarna
783762a958 json: add value_to_quoted_string helper function
After open-source-parsers/jsoncpp@42a161f commit jsoncpp's version
of valueToQuotedString no longer fits our needs, because too many
UTF-8 characters are unnecessarily escaped. To remedy that,
this commit provides our own string quoting implementation.

Reported-by: Nadav Har'El <nyh@scylladb.com>

Refs #3622
2018-07-25 13:16:00 +02:00
Piotr Sarna
f66aace685 cql3: fix INSERT JSON grammar
Previously CQL grammar wrongfully required INSERT JSON queries
to provide a list of columns, even though they are already
present in JSON itself.
Unfortunately, tests were written with this false assumption as well,
so they're are updated.
Message-Id: <33b496cba523f0f27b6cbf5539a90b6feb20269e.1532514111.git.sarna@scylladb.com>
2018-07-25 11:36:59 +01:00
Avi Kivity
b443a9b930 compaction: demote compaction start/end messages to DEBUG level
Compactions start and end all the time, especially with many shards,
and don't contribute much to understanding what is going on these
days. Compaction throughput is available through the metrics and
other information is available via the compaction history table.

Demote compaction start and end messages to DEBUG level to keep
the log clean. Cleaning and resharding compactions are kept as
INFO, at least for now, since they are manual operations and
therefore rarer.
Message-Id: <20180724132859.14109-1-avi@scylladb.com>
2018-07-25 09:53:39 +01:00
Takuya ASADA
58f094e06d dist/debian: fix ImportError on pystache
Seems like pystache does not provides dependency, need to install it on
build_deb.sh.

Fixes #3627

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180724164852.16094-1-syuu@scylladb.com>
2018-07-25 07:42:19 +03:00
Avi Kivity
e2ad45c3db Merge "Add clustering prefix logic to indexes and filtering" from Piotr
"
This series follows up ALLOW FILTERING support series and depends on
this one: https://groups.google.com/d/msg/scylladb-dev/Qxt3_MP03jI/5ZhRTJ3gBwAJ

The following optimizations regarding clustering key prefix and filtering are
applied:
 * if clustering key restrictions require filtering, but they still
   contain any part of the prefix, this prefix can be used to narrow
   down the query by using it in computing clustering bounds
 * if an indexed query has partition key restrictions and any clustering
   key restrictions that form a prefix, then from now on this prefix
   will be used to narrow down the index query

"

Ref #3611.

* 'use_prefix_with_filtering_and_si_4' of https://github.com/psarna/scylla:
  tests: add prefix cases to indexed filtered queries tests
  cql3: use ck prefix in filtered queries
  cql3: use clustering key prefix in index queries
  cql3: add conversion to ck longest prefix restrictions
  cql3: add prefix_size method to ck restrictions
2018-07-23 15:28:50 +03:00
Piotr Sarna
517a5b66ba tests: add prefix cases to indexed filtered queries tests
More cases related to querying clustering key prefix in an indexed
query are added to secondary index test suite.
2018-07-23 14:10:52 +02:00
Piotr Sarna
8523c24576 cql3: use ck prefix in filtered queries
If a filtering query has restrictions that include any clustering
prefix, the longest prefix will be used to narrow down the query.

Fixes #3611
2018-07-23 14:10:52 +02:00
Piotr Sarna
6cc8ccc771 cql3: use clustering key prefix in index queries
If an indexed query has partition+clustering key restrictions as well
and at least some of these restrictions create a prefix, this prefix
is used in the index query to narrow down the number of rows read.

Refs #3611
2018-07-23 14:10:52 +02:00
Piotr Sarna
ab74f75727 cql3: add conversion to ck longest prefix restrictions
For optimization purposes it's sometimes useful to extract
the longest prefix of clustering key restrictions in order
to narrow down queries.
2018-07-23 14:10:52 +02:00
Piotr Sarna
2e4c493870 cql3: add prefix_size method to ck restrictions
Clustering key restrictions are usually set for at least part
of the clustering key prefix. A method of extracting the longest
prefix's size is added.
2018-07-23 14:10:52 +02:00
Vladimir Krivopalov
ec7f853f49 sstables: Do not pass liveness_info to consume_row_end().
The liveness_info is unconditionally added to the _in_progress_row as of
commit cbfc741d70 so no need to pass it to consume_row_end() and add
conditionally.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <7cd3e599817cbd4b857c3295153602cd2b9a6ef1.1532311852.git.vladimir@scylladb.com>
2018-07-23 13:10:36 +03:00
Avi Kivity
bb79eccf55 tests: sstable_mutation_test: hack around leak during sstable close
sstable close is an asychronous operation launched in the background,
so we can't wait for it. If the test ends before all operations are
complete, the background operations are detected as leaks.

We need either a proper close(), or maybe a sstables::quiesce() that
waits until there are no sstables alive on the shard, but until then,
a hack.
2018-07-23 12:40:46 +03:00
Avi Kivity
af6ce47082 Merge "Support filtering and fast-forwarding with SSTables 3.x" from Piotr and Vladimir
"
This patchset authored by Piotr fixes ck filtering and fast forwarding in SSTables 3.x.
For now only clustering rows are supported and range tombstones will come next.

Test: unit {release}
"

* 'projects/sstables-30/filtering/v5' of https://github.com/argenet/scylla:
  sstables: Minor clean-up and renaming to clustering_ranges_walker.
  sstables: Add test for filtering and forwarding
  sstables: Fix schema for static row tests
  sstables: Fix ck filtering and fast forwarding
  sstables: Introduce mutation_fragment_filter
2018-07-22 21:11:51 +03:00
Avi Kivity
761931659a Merge "Do not linearise incoming CQL3 requests" from Paweł
"
This series changes the native CQL3 protocl layer so that it works with
fragmented buffers instead of a single temporary_buffer per request.
The main part is fragmented_temporary_buffer which represents a
fragmented buffer consisting of multiple temporary_buffers. It provides
helpers for reading fragmented buffer from an input_stream, interpreting
the data in the fragmented buffer as well as view that satisfy
FragmentRange concept.

There are still situations where a fragmented buffer is linearised. That
includes decompressing client requests (this uses reusable buffers in a
similar way to the code that sends compressed responses), CQL statement
restrictions and values that are hard-coded in prepared statements
(hopefully, the values in those cases will be small), value validation
in some cases (blobs are not validated, irrelevant for many fixed-size
small types, but may be a problem for large text cells) as well as
operations on collections.

Tests: unit(release), dtests(cql_prepared_test.py, cql_tests.py, cql_additional_tests.py)
"

* tag 'fragmented-cql3-receive/v1' of https://github.com/pdziepak/scylla: (23 commits)
  types: bytes_view: override fragmented validate()
  cql3: value_view: switch to fragmented_temporary_buffer::view
  types: add validate that accepts fragmented_temporary_buffer::view
  cql3 query_options: add linearize()
  cql3: query_options: use bytes_ostream for temporaries
  cql3: operation: make make_cell accept fragmented_temporary_buffer::view
  atomic_cell: accept fragmented_temporary_buffer::view values
  cql3: avoid ambiguity in a call to update_parameters::make_cell()
  transport: switch to fragmented_temporary_buffer
  transport: extract compression buffers from response class
  tests/reusable_buffer: test fragmented_temporary_buffer support
  utils: reusable_buffer: support fragmented_temporary_buffer
  tests: add test for fragmented_temporary_buffer
  util fragment_range: add general linearisation functions
  utils: add fragmented_temporary_buffer
  tests: add basic test for transport requests and responses
  tests/random-utils: print seed
  tests/random-utils: generate sstrings
  cql3: add value_view printer and equality comparison
  transport: move response outside of cql_server class
  ...
2018-07-22 19:40:37 +03:00
Avi Kivity
30cddd4531 Merge "Support reading promoted index from SSTables 3.x" from Vladimir and Piotr
"
This patchset adds support for reading Index.db files written in
SSTables 3.x ('mc') format.

Note that the offsets map introduced in SSTables 3.x is neither used nor
read yet. It is located last in promoted index and so current parsers
just ignore it for the time being.

Later it should be used to perform binary search of a desired promoted
index block in large partition, thus reducing the complexity from linear
to logarithmic.

Tests: unit {release}
"

* 'projects/sstables-30/index_reader/v5' of https://github.com/argenet/scylla:
  sstables: Add getter for end_open_marker to index_reader.
  tests: Add test reading index for a partition comprised of RT markers of boundary types.
  tests: Add test for reading index of a partition comprised of only range tombstones.
  tests: Use std::adjacent_find in index_reader_assertions::has_monotonic_positions()
  tests: Read rows only index
  sstables: Do not seek through the promoted index for static row positions.
  sstables: Read promoted index stored in SSTables 3.x ('mc') format.
  sstables: Make promoted_index_block support clustering keys for both ka/la and mc formats.
  utils: Add overloaded_functor helper.
  position_in_partition: Add a constructor from range_tag_t{}, bound_kind and clustering_key_prefix.
  sstables: Support reading signed vints in continuous_data_consumer.
  sstables: Factor out the code building a vector of fixed clustering values lengths.
  sstables: Remove unused includes from index_entry.hh
  tests: Add test for reading SSTables 3.x index file with empty promoted index.
  tests: Rename sstable_assertions.hh -> tests/index_reader_assertions.hh
  sstables: Support parsing index entries from SSTables 3.x format.
  sstables: move bound_kind_m to header
2018-07-22 16:15:41 +03:00
Vladimir Krivopalov
df1a151f75 sstables: Minor clean-up and renaming to clustering_ranges_walker.
- Renamed _current to _current_range to better reflect its nature as
  there are other similarly named members (_current_start and
  _current_end).

- Don't use a temporary variable for incrementing the change counter.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 16:34:37 -07:00
Piotr Jastrzebski
01611f2083 sstables: Add test for filtering and forwarding
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-07-20 16:34:37 -07:00
Piotr Jastrzebski
3466dc2368 sstables: Fix schema for static row tests 2018-07-20 16:34:37 -07:00
Piotr Jastrzebski
abf3fc1b98 sstables: Fix ck filtering and fast forwarding
Both were broken before this change.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 16:34:37 -07:00
Piotr Jastrzebski
564bcfa4d0 sstables: Introduce mutation_fragment_filter
This class encapsulates the logic related to
clustering key filtering and fast forwarding.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 16:19:07 -07:00
Vladimir Krivopalov
4d3467d793 sstables: Add getter for end_open_marker to index_reader.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:51:13 -07:00
Vladimir Krivopalov
c7285abc9e tests: Add test reading index for a partition comprised of RT markers of boundary types.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:51:13 -07:00
Vladimir Krivopalov
91f96d7d2b tests: Add test for reading index of a partition comprised of only range tombstones.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:51:13 -07:00
Vladimir Krivopalov
fc051954c2 tests: Use std::adjacent_find in index_reader_assertions::has_monotonic_positions()
Not only this is easier to read and understand, but it also doesn't
force the promoted_index_block class to support copying which is
heavyweight and otherwise not needed.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:51:13 -07:00
Vladimir Krivopalov
d4e0fa96e3 tests: Read rows only index
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:51:13 -07:00
Vladimir Krivopalov
5561c713d9 sstables: Do not seek through the promoted index for static row positions.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:51:13 -07:00
Vladimir Krivopalov
917528c427 sstables: Read promoted index stored in SSTables 3.x ('mc') format.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:51:13 -07:00
Vladimir Krivopalov
86d14f8166 sstables: Make promoted_index_block support clustering keys for both ka/la and mc formats.
This is a pre-requisite for parsing promoted index blocks written in
SSTables 'mc' format.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:51:13 -07:00
Vladimir Krivopalov
79c2f0095c utils: Add overloaded_functor helper.
The overloaded_functor class template can be used to encompass multiple
lambdas accepting different types into a single callable object that can
be used with any of those types.

One application is visitors for std::variant where different handling is
required for different types.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:50:17 -07:00
Vladimir Krivopalov
593d8faf7d position_in_partition: Add a constructor from range_tag_t{}, bound_kind and clustering_key_prefix.
This facilitates position_in_partition creation when parsing range tombstones bounds from SSTables files.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:50:17 -07:00
Vladimir Krivopalov
997ebaaa14 sstables: Support reading signed vints in continuous_data_consumer.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:50:17 -07:00
Vladimir Krivopalov
540dfcc9bf sstables: Factor out the code building a vector of fixed clustering values lengths.
This code will be re-used in promoted_index_blocks_parser to parse
clustering key prefixes from SSTables 3.x format.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:50:17 -07:00
Vladimir Krivopalov
741d5f3b5d sstables: Remove unused includes from index_entry.hh
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:50:17 -07:00
Vladimir Krivopalov
b29b948872 tests: Add test for reading SSTables 3.x index file with empty promoted index.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:50:17 -07:00
Vladimir Krivopalov
054eb2df66 tests: Rename sstable_assertions.hh -> tests/index_reader_assertions.hh
The previous name of the file is moreover confusing as we have several
sstable_assertions classes throughout tests but this header only
contains a class for index reader assertions.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:50:17 -07:00
Vladimir Krivopalov
f50ffa267f sstables: Support parsing index entries from SSTables 3.x format.
With this patch, index_reader is capable of reading index_entries from
both 'ka'/'la' and 'mc' formats.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-07-20 13:50:17 -07:00
Piotr Jastrzebski
d0f8c71e28 sstables: move bound_kind_m to header
and add helper methods.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-07-20 13:50:17 -07:00
Duarte Nunes
6bd087facb Merge 'Make indexed queries with pk restrictions non-filtering' from Piotr
"
Queries that use secondary index and have a full partition key restriction
or full primary key restriction should not require filtering - it's
sufficient to add these restrictions to the index query.
This also adds secondary index tests to cover this case.

Tests: unit (release)
"

* 'si_and_pk_restrictions_2' of https://github.com/psarna/scylla:
  tests: add index + partition key test
  cql3: make index+primary key restrictions filtering-independent
  cql3: use primary key restrictions in filtering index queries
  cql3: add is_all_eq to primary key restrictions
  cql3: add explicit conversion between key restrictions
  cql3: add apply_to() method to single column restriction
  cql3: make primary key restrictions' values unambiguous
2018-07-19 16:54:43 +01:00
Tomasz Grabiec
d5534d6a77 Merge "Improve categorization of messaging verbs into connections" from Avi
Now that verb categorizations also affect scheduling, getting them
correct is more important. The first three patches in this series
improve the infrastructure a little, and the forth fixes some
categorization errors wrt. repair/streaming verbs.

* https://github.com/avikivity/scylla msg-idx-sanity/v1:
  messaging: choose connection index via a look-up table
  messaging: convert do_get_rpc_client_idx into a switch
  messaging: remove default when computing rpc client index
  messaging: categorize more streaming/repair verbs as streaming
2018-07-19 15:03:15 +02:00
Tomasz Grabiec
ef4fb1f91d sstables: mp_row_consumer_m: Add trace-level logging
Very useful for debugging. The old mp_row_consumer_k_l had this.

Message-Id: <1532000326-28649-1-git-send-email-tgrabiec@scylladb.com>
2018-07-19 14:58:00 +03:00
Asias He
1f06ee3960 range_streamer: Limit nr of nodes to stream in parallel
For example, to bootstrap a 50th node in a cluster

 [shard 0] range_streamer - Bootstrap with
 [127.0.0.8, 127.0.0.2, 127.0.0.24, 127.0.0.21, 127.0.0.49, 127.0.0.44,
 127.0.0.9, 127.0.0.7, 127.0.0.47, 127.0.0.15, 127.0.0.5, 127.0.0.30,
 127.0.0.14, 127.0.0.12, 127.0.0.36, 127.0.0.11, 127.0.0.48, 127.0.0.28,
 127.0.0.33, 127.0.0.10, 127.0.0.41, 127.0.0.4, 127.0.0.40, 127.0.0.3,
 127.0.0.6, 127.0.0.43, 127.0.0.22, 127.0.0.26, 127.0.0.42, 127.0.0.25,
 127.0.0.17, 127.0.0.37, 127.0.0.23, 127.0.0.13, 127.0.0.38, 127.0.0.1,
 127.0.0.18, 127.0.0.20, 127.0.0.39, 127.0.0.27, 127.0.0.34, 127.0.0.32,
 127.0.0.19, 127.0.0.16, 127.0.0.31, 127.0.0.45, 127.0.0.29, 127.0.0.35,
 127.0.0.46]
 for keyspace=keyspace1 started, nodes_to_stream=49, nodes_in_parallel=49

the new node will get data from 49 existing nodes.

Currently, it will stream from all the 49 existing nodes at the same
time. It is not a good idea to stream from all the nodes in parallel
which can overwhelm the bootstrap node, i.e., 49 nodes sending, 1 node
receiving.

To fix this, limit the nr of nodes to stream in parallel. We should have
a better control over the memory usage and parallelism. But for now,
limit the nr of nodes to a maximum of 16 as a starter. With this limit,
each shard can work with as many as 16 remote nodes in parallel, I think
this has enough parallelism for streaming in terms of performance.

This change have effect on the bootstrap/decommission/removenode node
operations, and do not have effect on repair.

Refs #2782

Message-Id: <980610dc97490d4f16281a0c3203b9bee73e04e4.1531989557.git.asias@scylladb.com>
2018-07-19 11:44:05 +03:00
Avi Kivity
31d4d37161 Merge "Reduce continuous memory usage in gossip" from Asias"
"
Use chunked_vector instead of vector. It won't have compatibility issues
because chunked_vector and vector have the same on wire format.

Refs #278
"

* 'asias/gossip_memory_v2' of github.com:scylladb/seastar-dev:
  gossip: Reduce continuous memory usage
  to_string: Add std::list and utils::chunked_vector support
  serializer: Add chunked_vector support
2018-07-19 09:12:09 +03:00
Tomasz Grabiec
9a0548397c tests: row_cache: Add test for eviction from invalidated partitions
Message-Id: <1531933216-28026-1-git-send-email-tgrabiec@scylladb.com>
2018-07-18 21:06:36 +03:00
Piotr Sarna
82c049692b tests: add index + partition key test
Tests covering querying both index and partition keys are added
- it's checked that such queries do not require filtering.
2018-07-18 18:45:08 +02:00
Piotr Sarna
0c85bdcdc2 cql3: make index+primary key restrictions filtering-independent
If full partition key (or full primary key) is used in an indexed
query, it should not require filtering, because queries like that
can be efficiently narrowed down with stricter index restrictions.
2018-07-18 18:45:08 +02:00
Piotr Sarna
2542630a18 cql3: use primary key restrictions in filtering index queries
If both index and partition key is used in a query, it should not
require filtering, because indexed query can be narrowed down
with partition key information. This commit appends partition key
restrictions to index query.
2018-07-18 18:45:08 +02:00
Piotr Sarna
27590816f0 cql3: add is_all_eq to primary key restrictions
is_all_eq is later needed to decide if restrictions can be used
in an indexed query.
2018-07-18 18:45:08 +02:00
Piotr Sarna
20a349777e cql3: add explicit conversion between key restrictions
Partition and clustering key restrictions sometimes need to be converted
and this commit provides a way to do that.
2018-07-18 18:45:08 +02:00
Piotr Sarna
f1357defd6 cql3: add apply_to() method to single column restriction
This method allows copying single column restriction,
possibly with a new column definition.
2018-07-18 18:44:38 +02:00
Tomasz Grabiec
dc453d4f5d tests: flat_mutation_reader: Use fluent assertions for better error messages
Message-Id: <1531908313-29810-2-git-send-email-tgrabiec@scylladb.com>
2018-07-18 13:52:23 +01:00
Tomasz Grabiec
604c8baed8 tests: flat_mutation_reader_assertions: Introduce produces(mutation_fragment)
Message-Id: <1531908313-29810-1-git-send-email-tgrabiec@scylladb.com>
2018-07-18 13:52:23 +01:00
Tomasz Grabiec
c46813717c tests: sstables: Check that reading large index pages does not cause large allocations
Reproducer of #3597.

Message-Id: <1531914040-5427-1-git-send-email-tgrabiec@scylladb.com>
2018-07-18 14:56:41 +03:00
Piotr Sarna
30f9924ad5 cql3: make primary key restrictions' values unambiguous
using directive must be used to disambiguate the overridden method.
2018-07-18 13:28:37 +02:00
Paweł Dziepak
a0c1c0c921 types: bytes_view: override fragmented validate()
The default implementation linearises the buffer and calls
validate(bytes_view). This is bad and not needed for bytes_type which
doesn't do any validation anyway.
2018-07-18 12:28:06 +01:00
Paweł Dziepak
0b9eed72f4 cql3: value_view: switch to fragmented_temporary_buffer::view 2018-07-18 12:28:06 +01:00
Paweł Dziepak
0551efee3b types: add validate that accepts fragmented_temporary_buffer::view 2018-07-18 12:28:06 +01:00
Paweł Dziepak
8f4cb36ef2 cql3 query_options: add linearize()
Some code in the CQL3 layer requires bytes_view and it is fairly
reasonable to assume that it won't deal with large buffers (e.g.
statement restrictions). query_options already has make_temporary()
which takes ownership of a cql3::raw_value so that the rest of the code
can use cql3::raw_value_view. This patch adds similar linearize()
function which, if necessary, linearises a cql3::raw_value_view and
returns a bytes_view with lifetime tied to the life or query_options.
2018-07-18 12:28:06 +01:00
Paweł Dziepak
3810045f8f cql3: query_options: use bytes_ostream for temporaries
bytes_ostream is going to be more efficient than
std::vector<std::vector<char>> since it can put multiple small values in
a single buffer thus reducing the number of memory allocations.
2018-07-18 12:28:06 +01:00
Paweł Dziepak
dff6cd3e2f cql3: operation: make make_cell accept fragmented_temporary_buffer::view 2018-07-18 12:28:06 +01:00
Paweł Dziepak
cc87263bd8 atomic_cell: accept fragmented_temporary_buffer::view values 2018-07-18 12:28:06 +01:00
Paweł Dziepak
7d7910aa4d cql3: avoid ambiguity in a call to update_parameters::make_cell()
Using initializer lists in calls like foo({}) is ambiguous if foo() has
multiple overloads with more than one accepting a type that is
default-constructible. update_parameters::make_cell() is about to get an
overload that accepts fragmented_temporary_buffer::view as a value, so
let's make sure its call site won't be ambiguous.
2018-07-18 12:28:06 +01:00
Paweł Dziepak
8c6e544fec transport: switch to fragmented_temporary_buffer
The logic responsible for reading requests was operating on
temporary_buffer<char> and bytes_view. This required all request
messages to be linearised to a contiguous buffer, possibly causing large
allocations. Changing to fragmented_temporary_buffer mostly alleviates this
problem unless the reader code explicitly asks for a contiguous bytes_view.
2018-07-18 12:28:06 +01:00
Paweł Dziepak
f95bb21d99 transport: extract compression buffers from response class
Both compression and decompression code is going to reuse the same pair
of reusable buffers.
2018-07-18 12:28:06 +01:00
Paweł Dziepak
a8c4f41a0b tests/reusable_buffer: test fragmented_temporary_buffer support 2018-07-18 12:28:06 +01:00
Paweł Dziepak
32ba47fb87 utils: reusable_buffer: support fragmented_temporary_buffer
reusable_buffer already supports bytes_ostream which is often used for
handling data sent from Scylla. This patch adds support for
fragmented_temporary_buffer which is going to be mainly used for data
received by Scylla.
2018-07-18 12:28:06 +01:00
Paweł Dziepak
166c9a3b8c tests: add test for fragmented_temporary_buffer 2018-07-18 12:28:06 +01:00
Paweł Dziepak
b152aafd67 util fragment_range: add general linearisation functions
All FragmentRange implementations can be linearised in the same way, so
let's provide linearized() and with_linearized() functions for all of
them.
2018-07-18 12:28:06 +01:00
Paweł Dziepak
fc484f0819 utils: add fragmented_temporary_buffer
Seastar output_streams produce temporary_buffer<char>s.
fragmented_temporary_buffer represents a single fragmented buffer that
consists of, possibly multiple, temporary_buffer<char>s.
2018-07-18 12:28:06 +01:00
Paweł Dziepak
b5a72a880b tests: add basic test for transport requests and responses 2018-07-18 12:28:06 +01:00
Paweł Dziepak
054d39b8f7 tests/random-utils: print seed
Knowning the seed will make it easier to investigate failures in
randomised tests.
2018-07-18 12:28:06 +01:00
Paweł Dziepak
9445ce3f84 tests/random-utils: generate sstrings 2018-07-18 12:28:06 +01:00
Paweł Dziepak
46acd76cc8 cql3: add value_view printer and equality comparison
BOOST_CHECK_*() expect compared objcts to be equality-comparable and
printable.
2018-07-18 12:28:06 +01:00
Paweł Dziepak
24929fd2ce transport: move response outside of cql_server class 2018-07-18 12:28:06 +01:00
Paweł Dziepak
5986e7a383 transport: drop request_reader::read_value() 2018-07-18 12:28:06 +01:00
Paweł Dziepak
72450e2f7f transport: extract request reading to request_reader 2018-07-18 12:28:06 +01:00
Paweł Dziepak
1eeef4383c transport: fix use-after-free in read_name_and_value_list() 2018-07-18 12:28:06 +01:00
Avi Kivity
31151cadd4 Merge "row_cache: Fix violation of continuity on concurrent eviction and population" from Tomasz
"
The problem happens under the following circumstances:

  - we have a partially populated partition in cache, with a gap in the middle

  - a read with no clustering restrictions trying to populate that gap

  - eviction of the entry for the lower bound of the gap concurrent with population

The population may incorrectly mark the range before the gap as continuous.
This may result in temporary loss of writes in that clustering range. The
problem heals by clearing cache.

Caught by row_cache_test::test_concurrent_reads_and_eviction, which has been
failing sporadically.

The problem is in ensure_population_lower_bound(), which returns true if
current clustering range covers all rows, which means that the populator has a
right to set continuity flag to true on the row it inserts. This is correct
only if the current population range actually starts since before all
clustering rows. Otherwise, we're populating since _last_row and should
consult it.

Fixes #3608.
"

* 'tgrabiec/fix-violation-of-continuity-on-concurrent-read-and-eviction' of github.com:tgrabiec/scylla:
  row_cache: Fix violation of continuity on concurrent eviction and population
  position_in_partition: Introduce is_before_all_clustered_rows()
2018-07-18 10:11:34 +03:00
Asias He
506eed325a dht: Fix typo in boot_strapper.cc
Eror -> Error

Message-Id: <ab1050c526f6e70c3a365595376acde7706d86e9.1531877929.git.asias@scylladb.com>
2018-07-18 10:00:27 +03:00
Tomasz Grabiec
894961006b Merge "db/view/view_builder: Fixes to bookkeeping" from Duarte
This series contains a couple of fixes to the bookkeeping of the view
build process, which could cause data to be left behind in the system
tables.

* git@github.com:duarten/scylla.git materialized-views/view-build-fixes/v1:

Duarte Nunes (3):
  db/system_keyspace: Add function to remove view build status of a
    shard
  db/view: Don't have shard 0 clear other shard's status on drop
  db/view: Restrict writes to the distributed system keyspace to shard 0
2018-07-17 18:01:28 +02:00
Tomasz Grabiec
25d09e51ac Merge "db/view/build_progress_virtual_reader: Fixes to clustering key adjusts" from Duarte
This series contains a couple of fixes to the adjusting of clustering
keys in the build_progress_virtual_reader, some of which could
potentially cause heap overflows when querying the legacy system table.

* git@github.com:duarten/scylla.git materialized-views/build-progress-virtual-reader-fixes/v1:

Duarte Nunes (3):
  db/view/build_progress_virtual_reader: Use correct schema to adjust ck
  db/view/build_progress_virtual_reader: Fix full ck detection
  db/view/build_progress_virtual_reader: Also adjust end RT bound
2018-07-17 18:00:30 +02:00
Avi Kivity
9ffa6b9ad6 Merge "Fix leaks and corruption of continuity in cache in case of bad_alloc from key linearization" from Tomasz
"
This series fixes two issues related to bad_allocs and keys which require
linearization (larger than 12.8 KiB). With such keys, comparators may throw if
memory allocation fails. This may cause lookups in partition and rows trees to
fail with bad_alloc.

The first issue (#3583) was that partition version merging
(mutation_partition::apply_monotonically()) was not taking into account that
lookups may fail. If we fail, the partition which is being applied may be
incorrectly left with the clustering range since the begging of the range up
to the current row marked as continuous, if the current row has the continuity
flag set, because we've moved all of the preceding rows into the target, and
the correct lower bound row is no longer there in the source. This may mark
some discontinuous ranges as continuous. Merging is retried by
allocating_section, and there will be no problem if it eventually succeeds,
original continuity will be reflected in the sum. The problem will persist if
it doesn't eventually succeed, when we're really out of memory.

The user-perceivable effect of this would be temporary loss of writes in the
clustering range which was marked as continuous but shouldn't. Introduced in
2.2-rc1.

The second issue (#3585) is that the code which inserts partitions in memtable
and cache will leak the entry if boost::intrusive_set::insert() throws. This
will also cause SIGSEGV when cache tries to evict from such a leaked entry.
"

* tag 'tgrabiec/fix-bad-continuity-on-oom-in-apply-v2' of github.com:tgrabiec/scylla:
  managed_bytes: Mark read_linearize() as an allocation point
  tests: Relax expectation about continuity after failed merging
  tests: mutation_partition: Verify continuity is consistent on bad_alloc on merging
  tests: Switch to seastar's allocation failure injector
  mutation_partition: Introduce set_continuity()
  clustering_interval_set: Introduce contained_in()
  clustering_interval_set: Introduce add() overload accepting another interval set
  mutation_partition: Fix merging to not leave the source with broader continuity on bad_alloc
  mutation_partition: Preserve continuity in case row merging with no tracker throws
  memtable, cache: Fix exception safety of partition entry insertions
2018-07-17 18:19:37 +03:00
Tomasz Grabiec
477d7b439b row_cache: Fix violation of continuity on concurrent eviction and population
ensure_population_lower_bound() returned true if current clustering
range covers all rows, which means that the populator has a right to
set continuity flag to true on the row it inserts. This is correct
only if the current population range actually starts since before all
clustering rows. Otherwise we're populating since _last_row, and
should consult it.

The fix introduces a new flag, set when starting to populte, which
indicates if we're populating from the beginning of the range or
not. We cannot simply check if _last_row is set in
ensure_population_lower_bound() because _last_row can be set and then
become empty again.

Fixes #3608
2018-07-17 16:43:21 +02:00
Tomasz Grabiec
8d47d21149 position_in_partition: Introduce is_before_all_clustered_rows() 2018-07-17 16:43:21 +02:00
Tomasz Grabiec
612b223819 managed_bytes: Mark read_linearize() as an allocation point 2018-07-17 16:39:43 +02:00
Tomasz Grabiec
be678a81ee tests: Relax expectation about continuity after failed merging
Currently we check that the sum of continuities is exactly the same as
expected on failure. Relax this to require that continuity is not
broader, since in some bad_alloc scenarios, or preemption, we will
have to mark some ranges as discontinuous.
2018-07-17 16:39:43 +02:00
Tomasz Grabiec
f366ac76e8 tests: mutation_partition: Verify continuity is consistent on bad_alloc on merging 2018-07-17 16:30:01 +02:00
Tomasz Grabiec
d9db79a85d tests: Switch to seastar's allocation failure injector
It catches more allocation sites.
2018-07-17 16:30:01 +02:00
Tomasz Grabiec
6b1fe6cbe5 mutation_partition: Introduce set_continuity() 2018-07-17 16:30:01 +02:00
Tomasz Grabiec
ac772cbd81 clustering_interval_set: Introduce contained_in() 2018-07-17 16:30:01 +02:00
Tomasz Grabiec
d24ebe8565 clustering_interval_set: Introduce add() overload accepting another interval set 2018-07-17 16:30:01 +02:00
Tomasz Grabiec
c6c54021a8 mutation_partition: Fix merging to not leave the source with broader continuity on bad_alloc
When clustering keys are larger than 12.8 KiB they may get fragmented
and key comparator will need to linearize them on comparison. This may
cause lookups in the rows tree to fail with bad_alloc. Partition
version merging (mutation_partition::apply_monotonically()) was not
taking this into account. If we fail on lookup, the partition which is
being applied may be incorrectly left with the clustering range since
the begging up to the current row marked as continuous, if the current
row has the continuity flag set, because we've moved all of the
preceding rows into the target, and the correct lower bound row is no
longer there in the source. This may mark some discontinuous ranges as
continuous.

Merging is retried by allocating_section, and there will be no problem
if it eventually suceeds, original continity will be reflected in the
sum. The problem will persist if it doesn't eventually succeed, when
we're really out of memory.

To protect against this, we could reset the continuity flag of the
current row in the source when exiting on exception.

Fixes #3583
2018-07-17 16:30:01 +02:00
Tomasz Grabiec
de5c52f422 mutation_partition: Preserve continuity in case row merging with no tracker throws
Example:

 p:      row{key=A, cont=0} row{key=C, cont=1}
 this:                      row{key=C, cont=0}

When we get to processing key=C, key=A was already moved to this, so p
has stale continuity on key=C, which marks (-inf,C) as continuous,
whereas it should mark only (A, C). That's not a problem if merging
succeeds, but if exception happens at this point, we will violate the
invariant which says that the sum of p and this should yield the same
logical partition. It wouldn't because continuity of the sum is
calculated as a set union, and (-inf, A) would be incorrectly turned
into a continuous range.

This is not a problem currently because continuity is always full when
there is no tracker (memtables), so won't change anyway, and when
there is a tracker (cache) we never merge but overwrite instead, so
there is no memory allocation and thus no possibility for failure. But
better be safe.
2018-07-17 16:30:01 +02:00
Tomasz Grabiec
567da3e063 memtable, cache: Fix exception safety of partition entry insertions
boost::intrusive::set::insert() may throw if keys require
linearization and that fails, in which case we will leak the entry.

When this happens in cache, we will also violate the invariant for
entry eviction, which assumes all tracked entries are linked, and
cause a SEGFAULT.

Use the non-throwing and faster insert_before() instead. Where we
can't use insert_before(), use alloc_strategy_unique_ptr<> to ensure
that entry is deallocated on insert failure.

Fixes #3585.
2018-07-17 16:30:01 +02:00
Tomasz Grabiec
c82c0be0be tests: mutation_diff: Ignore differences in memory addresses
Differences in memory addresses are not necessarily differences in
values.

Refs #3571

Message-Id: <1531824919-12737-1-git-send-email-tgrabiec@scylladb.com>
2018-07-17 16:32:04 +03:00
Amos Kong
0fcdab8538 scylla_setup: nic setup dialog is only for interactive mode
Current code raises dialog even for non-interactive mode when we pass options
in executing scylla_setup. This blocked automatical artifact-test.

Fixes #3549

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <58f90e1e2837f31d9333d7e9fb68ce05208323da.1531824972.git.amos@scylladb.com>
2018-07-17 16:31:18 +03:00
Paweł Dziepak
422d1eaeb9 Merge "Improve usability of pkeys in system.large_partitions table" from Avi
"
Partition keys are currently stored in serialized form in the
system.large_partitions table. This is an obstacle to operators
who usually can't deserialize partition keys in their heads.

Improve the situation by deserializing the partition key for them.
"

* tag 'pkey-print/v1' of https://github.com/avikivity/scylla:
  large_partition_handler: output friendly partition key
  keys: schema-aware printing of a partition_key
2018-07-17 13:51:22 +01:00
Avi Kivity
002ac87aac Update seastar submodule
* seastar aac6cf1...6b97e00 (5):
  > Merge "changes to fix travis CI builds" from Kefu
  > tls.cc: Make "close" timeout delay exception proof
  > core/sharded: mark foreign_ptr::get_owner_shard() const
  > core/memory: Expose counter of large allocations
  > tests: add test for multi-fragmented net::packet

Fixes #3461.
Ref scylladb/seastar#474.
2018-07-17 15:43:01 +03:00
Tomasz Grabiec
3f509ee3a2 mutation_partition: Fix exception-safety of row copy constructor
In case population of the vector throws, the vector object would not
be destroyed. It's a managed object, so in addition to causing a leak,
it would corrupt memory if later moved by the LSA, because it would
try to fixup forward references to itself.

Caused sporadic failures and crashes of row_cache_test, especially
with allocation failure injector enabled.

Introduced in 27014a23d7.
Message-Id: <1531757764-7638-1-git-send-email-tgrabiec@scylladb.com>
2018-07-17 13:21:21 +01:00
Asias He
fd71c5718f gossip: Reduce continuous memory usage
Gossip SYN and ACK uses std::vector to store a list of gossip_digest,
the larger the cluster, the more continuous memory is needed. To reduce
the memory pressure which might cause std::bad_alloc, switch the std::vector
to chunked_vector.

In addition, change add_local_application_state to use std::list instead
of std::vector.

Refs #2782
2018-07-17 20:15:32 +08:00
Avi Kivity
acb3163639 large_partition_handler: output friendly partition key
Use abstract_type::to_string() to prettify partition key components.

Manually tested by setting --compaction-large-partition-warning-threshold-mb
to zero and inspecting the output for compound and non-compound partition
keys.
2018-07-17 14:44:52 +03:00
Avi Kivity
bfd14b4123 keys: schema-aware printing of a partition_key
Add a with_schema() helper to decorate a partition key with its
schema for pretty-printing purposes, and matching operator<<.

This is useful to print partition keys where the operator, who
may not be familiar with the encoding, may see them.
2018-07-17 14:43:12 +03:00
Tomasz Grabiec
d94c7c07a3 lsa: Disable alloc failure injector inside the LSA sanitizer
Message-Id: <1531814822-30259-1-git-send-email-tgrabiec@scylladb.com>
2018-07-17 11:27:56 +01:00
Asias He
77018b7304 to_string: Add std::list and utils::chunked_vector support
It will be used by the gossip code.
2018-07-17 16:14:31 +08:00
Asias He
e4802d2fe3 serializer: Add chunked_vector support
It will be used by the gossip SYN and ACK message soon.
2018-07-17 16:12:50 +08:00
Botond Dénes
cc4acb6e26 storage_proxy: use the original row limits for the final results merging
`query_partition_key_range()` does the final result merging and trimming
(if necessary) to make sure we don't send more rows to the client than
requested. This merging and trimming is done by a continuation attached
to the `query_partition_key_range_concurrent()` which does the actual
querying. The continuations captures via value the `row_limit` and
`partition_limit` fields of the `query::read_command` object of the
query. This has an unexpected consequence. The lambda object is
constructed after the call to `query_partition_key_range_concurrent()`
returns. If this call doesn't defer, any modifications done to the read
command object done by `query_partition_key_range_concurrent()` will be
visible to the lambda. This is undesirable because
`query_partition_key_range_concurrent()` updates the read command object
directly as the vnodes are traversed which in turn will result in the
lambda doing the final trimming according to a decremented `row_limits`,
which will cause the paging logic to declare the query as exhausted
prematurely because the page will not be full.
To avoid all this make a copy of the relevant limit fields before
`query_partition_key_range_concurrent()` is called and pass these copies
to the continuation, thus ensuring that the final trimming will be done
according to the original page limits.

Spotted while investigating a dtest failure on my 1865/range-scans/v2
branch. On that branch the way range scans are executed on replicas is
completely refactored. These changes appearantly reduce the number of
continuations in the read path to the point where an entire page can be
filled without deferring and thus causing the problem to surface.

Fixes #3605.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <f11e80a6bf8089d49ba3c112b25a69edf1a92231.1531743940.git.bdenes@scylladb.com>
2018-07-16 16:54:50 +03:00
Takuya ASADA
9479ff6b1e dist/common/scripts/scylla_prepare: fix error when /etc/scylla/ami_disabled exists
On this part shell command wasn't converted to python3, need to fix.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180715075015.13071-1-syuu@scylladb.com>
2018-07-16 09:29:38 +03:00
Avi Kivity
c4013f6fe1 messaging: categorize more streaming/repair verbs as streaming
Since the messaging service will assign a scheduling group based
on the client index, it's more important now to get the verbs categorized
correctly.

Re-categorize REPLICATION_FINISHED, REPAIR_CHECKSUM_RANGE, and most
importantly STREAM_MUTATION_FRAGMENTS to the repair/streaming oriented
connections so we get the correct scheduling.
2018-07-15 15:44:10 +03:00
Avi Kivity
ff3d7839ab messaging: remove default when computing rpc client index
A default means that when adding new verbs, we may forget to
categorize a verb correctly.  Without the default, the compiler
will complain due to -Wswitch.
2018-07-15 15:40:29 +03:00
Avi Kivity
fe2db68be8 messaging: convert do_get_rpc_client_idx into a switch
A switch is more readable for multiple choice with no
clearly preferred choice.
2018-07-15 15:26:50 +03:00
Avi Kivity
3b1e04091c messaging: choose connection index via a look-up table
Looking up is faster than a bunch of if()s.
2018-07-15 15:21:06 +03:00
Takuya ASADA
1511d92473 dist/redhat: drop scylla_lib.sh from .rpm
Since we dropped scylla_lib.sh at 58e6ad22b2,
we need remove it from RPM spec file too.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180712155129.17056-1-syuu@scylladb.com>
2018-07-15 14:46:22 +03:00
Avi Kivity
ef9b36376c Merge "database: support multiple data directories" from Glauber
"
While Cassandra supports multiple data directories, we have been
historically supporting just one. The one-directory model suits us
better because of the I/O Scheduler and so far we have seen very few
requests -- if any, to support this.

Still, the infrastructure needed to support multiple directories can be
beneficial so I am trying to bring this in.

For simplicity, we will treat the first directory in the list as the
main directory. By being able to still associate one singular directory
with a table, most of the code doesn't have to change and we don't have
to worry about how to distribute data between the directories.

In this design:
- We scan all data directories for existing data.
- resharding only happens within a particular data directory.
- snapshot details are accumulated with data for all directories that
  host snapshots for the tables we are examining
- snapshots are created with files in its own directories, but the
  manifest file goes to the main directory. For this one, note that in
  Cassandra the same thing happens, except that there is no "main"
  directory. Still the manifest file is still just in one of them.
- SSTables are flushed into the main directory.
- Compactions write data into the main directory

Despite the restrictions, one example of usage of this is recovery.  If
we have network attached devices for instance, we can quickly attach a
network device to an existing node and make the data immediately
available as it is compacted back to main storage.

Tests: unit (release)
"

* 'multi-data-file-v2' of github.com:glommer/scylla:
  database: change ident
  database: support multiple data directories
  database: allow resharing to specify a directory
  database: support multiple directories in get_snapshot_details
  database: move get_snapshot_info into a seastar::thread
  snapshots: always create the snapshot directory
  sstables: pass sstable dir with entry descriptor
  database: make nodetool listsnapshots print correct information
  sstables: correctly create descriptors for snapshots
2018-07-15 13:31:04 +03:00
Avi Kivity
8ee807321f Merge "scylla streaming with rpc streaming" from Asias
"
This work is on top of Gleb's rpc streaming which is merged recently.

What this series does is to replace scylla streaming service's data plane to
use the new rpc streaming instead of the old rpc verb to send the mutations for
scylla streaming. Other parts of scylla streaming, the control plane, are not
changed.

In my test, to bootstrap a new node to the existing one node cluster, smp 2,
scylla stores data on ramdisk to minimize disk io impact.

I saw x2 improvment in streaming bandwidth.

Before:
   [shard 0] stream_session - [Stream #2ae92320-5fc8-11e8-911a-000000000000]
   Streaming plan for Bootstrap-ks3-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=1570312 KiB, 109521.02 KiB/s
   [shard 0] range_streamer - Bootstrap with 127.0.0.1 for keyspace=ks3 succeeded, took 14.338 seconds

After:
   [shard 0] stream_session - [Stream #e5589ac0-5fc7-11e8-b463-000000000000]
   Streaming plan for Bootstrap-ks3-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=1546875 KiB, 220415.36 KiB/s
   [shard 0] range_streamer - Bootstrap with 127.0.0.1 for keyspace=ks3 succeeded, took 7.018 seconds

Tests: dtest update_cluster_layout_tests.py

Fixes: #3591
"

* tag 'asias/scylla_streaming_with_rpc_streaming_v8' of github.com:scylladb/seastar-dev:
  streaming: Add rpc streaming support
  storage_service: Introduce STREAM_WITH_RPC_STREAM feature
  streaming: Add estimate_partitions to send_info
  messaging_service: Add streaming with rpc streaming support
  messaging_service: Add streaming_domain
  database: Add add_sstable_and_update_cache
  database: Add make_streaming_sstable_for_write
2018-07-15 12:36:52 +03:00
Vlad Zolotarov
235520292e utils::loading_cache: hold a shared_value_ptr to the value when we reload
This allows to remove the requirement to hold the key value inside the
_load callback if its value is needed in the asynchronous continuation
inside the callback in the context of a reload.

This also resolves the use-after-free issue when a _load() callback removes
the item for a given key.

See a9b72db34d.1528794135.git.bdenes%40scylladb.com
for a discussion about this.

In addition this patch makes the loading_cache more robust for any existing
and potential situations when cached entries are being removed from inside the
callback. This is achieved by extending the idea implemented by Duarte in the
"utils/loading_cache: Avoid using invalidated iterators" by capturing timestamped_val_ptr
(which is essentially a lw_shared_ptr to an intrusive set entry which holds both the key
and the cached value) instead of a naked pointer.

Tests {debug, release}:
   - Unit tests:
      - loading_cache_test
      - view_build_test
      - auth_test
      - auth_resource_test

   - dtest:
      - auth_test.py

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-07-13 11:27:58 -04:00
Vlad Zolotarov
b44ad5677a utils::loading_cache::on_timer(): remove not needed capture of "this"
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-07-13 11:27:43 -04:00
Vlad Zolotarov
4aa0e5914b utils::loading_cache::on_timer(): use chunked_vector for storing elements we want to reload
The list of elements that needs to be reloaded may be rather large.
Use chunked_vector in order to make the allocator's life easier.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-07-13 09:53:59 -04:00
Avi Kivity
8c993e0728 messaging: tag RPC services with scheduling groups
Assign a scheduling_group for each RPC service. Assignement is
done by connection (get_rpc_client_idx()) - all verbs on the
same connection are assigned the same group. While this may seem
arbitrary, it avoids priority inversion; if two verbs on the same
connection have different scheduling groups, the verb with the low
shares may cause a backlog and stall the connection, including
following requests from verbs that ought to have higher shares.

The scheduling_group parameters are encapsulated in different
classes as they are passed around to avoid adding dependencies.
Message-Id: <20180708140433.6426-1-avi@scylladb.com>
2018-07-13 13:57:08 +02:00
Vladimir Krivopalov
cf7b42619d clustering_ranges_walker: Improve class consistency and readability.
This patch addresses several issues.
  1. The class no longer uses placement-new trick for move-assignment.
     It was incorrect to use because the class contains const refererences
     and re-initializing the same region of memory would result in undefined
     behaviour on accessing these members.

  2. Use boost::iterator_range for tracking the current range of
     cr_ranges. It is easier to deal with and avoids possible bugs like
     assigning only one of two iterators
Message-Id: <4096182c4ee2fb1157e135c487c41012b266ba69.1531440684.git.vladimir@scylladb.com>
2018-07-13 11:23:33 +02:00
Asias He
deff5e7d60 streaming: Add rpc streaming support
This patch changes scylla streaming to use the recently added rpc
streaming feature provided by seastar to send mutation fragments for
scylla streaming instead of the rpc verbs.

It also changes the receiver to write to the sstable file directly,
skipping writing to memtable.
2018-07-13 08:36:47 +08:00
Asias He
71e22fe981 storage_service: Introduce STREAM_WITH_RPC_STREAM feature
With this feature, the node supports scylla streaming using the rpc
streaming.
2018-07-13 08:36:47 +08:00
Asias He
faa6769cdb streaming: Add estimate_partitions to send_info
The sender needs to estimate the number of partitions to send, because
the receiver needs this to prepare the sstables.
2018-07-13 08:36:46 +08:00
Asias He
ddfb4590ce messaging_service: Add streaming with rpc streaming support
Preparation for adding rpc streaming in scylla streaming.

- register_stream_mutation_fragments is used to register the rpc
streaming verb

- make_sink_and_source_for_stream_mutation_fragments is used to get the
sink and source object for the sender

- make_sink_for_stream_mutation_fragments is used to get a sink object
for the receiver
2018-07-13 08:36:46 +08:00
Asias He
671e1b08fe messaging_service: Add streaming_domain
The rpc streaming needs a streaming_domain id for the same logical
server. Chose one for our messaging service.
2018-07-13 08:36:46 +08:00
Asias He
6540051f77 database: Add add_sstable_and_update_cache
Since we can write mutations to sstable directly in streaming, we need
to add those sstables to the system so it can be seen by the query.
Also we need to update the cache so the query refects the latest data.
2018-07-13 08:36:45 +08:00
Asias He
dfc2739625 database: Add make_streaming_sstable_for_write
This will be used to create sstable for streaming receiver to write the
mutations received from network to sstable file instead of writing to
memtable.
2018-07-13 08:36:45 +08:00
Takuya ASADA
ee61660b76 dist/common/scripts/scylla_ec2_check: support custom NIC ifname on EC2
Since some AMIs using consistent network device naming, primary NIC
ifname is not 'eth0'.
But we hardcoded NIC name as 'eth0' on scylla_ec2_check, we need to add
--nic option to specify custom NIC ifname.

Fixes #3584

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180712142446.15909-1-syuu@scylladb.com>
2018-07-12 18:22:28 +03:00
Tomasz Grabiec
b17f7257a9 sstables: index_reader: Reduce size of index_entry by indirecting promoted_index
Reduces size of index_entry from 384 bytes to 64 bytes by using
indirection for the optional promoted index instead of embedding it.

Improves query time from 9ms to 4ms in a micro benchmark with a very
large index page.

Message-Id: <1531406354-10089-1-git-send-email-tgrabiec@scylladb.com>
2018-07-12 17:46:58 +03:00
Tomasz Grabiec
101dcdbb48 gdb: Fix scylla heapprof command
Type of _frames was chagned to static_vector<>

Message-Id: <1531233685-20786-2-git-send-email-tgrabiec@scylladb.com>
2018-07-12 16:51:30 +03:00
Tomasz Grabiec
059133ffa8 gdb: Introduce iteration wrapper for static_vector
Message-Id: <1531233685-20786-1-git-send-email-tgrabiec@scylladb.com>
2018-07-12 16:51:30 +03:00
Duarte Nunes
63b63b0461 utils/loading_cache: Avoid using invalidated iterators
When periodically reloading the values in the loading_cache, we would
iterate over the list of entries and call the load() function for
those which need to be reloaded.

For some concrete caches, load() can remove the entry from the LRU set,
and can be executed inline from the parallel_for_each(). This means we
could potentially keep iterating using an invalidated iterator.

Fix this by using a temporary container to hold those entries to be
reloaded.

Spotted when reading the code.

Also use if constexpr and fix the comment in the function containing
the changes.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180712124143.13638-1-duarte@scylladb.com>
2018-07-12 13:59:09 +01:00
Botond Dénes
2e7bf9c6f9 loading_cache::reload(): obtain key before calling _load()
The continuation attached to _load() needs the key of the loaded entry
to check whether it was disposed during the load. However if _load()
invalidates the entry the continuation's capture line will access
invalid memory while trying to obtain the key.
To avoid this save a copy of the key before calling _load() and pass it
to both _load() and the continuation.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <b571b73076ca863690f907fbd3fb4ff54e597b28.1531393608.git.bdenes@scylladb.com>
2018-07-12 13:42:42 +01:00
Avi Kivity
a4a2f743a8 Merge "Avoid large allocations when reading sstable index pages" from Tomasz
"
If there is a lot of partitions in the index page, index_list may grow large
and require large contiguous blocks of memory, because it's based on
std::vector. That puts pressure on the memory allocator, and if memory is
fragmented, may not be possible to satisfy without a lot of eviction. Switch
to chunked_vector to avoid this.

Refs #3597
"

* 'tgrabiec/avoid-large-alloc-in-index-reader' of github.com:tgrabiec/scylla:
  sstables: Switch index_list to chunked_vector to avoid large allocations
  utils: chunked_vector: Do not require T to be default-constructible for clear()
  utils: chunked_vector: Implement front()
2018-07-12 15:30:18 +03:00
Duarte Nunes
1fb3b924f4 utils/loading_cache: Remove superfluous continuation
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180712122031.13424-1-duarte@scylladb.com>
2018-07-12 15:22:35 +03:00
Takuya ASADA
8f80d23b07 dist/common/scripts/scylla_util.py: fix typo
Fix typo, and rename get_mode_cpu_set() to get_mode_cpuset(), since a
term 'cpuset' is not included '_' on other places.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180711141923.12675-1-syuu@scylladb.com>
2018-07-12 10:14:55 +03:00
Tomasz Grabiec
8c85b01ad3 gdb: Fix scylla lsa-segment on python 3
Referring to a function parameter via "global" no longer works on
python 3. We should be using "nonlocal", which is absent on python 2
though. To make the script work on both, inline next().

Message-Id: <1531317984-29224-1-git-send-email-tgrabiec@scylladb.com>
2018-07-12 10:14:22 +03:00
Duarte Nunes
a7fdf4fc49 Merge 'ALLOW FILTERING for indexed queries' from Piotr
"
Previous series on ALLOW FILTERING introduced it for regular queries,
but it's also possible to have an indexed query which requires
filtering. This series contains minor fixes that allow treating
indexed+filtered queries properly. The most important part is having
more selective approach of extracting values from restrictions
in read_posting_list() helper function. Before ALLOW FILTERING,
restrictions contained only a single entry that matched the indexed
column, but it's not the case with filtering (and it won't be the case
with multiple indexing support).

This series also comes with test cases for indexed+filtered queries.

Tests: unit (release)
"

* 'allow_filtering_and_si_3' of https://github.com/psarna/scylla:
  tests: add filtering indexed queries tests
  cql3: use single restriction value in index creation
  cql3: add secondary index condition to need_filtering
  cql3: add value_for method
  cql3: add missing inline declarations to restrictions
  cql3: make index detection more specific
  index: add target_column getter to index
2018-07-12 00:17:36 +01:00
Duarte Nunes
55caaec411 db/view/build_progress_virtual_reader: Also adjust end RT bound
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-07-11 23:28:31 +01:00
Duarte Nunes
eda6b88b0e db/view/build_progress_virtual_reader: Fix full ck detection
As an optimization, the virtual reader doesn't change the underlying
key if it is not full, and hence doesn't include the extra clustering
key. However, this detection is broken because it checked for 3
clustering columns, instead of 2.

This patch fixes that by obtaining the clustering key size from the
underlying schema instead of hardcoding the size.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-07-11 23:28:31 +01:00
Duarte Nunes
ff3a0d437a db/view/build_progress_virtual_reader: Use correct schema to adjust ck
The virtual reader adjusts clustering keys obtained from the
underlying, scylla-specific schema, and potentially sheds the extra
clustering key that's absent from the Cassandra-compatible schema.

This patches ensures we use the correct schema to iterator over the
key.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-07-11 23:28:31 +01:00
Duarte Nunes
df66d7db59 db/view: Restrict writes to the distributed system keyspace to shard 0
Writing to the distributed system keyspace should be confined to a
single shard of each host, namely shard 0. We were violating this
constraint by having all shards set the host status to "started". This
could be problematic when the build finishes quickly or there's a
concurrent view drop, such that a write done by shard 0 can have a
smaller timestamp than one done by some other shard.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-07-11 21:45:26 +01:00
Duarte Nunes
e683c1367f db/view: Don't have shard 0 clear other shard's status on drop
Shard 0 can clear the in-progress build status of all shards when a
view finishes building, because we are ensured all writes to the
system table have completed with earlier timestamps.

This is not the case when dropping a view. A drop can happen
concurrently with the build, in which case shard 0 may process the
notification before another shard receives it, and before that shard
writes to the system table.

Fix this by ensuring each shard clears its own status on drop.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-07-11 21:45:26 +01:00
Duarte Nunes
2fa7f10429 db/system_keyspace: Add function to remove view build status of a shard
This patch adds a function that clears the view build in-progress
status for the current shard, similar to the existing one that clears
it across all shards.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-07-11 21:27:39 +01:00
Piotr Sarna
fcfbc804e4 tests: add filtering indexed queries tests
Tests covering ALLOW FILTERING usage while using secondary indexes
as well are added to cql_query_test.
Tests are based on Cassandra's test suite for filtering secondary
indexes + some more simple cases.
2018-07-11 18:06:21 +02:00
Piotr Sarna
7d9715db27 cql3: use single restriction value in index creation
ALLOW FILTERING support caused index-related restrictions to possibly
have more values. In order to remain correct, only those restrictions
which match the indexed columns should be used.
2018-07-11 18:06:21 +02:00
Piotr Sarna
1d75035672 cql3: add secondary index condition to need_filtering
A query that restricts a partition key and an indexed column
needs filtering (after reading an index) and it wasn't
properly detected before.
2018-07-11 18:06:21 +02:00
Piotr Sarna
80ce9b72a1 cql3: add value_for method
In order to extract value from a restriction for just one column,
value_for(column_name, options) method is implemented.
It's needed because once ALLOW FILTERING support was introduced,
index-related restrictions may contain more than 1 value.
2018-07-11 18:06:21 +02:00
Piotr Sarna
c1ad28f28e cql3: add missing inline declarations to restrictions
In order to prevent future compilation errors, externally defined
class methods from single column primary key restrictions are explicitly
marked inline.
2018-07-11 18:06:21 +02:00
Piotr Sarna
02811d8996 cql3: make index detection more specific
Conditions that detect if restrictions need an indexed query weren't
specific enough to work properly with mixed index-filtering queries,
because they would overly eager assume that partition/clustering key
restrictions have a backing index.
2018-07-11 18:06:21 +02:00
Piotr Sarna
372644c909 index: add target_column getter to index
Target column for an index is later needed to find matching
restrictions.
2018-07-11 18:06:21 +02:00
Tomasz Grabiec
3b2890e1db sstables: Switch index_list to chunked_vector to avoid large allocations
If there is a lot of partitions in the index page, index_list may grow
large and require large contiguous blocks of memory. That puts
pressure on the memory allocator, and if memory is fragmented, may not
be possible to satisfy without a lot of eviction.
2018-07-11 16:55:20 +02:00
Tomasz Grabiec
b0f5df10d2 utils: chunked_vector: Do not require T to be default-constructible for clear()
resize(), used by clear(), requires T to be default-constructible in
case the vector is expanded. It's not actually needed for clearing,
and there will be users which use clear() with
non-default-constructible T, so implement clear() without using
resize().
2018-07-11 16:55:20 +02:00
Tomasz Grabiec
03832dab97 utils: chunked_vector: Implement front()
std::vector<> has it, so should this, for easy migration.
2018-07-11 16:55:20 +02:00
Piotr Sarna
dcdd8be59c cql3: make index-related tests less timing dependent
Indexes and materialized views take time to build, so checks
that rely on that are now wrapped with 'eventually' blocks.

Message-Id: <6d3def2bc49b76dda11d7a1c9974a8b3d221003f.1531312518.git.sarna@scylladb.com>
2018-07-11 15:45:52 +03:00
Takuya ASADA
58e6ad22b2 dist/common/scripts: drop scylla_lib.sh
Drop scylla_lib.sh since all bash scripts depends on the library is
already converted to python3, and all scylla_lib.sh features are
implemented on scylla_util.py.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180711114756.21823-1-syuu@scylladb.com>
2018-07-11 14:54:56 +03:00
Avi Kivity
83d72f3755 Update scylla-ami submodule
* dist/ami/files/scylla-ami 5200f3f...d53834f (1):
  > Merge "AMI scripts python3 conversion" from Takuya
2018-07-11 13:16:08 +03:00
Avi Kivity
693cf77022 Merge "more conversion from bash to python3" from Takuya
"Converted more scripts to python3."

* 'script_python_conversion2_v2' of https://github.com/syuu1228/scylla:
  dist/common/scripts/scylla_util.py: make run()/out() functions shorter
  dist/ami: install python34 to run scylla_install_ami
  dist/common/scripts/scylla_ec2_check: move ec2 related code to class aws_instance
  dist/common/scripts: drop class concolor, use colorprint()
  dist/ami/files/.bash_profile: convert almost all lines to python3
  dist/common/scripts: convert node_exporter_install to python3
  dist/common/scripts: convert scylla_stop to python3
  dist/common/scripts: convert scylla_prepare to python3
2018-07-11 13:14:23 +03:00
Tomasz Grabiec
1de5177175 tests: row_cache: Fix use-after-scope on partition_range passed to readers
The partition_range must outlive the reader.

Message-Id: <1531301583-15476-1-git-send-email-tgrabiec@scylladb.com>
2018-07-11 12:39:30 +03:00
Avi Kivity
28621066e6 observable: allow an observable to disconnect() twice without penalty
Message-Id: <20180711070754.13286-1-avi@scylladb.com>
2018-07-11 10:15:01 +01:00
Avi Kivity
1895483781 observable: add comments explaining the purpose and use of the mechanism
Message-Id: <20180710133706.8791-1-avi@scylladb.com>
2018-07-11 10:15:01 +01:00
Avi Kivity
99d3f0a1b1 tests: add obserable_test to test suite
Message-Id: <20180711071131.13702-1-avi@scylladb.com>
2018-07-11 10:15:01 +01:00
Tomasz Grabiec
fde4a312db gdb: Replace long() with int()
Python 3 doesn't have 'long' anymore, so commands using it fail with
newer GDB. long on python2 is the same as int on python3, both are
arbitrary-precision. On python2 int is fixed-precision, but seems to
be still enough (64 bit), so use that instead.

Message-Id: <1531215600-31899-1-git-send-email-tgrabiec@scylladb.com>
2018-07-10 15:05:02 +03:00
Nadav Har'El
5e47061438 repair: fix small error-handling logic mistake
As noticed by Tomasz Grabiec, we test a future's available() after
having already waited for it with when_all(), which is pointless.

The code after the wrong if() exchanges the contents of a token-range
between this node and several other live neighbors; We can't do this
exchange if either this node is broken or there is no other live neighbor.
So this is what we needed to test. so !available() should have been failed().

Also the test for live_neighbors_checksum.empty() added in commit 7c873f0d1f
is unnecessary - we build live_neighbors and live_neighbors_checksum
together, so if one of them is empty, so is the other.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180710114940.26027-1-nyh@scylladb.com>
2018-07-10 15:04:03 +03:00
Piotr Sarna
559439b6ea tests: add more ALLOW FILTERING tests
More test cases are added to cql_query_test in order to check
ALLOW FILTERING clauses more accurately.

Message-Id: <4c59c1f3eb01558be992d0596e5423c276087387.1531220558.git.sarna@scylladb.com>
2018-07-10 14:44:33 +03:00
Piotr Sarna
aadbfc6b84 cql3: throw instead of log for collection filtering
Original series that introduced filtering logged a warning
when collection restrictions appeared. Instead, an exception
should be thrown until collection restrictions are supported
for ALLOW FILTERING clauses.

Message-Id: <ddaf342d4d6766fadb756f66e5afa0b99ce054f8.1531220558.git.sarna@scylladb.com>
2018-07-10 14:44:29 +03:00
Avi Kivity
7db394ce50 observable: switch to noncopyable_function
std::function's move constructor is not noexcept, so observer's move
constructor and assignment operator also cannot be. Switch to Seastar's
noncopyable_function which provides better guarantees.

Tests: observer_tests (release)
Message-Id: <20180710073628.30702-1-avi@scylladb.com>
2018-07-10 09:42:49 +01:00
Avi Kivity
0a2c9387e8 Merge "Support reading deleted cells" from Piotr
"
Implement and test support for reading deleted cells in SSTables 3.
"

* 'haaawk/sstables3/read-deleted-cells-v2' of ssh://github.com/scylladb/seastar-dev:
  sstables: Test reading deleted cells from SST3
  sstables: Support deleted cells in reading SST3
  test_uncompressed_compound_ck_read: fix comment
  utils: add observer/observable templates
2018-07-10 11:21:00 +03:00
Piotr Jastrzebski
0abdd919c8 sstables: Test reading deleted cells from SST3
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-07-10 10:03:29 +02:00
Piotr Jastrzebski
54fc6dde35 sstables: Support deleted cells in reading SST3
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-07-10 10:03:29 +02:00
Piotr Jastrzebski
f64901fdac test_uncompressed_compound_ck_read: fix comment
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-07-10 10:03:14 +02:00
Avi Kivity
96737d140f utils: add observer/observable templates
An observable is used to decouple an information producer from a consumer
(in the same way as a callback), while allowing multiple consumers (called
observers) to coexist and to manage their lifetime separately.

Two classes are introduced:

 observable: a producer class; when an observable is invoked all observers
        receive the information
 observer: a consumer class; receives information from a observable

Modelled after boost::signals2, with the following changes
 - all signals return void; information is passed from the producer to
   the consumer but not back
 - thread-unsafe
 - modern C++ without preprocessor hacks
 - connection lifetime is always managed rather than leaked by default
 - renamed to avoid the funky "slot" name
Message-Id: <20180709172726.5079-1-avi@scylladb.com>
2018-07-09 18:48:44 +01:00
Paweł Dziepak
00a63663d6 bytes_ostream: increase max chunk size to 128 kB
128 kB is the size of the LSA segment and therefore the default size of
any kind of chunks, fragments and buffers.

Message-Id: <20180709155615.22500-1-pdziepak@scylladb.com>
2018-07-09 19:59:51 +03:00
Tomasz Grabiec
1336744a05 mutation_fragment: Fix clustering_row::equal() using incorrect column kind
Incorrect column_kind was passed, which may cause wrong type to be
used for comparison if schema contains static columns. Affects only
tests.

Spotted during code review.
Message-Id: <1531144991-2658-1-git-send-email-tgrabiec@scylladb.com>
2018-07-09 15:25:17 +01:00
Avi Kivity
ed7855a8a6 Update seastar submodule
* seastar 216d499...aac6cf1 (5):
  > reactor: pollable_fd: limit fragment count to IOV_MAX
  > tests: silence more "-Werror=sign-compare" warnings
  > reactor: include <boost/next_prior.hpp>
  > Use `#pragma once` everywhere
  > .gitignore: adds __pycache__ directory
2018-07-09 17:01:44 +03:00
Gleb Natapov
617666efb0 storage_proxy: use logger's exception printer to report read failure
Use existing exception pretty printer since it handles nested
exceptions.

Message-Id: <20180709122826.GT28899@scylladb.com>
2018-07-09 15:31:14 +03:00
Duarte Nunes
156817e00e db/size_estimates_virtual_reader: Use left-exclusive token ranges
We were considering the token ranges in the size_estimates system
table to be inclusive, which is incorrect and incompatible with
Cassandra.

While we ignore the inclusiveness of the partition_range bounds when
selecting sstables, we do take it into account in
estimated_keys_for_range(). We would thus select the correct sstables,
but could over-estimate the range size nonetheless.

Tests: virtual_reader_test(release)

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180709115919.5106-1-duarte@scylladb.com>
2018-07-09 15:26:32 +03:00
Takuya ASADA
1a5a40e5f6 dist/common/scripts/scylla_util.py: use os.open(O_EXCL) to verify disk is unused
To simplify is_unused_disk(), just try to open the disk instead of
checking multiple block subsystems.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180709102729.30066-1-syuu@scylladb.com>
2018-07-09 13:29:15 +03:00
Avi Kivity
7d0df2a06d Update scylla-ami submodule
* dist/ami/files/scylla-ami 67293ba...5200f3f (1):
  > Add custom script options to AMI user-data
2018-07-09 13:21:30 +03:00
Gleb Natapov
ac27d1c93b storage_proxy: fix rpc connection failure handling by read operation
Currently rpc::closed_error is not counted towards replica failure
during read and thus read operation waits for timeout even if one
of the nodes dies. Fix this by counting rpc::closed_error towards
failed attempts.

Fixes #3590.

Message-Id: <20180708123522.GC28899@scylladb.com>
2018-07-09 10:05:31 +03:00
Avi Kivity
2f8537b178 database: demote "Setting compaction strategy" log message to debug level
It's not very helpful in normal operation, and generates much noise,
especially when there are many tables.
Message-Id: <20180708070051.8508-1-avi@scylladb.com>
2018-07-08 10:27:03 +01:00
Avi Kivity
512baf536f storage_proxy: implement write timeouts
Require a timeout parameter for storage_proxy::mutate_begin() and
all its callers (all the way to thrift and cql modification_statement
and batch_statement).

This should fix spurious debug-mode test failures, where overcommit
and general debug slowness result in the default timeouts being
exceeded. Since the tests use infinite timeouts, they should not
time out any more.

Tests: unit (release), with an extra patch that aborts
    when a non-infinite timeout is detected.
Message-Id: <20180707204424.17116-1-avi@scylladb.com>
2018-07-08 10:27:03 +01:00
Takuya ASADA
929ba016ed dist/common/scripts/scylla_util.py: strip double quote from sysconfig parameter
Current sysconfig_parser.get() returns parameter including double quote,
it will cause problem by append text using sysconfig_parser.set().

Fixes #3587

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180706172219.16859-1-syuu@scylladb.com>
2018-07-08 10:47:41 +03:00
Duarte Nunes
1beed0ca16 Merge 'hinted handoff: add rebalancing and unmark as experimental' from Vlad
"
This series adds the last missing part of the HH feature list (as in the design doc) - rebalancing;
and finally removes the "experimental" tag from the HH.
"

* 'hinted_handoff_rebalance-v3' of https://github.com/vladzcloudius/scylla:
  main: remove the "experimental" tag from the hinted handoff feature
  db::hints::manager: implement rebalance() method
2018-07-07 20:38:07 +01:00
Takuya ASADA
a98b4b705c dist/common/scripts/scylla_util.py: make run()/out() functions shorter
Refactored these functions to make them simpler.
2018-07-08 01:13:36 +09:00
Takuya ASADA
e2a032f7ea dist/ami: install python34 to run scylla_install_ami
Since we switched scylla_install_ami to python3, need to install python3
before launching the script.
2018-07-08 01:13:36 +09:00
Takuya ASADA
4e04fb7d68 dist/common/scripts/scylla_ec2_check: move ec2 related code to class aws_instance
There is duplicated code on both scylla_ec2_check and class aws_instance
on scylla_util.py, so drop these code from scylla_ec2_check and use
class aws_instance.
2018-07-08 01:13:36 +09:00
Takuya ASADA
99d5ca03e7 dist/common/scripts: drop class concolor, use colorprint()
To print colored console output with simplar code, drop class concolor
and use colorprint() instead.
2018-07-08 01:13:36 +09:00
Takuya ASADA
14d117363b dist/ami/files/.bash_profile: convert almost all lines to python3
Since it's .bash_profile we cannot make the file to python3 script but
almost all lines are rewritten to python3, .bash_profile just launch it.
2018-07-08 01:13:35 +09:00
Takuya ASADA
25c3249d8d dist/common/scripts: convert node_exporter_install to python3
Convert bash script to python3.
2018-07-08 01:13:35 +09:00
Takuya ASADA
505fcc92f7 dist/common/scripts: convert scylla_stop to python3
Convert bash script to python3.
2018-07-08 01:13:35 +09:00
Takuya ASADA
eb369942bd dist/common/scripts: convert scylla_prepare to python3
Convert bash script to python3.
2018-07-08 01:13:35 +09:00
Vlad Zolotarov
7495c8e56d dist: scylla_lib.sh: get_mode_cpu_set: split the declaration and ssignment to the local variable
In bash local variable declaration is a separate operation with its own exit status
(always 0) therefore constructs like

local var=`cmd`

will always result in the 0 exit status ($? value) regardless of the actual
result of "cmd" invocation.

To overcome this we should split the declaration and the assignment to be like this:

local var
var=`cmd`

Fixes #3508

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1529702903-24909-3-git-send-email-vladz@scylladb.com>
2018-07-07 18:04:19 +03:00
Vlad Zolotarov
f3ca17b1a1 dist: scylla_lib.sh: get_mode_cpu_set: don't let the error messages out
References #3508

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1529702903-24909-2-git-send-email-vladz@scylladb.com>
2018-07-07 18:04:18 +03:00
Avi Kivity
e79fccdf7b Update seastar submodule
* seastar d7f35d7...216d499 (10):
  > temporary_buffer: Add clone method()
  > temporary_buffer: Make move-assignment operator noexcept.
  > deleter: Make move-assignment operator noexcept.
  > reactor: don't become inefficient when max_task_backlog is exceeded
  > reactor: switch cumulative time metrics resolution from nanoseconds to milliseconds
  > preempt: annotate for branch prediction
  > tests: silence "-Werror=sign-compare" warnings
  > Merge "Support one I/O Scheduler per device" from Glauber
  > rpc: make rpc server scheduling aware
  > Add SEASTAR_USER_CFLAGS and SEASTAR_ENABLE_WERROR
2018-07-07 17:48:25 +03:00
Vlad Zolotarov
c65a110839 main: remove the "experimental" tag from the hinted handoff feature
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-07-06 19:19:40 -04:00
Vlad Zolotarov
83ba6d84a1 db::hints::manager: implement rebalance() method
Rebalance hints segments that need to be sent among all present shards.

Ensure that after rebalancing the difference between the number of segments
of any two shards is not greater than 1.

Try to minimize the amount of "file rename" operations in order to achieve the needed result.

Note: "Resharding" is a particular case of rebalancing.

Tests: dtest: hintedhandoff_additional_test.py:TestHintedHandoff.hintedhandoff_rebalance_test

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-07-06 19:18:46 -04:00
Piotr Sarna
77aa97f62a cql3: fix ALLOW FILTERING iterator
In original series cell iterator for regular cells
was erroneously taken by copy instead of by reference,
which will result in iterating over the first value indefinitely.
Also, the same iterator was not updated for collections,
which is fixed too.
Message-Id: <83297adf8121de4fd37257c87f250d61ea9ec80b.1530892191.git.sarna@scylladb.com>
2018-07-06 17:23:12 +01:00
Duarte Nunes
0ec3ff0611 Merge 'Add ALLOW FILTERING metrics' from Piotr
"
This series addresses issue #3575 by adding 3 ALLOW FILTERING related
metrics to help profile queries:
 * number of read request that required filtering
 * total number of rows read that required filtering
 * number of rows read that required filtering and matched

Tests: unit (release)
"

* 'allow_filtering_metrics_4' of https://github.com/psarna/scylla:
  cql3: publish ALLOW FILTERING metrics
  cql3: add updating ALLOW FILTERING metrics
  cql3: define ALLOW FILTERING metrics
2018-07-06 11:19:37 +01:00
Piotr Sarna
4a435e6f66 cql3: publish ALLOW FILTERING metrics
ALLOW FILTERING related metrics are registered and published.

Fixes #3575
2018-07-06 12:00:37 +02:00
Piotr Sarna
03f2f8633b cql3: add updating ALLOW FILTERING metrics
Metrics related to ALLOW FILTERING queries are now properly
updated on read requests.
2018-07-06 12:00:29 +02:00
Piotr Sarna
8cb242ab0b cql3: define ALLOW FILTERING metrics
The following metrics are defined for ALLOW FILTERING:
 * number of read request that required filtering
 * total number of rows read that required filtering
 * number of rows read that required filtering and matched
2018-07-06 10:43:18 +02:00
Glauber Costa
82f7f7b36d database: change ident
Previous patches have used reviewer-oriented identation. Re-ident.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-07-05 17:11:01 -04:00
Glauber Costa
99c8a1917f database: support multiple data directories
While Cassandra supports multiple data directories, we have been
historically supporting just one. The one-directory model suits us
better because of the I/O Scheduler and so far we have seen very few
requests -- if any, to support this.

Still, the infrastructure needed to support multiple directories can be
beneficial so I am trying to bring this in.

For simplicity, we will treat the first directory in the list as the
main directory. By being able to still associate one singular directory
with a table, most of the code doesn't have to change and we don't have
to worry about how to distribute data between the directories.

In this design:
 - We scan all data directories for existing data.
 - resharding only happens within a particular data directory.
 - snapshot details are accumulated with data for all directories that
   host snapshots for the tables we are examining
 - snapshots are created with files in its own directories, but the
   manifest file goes to the main directory. For this one, note that in
   Cassandra the same thing happens, except that there is no "main"
   directory. Still the manifest file is still just in one of them.
 - SSTables are flushed into the main directory.
 - Compactions write data into the main directory

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-07-05 16:58:39 -04:00
Glauber Costa
3b46984a1e database: allow resharing to specify a directory
resharding assumes that all SSTables will be in cf->dir(), but in
reality we will soon have tables in other places. We can specify a
directory in get_all_shared_sstables and specify that directory from the
resharding process.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-07-05 16:58:08 -04:00
Glauber Costa
c8b2d441a8 database: support multiple directories in get_snapshot_details
Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-07-05 16:58:08 -04:00
Glauber Costa
a8ccf4d1e6 database: move get_snapshot_info into a seastar::thread
I am about to add another level of identation and this code already
shifts right too much. It is not performance critical, so let's use a
thread for that. seastar::threads did not exist when this was first
written.

Also remove one unused continuation from inside the inner scan,
simplifying its code.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-07-05 16:58:08 -04:00
Glauber Costa
919c7d6bb9 snapshots: always create the snapshot directory
We currently don't always create the snapshot directory as an
optimization. We have a test at sync time handling this use case.

This works well when all SSTables are created in the same directory, but
if we have more than one data directory than it may not work if we don't
have SSTables in all data directories.

We can fix it by unconditionally creating the directory.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-07-05 16:58:08 -04:00
Glauber Costa
86239e4e22 sstables: pass sstable dir with entry descriptor
We have been assuming that all SSTables for a table will be in the same
directory, and we pass the directory name to make_descriptor only
because that's the way in ka and la to find out the keyspace and table
names.

However, SSTables for a given column family could be spread into
multiple directories. So let's pass it down with the descriptor so we
can load from the right place.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-07-05 16:45:26 -04:00
Glauber Costa
25a02c61d6 database: make nodetool listsnapshots print correct information
nodetool listsnapshots is currently printing zero sizes for all snapshots
The reason for that is that we are moving the snapshot directory name in
the capture list, which can be evaluated by the compiler to happen
before we use it as the function parameter.

Fixes #3572

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-07-05 16:20:07 -04:00
Glauber Costa
4a62866104 sstables: correctly create descriptors for snapshots
Our regular expression for parsing SSTable files tests for the directory
for the la file format, since that file format does not include the
ks/cf pair in the file name itself.

However, the regular expression does not cover the case in which the
SSTable files are coming from snapshots. This patch extends the regex so
they are also covered.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-07-05 16:19:09 -04:00
Raphael S. Carvalho
dfd1e1229e sstables/compaction_manager: fix typo in function name to reevaluate postponed compaction
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180702185343.26682-1-raphaelsc@scylladb.com>
2018-07-05 18:54:14 +03:00
Takuya ASADA
4df982fe07 dist/common/scripts/scylla_sysconfig_setup: fix typo
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180705133313.16934-1-syuu@scylladb.com>
2018-07-05 16:38:14 +03:00
Avi Kivity
7a1bcd9ad3 Merge "Improve mutation printing in GDB" from Tomasz
"
This is a series of patches which make it possible for a human to examine
contents of cache or memtables from GDB.
"

* 'tgrabiec/gdb-cache-printers' of github.com:tgrabiec/scylla:
  gdb: Add pretty printer for managed_vector
  gdb: Add pretty printer for rows
  gdb: Add mutation_partition pretty printer
  gdb: Add pretty printer for partition_entry
  gdb: Add pretty printer for managed_bytes
  gdb: Add iteration wrapper for intrusive_set_external_comparator
  gdb: Add iteration wrapper for boost intrusive set
2018-07-05 14:08:58 +03:00
Avi Kivity
f55a2fe3a7 main: improve reporting of dns resolution errors
A report that C-Ares returned some errors tells the user nothing.

Improve the error message by including the name of the configuration
variable and its value.
Message-Id: <20180705084959.10872-1-avi@scylladb.com>
2018-07-05 10:24:41 +01:00
Duarte Nunes
c126b00793 Merge 'ALLOW FILTERING support' from Piotr
"
The main idea of this series is to provide a filtering_visitor
as a specialised result_set_builder::visitor implementation
that keeps restriction info and applies it on query results.
Also, since allow_filtering checking is not correct now (e.g. #2025)
on select_statement level, this series tries to fix any issues
related to it.

Still in TODO:
 * handling CONTAINS relation in single column restriction filtering
 * handling multi-column restrictions - especially EQ, which can be
   split into multiple single-column restrictions
 * more tests - it's never enough; especially esoteric cases
   like filtering queries which also use secondary indexes,
   paging tests, etc.

Tests: unit (release)
"

* 'allow_filtering_6' of https://github.com/psarna/scylla:
  tests: add allow_filtering tests to cql_query_test
  cql3: enable ALLOW FILTERING
  service: add filtering_pager
  cql3: optimize filtering partition keys and static rows
  cql3: add filtering visitor
  cql3: move result_set_builder functions to header
  cql3: amend need_filtering()
  cql3: add single column primary key restrictions getters
  cql3: expose single column primary key restrictions
  cql3: add needs_filtering to primary key restrictions
  cql3: add simpler single_column_restriction::is_satisfied_by
2018-07-05 10:18:08 +01:00
Piotr Sarna
a7dd02309f tests: add allow_filtering tests to cql_query_test
Test cases for ALLOW FILTERING are added to cql_query_test suite.
2018-07-05 10:50:43 +02:00
Piotr Sarna
27bf20aa3f cql3: enable ALLOW FILTERING
Enables 'ALLOW FILTERING' queries by transfering control
to result_set_builder::filtering_visitor.
Both regular and primary key columns are allowed,
but some things are left unimplemented:
 - multi-column restrictions
 - CONTAINS queries

Fixes #2025
2018-07-05 10:50:43 +02:00
Piotr Sarna
7b018f6fd6 service: add filtering_pager
For paged results of an 'ALLOW FILTERING' query, a filtering pager
is provided. It's based on a filtering_visitor for result_builder.
2018-07-05 10:50:43 +02:00
Piotr Sarna
a08fba19e3 cql3: optimize filtering partition keys and static rows
If any restriction on partition key or static row part fails,
it will be so for every row that belongs to a partition.
Hence, full check of the rest of the rows is skipped.
2018-07-05 10:50:43 +02:00
Piotr Sarna
2a0b720102 cql3: add filtering visitor
In order to filter results of an 'ALLOW FILTERING' query,
a visitor that can take optional filter for result_builder
is provided. It defaults to nop_filter, which accepts
all rows.
2018-07-05 10:50:43 +02:00
Piotr Sarna
1cf5653f89 cql3: move result_set_builder functions to header
Moving function definitions to header is a preparation step
before turning result_set_builder into a template.
2018-07-05 10:50:43 +02:00
Piotr Sarna
4d3d32f465 cql3: amend need_filtering()
Previous implementation of need_filtering() was too eager to assume
that index query should be used, whereas sometimes a query should
just be filtered.
2018-07-05 10:50:39 +02:00
Avi Kivity
dd083122f9 Update scylla-ami submodule
* dist/ami/files/scylla-ami 0fd9d23...67293ba (1):
  > scylla_install_ami: fix broken argument parser

Fixes #3578.
2018-07-05 09:48:06 +03:00
Avi Kivity
f4caa418ff Merge "Fix the "LCS data-loss bug"" from Botond
"
This series fixes the "LCS data-loss bug" where full scans (and
everything that uses them) would miss some small percentage (> 0.001%)
of the keys. This could easily lead to permanent data-loss as compaction
and decomission both use full scans.
aeffbb673 worked around this bug by disabling the incremental reader
selectors (the class identified as the source of the bug) altogether.
This series fixes the underlying issue and reverts aeffbb673.

The root cause of the bug is that the `incremental_reader_selector` uses
the current read position to poll for new readers using
`sstable_set::incremental_selector::select()`. This means that when the
currently open sstables contain no partitions that would intersect with
some of the yet unselected sstables, those sstables would be ignored.
Solve the problem by not calling `select()` with the current read
position and always pass the `next_position` returned in the previous
call. This means that the traversal of the sstable-set happens at a pace
defined by the sstable-set itself and this guarantees that no sstable
will be jumped over. When asked for new readers the
`incremental_reader_selector` will now iteratively call `select()` using
the `next_position` from the previous `select()` call until it either
receives some new, yet unselected sstables, or `next_position` surpasses
the read position (in which case `select()` will be tried again later).
The `sstable_set::incremental_selector` was not suitable in its present
state to support calling `select()` with the `next_position` from a
previous call as in some cases it could not make progress due to
inclusiveness related ambiguities. So in preparation to the above fix
`sstable_set` was updated to work in terms of ring-position instead of
tokens. Ring-position can express positions in a much more fine-grained
way then token, including positions after/before tokens and keys. This
allows for a clear expression of `next_position` such that calling
`select()` with it guarantees forward progress in the token-space.

Tests: unit(release, debug)

Refs: #3513
"

* 'leveled-missing-keys/v4' of https://github.com/denesb/scylla:
  tests/mutation_reader_test: combined_mutation_reader_test: use SEASTAR_THREAD_TEST_CASE
  tests/mutation_reader_test: refactor combined_mutation_reader_test
  tests/mutation_reader_test: fix reader_selector related tests
  Revert "database: stop using incremental selectors"
  incremental_reader_selector: don't jump over sstables
  mutation_reader: reader_selector: use ring_position instead of token
  sstables_set::incremental_selector: use ring_position instead of token
  compatible_ring_position: refactor to compatible_ring_position_view
  dht::ring_position_view: use token_bound from ring_position
  i_partitioner: add free function ring-position tri comparator
  mutation_reader_merger::maybe_add_readers(): remove early return
  mutation_reader_merger: get rid of _key
2018-07-05 09:33:12 +03:00
Takuya ASADA
3bcc123000 dist/ami: hardcode target for scylla_current_repo since we don't have --target option anymore
We break build_ami.sh since we dropped Ubuntu support, scylla_current_repo
command does not finishes because of less argument ('--target' with no
distribution name, since $TARGET is always blank now).
It need to hardcoded as centos.

Fixes #3577

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180705035251.29160-1-syuu@scylladb.com>
2018-07-05 09:31:43 +03:00
Paweł Dziepak
07a429e837 test.py: do not disable human-readable format with --jenkins flag
When test.py is run with --jenkins flag Boost UTF is asked to generate
an XML file with the test results. This automatically disables the
human-readable output printed to stdout. There is no real reason to do
so and it is actually less confusing when the Boost UTF messages are in
the test output together with Scylla logger messages.

Message-Id: <20180704172913.23462-1-pdziepak@scylladb.com>
2018-07-05 09:31:15 +03:00
Raphael S. Carvalho
7d6af5da3a sstables/compaction_manager: properly reevaluate postponed compactions for leveled strategy
Function to reevaluate postponed compaction was called too early for strategies that
don't allow parallel compaction (only leveled strategy (LCS) at this moment).
Such strategies must first have the ongoing compaction deregistered before reevaluating
the postponed ones. Manager uses task list of ongoing compaction to decides if there's
ongoing compaction for a given column family. So compaction could stop making progress
at all *if and only if* we stop flushing new data.

So it could happen that a column family would be left with lots of pending compaction,
leading the user to think all compacting is done, but after reboot, there will be
lots of compaction activity.

We'll both improve method to detect parallel compaction here and also add a call to
reevaluate postponed compaction after compaction is done.

Fixes #3534.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180702185327.26615-1-raphaelsc@scylladb.com>
2018-07-04 16:30:21 +01:00
Botond Dénes
b32f94d31e tests/mutation_reader_test: combined_mutation_reader_test: use SEASTAR_THREAD_TEST_CASE 2018-07-04 17:42:37 +03:00
Botond Dénes
77ad085393 tests/mutation_reader_test: refactor combined_mutation_reader_test
Make combined_mutation_reader_test more interesting:
* Set the levels on the sstables
* Arrange the sstables so that they test for the "jump over sstables"
bug.
* Arrange the sstables so that they test for the "gap between sstables".

While at it also make the code more compact.
2018-07-04 17:42:37 +03:00
Botond Dénes
4b57fc9aea tests/mutation_reader_test: fix reader_selector related tests
Don't assume the partition keys use lexical ordering. Add some
additional checks.
2018-07-04 17:42:37 +03:00
Botond Dénes
a9c465d7d2 Revert "database: stop using incremental selectors"
The data-loss bug is fixed, the incremental selector can be used again.

This reverts commit aeffbb6732.
2018-07-04 17:42:37 +03:00
Botond Dénes
c37aff419e incremental_reader_selector: don't jump over sstables
Passing the current read position to the
`incremental_selector::select()` can lead to "jumping" through sstables.
This can happen when the currently open sstables have no partition that
intersects with a yet unselected sstable that has an intersecting range
nevertheless, in other words there is a gap in the selected sstables
that this unselected one completely fits into. In this case the
unselected sstable will be completely omitted from the read.
The solution is to not to avoid calling `select()` with a position that
is larger than the `next_position` returned from the previous `select()`
call. Instead, call `select()` repeatedly with the `next_position` from
the previous call, until either at least one new sstable is selected or
the current read position is surpassed. This guarantess that no sstables
will be jumped over. In other words, advance the incremental selector in
a pace defined by itself thus guaranteeing that no sstable will be
jumped over.
2018-07-04 17:42:37 +03:00
Botond Dénes
81a03db955 mutation_reader: reader_selector: use ring_position instead of token
sstable_set::incremental selector was migrated to ring position, follow
suit and migrate the reader_selector to use ring_position as well. Above
correctness this also improves efficiency in case of dense tables,
avoiding prematurely selecting sstables that share the token but start
at different keys, altough one could argue that this is a niche case.
2018-07-04 17:42:37 +03:00
Botond Dénes
a8e795a16e sstables_set::incremental_selector: use ring_position instead of token
Currently `sstable_set::incremental_selector` works in terms of tokens.
Sstables can be selected with tokens and internally the token-space is
partitioned (in `partitioned_sstable_set`, used for LCS) with tokens as
well. This is problematic for severeal reasons.
The sub-range sstables cover from the token-space is defined in terms of
decorated keys. It is even possible that multiple sstables cover
multiple non-overlapping sub-ranges of a single token. The current
system is unable to model this and will at best result in selecting
unnecessary sstables.
The usage of token for providing the next position where the
intersecting sstables change [1] causes further problems. Attempting to
walk over the token-space by repeatedly calling `select()` with the
`next_position` returned from the previous call will quite possibly lead
to an infinite loop as a token cannot express inclusiveness/exclusiveness
and thus the incremental selector will not be able to make progress when
the upper and lower bounds of two neighbouring intervals share the same
token with different inclusiveness e.g. [t1, t2](t2, t3].

To solve these problems update incremental_selector to work in terms of
ring position. This makes it possible to partition the token-space
amoing sstables at decorated key granularity. It also makes it possible
for select() to return a next_position that is guaranteed to make
progress.

partitioned_sstable_set now builds the internal interval map using the
decorated key of the sstables, not just the tokens.
incremental_selector::select() now uses `dht::ring_position_view` as
both the selector and the next_position. ring_position_view can express
positions between keys so it can also include information about
inclusiveness/exclusiveness of the next interval guaranteeing forward
progress.

[1] `sstable_set::incremental_selector::selection::next_position`
2018-07-04 17:42:33 +03:00
Duarte Nunes
33d7de0805 Merge 'Expose sharding information to connections' from Avi
"
In the same way that drivers can route requests to a coordinator that
is also a replica of the data used by the request, we can allow
drivers to route requests directly to the shard. This patchset
adds and documents a way for drivers to know which shard a connection
is connected to, and how to perform this routing.
"

* tag 'shard-info-alt/v1' of https://github.com/avikivity/scylla:
  doc: documented protocol extension for exposing sharding
  transport: expose more information about sharding via the OPTIONS/SUPPORTED messages
  dht: add i_partitioner::sharding_ignore_msb()
2018-07-04 13:01:21 +01:00
Botond Dénes
8084ce3a8e query_pager: use query::is_single_partition() to check for singular range
Use query::is_single_partition() to check whether the queried ranges are
singular or not. The current method of using
`dht::partition_range::is_singular()` is incorrect, as it is possible to
build a singular range that doesn't represent a single partition.
`query::is_single_partition()` correctly checks for this so use it
instead.

Found during code-review.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <f671f107e8069910a2f84b14c8d22638333d571c.1530675889.git.bdenes@scylladb.com>
2018-07-04 10:04:50 +01:00
Takuya ASADA
3cb7ddaf68 dist/debian/build_deb.sh: make build_deb.sh more simplified
Use is_debian()/is_ubuntu() to detect target distribution, also install
pystache by path since package name is different between Fedora and
CentOS.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180703193224.4773-1-syuu@scylladb.com>
2018-07-04 11:12:26 +03:00
Takuya ASADA
ed1d0b6839 dist/ami/files/.bash_profile: drop Ubuntu support
Drop Ubuntu support on login prompt, too.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180703192813.4589-1-syuu@scylladb.com>
2018-07-04 11:12:26 +03:00
Piotr Sarna
f42eaff75e cql3: add single column primary key restrictions getters
Getters for single column partition/clustering key restrictions
are added to statement_restrictions.
2018-07-04 09:48:32 +02:00
Piotr Sarna
a99acbc376 cql3: expose single column primary key restrictions
Underlying single_column_restrictions are exposed
for single_column_primary_key_restrictions via a const method.
2018-07-04 09:48:32 +02:00
Piotr Sarna
f7a2f15935 cql3: add needs_filtering to primary key restrictions
Primary key restrictions sometimes require filtering. These functions
return true if ALLOW FILTERING needs to be enabled in order to satisfy
these restrictions.
2018-07-04 09:48:32 +02:00
Piotr Sarna
6aec9e711f cql3: add simpler single_column_restriction::is_satisfied_by
Currently restriction::is_satisfied_by() accepts only keys and rows
as arguments. In this commit, a version that only takes bytes of data
is provided.
This simpler version applies to single_column_restriction only,
because it compares raw bytes underneath anyway. For other restriction
types, simplified is_satisfied_by is not defined.
2018-07-04 09:48:32 +02:00
Botond Dénes
bf2645c616 compatible_ring_position: refactor to compatible_ring_position_view
compatible_ring_position's sole purpose is to allow creating
boost::icl::interval_map with dht::ring_position as the key and list of
sstables as the value. This function is served equally well if
compatible_ring_position wraps a `dht::ring_position_view` instead of a
`dht::ring_position` with the added benefit of not having to copy the
possibly heavy `dht::decorated_key` around. It also makes it possible
to do lookups with `dht::ring_position_view` which is much more
versatile and allows avoiding copies just to make lookups.
The only downside is that `dht::ring_position_view` requires the
lifetime of the "viewed" object to be taken care of. This is not a
concern however, as so long as an interval is present in the map the
represented sstable is guaranteed to be alive to, as the interval map
participates in the ownership of the stored sstables.

Rename compatible_ring_position to compatible_ring_position_view to
reflect the changes.
While at it upgrade the std::experimental::optional to std::optional.
2018-07-04 08:19:39 +03:00
Botond Dénes
48b07ba5d3 dht::ring_position_view: use token_bound from ring_position
Currently dht::ring_position_view's dht::token constructor takes the
token bound in the form of a raw `uint8_t`. This allows for passing a
weight of "0" which is illegal as single token does not represent a
single ring position but an interval as arbitrary number of keys can
have the same token. dht::ring_position uses an enum in its dht::token
constructor. Import that same enum into the dht::ring_position_view
scope and take a `token_bound` instead of `uint8_t`.
This is especially important as in later patches the internal weight of
the ring_position_view will be exposed and illegal values can cause all
sorts of problems.
2018-07-04 08:19:34 +03:00
Alexys Jacob
8c03c1e2ce Support Gentoo Linux on node_health_check script.
Gentoo Linux was not supported by the node_health_check script
which resulted in the following error message displayed:

"This s a Non-Supported OS, Please Review the Support Matrix"

This patch adds support for Gentoo Linux while adding a TODO note
to add support for authenticated clusters which the script does
not support yet.

Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20180703124458.3788-1-ultrabug@gentoo.org>
2018-07-03 20:18:13 +03:00
Tomasz Grabiec
2ffb621271 Merge "Fix atomic_cell_or_collection::external_memory_usage()" from Paweł
After the transition to the new in-memory representation in
aab6b0ee27 'Merge "Introduce new in-memory
representation for cells" from Paweł'
atomic_cell_or_collection::external_memory_usage() stopped accounting
for the externally stored data. Since, it wasn't covered by the unit
tests the bug remained unnotices until now.

This series fixes the memory usage calculation and adds proper unit
tests.

* https://github.com/pdziepak/scylla.git fix-external-memory-usage/v1:
  tests/mutation: properly mark atomic_cells that are collection members
  imr::utils::object: expose size overhead
  data::cell: expose size overhead of external chunks
  atomic_cell: add external chunks and overheads to
    external_memory_usage()
  tests/mutation: test external_memory_usage()
2018-07-03 14:58:10 +02:00
Botond Dénes
c236a96d7d tests/cql_query_tess: add unit test for querying empty ranges test
A bug was found recently (#3564) in the paging logic, where the code
assumed the queried ranges list is non-empty. This assumption is
incorrect as there can be valid (if rare) queries that can result in the
ranges list to be empty. Add a unit test that executes such a query with
paging enabled to detect any future bugs related to assumptions about
the ranges list being non-empty.

Refs: #3564
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <f5ba308c4014c24bb392060a7e72e7521ff021fa.1530618836.git.bdenes@scylladb.com>
2018-07-03 13:43:17 +01:00
Botond Dénes
59a30f0684 query_pager: be prepared to _ranges being empty
do_fetch_page() checks in the beginning whether there is a saved query
state already, meaning this is not the first page. If there is not it
checks whether the query is for a singulular partitions or a range scan
to decide whether to enable the stateful queries or not. This check
assumed that there is at least one range in _ranges which will not hold
under some circumstances. Add a check for _ranges being empty.

Fixes: #3564
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <cbe64473f8013967a93ef7b2104c7ca0507afac9.1530610709.git.bdenes@scylladb.com>
2018-07-03 11:05:01 +01:00
Avi Kivity
eafd16266d tests: reduce multishard_mutation_test runtime in debug mode
Debug mode is so slow that generating 1000 mutations is too much for it.
High memory use can also confuse the santitizers that track each allocation.

Reduce mutation count from 1000 to 10 in debug mode.
2018-07-03 12:01:44 +03:00
Avi Kivity
a36b1f1967 Merge "more scylla_setup fixes" from Takuya
"
Added NIC / Disk existance check, --force-raid mode on
scylla_raid_setup.
"

* 'scylla_setup_fix4' of https://github.com/syuu1228/scylla:
  dist/common/scripts/scylla_raid_setup: verify specified disks are unused
  dist/common/scripts/scylla_raid_setup: add --force-raid to construct raid even only one disk is specified
  dist/common/scripts/scylla_setup: don't accept disk path if it's not block device
  dist/common/scripts/scylla_raid_setup: verify specified disk paths are block device
  dist/common/scripts/scylla_sysconfig_setup: verify NIC existance
2018-07-03 11:03:08 +03:00
Takuya ASADA
d0f39ea31d dist/common/scripts/scylla_raid_setup: verify specified disks are unused
Currently only scylla_setup interactive mode verifies selected disks are
unused, on non-interactive mode we get mdadm/mkfs.xfs program error and
python backtrace when disks are busy.

So we should verify disks are unused also on scylla_raid_setup, print
out simpler error message.
2018-07-03 14:50:34 +09:00
Takuya ASADA
3289642223 dist/common/scripts/scylla_raid_setup: add --force-raid to construct raid even only one disk is specified
User may want to start RAID volume with only one disk, add an option to
force constructing RAID even only one disk specified.
2018-07-03 14:50:34 +09:00
Takuya ASADA
e0c16c4585 dist/common/scripts/scylla_setup: don't accept disk path if it's not block device
Need to ignore input when specified path is not block device.
2018-07-03 14:50:34 +09:00
Takuya ASADA
24ca2d85c6 dist/common/scripts/scylla_raid_setup: verify specified disk paths are block device
Verify disk paths are block device, exit with error if not.
2018-07-03 14:50:34 +09:00
Takuya ASADA
99b5cf1f92 dist/common/scripts/scylla_sysconfig_setup: verify NIC existance
Verify NIC existance before writing sysconfig file to prevent causing
error while running scylla.

See #2442
2018-07-03 14:50:34 +09:00
Takuya ASADA
084c824d12 scripts: merge scylla_install_pkg to scylla-ami
scylla_install_pkg is initially written for one-liner-installer, but now
it only used for creating AMI, and it just few lines of code, so it should be
merge into scylla_install_ami script.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180612150106.26573-2-syuu@scylladb.com>
2018-07-02 13:20:09 +03:00
Takuya ASADA
fafcacc31c dist/ami: drop Ubuntu AMI support
Drop Ubuntu AMI since it's not maintained for a long time, and we have
no plan to officially provide it.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180612150106.26573-1-syuu@scylladb.com>
2018-07-02 13:20:08 +03:00
Avi Kivity
677991f353 Uodate scylla-ami submodule
* dist/ami/files/scylla-ami 36e8511...0fd9d23 (2):
  > scylla_install_ami: merge scylla_install_pkg
  > scylla_install_ami: drop Ubuntu AMI
2018-07-02 13:19:34 +03:00
Botond Dénes
01bd34d117 i_partitioner: add free function ring-position tri comparator
Having to create an object just to compare two ring positions (or views)
is annoying and unnecessary. Provide a free function version as well.
2018-07-02 11:41:09 +03:00
Botond Dénes
78ecf2740a mutation_reader_merger::maybe_add_readers(): remove early return
It's unnecessary (doesn't prevent anything). The code without it
expresses intent better (and is shorter by two lines).
2018-07-02 11:41:09 +03:00
Botond Dénes
d26b35b058 mutation_reader_merger: get rid of _key
`_key` is only used in a single place and this does not warrant storing
it in a member. Also get rid of current_position() which was used to
query `_key`.
2018-07-02 11:40:43 +03:00
Avi Kivity
0b148d0070 Merge "scylla_setup fixes" from Takuya
"
I found problems on previously submmited patchset 'scylla_setup fixes'
and 'more fixes for scylla_setup', so fixed them and merged into one
patchset.

Also added few more patches.
"

* 'scylla_setup_fix3' of https://github.com/syuu1228/scylla:
  dist/common/scripts/scylla_setup: allow input multiple disk paths on RAID disk prompt
  dist/common/scripts/scylla_raid_setup: skip constructing RAID0 when only one disk specified
  dist/common/scripts/scylla_raid_setup: fix module import
  dist/common/scripts/scylla_setup: check disk is used in MDRAID
  dist/common/scripts/scylla_setup: move unmasking scylla-fstrim.timer on scylla_fstrim_setup
  dist/common/scripts/scylla_setup: use print() instead of logging.error()
  dist/common/scripts/scylla_setup: implement do_verify_package() for Gentoo Linux
  dist/common/scripts/scylla_coredump_setup: run os.remove() when deleting directory is symlink
  dist/common/scripts/scylla_setup: don't include the disk on unused list when it contains partitions
  dist/common/scripts/scylla_setup: skip running rest of the check when the disk detected as used
  dist/common/scripts/scylla_setup: add a disk to selected list correctly
  dist/common/scripts/scylla_setup: fix wrong indent
  dist/common/scripts: sync instance type list for detect NIC type to latest one
  dist/common/scripts: verify systemd unit existance using 'systemctl cat'
2018-07-02 10:21:49 +03:00
Avi Kivity
a45c3aa8c7 Merge "Fix handling of stale write replies in storage_proxy" from Gleb
"
If a coordinator sends write requests with ID=X and restarts it may get a reply to
the request after it restarts and sends another request with the same ID (but to
different replicas). This condition will trigger an assert in a coordinator. Drop
the assertion in favor of a warning and initialize handler id in a way to make
this situation less likely.

Fixes: #3153
"

* 'gleb/write-handler-id' of github.com:scylladb/seastar-dev:
  storage_proxy: initialize write response id counter from wall clock value
  storage_proxy: drop virtual from signal(gms::inet_address)
  storage_proxy: do not assert on getting an unexpected write reply
2018-07-01 17:59:54 +03:00
Gleb Natapov
19e7493d5b storage_proxy: initialize write response id counter from wall clock value
Initializing write response id to the same value on each reboot may
cause stale id to be taken for active one if node restarts after
sending only a couple of write request and before receiving replies.
On next reboot it will start assigning id's from the same value and
receiving old replies will confuse it. Mitigate this by assigning
initial id to wall clock value in milliseconds. It will not solve the
problem completely, but will mitigate it.
2018-07-01 17:24:40 +03:00
Nadav Har'El
3194ce16b3 repair: fix combination of "-pr" and "-local" repair options
When nodetool repair is used with the combination of the "-pr" (primary
range) and "-local" (only repair with nodes in the same DC) options,
Scylla needs to define the "primary ranges" differently: Rather than
assign one node in the entire cluster to be the primary owner of every
token, we need one node in each data-center - so that a "-local"
repair will cover all the tokens.

Fixes #3557.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180701132445.21685-1-nyh@scylladb.com>
2018-07-01 16:39:33 +03:00
Gleb Natapov
569437aaa5 storage_proxy: drop virtual from signal(gms::inet_address)
The function is not overridden, so should not be virtual.
2018-07-01 16:35:59 +03:00
Gleb Natapov
5ee09e5f3b storage_proxy: do not assert on getting an unexpected write reply
In theory we should not get write reply from a node we did not send
write to, but in practice stale reply can be received if node reboot
between sending write and getting a reply. Do not assert, but log the
warning instead and ignore the reply.

Fixes: #3153
2018-07-01 16:35:09 +03:00
Tomasz Grabiec
b464b66e90 row_cache: Fix memtable reads concurrent with cache update missing writes
Introduced in 5b59df3761.

It is incorrect to erase entries from the memtable being moved to
cache if partition update can be preempted because a later memtable
read may create a snapshot in the memtable before memtable writes for
that partition are made visible through cache. As a result the read
may miss some of the writes which were in the memtable. The code was
checking for presence of snapshots when entering the partition, but
this condition may change if update is preempted. The fix is to not
allow erasing if update is preemptible.

This also caused SIGSEGVs because we were assuming that no such
snapshots will be created and hence were not invalidating iterators on
removal of the entries, which results in undefined behavior when such
snapshots are actually created.

Fixes SIGSEGV in dtest: limits_test.py:TestLimits.max_cells_test

Fixes #3532

Message-Id: <1530129009-13716-1-git-send-email-tgrabiec@scylladb.com>
2018-07-01 15:36:05 +03:00
Avi Kivity
f3da043230 Merge "Make in-memory partition version merging preemptable" from Tomasz
"
Partition snapshots go away when the last read using the snapshot is done.
Currently we will synchronously attempt to merge partition versions on this event.
If partitions are large, that may stall the reactor for a significant amount of time,
depending on the size of newer versions. Cache update on memtable flush can
create especially large versions.

The solution implemented in this series is to allow merging to be preemptable,
and continue in the background. Background merging is done by the mutation_cleaner
associated with the container (memtable, cache). There is a single merging process
per mutation_cleaner. The merging worker runs in a separate scheduling group,
introduced here, called "mem_compaction".

When the last user of a snapshot goes away the snapshot is slided to the
oldest unreferenced version first so that the version is no longer reachable
from partition_entry::read(). The cleaner will then keep merging preceding
(newer) versions into it, until it merges a version which is referenced. The
merging is preemtable. If the initial merging is preempted, the snapshot is
enqueued into the cleaner, the worker woken up, and merging will continue
asynchronously.

When memtable is merged with cache, its cleaner is merged with cache cleaner,
so any outstanding background merges will be continued by the cache cleaner
without disruption.

This reduces scheduling latency spikes in tests/perf_row_cache_update
for the case of large partition with many rows. For -c1 -m1G I saw
them dropping from >23ms to 1-2ms. System-level benchmark using scylla-bench
shows a similar improvement.
"

* tag 'tgrabiec/merge-snapshots-gradually-v4' of github.com:tgrabiec/scylla:
  tests: perf_row_cache_update: Test with an active reader surviving memtable flush
  memtable, cache: Run mutation_cleaner worker in its own scheduling group
  mutation_cleaner: Make merge() redirect old instance to the new one
  mvcc: Use RAII to ensure that partition versions are merged
  mvcc: Merge partition version versions gradually in the background
  mutation_partition: Make merging preemtable
  tests: mvcc: Use the standard maybe_merge_versions() to merge snapshots
2018-07-01 15:32:51 +03:00
Avi Kivity
8eba27829a doc: documented protocol extension for exposing sharding
Document a protocol extension that exposes the sharding algorithm
to drivers, and recommend how to use it to achieve connection-per-core.
2018-07-01 15:26:30 +03:00
Avi Kivity
28d064e7c0 transport: expose more information about sharding via the OPTIONS/SUPPORTED messages
Provide all infomation needed for a connection pool to set up a connection
per shard.
2018-07-01 15:26:28 +03:00
Botond Dénes
5fd9c3b9d4 tests/mutation_reader_test: require min shard-count for multishard tests
Tests testing different aspects of `foreign_reader` and
`multishard_combining_reader` are designed to run with a certain minimum
shard count. Running them with any shard count below this minimum makes
them useless at best but can even fail them.
Refuse to run these tests when the shard count is below the required
minimum to avoid an accidental and unnecessary investigation into a
false-positive test failure.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <d24159415b6a9d74eafb8355b6e3fba98c1ff7ff.1530274392.git.bdenes@scylladb.com>
2018-07-01 12:44:41 +03:00
Avi Kivity
f73340e6f8 Merge "Index reader and associated types clean-up." from Vladimir
"
This patchset paves way to support for reading SSTables 3.x index files.
It aims at streamlining and tidying up the existing index_reader and
helpers and brings no functional or high-level changes.

In v3:
  - do not capture 'found' and just return 'true' in the continuation
    inside advance_and_check_if_present()
  - split code that makes the use of advance_upper_past() internal-only
    into two commits for better readability

GitHub URL: https://github.com/argenet/scylla/tree/projects/sstables-30/index_reader_cleanup/v3

Tests: unit {release}

Performance tests (perf_fast_forward) did not reveal any noticeable
changes. The complete output is below.

========================================
Original code (before the patchset)
========================================
running: large-partition-skips
Testing scanning large partition with skips.
Reads whole range interleaving reads with skips according to read-skip pattern:
read    skip      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
1       0         0.336514   1000000    2971642   1000     126956      35       0        0        0        0        0        0        0  99.5%
1       1         1.411239    500000     354299    993     127056       2       0        0        1        1        0        0        0  99.9%
1       8         0.464468    111112     239224    993     127056       2       0        0        1        1        0        0        0  99.8%
1       16        0.330490     58824     177990    993     127056      12       0        0        1        1        0        0        0  99.7%
1       32        0.257010     30304     117910    993     127056      15       0        0        1        1        0        0        0  99.7%
1       64        0.213650     15385      72010    997     127072     268       0        0        3        3        0        0        0  99.5%
1       256       0.159498      3892      24402    993     127056     245       0        0        1        1        0        0        0  95.5%
1       1024      0.088678       976      11006    993     127056     347       0        0        1        1        0        0        0  63.4%
1       4096      0.082627       245       2965    649      22452     389     252        0        1        1        0        0        0  20.0%
64      1         0.411080    984616    2395191   1059     127056      57       1        0        1        1        0        0        0  99.1%
64      8         0.390130    888896    2278461    993     127056       2       0        0        1        1        0        0        0  99.8%
64      16        0.369033    800000    2167828    993     127056       3       0        0        1        1        0        0        0  99.8%
64      32        0.338126    666688    1971714    993     127056      10       0        0        1        1        0        0        0  99.7%
64      64        0.297335    500032    1681711    997     127072      18       0        0        3        3        0        0        0  99.7%
64      256       0.199420    200000    1002910    993     127056     211       0        0        1        1        0        0        0  99.5%
64      1024      0.113953     58880     516704    993     127056     284       0        0        1        1        0        0        0  64.1%
64      4096      0.094596     15424     163051    687      23684     415     248        0        1        1        0        0        0  23.7%

running: large-partition-slicing
Testing slicing of large partition:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.000586         1       1706      3        164       2       1        0        1        1        0        0        0   9.0%
0       32        0.000587        32      54539      3        164       2       1        0        1        1        0        0        0   9.9%
0       256       0.000688       256     372343      4        196       2       1        0        1        1        0        0        0  20.7%
0       4096      0.004320      4096     948185     19        676      10       1        0        1        1        0        0        0  36.7%
500000  1         0.000882         1       1134      5        228       3       2        0        1        1        0        0        0  14.3%
500000  32        0.000881        32      36321      5        228       3       2        0        1        1        0        0        0  14.3%
500000  256       0.000961       256     266386      6        260       3       2        0        1        1        0        0        0  21.9%
500000  4096      0.003127      4096    1309805     21        740      14       2        0        1        1        0        0        0  54.0%

running: large-partition-slicing-clustering-keys
Testing slicing of large partition using clustering keys:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.000639         1       1564      3        164       2       0        0        1        1        0        0        0  13.9%
0       32        0.000626        32      51154      3        164       2       0        0        1        1        0        0        0  15.3%
0       256       0.000716       256     357560      4        168       2       0        0        1        1        0        0        0  23.1%
0       4096      0.003681      4096    1112743     16        680       8       1        0        1        1        0        0        0  38.5%
500000  1         0.000966         1       1035      4        424       3       2        0        1        1        0        0        0  12.4%
500000  32        0.000911        32      35121      5        296       3       1        0        1        1        0        0        0  13.1%
500000  256       0.000978       256     261645      5        296       3       1        0        1        1        0        0        0  19.1%
500000  4096      0.003155      4096    1298139     11        744       6       1        0        1        1        0        0        0  44.5%

running: large-partition-slicing-single-key-reader
Testing slicing of large partition, single-partition reader:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.000756         1       1323      4        484       2       0        0        1        1        0        0        0  11.3%
0       32        0.000625        32      51174      3        164       2       0        0        1        1        0        0        0  15.5%
0       256       0.000705       256     363337      4        196       2       0        0        1        1        0        0        0  24.3%
0       4096      0.003603      4096    1136829     16        900       8       1        0        1        1        0        0        0  44.4%
500000  1         0.000880         1       1136      5        228       3       3        0        1        1        0        0        0  12.6%
500000  32        0.000882        32      36268      5        228       3       1        0        1        1        0        0        0  14.0%
500000  256       0.000965       256     265178      6        260       3       1        0        1        1        0        0        0  20.8%
500000  4096      0.003098      4096    1322024     21        740      14       2        0        1        1        0        0        0  54.6%

running: large-partition-select-few-rows
Testing selecting few rows from a large partition:
stride  rows      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
1000000 1         0.000631         1       1585      3        164       2       2        0        1        1        0        0        0  15.2%
500000  2         0.000873         2       2291      5        228       3       2        0        1        1        0        0        0  13.2%
250000  4         0.001404         4       2850      9        356       5       4        0        1        1        0        0        0  11.9%
125000  8         0.002878         8       2779     21        740      13       8        0        1        1        0        0        0  15.5%
62500   16        0.005184        16       3087     41       1380      25      16        0        1        1        0        0        0  19.3%
2       500000    0.948899    500000     526926   1040     127056      39       0        0        1        1        0        0        0  99.9%

running: large-partition-forwarding
Testing forwarding with clustering restriction in a large partition:
pk-scan   time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
yes       0.001813         2       1103     11       1380       3       8        0        1        1        0        0        0  18.5%
no        0.000922         2       2170      5        228       3       1        0        1        1        0        0        0  14.1%

running: small-partition-skips
Testing scanning small partitions with skips.
Reads whole range interleaving reads with skips according to read-skip pattern:
   read    skip      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
-> 1       0         1.023396   1000000     977139   1104     139668      12       0        0        2        2        0        0        0  99.7%
-> 1       1         2.176794    500000     229696   6200     177660    5109       0        0     5108     7679        0        0        0  69.9%
-> 1       8         1.130179    111112      98314   6200     177660    5109       0        0     5108     9647        0        0        0  41.5%
-> 1       16        0.972022     58824      60517   6200     177660    5109       0        0     5108     9913        0        0        0  32.0%
-> 1       32        0.880783     30304      34406   6201     177664    5110       0        0     5108    10057        0        0        0  25.2%
-> 1       64        0.829019     15385      18558   6199     177656    5108       0        0     5107    10135        0        0        0  20.4%
-> 1       256       2.248487      3892       1731   5028     168948    3937       0        0     3936     7801        0        0        0   4.6%
-> 1       1024      0.342806       976       2847   2076     146948     985     105        0      984     1955        0        0        0   9.3%
-> 1       4096      0.088605       245       2765    739      18152     492     246        0      247      490        0        0        0  11.1%
-> 64      1         1.796715    984616     548009   6274     177660    5120       0        0     5108     5187        0        0        0  63.1%
-> 64      8         1.688994    888896     526287   6200     177660    5109       0        0     5108     5674        0        0        0  61.2%
-> 64      16        1.593196    800000     502135   6200     177660    5109       0        0     5108     6143        0        0        0  58.7%
-> 64      32        1.438651    666688     463412   6200     177660    5109       0        0     5108     6807        0        0        0  56.5%
-> 64      64        1.290205    500032     387560   6200     177660    5109       0        0     5108     7660        0        0        0  49.2%
-> 64      256       2.136466    200000      93613   5252     170616    4161       0        0     4160     6267        0        0        0  13.8%
-> 64      1024      0.388871     58880     151413   2317     148784    1226     107        0     1225     1844        0        0        0  23.4%
-> 64      4096      0.107253     15424     143809    807      19100     562     244        0      321      482        0        0        0  24.2%

running: small-partition-slicing
Testing slicing small partitions:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.002773         1        361      3         68       2       0        0        1        1        0        0        0  10.5%
0       32        0.002905        32      11015      3         68       2       0        0        1        1        0        0        0  11.6%
0       256       0.003170       256      80764      4        104       2       0        0        1        1        0        0        0  17.8%
0       4096      0.008125      4096     504095     20        616      11       1        0        1        1        0        0        0  54.1%
500000  1         0.002914         1        343      3         72       2       0        0        1        2        0        0        0  10.7%
500000  32        0.002967        32      10786      3         72       2       0        0        1        2        0        0        0  12.6%
500000  256       0.003338       256      76685      5        112       3       0        0        2        2        0        0        0  17.4%
500000  4096      0.008495      4096     482141     21        624      12       1        0        2        2        0        0        0  52.3%

========================================
With the patchset
========================================

running: large-partition-skips
Testing scanning large partition with skips.
Reads whole range interleaving reads with skips according to read-skip pattern:
read    skip      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
1       0         0.340110   1000000    2940229   1000     126956      42       0        0        0        0        0        0        0  97.5%
1       1         1.401352    500000     356798    993     127056       2       0        0        1        1        0        0        0  99.9%
1       8         0.463124    111112     239918    993     127056       2       0        0        1        1        0        0        0  99.8%
1       16        0.330050     58824     178228    993     127056      11       0        0        1        1        0        0        0  99.7%
1       32        0.255981     30304     118384    993     127056       8       0        0        1        1        0        0        0  99.7%
1       64        0.215160     15385      71505    997     127072     263       0        0        3        3        0        0        0  99.4%
1       256       0.159702      3892      24370    993     127056     239       0        0        1        1        0        0        0  95.6%
1       1024      0.094403       976      10339    993     127056     298       0        0        1        1        0        0        0  58.9%
1       4096      0.082501       245       2970    649      22452     391     252        0        1        1        0        0        0  20.1%
64      1         0.415227    984616    2371272   1059     127056      52       1        0        1        1        0        0        0  99.3%
64      8         0.391556    888896    2270166    993     127056       2       0        0        1        1        0        0        0  99.8%
64      16        0.372075    800000    2150102    993     127056       4       0        0        1        1        0        0        0  99.7%
64      32        0.337454    666688    1975641    993     127056      15       0        0        1        1        0        0        0  99.7%
64      64        0.296345    500032    1687333    997     127072      21       0        0        3        3        0        0        0  99.7%
64      256       0.199221    200000    1003911    993     127056     204       0        0        1        1        0        0        0  99.4%
64      1024      0.118224     58880     498037    993     127056     275       0        0        1        1        0        0        0  61.8%
64      4096      0.095098     15424     162191    687      23684     417     248        0        1        1        0        0        0  23.7%

running: large-partition-slicing
Testing slicing of large partition:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.000585         1       1709      3        164       2       1        0        1        1        0        0        0  10.7%
0       32        0.000589        32      54353      3        164       2       1        0        1        1        0        0        0  10.0%
0       256       0.000688       256     372293      4        196       2       1        0        1        1        0        0        0  20.7%
0       4096      0.004336      4096     944562     19        676      10       1        0        1        1        0        0        0  36.9%
500000  1         0.000877         1       1140      5        228       3       2        0        1        1        0        0        0  13.6%
500000  32        0.000883        32      36222      5        228       3       2        0        1        1        0        0        0  14.4%
500000  256       0.000963       256     265804      6        260       3       2        0        1        1        0        0        0  22.0%
500000  4096      0.003008      4096    1361779     21        740      17       2        0        1        1        0        0        0  56.7%

running: large-partition-slicing-clustering-keys
Testing slicing of large partition using clustering keys:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.000623         1       1604      3        164       2       0        0        1        1        0        0        0  13.9%
0       32        0.000624        32      51261      3        164       2       0        0        1        1        0        0        0  14.7%
0       256       0.000714       256     358484      4        168       2       0        0        1        1        0        0        0  22.6%
0       4096      0.003687      4096    1110990     16        680       8       1        0        1        1        0        0        0  38.6%
500000  1         0.000973         1       1028      4        424       3       2        0        1        1        0        0        0  12.1%
500000  32        0.000914        32      35022      5        296       3       1        0        1        1        0        0        0  12.8%
500000  256       0.000986       256     259646      5        296       3       1        0        1        1        0        0        0  19.7%
500000  4096      0.003155      4096    1298122     11        744       6       1        0        1        1        0        0        0  44.5%

running: large-partition-slicing-single-key-reader
Testing slicing of large partition, single-partition reader:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.000766         1       1305      4        484       2       0        0        1        1        0        0        0  12.2%
0       32        0.000626        32      51111      3        164       2       0        0        1        1        0        0        0  15.2%
0       256       0.000710       256     360563      4        196       2       0        0        1        1        0        0        0  25.2%
0       4096      0.003963      4096    1033440     16        900       8       1        0        1        1        0        0        0  40.2%
500000  1         0.000877         1       1141      5        228       3       1        0        1        1        0        0        0  12.7%
500000  32        0.000882        32      36272      5        228       3       1        0        1        1        0        0        0  14.2%
500000  256       0.000959       256     266937      6        260       3       1        0        1        1        0        0        0  21.1%
500000  4096      0.003103      4096    1319992     21        740      14       2        0        1        1        0        0        0  53.9%

running: large-partition-select-few-rows
Testing selecting few rows from a large partition:
stride  rows      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
1000000 1         0.000631         1       1586      3        164       2       2        0        1        1        0        0        0  13.8%
500000  2         0.000872         2       2295      5        228       3       2        0        1        1        0        0        0  13.4%
250000  4         0.001483         4       2698      9        356       5       4        0        1        1        0        0        0  11.2%
125000  8         0.002894         8       2764     21        740      13       8        0        1        1        0        0        0  15.6%
62500   16        0.005182        16       3087     41       1380      25      16        0        1        1        0        0        0  19.5%
2       500000    0.942943    500000     530255   1040     127056      38       0        0        1        1        0        0        0  99.9%

running: large-partition-forwarding
Testing forwarding with clustering restriction in a large partition:
pk-scan   time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
yes       0.001807         2       1107     11       1380       3       8        0        1        1        0        0        0  18.9%
no        0.000924         2       2165      5        228       3       1        0        1        1        0        0        0  14.1%

running: small-partition-skips
Testing scanning small partitions with skips.
Reads whole range interleaving reads with skips according to read-skip pattern:
   read    skip      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
-> 1       0         1.009953   1000000     990145   1104     139668      11       0        0        2        2        0        0        0  99.7%
-> 1       1         2.213846    500000     225851   6200     177660    5109       0        0     5108     7679        0        0        0  70.3%
-> 1       8         1.150029    111112      96617   6200     177660    5109       0        0     5108     9647        0        0        0  42.3%
-> 1       16        0.989438     58824      59452   6200     177660    5109       0        0     5108     9913        0        0        0  33.2%
-> 1       32        0.891590     30304      33989   6201     177664    5110       0        0     5108    10057        0        0        0  26.4%
-> 1       64        0.840952     15385      18295   6199     177656    5108       0        0     5107    10135        0        0        0  21.6%
-> 1       256       2.247875      3892       1731   5028     168948    3937       0        0     3936     7801        0        0        0   5.0%
-> 1       1024      0.345917       976       2821   2076     146948     985     105        0      984     1955        0        0        0  10.0%
-> 1       4096      0.088806       245       2759    739      18152     492     246        0      247      490        0        0        0  11.6%
-> 64      1         1.821995    984616     540406   6274     177660    5119       0        0     5108     5187        0        0        0  63.9%
-> 64      8         1.715052    888896     518291   6200     177660    5109       0        0     5108     5674        0        0        0  61.9%
-> 64      16        1.620385    800000     493710   6200     177660    5109       0        0     5108     6143        0        0        0  59.4%
-> 64      32        1.464497    666688     455233   6200     177660    5109       0        0     5108     6807        0        0        0  56.9%
-> 64      64        1.311386    500032     381300   6200     177660    5109       0        0     5108     7660        0        0        0  50.0%
-> 64      256       2.153954    200000      92853   5252     170616    4161       0        0     4160     6267        0        0        0  14.3%
-> 64      1024      0.350275     58880     168097   2317     148784    1226     107        0     1225     1844        0        0        0  27.5%
-> 64      4096      0.107498     15424     143482    807      19100     562     244        0      321      482        0        0        0  24.5%

running: small-partition-slicing
Testing slicing small partitions:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.002872         1        348      3         68       2       0        0        1        1        0        0        0  10.2%
0       32        0.002833        32      11297      3         68       2       0        0        1        1        0        0        0  12.1%
0       256       0.003145       256      81404      4        104       2       0        0        1        1        0        0        0  17.9%
0       4096      0.008110      4096     505079     20        616      12       1        0        1        1        0        0        0  54.4%
500000  1         0.002934         1        341      3         72       2       1        0        1        2        0        0        0  10.6%
500000  32        0.002871        32      11145      3         72       2       0        0        1        2        0        0        0  12.0%
500000  256       0.003216       256      79598      5        112       3       0        0        2        2        0        0        0  18.3%
500000  4096      0.008557      4096     478692     21        624      12       1        0        2        2        0        0        0  51.9%
"

* 'projects/sstables-30/index_reader_cleanup/v3' of https://github.com/argenet/scylla:
  sstables: Remove "lower_" from index_reader public methods.
  sstables: Make index_reader::advance_upper_past() method private.
  sstables: Stop using index_reader::advance_upper_past() outside the class.
  sstables: Move promoted_index_block from types.hh to index_entry.hh.
  sstables: Factor out promoted index into a separate class.
  sstables: Use std::optional instead of std::experimental optional in index_reader.
2018-07-01 12:30:29 +03:00
Botond Dénes
da53ea7a13 tests.py: add --jobs command line parameter
Allowing for setting the number of jobs to use for running the tests.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <d58d6393c6271bffc37ab3b5edc37b00ef485d9c.1529433590.git.bdenes@scylladb.com>
2018-07-01 12:26:41 +03:00
Avi Kivity
db2c029f7a dht: add i_partitioner::sharding_ignore_msb()
While the sharding algorithm is exposed (as cpu_sharding_algorithm_name()),
the ignore_msb parameter is not. Add a function to do that.
2018-07-01 12:17:35 +03:00
Vladimir Krivopalov
b24eb5c11d sstables: Remove "lower_" from index_reader public methods.
The index_reader class public interface has been amended to only deal
with the upper bound cursor along with advancing the lower bound.
Since the class users can only explicitly operate with the lower bound
cursor (take data file position, advance to the next partition, etc), it
no longer makes sense to specify that the method operates on the lower
bound cursor in its name.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-29 11:48:33 -07:00
Vladimir Krivopalov
30109a693b sstables: Make index_reader::advance_upper_past() method private.
No changes made to the code except that it is moved around.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-29 11:47:48 -07:00
Vladimir Krivopalov
80d1d5017f sstables: Stop using index_reader::advance_upper_past() outside the class.
The only case when it needs to be called is when an index_reader is
advanced to a specific partition as part of sstable_reader
initialisation.

Instead, we're passing an optional upper_bound parameter that is used to
call advance_upper_past() internally if partition is found.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-29 11:47:20 -07:00
Duarte Nunes
0db5419ec5 Merge 'Avoid copies when unfreezing frozen_mutation' from Paweł
"
When frozen mutation gets deserialised current implementation copies
its value 3 times: from IDL buffer to bytes object, from bytes object to
atomic_cell and then atomic_cell is copied again. Moreover, the value
gets linearised which may cause a large allocation.

All of that is very wasteful. This patch devirtualises and reworks IDL
reading code so that when used with partition_builder the cell value is
copied only once and without linearisation: from the IDL buffer to the
final atomic_cell.

perf_simple_query -c4, medians of 30 results:
        ./perf_before  ./perf_after  diff
 read       310576.54     316273.90  1.8%
 write      359913.15     375579.44  4.4%

microbenchmark, perf_idl:

BEFORE
test                                      iterations      median         mad         min         max
frozen_mutation.freeze_one_small_row         2142435   462.431ns     0.125ns   462.306ns   467.659ns
frozen_mutation.unfreeze_one_small_row       1640949   601.422ns     0.082ns   601.340ns   605.279ns
frozen_mutation.apply_one_small_row          1538969   645.993ns     0.405ns   645.588ns   656.510ns

AFTER
test                                      iterations      median         mad         min         max
frozen_mutation.freeze_one_small_row         2139548   455.525ns     0.631ns   454.894ns   456.707ns
frozen_mutation.unfreeze_one_small_row       1760139   566.157ns     0.003ns   566.153ns   584.339ns
frozen_mutation.apply_one_small_row          1582050   610.951ns     0.060ns   610.891ns   613.044ns

Tests: unit(release)
"

* tag 'avoid-copy-unfreeze/v2' of https://github.com/pdziepak/scylla:
  mutation_partition_view: use column_mapping_entry::is_atomic()
  schema: column_mapping_entry: cache abstract_type::is_atomic()
  schema: column_mapping_entry: reduce logic duplication
  mutation_partition_view: do not linearise or copy cell value
  atomic_cell: allow passing value via ser::buffer_view
  mutation_partition_view: pass cell by value to visitor
  mutation_partition_view: devirtualise accept()
  storage_proxy: use mutation_partition_view::{first, last}_row_key()
  mutation_partition_view: add last_row_key() and first_row_key() getters
2018-06-28 22:55:20 +01:00
Paweł Dziepak
c45e291084 mutation_partition_view: use column_mapping_entry::is_atomic() 2018-06-28 22:16:42 +01:00
Paweł Dziepak
6c54a97320 schema: column_mapping_entry: cache abstract_type::is_atomic()
IDL deserialisation code calls is_atomic() for each cell. An additional
indirection and a virtual call can be avoided by caching that value in
column_mapping_entry. There is already very similar optimisation done
for column_definitions.
2018-06-28 22:16:42 +01:00
Paweł Dziepak
2bfdc2d781 schema: column_mapping_entry: reduce logic duplication
User-defined constructors often make it more likely that a careless
developer will forget to update one of them when adding a new member to
a structure. The risk of that happening can be reduced by reducing code
duplication with delegating constructors.
2018-06-28 22:16:42 +01:00
Paweł Dziepak
199f9196e9 mutation_partition_view: do not linearise or copy cell value 2018-06-28 22:11:19 +01:00
Paweł Dziepak
92700c6758 atomic_cell: allow passing value via ser::buffer_view 2018-06-28 22:11:19 +01:00
Paweł Dziepak
bf330a99f0 mutation_partition_view: pass cell by value to visitor
mutation_partition_view needs to create an atomic_cell from
IDL-serialised data. Then that cell is passed to the visitor. However,
because generic mutation_partition_visitor interface was used, the cell
was passed by constant reference which forced the visitor to needlessly
copy it.

This patch takes advantage of the fact that mutation_partition_view is
devirtualised now and adjust the interfaces of its visitors so that the
cell can be passed without copying.
2018-06-28 22:11:19 +01:00
Paweł Dziepak
569176aad1 mutation_partition_view: devirtualise accept()
There are only two types of visitors used and only one of them appears
in the hot path. They can be devirtualised without too much effort,
which also enables future custom interface specialisations specific to
mutation_partition_views and its users, not necessairly in the scope of
more general mutation_partition_visitor.
2018-06-28 22:11:19 +01:00
Paweł Dziepak
6bd71015e7 storage_proxy: use mutation_partition_view::{first, last}_row_key() 2018-06-28 22:11:19 +01:00
Paweł Dziepak
2259eee97c mutation_partition_view: add last_row_key() and first_row_key() getters
Some users (e.g. reconciliation code) need only to know the clustering
key of the first or the last row in the partition. This was done with a
full visitor visiting every single cell of the partition, which is very
wasteful. This patch adds direct getters for the needed information.
2018-06-28 22:11:19 +01:00
Vladimir Krivopalov
a497edcbda sstables: Move promoted_index_block from types.hh to index_entry.hh.
It is only being used by index_reader internally and never exposed so
should not be listed in commonly used types.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-28 12:28:59 -07:00
Vladimir Krivopalov
81fba73e9d sstables: Factor out promoted index into a separate class.
An index entry may or may not have a promoted index. All the optional
fields are better scoped under the same class to avoid lots of separate
optional fields and give better representation.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-28 12:28:59 -07:00
Asias He
bb4d361cf6 storage_service: Limit number of REPLICATION_FINISHED verb can retry
In the removenode operation, if the message servicing is stopped, e.g., due
to disk io error isolation, the node can keep retrying the
REPLICATION_FINISHED verb infinitely.

Scylla log full of such message was observed:

[shard 0] storage_service - Fail to send REPLICATION_FINISHED to $IP:0:
seastar::rpc::closed_error (connection is closed)

To fix, limit the number of retires.

Tests: update_cluster_layout_tests.py

Fixes #3542

Message-Id: <638d392d6b39cc2dd2b175d7f000e7fb1d474f87.1529927816.git.asias@scylladb.com>
2018-06-28 19:54:01 +01:00
Paweł Dziepak
e9dffc753c tests/mutation: test external_memory_usage() 2018-06-28 19:20:23 +01:00
Paweł Dziepak
8153df7684 atomic_cell: add external chunks and overheads to external_memory_usage() 2018-06-28 19:20:23 +01:00
Paweł Dziepak
2dc78a6ca2 data::cell: expose size overhead of external chunks 2018-06-28 18:01:17 +01:00
Paweł Dziepak
6adc78d690 imr::utils::object: expose size overhead 2018-06-28 18:01:17 +01:00
Paweł Dziepak
e69f2c361c tests/mutation: properly mark atomic_cells that are collection members 2018-06-28 18:00:39 +01:00
Takuya ASADA
972ce88601 dist/common/scripts/scylla_setup: allow input multiple disk paths on RAID disk prompt
Allow "/dev/sda1,/dev/sdb1" style input on RAID disk prompt.
2018-06-29 01:37:19 +09:00
Takuya ASADA
a83c66b402 dist/common/scripts/scylla_raid_setup: skip constructing RAID0 when only one disk specified
When only one disk specified, create XFS directly on the disk instead of
creating RAID0 volume on the disk.
2018-06-29 01:37:19 +09:00
Takuya ASADA
99fb754221 dist/common/scripts/scylla_raid_setup: fix module import
sys module was missing, import it.

Fixes #3548
2018-06-29 01:37:19 +09:00
Takuya ASADA
f2132c61bd dist/common/scripts/scylla_setup: check disk is used in MDRAID
Check disk is used in MDRAID by /proc/mdstat.
2018-06-29 01:37:19 +09:00
Takuya ASADA
daccc10a06 dist/common/scripts/scylla_setup: move unmasking scylla-fstrim.timer on scylla_fstrim_setup
Currently, enabling scylla-fstrim.timer is part of 'enable-service', it
will be enabled even --no-fstrim-setup specified (or input 'No' on interactive setup prompt).

To apply --no-fstrim-setup we need to enabling scylla-fstrim.timer in
scylla_fstrim_setup instead of enable-service part of scylla_setup.

Fixes #3248
2018-06-29 01:37:19 +09:00
Takuya ASADA
fa6db21fea dist/common/scripts/scylla_setup: use print() instead of logging.error()
Align with other script scripts, use print().
2018-06-29 01:37:19 +09:00
Takuya ASADA
2401115e14 dist/common/scripts/scylla_setup: implement do_verify_package() for Gentoo Linux
Implement Gentoo Linux support on scylla_setup.
2018-06-29 01:37:19 +09:00
Takuya ASADA
9d537cb449 dist/common/scripts/scylla_coredump_setup: run os.remove() when deleting directory is symlink
Since shutil.rmtree() causes exception when running on symlink, we need
to check the path is symlink, run os.remove() when it symlink.

Fixes #3544
2018-06-29 01:37:19 +09:00
Takuya ASADA
5b4da4d4bd dist/common/scripts/scylla_setup: don't include the disk on unused list when it contains partitions
On current implementation, we are checking the partition is mounted, but
a disk contains the partition marked as unused.
To avoid the problem, we should skip a disk which contains partitions.

Fixes #3545
2018-06-29 01:37:19 +09:00
Takuya ASADA
83bc72b0ab dist/common/scripts/scylla_setup: skip running rest of the check when the disk detected as used
Don't need to run check when we already detected the disk as used.
2018-06-29 01:37:19 +09:00
Takuya ASADA
1650d37dae dist/common/scripts/scylla_setup: add a disk to selected list correctly
When a disk path typed on the RAID setup prompt, the script mistakenly
splits the input for each character,
like ['/', 'd', 'e', 'v', '/', 's', 'd', 'b'].

To fix the issue we need to use selected.append() instead of
selected +=.

See #3545
2018-06-29 01:37:19 +09:00
Takuya ASADA
4b5826ff5a dist/common/scripts/scylla_setup: fix wrong indent
list_block_devices() should return 'devices' on both re.match() is
matched and unmatched.
2018-06-29 01:37:19 +09:00
Takuya ASADA
f828c5c4f3 dist/common/scripts: sync instance type list for detect NIC type to latest one
Current instance type list is outdated, sync with latest table from:
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/enhanced-networking.html#enabling_enhanced_networking

Fixes #3536
2018-06-29 01:37:19 +09:00
Takuya ASADA
6cffb164d6 dist/common/scripts: verify systemd unit existance using 'systemctl cat'
Verify unit existance by running 'systemctl cat {}' silently, raise
exception if the unit doesn't exist.
2018-06-29 01:37:19 +09:00
Vladimir Krivopalov
82f76b0947 Use std::reference_wrapper instead of a plain reference in bound_view.
The presence of a plain reference prohibits the bound_view class from
being copyable. The trick employed to work around that was to use
'placement new' for copy-assigning bound_view objects, but this approach
is ill-formed and causes undefined behaviour for classes that have const
and/or reference members.

The solution is to use a std::reference_wrapper instead.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <a0c951649c7aef2f66612fc006c44f8a33713931.1530113273.git.vladimir@scylladb.com>
2018-06-28 11:24:06 +01:00
Avi Kivity
c87a961667 Merge "Add multishard_writer support" from Asias
"
We need a multishard_writer which gets mutation fragments from a producer
(e.g., from the network using the rpc streaming) and consumes the mutation
fragments with a consumer (e.g., write to sstable).

The multishard_writer will take care of the mutation fragments do not belong to
current shard.

This multishard_writer will be used in the new scylla streaming.
"

* 'asias/multishard_writer_v10.1' of github.com:scylladb/seastar-dev:
  tests: Add multishard_writer_test to test.py
  tests: Add test for multishard_writer
  multishard_writer: Introduce multishard_writer
  tests: Allow random_mutation_generator to generate mutations belong to remote shrard
2018-06-28 12:36:55 +03:00
Asias He
fd8b7efb99 tests: Add multishard_writer_test to test.py
For multishard_writer class testing.
2018-06-28 17:20:29 +08:00
Asias He
4050a4b24e tests: Add test for multishard_writer 2018-06-28 17:20:29 +08:00
Asias He
f4b406cce1 multishard_writer: Introduce multishard_writer
The multishard_writer class gets mutation_fragments generated from
flat_mutation_reader and consumes the mutation_fragments with
multishard_writer::_consumer. If the mutation_fragment does not belong to the
shard multishard_writer is on, it will forward the mutation_fragment to the
correct shard. Future returned by multishard_writer() becomes ready
when all the mutation_fragments are consumed.

Tests: tests/multishard_writer_test.cc
Tests: dtest update_cluster_layout_tests.py

Fixes #3497
2018-06-28 17:20:28 +08:00
Asias He
8eccff1723 tests: Allow random_mutation_generator to generate mutations belong to remote shrard
- make_local_keys returns keys of current shard
- make_keys returns keys of current or remote shard
2018-06-28 17:20:28 +08:00
Asias He
27cb41ddeb range_streamer: Use float for time took for stream
It is useful when the total time to stream is small, e.g, 2.0 seconds
and 2.9 seconds. Showing the time as interger number of seconds is not
accurate in such case.

Message-Id: <d801b57279981c72acb907ad4b0190ba4d938a3d.1530175052.git.asias@scylladb.com>
2018-06-28 11:39:14 +03:00
Vladimir Krivopalov
fc629b9ca6 sstables: Use std::optional instead of std::experimental optional in index_reader.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-27 16:47:53 -07:00
Tomasz Grabiec
0a1aec2bd6 tests: perf_row_cache_update: Test with an active reader surviving memtable flush
Exposes latency issues caused by mutation_cleaner life time issues,
fixed by eralier commits.
2018-06-27 21:51:04 +02:00
Tomasz Grabiec
074be4d4e8 memtable, cache: Run mutation_cleaner worker in its own scheduling group
The worker is responsible for merging MVCC snapshots, which is similar
to merging sstables, but in memory. The new scheduling group will be
therefore called "memory compaction".

We should run it in a separate scheduling group instead of
main/memtables, so that it doesn't disrupt writes and other system
activities. It's also nice for monitoring how much CPU time we spend
on this.
2018-06-27 21:51:04 +02:00
Tomasz Grabiec
6c6ffaee71 mutation_cleaner: Make merge() redirect old instance to the new one
If memtable snapshot goes away after memtable started merging to
cache, it would enqueue the snapshots for cleaning on the memtable's
cleaner, which will have to clean without deferrring when the memtable
is destroyed. That may stall the reactor. To avoid this, make merge()
cause the old instance of the cleaner to redirect to the new instance
(owned by cache), like we do for regions. This way the snapshots
mentioned earlier can be cleaned after memtable is destroyed,
gracefully.
2018-06-27 21:51:04 +02:00
Tomasz Grabiec
450985dfee mvcc: Use RAII to ensure that partition versions are merged
Before this patch, maybe_merge_versions() had to be manually called
before partition snapshot goes away. That is error prone and makes
client code more complicated. Delegate that task to a new
partition_snapshot_ptr object, through which all snapshots are
published now.
2018-06-27 21:51:04 +02:00
Avi Kivity
e1efda8b0c Merge "Disable sstable filtering based on min/max clustering key components" from Tomasz
"
With DateTiered and TimeWindow, there is a read optimization enabled
which excludes sstables based on overlap with recorded min/max values
of clustering key components. The problem is that it doesn't take into
account partition tombstones and static rows, which should still be
returned by the reader even if there is no overlap in the query's
clustering range. A read which returns no clustering rows can
mispopulate cache, which will appear as partition deletion or writes
to the static row being lost. Until node restart or eviction of the
partition entry.

There is also a bad interaction between cache population on read and
that optimization. When the clustering range of the query doesn't
overlap with any sstable, the reader will return no partition markers
for the read, which leads cache populator to assume there is no
partition in sstables and it will cache an empty partition. This will
cause later reads of that partition to miss prior writes to that
partition until it is evicted from cache or node is restarted.

Disable until a more elaborate fix is implemented.

Fixes #3552
Fixes #3553
"

* tag 'tgrabiec/disable-min-max-sstable-filtering-v1' of github.com:tgrabiec/scylla:
  tests: Add test for slicing a mutation source with date tiered compaction strategy
  tests: Check that database conforms to mutation source
  database: Disable sstable filtering based on min/max clustering key components
2018-06-27 14:28:27 +03:00
Calle Wilund
054514a47a sstables::compress: Ensure unqualified compressor name if possible
Fixes #3546

Both older origin and scylla writes "known" compressor names (i.e. those
in origin namespace) unqualified (i.e. LZ4Compressor).

This behaviour was not preserved in the virtualization change. But
probably should be.

Message-Id: <20180627110930.1619-1-calle@scylladb.com>
2018-06-27 14:16:50 +03:00
Tomasz Grabiec
d1e8c32b2e gdb: Add pretty printer for managed_vector 2018-06-27 13:07:28 +02:00
Tomasz Grabiec
b0e8547569 gdb: Add pretty printer for rows 2018-06-27 13:07:28 +02:00
Tomasz Grabiec
da19508317 gdb: Add mutation_partition pretty printer 2018-06-27 13:07:28 +02:00
Tomasz Grabiec
d485e1c1d8 gdb: Add pretty printer for partition_entry 2018-06-27 13:07:28 +02:00
Tomasz Grabiec
b51c70ef69 gdb: Add pretty printer for managed_bytes 2018-06-27 13:07:28 +02:00
Tomasz Grabiec
d76cfa77b1 gdb: Add iteration wrapper for intrusive_set_external_comparator 2018-06-27 13:07:24 +02:00
Tomasz Grabiec
aa0b41f0b2 gdb: Add iteration wrapper for boost intrusive set 2018-06-27 13:04:47 +02:00
Tomasz Grabiec
c26a304fbb mvcc: Merge partition version versions gradually in the background
When snapshots go away, typically when the last reader is destroyed,
we used to merge adjacent versions atomically. This could induce
reactor stalls if partitions were large. This is especially true for
versions created on cache update from memtables.

The solution is to allow this process to be preempted and move to the
background. mutation_cleaner keeps a linked list of such unmerged
snapshots and has a worker fiber which merges them incrementally and
asynchronously with regards to reads.

This reduces scheduling latency spikes in tests/perf_row_cache_update
for the case of large partition with many rows. For -c1 -m1G I saw
them dropping from 23ms to 2ms.
2018-06-27 12:48:30 +02:00
Tomasz Grabiec
4d3cc2867a mutation_partition: Make merging preemtable 2018-06-27 12:48:30 +02:00
Tomasz Grabiec
4995a8c568 tests: mvcc: Use the standard maybe_merge_versions() to merge snapshots
Preparation for switching to background merging.
2018-06-27 12:48:30 +02:00
Piotr Sarna
03753cc431 database: make drop_column_family wait on reads in progress
drop_column_family now waits for both writes and reads in progress.
It solves possible liveness issues with row cache, when column_family
could be dropped prematurely, before the read request was finished.

Phaser operation is passed inside database::query() call.
There are other places where reading logic is applied (e.g. view
replicas), but these are guarded with different synchronization
mechanisms, while _pending_reads_phaser applies to regular reads only.

Fixes #3357

Reported-by: Duarte Nunes <duarte@scylladb.com>
Signed-off-by: Piotr Sarna <sarna@scylladb.com>
Message-Id: <d58a5ee10596d0d62c765ee2114ac171b6f087d2.1529928323.git.sarna@scylladb.com>
2018-06-27 10:02:56 +01:00
Piotr Sarna
e1a867cbe3 database: add phaser for reads
Currently drop_column_family waits on write_in_progress phaser,
but there's no such mechanism for reads. This commit adds
a corresponding reads phaser.

Refs #3357

Reported-by: Duarte Nunes <duarte@scylladb.com>
Signed-off-by: Piotr Sarna <sarna@scylladb.com>
Message-Id: <70b5fdd44efbc24df61585baef024b809cabe527.1529928323.git.sarna@scylladb.com>
2018-06-27 10:02:56 +01:00
Tomasz Grabiec
b4879206fb tests: Add test for slicing a mutation source with date tiered compaction strategy
Reproducer for https://github.com/scylladb/scylla/issues/3552
2018-06-26 18:54:44 +02:00
Tomasz Grabiec
826a237c2e tests: Check that database conforms to mutation source 2018-06-26 18:54:44 +02:00
Tomasz Grabiec
19b76bf75b database: Disable sstable filtering based on min/max clustering key components
With DateTiered and TimeWindow, there is a read optimization enabled
which excludes sstables based on overlap with recorded min/max values
of clustering key components. The problem is that it doesn't take into
account partition tombstones and static rows, which should still be
returned by the reader even if there is no overlap in the query's
clustering range. A read which returns no clustering rows can
mispopulate cache, which will appear as partition deletion or writes
to the static row being lost. Until node restart or eviction of the
partition entry.

There is also a bad interaction between cache population on read and
that optimization. When the clustering range of the query doesn't
overlap with any sstable, the reader will return no partition markers
for the read, which leads cache populator to assume there is no
partition in sstables and it will cache an empty partition. This will
cause later reads of that partition to miss prior writes to that
partition until it is evicted from cache or node is restarted.

Disable until a more elaborate fix is implemented.

Fixes #3552
Fixes #3553
2018-06-26 18:54:44 +02:00
Avi Kivity
9a7ecdb3b9 Merge "Deglobalise cache_tracker" from Paweł
"
Cache tracker is a thread-local global object that indirectly depends on
the lifetimes of other objects. In particular, a member of
cache_tracker: mutation_cleaner may extend the lifetime of a
mutation_partition until the cleaner is destroyed. The
mutation_partition itself depends on LSA migrators which are
thread-local objects. Since, there is no direct dependency between
LSA-migrators and cache_tracker it is not guarantee that the former
won't be destroyed before the latter. The easiest (barring some unit
tests that repeat the same code several billion times) solution is to
stop using globals.

This series also improves the part of LSA sanitiser that deals with
migrators.

Fixes #3526.

Tests: unit(release)
"

* tag 'deglobalise-cache-tracker/v1-rebased' of https://github.com/pdziepak/scylla:
  mutation_cleaner: add disclaimer about mutation_partition lifetime
  lsa: enhance sanitizer for migrators
  lsa: formalise migrator id requirements
  row_cache: deglobalise row cache tracker
2018-06-26 16:38:12 +01:00
Asias He
c3b5a2ecd5 gossip: Fix tokens assignment in assassinate_endpoint
The tokens vector is defined a few lines above and is needed outsie the
if block.

Do not redefine it again in the if block, otherwise the tokens will be empty.

Found by code inspection.

Fixes #3551.

Message-Id: <c7a06375c65c950e94236571127f533e5a60cbfd.1530002177.git.asias@scylladb.com>
2018-06-26 16:38:12 +01:00
Tomasz Grabiec
6d6b93d1e7 flat_mutation_reader: Move field initialization to initializer list
This works around a problem of std::terminate() being called in debug
mode build if initialization of _current throws.

Backtrace:

Thread 2 "row_cache_test_" received signal SIGABRT, Aborted.
0x00007ffff17ce9fb in raise () from /lib64/libc.so.6
(gdb) bt
  #0  0x00007ffff17ce9fb in raise () from /lib64/libc.so.6
  #1  0x00007ffff17d077d in abort () from /lib64/libc.so.6
  #2  0x00007ffff5773025 in __gnu_cxx::__verbose_terminate_handler() () from /lib64/libstdc++.so.6
  #3  0x00007ffff5770c16 in ?? () from /lib64/libstdc++.so.6
  #4  0x00007ffff576fb19 in ?? () from /lib64/libstdc++.so.6
  #5  0x00007ffff5770508 in __gxx_personality_v0 () from /lib64/libstdc++.so.6
  #6  0x00007ffff3ce4ee3 in ?? () from /lib64/libgcc_s.so.1
  #7  0x00007ffff3ce570e in _Unwind_Resume () from /lib64/libgcc_s.so.1
  #8  0x0000000003633602 in reader::reader (this=0x60e0001160c0, r=...) at flat_mutation_reader.cc:214
  #9  0x0000000003655864 in std::make_unique<make_forwardable(flat_mutation_reader)::reader, flat_mutation_reader>(flat_mutation_reader &&) (__args#0=...)
    at /usr/include/c++/7/bits/unique_ptr.h:825
  #10 0x0000000003649a63 in make_flat_mutation_reader<make_forwardable(flat_mutation_reader)::reader, flat_mutation_reader>(flat_mutation_reader &&) (args#0=...)
    at flat_mutation_reader.hh:440
  #11 0x000000000363565d in make_forwardable (m=...) at flat_mutation_reader.cc:270
  #12 0x000000000303f962 in memtable::make_flat_reader (this=0x61300001d540, s=..., range=..., slice=..., pc=..., trace_state_ptr=..., fwd=..., fwd_mr=...)
    at memtable.cc:592

Message-Id: <1528792447-13336-1-git-send-email-tgrabiec@scylladb.com>
2018-06-25 20:03:23 +03:00
Avi Kivity
31eeae0126 Merge "Avoid buffer linearisation in read path" from Paweł
"
The read path on coordinator involves a lot of passing around buffers
and some occasional processing. We start with query::result obtained
from the storage_proxy which is then transformed into a
cql3::result_set, which is then used to write a response. Buffers are
copied and linearised quite excessively.

This series attempts to remedy that by using view of fragmented buffers
as much as possible. The first part deals with reading from
query::result. ser::buffer_view is introduced which enables the IDL
infrastructure to read a buffer without copying or linearising it.
The second part is switching native protocol layer to use bytes_ostream
instead of std::vector<char> to hold the generated response to the
client. The last part introduces cql3::result_generator which is an
alternative to cql3::result_set that passes buffer views without copying
or linearising anything from query::result to the native protocl layer
(or Thrift). It is only used in simple cases, when no processing at the
CQL layer is required, except for paged queries which require some
simple interpretation of the results and are supported by the result
generator.

Tests: unit(release), dtests(paging_test.py paging_additional_test.py
  cql_additional_tests.py cql_tracing_test.py cql_prepared_test.py
  cql_cast_test.py cql_tests.py)
"

* tag 'buffer-views-query-result/v2' of https://github.com/pdziepak/scylla: (34 commits)
  cql3: select_statement: use fetch_page_generator() if possible
  pager: add fetch_page_generator()
  pager: make the visitor handle_result() accepts a template parameter
  pager: make query_result_visitor base class a template parameter
  pager: make myvistor a member class of query_pager
  pager: make shared pointers to selection constant
  pager: merge query_pager and query_pagers::impl
  cql3: select_statement: use result_generator if possible
  cql3: selection: add is_trivial()
  cql3: result: support result_generator
  cql3: add lazy result_generator
  cql3: add result class
  cql3::result_set: fix encapsulation
  thrift: use cql3::result_set visiting interface
  transport: use cql3::result_set visiting interface
  cql3::result_set: add visit()
  transport: response: add write_int_placeholder()
  transport: steal response buffers and make send zero-copy
  transport: use reusable_buffer for compression
  transport: response: use bytes_ostream
  ...
2018-06-25 17:37:50 +03:00
Paweł Dziepak
bdc299cc38 mutation_cleaner: add disclaimer about mutation_partition lifetime
mutation_cleaner has already caused problems by extending lifetime of
mutation_partition past the lifetime of LSA migrators that it uses (due
to the fact that both the cleaner and migrators where thread-local
globals). Since, the long term goal is to make mutation_partition
internal representation depend more and more on schema that lifetime
extension may again cause problems in the future, so let's add a
disclaimer that hopefuly, will help avoiding them.
2018-06-25 09:37:43 +01:00
Paweł Dziepak
55bf9d78a6 lsa: enhance sanitizer for migrators
Current LSA sanitizer performs only basic checks on the migrators use,
without doing any additonal reporting in case an error is detected. This
patch enhances it so that when a problem is detected relevant stack
traces get printed.
2018-06-25 09:37:43 +01:00
Paweł Dziepak
fcd9b1f821 lsa: formalise migrator id requirements
object_descriptor uses special encoding for migrator ids which assumes
that the valid ones are in a range smaller than uint32_t. Let's add some
static asserts that make this fact more visible.
2018-06-25 09:37:43 +01:00
Paweł Dziepak
96b0577343 row_cache: deglobalise row cache tracker
Row cache tracker has numerous implicit dependencies on ohter objects
(e.g. LSA migrators for data held by mutation_cleaner). The fact that
both cache tracker and some of those dependencies are thread local
objects makes it hard to guarantee correct destruction order.

Let's deglobalise cache tracker and put in in the database class.
2018-06-25 09:37:43 +01:00
Paweł Dziepak
2b1fcfe019 cql3: select_statement: use fetch_page_generator() if possible 2018-06-25 09:21:47 +01:00
Paweł Dziepak
1cf3cb285f pager: add fetch_page_generator()
fetch_page_generator() is an equivalent of fetch_page(), but instead of
building a cql3::result_set it returns a cql3::result_generator().
2018-06-25 09:21:47 +01:00
Paweł Dziepak
f6fe831d49 pager: make the visitor handle_result() accepts a template parameter 2018-06-25 09:21:47 +01:00
Paweł Dziepak
fc87ca5926 pager: make query_result_visitor base class a template parameter
So far query_result_visitor was tied to result_set_builder. The goal is
to enable result_generator to work with paged queries as well so we need
to decouple them.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
dc9a65ea76 pager: make myvistor a member class of query_pager
It is going to be come a class template.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
319b2cde7e pager: make shared pointers to selection constant
Shared pointers make code harder to reason about, it is not easy to get
rid of them in this piece of the code, but we can restore at least a bit
of sanity by adding consts.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
327d3de51e pager: merge query_pager and query_pagers::impl
There is just a single implementation of query_pager and there is no
reason to make anything virtual. Devirtualising this code will allow
higher layers to pass visitors via templates.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
fa5dea91e7 cql3: select_statement: use result_generator if possible 2018-06-25 09:21:47 +01:00
Paweł Dziepak
3f1184d16d cql3: selection: add is_trivial()
cql3::result_generator supports only trivial selections.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
adad31ba6b cql3: result: support result_generator
cql3::result can now hold either a result_set or a result_generator.
Some code that is not performance critical expects to get result_set so
a way of converting the result_generator to a result_set is added.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
02443d10db cql3: add lazy result_generator
result_generator is a restricted alternative of result_set. It supports
only the simples cases, but is much cheaper as it passes data almost
directly from query::result to its visitor bypassing much of the CQL
layer.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
dca68afce6 cql3: add result class
So far the only way of returing a result of a CQL query was to build a
result_set. An alternative lazy result generator is going to be
introduced for the simple cases when no transformations at CQL layer are
needed. To do that we need to hide the fact that there are going to be
multiple representations of a cql results from the users.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
29cc4a4c0b cql3::result_set: fix encapsulation 2018-06-25 09:21:47 +01:00
Paweł Dziepak
8f26d9c03f thrift: use cql3::result_set visiting interface 2018-06-25 09:21:47 +01:00
Paweł Dziepak
54d5dc414d transport: use cql3::result_set visiting interface 2018-06-25 09:21:47 +01:00
Paweł Dziepak
2e4234ab63 cql3::result_set: add visit()
This visiting interface for result_set satisfies most of its users (at
least all of those which are in the hot path). It will allow having an
alternative of result_set (i.e. lazy result generator) which would
provide exaclty the same interface.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
c0e7160625 transport: response: add write_int_placeholder()
This allows the response writer to defer writing integers until later
time. It will be used by lazy response generator which will know the
number of rows in the response only after they are all written.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
88aff8eda8 transport: steal response buffers and make send zero-copy
Each response is sent only once, so we can safely steal its buffers and
pass them to the output_stream using the zero-copy interface.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
821e6683e3 transport: use reusable_buffer for compression
Compression algorithms require us to linearise bytes_ostream. This may
cause an excessive number of large allocations. Using reusable_buffers
can avoid that.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
a7c4d407ce transport: response: use bytes_ostream
std::vector<char> is not a very good container for incrementally
building a response. It may cause excessive copies and allocations. If
the response is large it will put more pressure on the memory allocator
by requiring the buffer to be contiguous.

We already have bytes_ostream which avoids all of these problems, so
let's use it.
2018-06-25 09:22:43 +01:00
Paweł Dziepak
c04d38b76b transport: drop response::make_message() 2018-06-25 09:22:35 +01:00
Paweł Dziepak
444acf49af transport: use std::unique_ptr for the response
So far cql_server::response was passed around using shared pointers.
They have very big cost of making it hard to reason about the code. All
that is not necessary and we can easily switch to using much more
sensible std::unique_ptr.
2018-06-25 09:22:24 +01:00
Paweł Dziepak
12f89299b2 transport: move response to a separate header
There are some other translation units which right now are satisfied
with the response being an incomplete type. This means that
std::unique_ptr can't be used for it. Let's move the class declaration
to a header that can be included where needed.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
3b9ba30497 tests: add test for reusable buffers 2018-06-25 09:21:47 +01:00
Paweł Dziepak
b4c5e1a6d4 utils: add reusable_buffer
This commit adds a helper class reusable_buffer which can be used to
avoid excessive memory allocations of large buffers when bytes_ostream
needs to be linearised. The idea is that reusable_buffer in most cases
is going to be thread local so that multiple continuation chains can
reuse the same large buffer.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
8feab33cf4 query::result: use std::optional instead of experimental version 2018-06-25 09:21:47 +01:00
Paweł Dziepak
9d140488bd tests/perf: add performance test for IDL 2018-06-25 09:21:47 +01:00
Paweł Dziepak
4704c4efab query::result: avoid copying and linearising cell value
query::result_view already operates on views of a serialised
query::result. However, until now the value of a cell was always
linearised and copied. This patch makes use of ser::buffer_view to avoid
that.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
982f71a804 query::result_view: add concept 2018-06-25 09:21:47 +01:00
Paweł Dziepak
2914f64b2d serializer: user buffer_view in bytes deserialiser 2018-06-25 09:21:47 +01:00
Paweł Dziepak
19caf709e1 serializer: add view of a fragmented stream
ser::buffer_view is a view of a fragmented buffer in a stream od
IDL-serialised data. It can be used to deserialise IDL objects without
needless copying and linearisation of large blobs.
2018-06-25 09:21:47 +01:00
Paweł Dziepak
fe8dc1fa5c bytes_ostream: add remove_suffix() 2018-06-25 09:21:47 +01:00
Paweł Dziepak
969219d5bc tests/random-utils: add missing include 2018-06-25 09:21:47 +01:00
Paweł Dziepak
a85197a7b5 bytes_ostream: make fragment_iterator default constructible 2018-06-25 09:21:47 +01:00
Piotr Sarna
828497ad19 hints: amend a comment in device limits
To make the comment less confusing, 'group of managers'
is used instead of 'device'.

Refs #3516

Reported-by: Vlad Zolotarov <vladz@scylladb.com>
Signed-off-by: Piotr Sarna <sarna@scylladb.com>
Message-Id: <60c9ab6b47195570f7ce7dff9556e3739b7ae00f.1529862547.git.sarna@scylladb.com>
2018-06-24 19:14:59 +01:00
Avi Kivity
48dc875e49 Merge "convert setup scripts to python3" from Takuya
"
Converted all setup scripts from bash to python3.
"

* 'scripts_python_conversion_v1' of https://github.com/syuu1228/scylla:
  dist/common/scripts: convert scylla_kernel_check to python3
  dist/common/scripts: convert scylla_ec2_check to python3
  dist/common/scripts: convert scylla_sysconfig_setup to python3
  dist/common/scripts: convert scylla_setup to python3
  dist/common/scripts: convert scylla_selinux_setup to python3
  dist/common/scripts: convert scylla_raid_setup to python3
  dist/common/scripts: convert scylla_ntp_setup to python3
  dist/common/scripts: convert scylla_fstrim_setup to python3
  dist/common/scripts: convert scylla_dev_mode_setup to python3
  dist/common/scripts: convert scylla_cpuset_setup to python3
  dist/common/scripts: convert scylla_cpuscaling_setup to python3
  dist/common/scripts: convert scylla_coredump_setup to python3
  dist/common/scripts: convert scylla_bootparam_setup to python3
  dist/common/scripts: extend scylla_util.py to convert setup scripts to python3
  dist/common/scripts: convert scylla_io_setup and scylla_util.py to python3
2018-06-24 15:02:08 +03:00
Avi Kivity
40dbdae24e Update seastar submodule
> Merge "Allow creating views from simple streams" from Paweł
  > IOTune: allow duration to be configurable and change its defaults
2018-06-24 14:54:46 +03:00
Avi Kivity
cb549c767a database: rename column_family to table
The name "column_family" is both awkward and obsolete. Rename to
the modern and accurate "table".

An alias is kept to avoid huge code churn.

To prevent a One Definition Rule violation, a preexisting "table"
type is moved to a new namespace row_cache_stress_test.

Tests: unit (release)
Message-Id: <20180624065238.26481-1-avi@scylladb.com>
2018-06-24 14:54:46 +03:00
Tomasz Grabiec
2d4177355a Merge "Support for writing range tombstones to SSTables 3.x" from Vladimir
This patchset brings support for writing range tombstones to SSTables
3.x. ('mc' format).

In SSTables 3.x, range tombstones are represented by so-called range
tombstone markers (hereafter RT markers) that denote range tombstone
start and end bounds. So each range tombstone is represented in data
file by two ordered RT markers.
There are also markers that both close the previous range tombstone and
open the new one in case if two range tombstones are ajdacent. This is
done to consume less disk space on such occasions.
Range tombstones written as RT markers are naturally non-overlapping.

* github.com:argenet/scylla projects/sstables-30/write-range-tombstones/v6
range_tombstone_stream: Remove an unused boolean flag.
Revert "Add missing enum values to bound_kind."
sstables: Move to_deletion_time helper up and make it static.
sstables: Write end-of-partition byte before flushing the last index
block.
sstables: Add support for writing range tombstones in SSTables 3.x
format.
tests: Add unit test covering simple range tombstone.
tests: Add unit test covering adjacent range tombstones.
tests: Add test to cover non-adjacent RTs.
tests: Add test covering mixed rows and range tombstones.
tests: Add test covering SSTables 3.x with many RTs.
tests: Add unit test covering overlapping RTs and rows.
tests: Add tests writing a range tombstone and a row overlapping with
its start.
tests: Add tests writing a range tombstone and a row overlapping with
its end.
tests: Add function that writes from multiple memtable into SSTables.
tests: Add test where 2nd range tombstone covers the remainder of the
1st one.
tests: Add test writing two non-adjacent range tombstones with same
clustering key prefix at their bounds.
tests: Add test covering overlapped range tombstones.
2018-06-22 15:47:18 +02:00
Tomasz Grabiec
f09fff090a Merge 'Enhance space watchdog' from Piotr Sarna
"
This series addresses issue #3516 and enhances space watchdog to make it
device-aware. It's needed because since last MV-related changes, space
watchdog can be responsible for multiple hints manager, which means
multiple directories, which may mean multiple devices.
Hence, having a single static space size limit is not enough anymore
and watchdog should take it into account that different managers
may work on different disks, while yet another managers can share
the same device.

Tests: unit (release)
"

* 'enhance_space_watchdog_4' of https://github.com/psarna/scylla:
hints: reserve more space for dedicated storage
hints: add is_mountpoint function
hints: make space_watchdog device-aware
hints: add device_id to manager
hints: add get_device_id function
2018-06-22 15:45:47 +02:00
Piotr Sarna
8b43ac3a57 hints: reserve more space for dedicated storage
Reserving 10% of space for hints managers makes sense if the device
is shared with other components (like /data or /commitlog).
But, if hints directory is mounted on a dedicated storage, it makes
sense to reserve much more - 90% was chosen as a sane limit.
Whether storage is 'dedicated' or not is based on a simple check
if given hints directory is a mount point.

Fixes #3516

Signed-off-by: Piotr Sarna <sarna@scylladb.com>
2018-06-22 10:27:00 +02:00
Piotr Sarna
32f86ca61e hints: add is_mountpoint function
A helper function that checks whether a path is also a mount point
is added.

Signed-off-by: Piotr Sarna <sarna@scylladb.com>
2018-06-22 10:26:52 +02:00
Piotr Sarna
b6c1b8c5ef hints: make space_watchdog device-aware
Instead of having one static space limit for all directories,
space_watchdog now keeps a per-device limit, shared among
hints managers residing on the same disks.

References #3516

Signed-off-by: Piotr Sarna <sarna@scylladb.com>
2018-06-22 10:26:45 +02:00
Piotr Sarna
d22668de04 hints: add device_id to manager
In order to make space_watchdog device-aware, device_id field
is added to hints manager. It's an equivalent of stat.st_dev
and it identifies the disk that contains manager's root directory.

Signed-off-by: Piotr Sarna <sarna@scylladb.com>
2018-06-22 10:26:37 +02:00
Piotr Sarna
91b5e33c6a hints: add get_device_id function
In order to distinguish which directories reside on which devices,
get_device_id function is added to resource manager.

Signed-off-by: Piotr Sarna <sarna@scylladb.com>
2018-06-22 10:25:47 +02:00
Takuya ASADA
ca52407fd6 dist/common/scripts: convert scylla_kernel_check to python3
Convert bash script to python3.
2018-06-22 12:31:12 +09:00
Takuya ASADA
5efbb714ff dist/common/scripts: convert scylla_ec2_check to python3
Convert bash script to python3.
2018-06-22 12:30:59 +09:00
Takuya ASADA
d0b9464dc7 dist/common/scripts: convert scylla_sysconfig_setup to python3
Convert bash script to python3.
2018-06-22 12:30:49 +09:00
Takuya ASADA
d3a3d0f8de dist/common/scripts: convert scylla_setup to python3
Convert bash script to python3.
2018-06-22 12:30:37 +09:00
Takuya ASADA
8030e89725 dist/common/scripts: convert scylla_selinux_setup to python3
Convert bash script to python3.
2018-06-22 12:30:29 +09:00
Takuya ASADA
8cfc4f1c3d dist/common/scripts: convert scylla_raid_setup to python3
Convert bash script to python3.
2018-06-22 12:30:19 +09:00
Takuya ASADA
63a287b7d4 dist/common/scripts: convert scylla_ntp_setup to python3
Convert bash script to python3.
2018-06-22 12:30:10 +09:00
Takuya ASADA
01eea76a4e dist/common/scripts: convert scylla_fstrim_setup to python3
Convert bash script to python3.
2018-06-22 12:29:56 +09:00
Takuya ASADA
5e07567c60 dist/common/scripts: convert scylla_dev_mode_setup to python3
Convert bash script to python3.
2018-06-22 12:29:44 +09:00
Takuya ASADA
ccc6dbf6c7 dist/common/scripts: convert scylla_cpuset_setup to python3
Convert bash script to python3.
2018-06-22 12:29:25 +09:00
Takuya ASADA
7fd81510a4 dist/common/scripts: convert scylla_cpuscaling_setup to python3
Convert bash script to python3.
2018-06-22 12:29:04 +09:00
Takuya ASADA
e858674a79 dist/common/scripts: convert scylla_coredump_setup to python3
Convert bash script to python3.
2018-06-22 12:28:50 +09:00
Takuya ASADA
b3ee02dd1e dist/common/scripts: convert scylla_bootparam_setup to python3
Convert bash script to python3.
2018-06-22 12:27:56 +09:00
Takuya ASADA
2a4ba883c8 dist/common/scripts: extend scylla_util.py to convert setup scripts to python3
To porting setup scripts to python3, following utility functions/classes
introduced:
 - run(): execute command line, returns return code
 - out(): execute command line, returns stdout as string
 - is_debian_variant() / is_redhat_variant() / is_gentoo_variant()
 / is_ec2() / is_systemd(): detect specific environment
 - hex2list(): implement hex2list.py code as a function
 - makedirs(): same as os.makedirs() but do nothing when dir is exists
 - dist_name() / dist_ver(): alias of platform.dist()
 - class systemd_unit: an utility to control systemd unit using systemctl
 - class sysconfig_parser: reader/writer of /etc/sysconfig files
 - class concolor: ANSI color escape sequences list
2018-06-22 12:21:37 +09:00
Takuya ASADA
b7c980ac56 dist/common/scripts: convert scylla_io_setup and scylla_util.py to python3
To share scylla_util.py with python3 converted setup scripts, these
scripts need to be python3 too.
2018-06-22 12:11:27 +09:00
Glauber Costa
7f6b6fa129 github: direct users asking questions to our mailing list.
Very often people use the issue tracker to just ask questions. We have
been telling them to close the bug and move the discussion somewhere
else but it would be better if people were already directed to the right
place before they even get it wrong.

This would be easier to everybody.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180621135051.3254-1-glauber@scylladb.com>
2018-06-21 17:43:23 +03:00
Tomasz Grabiec
0f380f24c3 Update seastar submodule
* seastar 3c60b82...7aca670 (2):
  > Merge "Log stack trace during exception" from Gleb
  > shared_ptr: Introduce lw_shared_ptr::dispose() for convenience
2018-06-21 12:19:33 +02:00
Vladimir Krivopalov
ea09cf732d tests: Add test covering overlapped range tombstones.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-20 18:08:36 -07:00
Vladimir Krivopalov
5df3cd1787 tests: Add test writing two non-adjacent range tombstones with same clustering key prefix at their bounds.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-20 18:08:36 -07:00
Vladimir Krivopalov
35b90b2d1e tests: Add test where 2nd range tombstone covers the remainder of the 1st one.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-20 18:08:36 -07:00
Vladimir Krivopalov
2f277c29e8 tests: Add function that writes from multiple memtable into SSTables.
This comes in handy when we want to test overlapping range tombstones
because memtable would otherwise de-overlap them internally.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-20 18:08:36 -07:00
Vladimir Krivopalov
41d283fe83 tests: Add tests writing a range tombstone and a row overlapping with its end.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-20 18:08:36 -07:00
Vladimir Krivopalov
ff53f601e4 tests: Add tests writing a range tombstone and a row overlapping with its start.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-20 18:08:36 -07:00
Vladimir Krivopalov
f552f30d57 tests: Add unit test covering overlapping RTs and rows.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-20 18:08:36 -07:00
Vladimir Krivopalov
27e053f933 tests: Add test covering SSTables 3.x with many RTs.
This test checks the validity of the promoted index generated for an
RT-only data file.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-20 18:08:36 -07:00
Vladimir Krivopalov
aa4a011eb3 tests: Add test covering mixed rows and range tombstones.
Tests three cases:
 - a row lying inside a range tombstone
 - a row that has the same clustering key as range tombstone start
 - a row that has the same clustering key as range tombstone end

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-20 18:08:36 -07:00
Vladimir Krivopalov
492a401855 tests: Add test to cover non-adjacent RTs.
These are two RTs where one's RT end clustering is the same as another
one's RT start bound but they are both exclusive.

In this case those bounds should not (and cannot) be merged into a
single RT boundary when writing RT markers.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-20 18:08:36 -07:00
Vladimir Krivopalov
3a96226492 tests: Add unit test covering adjacent range tombstones.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-20 18:08:36 -07:00
Vladimir Krivopalov
b3e7982fec tests: Add unit test covering simple range tombstone.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-20 18:08:36 -07:00
Vladimir Krivopalov
5559fc2121 sstables: Add support for writing range tombstones in SSTables 3.x format.
For SSTables 3.x. ('mc' format), range tombstones are represented by
their bounds that are written to the data file as so-called RT markers.
For adjacent range tombstones, an RT marker can be of a 'boundary' type
which means it closes the previous range tombstone and opens the new
one.

Internally, sstable_writer_m relies on range_tombstone_stream to both
de-overlap incoming range tombstones and order them so that when they
are drained they can be easily thought of as just pairs of their bounds.
2018-06-20 18:08:36 -07:00
Noam Hasson
6572917fda docker: added support for authenticator & authorizer command arguments
By default Scylla docker runs without the security features.
This patch adds support for the user to supply different params values for the
authenticator and authorizer classes and allowing to setup a secure Scylla in
Docker.
For example if you want to run a secure Scylla with password and authorization:
docker run --name some-scylla -d scylladb/scylla --authenticator
PasswordAuthenticator --authorizer CassandraAuthorizer

Update the Docker documentation with the new command line options.

Signed-off-by: Noam Hasson <noam@scylladb.com>
Message-Id: <20180620122340.30394-1-noam@scylladb.com>
2018-06-20 20:33:59 +03:00
Gleb Natapov
f53ae2d07f storage_service: avoid "ignored future" message during schema check failure
Message-Id: <20180620134402.GQ1918@scylladb.com>
2018-06-20 18:53:47 +03:00
Takuya ASADA
6acb2add4a dist/ami: show unsupported instance type message even scylla_ami_setup is still running
On current .bash_profile it prints "Constructing RAID volume..." when
scylla_ami_setup is still running, even it running on unsupported
instance types.

To avoid that we need to run instance type check at first, then we can
run rest of the script.

Fixes #2739

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180613111539.30517-1-syuu@scylladb.com>
2018-06-20 16:49:15 +03:00
Takuya ASADA
4151120752 dist/debian: change owner of build/debs/ to current user
Currently build/debs/ is owned by root user since pbuilder requires to
run in root.
So chown them after finished building.

Fixes #3447

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180613093213.28827-1-syuu@scylladb.com>
2018-06-20 16:48:17 +03:00
Tomasz Grabiec
4523706312 gdb: Adjust for removal of the 'fsu' field from cpu_mem
Message-Id: <1529497459-15287-1-git-send-email-tgrabiec@scylladb.com>
2018-06-20 15:27:35 +03:00
Avi Kivity
b97e1aeff5 Merge "Consume row marker correctly" from Piotr
"
Make sure we properly handle row marker and row tombstone
when reading a row.

Tests: unit {release}
"

* 'haaawk/sstables3/read-liveness-info-v4' of ssh://github.com/scylladb/seastar-dev:
  sstable: consume row marker in data_consume_rows_context_m
  sstable: Add consumer_m::consume_row_marker_and_tombstone
  sstable: add is_set and to_row_marker to liveness_info
2018-06-20 14:44:03 +03:00
Piotr Jastrzebski
75edaff7b6 sstable: consume row marker in data_consume_rows_context_m
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-20 13:13:29 +02:00
Piotr Jastrzebski
cbfc741d70 sstable: Add consumer_m::consume_row_marker_and_tombstone
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-20 13:13:16 +02:00
Tomasz Grabiec
5548eb96f7 Merge "store prepared statements parameters values" from Vlad
* https://github.com/vladzcloudius/scylla.git tracing_prepared_parameters-v6:
  cql3::query_options: add get_names() method
  tracing::trace_state: hide the internals of params_values
  tracing: store queries statements for BATCH
  tracing: store the prepared statements parameters values
2018-06-19 19:12:26 +02:00
Avi Kivity
f912eefbe2 Merge "Fix numerous issues in AMI related scriptology" from Vlad
"
A few fixes in scripts that were found when debugging #3508.
This series fixed this issue.

"

Fixes #3508

* 'ami_scripts_fixes-v1' of https://github.com/vladzcloudius/scylla:
  scylla_io_setup: properly define the disk_properties YAML hierarchy
  scylla_io_setup: fix a typo: s/write_bandwdith/write_bandwidth/
  scylla_io_setup: hardcode the "mountpoint" YAML node to "/var/lib/scylla" for AMIs
  scylla_io_setup: print the io_properties.yaml file name and not its handle info
  scylla_lib.sh: tolerate perftune.py errors
2018-06-19 19:31:23 +03:00
Avi Kivity
e0eb66af6b Merge "Do not allow compaction controller shares to grow indefinitely" from Glauber
"
We are seeing some workloads with large datasets where the compaction
controller ends up with a lot of shares. Regardless of whether or not
we'll change the algorithm, this patchset handles a more basic issue,
which is the fact that the current controller doesn't set a maximum
explicitly, so if the input is larger than the maximum it will keep
growing without bounds.

It also pushes the maximum input point of the compaction controller from
10 to 30, allowing us to err on the side of caution for the 2.2 release.
"

* 'tame-controller' of github.com:glommer/scylla:
  controller: do not increase shares of controllers for inputs higher than the maximum
  controller: adjust constants for compaction controller
2018-06-19 18:49:02 +03:00
Avi Kivity
b6b5647836 Merge "Fix querier-cache related issues" from Botond
"
This mini series fixes some querier-cache related issues discovered
while working on stateful range-scans.
1) A problem in the memory based cache eviction test that is is yet
   unexposed (#3529).
2) Possible usage of invalidated iterators in querier_cache (#3424).
3) lookup() possibly returning a querier with the wrong read range
   (#3530).

Tests: unit(release)
"

* 'fix-querier-cache-invalid-iterators-master' of https://github.com/denesb/scylla:
  querier: find_querier(): return end() when no querier matches the range
  querier_cache: restructure entries storage
  tests/querier_cache: fix memory based eviction test
2018-06-19 16:29:03 +03:00
Paweł Dziepak
e55034a33e cql3: batch_statement: use external_memory_usage() to get mutation size
batch_statement::verify_batch_size() verifies that the total size of
mutations generated by the batch statement is smaller than certain
configurable thresholds. This is done by a custom mutation_partition
visitor, which violates atomic_cell_view::value() preconditions by
calling it even for dead cells.

The simples solution is to use
mutation_partition::external_memory_usage() instead.

Message-Id: <20180619131405.12601-1-pdziepak@scylladb.com>
2018-06-19 16:26:52 +03:00
Duarte Nunes
d3e24076b0 tests/cell_locker_test: Prevent timeout underflow
Timeout underflow causes the test to hang, due to a seastar bug
with negative time_points.


Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180619091635.34228-1-duarte@scylladb.com>
2018-06-19 16:26:52 +03:00
Duarte Nunes
ee4b3c4c2d database: Await pending writes before truncating CF on drop
When dropping a table, wait for the column family to quiesce so that
no pending writes compete with the truncate operation, possibly
allowing data to be left on disk.

Fixes #2562

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180618193134.31971-1-duarte@scylladb.com>
2018-06-19 16:26:52 +03:00
Botond Dénes
9490b8935c .gitignore: add resources directory
This directory is necessary when running dtests against a scylla
repository.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <8d9ad8dae2b9d2ec3cc6c9c4d6527fba8ce91272.1529387008.git.bdenes@scylladb.com>
2018-06-19 16:26:51 +03:00
Piotr Sarna
61e3ee6c3c cql3: fix supernumerary column on view update
Patch f39891a999 fixed 3443,
but also introduced a regression in dtest - new column
was unconditionally added to view during ALTER TABLE ADD,
while it should only be the case for "include all columns" views.
This patch fixes the regression (spotted by query_new_column_test).

References #3443
Message-Id: <7410d965255a514d78cf0ce941a3236b9d8ddbbd.1529399135.git.sarna@scylladb.com>
2018-06-19 16:26:51 +03:00
Botond Dénes
2609a17a23 querier: find_querier(): return end() when no querier matches the range
When none of the queriers found for the lookup key match the lookup
range `_entries.end()` should be returned as the search failed. Instead
the iterator returned from the failed `std::find_if()` is returned
which, if the find failed, will be the end iterator returned by the
previous call to `_entries.equal_range()`. This is incorrect because as
long as `equal_range()`'s end iterator is not also `_entries.end()` the
search will always return an iterator to a querier regardless of whether
any of them actually matches the read range.
Fix by returning `_entries.end()` when it is detected that no queriers
match the range.

Fixes: #3530
2018-06-19 13:20:43 +03:00
Botond Dénes
7ce7f3f0cc querier_cache: restructure entries storage
Currently querier_cache uses a `std::unordered_map<utils::UUID, querier>`
to store cache entries and an `std::list<meta_entry>` to store meta
information about the querier entries, like insertion order, expiry
time, etc.

All cache eviction algorithms use the meta-entry list to evict entries
in reverse insertion order (LRU order). To make this possible
meta-entries keep an iterator into the entry map so that given a
meta-entry one can easily erase the querier entry. This however poses a
problem as std::unordered_map can possibly invalidate all its iterators
when new items are inserted. This is use-after-free waiting to happen.

Another disadvantages of the current solution is that it requires the
meta-entry to use a weak pointer to the querier entry so that in case
that is removed (as a result of a successful lookup) it doesn't try to
access it. This has an impact on all cache eviction algorithms as they
have to be prepared to deal with stale meta-entries. Stale meta-entries
also unnecesarily consume memory.

To solve these problems redesign how querier_cache stores entries
completely. Instead of storing the entries in an `std::unordered_map`
and storing the meta-entries in an `std::list`, store the entries in an
`std::list` and an intrusive-map (index) for lookups. This new design
has severeal advantages over the old one:
* The entries will now be in insert order, so eviction strategies can
  work on the entry list itself, no need to involve additional data
  structures for this.
* All data related to an entry is stored in one place, no data
  duplication.
* Removing an entry automatically removes it from the index as intrusive
  containers support auto unlink. This means there is no need to store
  iterators for long terms, risking use-after-free when the container
  invalidates it's iterators.

Additional changes:
* Modify eviction strategies so that they work with the `entry`
  interface rather than the stored value directly.

Ref #3424
2018-06-19 13:20:40 +03:00
Botond Dénes
b9d51b4c08 tests/querier_cache: fix memory based eviction test
Do increment the key counter after inserting the first querier into the
cache. Otherwise two queriers with the same key will be inserted and
will fail the test. This problem is exposed by the changes the next
patches make to the querier-cache but will be fixed before to maintain
bisectability of the code.

Fixes: #3529
2018-06-19 13:20:13 +03:00
Vladimir Krivopalov
100eb03f29 sstables: Write end-of-partition byte before flushing the last index block.
This is to stay compliant with the Origin for SSTables 3.x.
It differs from SSTables 2.x (ka/la) as for those the last promoted
index block is pushed first and the end-of-partition byte is written
after.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-18 14:28:25 -07:00
Vladimir Krivopalov
ad0b911b03 sstables: Move to_deletion_time helper up and make it static.
It is used for writing end_open_marker for promoted index.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-18 14:25:13 -07:00
Vladimir Krivopalov
03cf20676c Revert "Add missing enum values to bound_kind."
This reverts commit 3ecc9e9ce4.

It also adds another enum to be used instead.
2018-06-18 14:22:12 -07:00
Vladimir Krivopalov
0cf42e7fd2 range_tombstone_stream: Remove an unused boolean flag.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-18 14:22:12 -07:00
Glauber Costa
e0b209b271 controller: do not increase shares of controllers for inputs higher than the maximum
Right now there is no limit to how much the shares of the controllers
can grow. That is not a big problem from the memtable flush controller,
since it has a natural maximum in the dirty limit.

But the compaction controller, the way it's written today, can grow
forever and end up with a very large value for shares. We'll cap that at
adjust() time by not allowing shares to grow indefinitely.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-06-18 15:16:39 -04:00
Glauber Costa
70c47eb045 controller: adjust constants for compaction controller
Right now the controller adjusts its shares based on how big the backlog
is in comparison to shard memory. We have seen in some tests that if the
dataset becomes too big, this may cause compactions to dominate.

While we may change the input altogether in future versions, I'd like to
propose a quick change for the time being: move the high point from 10x
memory size to 30x memory size. This will cause compactions to increase
in shares more slowly.

While this is as magic as the 10 before, they will allow us to err in
the side of caution, with compactions not becoming aggressive enough to
overly disrupt workloads.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-06-18 15:16:38 -04:00
Piotr Jastrzebski
4c261d2e51 sstable: add is_set and to_row_marker to liveness_info
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-18 20:26:39 +02:00
Paweł Dziepak
71471bb322 Merge "Make front-end processing scheduling aware" from Avi
"
This patchset runs the protocol servers under the "statement" scheduling
group, and makes all execution_stages in that path scheduling aware.

I used inheriting_concrete_execution_stage instead of passing the
scheduling group to concrete_execution_stage's constructor for two
reasons:

 1. For cql statements, there is no easily accessible object that
    can host the concrete_execution_stage and be reached from both
    main.cc and the statements,
 2. In the future, we will want to assign users to different
    scheduling_groups, thus providing performance isolation for
    service-level agreements (SLAs). Using an inheriting
    execution_stage allows us to make the scheduling_group decision
    in one place.

Depends on two unmerged patches in seastar, one fixing
inheriting_concrete_execution_stage compilation with reference parameters,
and one making smp::submit_to() scheduling aware.
"

* tag 'cql-sched/v1' of https://github.com/avikivity/scylla:
  cql: make modification_statement execution_stage scheduling aware
  cql: make batch_statement execution_stage scheduling aware
  cql: make select_statement execution_stage scheduling aware
  transport: make native protocol request processing execution_stage scheduling aware
  main: start client protocol servers under the statement scheduling group
2018-06-18 16:38:30 +01:00
Avi Kivity
0cf4cf5981 cql: make modification_statement execution_stage scheduling aware
Inherit scheduling from the caller, preventing a fall back into the main group.
2018-06-18 18:30:21 +03:00
Avi Kivity
9479d3f345 cql: make batch_statement execution_stage scheduling aware
Inherit scheduling from the caller, preventing a fall back into the main group.
2018-06-18 18:30:21 +03:00
Avi Kivity
fdfc347595 cql: make select_statement execution_stage scheduling aware
Inherit scheduling from the caller, preventing a fall back into the main group.
2018-06-18 18:30:21 +03:00
Avi Kivity
ec788d2a7a transport: make native protocol request processing execution_stage scheduling aware
Inherit scheduling from the caller, preventing a fall back into the main group.
2018-06-18 18:30:21 +03:00
Avi Kivity
ea39e3e9d4 main: start client protocol servers under the statement scheduling group
This will isolate client protocol and coordinator-side processing from
the rest of the system.
2018-06-18 18:30:21 +03:00
Paweł Dziepak
79fae49689 Merge seastar upstream
* seastar 6422ece...3c60b82 (5):
  > reactor: inherit scheduling_group in smp::submit_to()
  > execution_stage: fix inheriting_concrete_execution_stage with reference arguments
  > tests: shared_ptr: Add typename keyword to fix compilation
  > configure: Fix --static-stdc++ flag
  > scheduling: Move friends' definitions outside the class scope
2018-06-18 12:16:57 +01:00
Avi Kivity
782827cc1b Update seastar submodule
* seastar e7275e4...6422ece (7):
  > build: enable concepts whenever they are supported by compiler
  > shared_ptr: Enable releasing ownership of the object stored in lw_shared_ptr
  > reactor: change way of calculating task quota violations
  > Merge "Add metrics for steal time and task quota violations" from Glauber
  > bitops.hh/log2ceil(): add special case for n == 1
  > circular_buffer: add clear()
  > build: add core/execution_stage.{cc,hh} to core_files
2018-06-17 21:53:50 +03:00
Avi Kivity
f0fc888381 Merge "Try harder to move STCS towards zero-backlog" from Glauber
"
Tests: unit (release)

Before merging the LCS controller, we merged patches that would
guarantee that LCS would move towards zero backlog - otherwise the
backlog could get too high.

We didn't do the same for STCS, our first controlled strategy. So we may
end up with a situation where there are many SSTables inducing a large
backlog, but they are not yet meeting the minimum criteria for
compaction. The backlog, then, never goes down.

This patch changes the SSTable selection criteria so that if there is
nothing to do, we'll keep pushing towards reaching a state of zero
backlog. Very similar to what we did for LCS.
"

* 'stcs-min-threshold-v4' of github.com:glommer/scylla:
  STCS: bypass min_threshold unless configure to enforce strictly
  compaction_strategy: allow the user to tell us if min_threshold has to be strict
2018-06-17 18:07:23 +03:00
Glauber Costa
fd51ff3d9e STCS: bypass min_threshold unless configure to enforce strictly
If we fail to produce a SizeTiered compaction with the configured
min_threshold, we can try again to compact any two - unless there is a
global bypass telling us no to.

This will still privilege doing larger compactions in size buckets where
that is possible, but if we are idle will try to compact any two

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-06-15 14:27:22 -04:00
Glauber Costa
290d553c3a compaction_strategy: allow the user to tell us if min_threshold has to be strict
Now that we have the controller, we would like to take min_threshold as
a hint. If there is nothing to compact, we can ignore that and start
compacting less than min_threshold SSTables so that the backlog keeps
reducing.

But there are cases in which we don't want min_threshold to be a hint
and we want to enforce it strictly. For instance, if write amplification
is more of a concern than space amplification.

This patch adds a YAML option that allows the user to tell us that. We will
default to false, meaning min_threshold is not strictly enforced.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-06-15 13:42:43 -04:00
Avi Kivity
75b53c4170 Merge "sstables 3.x read counters v2 00/10] Support reading counters" from Piotr
"
Implement and test support for reading counters in SSTables 3.
"

* 'haaawk/sstables3/read-counters-v2' of ssh://github.com/scylladb/seastar-dev:
  sstable_3_x_test: add test for counters
  data_consume_rows_context_m: support reading counters
  Add consumer_m::consume_counter_column
  Extract make_counter_cell
  row.hh & mp_row_consumer.hh: Add required includes
  Use serialization_header::adjust in read_statistics
  sstables 3: add serialization_header::adjust
  data_consume_rows_context_m: add is_column_counter
  data_consume_rows_context_m: Remove unused CELL_PATH_SIZE state
  column_translation: add is_counter
2018-06-15 17:33:40 +03:00
Takuya ASADA
3f8719d67e dist/common/scripts/scylla_coredump_setup: fix typo
Correct function name is "is_debian_variant", not "is_debian_variants"

Fixed #3507

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180612155353.28229-1-syuu@scylladb.com>
2018-06-15 12:11:52 +01:00
Glauber Costa
e1246a3a3a sstable_test: write to temporary directory
Currently the SSTable test is failing (at least for me and Raphael),
complaining about the file it tries to write already existing. We have
helpers now to generate temporary directories, so we should use it.

The test passes after that.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180614210036.16662-1-glauber@scylladb.com>
2018-06-15 11:00:08 +02:00
Piotr Sarna
d7eb6e6c7f tests: fix a typo in idl_test.cc
Fixes #3520
Message-Id: <831ead669a30d1b136d9ae50c4a1ac7057cf3340.1529047397.git.sarna@scylladb.com>
2018-06-15 09:56:45 +01:00
Piotr Jastrzebski
346e559c1b sstable_3_x_test: add test for counters
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-15 09:11:09 +02:00
Piotr Jastrzebski
2942f6eecc data_consume_rows_context_m: support reading counters
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-15 09:11:09 +02:00
Piotr Jastrzebski
785e14dfb9 Add consumer_m::consume_counter_column
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-15 09:11:09 +02:00
Piotr Jastrzebski
6f559445d0 Extract make_counter_cell
It will be used by both consumers.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-15 09:11:09 +02:00
Piotr Jastrzebski
88b66189b7 row.hh & mp_row_consumer.hh: Add required includes
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-15 09:11:09 +02:00
Piotr Jastrzebski
369e4a4987 Use serialization_header::adjust in read_statistics
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-15 09:11:09 +02:00
Piotr Jastrzebski
a3683d6e0f sstables 3: add serialization_header::adjust
In SSTables 3, min timestamp and min deletion time in serialization
header are not stored normally but instead the difference between
their value and the cassandra "epoch" is stored.
This is supposed to make SSTables smaller. As a consequence, we have
to add the "epoch" after reading the values to obtain the actual
values of min timestamp and min deletion time.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-15 09:10:48 +02:00
Tomasz Grabiec
78274276f5 row_cache: Use the memtable cleaner to create memtable snapshot during update
Memtable entries should be cleaned using memtable cleaner, which
unlike the cache' cleaner is not associated with the cache
tracker. It's an error to clean a snapshot using tracker which doesn't
own the entries. This will corrupt cache tracker's row counter.

Fixes failure of test_exception_safety_of_update_from_memtable from
row_cache.cc in debug mode and with allocation failure injection
enabled.

Introduce in "cache: Defer during partition merging"
(70c72773be).
Message-Id: <1528988256-20578-1-git-send-email-tgrabiec@scylladb.com>
2018-06-14 18:03:02 +03:00
Piotr Sarna
6b3a97e34a hints: fix max_shard_disk_space_size initialization
Previously max_shard_disk_space_size was unconditionally initialized
with the capacity of hints_directory. But, it's likely that
hints_directory doesn't exist at all if hinted handoff is not enabled,
which results in Scylla failing to boot.
So, max_shard_disk_space_size is now initialized with the capacity
of hints_for_views directory, which is always present.
This commit also moves max_shard_disk_space_size to the .cc file
where it belongs - resource_manager.cc.

Tests: unit (release)

Message-Id: <9f7b86b6452af328c05c5c6c55bfad3382e12445.1528977363.git.sarna@scylladb.com>
2018-06-14 14:24:01 +01:00
Duarte Nunes
5a8b8afe19 Merge "Add support for datetime functions" from Piotr
"
This series adds the following datetime functions to CQL:
 - currentTimestamp
 - currentDate
 - currentTime
 - currentTimeUUID
 - timeUUIDToDate
 - timestampToDate
 - timeUUIDToTimestamp
 - dateToTimestamp
 - timeUUIDToUnixTimestamp
 - timestampToUnixTimestamp
 - dateToUnixTimestamp

It also comes with datetime conversions test added to cql_query_test.

Note: issue #2949 also mentioned queries like:
 $ SELECT * FROM myTable WHERE date >= currentDate() - 2d;
but it's a broader topic of supporting arithmetic operations in general,
so it's moved to #3499.

Tests: unit (release)
"

* 'support_datetime_functions_3' of https://github.com/psarna/scylla:
  tests: add datetime conversions to cql_query_tests
  cql3: add time conversion functions
  cql3: add current* time functions
  types: add time_native_type
2018-06-14 12:31:39 +01:00
Piotr Sarna
5900e7f55f tests: add datetime conversions to cql_query_tests
Test case related to datetime converting functions
is added to cql_query_tests suite.
2018-06-14 11:49:11 +02:00
Piotr Sarna
695015a27e cql3: add time conversion functions
Following functions are added:
 - timeuuidtodate
 - timestamptodate
 - timeuuidtotimestamp
 - datetotimestamp
 - timeuuidtounixtimestamp
 - timestamptounixtimestamp
 - datetounixtimestamp

Fixes #2949
2018-06-14 11:49:11 +02:00
Piotr Sarna
087998b768 cql3: add current* time functions
Following date/time-related functions are added:
 - currentTimestamp
 - currentDate
 - currentTime
 - currentTimeUUID
2018-06-14 11:49:08 +02:00
Piotr Sarna
90d323a522 types: add time_native_type
CQL3's time_type didn't have any suitable native type,
so time_native_type is introduced to serve that purpose.
2018-06-14 11:11:41 +02:00
Takuya ASADA
9971576ecb dist: drop collectd support from package
Since scyllatop no longer needs collectd, now we are able to drop collectd.

resolves #3490

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1528961612-8528-1-git-send-email-syuu@scylladb.com>
2018-06-14 10:40:23 +03:00
Takuya ASADA
fcc1a9f6bb dist/redhat: Disables ambient capabilities when systemd/kernel doesn't support it
CentOS 7.4 does support to use ambient capabilities on systemd unit
file, but on some other RHEL7 compatible enviroment doesn't, it causes
Scylla startup failure.

To avoid the issue, move AmbientCapabilities line to
/etc/systemd/system/scylla.server.service.d/, install .conf only when
both systemd and kernel supported the feature.

Fixes #3486

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180613232327.7839-1-syuu@scylladb.com>
2018-06-14 10:32:56 +03:00
Avi Kivity
aeffbb6732 database: stop using incremental selectors
There is a bug in incremental_selector for partitioned_sstable_set, so
until it is found, stop using it.

This degrades scan performance of Leveled Compaction Strategy tables.

Fixes #3513. (as a workaround)
Introduced: 2.1
Message-Id: <20180613131547.19084-1-avi@scylladb.com>
2018-06-13 17:57:57 +02:00
Paweł Dziepak
d5982569bc Merge "Fix fragmented serialization" from Piotr
"
After issue 3501 it turned out that IDL generates incorrect
serialization code for fragmented buffers. This series addresses
the problem by:
 * providing serialization code for FragmentRange
 * changing IDL generation rules for fragmented buffers, so they
   expect a lower layer to iterate over fragments
 * adding a test to cql_query_test suite that covers #3501
 * adding a test to idl_tests suite that covers fragmented serialization
"

* 'fix_fragmented_serialization_3' of https://github.com/psarna/scylla:
  tests: add fragmented serialization test to idl_tests
  tests: add long text value test
  idl: remove for_each from fragmented serialization
  serializer: add FragmentRange serialization
2018-06-13 14:11:16 +01:00
Gleb Natapov
98b7f6148b fix regression in perf_row_cache_update test
logalloc should be initialized explicitly by every test that uses it
now.

Message-Id: <20180613093657.GY11809@scylladb.com>
2018-06-13 15:21:20 +03:00
Avi Kivity
29976600b4 Update scylla-ami submodule
* dist/ami/files/scylla-ami 1f5329f...36e8511 (1):
  > don't try to add busy devices to the RAID.
2018-06-13 15:19:57 +03:00
Piotr Sarna
551e8f5d8c tests: add fragmented serialization test to idl_tests
IDL tests now has an additional test that checks whether serializing
and deserializing of fragmented buffers is working properly.

References #3501
2018-06-13 13:54:12 +02:00
Piotr Sarna
cdd87af408 tests: add long text value test
Test adding a long (>8192) text/varchar value is added to cql suite.

References #3501
2018-06-13 13:54:12 +02:00
Piotr Sarna
450e014558 idl: remove for_each from fragmented serialization
Previously fragmented buffers of bytes were serialized
with a for_each loop. Since serializing bytes involves writing
size first and then data, only first fragment (and its size)
would be taken into account.
This commit changes fragmented code generation so it expects
that serialized range has a serialize(output, T) specification
and expects it to iterate over fragments on its own (just like
serializer for basic_value_view does).

Fixes #3501
2018-06-13 13:54:09 +02:00
Piotr Sarna
e525a0d51b serializer: add FragmentRange serialization
Serialization for FragmentRange classes is added to serialization
suite. It first serializes total length to a 32bit field and then
writes each fragment to output.

References #3501
2018-06-13 13:44:08 +02:00
Piotr Jastrzebski
42d2a162dd data_consume_rows_context_m: add is_column_counter
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-13 09:27:58 +02:00
Piotr Jastrzebski
d4d3e6f8eb data_consume_rows_context_m: Remove unused CELL_PATH_SIZE state
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-13 09:27:58 +02:00
Piotr Jastrzebski
ca7ede7eaf column_translation: add is_counter
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-13 09:27:58 +02:00
Vlad Zolotarov
0004c29aba scylla_io_setup: properly define the disk_properties YAML hierarchy
disk_properties map should be an entry in the 'disk' list hierarchy.
Currently this list is going to containe a single element.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-06-12 19:14:22 -04:00
Vlad Zolotarov
038b2f3be2 scylla_io_setup: fix a typo: s/write_bandwdith/write_bandwidth/
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-06-12 18:54:34 -04:00
Vlad Zolotarov
26277e5973 scylla_io_setup: hardcode the "mountpoint" YAML node to "/var/lib/scylla" for AMIs
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-06-12 15:51:10 -04:00
Vlad Zolotarov
77463ddc3b scylla_io_setup: print the io_properties.yaml file name and not its handle info
In order to get a file name from the given file() handle one should use
a file_handle.name property.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-06-12 15:25:55 -04:00
Vlad Zolotarov
aa3d9c38b5 scylla_lib.sh: tolerate perftune.py errors
When we check the currently configured tuning mode perftune.py is allowed
to return an error. get_tune_mode() has to be able to tolerate them.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-06-12 14:21:26 -04:00
Vlad Zolotarov
818b5b75ba tracing: store the prepared statements parameters values
Store the prepared statement positional parameters values in the
corresponding system_traces.sessions entry in the 'parameters' column
(which has a map<text,text> type).

Parameters are stored as a pair of "param[X]" : "value", where X is
the index of the parameter starting from 0 and the "value" is the first
64 characters of the parameter's value string representation.

If parameters were given with their names attached (see the description
on bit 0x40 of QUERY flags in the CQL binary protocol specification) then
parameters are going to be stored in the "param[X](<bound variable name>)" : "value"
form.

If the value's string representation is longer than 64 characters then the "value" will
contain only first 64 characters of it and will have the "..." at
the end.

For a BATCH of prepared statements the parameter "name" will have a form of
param[Y][X] where Y is the index of the corresponding prepared statement
in the BATCH and X is the index of the parameter. Both X and Y start from
0.

Note:
Had to switch to boost::range::find() in sstables::big_sstable_set in order to
address the "ambiguous overload" compilation error.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-06-12 10:57:05 -04:00
Vlad Zolotarov
a1da285f9e tracing: store queries statements for BATCH
Similarly to the regular QUERY of EXECUTE we want to see the actual
queries statement that were part of the BATCH.

If a traced query has only a single statement to execute then its statement will be stored in a form 'query':'<statement>'.

If there are two or more queries (BATCH) then statements of each query in the BATCH will be stored in a form 'query[X]':'<statement>', where X is the index of the query in the
BATCH starting from 0.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-06-12 10:57:05 -04:00
Vlad Zolotarov
c0e51c4521 tracing::trace_state: hide the internals of params_values
Hide it inside the trace_state.cc in order to avoid future circular
dependencies with other .hh files.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-06-12 10:57:05 -04:00
Vlad Zolotarov
a469567605 cql3::query_options: add get_names() method
This method returns names of named prepared statement parameters.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-06-12 10:57:05 -04:00
Avi Kivity
d91891b6f0 Restore scylla-ami submodule
Commit b38ced0fcd ("Configure logalloc
memory size during initialization") updated the scylla-ami submodule
inadvertently.
2018-06-12 10:56:59 +03:00
Avi Kivity
74a3ab36e3 Restore seastar submodule
Commit b38ced0fcd ("Configure logalloc
memory size during initialization") updated the seastar submodule
inadvertently.
2018-06-12 10:37:35 +03:00
Avi Kivity
24a9a3c679 Merge "Push memory limits configuration up to main" from Gleb
"
May components limit its internal memory pools/caches/queues depending
on amount of memory present in a system. Each of them uses seastar
memory interface to get the information about memory availability
which makes it harder to 1: test the components with various memory
configurations and 2: to see which components reserve memory and how
much each one reserves.

The patch changes all the components that rely on memory size to get this
information through configuration parameter during creation instead of
checking it directly with seastar, so only main interacts with seastar
allocator.
"

* 'gleb/memory-config-v2' of github.com:scylladb/seastar-dev:
  Provide available memory size to compaction_manager object during creation
  Configure authorized_prepared_statment_cache memory limit during object creation
  Configure logalloc memory size during initialization
  Provide cql max request limit to cql server object during creation
  Configure query result memory limiter size limit during object creation
  Configure querier_cache size limit during object creation
  Provide available memory size to messaging_service object during creation
  Provide available memory size to hinted handoff resource manager during creation
  Provide available memory size to storage_proxy object during creation
  Provide available memory size to commitlog during creation
  Provide available memory size to database object during creation
  Configure prepared_statements_cache memory limit from outside
2018-06-11 15:34:14 +03:00
Gleb Natapov
59da525e0d Provide available memory size to compaction_manager object during creation 2018-06-11 15:34:14 +03:00
Gleb Natapov
da20d86423 Configure authorized_prepared_statment_cache memory limit during object creation 2018-06-11 15:34:14 +03:00
Gleb Natapov
b38ced0fcd Configure logalloc memory size during initialization 2018-06-11 15:34:14 +03:00
Gleb Natapov
894673ac14 Provide cql max request limit to cql server object during creation 2018-06-11 15:34:14 +03:00
Gleb Natapov
7832266cd7 Configure query result memory limiter size limit during object creation 2018-06-11 15:34:13 +03:00
Gleb Natapov
04727acee9 Configure querier_cache size limit during object creation 2018-06-11 15:34:13 +03:00
Gleb Natapov
646e400918 Provide available memory size to messaging_service object during creation 2018-06-11 15:34:13 +03:00
Gleb Natapov
cdf1289b43 Provide available memory size to hinted handoff resource manager during creation 2018-06-11 15:34:13 +03:00
Gleb Natapov
ac88935baa Provide available memory size to storage_proxy object during creation 2018-06-11 15:34:13 +03:00
Gleb Natapov
cc47f6c69d Provide available memory size to commitlog during creation 2018-06-11 15:34:13 +03:00
Gleb Natapov
f41575a156 Provide available memory size to database object during creation 2018-06-11 15:34:13 +03:00
Gleb Natapov
461f20e7b1 Configure prepared_statements_cache memory limit from outside
Pass desirable memory limit during construction instead of querying
memory size explicitly.
2018-06-11 15:34:13 +03:00
Tomasz Grabiec
a91974af7a tests: row_cache: Reduce concurrency limit to avoid bad_alloc
The test uses random mutations. We saw it failing with bad_alloc from time to time.
Reduce concurrency to reduce memory footprint.

Message-Id: <20180611090304.16681-1-tgrabiec@scylladb.com>
2018-06-11 10:06:56 +01:00
Tomasz Grabiec
cd7c7ac40f mutation_partition: Make do_compact() respect range tombstone merging rules
It compares only timestamps, but it should use intrinsic ordering of
the tombstone, which takes deletio ntime into consideration as well.
If we have two range tombstones with the same timestamp but different
deletion time (odd case, but still), then the one with the higher
deletion time should win. That's what all other parts of the system
use to resolve merges, in particular range_tombstone_list and
compact_mutation_state (the fragment stream compactor).

Not respecting this ordering violates the following equality:

  do_compact(do_compact(m1) + m2) == do_compact(m1 + m2)

which may results in some clustered rows being missing in the
right-hand side, but not in the left-hand side, due to differences in
range tombstones.

This impacts only tests currently.
Message-Id: <1528705602-7218-1-git-send-email-tgrabiec@scylladb.com>
2018-06-11 10:05:52 +01:00
Nadav Har'El
41472e2618 legacy_schema_migrator: add comment
When I came across db/legacy_schema_migrator.cc, I had no idea what it
does and though I had obvious guesses (it somehow migrates old schemas,
right?) I didn't know what it really does. So after I figured this out,
I wrote this comment so the next person doesn't need to guess.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180605120225.25173-1-nyh@scylladb.com>
2018-06-10 19:39:06 +03:00
Avi Kivity
4b81feb344 Merge "switch to systemd-coredump on Debian 9" from Takuya
* 'systemd-coredump-debian9' of https://github.com/syuu1228/scylla:
  dist/debian: fix pystache package name on Debian / Ubuntu
  dist/debian: switch to systemd-coredump on Debian 9
  dist/debian: rename 99-scylla.conf to 99-scylla-coredump.conf
2018-06-10 19:38:25 +03:00
Asias He
059ec89ad1 gms: Add is_normal helper to endpoint_state
It is faster than gossiper::is_normal because it avoids to do search in
the std::map<application_state, versioned_value>. It is useful for the
code in the fast path which needs to query if a node is in NORMAL
status.

Fixes #3500

Message-Id: <42db91fa4108f9f4fcf94fed3ec403ccf35d15e9.1528354644.git.asias@scylladb.com>
2018-06-10 19:21:03 +03:00
Vladimir Krivopalov
9c9c85cde5 tests: Add test writing UDT data to SSTables 3.x.
Original data and index files are generated using Cassandra 3.11.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <d0ea8146d6f2a76a5f661271500b35390962a9d4.1528420647.git.vladimir@scylladb.com>
2018-06-10 19:20:42 +03:00
Avi Kivity
74469ecc09 Merge "Support reading collections" from Piotr
"
Implement and test support for reading collections in SSTables 3.

Tests: unit {release}
"

* 'haaawk/sstables3/read-collections-v1' of ssh://github.com/scylladb/seastar-dev:
  sstables 3: Add tests for reading collections
  flat_mutation_reader_assertions: add more flexible asserts
  data_consume_rows_context_m: add support for collections
  mp_row_consumer_m: Add support for collections
  data_consume_rows_context_m: introduce cell_path
  Use column_translation::*_is_collection in reading
  column_translation: add *_column_is_collection()
  column_flags_m: add HAS_COMPLEX_DELETION
  Use read_unsigned_vint_length_bytes for COLUMN_VALUE
  Use read_unsigned_vint_length_bytes for CK_BLOCKS
  Implement read_unsigned_vint_length_bytes
2018-06-10 17:10:52 +03:00
Avi Kivity
2582f53b44 Merge "database and API: Add column_family::get_sstables_by_key" from Amnon
"
This is series is for nodetool getsstables.

This patch is based on:
8daaf9833a

With some minor adjustments because of the code change in sstables.

The idea is to allow searching for all the sstables that contains a
given key.

After this patch if there is a table t1 in keyspace k1 and it has a key
called aa.

curl -X GET "http://localhost:10000/column_family/sstables/by_key/k1%3At1?key=aa"

Will return the list of sstables file names that contains that key.
"

* 'amnon/sstable_for_key_v4' of github.com:scylladb/seastar-dev:
  Add the API implementation to get_sstables_by_key
  api: column_family.json make the get_sstables_for_key doc clearer
  column_family: Add the get_sstables_by_partition_key method
  sstable test: add has_partition_key test
  sstable: Add has_partition_key method
  keys_test: add a test for nodetool_style string
  keys: Add from_nodetool_style_string factory method
2018-06-10 16:53:56 +03:00
Amnon Heiman
8fbc6a22fb Add the API implementation to get_sstables_by_key
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2018-06-10 16:13:01 +03:00
Amnon Heiman
cc5601d000 api: column_family.json make the get_sstables_for_key doc clearer
This patch makes it clearer that the key that get_sstables_for_key
refers to, is a partition key.
2018-06-10 16:13:01 +03:00
Amnon Heiman
acb0a738eb column_family: Add the get_sstables_by_partition_key method
The get_sstables_by_partition_key method used by the API to return a set of
sstables names that holds a given partition key.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2018-06-10 16:13:01 +03:00
Amnon Heiman
b8e5029991 sstable test: add has_partition_key test
This patch adds a test to the has_partition_key method, it creates an
sstable with a partition key and then used that key in the
has_partition_key method to verify that it is there.

It creates a different key and use that to verify that a non exist key
return false.
2018-06-10 16:12:12 +03:00
Avi Kivity
ba5d8717c8 tests: disable reactor stall notifier
In case it is interacting badly with ASAN and causing spurious test
failures.
2018-06-10 15:55:00 +03:00
Avi Kivity
95b00aae33 Revert scylla-ami update in "scylla_setup: fix conditional statement of silent mode"
This reverts part of commit 364c2551c8. I mistakenly
changed the scylla-ami submodule in addition to applying the patch. The revert
keeps the intended part of the patch and undoes the scylla-ami change.
2018-06-10 14:53:40 +03:00
Asias He
d23dafa7ac dht: Remove column_families parameter in add_rx_ranges and add_tx_ranges
In 4b1034b (storage_service: Remove the stream_hints), we removed the
only user of the api with the column_families parameter.

std::vector column_families = { db::system_keyspace::HINTS };
streamer->add_tx_ranges(keyspace, std::move(ranges_per_endpoint),
column_families);

We can simplify the code range_streamer a bit by removing it.

Fixes #3476

Tests: dtest update_cluster_layout_tests.py
Message-Id: <c81d79c5e6dbc8dd78c1242837de892e39d6abd2.1528356342.git.asias@scylladb.com>
2018-06-10 14:53:40 +03:00
Piotr Jastrzebski
7d3abb0668 sstables 3: Add tests for reading collections
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-07 23:40:10 +02:00
Piotr Jastrzebski
176305c2f2 flat_mutation_reader_assertions: add more flexible asserts
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-07 23:39:51 +02:00
Piotr Jastrzebski
f9c62b8188 data_consume_rows_context_m: add support for collections
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-07 23:39:07 +02:00
Piotr Jastrzebski
fd89f42b09 mp_row_consumer_m: Add support for collections
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-07 23:35:12 +02:00
Piotr Jastrzebski
ffb6b9ed24 data_consume_rows_context_m: introduce cell_path
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-07 23:30:40 +02:00
Piotr Jastrzebski
5e1dd89d4d Use column_translation::*_is_collection in reading
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-07 22:50:23 +02:00
Piotr Jastrzebski
7bb25a2dd9 column_translation: add *_column_is_collection()
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-07 22:48:43 +02:00
Piotr Jastrzebski
2b8ff15f9f column_flags_m: add HAS_COMPLEX_DELETION
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-07 22:47:19 +02:00
Avi Kivity
f9d66f88bb transport: advertise the shard serving a connection
It is useful for the client driver to know which shard is serving a
particular connection, so it can only send requests through that connection
which will be served by the same shard, eliminating a hop.

Support that by advertising a "SCYLLA_SHARD" option, with a value
corresponding to the shard number.

Acked-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180606203437.1198-1-avi@scylladb.com>
2018-06-07 10:43:16 +03:00
Avi Kivity
4a90eeb326 Update seastar submodule
* seastar 12cffef...e7275e4 (9):
  > tests: execution_stage_test: capture sg by value
  > Merge "Add in-path parameter suport to the code generation" from Amnon
  > Merge "Add scheduling_group inheritance to execution_stage" from Avi
  > tutorial: explain how to find origin of exception
  > tls: Ensure handshake always drains output before return/throw
  > build: cmake: correct stdc++fs library name once more
  > perftune.py: make sure config file existing before write
  > Update travis-ci integration
  > build: fix compilation issues on cmake. missing stdc++-fs
2018-06-06 19:07:16 +03:00
Avi Kivity
6f23403137 Merge "Virtualize IndexInfo system table" from Duarte
"
The IndexInfo table tracks the secondary indexes that have already
been populated. Since our secondary index implementation is backed by
materialized views, we can virtualize that table so queries are
actually answered by built_views.

Fixes #3483
"

* 'built-indexes-virtual-reader/v2' of github.com:duarten/scylla:
  tests/virtual_reader_test: Add test for built indexes virtual reader
  db/system_keysace: Add virtual reader for IndexInfo table
  db/system_keyspace: Explain that table_name is the keyspace in IndexInfo
  index/secondary_index_manager: Expose index_table_name()
  db/legacy_schema_migrator: Don't migrate indexes
2018-06-06 17:35:51 +03:00
Piotr Jastrzebski
f7a1d5a437 Use read_unsigned_vint_length_bytes for COLUMN_VALUE
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-06 15:54:17 +02:00
Piotr Jastrzebski
3b8b165053 Use read_unsigned_vint_length_bytes for CK_BLOCKS
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-06 15:44:53 +02:00
Piotr Jastrzebski
21a0e95a06 Implement read_unsigned_vint_length_bytes
It's a common operation that's used in multiple
places so it's best to have it implemented once.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-06 15:44:06 +02:00
Piotr Sarna
0818eb42ae cql3: remove additional IN relation check
Commit 80fc1b1408 introduced additional checks to ensure
that IN relation in WHERE clause can only occur on last restricted
column. This check is not present in current Cassandra code,
the restriction isn't mentioned anywhere in 'IN relation' documentation
and removing it fixes issue 2865.
Running cql_tests dtest suite doesn't show any regression after removing
this check.

Also at: https://github.com/psarna/scylla/tree/remove_additional_in_relation_check

Tests: dtest (cql_tests), unit (release)

Fixes #2865

Message-Id: <aa8c0b33618dd58cd153e83589ac016bc63f4343.1528288388.git.sarna@scylladb.com>
2018-06-06 16:01:54 +03:00
Tomasz Grabiec
9975135110 row_cache: Make sure reader makes forward progress after each fill_buffer()
If reader's buffer is small enough, or preemption happens often
enough, fill_buffer() may not make enough progress to advance
_lower_bound. If also iteartors are constantly invalidated across
fill_buffer() calls, the reader will not be able to make progress.

See row_cache_test.cc::test_reading_progress_with_small_buffer_and_invalidation()
for an examplary scenario.

Also reproduced in debug-mode row_cache_test.cc::test_concurrent_reads_and_eviction

Message-Id: <1528283957-16696-1-git-send-email-tgrabiec@scylladb.com>
2018-06-06 16:01:52 +03:00
Vlad Zolotarov
12e3e4fb2a service::client_state::has_access(): make readable_system_resources an std::unordered_set
There is not reason to use an std::set for it since we don't care about
the ordering - only about the existance of a particular entry.
Hash table will be more efficient for this use case.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1528220892-5784-2-git-send-email-vladz@scylladb.com>
2018-06-06 15:29:29 +03:00
Duarte Nunes
833d34e88a Merge 'Make rows in a secondary index ordered by token' from Piotr
"
As in #3423, ensuring token order on secondary index queries can be done
by adding an additional column to views that back secondary indexes.
This column is a first clustering column and contains token value,
computed on updates.
This series also updates tests and comments refering to issue 3423.

Tests: unit (release, debug)
"

* 'order_by_token_in_si_5' of https://github.com/psarna/scylla:
  cql3: update token order comments
  index, tests: add token column to secondary index schema
  view: add handling of a token column for secondary indexes
  view: add is_index method
2018-06-06 10:07:43 +01:00
Vlad Zolotarov
2dde372ae6 locator::ec2_multi_region_snitch: don't call for ec2_snitch::gossiper_starting()
ec2_snitch::gossiper_starting() calls for the base class (default) method
that sets _gossip_started to TRUE and thereby prevents to following
reconnectable_snitch_helper registration.

Fixes #3454

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1528208520-28046-1-git-send-email-vladz@scylladb.com>
2018-06-06 12:00:17 +03:00
Asias He
6496cdf0fb db: Get rid of the streaming memtable delayed flush
In 455d5a5 (streaming memtables: coalesce incoming writes), we
introduced the delayed flush to coalesce incoming streaming mutations
from different stream_plan.

However, most of the time there will be one stream plan at a time, the
next stream plan won't start until the previous one is finished. So, the
current coalescing does not really work.

The delayed flush adds 2s of dealy for each stream session. If we have lots
of table to stream, we will waste a lot of time.

We stream a keyspace in around 10 stream plans, i.e., 10% of ranges a
time. If we have 5000 tables, even if the tables are almost empty, the
delay will waste 5000 * 10 * 2 = 27 hours.

To stream a keyspace with 4 tables, each table has 1000 rows.

Before:

 [shard 0] stream_session - [Stream #944373d0-5d9c-11e8-9cdb-000000000000] Executing streaming plan for Bootstrap-ks-index-0 with peers={127.0.0.1}, master
 [shard 0] stream_session - [Stream #944373d0-5d9c-11e8-9cdb-000000000000] Streaming plan for Bootstrap-ks-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=1030 KiB, 125.21 KiB/s
 [shard 0] range_streamer - Bootstrap with 127.0.0.1 for keyspace=ks succeeded, took 8.233 seconds

After:

 [shard 0] stream_session - [Stream #e00bf6a0-5d99-11e8-a7b8-000000000000] Executing streaming plan for Bootstrap-ks-index-0 with peers={127.0.0.1}, master
 [shard 0] stream_session - [Stream #e00bf6a0-5d99-11e8-a7b8-000000000000] Streaming plan for Bootstrap-ks-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=1030 KiB, 4772.32 KiB/s
 [shard 0] range_streamer - Bootstrap with 127.0.0.1 for keyspace=ks succeeded, took 0.216 seconds

Fixes #3436

Message-Id: <cb2dde263782d2a2915ddfe678c74f9637ffd65b.1526979175.git.asias@scylladb.com>
2018-06-06 10:16:02 +03:00
Piotr Sarna
70ba8c8317 cql3: update token order comments
Comments about token order were outdated with token column patches
and they are now up to date.

Fixes #3423
2018-06-06 09:02:37 +02:00
Piotr Sarna
4a9bf7ed5b index, tests: add token column to secondary index schema
Additional token column is now present in every view schema
that backs a secondary index. This column is always a first part
of the clustering key, so it forces token order on queries.
Column's name is ideally idx_token, but can be postfixed
with a number to ensure its uniqueness.

It also updates tests to make them acknowledge the new token order.

Fixes #3423
2018-06-06 09:02:33 +02:00
Takuya ASADA
899f7641b6 dist/debian: fix pystache package name on Debian / Ubuntu
It's python-pystache, not python2-pystache.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2018-06-06 15:55:52 +09:00
Takuya ASADA
db9074707a dist/debian: switch to systemd-coredump on Debian 9
Debian 9 has newer systemd that supports systemd-coredump, so enable it.
2018-06-06 15:04:31 +09:00
Takuya ASADA
30386ed215 dist/debian: rename 99-scylla.conf to 99-scylla-coredump.conf
Since 99-scylla.conf is only used for setting coredump handler, rename
it to 99-scylla-coredump.conf.
2018-06-06 14:59:32 +09:00
Piotr Sarna
d5e7b5507b view: add handling of a token column for secondary indexes
In order to ensure token order on secondary index queries,
first clustering column for each view that backs a secondary index
is going to store a token computed from base's partition keys.
After this commit, if there exists a column that is not present
in base schema, it will be filled with computed token.
2018-06-05 18:59:25 +02:00
Tomasz Grabiec
f775fc2e4c mvcc: Fix partition_entry::open_version()
After 70c72773be it's possible that
open_version() is called with a phase which is smaller than the phase
of the latest version, because latest version belongs to the
in-progress cache update. In such case we must return the existing
non-latest snapshot and not create a new version on top of the
in-progress update. Not doing this violates several invariants, and
may lead to inconsistencies, including violation of write atomicity or
temporary loss of writes.

partition_entry::read() was already adjusted by the aforementioned
commit. Do a similar adjustement for open_version().

Fixes sporadic failures of row_cache_test.cc::test_concurrent_reads_and_eviction
Message-Id: <1528211847-22825-1-git-send-email-tgrabiec@scylladb.com>
2018-06-05 18:22:38 +03:00
Takuya ASADA
60844ae67b dist/common/scripts/scylla_coredump_setup: don't run sysctl on Ubuntu 18.04
Since 99-scylla.conf is not included on Ubuntu 18.04, skip running it.

Fixes #3494

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180605093619.9197-1-syuu@scylladb.com>
2018-06-05 12:47:46 +03:00
Takuya ASADA
222b8588ee dist/common/systemd/scylla-server.service.in: add local-fs.target as dependency
We mistakenly only added network-online.target is doens't promises to
wait /var/lib/scylla mount.
To do this we need local-fs.target.

Fixes #3441

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180521083349.8970-1-syuu@scylladb.com>
2018-06-05 12:26:21 +03:00
Piotr Sarna
06eee0f525 view: add is_index method
is_index method returns true if view that owns it
is backing a secondary index.
2018-06-05 11:10:24 +02:00
Piotr Sarna
6130a00597 dist: add scylla/hints directory to scripts
/var/lib/scylla/hints directory was missing from dist-specific
scripts, which may cause package installations to fail.
Package building scripts and descriptions are updated/

Fixes #3495

Message-Id: <0f5596cb49500416820ece023b7f76a4e2427799.1528184949.git.sarna@scylladb.com>
2018-06-05 11:33:29 +03:00
Avi Kivity
4aaf7bbc1d Merge "Add test for compression" from Piotr
"
It turns out that compression just works for SSTables 3.x.
Thanks to the previous work done on the write path.
This series cleans up tests a bit and introduces test for compression
on the read path.
"

* 'haaawk/sstables3/read-compression-v1' of ssh://github.com/scylladb/seastar-dev:
  Add test for compression in sstables 3.x
  Extract test_partition_key_with_values_of_different_types_read
  sstable_3_x_test: use SEASTAR_THREAD_TEST_CASE
  Drop UNCOMPRESSD_ when code will be used for compressed too
2018-06-04 20:33:50 +03:00
Piotr Jastrzebski
25a7f03f7f Add test for compression in sstables 3.x
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-04 18:41:10 +02:00
Piotr Jastrzebski
be9c7391aa Extract test_partition_key_with_values_of_different_types_read
It will be used also for testing compression.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-04 18:41:10 +02:00
Piotr Jastrzebski
1f324b7fc8 sstable_3_x_test: use SEASTAR_THREAD_TEST_CASE
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-04 18:40:52 +02:00
Piotr Jastrzebski
3e3ccdb323 Drop UNCOMPRESSD_ when code will be used for compressed too
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-04 18:29:02 +02:00
Avi Kivity
6d6c355dc0 Merge "augment system.local with sharding information" from Glauber
"
This patch adds nr_shards, msb_ignore, and the actual sharding algorithm to the
system.local table. Drivers and other tools can then make use of this
information to talk to scylla in an optimal way
"

* 'system_tables-v3' of github.com:glommer/scylla:
  system_keyspace: add sharding information to local table
  partitioner: export the name of the algorithm used to do intra-node sharding
2018-06-04 18:50:28 +03:00
Glauber Costa
bdce561ada system_keyspace: add sharding information to local table
We would like the clients to be able to route work directly to the right
shards. To do that, they need to know the sharding algorithm and its
parameters.

The algorithm can be copied into the client, but the parameters need to
be exported somewhere. Let's use the local table for that.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
---
v2: force msb to zero on non-murmur
2018-06-04 11:25:58 -04:00
Glauber Costa
250d9332dc partitioner: export the name of the algorithm used to do intra-node sharding
We will export this on system tables. To avoid hard-coding it in the system
table level, keep it at least in the dht layer where it belongs.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-06-04 11:25:58 -04:00
Takuya ASADA
ad4ca1e166 dist: simplified build script templates
Currently, build_deb.sh looks very complicated because each of distribution
requires different parameter, and we are applying them by sed command one-by-one.

This patch will replace them by Mustache, it's simple and easy syntax
template language.
Both .rpm distributions and .deb distributions have pystache (a Python
implimentation of Mustache), we will use it.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180604104026.22765-1-syuu@scylladb.com>
2018-06-04 14:38:52 +03:00
Paweł Dziepak
24764712b6 sstable: fix capture by reference of stack variable in continuation
Message-Id: <20180604102542.21799-1-pdziepak@scylladb.com>
2018-06-04 14:35:49 +03:00
Duarte Nunes
dfa779ebe7 Merge 'Separate hinted handoff manager for materialized views' from Piotr
"
This series introduces a separate hinted handoff manager for materialized views.

Steps:
 * decouple resource limits from hinted handoff, so multiple instances can share space
   and throughput limits in order to avoid internal fragmentation for every instance's
   reservations
 * add a subdirectory to data/, responsible for storing materialized view hints
 * decouple registering global metrics from hinted handoff constructor, now that there
   can be more than one instance - otherwise 'registering metrics twice' errors are going to occur
 * add a hints_for_views_manager to storage proxy and route failed view updates to use it
   instead of the original hints_manager
 * restore previous semantics for enabling/disabling hinted handoff - regular hinted handoff
   can be disabled or enabled just for specific datacenters without influencing materialized
   views flow
"

* 'separate_hh_for_mv_4' of https://github.com/psarna/scylla:
  storage_proxy: restore optional hinted handoff
  storage_proxy: add hints manager for views
  hints: decouple hints manager metrics from constructor
  db, config: add view_pending_updates directory
  hints: move space_watchdog to resource manager
  hints: move send limiter to resource manager
  hints: move constants to resource_manager
2018-06-04 12:03:59 +01:00
Duarte Nunes
01676a2cda tests/virtual_reader_test: Add test for built indexes virtual reader
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-06-04 11:31:29 +01:00
Duarte Nunes
3e39985c7a db/system_keysace: Add virtual reader for IndexInfo table
The IndexInfo table tracks the secondary indexes that have already
been populated. Since our secondary index implementation is backed by
materialized views, we can virtualize that table so queries are
actually answered by built_views.

Fixes #3483

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-06-04 11:14:17 +01:00
Duarte Nunes
65c4205334 db/system_keyspace: Explain that table_name is the keyspace in IndexInfo
This patch adds the same comment that exists in Apache Cassandra,
explaining that the table_name column in the IndexInfo system table
actually refers to the keyspace name. Don't be fooled.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-06-04 11:14:17 +01:00
Duarte Nunes
bc4db67524 index/secondary_index_manager: Expose index_table_name()
Expose secondary_index::index_table_name() so knowledge on how to
built an index name can remain centralized.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-06-04 11:14:17 +01:00
Duarte Nunes
7187963bda db/legacy_schema_migrator: Don't migrate indexes
Previous versions contained no indexes, and Apache Cassandra indexes
cannot be migrated to Scylla.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-06-04 11:14:17 +01:00
Vlad Zolotarov
e759803f48 cql3::authorized_prepared_statements_cache: properly set the expiration timeout
Because authorized_prepared_statements_cache caches the information that comes from
the permissions cache and from the prepared statements cache it should has the entries
expiration period set to the minimum of expiration periods of these caches.

The same goes to the entry refresh period but since prepared statements cache does have a
refresh period authorized_prepared_statements_cache's entries refresh period
is simply equal to the one of the permissions cache.

Fixes #3473

Tests: dtest{release} auth_test.py

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1527789716-6206-1-git-send-email-vladz@scylladb.com>
2018-06-04 10:34:54 +02:00
Piotr Jastrzebski
0b72594c1f data_consume_rows_context_m: Use find_first and find_next
Those methods of boost::dynamic_bitset allow much more
efficient implementation of skip_absent_columns and
move_to_next_column.

Also fix some indentation and variable naming.

Test: unit {release}

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <8a4dea51060c5a02bb774eac43e9eb67d316049a.1528100153.git.piotr@scylladb.com>
2018-06-04 11:18:03 +03:00
Piotr Sarna
f12fdcffdb storage_proxy: restore optional hinted handoff
Since hinted handoff for materialized views is now a separate entity,
regular hinted handoff can go back to being optional.
2018-06-04 09:46:06 +02:00
Piotr Sarna
a6aae369da storage_proxy: add hints manager for views
This commit adds a separate hints manager that serves
only failed materialized view updates.
2018-06-04 09:46:06 +02:00
Piotr Sarna
204bc17bd7 hints: decouple hints manager metrics from constructor
Now that more than one instance of hints manager can be present
at the same time, registering metrics is moved out of the constructor
to prevent 'registering metrics twice' errors.
2018-06-04 09:46:06 +02:00
Piotr Sarna
a791dce0ae db, config: add view_pending_updates directory
Hints for materialized view updates need to be kept somewhere,
because their dedicated hints manager has to have a root directory.
view_pending_updates directory resides in /data and is used
for that purpose.
2018-06-04 09:46:06 +02:00
Piotr Sarna
f345efc79a hints: move space_watchdog to resource manager
Space watchdog is decoupled from hints manager and moved to resource
manager, so it can be shared among different hints manager instances.
2018-06-04 09:46:01 +02:00
Piotr Sarna
ef40f7e628 hints: move send limiter to resource manager
Send limiting semaphore is moved from hints manager to resource manager.
In consequence, hints manager now keeps a reference to its resource
manager.
2018-06-04 09:35:58 +02:00
Piotr Sarna
2315937854 hints: move constants to resource_manager
Constants related to managing resources are moved to newly created
resource_manager class. Later, this class will be used to manage
(potentially shared) resources of hints managers.
2018-06-04 09:35:58 +02:00
Avi Kivity
9b21fbc055 Merge "LCS: enable compaction controller" from Glauber
"

In preparation, we change LCS so that it tries harder to push data
to the last level, where the backlog is supposed to be zero.

The backlog is defined as:

backlog_of_stcs_in_l0 + Sum(L in level) sizeof(L) * (max_level - L) * fan_out

where:
 * the fan_out is the amount of SSTables we usually compact with the
   next level (usually 10).
 * max_levels is the number of levels currently populated
 * sizeof(L) is the total amount of data in a particular level.

Tests: unit (release)
"

* 'lcs-backlog-v2' of github.com:glommer/scylla:
  LCS: implement backlog tracker for compaction controller
  LCS: don't construct property in the body of constructor
  LCS: try harder to move SSTables to highest levels.
  leveled manifest: turn 10 into a constant
  backlog: add level to write progress monitor
2018-06-04 10:29:56 +03:00
Amos Kong
364c2551c8 scylla_setup: fix conditional statement of silent mode
Commit 300af65555 introdued a problem in
conditional statement, script will always abort in silent mode, it doesn't
care about the return value.

Fixes #3485

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <1c12ab04651352964a176368f8ee28f19ae43c68.1528077114.git.amos@scylladb.com>
2018-06-04 10:14:06 +03:00
Glauber Costa
6317bd45d7 LCS: implement backlog tracker for compaction controller
This is the last missing tracker among the major strategies. After
this, only DTCS is left.

To calculate the backlog, we will define the point of zero-backlog
as having all data in the last level. The backlog is then:

Sum(L in levels) sizeof(L) * (max_levels - L) * fan_out,

where:
 * the fan_out is the amount of SSTables we usually compact with the
   next level (usually 10).
 * max_levels is the number of levels currently populated
 * sizeof(L) is the total amount of data in a particular level.

Care is taken for the backlog not to jump when a new level has been just
recently created.

Aside from that, SSTables that accumulate in L0 can be subject to STCS.
We will then add a STCS backlog in those SSTables to represent that.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-06-03 18:14:09 -04:00
Glauber Costa
04546df55c LCS: don't construct property in the body of constructor
Right now we are constructing the _max_sstable_size_in_mb property in
the body of the constructor, which it makes it hard for us to use from
other properties.

We are doing that because we'd like to test for bounds of that value. So
a cleaner way is to have a helper function for that.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-06-03 18:14:09 -04:00
Glauber Costa
28382cb25c LCS: try harder to move SSTables to highest levels.
Our current implementation of LCS can end up with situations in which
just a bit of data is in the highest levels, with the majority in the
lowest levels. That happens because we will only promote things to
highest levels if the amount of data in the current level is higher than
the maximum.

This is a pre-existing problem in itself, but became even clearer when
we started trying to define what is the backlog for LCS.

We have discussed ways to fix this it by redefining the criteria on when
to move data to the next levels. That would require us to change the way
things are today considerably, allowing parallel compactions, etc. There
is significant risk that we'll increase write amplication and we would
need to carefully validate that.

For now I will propose a simpler change, that essentially solves the
"inverted pyramid" problem of current LCS without major disruption:
keep selecting compaction candidates with the same criteria that we do
today, we should help make sure we are not compacting high levels for no
reason; but if there is nothing to do, use the idle time to push data to
higher levels. As an added benefit, old data that is in the higher level
can also be compacted away faster.

With this patch we see that in an idle, post-load system all data is
eventually pushed to the last level. Systems under constant writes keep
behaving the same way they did before.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-06-03 18:12:19 -04:00
Glauber Costa
e64b471e3d leveled manifest: turn 10 into a constant
We increase levels in powers of 10 but that is a parameter
of the algorithm. At least make it into a constant so that we can
reuse it somewhere else.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-06-03 16:55:58 -04:00
Avi Kivity
6f2d3b7f9f Merge "Fix previous row size calculation for SSTables 3.x" from Vladimir
"
SSTables 3.x format ('m') stores the size of previous row or RT marker
inside each row/marker. That potentially allows to traverse rows/markers
in reverse order.

The previous code calculating those sizes appeared to produce invalid
values for all rows except the first one. The problem with detecting
this bug was that neither Cassandra itself nor the sstabledump tool use
those values, they are simply rejected on reading.
From UnfilteredSerializer.deserializeRowBody() method,
https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/db/rows/UnfilteredSerializer.java#L562
:

            if (header.isForSSTable())
            {
                in.readUnsignedVInt(); // Skip row size
                in.readUnsignedVInt(); // previous unfiltered size
            }

So while the previous test files were technically correct in that they
contained valid data readable by Cassandra/sstabledump, they didn't
follow the format specification.

This patchset fixes the code to produce correct values and replaces
incorrect data files with correct ones. The newly generated data files
have been validated to be identical to files generated with Cassandra
using same data and timestamps as unit tests.

Tests: Unit {release}
"

* 'projects/sstables-30/fix-prev-row_size/v1' of https://github.com/argenet/scylla:
  tests: Fix test files to use correct previous row sizes.
  sstables: Fix calculation of previous row size for SSTables 3.x
  sstables: Factor out code building promoted index blocks into separate helpers.
2018-06-03 11:38:22 +03:00
Avi Kivity
a43b3e22fc Merge "Fix clustering blocks serialization for SSTables 3.x" from Vladimir
"
This patchset contains two fixes to the clustering key prefixes
serialization logic for SSTables 3.x.

First, it fixes a vexing typo: a bitwise-and (&) has been used instead
of a remainder operator (%) for truncating the shift value.
This did not show up in existing tests because they all had non-empty
clustering columns values.
Added tests to cover empty clustering columns values.

Second, it fixes the logic of serialization to write values up to the
prefix length, not the length of the clustering key as defined by
schema. This matches the way it is done by the Origin.

There is, however, a special case where the prefix size is smaller than
that of a clustering key but we still need to serialize up to the full
size. This is the case when a compact table is being used and some
rows in it are added using incomplete clustering keys (containing null
for trailing columns).
In Cassandra, these prefixes still have a full length and missing
columns are just set to 'null'. In our code those prefixes have their
real length, but since we need to serialize beyond it, we pass a flag to
indicate this.
"

* 'projects/sstables-30/fix-clustering-blocks/v1' of https://github.com/argenet/scylla:
  tests: Add test covering compact table with non-full clustering key.
  sstables: Improve clustering blocks writing, use logical clustering prefix size.
  tests: Add test covering large clustering keys (>32 columns) for SSTables 3.x
  tests: Add unit test covering empty values in clustering key.
  sstables: Fix typo in clustering blocks write helper.
2018-06-03 11:35:49 +03:00
Avi Kivity
1071e481ed Merge "Implement support for missing columns in SSTable 3.0" from Piotr
"
Add handling for missing columns and tests for it.

There are 3 cases:
1. Number of columns in a table is smaller than 64
2. Number of columns in a table is greater than 64
2a. and less than half of all possible columns are present in sstable
2b. and at least half of all possible columns are present in sstable

Case 1 is implemented using bit mask and column is present if mask & (1 << <column number>) == 0
Case 2 is implemented by storing list of column numbers for each present column
case 3 is implemented by storing list of column numbers for each absent column
"

* 'haaawk/sstables3/read-missing-columns-v3' of ssh://github.com/scylladb/seastar-dev:
  sstables 3: add test for reading big dense subset of columns
  sstables 3: support reading big dense subsets of columns
  sstables 3: add test for reading big sparse subset of columns
  sstables 3: support reading big sparse subsets of columns
  sstables 3: add test for reading small subset of columns
  sstables 3: support reading small subsets of columns
2018-06-03 10:42:00 +03:00
Avi Kivity
78182a704b partition_snapshot_row_cursor: initialize _dummy and _continuous
Debug mode view_schema_test sometimes complains that a bool member
doesn't contain in-range values, apparenty in the move constructor.

Initialize them for its benefit to avoid false-positive test
failures.
Message-Id: <20180602184934.31258-1-avi@scylladb.com>
2018-06-02 19:51:36 +01:00
Avi Kivity
187ebdbe46 auth: fix possible use of disengaged optional in has_salted_hash()
untyped_result_set_row's cell data type is bytes_opt, and the
get_block() accessor accesses the value assuming it's engaged
(relying on the caller to call has()).

has_unsalted_hash() calls get_blob() without calling has() beforehand,
potentially triggering undefined behavior.

Fix by using get_or() instead, which also simplifies the caller.

I observed failures in Jenkins in this area. It's hard to be sure
this is the root cause, since the failures triggered an internal
consistency assertion in asan rather than an asan report. However,
the error is hard to reproduce and the fix makes sense even if it
doesn't prevent the error.

See #3480 for the asan error.

Fixes #3480 (hopefully).
Message-Id: <20180602181919.29204-1-avi@scylladb.com>
2018-06-02 19:46:32 +01:00
Piotr Jastrzebski
2fd0566eb7 sstables 3: add test for reading big dense subset of columns
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-02 10:41:18 +02:00
Piotr Jastrzebski
829f0c5f80 sstables 3: support reading big dense subsets of columns
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-02 10:41:18 +02:00
Piotr Jastrzebski
4e4972ffea sstables 3: add test for reading big sparse subset of columns
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-02 10:40:56 +02:00
Piotr Jastrzebski
e5fb499736 sstables 3: support reading big sparse subsets of columns
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-01 21:35:28 +02:00
Piotr Jastrzebski
24e9ab4ab6 sstables 3: add test for reading small subset of columns
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-01 21:34:03 +02:00
Piotr Jastrzebski
63d45c4f24 sstables 3: support reading small subsets of columns
Small subset is contains no more than 63 elements.
Support for large subsets will come in the following
patches.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-01 21:33:50 +02:00
Glauber Costa
7e3093709a backlog: add level to write progress monitor
For SSTables being written, we don't know their level yet. Add that
information to the write monitor. New SSTables will always be at L0.
Compacted SSTables will have their level determined by the compaction
process.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-31 21:09:38 -04:00
Vladimir Krivopalov
b6511d1b07 tests: Add test covering compact table with non-full clustering key.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-31 17:30:36 -07:00
Vladimir Krivopalov
47a7e78bc8 sstables: Improve clustering blocks writing, use logical clustering prefix size.
In the Origin, the size of the clustering key prefix used during
serialization is the actual length of the prefix and not the full size
as defined in schema. So the code is fixed to align with that logic.
This, in particular, is needed to write clustering blocks for RT
markers.

There is, however, a special case where the prefix size is smaller than
that of a clustering key but we still need to serialize up to the full
size. This is the case when a compact table is being used and some
rows in it are added using incomplete clustering keys (containing null
for trailing columns).
In Cassandra, these prefixes still have a full length and missing
columns are just set to 'null'. In our code those prefixes have their
real length, but since we need to serialize beyond it, we pass a flag to
indicate this.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-31 17:30:36 -07:00
Vladimir Krivopalov
3f404f19dc tests: Add test covering large clustering keys (>32 columns) for SSTables 3.x
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-31 17:30:36 -07:00
Vladimir Krivopalov
487796de85 tests: Add unit test covering empty values in clustering key.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-31 17:30:36 -07:00
Vladimir Krivopalov
0dadd4fdf3 sstables: Fix typo in clustering blocks write helper.
What supposed to be an operation of taking remainder turned to be a
bitwise 'and'. This didn't show up in existing tests only because they
all had non-empty clustering values.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-31 15:12:40 -07:00
Avi Kivity
aab6b0ee27 Merge "Introduce new in-memory representation for cells" from Paweł
"
This is the first part of the first step of switching Scylla. It covers
converting cells to the new serialisation format. The actual structure
of the cells doesn't differ much from the original one with a notable
exception of the fact that large values are now fragmented and
linearisation needs to be explicit. Counters and collections still
partially rely on their old, custom serialisation code and their
handling is not optimial (although not significantly worse than it used
to be).

The new in-memory representation allows objects to be of varying size
and makes it possible to provide deserialisation context so that we
don't need to keep in each instance of an IMR type all the information
needed to interpret it. The structure of IMR types is described in C++
using some metaprogramming with the hopes of making it much easier to
modify the serialisation format that it would be in case of open-coded
serialisation functions.

Moreover, IMR types can own memory thanks to a limited support for
destructors and movers (the latter are not exactly the same thing as C++
move constructors hence a different name). This makes it (relatively)
to ensure that there is an upper bound on the size of all allocations.

For now the only thing that is converted to the IMR are atomic_cells
and collections which means that the reduction in the memory footprint
is not as big as it can be, but introducing the IMR is a big step on its
own and also paves the way towards complete elimination of unbounded
memory allocations.

The first part of this patchset contains miscellaneous preparatory
changes to various parts of the Scylla codebase. They are followed by
introduction of the IMR infrastructure. Then structure of cells is
defined and all helper functions are implemented. Next are several
treewide patches that mostly deal with propagating type information to
the cell-related operations. Finally, atomic_cell and collections are
switched to used the new IMR-based cell implementation.

The IMR is described in much more detail in imr/IMR.md added in "imr:
add IMR documentation".

Refs #2031.
Refs #2409.

perf_simple_query -c4, medians of 30 results:

        ./perf_base  ./perf_imr   diff
 read     308790.08   309775.35   0.3%
 write    402127.32   417729.18   3.9%

The same with 1 byte values:
        ./perf_base1  ./perf_imr1   diff
 read      314107.26    314648.96   0.2%
 write     463801.40    433255.96  -6.6%

The memory footprint is reduced, but that is partially due to removal of
small buffer optimisation (whether it will be restored depends on the
exact mesurements of the performance impact). Generally, this series was
not expected to make a huge difference as this would require converting
whole rows to the IMR.

Memory footprint:
Before:
mutation footprint:
 - in cache: 1264
 - in memtable: 986

After:
mutation footprint:
 - in cache: 1104
 - in memtable: 866

Tests: unit (release, debug)
"

* tag 'imr-cells/v3' of https://github.com/pdziepak/scylla: (37 commits)
  tests/mutation: add test for changing column type
  atomic_cell: switch to new IMR-based cell reperesentation
  atomic_cell: explicitly state when atomic_cell is a collection member
  treewide: require type for creating collection_mutation_view
  treewide: require type for comparing cells
  atomic_cell: introduce fragmented buffer value interface
  treewide: require type to compute cell memory usage
  treewide: require type to copy atomic_cell
  treewide: require type info for copying atomic_cell_or_collection
  treewide: require type for creating atomic_cell
  atomic_cell: require column_definition for creating atomic_cell views
  tests: test imr representation of cells
  types: provide information for IMR
  data: introduce cell
  data: introduce type_info
  imr/utils: add imr object holder
  imr: introduce concepts
  imr: add helper for allocating objects
  imr: allow creating lsa migrators for IMR objects
  imr: introduce placeholders
  ...
2018-05-31 19:21:15 +03:00
Amnon Heiman
bc7503feee Scyllatop to use prometheus by default
Scylla now expose the prometheus API by default. This patch chagnes
scyllatop to use the Prometheus API, the collect API is still available.

The main changes in the patch:
* Move collectd specific logic inside collectd.
* Add support for help information.
* Add command line to configure prometheus end point and to enable
collectd.

* Add a prometheus class that collect information from prometheus.

Fixes: #1541
Message-Id: <20180531124156.26336-1-amnon@scylladb.com>
2018-05-31 18:00:22 +03:00
Tomasz Grabiec
b5e42bc6a0 tests: row_cache: Do not hang when only one of the readers throws
Message-Id: <20180531122729.3314-1-tgrabiec@scylladb.com>
2018-05-31 18:00:22 +03:00
Piotr Sarna
360326fdc5 cql3: add compatibility with libjsoncpp < 1.6.0
Only libjsoncpp >= 1.6.0 offers a safe name() method for value
iterators. For older versions, deprecated memberName() is used
instead. Note that memberName() was deprecated because of its
inability to deal with embedded null characters.

Fixes #3471

Message-Id: <e64a62bfc24ef06daee238d79d557fe6ec8979d3.1527758708.git.sarna@scylladb.com>
2018-05-31 18:00:22 +03:00
Paweł Dziepak
131a47dea3 tests/mutation: add test for changing column type
With the introduction of the new in-memory representation changing
column type has become a more complex operation since it needs to handle
switch from fixed-size to variable-size types. This commit adds an
explicit test for such cases.
2018-05-31 15:51:11 +01:00
Paweł Dziepak
a040d37cd5 atomic_cell: switch to new IMR-based cell reperesentation
This patch changes the implementation of atomic_cell and
atomic_cell_or_collection to use the data::cell implementation which is
based on the new in-memory representation infrastructure.
2018-05-31 15:51:11 +01:00
Paweł Dziepak
0ea6d14cf5 atomic_cell: explicitly state when atomic_cell is a collection member
Collections are not going to be fully converted to the IMR just yet and
still use the old serialisation format. This means that they still don't
support fragmented values very well. This patch passes the information
when an atomic_cell is created as a member of a collection so that later
we can avoid fragmenting the value in such cases.
2018-05-31 15:51:11 +01:00
Paweł Dziepak
e34ff8b4bf treewide: require type for creating collection_mutation_view 2018-05-31 15:51:11 +01:00
Paweł Dziepak
9bb1f10bb6 treewide: require type for comparing cells 2018-05-31 15:51:11 +01:00
Paweł Dziepak
aa25f0844f atomic_cell: introduce fragmented buffer value interface
As a prepratation for the switch to the new cell representation this
patch changes the type returned by atomic_cell_view::value() to one that
requires explicit linearisation of the cell value. Even though the value
is still implicitly linearised (and only when managed by the LSA) the
new interface is the same as the target one so that no more changes to
its users will be needed.
2018-05-31 15:51:11 +01:00
Paweł Dziepak
ec9d166a4f treewide: require type to compute cell memory usage 2018-05-31 15:51:11 +01:00
Paweł Dziepak
418c159057 treewide: require type to copy atomic_cell 2018-05-31 15:51:11 +01:00
Paweł Dziepak
27014a23d7 treewide: require type info for copying atomic_cell_or_collection 2018-05-31 15:51:11 +01:00
Paweł Dziepak
e9d6fc48ac treewide: require type for creating atomic_cell 2018-05-31 15:51:11 +01:00
Paweł Dziepak
93130e80fb atomic_cell: require column_definition for creating atomic_cell views 2018-05-31 15:51:11 +01:00
Paweł Dziepak
b25cc61a13 tests: test imr representation of cells 2018-05-31 15:51:11 +01:00
Paweł Dziepak
43b216b43d types: provide information for IMR 2018-05-31 15:51:11 +01:00
Paweł Dziepak
eec33fda14 data: introduce cell
This commit introduces cell serializers and views based on the in-memory
representation infrastructure. The code doesn't assume anything about
how the cells are stored, they can be either a part of another IMR
object (once the rows are converted to the IMR) or a separate objects
(just like current atomic_cell).
2018-05-31 15:51:11 +01:00
Duarte Nunes
f8626c7c93 tests/view_schema_test: Test view correctness under base schema changes
Reproducer for #3443.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180530194536.51202-2-duarte@scylladb.com>
2018-05-31 12:10:50 +03:00
Duarte Nunes
c4f267bdfe database: Refresh view dependent fields when altering base
A view schema's view_info contains the id of the base regular column
that view includes in its primary key. Since the column id of a
particular column can potentially change with a new schema version, we
need to refresh the stored column id. We weren't doing that when
unselected base columns are added, and this patch fixes it by
triggering an update of the view schema when base columns are added
and the view contains a base regular column in its PK.

Fixes #3443

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180530194536.51202-1-duarte@scylladb.com>
2018-05-31 12:10:49 +03:00
Paweł Dziepak
544b3c9a34 data: introduce type_info
This patch introduces type_info class which contains all type
information needed by IMR deserialisation contexts.
2018-05-31 10:09:01 +01:00
Paweł Dziepak
4929c1f39a imr/utils: add imr object holder
imr::object<> is an owning pointer to an IMR objects. It is LSA-aware.
2018-05-31 10:09:01 +01:00
Paweł Dziepak
fd47858755 imr: introduce concepts
This commit adds type traits and concepts for sizers, serializers and
writers that help explicitly specify requirements of various interfaces.
2018-05-31 10:09:01 +01:00
Paweł Dziepak
28ea36a686 imr: add helper for allocating objects
IMR objects may own memory. object_allocator takes care of allocating
memory for all owned objects during the serialisation of their owner.

In practice a writer of the parent object would accept a helper object
created by object_allocator. That helper object would be either
responsible for computing the size of buffers that have to be allocated
or perform the actual serialisation in the same two phase manner as it
is done for the parent IMR object.
2018-05-31 10:09:01 +01:00
Paweł Dziepak
79941f2fc7 imr: allow creating lsa migrators for IMR objects
This patch introduces helpers for creating LSA migrators from IMR
deserialisation contexts and context factories.
2018-05-31 10:09:01 +01:00
Paweł Dziepak
5ddb118c78 imr: introduce placeholders
In some cases the actual value of an IMR object is not know at the
serialisation time. If the type is fixed-size we can use a placeholder
to defer writing it to a more conveninent moment.
2018-05-31 10:09:01 +01:00
Paweł Dziepak
8c38f09fbc tests/imr: add tests for destructor and mover methods 2018-05-31 10:09:01 +01:00
Paweł Dziepak
fa7b080443 imr: introduce destructor and mover methods
This patch introduces destructors and movers for IMR objects which
enables them to own memory. Custom destructors and methods can be
defined by specialising appropriate classes.
2018-05-31 10:09:01 +01:00
Paweł Dziepak
c02bfb942d imr/compound: introduce tagged_type<Tag, T> 2018-05-31 10:09:01 +01:00
Paweł Dziepak
a29a88c9d9 tests/imr/compound: add tests for structure<...> 2018-05-31 10:09:01 +01:00
Paweł Dziepak
4f51901dfe imr/compound: introduce structure<...> 2018-05-31 10:09:01 +01:00
Paweł Dziepak
466d91f652 tests/imr/compound: add tests for variant<Ts...> 2018-05-31 10:09:01 +01:00
Paweł Dziepak
8e4c8ce2c4 imr/compound: introduce variant<Ts...> 2018-05-31 10:09:01 +01:00
Paweł Dziepak
7c28c9eda8 tests/imr: add test for optional<T> 2018-05-31 10:09:01 +01:00
Paweł Dziepak
6d7b205d1a imr: introduce optional<T> 2018-05-31 10:09:01 +01:00
Paweł Dziepak
eb2479fa9a tests: add test for new in memory representation 2018-05-31 10:09:01 +01:00
Paweł Dziepak
a995fb337c imr: introduce fundamental types
This patch introduces fundamental IMR types: a set of flags, a POD type
and a buffer.
2018-05-31 10:09:01 +01:00
Paweł Dziepak
5f960beca1 imr: add IMR documentation 2018-05-31 10:09:01 +01:00
Paweł Dziepak
0092076167 tests: add helpers for generating random data 2018-05-31 10:09:01 +01:00
Paweł Dziepak
cc76480174 tests: introduce tests for metaprogramming helpers 2018-05-31 10:09:01 +01:00
Paweł Dziepak
ba5e64383a utils: add metaprogramming helper functions 2018-05-31 10:09:01 +01:00
Paweł Dziepak
5845d52632 idl: allow fragmented bytes_view in serialisation
This patch adds new way of serialising bytes and sstring objects in the
IDL. Using write_fragmented_<field-name>() the caller can pass a range
of fragments that would be serialised without linearising the buffer.
2018-05-31 10:09:01 +01:00
Paweł Dziepak
c41b9fc7ec utils: add fragment range
This patch introduces a FragmentRange concept which is the minimal interface all
classes representing a fragmented buffer should satisfy.
2018-05-31 10:09:01 +01:00
Vladimir Krivopalov
0886c189bf tests: Fix test files to use correct previous row sizes.
Since sstabledump and Cassandra do not use row size values, the new
files have been validated to be identical to files generated by
Cassandra with the same data inserted at same timestamps.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-30 18:18:35 -07:00
Vladimir Krivopalov
2d86fcc8ab sstables: Fix calculation of previous row size for SSTables 3.x
The previous code incorrectly calculated sizes of previous rows while
writing SSTables in 3.x ('m') format.
The problem with detecting this issue was that neither sstabledump nor
Cassandra 3.x itself use those values, as of today, they are simply
ignored when data is read from files.

Still, we want to be compatible and write correct values as they may be
of use in the future.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-30 18:14:12 -07:00
Vladimir Krivopalov
71f7f45d64 sstables: Factor out code building promoted index blocks into separate helpers.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-30 12:40:18 -07:00
Nadav Har'El
a1cbeeffcd tests/view_complex_test.cc: fix and enable buggy test
tests/view_complex_test.cc contained a #ifdef'ed-out test claiming to
be a reproducer for issue #3362. Unfortunately, it it is not - after
earlier commits the only reason this test still fails is a mistake in
the test, which expects 0 rows in a case where the real result is 1 row.
Issue #3362 does *not* have to be fixed to fix this test.

So this patch fixes the broken test, and enables it. It also adds comments
explaining what this test is supposed to do, and why it works the way it
does.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180530142214.29398-1-nyh@scylladb.com>
2018-05-30 15:39:25 +01:00
Avi Kivity
9999e0e6bc Merge "Implement support for static rows in SSTable 3.0" from Piotr
"
Add handling for static rows and tests for it.
"

* 'haaawk/sstables3/read-static-v1' of ssh://github.com/scylladb/seastar-dev:
  sstable_3_x_test: Add test_uncompressed_compound_static_row_read
  sstable_3_x_test: add test_uncompressed_static_row_read
  flat_mutation_reader_assertions: improve static row assertions
  data_consume_rows_context_m: Implement support for static rows
  mp_row_consumer_m: Implement support for static rows
  mp_row_consumer_m: Extract fill_cells
2018-05-30 17:17:17 +03:00
Paweł Dziepak
62d0639fe9 Merge "Avoid reactor stalls in cache with large partitions" from Tomasz
"
We currently suffer from reactor stalls caused by non-preemptible processing
of large partitions in the following places:

  (1) dropping partition entries from cache or memtables does not defer

  (2) dropping partition versions abandoned by detached snapshots does not defer

  (3) merging of partition versions when snapshots go away does not defer

  (4) cache update from memtable processes partition entries without deferring (#2578)

  (5) partition entries are upgraded to new schema atomically

This series fixes problems (1), (2) and (4), but not (3) and (5).

(1) and (2) are fixed by introducing mutation_cleaner objects which are
containers for garbage partition versions which are delaying actual freeing.
Freeing happens from memory reclaimers and is incremental.

(3) and (5) are not solved yet.

(4) is solved by having partition merging process partitions with row
granularity and defer in the middle of partition. In order to preserve update
atomicity on partition level as perceived by reads, when update starts we
create a snapshot to the current version of partition and process memtable
entry by inserting data into a separate partition version. This way if upgrade
defers in the middle of partition reads can still go to the old version and
not see partial writes. Snapshots are marked with phase numbers, and reads
will use the previous phase until whole partition is upgraded. When partition
is finally merged, the snapshots go away and the new version will eventually
be merged to the old version. Due to (3) however, this merging may still add
latency to the upgrade path.

Remaining work:

  - Solving problem (3). I think the approach to take here would be to
    move the task of merging versions to the background, maybe into mutation_cleaner.

  - Merging range tombstones incrementally.

Performance
===========

Performance improvements were evaluated using tests/perf_row_cache_update -c1 -m1G,
which measures time it takes to update cache from memtable for various workloads
and schemas.

For large partition with lots of small rows we see a significant reduction of
scheduling latency from ~550ms to ~23ms. The cause of remainig latency is
problem (3) stated above. The run time is reduced by 70%.

For small partition case without clustering columns we see no degradation.

For small partition case with clustering key, but only 3 small rows per partition,
we see a 30% degradation in run time.

For large partition with lots of range tombstones we see degradation of 15% in
run time and scheduling latency.

Below you can see full statistics for cache update run time:

=== Small partitions, no overwrites:

Before:

  avg = 433.965155
  stdev = 35.958024
  min = 340.093201
  max = 468.564514

After:

  avg = 436.929447 (+1%)
  stdev = 37.130237
  min = 349.410339
  max = 489.953400

=== Small partition with a few rows:

Before:

  avg = 315.379316
  stdev = 30.059120
  min = 240.340561
  max = 342.408295

After:

  avg = 407.232691 (+30%)
  stdev = 53.918717
  min = 269.514648
  max = 444.846649

=== Large partition, lots of small rows:

Before:

  avg = 412.870689
  stdev = 227.411317
  min = 286.990631
  max = 1263.417847

After:

  avg = 124.351705 (-70%)
  stdev = 4.705762
  min = 110.063255
  max = 129.643387

=== Large partition, lots of range tombstones:

Before:

  avg = 601.172644
  stdev = 121.376866
  min = 223.502136
  max = 874.111572

After:

  avg = 695.627588 (+15%)
  stdev = 135.057004
  min = 337.173950
  max = 784.838745
"

* tag 'tgrabiec/clear-gently-all-partitions-v3' of github.com:tgrabiec/scylla:
  mvcc: Use small_vector<> in partition_snapshot_row_cursor
  utils: Extract small_vector.hh
  mvcc: Erase rows gradually in apply_to_incomplete()
  mvcc: partition_snapshot_row_cursor: Avoid row copying in consume() when possible
  cache: real_dirty_memory_accounter: Move unpinning out of the hot path
  mvcc: partition_snapshot_row_cursor: Reduce lookups in ensure_entry_if_complete()
  mutation_partition: Reduce row lookups in apply_monotonically()
  cache: Release dirty memory with row granularity
  cache: Defer during partition merging
  mvcc: partition_snapshot_row_cursor: Introduce consume_row()
  mvcc: partition_snapshot_row_cursor: Introduce maybe_refresh_static()
  mvcc: Make apply_to_incomplete() work with attached versions
  cache: Propagate phase to apply_to_incomplete()
  cache: Prepare for incremental apply_to_incomplete()
  Introduce a coroutine wrapper
  tests: mvcc: Encapsulate memory management details
  tests: cache: Take into account that update() may defer
  cache: real_dirty_memory_accounter: Allow construction without memtable
  cache: Extract real_dirty_memory_accounter
  mvcc: Destroy memtable partition versions gently
  memtable: Destroy partitions incrementally from clear_gently()
  mvcc: Remove rows from tracker gently
  cache: Destroy partition versions incrementally
  Introduce mutation_cleaner
  mvcc: Introduce partition_version_list
  mvcc: Fix move constructor of partition_version_ref() not preserving _unique_owner
  database: Add API for incremental clearing of partition entries
  cache: Define trivial methods inline
  tests: Improve perf_row_cache_update
  mutation_reader: Make empty mutation source advertize no partitions
2018-05-30 14:12:29 +01:00
Tomasz Grabiec
4561e97efe mvcc: Use small_vector<> in partition_snapshot_row_cursor
I measured 8% improvement in cache update throughput for small
partitions.
2018-05-30 14:41:41 +02:00
Tomasz Grabiec
db36ff0643 utils: Extract small_vector.hh 2018-05-30 14:41:41 +02:00
Tomasz Grabiec
5b59df3761 mvcc: Erase rows gradually in apply_to_incomplete()
So that we avoid double-buffering partitions.
2018-05-30 14:41:41 +02:00
Tomasz Grabiec
b7fdf4309f mvcc: partition_snapshot_row_cursor: Avoid row copying in consume() when possible 2018-05-30 14:41:41 +02:00
Tomasz Grabiec
8d66f6da58 cache: real_dirty_memory_accounter: Move unpinning out of the hot path
Instead of calling into real dirty memory manager per row, call it per
deferring point.
2018-05-30 14:41:41 +02:00
Tomasz Grabiec
60000b98a4 mvcc: partition_snapshot_row_cursor: Reduce lookups in ensure_entry_if_complete()
Leverage the fact that it is called with monotonically increasing
positions, and avoid lookups in case the current target entry is the
successor of desired position. Reduces cache update latency by 40%
for large partition in a time-series workload.
2018-05-30 14:41:41 +02:00
Tomasz Grabiec
82e8217ba0 mutation_partition: Reduce row lookups in apply_monotonically()
This change speeds up merging of partition versions with many rows in
case the merged version has many rows which fall between existing rows
in the target version. This is often the case for time-series
workloads, which insert rows at the front. Lookup can be avoided for
all but the first row in the stride because we already have a
reference to the successor in the target tree, we only need to check
that the current entry in the target tree is still the successor.

This change greatly reduces amount of lookups per row during version
merging of large partitions in time-series workloads.
2018-05-30 14:41:41 +02:00
Tomasz Grabiec
5bc201df10 cache: Release dirty memory with row granularity 2018-05-30 14:41:41 +02:00
Tomasz Grabiec
70c72773be cache: Defer during partition merging 2018-05-30 14:41:41 +02:00
Tomasz Grabiec
051bb74583 mvcc: partition_snapshot_row_cursor: Introduce consume_row() 2018-05-30 14:41:41 +02:00
Tomasz Grabiec
518fd7083f mvcc: partition_snapshot_row_cursor: Introduce maybe_refresh_static()
A version of maybe_refresh() optimized for snapshots which are
no longer populated. Will be used to implement cache update from
memtable.
2018-05-30 14:41:40 +02:00
Tomasz Grabiec
c653137b2b mvcc: Make apply_to_incomplete() work with attached versions
Needed before making it preemptible. We cannot steal the entry since
we may need to resume merging later.
2018-05-30 14:41:40 +02:00
Tomasz Grabiec
1792be3697 cache: Propagate phase to apply_to_incomplete()
It will be needed to create snapshots with appropriate phase markers.
2018-05-30 14:41:40 +02:00
Tomasz Grabiec
494cb3f3da cache: Prepare for incremental apply_to_incomplete()
Incremental merging will be implemented by the means of resumable
functions, which return stop_iteration::no when not yet
finished. We're not using futures, so that the caller can do work
around preemption points as well.
2018-05-30 14:41:40 +02:00
Tomasz Grabiec
a19c5cbc16 Introduce a coroutine wrapper
Represents a deferring operation which defers cooperatively with the caller.

The operation is started and resumed by calling run(), which returns
with stop_iteration::no whenever the operation defers and is not
completed yet. When the operation is finally complete, run() returns
with stop_iteration::yes.

This allows the caller to:

 1) execute some post-defer and pre-resume actions atomically

 2) have control over when the operation is resumed and in which context,
    in particular the caller can cancel the operation at deferring points.

It will be used to implement deferring partition_version::apply_to_incomplete().
2018-05-30 14:41:40 +02:00
Tomasz Grabiec
6bd1a04c10 tests: mvcc: Encapsulate memory management details
Curently tests have a single LSA region lock around construction of
managed objects, their manipulation, and access. This way we avoid the
complexity of dealing with allocating sections. That will not be
possible once apply_to_incomplete() is changed to enter an allocating
section itself becasue this requires region to be unlocked at
entry. The tests will have to take more fine-grained locks. That is
somewhat tricky add would add a lot of noise to tests. This patch will
make things easier by abstracting LSA management, among other things,
inside mvcc_conatiner and mvcc_partition classes.
2018-05-30 14:41:40 +02:00
Tomasz Grabiec
f6e21accc7 tests: cache: Take into account that update() may defer
The test incorrectly assumed that once update() is started the
cache will return only versions from last_generation. This will not
hold once we start to defer during partition merging.
2018-05-30 14:41:40 +02:00
Tomasz Grabiec
c10d9e1607 cache: real_dirty_memory_accounter: Allow construction without memtable 2018-05-30 14:41:40 +02:00
Tomasz Grabiec
6ecda1ccd7 cache: Extract real_dirty_memory_accounter 2018-05-30 14:41:40 +02:00
Tomasz Grabiec
3f19f76c67 mvcc: Destroy memtable partition versions gently
Now all snapshots will have a mutation_cleaner which they will use to
gently destroy freed partition_version objects.

Destruction of memtable entries during cache update is also using the
gentle cleaner now. We need to have a separate cleaner for memtable
objects even though they're owned by cache's region, because memtable
versions must be cleared without a cache_tracker.

Each memtable will have its own cleaner, which will be merged with the
cache's cleaner when memtable is merged into cache.

Fixes some sources of reactor stalls on cache update when there are
large partition entries in memtables.
2018-05-30 14:41:40 +02:00
Tomasz Grabiec
c2d702622e memtable: Destroy partitions incrementally from clear_gently()
Destroying large partitions may stall the reactor for a long
time. Avoid this by clearing incrementally.
2018-05-30 14:41:40 +02:00
Tomasz Grabiec
81d231f35b mvcc: Remove rows from tracker gently
Some parititons may have a lot of rows. Better to iterate over them
incrementally as part of clear_gently() to avoid stalls.
2018-05-30 14:41:40 +02:00
Tomasz Grabiec
f0c1edd672 cache: Destroy partition versions incrementally
Instead of destroying whole partition_versions at once, we will do that
gently using mutation_cleaner to avoid reactor stalls.

Large deletions could happen when large partition gets invalidated,
upgraded to a new schema, or when it's abandaned by a detached snapshot.

Refs #3289.
2018-05-30 14:41:40 +02:00
Tomasz Grabiec
e0803ff71e Introduce mutation_cleaner
Used for collecting unsued partition_version objects and freeing them
incrementally. Will be used for both cache and memtables.
2018-05-30 14:41:39 +02:00
Tomasz Grabiec
e5aa02efeb mvcc: Introduce partition_version_list 2018-05-30 12:18:56 +02:00
Tomasz Grabiec
ca1ee93577 mvcc: Fix move constructor of partition_version_ref() not preserving _unique_owner
We didn't rely on that yet, it seems, but will.

(cherry picked from commit 21a744337de01f699d5c5c340483ad23cabab2ee)
2018-05-30 12:18:56 +02:00
Tomasz Grabiec
40cc766cf2 database: Add API for incremental clearing of partition entries
Partitions can get very large. Destroying them all at once can stall
the reactor for significant amount of time. We want to avoid that by
doing destruction incrementally, deferring in between. A new API is
added for that at various levels:

  stop_iteration clear_gently() noexcept;

It returns stop_iteration::yes when the object is fully cleared and
can be now destroyed quickly. So a deferring destruction can look like
this:

  return repeat([this] { return clear_gently(); });

The reason why clear_gently() doesn't return a future<> itself is that some
contexts cannot defer, like memory reclamation.
2018-05-30 12:18:56 +02:00
Tomasz Grabiec
2f75212ca4 cache: Define trivial methods inline
They have users in a different compilation unit, in partition_version.cc
2018-05-30 12:18:56 +02:00
Tomasz Grabiec
25b3641d9e tests: Improve perf_row_cache_update
We now test more kinds of workloads:
 - small partitions with no clustering key
 - large partition with lots of small rows
 - large partition with lots of range tombstones

We also collect statistics about scheduling latency induced by cache
update.

Example output:

Small partitions, no overwrites:
update: 356.809113 [ms], stall: {ticks: 396, min: 0.006867 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.131752 [ms], max: 1.358102 [ms]}, cache: 257/257 [MB] LSA: 257/257 [MB] std free: 83 [MB]
update: 337.542999 [ms], stall: {ticks: 373, min: 0.001598 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.131752 [ms], max: 1.358102 [ms]}, cache: 514/514 [MB] LSA: 514/514 [MB] std free: 83 [MB]
update: 383.485291 [ms], stall: {ticks: 425, min: 0.001598 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.131752 [ms], max: 1.131752 [ms]}, cache: 771/788 [MB] LSA: 771/788 [MB] std free: 83 [MB]
update: 574.968811 [ms], stall: {ticks: 634, min: 0.001598 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.629722 [ms], max: 1.955666 [ms]}, cache: 879/917 [MB] LSA: 879/917 [MB] std free: 83 [MB]
update: 411.541138 [ms], stall: {ticks: 455, min: 0.001598 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.131752 [ms], max: 1.358102 [ms]}, cache: 787/835 [MB] LSA: 787/835 [MB] std free: 83 [MB]
update: 368.491211 [ms], stall: {ticks: 408, min: 0.001332 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.131752 [ms], max: 1.131752 [ms]}, cache: 750/790 [MB] LSA: 750/790 [MB] std free: 83 [MB]
update: 343.671967 [ms], stall: {ticks: 380, min: 0.001598 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.131752 [ms], max: 1.131752 [ms]}, cache: 734/769 [MB] LSA: 734/769 [MB] std free: 83 [MB]
update: 320.277283 [ms], stall: {ticks: 357, min: 0.001598 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.131752 [ms], max: 1.131752 [ms]}, cache: 724/753 [MB] LSA: 724/753 [MB] std free: 83 [MB]
update: 310.583282 [ms], stall: {ticks: 344, min: 0.001598 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.131752 [ms], max: 1.131752 [ms]}, cache: 714/740 [MB] LSA: 714/740 [MB] std free: 83 [MB]
update: 303.627106 [ms], stall: {ticks: 338, min: 0.001598 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.131752 [ms], max: 1.955666 [ms]}, cache: 707/731 [MB] LSA: 707/731 [MB] std free: 83 [MB]
update: 296.742523 [ms], stall: {ticks: 330, min: 0.001332 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.131752 [ms], max: 1.131752 [ms]}, cache: 701/724 [MB] LSA: 701/724 [MB] std free: 83 [MB]
update: 286.598541 [ms], stall: {ticks: 319, min: 0.001598 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.131752 [ms], max: 1.131752 [ms]}, cache: 697/719 [MB] LSA: 697/719 [MB] std free: 83 [MB]
update: 288.649323 [ms], stall: {ticks: 321, min: 0.001598 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.131752 [ms], max: 1.131752 [ms]}, cache: 694/715 [MB] LSA: 694/715 [MB] std free: 83 [MB]
update: 282.069916 [ms], stall: {ticks: 314, min: 0.001598 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.131752 [ms], max: 1.131752 [ms]}, cache: 692/712 [MB] LSA: 692/712 [MB] std free: 83 [MB]
update: 292.462036 [ms], stall: {ticks: 325, min: 0.001917 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.131752 [ms], max: 1.131752 [ms]}, cache: 689/708 [MB] LSA: 689/708 [MB] std free: 83 [MB]
update: 274.390442 [ms], stall: {ticks: 305, min: 0.001332 [ms], 50%: 1.131752 [ms], 90%: 1.131752 [ms], 99%: 1.131752 [ms], max: 1.131752 [ms]}, cache: 687/705 [MB] LSA: 687/705 [MB] std free: 83 [MB]
invalidation: 172.617508 [ms]
Large partition, lots of small rows:
update: 262.132721 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.005722 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 268.650944 [ms]}, cache: 187/188 [MB] LSA: 187/188 [MB] std free: 82 [MB]
update: 281.359467 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.004768 [ms], 90%: 0.017084 [ms], 99%: 0.017084 [ms], max: 322.381152 [ms]}, cache: 375/376 [MB] LSA: 375/376 [MB] std free: 82 [MB]
update: 287.229065 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.004768 [ms], 90%: 0.017084 [ms], 99%: 0.017084 [ms], max: 322.381152 [ms]}, cache: 563/564 [MB] LSA: 563/564 [MB] std free: 82 [MB]
update: 1294.816284 [ms], stall: {ticks: 4, min: 0.001917 [ms], 50%: 0.005722 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 1386.179840 [ms]}, cache: 586/625 [MB] LSA: 586/625 [MB] std free: 82 [MB]
update: 845.022461 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.005722 [ms], 90%: 0.017084 [ms], 99%: 0.017084 [ms], max: 962.624896 [ms]}, cache: 439/475 [MB] LSA: 439/475 [MB] std free: 82 [MB]
update: 380.335938 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.004768 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 386.857376 [ms]}, cache: 599/600 [MB] LSA: 599/600 [MB] std free: 82 [MB]
update: 477.234680 [ms], stall: {ticks: 4, min: 0.002760 [ms], 50%: 0.004768 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 557.074624 [ms]}, cache: 599/600 [MB] LSA: 599/600 [MB] std free: 82 [MB]
update: 525.955017 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.004768 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 557.074624 [ms]}, cache: 599/600 [MB] LSA: 599/600 [MB] std free: 82 [MB]
update: 548.003784 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.006866 [ms], 90%: 0.017084 [ms], 99%: 0.017084 [ms], max: 557.074624 [ms]}, cache: 599/600 [MB] LSA: 599/600 [MB] std free: 82 [MB]
update: 528.697937 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.004768 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 557.074624 [ms]}, cache: 599/600 [MB] LSA: 599/600 [MB] std free: 82 [MB]
update: 609.292603 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.005722 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 668.489536 [ms]}, cache: 599/600 [MB] LSA: 599/600 [MB] std free: 82 [MB]
update: 575.762451 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.004768 [ms], 90%: 0.017084 [ms], 99%: 0.017084 [ms], max: 668.489536 [ms]}, cache: 599/600 [MB] LSA: 599/600 [MB] std free: 82 [MB]
update: 530.801392 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.004768 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 557.074624 [ms]}, cache: 599/600 [MB] LSA: 599/600 [MB] std free: 82 [MB]
update: 535.948364 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.004768 [ms], 90%: 0.017084 [ms], 99%: 0.017084 [ms], max: 557.074624 [ms]}, cache: 599/600 [MB] LSA: 599/600 [MB] std free: 82 [MB]
update: 527.143555 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.004768 [ms], 90%: 0.020501 [ms], 99%: 0.020501 [ms], max: 557.074624 [ms]}, cache: 599/600 [MB] LSA: 599/600 [MB] std free: 82 [MB]
update: 521.869202 [ms], stall: {ticks: 4, min: 0.002760 [ms], 50%: 0.004768 [ms], 90%: 0.017084 [ms], 99%: 0.017084 [ms], max: 557.074624 [ms]}, cache: 599/600 [MB] LSA: 599/600 [MB] std free: 82 [MB]
invalidation: 173.069733 [ms]
Large partition, lots of range tombstones:
update: 224.003220 [ms], stall: {ticks: 4, min: 0.001917 [ms], 50%: 0.004768 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 268.650944 [ms]}, cache: 52/52 [MB] LSA: 52/52 [MB] std free: 82 [MB]
update: 570.882874 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.004768 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 668.489536 [ms]}, cache: 105/105 [MB] LSA: 105/105 [MB] std free: 82 [MB]
update: 577.249878 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.004768 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 668.489536 [ms]}, cache: 158/158 [MB] LSA: 158/158 [MB] std free: 82 [MB]
update: 580.239624 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.004768 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 668.489536 [ms]}, cache: 211/211 [MB] LSA: 211/211 [MB] std free: 82 [MB]
update: 614.187134 [ms], stall: {ticks: 4, min: 0.001917 [ms], 50%: 0.004768 [ms], 90%: 0.011864 [ms], 99%: 0.011864 [ms], max: 668.489536 [ms]}, cache: 264/264 [MB] LSA: 264/264 [MB] std free: 82 [MB]
update: 618.709229 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.003973 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 668.489536 [ms]}, cache: 317/317 [MB] LSA: 317/317 [MB] std free: 82 [MB]
update: 626.943359 [ms], stall: {ticks: 4, min: 0.001598 [ms], 50%: 0.004768 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 668.489536 [ms]}, cache: 369/370 [MB] LSA: 369/370 [MB] std free: 82 [MB]
update: 602.873474 [ms], stall: {ticks: 4, min: 0.001917 [ms], 50%: 0.003973 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 668.489536 [ms]}, cache: 422/423 [MB] LSA: 422/423 [MB] std free: 82 [MB]
update: 617.522583 [ms], stall: {ticks: 4, min: 0.001598 [ms], 50%: 0.004768 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 668.489536 [ms]}, cache: 475/475 [MB] LSA: 475/475 [MB] std free: 82 [MB]
update: 627.291138 [ms], stall: {ticks: 4, min: 0.001598 [ms], 50%: 0.004768 [ms], 90%: 0.011864 [ms], 99%: 0.011864 [ms], max: 668.489536 [ms]}, cache: 528/528 [MB] LSA: 528/528 [MB] std free: 82 [MB]
update: 623.720886 [ms], stall: {ticks: 4, min: 0.001598 [ms], 50%: 0.003973 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 668.489536 [ms]}, cache: 581/581 [MB] LSA: 581/581 [MB] std free: 82 [MB]
update: 630.735596 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.004768 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 668.489536 [ms]}, cache: 634/634 [MB] LSA: 634/634 [MB] std free: 82 [MB]
update: 2776.525635 [ms], stall: {ticks: 4, min: 0.002300 [ms], 50%: 0.004768 [ms], 90%: 0.014237 [ms], 99%: 0.014237 [ms], max: 2874.382592 [ms]}, cache: 687/687 [MB] LSA: 687/687 [MB] std free: 82 [MB]
2018-05-30 12:18:56 +02:00
Tomasz Grabiec
bb96518cc5 mutation_reader: Make empty mutation source advertize no partitions
So that perf_row_cache_update will always populate cache.
2018-05-30 12:18:56 +02:00
Avi Kivity
dd26cf1490 Merge "db/view: Clarifications to range movement scenarios" from Duarte
"
This series provides reasoning and clarification for the current
structure of mutate_MV(), and how we handle some scenarios related to
range movements.
"

* 'materialized-views/clarifications/v3' of github.com:duarten/scylla:
  db/view: Remove ifdef'd Java code
  db/view: Ignore scenario where base replica hasn't joined the ring
  db/view: Handle case when base has no paired view replica
2018-05-29 18:51:06 +03:00
Avi Kivity
928af7701c Merge "Implement reading clustering columns from SSTables 3.x" from Piotr
"
Add handling for clustering columns and tests for it.
"

* 'haaawk/sstables3/read-ck-v3' of ssh://github.com/scylladb/seastar-dev:
  Add test_uncompressed_compound_ck_read for SSTables 3.x
  Add test_uncompressed_simple_read for SSTables 3.x
  Implement reading clustering key from SSTables 3.x
  column_translation: cache fixed value lengths for ck
  data_consume_rows_context_m: use cached fixed column value lenghts
  column_translation: store fix lengths of column values
  consume_row_start: change type of clustering key
  Rename ROW_BODY state to CLUSTERING_ROW
2018-05-29 18:49:26 +03:00
Piotr Jastrzebski
d2300bc5a9 sstable_3_x_test: Add test_uncompressed_compound_static_row_read
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-29 14:55:36 +02:00
Piotr Jastrzebski
6639ef8769 sstable_3_x_test: add test_uncompressed_static_row_read
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-29 14:55:11 +02:00
Piotr Jastrzebski
18cced2edc flat_mutation_reader_assertions: improve static row assertions
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-29 14:52:55 +02:00
Piotr Jastrzebski
6ab660880d data_consume_rows_context_m: Implement support for static rows
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-29 14:52:14 +02:00
Piotr Jastrzebski
c9c2fc8e4b mp_row_consumer_m: Implement support for static rows
Add consumer_m::consume_static_row_start

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-29 14:50:15 +02:00
Piotr Jastrzebski
f018e5dfed mp_row_consumer_m: Extract fill_cells
This lambda will be used not only for regular columns
but also for static columns.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-29 14:46:02 +02:00
Laura Novich
e053da6f51 scylla_setup: adjust language
Edited the text for the scylla setup, improving readability for the prompts
with regards to grammar and usage.

Signed-off by: Laura Novich <laura@scylladb.com>
Message-Id: <CAGcEH3Xa6TFy=_rdz_=NP0b23vEDZmfRQzAdxV-f04C1p+AzTw@mail.gmail.com>
2018-05-29 09:56:41 +03:00
Piotr Sarna
ffe52681ea storage_proxy: add mv stats to write handler
Previous patch for issue 3416 did not cover passing write stats
to write response handler, which results in some write stats
being incorrectly counted as user write stats, while they belong
to materialized views.
This one fixes that by passing correct write stats reference
to write response handler constructor.

Also at: https://github.com/psarna/scylla/commits/fix_3416_again

Closes #3416
Message-Id: <53ef3cc96ccadfdad8992d92ed6a41473419eb0a.1527510473.git.sarna@scylladb.com>
2018-05-28 17:50:49 +01:00
Piotr Jastrzebski
a7a152b27f Add test_uncompressed_compound_ck_read for SSTables 3.x
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-28 18:28:14 +02:00
Piotr Jastrzebski
5c0f9f17ba Add test_uncompressed_simple_read for SSTables 3.x
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-28 18:28:14 +02:00
Piotr Jastrzebski
c89b485871 Implement reading clustering key from SSTables 3.x
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-28 18:28:14 +02:00
Piotr Jastrzebski
101e38f19b column_translation: cache fixed value lengths for ck
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-28 18:28:14 +02:00
Piotr Jastrzebski
b7149d349c data_consume_rows_context_m: use cached fixed column value lenghts
Take them from column_translation instead of parsing the type every
time.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-28 18:28:14 +02:00
Piotr Jastrzebski
9d41c2299d column_translation: store fix lengths of column values
We don't need to parse the type every time.
It's better to cache fix lengths of column values
for sstable.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-28 18:28:14 +02:00
Piotr Jastrzebski
351c9e5d65 consume_row_start: change type of clustering key
Clustering key in 3.x format is stored differently
so it's easier to create a vector of temporary buffers
instead of a single block of concatenated bytes.

Each temporary buffer stores a value of a single
clustering column.

This is because the way clustering key is stored on disk
in SSTables 3.x is not the same as the way we store it
internally.

This means that we have to first read a value of every
clustering column into temporary_buffer and only then
we can create clustering key using a vector of those
temporary buffers.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-28 18:27:56 +02:00
Amnon Heiman
1f28e97458 sstable: Add has_partition_key method
This patch adds a helper function to sstable to check if it has a given
partition key.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2018-05-28 18:12:17 +03:00
Amnon Heiman
cd1f4ccb89 keys_test: add a test for nodetool_style string
This patch adds a test for single and compund partition key that is
created from a nodetoold style string.
2018-05-28 18:12:12 +03:00
Amnon Heiman
c517ee8353 keys: Add from_nodetool_style_string factory method
Based on:
8daaf9833a

This patch adds a from_nodetool_style_string factory method to partition_key.
The string format is follows the nodetool format, that column in the
partition keys are split by ':'.
For example, if a partition key has two column col1 and col2, to get the
partition key that has col1 = val1 and col2 = val2:

val1:val2
2018-05-28 18:09:51 +03:00
Tomasz Grabiec
aefb5e0fbd Merge "Get rid of cql_statement::execute_internal" from Avi
execute_internal() duplicates several code paths, especially in
the select path, for no good reason.  It boils down to timeout and
consistency level selection which can be done based on
client_state::is_internal().

This patchset eliminated the duplication and execute_internal(),
simplifying the code.

* github.com:avikivity/scylla cql-no-execute_internal/v2:
  cql: schema_altering_statement: make execute() and execute_internal()
    equivalent
  cql: select_statement: make execute() and execute_internal()
    equivalent
  cql: query_processor: don't call cql_statement::execute_internal() any
    more
  cql: cql_statement: remove execute_internal()
2018-05-28 13:01:43 +02:00
Avi Kivity
8033785b36 Update scylla-ami submodule
* dist/ami/files/scylla-ami 025644d...1f5329f (1):
  > scylla_install_ami: Update CentOS to latest version
2018-05-28 13:59:57 +03:00
Avi Kivity
ff3e86888a tests: report tests as they are completed
As each test completes, report it. This prevents a long-running
test in the beginning of the list from stalling output.
Message-Id: <20180526173517.23078-1-avi@scylladb.com>
2018-05-28 13:58:01 +03:00
Avi Kivity
3a4d11d374 Merge "Introduce frozen_mutation_fragment" from Paweł
"
This series introduces frozen_mutation_fragment which can be used to
send mutation_fragments over the wire to a remote node. The main
intended user is going to be the new streaming implementation.

The first part of the series fixes some IDL issues related to empty
structures and variant being the first member of a structure. Both these
problems make the generated code fail to build and they do not, in any
way, affect the existing on-wire protocol.

Logic responsible for freezing and unfreezing of mutation_fragments is
heavily based on the existing code for freezing mutations and shares the
same drawbacks (for example, unnecessary copy during unfreezing). These
preexisting performance problems can be fixed incrementally.

Another performance problem (which affects frozen_mutations as well, but
to a lesser extent) is that since the batching is done at a different
layer each frozen mutation fragment is a separate bytes_ostream object
owning at least one  memory buffer. If the mutation fragments are small
this will cause an excessive number of allocations. This could be solved
either by freezing fragments in batches (though it goes against the RPC
layer doing its own batching) or using bytes_ostream or an equivalent
object with a buffer allocation policy more suitable for such use cases.
This also is something that probably could be an incremental fix.

Tests: unit (release)
"

* tag 'frozen_mutation_fragment/v1-rebased' of https://github.com/pdziepak/scylla:
  idl: add idl description of frozen_mutation_fragments
  tests: add test for frozen_mutation_fragments
  frozen_mutation: introduce frozen_mutation_fragment
  tests/idl: test variant being the first member of a structure
  idl: create variant state in root node
  tests/idl: test serialising and deserialising empty structures
  idl-compiler: avoid unused variable in empty struct deserialisers
  tests/mutation_reader: disambiguate freeze() overload
2018-05-28 13:54:01 +03:00
Takuya ASADA
55d6be9254 Revert "dist/ami: update CentOS base image to latest version"
This reverts commit 69d226625a.
Since ami-4bf3d731 is Market Place AMI, not possible to publish public AMI based on it.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180523112414.27307-1-syuu@scylladb.com>
2018-05-28 13:52:34 +03:00
Duarte Nunes
99d678d079 db/view: Remove ifdef'd Java code
It provides no useful information, so just get rid of it.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-05-28 11:51:23 +01:00
Duarte Nunes
ad18d535e9 db/view: Ignore scenario where base replica hasn't joined the ring
Apache Cassandra handles a case where the node hasn't joined the ring
and may consequentially have an outdated view of it. Following the same
reasoning as with the previous patch, we ignore this scenario. It
happens when there are range movements, and this node is bootstrapping,
but there are already other mechanisms in the cluster, such as hinted
handoff and dual-writing to replicas during range movements, that
contribute to this update eventually making its way to the view.

This patch doesn't change any behavior, but it provides the reasoning
why we won't use the batchlog as Cassandra does, or the hinted handoff
log as we will, to later send the update when the node is joined (note
that Cassandra just sends the mutations "later", and doesn't check
again for any condition or change).

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-05-28 11:51:23 +01:00
Duarte Nunes
be45e6a1b7 db/view: Handle case when base has no paired view replica
If no view replica is paired with the current base replica, it means
there's a range movement going on (decommission or move), such that
this base replica is gaining new token ranges. The current node is
thus a pending_endpoint from the POV of the coordinator that sent the
request.

Sending view updates to the view replica this base will eventually be
paired with only makes a difference when the base update didn't make
it to the node which is currently being decommissioned or moved-from.

The update will, however, make it to that node if HH is enabled at the
coordinator, before the range movement finishes, or later to this node
when it becomes a natural endpoint for the token.

We still ensure we send to any pending view endpoints though, at least
until we handle that case more optimally.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-05-28 11:51:18 +01:00
Avi Kivity
b70febe246 cql: cql_statement: remove execute_internal()
With no callers, it can be safely removed.
2018-05-27 12:40:27 +03:00
Avi Kivity
c8a66efb6a cql: query_processor: don't call cql_statement::execute_internal() any more
All cql_statement::execute_internal() overrides now either throw or
call execute().  Since we shouldn't be calling the throwing overrides
internally, we can safely call execute() instead.  This allows us to
get rid of execute_internal().
2018-05-27 12:37:37 +03:00
Avi Kivity
eb19798f99 cql: select_statement: make execute() and execute_internal() equivalent
execute_internal(), for some code paths, differs from execute by the
following:
 1. it uses CL_ONE unconditionally
 2. it has no query timeout
 3. it doesn't use execution stages

for other code paths, it just calls execute.

As preparation for getting rid of execute_internal(), unify the two
code paths.

Commit 4859b759b9 caused the consistency level and timeouts
to be provided by the caller, so using the caller provided parameters
instead of overriding them does not change behavior.
2018-05-27 12:36:02 +03:00
Avi Kivity
d998f06633 cql: schema_altering_statement: make execute() and execute_internal() equivalent
To get rid of execute_internal(), make the normal execute() equivalent and call
it instead of having two different paths.
2018-05-27 11:08:55 +03:00
Duarte Nunes
4859b759b9 Merge 'Make all timeouts explicit' from Avi
"
This patchset makes all users of query_processor specify their timeouts
explicitly, in preparation for the removal of
cql_statement::execute_internal() (whose main function was to override
timeouts).
"

* tag 'cql-explicit-timeouts/v1' of https://github.com/avikivity/scylla:
  query_processor: require clients to specify timeout configuration
  query_processor: un-default consistency level in make_internal_options
2018-05-26 16:10:58 +02:00
Avi Kivity
6e97609049 Merge "Improve support for data types handling in SSTables 3.x" from Vladimir
"
Firstly, this patchset removes the is_fixed_length() function of
abstract_type in favour of value_length_if_fixed().

Secondly, it fixed the byte_type to be compatible with Cassandra which
erroneously treats it as a variable-length data type.

Lastly, it adds a unit test covering all non-composite CQL data types
for writing.

Tests: unit {release}
"

* 'projects/sstables-30/different-data-types/v1' of https://github.com/argenet/scylla:
  tests: Add a unit test for writing different data types to SSTables 3.x format.
  types: Treat byte_type as a variable-length type for compatibility reasons.
  types: Remove is_value_fixed() and use value_length_if_fixed() instead.
2018-05-26 10:24:35 +03:00
Vladimir Krivopalov
0951153292 tests: Add a unit test for writing different data types to SSTables 3.x format.
This tests covers all non-composite CQL data types.
The resulting files are dumped using sstabledump as follows:

[
  {
    "partition" : {
      "key" : [ "key" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 174,
        "liveness_info" : { "tstamp" : "1525385507816568" },
        "cells" : [
          { "name" : "asciival", "value" : "hello" },
          { "name" : "bigintval", "value" : 9223372036854775807 },
          { "name" : "blobval", "value" : "0x6772656174" },
          { "name" : "boolval", "value" : true },
          { "name" : "dateval", "value" : "2017-05-05" },
          { "name" : "decimalval", "value" : 5.45 },
          { "name" : "doubleval", "value" : 36.6 },
          { "name" : "durationval", "value" : 1h4m48s20ms },
          { "name" : "floatval", "value" : 7.62 },
          { "name" : "inetval", "value" : "192.168.0.110" },
          { "name" : "intval", "value" : -2147483648 },
          { "name" : "smallintval", "value" : 32767 },
          { "name" : "timeuuidval", "value" : "50554d6e-29bb-11e5-b345-feff819cdc9f" },
          { "name" : "timeval", "value" : "19:45:05.090000000" },
          { "name" : "tinyintval", "value" : 127 },
          { "name" : "tsval", "value" : "2015-05-01 09:30:54.234Z" },
          { "name" : "uuidval", "value" : "01234567-0123-0123-0123-0123456789ab" },
          { "name" : "varcharval", "value" : "привет" },
          { "name" : "varintval", "value" : 123 }
        ]
      }
    ]
  }
]

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-25 21:41:23 -07:00
Vladimir Krivopalov
3981dd6dd6 types: Treat byte_type as a variable-length type for compatibility reasons.
Although values of the byte_type that corresponds to CQL TINYINT type
always occupy only a single byte, Cassandra treats this it as a
variable-length type for SSTables 3.0 reading and writing.

While it is clearly a mistake at Cassandra side, we have to stay
compatible.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-25 21:41:23 -07:00
Vladimir Krivopalov
24cb062834 types: Remove is_value_fixed() and use value_length_if_fixed() instead.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-25 21:41:23 -07:00
Paweł Dziepak
ed12555192 idl: add idl description of frozen_mutation_fragments 2018-05-25 10:15:10 +01:00
Paweł Dziepak
0bac487426 tests: add test for frozen_mutation_fragments 2018-05-25 10:15:10 +01:00
Paweł Dziepak
aa4e589ace frozen_mutation: introduce frozen_mutation_fragment
This patch introduces IDL definition as well as serialisers and
deserialisers for freezing mutation_fragment so that they can be
transferred between nodes in a cluster.
2018-05-25 10:15:10 +01:00
Paweł Dziepak
b2e9491728 tests/idl: test variant being the first member of a structure 2018-05-25 10:15:10 +01:00
Paweł Dziepak
a5731ded98 idl: create variant state in root node
Each non-final IDL object is preceeded by a frame containing its size.
In case of boost::variant there is a frame for the variant itself, an
integer determining the active alternative of the variant and a frame of
that active alternative.

However, if a variant was the first member of a writable stub object the
IDL would generate code that would not write the frame for the variant.
This is not a very severe issue since there are no such cases right now
as  C++ type system would no allow such generated code to compile.
2018-05-25 10:15:10 +01:00
Paweł Dziepak
d731cf427d tests/idl: test serialising and deserialising empty structures 2018-05-25 10:15:10 +01:00
Paweł Dziepak
f719516be8 idl-compiler: avoid unused variable in empty struct deserialisers
Deserialisers generated by IDL compiler first create a substream
covering the deserialised structure and then skip and read appropriate
members. If there are no members the substream will be unused and prompt
the compiler to emit a warning.
2018-05-25 10:15:10 +01:00
Paweł Dziepak
fde9e1d55f tests/mutation_reader: disambiguate freeze() overload
freeze() is about to get overloaded so make sure we don't get any
ambiguities.
2018-05-25 10:15:10 +01:00
Duarte Nunes
4db0b4af58 Merge 'secondary index: Fixes for tables with multiple clustering columns' from Nadav
"
This patch series fixes #3405: secondary-index search only provided
correct results in certain cases, where entire partitions or contiguous
partition slices matched the query. When this was not the case, and
individual clustering rows match or do not match the query, the wrong
results were returned.

To fix this bug, we need to fix the two stages of secondary-index search:

1. In the first stage, we read from the index MV a list of row keys
   (i.e., primary keys) matching the query. We can no longer remember
   just the partition keys, and need to keep the list of full primary keys.

2. In the second stage, we have a list of rows (not partitions) and need
   to read their selected contents to return to the user. Since CQL queries
   do not have a syntax to select an arbitrary list of rows, we have to
   add new code to do such a selection.

Because we provide an ad-hoc, inefficient, implementation for the row
selection described in stage 2, these patches leave two paths in the code:
The old path, efficiently selecting entire partitions, and the new path,
selecting individual rows. The old path is still used when it is applicable,
which is when a partition key column or the first clustering key column
is searched.
"

* 'si-fix-v4' of http://github.com/nyh/scylla:
  secondary index: test multiple clustering column
  secondary index: fix wrong results returned in certain cases
  secondary index: method for fetching list of rows from base table
  secondary index: method for fetching list of rows from index
  select_statement.cc: refactor find_index_partition_ranges()
  select_statement.cc: fix variable lifetime errors
2018-05-24 21:36:18 +01:00
Nadav Har'El
a6d9ea2fb5 secondary index: test multiple clustering column
This patch adds a test for secondary indexes on a table which has many
columns - two partition key column, two clustering key columns, and two
regular columns. We add a bunch of data in various rows and partitions,
index all columns and search on this data and verify the results.

This test exposed various bugs in secondary index search, including
issue #3405. After we fixed those bugs, the test now passes.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-05-24 15:56:57 +03:00
Nadav Har'El
1b29dd44f7 secondary index: fix wrong results returned in certain cases
The current secondary-index search code, in
indexed_table_select_statement::do_execute(), begins by fetching a list
of partitions, and then the content of these partitions from the base
table. However, in some cases, when the table has clustering columns and
not searching on the first one of them, doing this work in partition
granularity is wrong, and yields wrong results as demonstrated in
issue #3405.

So in this patch, we recognize the cases where we need to work in
clustering row granularity, and in those cases use the new functions
introduced in the previous patches - find_index_clustering_rows() and
the execute() variant taking a list of primary-keys of rows.

Fixes #3405.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-05-24 15:56:03 +03:00
Nadav Har'El
adf6d742be secondary index: method for fetching list of rows from base table
We add a new variant of select_statement::execute() which allows selecting
an arbitrary list of clustering rows. The existing execute() variant can't
do that - it can only take a list of *partitions*, and read the same
clustering rows from all of them.

The new select variant is not needed for regular CQL queries (which do
not have a syntax allowing reading a list of rows with arbitrary primary
keys), but we will need it for secondary index search, for solving
issue #3405.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-05-24 15:54:36 +03:00
Nadav Har'El
a096a82adc secondary index: method for fetching list of rows from index
We already have a method find_index_partition_ranges(), to fetch a list
of partition keys from the secondary index. However, as we shall see in
the following patches (and see also issue #3405), getting a list of entire
partitions is not always enough - the secondary index actually holds a list
of primary keys, which includes clustering keys, and in some queries we
can't just ignore them.

So this patch provides a new method find_index_clustering_rows(), to
query the secondary index and get a list of matching clustering keys.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-05-24 15:53:29 +03:00
Nadav Har'El
083b2ae573 select_statement.cc: refactor find_index_partition_ranges()
The function find_index_partition_ranges() is used in secondary index
searches for fetching a list of matching partition. In a following patch,
we want to add a similar function for getting a list of *rows*. To avoid
duplicate code, in this patch we split parts of find_index_partition_ranges()
into two new functions:

1. get_index_schema() returns a pointer to the index view's schema.

2. read_posting_list() reads from this view the posting list (i.e., list
   of keys) for the current searched value.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-05-24 15:50:45 +03:00
Nadav Har'El
7dc9b77682 select_statement.cc: fix variable lifetime errors
do_with() provides code a *reference* to an object which will be kept
alive. It is a mistake to make a copy of this object or of parts of it,
because then the lifetime of this copy will have to be maintained as well.

In particular, it is a mistake to do do_with(..., [] (auto x) { ... }) -
note how "auto x" appears instead of the correct "auto& x". This causes
the object to be copied, and its lifetime not maintained.

This patch fixes several cases where this rule was broken in
select_statement.cc. I could not reproduce actual crashes caused by
these mistakes, but in theory they could have happened.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-05-24 15:46:12 +03:00
Piotr Jastrzebski
3b6e80a180 Rename ROW_BODY state to CLUSTERING_ROW
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-24 12:48:33 +02:00
Avi Kivity
0b8d06ebf9 Merge seastar upstream
* seastar a48fe69...12cffef (5):
  > variant_utils: don't pass variant by rref to boost::apply_visitor
  > Revert "build: fix compilation issues on cmake. missing stdc++-fs"
  > reactor: prevent expected overflow from triggering ubsan warning
  > cmake: Add cmake option to disable testing altogether
  > build: fix compilation issues on cmake. missing stdc++-fs
2018-05-24 12:17:56 +03:00
Avi Kivity
f893dc61f0 Merge "Implement reading columns from SSTable 3 format" from Piotr
"
This patchset implements reading row columns from SSTable 3 format data file.

Tests: units (release)
"

* 'haaawk/sstables3/read-columns-v4' of ssh://github.com/scylladb/seastar-dev: (21 commits)
  Add test for reading column values of different types.
  Support all fixed size column types from SSTable 3.x
  Add abstract_type::value_length_if_fixed
  Add test for simple table with value
  flat_reader_assertions: Add produces_row taking column values
  Implement reading rows and columns in data_consume_rows_context_m
  Introduce column_flags_m
  Add column_translation to data_consume_rows_context_m
  Pass schema to data_consume_context
  Add column_translation.hh
  consumer_m: Add consume methods for consuming rows and columns
  Extract make_atomic_cell from mp_row_consumer_k_l
  Rename NON_STATIC_ROW_* states to ROW_BODY_*
  Add liveness_info and use it in reading sstables
  Add helper methods for parsing simple types.
  Add unfiltered_flags_m::has_all_columns
  data_consume_context: use make_unique instead of new
  Pass serialization_header to data_consume_rows_context*
  Use disk_string_vint_size for bytes_array_vint_size
  Introduce disk_string_vint_size type
  ...
2018-05-24 10:11:25 +03:00
Takuya ASADA
e0d49aae37 dist/debian: fix missing --configfile parameter on pdebuild
We need to specify --configfile on pdebuild too, otherwise we will
always fail to build .deb on newly created build environment.
Only reason why we still able to build .deb is we already copied
.pbuilderrc to home directory on existing build environment.

Fixes #3456

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180523204112.24669-1-syuu@scylladb.com>
2018-05-24 10:10:27 +03:00
Piotr Jastrzebski
7869bd98b1 Add test for reading column values of different types.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 19:54:16 +02:00
Piotr Jastrzebski
a572d126e4 Support all fixed size column types from SSTable 3.x
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 19:54:16 +02:00
Piotr Jastrzebski
7a25819e5a Add abstract_type::value_length_if_fixed
This info is used by SSTable 3.x format to read column values
without reading their lengths.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 19:54:16 +02:00
Piotr Jastrzebski
f58f10d708 Add test for simple table with value
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 19:54:16 +02:00
Piotr Jastrzebski
0a5d06b2f3 flat_reader_assertions: Add produces_row taking column values
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 19:54:16 +02:00
Piotr Jastrzebski
9348006092 Implement reading rows and columns in data_consume_rows_context_m
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 19:54:16 +02:00
Piotr Jastrzebski
f6e1c38486 Introduce column_flags_m
This will be used for reading columns from data file.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 19:54:16 +02:00
Piotr Jastrzebski
609854e21a Add column_translation to data_consume_rows_context_m
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 19:54:16 +02:00
Piotr Jastrzebski
7fd222e639 Pass schema to data_consume_context
It will be needed to obtain column_translation that will
be added to data_consume_context in the next patch.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 19:54:16 +02:00
Piotr Jastrzebski
d3f3cd36dd Add column_translation.hh
It contains a class that manages mapping between sstable
columns and schema column definitions.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 19:54:16 +02:00
Piotr Jastrzebski
25b8cf9e4c consumer_m: Add consume methods for consuming rows and columns
Also implement them in mp_row_consumer_m.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 19:53:29 +02:00
Piotr Jastrzebski
94e3138dc5 Extract make_atomic_cell from mp_row_consumer_k_l
It will be used in both mp_row_consumer_k_l and
mp_row_consumer_m.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 16:39:52 +02:00
Piotr Jastrzebski
c6d5ebc274 Rename NON_STATIC_ROW_* states to ROW_BODY_*
New name describes the states in a better way as those states
will be used both for static and non-static rows.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 16:39:52 +02:00
Piotr Jastrzebski
10c669d2b5 Add liveness_info and use it in reading sstables
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 16:39:52 +02:00
Piotr Jastrzebski
b2f9841dd4 Add helper methods for parsing simple types.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 16:39:52 +02:00
Piotr Jastrzebski
d8cd8e04ed Add unfiltered_flags_m::has_all_columns
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 16:39:52 +02:00
Piotr Jastrzebski
51d079e17c data_consume_context: use make_unique instead of new
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 16:39:52 +02:00
Piotr Jastrzebski
54ef775501 Pass serialization_header to data_consume_rows_context*
This header is needed to parse data for SSTable 3.0 format

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 16:39:52 +02:00
Piotr Jastrzebski
b849eefc8c Use disk_string_vint_size for bytes_array_vint_size
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 16:39:52 +02:00
Piotr Jastrzebski
76f0f2693d Introduce disk_string_vint_size type
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 16:30:03 +02:00
Piotr Jastrzebski
5ca4bfd69a disk_array_vint_size: Remove unused Size template parameter
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 16:15:44 +02:00
Duarte Nunes
4eb47d136b Merge 'Introduce authorized_prepared_statements_cache' from Vlad
"
This series introduces a cache of already authenticated prepared statements which
is meant to optimize the prepared statement lookup when authentication is enabled.

This cache allows to perform a single cache lookup per EXECUTE operation as opposed
to at least 2 lookups: one in the prepared statements cache and one in the authentication
cache.

Tests:
   - cql_query_test {debug, release}.
   - cassandra-stress with authentication enabled and with short eviction timeout.
   - Manual (with printouts) checks:
      - Tested the eviction due to eviction in the prepared_statements_cache:
         - Artificially decreased the prepared_statements_cache size and ran c-s with different keyspaces.
         - Verified that the corresponding authorized_prepared_statements_cache entry is evicted and re-populated.
      - Tested the BATCH of prepared statements (with dtest infrastructure):
         - Verified that for each prepared statement authorized_prepared_statements_cache is updated only once:
            - The batch contained a few entries of the same prepared statement.
"

* 'authorized_prepared_statements_cache-v3' of https://github.com/vladzcloudius/scylla:
  cql3: use authorized_prepared_statements_cache in the BATCH processing
  cql3::statements::batch_statement: introduce a single_statement class
  cql3: introduce the authorized_prepared_statements_cache class
  loading_shared_values: introduce the templated find() overload
  tests: loading_cache_test: add a tests for a loading_cache::remove(key)/remove(iterator)
  utils::loading_cache: add remove(key)/remove(iterator) methods
  cql3::query_processor: properly stop() prepared_statements_cache object
2018-05-23 14:40:09 +01:00
Avi Kivity
3dd2f68712 dist: drop libunwind dependency
Since Seastar no longer (1f005fb434) requires libunwind, we can
drop it from our dependency list.  This helps the power build, for
which no libunwind is available.

Fixes #3453.
Message-Id: <20180523114750.10753-1-avi@scylladb.com>
2018-05-23 13:53:29 +02:00
Avi Kivity
1f005fb434 Merge seastar upstream
* seastar 5da5d4e...a48fe69 (1):
  > backtrace: drop libwind in favor of libc backtrace()
2018-05-23 14:42:14 +03:00
Duarte Nunes
eed09dfdf9 mutation_partition: Throw std::out_of_range with backtrace on cell_at
Makes it easier to investigate bugs.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180521133753.16375-1-duarte@scylladb.com>
2018-05-23 13:51:54 +03:00
Avi Kivity
701e6f2cff Merge "Implement backlog controller for TWCS" from Glauber
"
This series implements the backlog tracker for TWCS, allowing it to
be controlled. The backlog for a TWCS colum family is just the sum of
the SizeTiered backlogs for all the windows that we know about.

A possible optimization for this is to stop tracking windows after
they become old enough and revert to zero backlog. I reverted that
last minute, though, since this will probably cause the backlog to
completely misrepresent reality if we import SSTables into old buckets
with things like repairs or nodetool refresh.
"

* 'twcs-backlog-v4.1' of github.com:glommer/scylla:
  backlog: implement backlog tracker for the TWCS
  STCS_backlog: allow users to query for the total bytes managed
  backlog: keep track of maximum timestamp in write monitor
  memtable: also keep track of max timestamp
2018-05-23 13:37:49 +03:00
Glauber Costa
44a89d654b backlog: implement backlog tracker for the TWCS
The TWCS backlog is relatively simple: we just need to keep track of
which SSTable belong to which time window (and actually as usual,
just their sizes). That is an easy thing to do since we can statically
calculate the time bound from the timestamp.

Once we do that we can just sum the backlogs for each individual window.
Time windows that are well enough into the past can be at some point
discarded when their backlogs become zero.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-23 06:20:21 -04:00
Nadav Har'El
433fc6c36e keys.hh: simplify empty clustering-key check
The exploded_clustering_prefix type has a convenient is_empty() method
and an even more convenient "operator bool" shortcut. Unfortunately,
the other clustering prefix types (clustering_key_prefix,
clustering_key_prefix_view) have, for historic reasons, an is_empty
method which takes a schema parameter. That also means they can't
have an "operator bool" shortcut.

But checking if a prefix doesn't really need the schema - all we need to
check is whether the byte representation is empty. The result is simpler
and more efficient code, and easier to use. It is also more consistent -
all clustering-key-related types will have an "operator bool" instead of
just some of them.

To avoid massive code changes, we leave a is_empty(schema) variant, which
simply calls is_empty(). There's already precedent for that - various
methods which have a variant taking schema (and ignoring it) and one
taking nothing.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180521174220.13262-1-nyh@scylladb.com>
2018-05-23 11:46:23 +02:00
Takuya ASADA
300af65555 dist/common/scripts/scylla_setup: abort running script when one of setup failed in silent mode
Current script silently continues even one of setup fails, need to
abort.

Fixes #3433

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180522180355.1648-1-syuu@scylladb.com>
2018-05-23 11:05:33 +03:00
Vlad Zolotarov
82f7d1d006 cql3: use authorized_prepared_statements_cache in the BATCH processing
Like with the EXECUTE command avoid authorizing the same prepared
statement twice - this time in the context of processing the BATCH
command.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-22 20:15:03 -04:00
Vlad Zolotarov
9723988926 cql3::statements::batch_statement: introduce a single_statement class
This is a helper class needed to control the handling process of a single
statement in the current batch. In particular it has the boolean defining
if the authorization is needed for this statement.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-22 20:15:03 -04:00
Vlad Zolotarov
a138c59991 cql3: introduce the authorized_prepared_statements_cache class
Add a cache that would store the checked weak pointer to already authorized prepared statements
and which key is a tuple of an authenticated_user and key of the prepared_statements_cache.

The entries will be held as long as the corresponding prepared statement is valid (cached)
and will be discarded with the period equal to the refresh period of the permissions cache.

Entries are also going to be discarded after 60 minutes if not used.

The purpose of this new cache is to save the lookup in the permissions cache for already authenticated
resource (whatever is needed to be authenticated for the particular prepared statement).

This is meant to improve the cache coherency as well (since we are going to look in a single cache
instead of two).

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-22 20:15:03 -04:00
Vlad Zolotarov
3114cef42c loading_shared_values: introduce the templated find() overload
This overload alows searching the elements by an arbitrary key as long as it is "hashable"
to the same values as the default key and if there is a comparator for
this new key.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-22 20:15:00 -04:00
Vlad Zolotarov
ab251a1fc3 tests: loading_cache_test: add a tests for a loading_cache::remove(key)/remove(iterator)
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-22 20:05:01 -04:00
Vlad Zolotarov
34620deee4 utils::loading_cache: add remove(key)/remove(iterator) methods
remove(key): removes the entry with the given key if exists, otherwise does nothing.
remote(iterator): removes an entry by a given iterator (returned from loading_cache::find()).

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-22 20:05:00 -04:00
Piotr Sarna
b7ac2da238 main: initialize hints manager unconditionally
This commit makes sure that hints manager is always initialized,
including creating hints directories and starting it.
It needs to be fixed because hints manager is internally used
to store failed materialized view replicas.

Fixes #3451
Message-Id: <44532fd3704e20cabeb9c4985dace5650fd22d2c.1527018865.git.sarna@scylladb.com>
2018-05-22 22:21:50 +01:00
Duarte Nunes
ed2a1518f8 Merge 'Allow dropping tables with active secondary indexes' from Piotr
"
This series addresses issue #3202 about dropping a table with secondary
indexes present. Previously dropping such tables was impossible due to
materialized view restrictions (which is an implementation detail
of Scylla's secondary indexes).

Implemented:
 * fixing 'DROP KEYSPACE' with active materialized views
 * adapting schema_builder to make it easy to drop indexes
 * dropping all dependent SI before dropping a table
 * a test case for dropping a table with secondary indexes
"

* 'drop_si_before_drop_table_3' of https://github.com/psarna/scylla:
  tests: add test for dropping a table with secondary indexes
  migration_manager: allow dropping table with secondary indexes
  schema: add clearing indexes to schema builder
  database: do not truncate already removed views
2018-05-22 22:20:35 +01:00
Vlad Zolotarov
5bde36f29e cql3::query_processor: properly stop() prepared_statements_cache object
prepared_statements_cache has a timer that evicts old entries - it needs to be properly stopped.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-22 16:33:52 -04:00
Piotr Sarna
76848fb577 tests: add test for dropping a table with secondary indexes
This commit adds a test case for dropping a table with dependent
secondary indexes. Dependent materialized views prohibit the table
from being dropped, but dropping a table with dependent SI is legal.

References #3202
2018-05-22 21:10:51 +02:00
Piotr Sarna
7e4813a466 migration_manager: allow dropping table with secondary indexes
Previously dropping a table with secondary indexes failed, because
SI are internally backed by materialized views.
This commit triggers dropping dependent secondary indexes before
dropping a table.

Fixes #3202
2018-05-22 21:10:51 +02:00
Piotr Sarna
0513dc17a1 schema: add clearing indexes to schema builder
This commit adds 'without_indexes()' method to builder,
used to clear all previous index declarations from schema definition.
2018-05-22 21:10:51 +02:00
Piotr Sarna
f8237dd664 database: do not truncate already removed views
This commit clears table's views before truncating it
in drop_column_family function. The only case when
views are not empty during drop is when they're backing secondary
indexes of a base table and they are all atomically dropped
in the same go as the base table itself.
This change will prevent trying to truncate views that were
already dropped, which used to result in no_such_column_family error.

References #3202
2018-05-22 21:10:51 +02:00
Duarte Nunes
a3bbd52e2e Merge 'Add materialized view metrics' from Piotr
"
This series introduces materialized view statistics, as stated in issue #3385:
 - updates pushed
 - updates failed
 - row lock stats

It also addresses issue #3416 by decoupling user write stats from view
update stats.
"

* 'materialized_view_metrics_9' of https://github.com/psarna/scylla:
  view: adapt view_stats to act as write stats
  storage_proxy: decouple write_stats from stats
  db: add row locking metrics
  view: add view metrics
2018-05-22 18:41:51 +01:00
Glauber Costa
be39736293 STCS_backlog: allow users to query for the total bytes managed
We would like to know whether there is still backlog at rest in a
particular STCS object. This is useful, for instance, in the TWCS
backlog, that uses STCS so it can delete old windows that are no longer
used.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 13:40:15 -04:00
Glauber Costa
b573a2ff61 backlog: keep track of maximum timestamp in write monitor
For sealed SSTables we can get the maximum timestamp from the statistics
component.  But for partially written SSTables, the metadata is not yet
available.

One way to solve this would be to make the SSTable statistics available
earlier. But we would end up with a maximum timestamp that potentially
changes all the time as we write more cells.

A better approach is to take note of what's the maximum timestamp in a
memtable before we start to flush, and when time comes for us to flush
we will use the progress manager to inform the consumers about the
maximum timestamp.

For SSTables being compacted, we can't know for sure what is the maximum
timestamp as some entries could be TTLd already. But the maximum of all
SSTables present in the compaction is a good enough estimation for this
purposes.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 12:55:58 -04:00
Glauber Costa
68d1c64e7a memtable: also keep track of max timestamp
We are now keeping track of the minimum timestamp in a memtable. Also
keep track of the max timestamp so we can know what it is before we
finish flushing the entire memtable to an SSTable. Will be used by
partially written SSTables undergoing TWCS.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 12:55:58 -04:00
Avi Kivity
49892a06b9 Merge "exception safety and minimum work for compaction controller" from Glauber
"
This was sent before as two separate patchsets. It is now unified
because it has a lot of common infrastructure.

In this patchset I am aiming at two goals:

1) Provide a minimum amount of shares for user-initiated operations like
nodetool compact and nodetool cleanup

2) Be more robust with exceptions in the backlog tracker

For the first, the main difference is that I now made the compaction
controller a part of the compaction manager. It then becomes easy to
consult with the compaction controller for the correct amount of shares
those operations should have.

In compaction_strategy.cc, the major_compaction_strategy object was
actually already unused before. So instead of making use of it, which
would require some form of information flow downwards about the backlog
we need to export, I am creating a user-initiated backlog type inside
the compaction manager.

With the two changes described above everything is very well
self-contained within the compaction manager and the implementation
becomes trivial.

For the second, I am now handling exceptions in two places:

1) the backlog computation. Those are const functions so if we just have
a transient exception when compacting the backlog, all we need to do is
return some fixed amount of shares and try again in the next adjustment
window.

2) the process of adding / removing SSTables. Those are harder, since if
we fail to manipulate the list we'll be left in an inconsistent state.
The best approach is then to disable the backlog tracker and return a
fixed amount of shares globally.

Tests: unit (release)
"

* 'backlog-improvements-v3' of github.com:glommer/scylla:
  compaction_manager: disable backlog tracker if we see an exception
  backlog tracker: protect against exceptions in backlog calculation.
  STCS_backlog: protect against negative backlog
  STCS_backlog: remove unused attribute
  compaction strategy: move size tiered backlog to a header
  compaction_strategy: delete major_compaction_strategy class
  compaction: make sure that user-initiated compactions always have a minimum priority
  backlog_controller: add constants to represent a globally disabled controller
  backlog_controller: move compaction controller to the compaction manager
  backlog_controller: allow users to compute inverse function of shares
2018-05-22 18:35:42 +03:00
Piotr Sarna
3792bed3ed view: adapt view_stats to act as write stats
This commit adapts view_stats structure so it can be passed
to storage_proxy as write stats. Thanks to that, mv replica updates
will not interfere with user write metrics. As a side effect it also
provides more stats to replica view updates.

Closes #3385
Closes #3416
2018-05-22 16:52:58 +02:00
Piotr Sarna
1d590b3ca4 storage_proxy: decouple write_stats from stats
This commit extracts metrics related to writes from stats structure,
so it can be easily replaced later, e.g. for materialized view metrics.

References #3385
References #3416
2018-05-22 16:52:58 +02:00
Piotr Sarna
9246bb36bc db: add row locking metrics
This commit adds statistics to row_locker class. Metrics are
independendly counted for all lock types: row<->partition and
exclusive<->shared.

Metrics gathered:
 - total acquisitions
 - operations that wait on the lock
 - histogram of the time spent on waiting on this type of lock

References #3385
References #3416
2018-05-22 16:52:58 +02:00
Piotr Sarna
49bebcfa25 view: add view metrics
This commit introduces view statistics:
 - updates pushed to local/remote replicas
 - updates failed to be pushed to local/remote replicas

Metrics are kept on per-table basis, i.e. updates_pushed_remote
shows the number of total updates (mutations) pushed to all paired
mv replicas that this particular table has.
Every single update is taken into consideration, so if view update
requires removing a row from one view and adding a row to another,
it will be counted as 2 updates.

References #3385
References #3416
2018-05-22 16:52:58 +02:00
Tomasz Grabiec
e554a39fbb tests: memtable_snapshot_source: Fix compact()
Compactor collects all currently active memtables and later replaces
them with the merged result. The problem is that active memtable
belongs to the input set during compaction and as a result mutations
applied concurrently with compaction could be lost once compaction
replaces the memtables. The fix is to open a new active memtable when
compaction starts.

Caused sporadic failures of row_cache_test.cc:test_continuity_is_populated_when_read_overlaps_with_older_version()
Message-Id: <1526997724-13037-1-git-send-email-tgrabiec@scylladb.com>
2018-05-22 15:08:07 +01:00
Glauber Costa
d4e7783188 compaction_manager: disable backlog tracker if we see an exception
If we see an exception when adding or removing SSTables from the backlog
tracker, the backlog tracker can be inconsistent forever. It would be
best if we act before that happens and disable the backlog tracker. Once
the backlog tracker is disabled it will default to returning a fixed
number of shares.

We can either disable the backlog tracker or remove it. But if we remove
it we can end up with a backlog of zero if that's the only tracker with
a backlog. We then keep it registered but mark it as disabled. This also
leaves room for recovery in some situations: we can recover the backlog
by a doing a schema change in the column family that had the backlog
disabled, for instance.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 09:36:32 -04:00
Glauber Costa
fde26ec633 backlog tracker: protect against exceptions in backlog calculation.
Backlog calculations should be exception free, but there are at cases in
which I can see they happening. One example is if  some backlog tracker
that uses temporary objects fails an allocation.

Memory shortages can be specially pernicious: if we leave the
responsibility of catching those to the individual backlog tracker, we
will keep trying to make more allocations in the other backlog trackers
if we have many column families. By handling it here we can stop that.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 09:36:22 -04:00
Glauber Costa
3e08bd17f0 STCS_backlog: protect against negative backlog
A negative backlog can be interpreted as a very large backlog.
Part of that is because we keep the total_size as an unsigned type,
which is what we expect. But in case there is an issue-- like an
exception that causes some SSTable not to be tracked then this size
can become negative. Returning a zero backlog is better than allowing
it to be interpreted as a giant number.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 09:36:22 -04:00
Glauber Costa
4b4e9f6c8c STCS_backlog: remove unused attribute
This attribute ended up being unused in the final version.
Spotted now while reading the code for other purposes.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 09:36:22 -04:00
Glauber Costa
10046593be compaction strategy: move size tiered backlog to a header
It's very common to other strategies to include a SizeTiered
step somehow inside their algorithms: LCS will do SizeTiered on
L0, TWCS will do SizeTiered within a window, etc.

To make it easier for those strategies to consume the SizeTiered
backlog tracker, we will move that to its own file.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 09:36:22 -04:00
Glauber Costa
36ccb1dd7c compaction_strategy: delete major_compaction_strategy class
It was already unused before this series. In an earlier version I have
used it to provide an ad-hoc backlog for major compactions. But now that
this is done by the compaction manager, this class really isn't being
used.

And it is likely it won't be: major compaction is not a compaction
strategy a user can choose, unlike the others that need to be built
through make_compaction_strategy.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 09:33:59 -04:00
Glauber Costa
9320d6f17f compaction: make sure that user-initiated compactions always have a minimum priority
We have observed the following behavior with user initiated compactions,
like major compactions:

- if there are no writes, the backlog doesn't increase.
- as compaction progresses the backlog decreases.
- at some point, the backlog is so low that compaction barely makes any
  progress.

Going forward, we should allow one to read from the generated partial
SSTables, in which case this doesn't matter that much. But for
user-iniated compactions we would like to guarantee a minimum baseline.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 09:33:25 -04:00
Glauber Costa
c55ab93178 backlog_controller: add constants to represent a globally disabled controller
There are situations in which we want the controllers to stop working
altogether. Usually that's when we have an unimplemented controller or
some exception.

We want to return fixed shares in this case, but this is a very
different situation from when we want fixed shares for *one* backlog
tracker: we want to return fixed shares, yes, but if we disable 200
backlog trackers (because they all failed, for instance), we don't want
that fixed number x 200 to be our backlog.

So the mechanism to globally disable the controller is still granted,
and infinity is a good way to represent that. It's a float that the
controller can easily test against. But actually using infinity in the
code is confusing. People reading it may interpret it as the other way
around from what it means, just meaning "a very large backlog".

Let's turn that into a constant instead. It will help us convey meaning.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 09:25:23 -04:00
Glauber Costa
d758a416f8 backlog_controller: move compaction controller to the compaction manager
There was recently an attempt to add minimum shares to major compactions
which ended up being harder than it should be due to all the plumbing
necessary to call the compaction controller from inside the compaction
manager-- since it is currently a database object. We had this problem
again when trying to return fixed shares in case of an exception.

Taking a step back, all of those problems stem from the fact that the
compaction controller really shouldn't be a part of the database: as it
deals with compactions and its consequences it is a lot more natural to
have it inside the compaction manager to begin with.

Once we do that, all the aforementioned problems go away. So let's move
there where it belongs.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 09:24:19 -04:00
Calle Wilund
62c3b4c429 commitlog: Ensure file objects are closed before object free
Fixes #3446

Previously, only shutdown-synced objects where actually closed,
which is wrong.

This introduces yet another queue, processed together with the
deletion objects, which ensures we explicitly close all objects
that have been discarded.

Message-Id: <20180521140456.32100-1-calle@scylladb.com>
2018-05-22 14:52:06 +03:00
Duarte Nunes
4b2fd8d6f2 Merge 'Use hinted handoff to replay missed updates from base to view' from Piotr
"This series leverages hinted handoff for failed view replica
updates."

* 'materialized_view_updates_with_hh_5' of https://github.com/psarna/scylla:
  storage_proxy: enable hinted handoff for materialized views
  storage_proxy: make view updates use consistency_level::ANY
2018-05-22 11:24:37 +01:00
Paweł Dziepak
05c94bc98d mutation_partition: do not dereference null in find_cell()
row::find_cell() may be called for cells that do not exist in that row.
In such case nullptr shall be returned, this patch makes sure that
it is not dereferenced.
Message-Id: <20180522091726.24396-1-pdziepak@scylladb.com>
2018-05-22 10:31:09 +01:00
Glauber Costa
d3f985ef46 backlog_controller: allow users to compute inverse function of shares
There are some situations in which we want to force a specific amount of
shares and don't have a backlog. We can provide a function to get that
from the controller.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-21 19:35:07 -04:00
Avi Kivity
51f5599c75 Merge seastar upstream
* seastar a6cb005...5da5d4e (6):
  > append_challenged_posix_file_impl: Ensure continuation uses non-stale object
  > utils: make make_visitor() public
  > tcp: Adjust receive window
  > tcp: Fix allowed sending size calculation in can_send
  > tcp: Fix assert in tcp::tcb::output_one
  > be more descriptive with failed syscalls for filesystem operations

Contains alternative fix for #3446 (will also be fixed directly).
2018-05-21 20:35:30 +03:00
Piotr Sarna
f5d6326ced storage_proxy: enable hinted handoff for materialized views
This commit initializes and enables hinted handoff for materialized
views, even if HH is not explicitly turned on in config.

User writes still use hinted handoff only if it is explicitly enabled,
while materialized views are allowed to use it unconditionally
in order to store failed replica updates somewhere.

Fixes #3383
2018-05-21 17:09:27 +02:00
Piotr Sarna
da0d458f5f storage_proxy: make view updates use consistency_level::ANY
This commit makes view replica updates internally use consistency
level ANY, so in case an update fails it will fall back to hinted
handoff.

References #3383
2018-05-21 17:09:27 +02:00
Piotr Sarna
ba9e8a4f2c tests: initialize hints directory for cql env
This commit initializes hints_directory config value for cql_test_env.
It's needed now because materialized views support force-enables
hinted handoff.

Message-Id: <2aadf35eee329c1f89977c4a55660f330bd9d591.1526914827.git.sarna@scylladb.com>
2018-05-21 18:06:01 +03:00
Botond Dénes
204f6fd478 test.py: print test args when listing failed tests
This can be very helpful when a test only fails when run with some
particular arguments.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <dac1f7e23afa904156e65c3bb3c8fd52b7e999ff.1526906955.git.bdenes@scylladb.com>
2018-05-21 17:28:18 +03:00
Avi Kivity
f9c2ff1f9c install: prepare /etc directory
install(1) creates missing directories on recent Fedora, but not
on CentOS 7. This causes the RPM build (which installs to a pristine
tree, without an existing /etc) to fail.

Fix by setting up /etc.

Tests: rpm (Fedora, CentOS)
Message-Id: <20180520124937.20466-1-avi@scylladb.com>
2018-05-21 09:51:46 +02:00
Asias He
db8c3a7059 streaming: Do not use dht::split_ranges_to_shards
There is no need to call dht::split_ranges_to_shards to split the token
range into <shard> : <a lot of small ranges> mapping and create a flat
mutation reader with a lot of small ranges.

Because:

1) The flat mutation reader on each shard only returns data belongs to
this local shard, there is no correctness issue if we do not split and
feed the sub ranges only belongs to this local shard.

2) With murmur3_partitioner_ignore_msb_bits = 12, it is almost certain
that given a token range, all the shards will have data for the range
anyway. Even if we ask all the shards to work on the token range and
some of the shards have no data for it, it is fine. We simply send no
data from this shard.

Tests: update_cluster_layout_tests.py

Message-Id: <ac00cd21d6156c47b74451dd415d627481e48212.1526864222.git.asias@scylladb.com>
2018-05-21 10:42:45 +03:00
Takuya ASADA
5407c34c73 dist/debian: depends to coreutils instead of realpath on Ubuntu 18.04
On Ubuntu 18.04 realpath package is dropped, it becomes part of coreutils.

Fixes #3445

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180521031954.30815-1-syuu@scylladb.com>
2018-05-21 10:42:05 +03:00
Asias He
0c54c6e16f storage_service: Add node has left the cluster log
Remove a node from the cluster is a major operation, it deserves a log
for it. Add a log when node is removed from the cluster by `nodetool
decommission` or `nodetool removenode`.

Message-Id: <b6adf34492c8138296911f2b37b39e9dd8ed10a2.1523347916.git.asias@scylladb.com>
2018-05-19 21:47:05 +03:00
Asias He
e20038eb84 streaming: Handle stream_mutation rpc handler on all shards
In streaming, the sender sends the mutations on all the local shards in
parallel, it is possible that the receiver handle more than one such
connection on the same shard. It is determined by where the tcp
connection goes. Current rpc ignores the dest shard id when sending the
rpc message.

For instance, say node1 has 2 shards, node2 has 2 shards. Currently, we
can end up with like this:

   Node 1 shard 0 -> Node 2 shard 1
   Node 1 shard 1 -> Node 2 shard 1

It is better if we do:

   Node 1 shard 0 -> Node 2 shard 0
   Node 1 shard 1 -> Node 2 shard 1

This patch solves this problem by let the handler always handle on
shard = src_cpu_id % smp::count.

If sender and receiver have the same shard config, it is completely
distributed the work evenly.

If sender and receiver do not have the same shard config, it is
unavoidable some of the shard will do more work than the others.

Tests: dtest update_cluster_layout_tests.py

Message-Id: <911827bcf67459a07ec92623a9ed4c4fbba195ca.1524622375.git.asias@scylladb.com>
2018-05-19 21:08:25 +03:00
Calle Wilund
f69a52c475 storage_service: Add more error info to "isolate_on_error" shutdown
Fixes #2793

Prints error handle class (commitlog or "other/disk") + exception
type and message. While not exhaustive, at least gives a correlation
point to (hopefully) other log printouts.

Message-Id: <20180509081040.7676-1-calle@scylladb.com>
2018-05-19 21:06:03 +03:00
Piotr Jastrzebski
1520ffe7f5 sstables: check buffer size when reading vints
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <6ecbedae818fbef1f67a4472aba4ce443b9df0ee.1525888830.git.piotr@scylladb.com>
2018-05-19 21:01:45 +03:00
Avi Kivity
46a0109608 Merge "Support compression when writing SSTables 3.x." from Vladimir
"
For compression, SSTables 3.x format uses CRC32 for checksumming
compressed chunks as well as for calculating the full file checksum.
Also, while for older formats "full checksum" of a compressed data file
means a combination of checksums of its compressed chunks, in SSTables
3.x this now reads literally and assumes the checkum of all bytes
written, including per-chunk digests.

Tests: unit {debug, release}
"

* 'projects/sstables-30/write-compression/v3' of https://github.com/argenet/scylla:
  tests: Add unit tests for writing compressed SSTables 3.x.
  tests: Validate Digest32.crc for SSTables 3.x write tests.
  tests: Fix invalid Digest file for write_counter_table test.
  sstables: Support writing compressed SSTables 3.0.
  sstables: Make compressed streams customizable on checksumming.
  sstables: Move checksum calculation logic to compressed_output_stream.
2018-05-19 20:52:08 +03:00
Vladimir Krivopalov
d588a7e743 tests: Add unit tests for writing compressed SSTables 3.x.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-19 20:52:08 +03:00
Vladimir Krivopalov
e5ab271863 tests: Validate Digest32.crc for SSTables 3.x write tests.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-19 20:52:08 +03:00
Vladimir Krivopalov
fcc7bad777 tests: Fix invalid Digest file for write_counter_table test.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-19 20:52:07 +03:00
Vladimir Krivopalov
dd00d90a05 sstables: Support writing compressed SSTables 3.0.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-19 20:52:07 +03:00
Vladimir Krivopalov
cc62ad3b69 sstables: Make compressed streams customizable on checksumming.
Use either Adler32 or CRC32 while writing to or reading from a
compressed stream.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-19 20:52:07 +03:00
Vladimir Krivopalov
5183294676 sstables: Move checksum calculation logic to compressed_output_stream.
Previously, compressed_output_stream used to calculate checksum of the
supplied chunk and pass it to the 'compression' object to combine with
the full checksum calculated on prior writes.
Now, all the checksum calculation happens inside
compressed_output_stream and 'compression' only stores the result.

This is done to loosen ties between two classes and simplify
compressed_output_stream customisation with various checksum algorithms.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-19 20:52:07 +03:00
Glauber Costa
596a525950 commitlog: don't move pointer to segment
We are currently moving the pointer we acquired to the segment inside
the lambda in which we'll handle the cycle.

The problem is, we also use that same pointer inside the exception
handler. If an exception happens we'll access it and we'll crash.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180518125820.10726-1-glauber@scylladb.com>
2018-05-18 17:25:18 +02:00
Avi Kivity
684bb2042d Merge "Fixes and improvements for gdb LSA commands" from Tomasz
* tag 'tgrabiec/fixes-and-improvements-for-gdb-scripts-v1' of github.com:tgrabiec/scylla:
  gdb: Print live object size from 'scylla lsa-segment'
  gdb: Extend 'scylla segment-descs' output with full occupancy info
  gdb: Print allocated object's type name instead of full LSA migrator
  gdb: Fix LSA migrator discovery
  gdb: Drop code related to LSA zones
  gdb: Fix uses of removed segment_desctriptor::_lsa_managed
  lsa: Add use for debug::static_migrators
2018-05-17 15:54:21 +03:00
Tomasz Grabiec
d4a2d22812 gdb: Print live object size from 'scylla lsa-segment' 2018-05-17 14:22:20 +02:00
Tomasz Grabiec
08026a64c5 gdb: Extend 'scylla segment-descs' output with full occupancy info
After:

 0x600007220000: lsa free=24800  used=106272  81.08% region=0x600000403210
 0x600007240000: lsa free=13     used=131059  99.99% region=0x600000403210
 0x600007260000: lsa free=23072  used=108000  82.40% region=0x600000403210
 0x600007280000: lsa free=16772  used=114300  87.20% region=0x600000403210
 0x6000072a0000: lsa free=23996  used=107076  81.69% region=0x600000401410
 0x6000072c0000: lsa free=15552  used=115520  88.13% region=0x600000403210
2018-05-17 14:22:20 +02:00
Tomasz Grabiec
abd667d924 gdb: Print allocated object's type name instead of full LSA migrator
Before:

  0x6000302604e0: live {_vptr.migrate_fn_type = 0x3797a00 <vtable for standard_migrator<cache_entry>+16>, _migrators = std::any containing seastar::lw_shared_ptr<(anonymous namespace)::migrators> = {[contained value] = {_p = 0x600000080a80}}, _align = 8, _index = 0} @ 0x6000302604e8

After:

  0x6000302604e0: live cache_entry @ 0x6000302604e8
2018-05-17 14:22:14 +02:00
Tomasz Grabiec
653fcc10bb gdb: Fix LSA migrator discovery
Fixes 'scylla lsa-segment' which broke after recent changes, probably
commit b3699f286d.
2018-05-17 14:22:14 +02:00
Tomasz Grabiec
bb8f82f43f gdb: Drop code related to LSA zones
LSA zones have been removed.
2018-05-17 14:22:14 +02:00
Tomasz Grabiec
84a7961c23 gdb: Fix uses of removed segment_desctriptor::_lsa_managed 2018-05-17 14:22:14 +02:00
Tomasz Grabiec
498a4132c5 lsa: Add use for debug::static_migrators
Otherwise GDB complains about it being optimized out, breaking our
debug scritps.
2018-05-17 14:22:14 +02:00
Avi Kivity
d9c80cac26 dist: move Red Hat installation from .spec %install to new install.sh
Move code to a traditional install.sh script (more traditional would be
a "make install", but this is close enough).

This allows testing installation independently of packaging. In addition,
non-Red Hat-packaging can share much of the code in install.sh.

Ref #3243.

Tests: build+install rpm
Message-Id: <20180517114147.30863-1-avi@scylladb.com>
2018-05-17 13:46:27 +02:00
Avi Kivity
98967da94f Merge seastar upstream
* seastar 0a1a327...a6cb005 (1):
  > Merge " misc fixes for iotune" from Glauber
2018-05-17 12:42:46 +03:00
Avi Kivity
3b8118d4e5 dist: redhat: get rid of raid0.devices_discard_performance
This parameter is not available on recent Red Hat kernels or on
non-Red Hat kernels (it was removed on 3.10.0-772.el7,
RHBZ 1455932). The presence of the parameter on kernels that don't
support it cause the module load to fail, with the result that the
storage is not available.

Fix by removing the parameter. For someone running an older Red Hat
kernel the effect will be that discard is disabled, but they can fix
that by updating the kernel. For someone running a newer kernel, the
effect will be that they can access their data.

Fixes #3437.
Message-Id: <20180516134913.6540-1-avi@scylladb.com>
2018-05-16 15:38:29 +01:00
Avi Kivity
20271b3890 Update scylla-ami submodule
* dist/ami/files/scylla-ami e0b35dc...025644d (1):
  > Merge "AMI build fix" from Takuya
2018-05-16 12:33:45 +03:00
Avi Kivity
05cec4a265 Merge "Reduce LSA memory reclamation overhead" from Tomasz
"
Main optimization is in the patch titled "lsa: Reduce amount of segment compactions".

I measured 50% reduction of cache update run time in a steady state for an
append-only workload with large partition, in perf_row_cache_update version from:

  c3f9e6ce1f/tests/perf_row_cache_update.cc

Other workloads, and other allocation sites probably also could see the
improvement.
"

* tag 'tgrabiec/reduce-lsa-segment-compactions-v1' of github.com:tgrabiec/scylla:
  lsa: Expose counters for allocation and compaction throughput
  lsa: Reduce amount of segment compactions
  lsa: Avoid the call to segment_pool::descriptor() in compact()
  lsa: Make reclamation on reserve refill more efficient
2018-05-16 10:24:20 +03:00
Tomasz Grabiec
534068a0f7 Update seastar submodule
Fixes #3339

* seastar 840002c...0a1a327 (7):
  > Merge "fix perftune.py issues with cpu-masks on big machines" from Vlad
  > Merge 'Handle Intel's NICs in a special way'  from Vlad
  > reactor: fix calculation of idle ticks
  > log: streamline logging internals a little
  > Merge "CMake imrovements and compatibility" from Jesse
  > iotune: fix typo in property name
  > cmake: do not find_package(Boost ...) if Boost is a target
2018-05-16 09:11:22 +02:00
Avi Kivity
832e8fb1e0 Merge "Support writing counters in SSTables 3.x format." from Vladimir
"
This patchset adds support for writing counter cells in SSTables 3.x
format ('m'). The logic of writing counters is almost identical to that
used for the old 2.x format ('k'/'l') with the only difference that the
data length preceding serialised shards is written as a vint.

Tests: unit {release}.

Generated SSTables are verified to be processed fine by sstabledump
(note that sstabledump only outputs the binary data for counters, not
their actual values, same as sstable2json).

Verified with Cassandra 3.11 to get the expected values from the
counters table:
cqlsh> SELECT * from sst3.counter_table;

 pk  | ck  | rc1 | rc2
-----+-----+-----+-----
 key | ck1 |  10 |   1

(1 rows)

Verified that the deleted counter can no longer be updated:
cqlsh> use sst3 ;
cqlsh:sst3> UPDATE counter_table SET rc1 = rc1 + 2 WHERE pk = 'key' AND ck = 'ck2';
cqlsh:sst3> SELECT * from sst3.counter_table;

 pk  | ck  | rc1 | rc2
-----+-----+-----+-----
 key | ck1 |  10 |   1

(1 rows)
"

* 'projects/sstables-30/write_counters/v1' of https://github.com/argenet/scylla:
  tests: Unit tests to cover writing counters in SSTables 3.x format.
  sstables: Support writing counters for SSTables 3.x.
  sstables: Move code writing counter value into a separate helper.
2018-05-16 08:46:15 +03:00
Raphael S. Carvalho
59c57861ae tests/sstable_test: switch to dynamic temporary dir creation
sstable test fails when running concurrently (for example, release and debug
mode) because it uses a static temporary dir in lots of tests.
Let's fix it by switching to dynamic temporary dir, which is created using
mkdtemp(). Also the sstable tests will now run in /tmp, and so it's made
much faster.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180516042044.15336-1-raphaelsc@scylladb.com>
2018-05-16 08:00:29 +03:00
Tomasz Grabiec
4fdd61f1b0 lsa: Expose counters for allocation and compaction throughput
Allow observing amplification induced by segment compaction.
2018-05-15 21:49:01 +02:00
Tomasz Grabiec
3775a9ecec lsa: Reduce amount of segment compactions
Reclaiming memory through segment compaction is expensive. For
occupancy of 85%, in order to reclaim one free segment, we need to
compact 7 segments, by migrating 6 segments worth of data. This results
in significant amplification. Compaction involves moving objects,
which in some cases is expensive in itself as well
(See https://github.com/scylladb/scylla/issues/3247).

This patch reduces amount of segment compactions in favor of doing
more eviction. It especially helps workloads in which LRU order
matches allocation order, in which case there will be no segment
compaction, and just eviction.

In perf_row_cache_update test case for large partition with lots of
rows, which simulates appending workload, I measured that for each new
object allocated, 2 need to be migrated, before the patch. After the
patch, only 0.003 objects are migrated. This reduces run time of
cache update part by 50%.
2018-05-15 21:49:01 +02:00
Vladimir Krivopalov
a16b8d5d77 tests: Unit tests to cover writing counters in SSTables 3.x format.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-15 11:44:44 -07:00
Vladimir Krivopalov
ffd8886da9 sstables: Support writing counters for SSTables 3.x.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-15 11:44:44 -07:00
Vladimir Krivopalov
28c3c21c73 sstables: Move code writing counter value into a separate helper.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-15 11:44:44 -07:00
Avi Kivity
5f3a5c436e Merge "chunked vector memory estimation" from Glauber
"
The memory estimations we have when using the chunked vector
are usually slightly wrong. We can make them more accurate by
exporting the memory usage directly as a chunked_vector API.
"

* 'chunked_memory-v2' of github.com:glommer/scylla:
  large_bitset: be more accurate with memory usage
  chunked_vector: exports its current memory usage
2018-05-15 19:00:36 +03:00
Glauber Costa
2ba08178ca large_bitset: be more accurate with memory usage
We are slightly underestimating the amount of memory we use. Now that
the chunked vector can exports its internal memory usage we can use that
directly.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-15 11:22:21 -04:00
Glauber Costa
7190bb4f95 chunked_vector: exports its current memory usage
There are times in which we would like to estimate how much memory
a chunked_vector is using. We have two strategies to do it:

1) multiply the size by the size of the elements. That is wrong, because
the chunked_vector can allocate larger chunks in anticipation of more
elements to come.

2) multiply the number of chunks by 128kB. That is also wrong, because
the chunk_vector will not always allocate the entire chunk if there are
only a few elements in it.

The best way to deal with it is to allow the chunked_vector to exports
its current memory usage.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-15 11:22:21 -04:00
Raphael S. Carvalho
83e64192d3 tests/perf: fix compaction and write mode of perf_sstable
storage_service_for_tests must be instantiated only once at a global
scope.

Fixes #3369.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180510042200.2548-1-raphaelsc@scylladb.com>
2018-05-15 18:00:18 +03:00
Avi Kivity
e0ef39705f dist: redhat: properly package scylla_blocktune.py
Commit 9eb8ea8b11 installed
scylla_blocktune.py as part of preparing the rpm, but forgot
to add it to the installed file list, breaking the rpm build.

Fix by listing the file in the %files section.
Message-Id: <20180506202807.5719-1-avi@scylladb.com>
2018-05-15 18:00:05 +03:00
Piotr Sarna
40bf5d671b cql: add secondary index metrics
This commit adds basic secondary index metrics to cql_stats:
 * total number of indexes creates
 * total number of indexes dropped
 * total number of reads from a secondary index
 * total number of rows read from a secondary index

References #3384
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <d5eda7a343cee547c921dd4d289ecb1ac1c2bf24.1526374243.git.sarna@scylladb.com>
2018-05-15 17:59:53 +03:00
Avi Kivity
4f81e1f55a Merge "Use CRC32 to calculate checksums for SSTables 3.0." from Vladimir
"
SSTables 3.x (format 'm') use CRC32 instead of Adler32 for calculating
checksums. This patchset introduces support for CRC32 along with Adler32
in checksummed_file_writer to be used for SSTables written in 'mc'
format.

Structures and helpers introduced for CRC32 will be later used for
calculating checksums for compressed files as well (not a part of this
patchset).

Tests: unit {release}
"

* 'projects/sstables-30/write-digest-crc/v3' of https://github.com/argenet/scylla:
  tests: Add test covering checksumming SSTables 3.0 with CRC32.
  sstables: Support CRC32 checksum for SSTables 3.x.
  sstables: Move adler32 routines under the scope of a class.
  sstables: Move checksum utils into separate header.
  sstables: Remove unused 'checksum_file' flag from checksummed_file_writer.
2018-05-15 10:18:14 +03:00
Duarte Nunes
3a7d655d01 Merge 'transport: reduce unneeded continuations' from Avi
"
The native protocol server generates mant reactor tasks that
can be easily eliminated. I measured a read workload with 100%
cache hit rate, seeing the number of tasks per request drop
from ~31 to ~27, and an increase of 3% in throughput.
"

* tag 'transport-optimize-1/v1' of https://github.com/avikivity/scylla:
  transport: remove unused capture of flags variable
  transport: merge response write and error handling continuations
  transport: make write_repsonse() return void
  transport: de-template a lambda
  transport: merge memory-management and logging continuations
  transport: remove gate continuation
  transport: merge two response processing continuations
  transport: simplify response processing continuation
  transport: remove gratuitous continuation from process_request_one()
2018-05-14 10:12:07 +01:00
Avi Kivity
a99e820bb9 query_processor: require clients to specify timeout configuration
Remove implicit timeouts and replace with caller-specified timeouts.
This allows removing the ambiguity about what timeout a statement is
executed with, and allows removing cql_statement::execute_internal(),
which mostly overrode timeouts and consistency levels.

Timeout selection is now as follows:

  query_processor::*_internal: infinite timeout, CL=ONE
  query_processor::process(), execute(): user-specified consisistency level and timeout

All callers were adjusted to specify an infinite timeout. This can be
further adjusted later to use the "other" timeout for DCL and the
read or write timeout (as needed) for authentication in the normal
query path.

Note that infinite timeouts don't mean that the query will hang; as
soon as the failure detector decides that the node is down, RPC
responses will termiante with a failure and the query will fail.
2018-05-14 09:41:06 +03:00
Avi Kivity
4500baaaf4 transport: remove unused capture of flags variable 2018-05-14 09:41:06 +03:00
Avi Kivity
2a1f231f82 query_processor: un-default consistency level in make_internal_options
Make the consistency level explicit in the caller in order to clarify
what is going on.

An "internal" query used to mean that it was accessing local tables,
so infinite timeouts and a consistency level of ONE were indicated,
but authentication accesses non-local tables so explicit consistency
level and timeouts are needed.
2018-05-14 09:41:06 +03:00
Avi Kivity
88f8fe3168 transport: merge response write and error handling continuations
The response write continuation does not defer, so traditional try/catch
works well and saves a continuation.
2018-05-14 09:41:06 +03:00
Avi Kivity
3e8d1c8fd7 transport: make write_repsonse() return void
It just schedules the response, and returns immediately.

(I thought about calling it schedule_response(), but usually it will
write the response immediately, since waiting for network writes is
rare in a local network).
2018-05-14 09:41:06 +03:00
Avi Kivity
b26f36c2ec transport: de-template a lambda
Generic templates = annoying.
2018-05-14 09:41:06 +03:00
Avi Kivity
7a9b73f166 transport: merge memory-management and logging continuations
Merge a continuation that just keeps things alive with another that
just logs things.
2018-05-14 09:41:06 +03:00
Avi Kivity
f0887a55e4 transport: remove gate continuation
with_gate() generates a continuation if the protected function defers.
Avoid that by merging a gate::leave() call with another, preexisting,
continuation.
2018-05-14 09:41:06 +03:00
Avi Kivity
876837a5da transport: merge two response processing continuations
We have one coninuation transforming the result, and another shutting
down tracing. Since the first cannot defer, we can merge the two, reducing
the number of tasks processed by the reactor.
2018-05-14 09:41:06 +03:00
Avi Kivity
38619138be transport: simplify response processing continuation
A continuation in the response processing path is only doing
transformation on the output. Make that clear by returning a value,
not a future.
2018-05-14 09:41:06 +03:00
Avi Kivity
f0a1478b6c transport: remove gratuitous continuation from process_request_one()
No need to call then() just to convert exceptions to futures,
futurize_apply() does this with less ado.
2018-05-14 09:41:06 +03:00
Vladimir Krivopalov
1da6144f90 tests: Add test covering checksumming SSTables 3.0 with CRC32.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-13 12:38:25 -07:00
Vladimir Krivopalov
e6dfa008d8 sstables: Support CRC32 checksum for SSTables 3.x.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-13 12:38:25 -07:00
Vladimir Krivopalov
adb43959d1 sstables: Move adler32 routines under the scope of a class.
This is a step towards making digest algorithm customizable at compile
time.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-13 12:38:25 -07:00
Vladimir Krivopalov
4e4030676f sstables: Move checksum utils into separate header.
Checksummed writer doesn't need to include all compression stuff.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-13 12:38:25 -07:00
Nadav Har'El
f5536d607e secondary index: fix multiple appearance of rows
This patch fixes a bug where queries using a secondary index would, in
some cases, produce the same rows multiple times.

The problem was that the code begins by finding a list of primary keys
that match the search, and then work on the partitions containing them.
If multiple rows matched in the same partition, the partition was considered
multiple times, and the same rows were output multiple times.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180510203141.17157-1-nyh@scylladb.com>
2018-05-13 20:08:14 +02:00
Avi Kivity
7d29addb1f mutation_reader: optimize make_combined_reader for the single-reader case
If we're given a single reader (can be common in a low-write-rate table,
where most of the data will be in a single large sstable, or in leveled
tables) then we can avoid the overhead of the combining reader by returning
the single input.

Tests: unit (release)
Message-Id: <20180513130333.15424-1-avi@scylladb.com>
2018-05-13 20:07:10 +02:00
Duarte Nunes
a23bda3393 Merge 'Implement separate timeout for range queries' from Avi
"
This patchset implements separate timeouts for range queries, and lays
the foundations for separate timeouts for other query types.

While the feature in itself is worthy, the real motivation is to have
the timeouts decided by the caller, instead of storage_proxy. This in
turn is required to disentangle each layer behaving differently
depending on whether the query is internal or not; instead, the goal
is to have each caller declare its needs in terms of consistency level
and timeouts, and have the lower layers implement its requirements
instead of making their own decisions.

Fixes #3013.

Tests: unit (release)
"

* tag '3013/v1.1' of https://github.com/avikivity/scylla:
  storage_proxy: remove default_query_timeout()
  storage_proxy: don't use default timeouts
  query_options: augment with timeout_config
  thrift: configure thrift transport and handler with a timeout_config
  transport: configure native transport with a timeout_config
  cql3: define and populate timeout_config_selector
  timeout_config: introduce timeout configuration
2018-05-13 20:05:50 +02:00
Glauber Costa
3d2c4c1cf8 main: change I/O scheduler verification code
Before we accept running while not in developer mode, we verify that
the I/O Scheduler is properly configured. Up until now, that meant
verifying that --max-io-requests is properly set and that the number
of I/O Queues is enough to leave at least 4 requests per I/O Queue.

Systems that move to newer versions of Scylla may continue doing that,
so we need to be backwards compatible and keep testing for that.
However, newer systems will not set that option, but pass a YAML
property file (or string) instead. So we need to make sure that
either one of those is set.

If the property file is set, I am deciding here not to test for
number of I/O queues. scylla_io_setup will usually configure that
anyway, plus we plan on soon moving to all-shards-dispatch making
that less important.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180509163737.5907-1-glauber@scylladb.com>
2018-05-13 19:22:54 +03:00
Glauber Costa
2e0c673432 database: release flush permits earlier
There is an ongoing discussion in issue 2678 about the right time to
release permits. Right now we are releasing the permit after we flush
all data for the memtable plus the SSTables accompanying components -
plus flushing them, closing them, etc.

During all that time, we are increasing virtual dirty by adding more
data to the buffers but we are not able to decrease it-- until we
release the permit we can't start flushing the next memtable. This is
much more of a concern than I/O overlapping as described in the issue.

We have a hook in the SSTable write process that is (should be) called
as soon as data is written. We should move the permit release there.

We aren't, though, calling that as early as we could. The call to the
data written hook is writing after the Index is closed, summary is
sealed, etc.

This patch fixes that.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180508182746.28310-2-glauber@scylladb.com>
2018-05-13 19:22:54 +03:00
Tomasz Grabiec
8faafdaae5 lsa: Avoid the call to segment_pool::descriptor() in compact() 2018-05-11 19:07:23 +02:00
Tomasz Grabiec
19edf3970e lsa: Make reclamation on reserve refill more efficient
Currently reserve refill allocates segments repeatedly until the
reserve threhsold is met. If single segment allocation needs to
reclaim memory, it will ask the reclaimer for one segment. The
reclaimer could make better decisions if it knew the total number of
segments we try to allocate. In particular, it would not attempt to
compact any segment until it evicts total amount of memory first,
which may reduce the total amount of segment compactions during
refill.

This patch changes refill to increase reclamation step used by
allocate_segment() so that it matches the total amount of memory we
refill.
2018-05-11 19:07:23 +02:00
Takuya ASADA
6fa3c4dcad dist/redhat: replace scylla-libgcc72/scylla-libstdc++72 with scylla-2.2 metapackage
We have conflict between scylla-libgcc72/scylla-libstdc++72 and
scylla-libgcc73/scylla-libstdc++73, need to replace *72 package with
scylla-2.2 metapackage to prevent it.

Fixes #3373

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180510081246.17928-1-syuu@scylladb.com>
2018-05-11 09:41:57 +03:00
Vladimir Krivopalov
f443e85476 sstables: Remove unused 'checksum_file' flag from checksummed_file_writer.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-10 11:11:06 -07:00
Paweł Dziepak
863a96db48 Merge "Fix partition tombstones for SSTables 3.x" from Vladimir
"Previously, partition tombstone was not written for partitions with no
rows causing corrupted data files.

This is now fixed and covered with tests.

In addition, we now track partition tombstones while collecting encoding
statistics."

* 'projects/sstables-30/fix-partition-tombstone/v3' of https://github.com/argenet/scylla:
  tests: Don't use deprecated schema constructor.
  tests: Add tests to cover partitions consisting only of partition keys.
  sstables: Make sure partition level tombstone is written for partitions with no rows.
  memtable: Collect statistics from partition-level tombstone.
2018-05-10 16:27:20 +01:00
Vladimir Krivopalov
d7177d9013 tests: Don't use deprecated schema constructor.
Rely entirely on schema_builder facilities while preparing schema for
unit tests.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-10 08:13:29 -07:00
Vladimir Krivopalov
64cdb30379 tests: Add tests to cover partitions consisting only of partition keys.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-10 08:12:58 -07:00
Vladimir Krivopalov
97079208db sstables: Make sure partition level tombstone is written for partitions with no rows.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-10 07:28:54 -07:00
Vladimir Krivopalov
ffc3a1ffeb memtable: Collect statistics from partition-level tombstone.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-10 07:28:50 -07:00
Duarte Nunes
21ccf173a1 Merge 'Preparatory cleanup for stateful range-scans' from Botond
"
This is preparatory cleanup series with fixes/cleanup of miscellaneous
issues that I discovered while working on the stateful range-scans.
Since the stateful range-scans series, even without these patches, is a
20+ patches strong series I'd like to fast-track this, to ease reviewing
the former.
Most of the changes here are related to code-hygenie and effectiveness
and there is a patch that is correctness-related ("querier: check only
the end bound of ranges when matching them") and one that is related to
ease-of-use ("range: clean the deduced transformed type").
Note that altough these changes were made in the context of working on
the stateful range-scans they make sense on their own as well.

Tests: unit(release, debug)
"

* '1865/pre-range-scans-cleanup/v1' of https://github.com/denesb/scylla:
  multishard_combining_reader: use optimized optional for the shard reader
  Use dht::token_range alias for last/preferred replicas
  storage_proxy::coordinator_query_result: merge constructors into one w/ default params
  querier: check only the end bound of ranges when matching them
  querier: take range and slice by value
  querier: remove const params from make_compaction_state()
  querier: make _range and _slice const
  flat_multi_range_mutation_reader: optimize for non-plural range vectors
  range: clean the deduced transformed type
2018-05-10 11:09:44 +01:00
Botond Dénes
7a3eab90c8 multishard_combining_reader: use optimized optional for the shard reader
Use flat_mutation_reader_opt instead of
std::optional<flat_mutation_reader>.
2018-05-10 13:06:47 +03:00
Duarte Nunes
d49348b0e1 Merge 'Include OPTIONS with LIST ROLES' from Jesse
"
Fixes #3420.

Tests: dtest (`auth_test.py`), unit (release)
"

* 'jhk/fix_3420/v2' of https://github.com/hakuch/scylla:
  cql3: Include custom options in LIST ROLES
  auth: Query custom options from the `authenticator`
  auth: Add type alias for custom auth. options
2018-05-10 11:03:29 +01:00
Vladimir Krivopalov
e5477c6c6c utils: Use dedicated enum for Bloom filter format instead of a boolean.
It better reflects the purpose of the parameter and provides better type-safety.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <10a4fc16dafa0fb3234969041f68f9e7bfc61312.1525899669.git.vladimir@scylladb.com>
2018-05-10 09:47:41 +03:00
Avi Kivity
76c64e1f26 Merge "Prepare for the new in-memory representation" from Paweł
"
These patches were extracted from much larger series that introduces new
in-memory representation of cells. They contain various enhanecments and
fixes that to a varying degree make sense on its own. Sending them
separately will hopefuly ease the review and merging proces of the whole
IMR effort.

Tests: unit(release).
"

* tag 'pre-imr/v1' of https://github.com/pdziepak/scylla:
  tests/perf: add microbenchmarks for basic row operations
  tests: simple_schema: add make_row_from_serialized_value()
  row: add clear_hash()
  types: move compare_unsigned() to bytes.hh
  lsa: provide migrator with the object size
  lsa: add free() that does not require object size
  db/view/build_progress: avoid copying mutation fragment
  mutation_partition: enable ADL for cell swap
  types: make some collection_type_impl functions non-static
  counters: drop revertability of apply()
  mutable_view: add default constructor and const_iterator
  tests/mutation_reader: do not apply mutations created on another shard
  sstables: do not call atomic_cell::value() for dead cells
  lsa: sanitize use of migrators
  lsa: reuse registered migrator ids
  lsa: make migrators table thread-local
2018-05-10 09:41:49 +03:00
Botond Dénes
ddd70dc113 Use dht::token_range alias for last/preferred replicas
Use the pre-existing type alias instead of fully spelling out the type
everywhere.
2018-05-10 06:22:39 +03:00
Botond Dénes
52affa2a61 storage_proxy::coordinator_query_result: merge constructors into one w/ default params 2018-05-10 06:22:39 +03:00
Botond Dénes
3b6f4e4901 querier: check only the end bound of ranges when matching them
The querier provides a `matches(const nonwrapping_range&)` member to
allow for checking whether a range matches that with which the querier
was originally created. The check for match is more lax than a strict
equality check as ranges are shrunk query progresses.
Because of this the above member only checked that one of the bounds of
the examined ranges matches. This is adequate as for this purpose
because, in the context of a single query, it is guaranteed that no
two read requests to the same replica will have overlapping range.
However Avi pointed out in a recent, related review, that this check can
be made a little more strict by requiring that the end-bounds of the
two ranges *always* matches, instead of allowing any of the bounds to
match.
2018-05-10 06:22:39 +03:00
Botond Dénes
eba90d0208 querier: take range and slice by value
It needs to copy these anyway so give callers the opportunity to move
these in.
2018-05-10 06:22:39 +03:00
Botond Dénes
546a0e292e querier: remove const params from make_compaction_state() 2018-05-10 06:22:39 +03:00
Botond Dénes
bc01833cad querier: make _range and _slice const
Since we are storing them on the heap we can make them const and still
be movable. We get the cake and can eat it too.
2018-05-10 06:22:39 +03:00
Botond Dénes
f5b012c952 flat_multi_range_mutation_reader: optimize for non-plural range vectors
Don't create a flat_multi_range_mutation_reader when the range vector
has 0 or 1 element. In the former case create an empty reader and in the
latter just create a reader with the mutation-source with the only range
in the vector.
2018-05-10 06:22:39 +03:00
Botond Dénes
16319c2036 range: clean the deduced transformed type
wrapping_range and nonwrapping_range offer a transform() member function
which allows creating a new range by applying a transformer function to
the bounds of the current range. The type of bounds of the new range is
deduced from the return type for this transformer function. However the
return type is used as-is, with any CV or reference attached to it.
Since it doesn't make sense to create a range of references or a type
with CV qualifiers strip these off the deduced type.
2018-05-10 06:22:39 +03:00
Jesse Haber-Kucharsky
4ffb4c6788 cql3: Include custom options in LIST ROLES
An implementation of `authenticator` can support custom options for a
each role.

If, to make up an example, the authenticator supported the `region` key,
then a role would be created as follows:

CREATE ROLE jsmith WITH OPTIONS = { 'region': 'north_america' }
                    AND PASSWORD = 'super_secure';

LIST ROLES will now print this custom option map as an additional column
with the heading "options".

However, none of the implementations of `authenticator` in Scylla
currently support OPTIONS, so LIST ROLES will in practice, for now,
print the empty set:

 role      | super | login | options
-----------+-------+-------+---------
 cassandra |  True |  True |        {}
2018-05-09 21:17:14 -04:00
Jesse Haber-Kucharsky
cd0553ca6a auth: Query custom options from the authenticator
None of the `authenticator` implementations we have support custom
options, but we should support this operation to support the relevant
CQL statements.
2018-05-09 21:12:50 -04:00
Jesse Haber-Kucharsky
e149e48609 auth: Add type alias for custom auth. options 2018-05-09 21:12:47 -04:00
Paweł Dziepak
0b8a85b15f tests/perf: add microbenchmarks for basic row operations 2018-05-09 16:52:26 +01:00
Paweł Dziepak
e949061126 tests: simple_schema: add make_row_from_serialized_value()
simple_schema::make_row() is not very well suited for performance tests
of row and cell creation since it serialises the value. This patch
introduces a new function that performs only minimal actions.
2018-05-09 16:52:26 +01:00
Paweł Dziepak
33dffd5fb6 row: add clear_hash()
Needed to measure the performance of hashing a cell.
2018-05-09 16:52:26 +01:00
Paweł Dziepak
f9940f620a types: move compare_unsigned() to bytes.hh
compare_unsigned() is a general utility function that compares two
bytes_view byte-by-byte. There is no need to include whole type.hh in
order to make it available.
2018-05-09 16:52:26 +01:00
Paweł Dziepak
c6c5accd19 lsa: provide migrator with the object size
While the migration function should have enough information to obtain
the object size itself, the LSA logic needs to compute it as well.
IMR is going to make calculating object sizes more expensive, so by
providing the infromation to the migrator we can avoid some needless
operations.
2018-05-09 16:52:26 +01:00
Paweł Dziepak
884888dc11 lsa: add free() that does not require object size
It is non-trivial to get the size of an IMR object. However, the
standard allocator doesn't really need it and LSA can compute it itself
by asking the migrator.
2018-05-09 16:52:26 +01:00
Paweł Dziepak
75b8b521d9 db/view/build_progress: avoid copying mutation fragment 2018-05-09 16:52:26 +01:00
Paweł Dziepak
00509913fc mutation_partition: enable ADL for cell swap
Calling fully qualified std::swap() prohibits the cell objects from
using their own swap implementations. This patch invokes std::swap in
the usual ADL-friendly way.
2018-05-09 16:52:26 +01:00
Paweł Dziepak
0b4c6b8938 types: make some collection_type_impl functions non-static
The switch to the new in-memory representation will require a larger
parts of the logic be aware of the type of the values they are dealing
with. In most cases it is not a significant burden for the users.
2018-05-09 16:52:26 +01:00
Paweł Dziepak
a2b5779714 counters: drop revertability of apply()
Since 4cfcd8055e 'Merge "Drop reversible
apply() from mutation_partition" from Tomasz' it is no longer required
for apply() to be revertable.
2018-05-09 16:52:26 +01:00
Paweł Dziepak
f7438a8b96 mutable_view: add default constructor and const_iterator
Makes the interface more consistent with bytes_view.
2018-05-09 16:52:26 +01:00
Paweł Dziepak
7c5c77369a tests/mutation_reader: do not apply mutations created on another shard
Scylla uses shared-nothing architecture and communication between the
shards is supposed to be very restricted. Applying to a memtable
mutations created on another shard is way to complex operation to be
allowed. Using frozen mutations is a much safer option.
2018-05-09 16:52:26 +01:00
Paweł Dziepak
55d1d7adfb sstables: do not call atomic_cell::value() for dead cells
The preconditions of value() require the cell to be live.
2018-05-09 16:52:26 +01:00
Paweł Dziepak
b1bec336b3 lsa: sanitize use of migrators
Having migrators dynamically registered and deregistered opens a new
class of bugs. This patch adds some additional checks in the debug mode
with the hopes of catching any misuse early.
2018-05-09 16:52:26 +01:00
Paweł Dziepak
cca9f8c944 lsa: reuse registered migrator ids
With the introduction of the new in-memory representation we will get
type- and schema-dependent migrators. Since there is no bound how many
times they can be created and destroyed it is better to be safe and
reuse registered migrator ids.
2018-05-09 16:52:20 +01:00
Paweł Dziepak
b3699f286d lsa: make migrators table thread-local
Migrators can be registered and deregistered at any time. If the table
is not thread-local we risk race conditions.
2018-05-09 16:10:46 +01:00
Avi Kivity
8d09820472 Merge "Load serialization header for SSTables in 3.0 format" from Piotr
"
SSTable 3.0 format introduces serialization header which is used in reading SSTables in that format.
This patchset implements loading of this new component of Statistics.db.

Tests: units (release)
"

* 'haaawk/sstables3/load_serialization_header_v2' of ssh://github.com/scylladb/seastar-dev:
  Load serialization_header from statistics
  Add parse for disk_array_vint_size
  Add helpers to read/parse vints
  Add signed_vint::serialized_size_from_first_byte
  Add sstable::get_serialization_header
  Move random_access_reader to separate header
2018-05-09 17:48:48 +03:00
Glauber Costa
94f686f946 memtable controller: reduce adjustment period to 50ms
250ms is too high of a period for memtable controller. Since memtable
flushes are relatively efficient, specially in comparison to
compactions, if the shares are high we can flush a lot of data down with
the high shares - so in the next adjustment period our shares will be
minuscule and we won't flush much at all.

This leads to oscillating behavior that is mitigated by adjusting
faster.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180508182746.28310-3-glauber@scylladb.com>
2018-05-09 17:40:46 +03:00
Paweł Dziepak
920131b2f7 Merge "mvcc: Fix partition_snapshot::merge_partition_versions() to not leave latest versions unmerged" from Tomasz
"Fixes a bug in partition_snapshot::merge_partition_versions(), which would not
attempt merging if the snapshot is attached to the latest version (in which
case _version is nullptr and _entry is != nullptr). This would cause
partition_version objects to accumulate if there was an older snapshot and it
went away before the latest snapshot. Versions will be removed when the whole
entry goes away (flush or eviction).

May cause performance problems.

Fixes #3402."

* 'tgrabiec/fix-merge_partition_versions' of github.com:tgrabiec/scylla:
  mvcc: Test version merging when snapshots go away
  anchorless_list: Make ranges conform to SinglePassRange
  anchorless_list: Drop deprecated use of std::iterator
  mvcc: Fix partition_snapshot::merge_partition_versions() to not leave latest versions unmerged
2018-05-09 15:10:56 +01:00
Piotr Jastrzebski
70a204cdd0 Load serialization_header from statistics
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-09 15:46:59 +02:00
Piotr Jastrzebski
3e4bc923a8 Add parse for disk_array_vint_size
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-09 15:46:59 +02:00
Piotr Jastrzebski
6b4df2d424 Add helpers to read/parse vints
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-09 15:46:46 +02:00
Glauber Costa
aadc709068 scylla_io_setup: run new iotune.
The newer version of iotune, recently merged to Seastar, accepts
a new parameter that tells us where should we store the properties
about the disk.

We are already generating that properties file for the AMI case.
Let's also pass that parameter when calling iotune.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180507175757.9144-1-glauber@scylladb.com>
2018-05-09 16:32:43 +03:00
Amnon Heiman
6bf759128b scylla-housekeeping: support new 2018.1 path variation
Starting from 2018.1 and 2.2 there was a change in the repository path.
It was made to support multiple product (like manager and place the
enterprise in a different path).

As a result, the regular expression that look for the repository fail.

This patch change the way the path is searched, both rpm and debian
varations are combined and both options of the repository path are
unified.

See scylladb/scylla-enterprise#527

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20180429151926.20431-1-amnon@scylladb.com>
2018-05-09 15:22:30 +03:00
Botond Dénes
777f3c7dc2 mutation_reader_test: don't lock up with smp=1
test_foreign_reader_destroyed_with_pending_read_ahead lock up completely
when run with SMP=1. As a solution skip the test-case when SMP < 2.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <815585c40a65a66f3b03e6393b46fbd6849c8ef5.1525866777.git.bdenes@scylladb.com>
2018-05-09 15:10:18 +03:00
Piotr Jastrzebski
b602dea726 Add signed_vint::serialized_size_from_first_byte
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-09 11:41:00 +02:00
Piotr Jastrzebski
589463165c Add sstable::get_serialization_header
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-09 11:40:59 +02:00
Piotr Jastrzebski
aa126639c0 Move random_access_reader to separate header
It will be used not only in sstables.cc but also
in helpers for reading sstables in M format.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-09 11:40:59 +02:00
Avi Kivity
911c2e7953 Merge "Support Bloom filter format for SSTables 3.x." from Vladimir
"
In SSTables 3.0, the base and increment fields have been swapped in
Bloom filters to reduce collisions (see CASSANDRA-8413). This affects
the resulting values written to Filter.db.

This patchset adds support for reading/writing Filter.db in the format
corresponding to the version of SSTables.

Tests: unit {release}

Filter.db files have been generated using Cassandra 3.11 with same data
as in unit tests and are validated to match those generated by Scylla.
"

* 'projects/sstables-30/write-filter/v1-2' of https://github.com/argenet/scylla:
  Fix mistakes and typos in comments (minor clean-up)
  Check Filter.db in SSTables 3.x write tests.
  Support Bloom filter format used in SSTables 3.0.
  Remove unused overload of i_filter::get_filter().
2018-05-09 11:16:09 +03:00
Vladimir Krivopalov
51c8ea74d6 sstables: generate non-empty summaries for m format
Add summary entries as needed. Also removes the duplicate line that
assigned summary byte cost.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <0d387c68523bae0c121cb15ad1e651ee9a8e4b4a.1525732404.git.vladimir@scylladb.com>
2018-05-09 11:15:02 +03:00
Vladimir Krivopalov
b59549cd16 Fix mistakes and typos in comments (minor clean-up)
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-08 15:28:43 -07:00
Vladimir Krivopalov
e739bb3280 Check Filter.db in SSTables 3.x write tests.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-08 15:28:35 -07:00
Vladimir Krivopalov
0f37c0e684 Support Bloom filter format used in SSTables 3.0.
The two hash values, base and increment, used to produce indices for
setting bits in the filter, have been swapped in SSTables 3.0.
See CASSANDRA-8413 for details.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-08 15:28:27 -07:00
Vladimir Krivopalov
fe2358e8bd Remove unused overload of i_filter::get_filter().
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-08 15:28:18 -07:00
Calle Wilund
b2b1a1f7e1 database: Fix assert in truncate
Fixes crash in cql_tests.StorageProxyCQLTester.table_test
"avoid race condition when deleting sstable on behalf..." changed
discard_sstables behaviour to only return rp:s for sstables owned
and submitted for deletion (not all matching time stamp),
which can in some cases cause zero rp returned.
Message-Id: <20180508070003.1110-1-calle@scylladb.com>
2018-05-08 22:29:21 +01:00
Vlad Zolotarov
48c96d09d6 db::hints::manager: drain hints when the node is decommissioned/removed
When node is decommissioned/removed it will drain all its hints and all
remote nodes that have hints to it will drain their hints to this node.

What "drain" means? - The node that "drains" hints to a specific
destination will ignore failures and will continue sending hints till the end
of the current segment, erase it and move to the next one till there are
no more segments left.

After all hints are drained the corresponding hints directory is removed.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-08 22:29:21 +01:00
Vlad Zolotarov
ec76f8a27d db::hints::manager: add a few more trace messages
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-08 22:29:21 +01:00
Vlad Zolotarov
6ede32156f db::hints::manager::end_point_hints_manager::sender: add set_stopping()/stopping() methods
It's nicer to have access methods instead of working directly with enum_set methods and values.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-08 22:29:21 +01:00
Vlad Zolotarov
94da744f37 db::hints::manager::end_point_hints_manager::stop(): log the last exception instead of forwarding it
Returning a future with an exception from end_point_manager::stop()
is practically useless because the best the caller can do is to log
it and continue as if it didn't happen because it has other things
to shut down.

Therefore in order to simplify the caller we will log the exception
if it happens and will always return a non-exceptional future.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-08 22:29:21 +01:00
Vlad Zolotarov
8aedbf9d18 db::hints: manager.hh: cleanup: fix the comments
Fix the comments that went out of sync with the current implementation.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-08 22:29:21 +01:00
Vlad Zolotarov
5463b58faa db::hints::manager: rework end_point_hints_manager::stop() to use seastar::async()
This simplifies the code reading and extending.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-08 22:29:21 +01:00
Botond Dénes
6f7d919470 database: when dropping a table evict all relevant queriers
Queriers shouldn't outlive the table they read from as that could lead
to use-after-free problems when they are destroyed.

Fixes: #3414

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <3d7172cef79bb52b7097596e1d4ebba3a6ff757e.1525716986.git.bdenes@scylladb.com>
2018-05-07 21:20:25 +03:00
Duarte Nunes
c053275a48 db/view/row_locking: Add timeout when waiting for the lock
This ensures we respect the write timeout set by the client when
applying base writes, in case a writes takes too long to acquire the
row lock for the read-before-write phase of a materialized view
update.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180507132755.8751-1-duarte@scylladb.com>
2018-05-07 18:22:39 +01:00
Duarte Nunes
113294074d Merge seastar upstream
* seastar ac02df7...840002c (20):
  > dpdk: protect against missing statistics
  > alien: make visible in documentation
  > Merge "rewrite iotune to conform to the new ioscheduler" from Glauber
  > app_template: Correct outdated comment
  > apps, tests: Catch polymorphic exceptions by reference
  > configure.py: Enhance detection for gcc -fvisibility=hidden bug
  > reactor: add rudimentary task histogram reporting
  > Revert "Merge "rewrite iotune to conform to the new ioscheduler" from Glauber"
  > Merge "rewrite iotune to conform to the new ioscheduler" from Glauber
  > build: Use the same warning name for Clang and GCC
  > core/rwlock: Add support for timeouts
  > fs qualification: protect against EINTR
  > Docker: Fix failing build due to missing GNU make
  > reactor: move optional to experimental so we compile with c++14
  > future: remove allocation from future::get() thread context switch
  > Merge "rpc streaming" from Gleb
  > reactor: put mountpoint_params in seastar namespace
  > Tutorial: in PDF version of tutorial, better backtick typesetting
  > tutorial: support, and start using, links to other sections
  > tutorial: improve second half of semaphores section

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-05-07 18:22:39 +01:00
Tomasz Grabiec
58fe331c7e mvcc: Test version merging when snapshots go away 2018-05-07 13:54:30 +02:00
Avi Kivity
368e15a8e2 Update scylla-ami submodule
* dist/ami/files/scylla-ami 8a6e4dd...e0b35dc (1):
  > change default roles for EBS / ephemeral
2018-05-07 12:34:04 +03:00
Duarte Nunes
4b3562c3f5 db/view: Limit number of pending view updates
This patch adds a simple and naive mechanism to ensure a base replica
doesn't overwhelm a potentially overloaded view replica by sending too
many concurrent view updates. We add a semaphore to limit to 100 the
number of outstanding view updates. We limit globally per shard, and
not per destination view replica. We also limit statically.

Refs #2538

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180426134457.21290-2-duarte@scylladb.com>
2018-05-07 11:25:27 +03:00
Duarte Nunes
2be75bdfc9 db/timeout_clock: Properly scope type names
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180426134457.21290-1-duarte@scylladb.com>
2018-05-07 11:24:41 +03:00
Nadav Har'El
c93b56034d tests: improve usability of cql_assertions.hh error messages
The functions in cql_assertions.hh are very convenient, but have one
frustrating drawback: When you have many of those assertions in one
test, it's very hard to know *which* of the similar assertions failed.

The problem is that an error often looks like this:

unknown location(0): fatal error: in "test_many_columns":
std::runtime_error: Expected 2 row(s) but got 0
tests/cql_assertions.cc(131): last checkpoint

Which of the many similar checks in "test_many_columns" failed? Note the
unhelpful "unknown location" and also the "last checkpoint" points to code
in cql_assertions.cc, not in the actual test, so it is useless.

The root cause of these problems is that the Boost macros use the C
preprocessor __FILE__ and __LINE__, which in actual C++ functions like
is_rows() remembers its location, instead of the caller. Fixing this will
not be simple. But this patch has a much simpler solution - fixing the
"last checkpoint". What ruins the last checkpoint is the use of BOOST_REQUIRE
inside the cql_assertions.cc is_rows() - when that succeeds, it records
the location inside cql_assertions.cc (!) as the last success.

If we just replace BOOST_REQUIRE by our own test (just like in the rest of
the cql_assertions.cc code), this code will not override the last checkpoint.
The user can see the last real successful BOOST_REQUIRE, or use
BOOST_TEST_PASSPOINT() to set his own checkpoints between different parts of
the same test.

After this patch, and with adding BOOST_TEST_PASSPOINT() calls between
different parts of my test, the failure above now looks like:

unknown location(0): fatal error: in "test_many_columns":
std::runtime_error: Expected 2 row(s) but got 0
tests/secondary_index_test.cc(299): last checkpoint

The "last checkpoint" now shows me exactly where my failing check was.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180501152638.26238-1-nyh@scylladb.com>
2018-05-07 09:19:45 +01:00
Duarte Nunes
eabe471ce8 tests/secondary_index_test: Don't catch polymorphic exceptions by value
Don't slice exception by catching them by value. Instead of catching
by reference, use assert_that_failed().

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180506153745.4512-1-duarte@scylladb.com>
2018-05-06 18:53:40 +03:00
Duarte Nunes
ab5a45b00c Merge 'Improve debuggability of result_message' from Avi
"This patchset adds ostream operators to result_message and uses them
in cql_assertions."

* tag 'result_message-print/v1.1' of https://github.com/avikivity/scylla:
  tests: cql_assersions: improve error message when a row is not found
  transport: add ostream support to result_message
  transport: const correctness for result_message::accept()
2018-05-06 14:52:56 +01:00
Avi Kivity
6d3fb69827 tests: cql_assersions: improve error message when a row is not found
Display the row and the result set.
2018-05-06 16:28:37 +03:00
Avi Kivity
07d69ebce2 transport: add ostream support to result_message
Allow printing result_message:s for debugging.
2018-05-06 16:28:35 +03:00
Avi Kivity
50d4d01cb7 tests: fix view_schema_test cql_assertion types
Use utf8_type where warranted.

Fixes view_schema_test failure where the rows did not match. I don't
understand exactly why the failure happened (using the wrong type
should not cause a failure here), but the change fixes the problem.

Tests: view_schema_test (release)
Message-Id: <20180506130015.7450-1-avi@scylladb.com>
2018-05-06 14:25:22 +01:00
Avi Kivity
31f2b3ce15 transport: const correctness for result_message::accept()
The visitor does not alter the result_message it is visiting (and
its signature indicates that) so accept() should be const-qualified
to indicate that and to allow visiting const result_message:s.
2018-05-06 15:51:48 +03:00
Avi Kivity
cc900c23a6 Merge "Write Statistics.db in SSTables 3.x format." from Vladimir
"
This patchset adds support for writing Statistics.db in the SSTables
'mc' (3.x) format. This file is essential for reading data stored in
Data.db as it contains base values used for delta encoding and types of
columns.

This patchset also fixes several bugs found in writing data and index
files as well as bugs in a statistics-related structure definition.

Tests: unit {debug, release}

All SSTables files for write unit tests are validated to be processed by
sstabledump and output is verified to show the expected data.
"

* 'projects/sstables-30/write-statistics/v1' of https://github.com/argenet/scylla:
  Add test covering the composite partition key case.
  Add Statistics.db files to write tests for SSTables 3.0.
  Do not check rows and cells for expiration when writing them to the data file.
  Fix promoted index serialization.
  Fix the order of items in stats_metadata.
  Fix timestamp_epoch value which was truncated on exceeding int32_t type limit.
  Write serialization header to Statistics.db for SSTables 3.x.
  Do not pass schema to metadata_collector::update(column_stats)
  Collect metadata statistics when writing SSTables 3.0.
  Call get_metadata_collector() instead of referencing sstable::_collector directly.
  Fix logic of writing TTLed cells in SSTable 3.0 format.
  Separate statistics for count of cells, columns and rows in column_stats.
  Deserialize collection in a way that doesn't incur shared_ptr counter increment and is generally shorter.
  Track both min & max values for timestamp, TTL and local deletion time in metadata_collector.
  Add class for tracking both extremum values (min and max) on updates.
2018-05-05 16:53:08 +03:00
Vladimir Krivopalov
4ecb3a5e2a Add test covering the composite partition key case.
Mainly to check that the composite type is properly serialized when
writing serialization header to Statistics.db.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-04 15:45:11 -07:00
Vladimir Krivopalov
1b3989adcd Add Statistics.db files to write tests for SSTables 3.0.
For these tests to work, all time-related values are now fixed as these
are stored in Statistics.db files.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-04 15:45:11 -07:00
Vladimir Krivopalov
293ee6ae3f Do not check rows and cells for expiration when writing them to the data file.
Although this logic may be seen as a useful optimization, it hinders
unit tests writing SSTables 3.0 as those need to have fixed time-related
values to produce Statistics.db files with the same content on each run.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-04 15:45:11 -07:00
Vladimir Krivopalov
44bc0f1493 Fix promoted index serialization.
There is a new field introduced in the SSTables 3.0 index file format
named 'partition_header_length' that can be used to skip over to the
first clustering row in a wide partition. This one has not been
previously written and caused malformed indices.

Updated the corresponding test to include a static row and write
multiple wide partitions.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-04 15:45:10 -07:00
Vladimir Krivopalov
56ac941a2e Fix the order of items in stats_metadata.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-04 15:45:10 -07:00
Vladimir Krivopalov
926cdc6d70 Fix timestamp_epoch value which was truncated on exceeding int32_t type limit.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-04 15:45:10 -07:00
Vladimir Krivopalov
5db6002720 Write serialization header to Statistics.db for SSTables 3.x.
Serialization header is a new components in Statistics.db introduced in
SSTables 3.0 ('ma') format. It is essential for reading data file as it
contains the base values used for delta-encoded values (timestamps,
TTLs, local deletion times) and description of column types.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-04 15:43:17 -07:00
Vladimir Krivopalov
6e4601d177 Do not pass schema to metadata_collector::update(column_stats)
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-04 15:22:32 -07:00
Vladimir Krivopalov
a10ad6b623 Collect metadata statistics when writing SSTables 3.0.
Track min/max timestamps, TTLs, local deletion times and count of cells,
columns and rows.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-04 15:22:30 -07:00
Raphael S. Carvalho
abcfc19fe9 db: make compaction slightly faster by not using filtering reader on unshared sstable
After reboot, all existing sstables are considered shared. That's a safe default.
Reader used by compaction decides to use filtering reader (filters out data that
doesn't belong to this shard) if sstable is considered shared even though it may
actually be unshared.
By avoiding filtering reader we're avoiding an extra check for each key, and that
may be meaningful for compaction of tons of small partitions and even range
reads of such. We do so by fixing sstable::_shared, which is now set properly for
existing sstables at start.

quick check using microbenchmark which extends perf_sstable with compaction mode:
before: 69407.61 +- 37.03 partitions / sec (30 runs, 1 concurrent ops)
after: 70161.09 +- 40.35 partitions / sec (30 runs, 1 concurrent ops)

Fixes #3042.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180504182158.21130-1-raphaelsc@scylladb.com>
2018-05-04 19:34:09 +01:00
Raphael S. Carvalho
b65bc511fe sstables/compaction_manager: log user initiated compaction
Sometimes it's hard to figure out from log whether user run major
compaction.

Fixes #1303.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180504181047.20277-1-raphaelsc@scylladb.com>
2018-05-04 19:15:58 +01:00
Duarte Nunes
7916368df8 Merge "Introduce system.large_partitions table" from Piotr
"
This series introduces a system.large_partitions table,
used to gather information on largest partitions in the cluster.

Schema below allows easy extraction of most offending keys and removal
by sstable name, which happens when a table is compacted away.

Schema: (
  keyspace_name text,
  table_name text,
  sstable_name text,
  partition_size bigint,
  key text,
  compaction_time timestamp,
  PRIMARY KEY((keyspace_name, table_name), sstable_name, partition_size, key)
) WITH CLUSTERING ORDER BY (partition_size DESC);
"

Closes #3292.

* 'large_partition_table_3' of https://github.com/psarna/scylla:
  database, sstables, tests: add large_partition_handler
  db: add large_partition_handler interface with implementations
  docs: init system_keyspace entry with system.large_partitions
  db: add system.large_partitions table
2018-05-04 18:18:50 +01:00
Piotr Sarna
bc019205b3 schema: fix typos in a comment
Message-Id: <2b2a169e8a511fa9e0e1556ac7559ce9bef896e1.1525431353.git.sarna@scylladb.com>
2018-05-04 15:26:51 +01:00
Piotr Sarna
fe02c3d0e2 database, sstables, tests: add large_partition_handler
This commit makes database, sstables and tests aware
of which large_partition_handler they use.
Proper large_partition_handler is retrievable from config information
and is based on existing compaction_large_partition_warning_threshold_mb
entry. Right now CQL TABLE variant of large_partition_handler is used
in the database.

Tests use a NOP version of large_partition_handler, which does not
depend on CQL queries at all.
2018-05-04 14:38:13 +02:00
Piotr Sarna
14b3c7e7e7 db: add large_partition_handler interface with implementations
This commit introduces large_partition_handler class, which can be used
to take additional action when large partitions are written.

It comes with two implementations:
 * NOP, used in tests, which does nothing on large partition
   update/delete
 * CQL TABLE, which inserts/deletes information on particular sstable
   to system.large_partitions table, in order to be retrievable from
   cqlsh later.

References #3292
2018-05-04 12:46:31 +02:00
Piotr Sarna
3c82a8a2ff docs: init system_keyspace entry with system.large_partitions
This commit is a first step towards documenting system.* tables.
It contains information about system.large_partitions table.

References #3292
2018-05-04 12:45:40 +02:00
Piotr Sarna
02822efbc8 db: add system.large_partitions table
This commit adds a system.large_partitions table, which can be used
to trace largest partitions of a cluster.
Schema: (
  keyspace_name text,
  table_name text,
  sstable_name text,
  partition_size bigint,
  key text,
  compaction_time timestamp,
  PRIMARY KEY((keyspace_name, table_name), sstable_name, partition_size, key)
) WITH CLUSTERING ORDER BY (partition_size DESC);

References #3292
2018-05-04 12:45:40 +02:00
Raphael S. Carvalho
ce689a0807 database: avoid race condition when deleting sstable on behalf of cf truncate
After removal of deletion manager, caller is now responsible for properly
submitting the deletion of a shared sstable. That's because deletion manager
was responsible for holding deletion until all owners agreed on it.
Resharding for example was changed to delete the shared sstables at the end,
but truncate wasn't changed and so race condition could happen when deleting
same sstable at more than one shard in parallel. Change the operation to only
submit a shared sstable for deletion in only one owner.

Fixes dtest migration_test.TestMigration.migrate_sstable_with_schema_change_test

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180503193427.24049-1-raphaelsc@scylladb.com>
2018-05-04 11:42:56 +01:00
Vladimir Krivopalov
8342073758 Call get_metadata_collector() instead of referencing sstable::_collector directly.
A step to untie classes sstable_writer_m and sstable so that eventually
we could stop them being friends.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-03 17:05:06 -07:00
Vladimir Krivopalov
f1816d77cc Fix logic of writing TTLed cells in SSTable 3.0 format.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-03 17:05:06 -07:00
Vladimir Krivopalov
3e471116b4 Separate statistics for count of cells, columns and rows in column_stats.
SSTables 3.0 format makes a distinction between count of cells and count
of columns. In that sense, a column of a collection type counts as one
column but every atomic cell in it counts as a separate cell.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-03 17:05:06 -07:00
Vladimir Krivopalov
fdfe79e899 Deserialize collection in a way that doesn't incur shared_ptr counter
increment and is generally shorter.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-03 17:05:06 -07:00
Vladimir Krivopalov
7039dee12b Track both min & max values for timestamp, TTL and local deletion time
in metadata_collector.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-03 17:05:06 -07:00
Vladimir Krivopalov
8b8c9a5d10 Add class for tracking both extremum values (min and max) on updates.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-03 17:05:06 -07:00
Tomasz Grabiec
5e985192b2 db: Log table id and schema version on boot
Message-Id: <1524585689-12458-1-git-send-email-tgrabiec@scylladb.com>
2018-05-03 10:50:31 +03:00
Botond Dénes
5d5bc0e1ab mutation_reader_test: fix multishard-reader test with smp > 3
test_multishard_combining_reader_destroyed_with_pending_create_reader
was failing because it relied on smp == 3 and thus the shard on which
the reader creation is blocked being shard-2. Since the test requires to
be run with smp >= 3 we can hardcode this shard to be 2 because if the
test runs at all we are guaranteed to have at least smp >= 3.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <38883a1f4c18ca0cd065aa13826a4f1858353289.1525328233.git.bdenes@scylladb.com>
2018-05-03 10:30:21 +03:00
Botond Dénes
efa08f623a mutation_reader_test: add description to multishard-tests
These tests are quite complicated and require intimate knowledge of how
foreign_reader and multishard_combining_reader operates. Knowing these
two objects is still required to understand the tests but make it that
much easier by explaining how they were designed to test what they test.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <8de580131a8652924de920c2bc68a98e579398ee.1525328226.git.bdenes@scylladb.com>
2018-05-03 10:30:20 +03:00
Paweł Dziepak
bfc017daa8 tests/mutation_reader: do not capture on-stack variable by reference
'shard' is a short-lived on-stack variable that gets captured by
reference by continuation that gets executed on another shard.

Fixes a race condition that leads to an heap-use-after-free.

Message-Id: <20180502150507.2776-1-pdziepak@scylladb.com>
2018-05-02 18:07:37 +03:00
Botond Dénes
d80e586ccb mutation_reader_test: remove leftover comments
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <580dcf664fc4fc84f3a29137fba5c982f57d7601.1525269726.git.bdenes@scylladb.com>
2018-05-02 17:03:50 +03:00
Botond Dénes
e14b0ca13e mutation_reader_test: fix possible use-after-free
The test_foreign_reader_destroyed_with_pending_read_ahead test currently
doesn't ensure that the objects in it's scope are destroyed in the
correct order. This is necessary as there are severeal foreign pointers
to objects that live on remote shards and use each other. Since
foreign pointers destory their managed object in the background we
cannot rely on the to reliably destroy objects in order, nor can we be
sure when the object they manage is actually destroy.
So to work around that ensure that the puppet_reader is destroyed before
the remote_control it references even has a chance of being destroyed.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <232eaa899878b03fb2a765c2916e4f05841472a3.1525269726.git.bdenes@scylladb.com>
2018-05-02 17:03:49 +03:00
Nadav Har'El
68b5eafcc6 secondary index: test index naming
Test for Scylla's default choice of secondary index name (we found one
small problem, see issue #3403, and left it commented out). Also test
the ability to give indices non-default names.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180501153439.26619-1-nyh@scylladb.com>
2018-05-02 08:12:14 +03:00
Nadav Har'El
311b25948c secondary index: test indexing of partition-key column
Add a test that adding a secondary-index for an only partition key column
is not allowed (it would be redundant), but indexing one of several partition
key columns *is* allowed. This reproduced issue #3404, and verifies that
it was fixed.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180501121544.22869-2-nyh@scylladb.com>
2018-05-02 08:11:04 +03:00
Nadav Har'El
79c6bb642f secondary index: fix indexing of partition-key column
Indexing an only partition key component is not allowed (because it would
be redundant), but it should be allowed to index one of several partition
key components. We had a bug in that case: the underlying materialized view
we created had the same column as both a partition key and a clustering
key, which resulted in an assertion failure. This patch fixes that.

Fixes #3404.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180501121544.22869-1-nyh@scylladb.com>
2018-05-02 08:06:38 +03:00
Nadav Har'El
21d7507b74 secondary index: move stuff out of db/index directory
The db/index directory contains just a few lines of code that exists
there for historical reasons. It's confusing that we have both db/index
and index/ directory related to secondary-indexing.

This patch moves what little is still in db/index/ to index/. In the
future we should probably get rid of the "secondary_index" class we had
there, but for now, let's at least not have a whole new directory for it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180501101246.21143-1-nyh@scylladb.com>
2018-05-01 13:21:24 +03:00
Tomasz Grabiec
0455a19ce0 anchorless_list: Make ranges conform to SinglePassRange
They were missing const version of iterators as well as iterator and
const_iterator member type aliases.
2018-04-30 18:45:32 +02:00
Tomasz Grabiec
9b7e49ef35 anchorless_list: Drop deprecated use of std::iterator 2018-04-30 18:45:32 +02:00
Tomasz Grabiec
aa1458377c mvcc: Fix partition_snapshot::merge_partition_versions() to not leave latest versions unmerged
Fixes a bug in partition_snapshot::merge_partition_versions(), which
would not attempt merging if the snapshot is attached to the latest
version (in which case _version is nullptr and _entry is !=
nullptr). This would cause partition_version objects to accumulate if
there was an older snapshot and it went away before the latest
snapshot. Versions will be removed when the whole entry goes away
(flush or eviction).

May have caused performance problems.

Fixes #3402.
2018-04-30 18:45:32 +02:00
Avi Kivity
25545590a4 Merge "Read-ahead related fixes for multishard readers" from Botond
"
Both multishard_combining_reader and foreign_reader use read-head in the
background to avoid blocking consumers. These read-aheads can be still
pending when the reader is destroyed and hence extra attention is needed
to avoid memory errors. Recent manual testing, done in the context of
testing code that is using the multishard reader, proved that these
cases were not handled correctly in the initial series introducing it
(2d126a79b).
This series introduces fixes and comprehensive tests for all problematic
scenarios:
1) multishard_combining_reader is destroyed with pending reader creation
on a remote shard.
2) foreign_reader is destroyed with pending read-ahead.
3) multishard_combining_reader is destroyed with pending read-ahead.
"

* 'multishard-reader-read-ahead-fixes/v2' of https://github.com/denesb/scylla:
  test.py: add custom seastar flags for mutation_reader_test
  test.py: move custom seastar flags for tests declarative
  mutation_reader_test: add read-ahead related multishard reader tests
  tests/mutation_reader_test: change recommented smp to 3
  mutation_reader_test: fix name of existing multishard reader tests
  simple_schema: add global_simple_schema
  simple_schema.hh: remove unused include
  multishard_combining_reader: prepare for read-ahead otliving the reader
  foreign_reader: prepare for read-ahead outliving the reader
  multishard_combining_reader: avoid creating the shard reader twice
  multishard_combining_reader: read_ahead: don't assume reader is created
  multishard_combining_reader: move read-ahead related methods
  multishard_combining_reader: avoid looking up the shard reader twice
  multishard_combining_reader: use optional for maybe created reader
2018-04-30 17:41:50 +03:00
Botond Dénes
f96084d38e test.py: add custom seastar flags for mutation_reader_test
Use -c3 if possible (if the machines has at least 3 cores).
2018-04-30 17:17:45 +03:00
Botond Dénes
52f0bb0481 test.py: move custom seastar flags for tests declarative 2018-04-30 17:17:45 +03:00
Botond Dénes
79684eff8e mutation_reader_test: add read-ahead related multishard reader tests
Add tests for foreign_reader and multishard_combining_reader that check
that readers destroyed while there is pending read-head will not result
in use-after-free.
Specifically check that:
* multishard_combining_reader destroyed with pending reader creation
* foreign_reader destroyed with pending read-ahead
* multishard_combining_reader destroyed with pending read-ahead
does not result in use-after-free or SEGFAULT.

These tests try to do their best to check for correct behaviour with
various BOOST_REQUIRE* checks but they still heavily rely on ASAN to
detect any use-after-free, SEGFAULT or similar errors.
2018-04-30 17:17:45 +03:00
Botond Dénes
cb25afa8bf tests/mutation_reader_test: change recommented smp to 3
Of the test_multishard_combining_reader_reading_empty_table test.
Running this test with smp=3 instead of smp=2 helps detecting additional
read-ahead related memory problems.
2018-04-30 17:17:45 +03:00
Botond Dénes
78266f11c4 mutation_reader_test: fix name of existing multishard reader tests
s/multishard_combined_reader/multishard_combining_reader/
2018-04-30 17:17:44 +03:00
Botond Dénes
783f0f09bf simple_schema: add global_simple_schema
Which allows a simple_schema instance to be transferred to another
shard. In fact a new simple_schema instance will be created on the
remote shard but it will use the same schema instance the the original
one.
2018-04-30 17:17:44 +03:00
Botond Dénes
ed7bde99bc simple_schema.hh: remove unused include 2018-04-30 17:17:44 +03:00
Botond Dénes
04643fb223 multishard_combining_reader: prepare for read-ahead otliving the reader
When the multishard reader is destroyed there might be severeal pending
read-aheads running in the background. These read-aheads need their
associated reader to stay alive until after the read-ahead completes.
To solve this move the flat_mutation_reader into a struct and manage
this struct's lifetime through a shared pointer. Fibers associated with
read-aheads that might outlive the multishard reader will hold on to a
copy of the shard pointer keeping the underlying reader alive until they
complete. To avoid doing any extra work a flag is added to this state
which is set when the multishard reader is destroyed. When this flag is
set, pending continuations will return early.  All this is encapsulated
in multishard_combining_reader::shard_reader the multishard reader code
itself need not be changed.
2018-04-30 17:16:21 +03:00
Botond Dénes
a05d398be7 foreign_reader: prepare for read-ahead outliving the reader
The foreign reader keeps track of ongoing read-aheads via a
foreign_ptr to the read-ahead's future on the remote shard. This pointer
is overwritten after each "remote call" to the remote reader with a
pointer to the future of the new read-ahead's future.
There are severeal problems with the current implementation:
1) There is a new read-ahead launched after each "remote call"
  unconditionally, even if the remote reader is at EOS. This will start
  unecessary read-ahead when the reader is already finished and may be
  soon destroyed (legally) by the client.
2) The pointer to the remote read-ahead future is not set to nullptr
  when a remote call is issued. Thus in the destructor, where we
  attach a continuation to the read-ahead's future to extend the
  reader's lifetime until after the read-ahead finishes, we migh attach
  a continuation to a future that already has one and run into a failed
  assert().

To fix this issues reset the read-ahead pointer to nullptr each time a
remote call is issued and don't start a new read-ahead if the remote
reader is at EOS. This way we can ensure that when the reader is
destroyed we either have a valid and non-stale read-aead future or none
at all and can reliably make a decision about whether we need to extend
the lifetime of the remote reader or not.
2018-04-30 14:34:43 +03:00
Botond Dénes
704d3d8421 multishard_combining_reader: avoid creating the shard reader twice
The multishard reader creates its shard readers on demand when they are
first attempted to be used. However at this time the reader migh already
be in the progress of being created, initiated by a previous read-ahead.
To avoid creating the shard reader twice, before creating the reader
check whether there are any read-aheads in progress. If there is, it
already created (is creating or will create) the reader and hence
synchronise with the read ahead. Synchronisation happens via a promise,
the read ahead creates a promise which will be fulfilled when the reader
is created. A concurrent create_reader() call will wait on this promise
instead of attempting to create a new reader.
2018-04-30 14:34:43 +03:00
Botond Dénes
f9464cfcd7 multishard_combining_reader: read_ahead: don't assume reader is created
Currently it is assumed that when read_ahead is called the reader is
already created. Under most circumstances this will not be true. It was
blind (bad) luck that we didn't hit this before (during testing).
2018-04-30 14:34:43 +03:00
Botond Dénes
d9fceb398a multishard_combining_reader: move read-ahead related methods
To the group of methods that do not assume the reader is already
created. A patch will follow that will update read_ahead() to not assume
that the reader is created.
2018-04-30 14:34:43 +03:00
Botond Dénes
5dcfaa68f6 multishard_combining_reader: avoid looking up the shard reader twice 2018-04-30 14:34:43 +03:00
Botond Dénes
79504a7d28 multishard_combining_reader: use optional for maybe created reader
After a little "research" [1] it turns out my initial fears were
completely without ground, std::optional::operator->() and
std::optional::opterator*() doesn't involve an unnecessary branch and
thus there is no need to hand-roll an optional with a separate bool.

[1] http://en.cppreference.com/w/cpp/utility/optional/operator*
2018-04-30 14:34:37 +03:00
Avi Kivity
c8a6fe3044 storage_proxy: remove default_query_timeout()
No longer used.
2018-04-30 13:19:53 +03:00
Avi Kivity
d8dd7e05a7 storage_proxy: don't use default timeouts
Require all callers to supply timeouts instead of relying on defaults.

Since all callers now have the timeouts set up, they can easily supply
them.
2018-04-30 13:19:53 +03:00
Avi Kivity
7b5db486a0 query_options: augment with timeout_config
Add a timeout_config member to query_options. This lets the query
processor know what timeouts the user of this query want to apply.
2018-04-30 13:19:53 +03:00
Avi Kivity
fcea3ed722 thrift: configure thrift transport and handler with a timeout_config
Let the thrift transport server and request handler know about the
per-request-type timeouts, in preparation for actually using them.
2018-04-30 13:19:53 +03:00
Avi Kivity
f9370ab7e6 transport: configure native transport with a timeout_config
Let the native transport server know about the per-request-type
timeouts, in preparation for actually using them.
2018-04-30 13:19:53 +03:00
Avi Kivity
49fdf01b5d cql3: define and populate timeout_config_selector
Determine which timeout we need to apply at prepare time. We
don't know the numerical value (since it depends on whoever is
executing the query, not just the statement type), but we know
which member of timeout_config we need, so determine and remember
that.
2018-04-30 13:19:49 +03:00
Tomasz Grabiec
423712f1fe storage_proxy: Request schema from the coordinator in the original DC
The mutation forwarding intermediary (src_addr) may not always know
about the schema which was used by the original coordinator. I think
this may be the cause of the "Schema version ... not found" error seen
in one of the clusters which entered some pathological state:

  storage_proxy - Failed to apply mutation from 1.1.1.1#5: std::_Nested_exception<schema_version_loading_failed> (Failed to load schema version 32893223-a911-3a01-ad70-df1eb2a15db1): std::runtime_error (Schema version 32893223-a911-3a01-ad70-df1eb2a15db1 not found)


Fixes #3393.

Message-Id: <1524639030-1696-1-git-send-email-tgrabiec@scylladb.com>
2018-04-30 12:51:09 +03:00
Nadav Har'El
1bbf7ba78c secondary index: add tests for IF NOT EXISTS, IF EXISTS
Confirm that issue #2991 is indeed fixed - creating a secondary index
with IF NOT EXISTS ignores an already existing index, and dropping with
IF EXISTS ignores a non-existant index.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180430071714.10154-1-nyh@scylladb.com>
2018-04-30 10:36:50 +02:00
Nadav Har'El
6e3a53fab0 secondary index: improve testing of case-sensitive column names
The existing test_secondary_index_case_sensitive only tested the
case-sensitive case of the column being indexed, and only in some
scenarios. Further testing exposed more bugs - issue #3388, issue #3391,
issue #3401. This patch adds tests which reproduced those bugs, and now
verifies their fix.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429221857.6248-9-nyh@scylladb.com>
2018-04-30 00:27:23 +02:00
Nadav Har'El
a556b2b367 materialized views: fix test_case_sensitivity test
test_case_sensitivity from tests/view_schema_test.cc was well-intentioned,
aiming to test from different angles the issue of non-lowercase (quoted)
column names and their interaction with materialized views.

But unfortunately, it didn't test anything! This is because the quotation
marks were forgotten, so all the identifier in this test were folded to
lowercase, and the test didn't test non-lowercase identifiers like it
intended.

So this patch adds the missing quotes, to make this test great again.

After the patches for issues #3388 and #3391 which I sent earlier, the
test *passes* (before those patches, the fixed test did not pass -
the unfixed test trivially passed).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429221857.6248-8-nyh@scylladb.com>
2018-04-30 00:27:23 +02:00
Nadav Har'El
46d4f6f352 secondary index: fix yet another case sensitivity bug
When the secondary index code builds a "%s IS NOT NULL" clause for a
CQL statement, it needs to quote the column name if it needs to be
(not only lowercase, digits and _).

Fixes #3401.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429221857.6248-7-nyh@scylladb.com>
2018-04-30 00:27:23 +02:00
Nadav Har'El
8012f231ca materialized views: fix another case-sensitivity bug
We had another case-sensitivity bug in materialized views, where if
a case-sensitive (quoted) column name was listed explicitly on "SELECT"
(instead of implicitly, e.g., in "SELECT *") the column name was
incorrectly folded to lower-case and inserts would fail.

This patch fixes the code, where a "SELECT" statement was built using
the desired column names, but column names that needed quoting were
not being quoted. The bug was in a helper function build_select_statement()
which took column name strings and failed to quote them. We clean up this
function to take column definitions instead of strings - and take care
of the quoting itself. It also needs to quote the table's name in the
select statement being built.

Fixes #3391.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429221857.6248-6-nyh@scylladb.com>
2018-04-30 00:27:23 +02:00
Nadav Har'El
e2b2506cb1 materialized views - fix case-sensitive IS NOT NULL
Before this patch, if a materialized view is defined with the restriction
IS NOT NULL on a case-sensitive (quoted) column name, inserts fail with
a "restriction 'foobar IS NOT null' unknown column foobar" error, where
foobar is the lowercased version of the case-sensitive column name.

The problem is that the code uses single_column_relation::to_string()
to convert the relation into a CQL where clause. And indeed, this method
generates a CQL expression; But it calls column_identifier::raw::to_string()
to print identifiers. This is the wrong function - it doesn't quote
identifiers that need quoting because they are not lowercase.

So this patch uses column_identifier::raw::to_cql_string() (a method we
added in the previous patch) to generate the properly quoted CQL relation.

Fixes #3388

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429221857.6248-5-nyh@scylladb.com>
2018-04-30 00:27:23 +02:00
Nadav Har'El
b8ee50e6b9 Implement column_identifier::raw::to_cql_string()
Implement a method column_identifier::raw::to_cql_string(). Exactly like
the one without "raw", this method quotes the identifier name as needed
for CQL. We'll need this method in a later patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429221857.6248-4-nyh@scylladb.com>
2018-04-30 00:27:23 +02:00
Nadav Har'El
993c4441e5 column_identifier::to_cql_string() using maybe_quote()
There is no reason for to_cql_string() and maybe_quote() to both
implement the same quoting algorithm. Use the latter to implement the
former.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429221857.6248-3-nyh@scylladb.com>
2018-04-30 00:27:23 +02:00
Nadav Har'El
f4178f9582 Fix cql3::util::maybe_quote()
The utility function maybe_quote() is supposed to quote identifier names
(name of keyspace, table, or column) according to CQL rules, e.g., if the
name has any uppercase or non-alphanumeric characters, it needs to be
quoted. Unfortunatelty, it didn't quite do the right thing, so this patch
fixes that. This patch also adds a comment explaining what maybe_quote()
is supposed to do (until now, users could only guess).

Fixes #3400.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429221857.6248-2-nyh@scylladb.com>
2018-04-30 00:27:23 +02:00
Nadav Har'El
ecc85297a4 secondary index: clean up dead unquoting code
In commit d674b6f672, I fixed a case-
sensitive column name bug by avoiding CQL quoting of a column name
in create_index_statement.cc when building a "targets" option string.
However, there is also matching code in target_parser.hh to unquote
that option string. So this unquoting code is no longer necessary, and
should be dropped.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429221857.6248-1-nyh@scylladb.com>
2018-04-30 00:27:23 +02:00
Avi Kivity
b6d74b1c19 timeout_config: introduce timeout configuration
Different request types have different timeouts (for example,
read requests have shorter timeouts than truncate requests), and
also different request sources have different timeouts (for example,
an internal local query wants infinite timeout while a user query
has a user-defined timeout).

To allow for this, define two types: timeout_config represents the
timeout configuration for a source (e.g. user), while
timeout_config_selector represents the request type, and is used
to select a timeout within a timeout configuration. The latter is
implemented as a pointer-to-member.

Also introduce an infinite timeout configuration for internal
queries.
2018-04-29 19:52:40 +03:00
Nadav Har'El
a0bc0d2d11 secondary index: fix support for compound partition key
In the current code, if the base table has a compound partition key (i.e.,
multiple partition-key columns) searching its secondary indexes didn't work.
There is no real reason why this, it was a just a bug in preparing the
second query:

Every SI query is converted to two queries. The first queries the associated
materialized view, to find a list of primary keys. Those we need to use in a
second query, of the base table. The second query needs to list, as
restrictions, the keys found above. When a partition key is compound, its
components build one key and one restriction. But in the buggy code, we
incorrectly used each component as a separate (improperly formatted) key
and restriction, and obviously this didn't work.

This patch also adds a test that reproduces this problem and confirms its fix.

In the fixed code I also found another incorrect use of to_cql_string() (which
could break case-sensitive primary key column names) and changed it to
to_string().

Fixes #3210.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180429124138.24406-1-nyh@scylladb.com>
2018-04-29 14:40:13 +01:00
Duarte Nunes
b1dd1876e5 gms/gossiper: Prevent duplicate processing of EchoMessage reply
We make multiple attempts to mark a node as alive. We do that be
sending an EchoMessage, and marking the node as alive upon receiving a
successful answer. In case there's a network partition and the nodes
can't reach each other, multiple messages may be delivered and
processed.

We can avoid processing duplicate EchoMessage replies by checking
whether we had already marked the node as alive.

Fixes #1184

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180428191942.31990-1-duarte@scylladb.com>
2018-04-29 14:20:01 +03:00
Avi Kivity
51b235aa7e compress: adjust HAVE_LZ4_COMPRESS_DEFAULT macro for new name
Seastar changed the name of this macro.
2018-04-29 12:57:27 +03:00
Avi Kivity
0530653da9 Merge "adapt scylla_io_setup to recent I/O Scheduler changes" from Glauber
"
Recently many changes have landed in seastar for the I/O Scheduler. We
can now describe the I/O storage of a machine by its visible properties
like throughput and bandwidth instead of relying in an indirect
calculation.

For the instances we support, we can just measure that and start using
them right away.

A version of iotune that computes those properties is not yet ready, but
in its making I have noticed that we aren't really setting the nomerges
and scheduler properties of the disks under testing. We definitely
should, since that can influence the results. So this patchset also
starts doing that.

The commandline for iotunev2 shouldn't change much. When it is ready we
will just adjust this script once more.
"

* 'scylla_io_setup' of github.com:glommer/scylla:
  scylla_io_setup: preconfigure i3 and i2 instances with new I/O scheduler properties
  scylla_lib: drop support for m3 and c3 AWS instance types
  io_setup: call blocktune before tuning I/O
  blocktune: allow it to be called as a library.
  scripts: move scylla-blocktune to scripts location
2018-04-29 11:44:06 +03:00
Avi Kivity
7161244130 Merge seastar upstream
* seastar 70aecca...ac02df7 (5):
  > Merge "Prefix preprocessor definitions" from Jesse
  > cmake: Do not enable warnings transitively
  > posix: prevent unused variable warning
  > build: Adjust DPDK options to fix compilation
  > io_scheduler: adjust property names

DEBUG, DEFAULT_ALLOCATOR, and HAVE_LZ4_COMPRESS_DEFAULT macro
references prefixed with SEASTAR_. Some may need to become
Scylla macros.
2018-04-29 11:03:21 +03:00
Raphael S. Carvalho
043fadb15b sstables/twcs: fix setting of timestamp resolution
iterator incorrectly dereferenced when timestamp resolution not
explicitly specified.

following dtests are fixed:
compaction_additional_test.CompactionAdditionalStrategyTests_with_TimeWindowCompactionStrategy.compaction_is_started_on_boot_test
compaction_additional_test.CompactionAdditionalTest.compact_data_by_time_window_test
compaction_additional_test.CompactionAdditionalTest.compaction_removes_ttld_data_by_time_windows_test
compaction_test.TestCompaction_with_DateTieredCompactionStrategy.compaction_strategy_switching_test

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180427192545.17440-1-raphaelsc@scylladb.com>
2018-04-29 10:44:44 +03:00
Glauber Costa
0c29289c22 scylla_io_setup: preconfigure i3 and i2 instances with new I/O scheduler properties
We can use iotunev2 (or any other I/O generator) to test for the limits
of the disks for the i2 and i3 instance classes. The values I got here
are the values I got from ~5 invocations of the (yet to be upstreamed)
iotune v2, with the IOPS numbers rounded for convenience of reading.

During the execution, I verified that the disks were saturated so we
can trust these numbers even if iotunev2 is merged in a different form.
The numbers are very consistent, unlike what we usually saw with the
first version of iotune.

Previously, we were just multiplying the concurrency number by the
number of disks. Now that we have better infrastructure, we will
manually test i3.large and i3.xlarge, since their disks are smaller
and slower.

For the other i3, and all instances in the i2 family storage scales up
by adding more disks. So we can keep multiplying the characteristics of
one known disk by the number of disks and assuming perfect scaling.

Example for i3, obtained with i3.2xlarge:

read_iops = 411k
read_bandwidth = 1.9GB/s

So for i3.16xlarge, we would have read_iops = 3.28M and 15GB/s - very
close to the numbers advertised by AWS.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-04-28 09:50:07 -04:00
Glauber Costa
c85fbd16cb scylla_lib: drop support for m3 and c3 AWS instance types
m3 has 80GB SSDs in its largest form and I doubt anybody has ever
used it with Scylla.

I am also not aware of any c3 deployments. Since it is past generation,
it doesn't even show up in the default instance selector anymore.

I propose we drop AMI support for it. In practice, what that means is
that we won't auto-tune its I/O properties and people that want to use
it will have to run scylla_io_setup - like they do today with the EBS
instances.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-04-28 09:50:07 -04:00
Glauber Costa
685a7c9ae6 io_setup: call blocktune before tuning I/O
We are not configuring the disks the way we want them with respect to
scheduler and nomerges. This is an oversigh that became clear now that
I started rewriting iotune-- since I will explicitly test for that. But
since this can affect the results, it should be here all along.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-04-28 09:50:07 -04:00
Glauber Costa
9eb8ea8b11 blocktune: allow it to be called as a library.
This patch makes the functions in scylla-blocktune available as a
library for other scripts - namely scylla_io_setup.

The filename, scylla-blocktune, is not the most convenient thing to call
from python so instead of just wrapping it in the usual test for
__main__ I am just splitting the file into two.

Another option would be to patch all callers to call
scylla_blocktune.py, but because we are usually not using extensions in
scripts that are meant to be called directly I decided for the split.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-04-28 09:50:07 -04:00
Glauber Costa
f837d5b1f1 scripts: move scylla-blocktune to scripts location
scylla-blocktune currently lives in the top level but this is mostly
historical. When time comes for us to install it, the packaging systems
will copy it to /usr/lib/scylla with the others.

So for consistency let's make sure that it also lives in the scripts
directory.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-04-28 09:50:07 -04:00
Vladimir Krivopalov
b3572acd6e A few improvements to encoding_stats structure.
- Use the same default epoch as Origin
  - Use default value for the encoding_stats parameter in sstable::write_components()

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <846c6d2cbb97d2dd25968cb00b8557c86ff5e35c.1524854727.git.vladimir@scylladb.com>
2018-04-27 22:03:38 +03:00
Avi Kivity
2fb1bcfd13 Update scylla-ami submodule
* dist/ami/files/scylla-ami 02b1853...8a6e4dd (1):
  > ds2_configure.py: always use Ec2Snitch for single region case

Fixes #1800.
2018-04-27 21:02:27 +03:00
Vladimir Krivopalov
36fe06fd3e Make abstract_type::is_fixed_length() non-virtual.
This method is called agressively through SSTable 3.0 read/write, we
want to reasonably optimise it to no incur extra indirect calls.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <2d00ddecd112af867a30d3d6930c10165dd5af34.1524851530.git.vladimir@scylladb.com>
2018-04-27 20:57:46 +03:00
Tomasz Grabiec
b1465291cf db: schema_tables: Treat drop of scylla_tables.version as an alter
After upgrade from 1.7 to 2.0, nodes will record a per-table schema
version which matches that on 1.7 to support the rolling upgrade. Any
later schema change (after the upgrade is done) will drop this record
from affected tables so that the per-table schema version is
recalculated. If nodes perform a schema pull (they detect schema
mismatch), then the merge will affect all tables and will wipe the
per-table schema version record from all tables, even if their schema
did not change. If then only some nodes get restarted, the restarted
nodes will load tables with the new (recalculated) per-table schema
version, while not restarted nodes will still use the 1.7 per-table
schema version. Until all nodes are restarted, writes or reads between
nodes from different groups will involve a needless exchange of schema
definition.

This will manifest in logs with repeated messages indicating schema
merge with no effect, triggered by writes:

  database - Schema version changed to 85ab46cd-771d-36c9-bc37-db6d61bfa31f
  database - Schema version changed to 85ab46cd-771d-36c9-bc37-db6d61bfa31f
  database - Schema version changed to 85ab46cd-771d-36c9-bc37-db6d61bfa31f

The sync will be performed if the receiving shard forgets the foreign
version, which happens if it doesn't process any request referencing
it for more than 1 second.

This may impact latency of writes and reads.

The fix is to treat schema changes which drop the 1.7 per-table schema
version marker as an alter, which will switch in-memory data
structures to use the new per-table schema version immediately,
without the need for a restart.

Fixes #3394

Tests:
    - dtest: schema_test.py, schema_management_test.py
    - reproduced and validated the fix with run_upgrade_tests.sh from git@github.com:tgrabiec/scylla-dtest.git
    - unit (release)

Message-Id: <1524764211-12868-1-git-send-email-tgrabiec@scylladb.com>
2018-04-27 17:12:33 +03:00
Avi Kivity
6154ea734d Merge "upport for writing SSTables 3.0 - rows only" from Vladimir
"
This patch series introduces initial support for writing SSTables in
'mc' format (aka SSTables 3.0).

Currently, the following components are written in 3.0 format:
  - Data.db
  - Index.db
  - Summary.db
(there were no changes to summary files format compared to ka/la)
Other SSTables components are written in the old format for now as they
still need to exist to satisfy post-flush processing.

For now, only rows are written to the data file and indexed. Range
tombstones are not supported.

Writing rows is supported in full with the only exception being counter
cells. All the other features (TTLed data, row/cell level tombstones,
collections, etc) are supported.

Unit tests rely on producing files and binary-comparing them with
'golden' copies that are produced using Cassandra 3.11. This is done to
not block until reading SSTables 3.0 format is implemented.

=======================================
Implementation notes
=======================================

Internally, sstable_writer has been refactored to support multiple
implementations that are instantiated in its constructor based on the
sstable version. Little to no code is shared among sstable_writer_v2 and
sstable_writer_v3 as we only intend to support sstable_writer_v2
alongside sstable_writer_v3 for a single release (to be able to do
rollback on rolling upgrade failure) and then plan to get rid of it
entirely and switch to always writing SSTables in the new format.

The design of sstable_writer_v3 mostly follows that of its precursors
sstable_writer(_v2) and components_writer. Some refactoring and further
code rearrangements are expected in the future but the main code is
there.
"

* 'projects/sstables-30/write-rows/v2' of https://github.com/argenet/scylla:
  Add tests for writing data and index files in SSTables 3.0 ('mc') format.
  Support for writing SSTables 3.0 ('mc') Data.db and Index.db files - rows only.
  Add missing enum values to bound_kind.
  Add building blocks for writing data in SSTables 3.0 format.
  Refactor sstable_writer to support various internal implementations.
  Add is_fixed_length() to data types.
  Add mutation_partition::apply_insert() overload that accepts TTL and expiry for row marker.
2018-04-27 17:10:31 +03:00
Piotr Jastrzebski
d839a945b4 Use goto instead of break in data_consume_rows_context_m::process_state
This way the code will be better predicted.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <271333caa723e8f3ed1db4fbe6b014ebde2b5d3a.1524818584.git.piotr@scylladb.com>
2018-04-27 11:56:13 +03:00
Vladimir Krivopalov
77fdfa3e7a Add tests for writing data and index files in SSTables 3.0 ('mc') format.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-04-26 14:34:20 -07:00
Vladimir Krivopalov
15ef4ca73c Support for writing SSTables 3.0 ('mc') Data.db and Index.db files - rows only.
This fix adds functionality for writing data in 'mc' format to Data.db
file according to the SSTables 3.0 data format as described at https://github.com/scylladb/scylla/wiki/SSTables-3.0-Data-File-Format
and Index.db file according to the specification at https://github.com/scylladb/scylla/wiki/SSTables-3.0-Index-File-Format

The following cases are not supported yet:
  - writing counter cells
  - range tombstones

In Index.db, end open markers are not written since range tombstones are not
supported for data files yet.

For #1969.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-04-26 14:34:20 -07:00
Vladimir Krivopalov
3ecc9e9ce4 Add missing enum values to bound_kind.
bound_kind::clustering, bound_kind::excl_end_incl_start and
bound_kind::incl_end_excl_start are used during SSTables 3.0 writing.

bound_kind::static_clustering is not used yet but added for completeness
and parity with the Origin.

For #1969.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-04-26 14:34:20 -07:00
Vladimir Krivopalov
a95664be08 Add building blocks for writing data in SSTables 3.0 format.
For #1969.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-04-26 14:34:20 -07:00
Vladimir Krivopalov
bb2bea928a Refactor sstable_writer to support various internal implementations.
This is preparatory work for supporting writing SSTables in multiple
formats.

For #1969.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-04-26 14:34:20 -07:00
Vladimir Krivopalov
54bd74fda0 Add is_fixed_length() to data types.
For any given CQL data type, this member returns whether its values are
of fixed or variable length. This is used by SSTables 3.0 format to only
store the length value for variable-length cells.

For #1969.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-04-26 14:34:20 -07:00
Vladimir Krivopalov
ed62b9a667 Add mutation_partition::apply_insert() overload that accepts TTL and expiry for row marker.
For #1969.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-04-26 13:27:42 -07:00
Piotr Jastrzebski
a8154e2825 Fix use-after-free in summary parsing
Buffer received from read_exactly is referenced by
a pointer used in do_until loop but is not kept around
and is destroyed.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <5edd6d08ec4466fe6abd0e83b4bfb24f1f5c80fa.1524747108.git.piotr@scylladb.com>
2018-04-26 15:54:41 +03:00
Avi Kivity
5119c1e9c1 Merge "Implement reading simple table from sstable 3.x" from Piotr
"
This patchset prepares everything for support of both 2.x and 3.x formats and implements reading from sstable 3.x
very simple table with just partition keys.

Tests: units (release)
"

* 'haaawk/sstables3/read_only_partitions_v4' of ssh://github.com/scylladb/seastar-dev: (22 commits)
  Test for reading sstable in MC format with no columns
  Use new mp_row_consumer_m and data_consume_rows_context_m
  Introduce mp_row_consumer_m
  Rename mp_row_consumer to mp_row_consumer_k_l
  Introduce consumer_m and data_consume_rows_context_m
  Use read_short_length_bytes in RANGE_TOMBSTONE
  Use read_short_length_bytes in ATOM_START
  Use read_short_length_bytes in ROW_START
  Add continuous_data_consumer::read_short_length_bytes
  Reduce duplication with continuous_data_consumer::read_partial_int
  Add test for a simple table with just partition key
  Add test for reading index
  Extract mp_row_consumer to separate header
  Make sstable_mutation_reader independent from mp_row_consumer
  Make sstable_mutation_reader a template
  Make data_consume_context a template
  Move data_consume_rows_context from row.cc to row.hh
  Decouple sstable.hh and row.hh
  Reduce visibility of sstable::data_consume_*
  Move data_consume_context to separate header
  ...
2018-04-26 14:35:42 +03:00
Botond Dénes
b2d71ed872 install_dependencies.sh: centos: add systemd-devel
This optional dependency is needed to properly integrate with systemd.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <bacd07958531e6541d5b1a4ea885f01491002a7b.1524740540.git.bdenes@scylladb.com>
2018-04-26 14:32:36 +03:00
Piotr Jastrzebski
5c223c13d6 Test for reading sstable in MC format with no columns
Just a simple table with only partition key.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:38 +02:00
Piotr Jastrzebski
6dd7ce2582 Use new mp_row_consumer_m and data_consume_rows_context_m
When SSTable is in MC format then use those new classes
to be able to read the sstable.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:38 +02:00
Piotr Jastrzebski
9ba64f65e1 Introduce mp_row_consumer_m
This is a version of mp_row_consumer that can
handle SSTables in MC format.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:38 +02:00
Piotr Jastrzebski
4aec023927 Rename mp_row_consumer to mp_row_consumer_k_l
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:38 +02:00
Piotr Jastrzebski
2ee3d8b87b Introduce consumer_m and data_consume_rows_context_m
Those classes can handle SSTables in MC format.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:38 +02:00
Piotr Jastrzebski
b343212073 Use read_short_length_bytes in RANGE_TOMBSTONE
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
90bb7802cc Use read_short_length_bytes in ATOM_START
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
6a81a755ee Use read_short_length_bytes in ROW_START
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
06ceea9c3e Add continuous_data_consumer::read_short_length_bytes
This is a common operation so it's better to have it
implemented in a single place.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
e664360730 Reduce duplication with continuous_data_consumer::read_partial_int
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
9a3f93a42b Add test for a simple table with just partition key
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
c6d4f49abb Add test for reading index
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
63f0b57365 Extract mp_row_consumer to separate header
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
e5145b87b0 Make sstable_mutation_reader independent from mp_row_consumer
Take consumer as template parameter instead.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
9c93f9f5f4 Make sstable_mutation_reader a template
Take DataConsumeRowsContext type as parameter.
This will allow us to implement different context
for reading 3.x files.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
9fad5831df Make data_consume_context a template
Parametrize it with the type of data consume rows context.

There will be different implementations used for different
sstable file formats.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
e2b393df13 Move data_consume_rows_context from row.cc to row.hh
It will be used as a template parameter for sstable_mutation_reader
once it's turned into a template. This means the definition has
to be accessible.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
0e405719e8 Decouple sstable.hh and row.hh
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
bcf5717753 Reduce visibility of sstable::data_consume_*
They are used just in partition.cc, row.cc and sstables_test.cc
so it is usefull to cut their scope by moving them
to data_consume_context.hh.

This will make it much easier to turn data_consume_context into
a template.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
578aa6826f Move data_consume_context to separate header
It's used only in row.cc, partition.cc and sstables_test.cc
so it's better to reduce the dependency just to those files.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
a55cec544e mp_row_consumer: stop depending on sstable_mutation_reader
Introduce mp_row_consumer_reader to cut
a cyclic dependency between them.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:37 +02:00
Piotr Jastrzebski
0efcc6b33f Fix use-after-free in estimated_histogram parsing
A pointer to buf was used in do_until but buf wasn't
kept around and was destroyed.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:48:02 +02:00
Takuya ASADA
782ebcece4 dist/debian: add --jobs <njobs> option just like build_rpm.sh
On some build environment we may want to limit number of parallel jobs since
ninja-build runs ncpus jobs by default, it may too many since g++ eats very
huge memory.
So support --jobs <njobs> just like on rpm build script.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180425205439.30053-1-syuu@scylladb.com>
2018-04-26 12:44:06 +03:00
Duarte Nunes
6f9bc28edf Merge 'Collect statistics on updates to memtables' from Vladimir
"
This patchset brings in a statistics collector that tracks minimal
values for timestamps, TTLs and local deletion times for all the updates
made to a given memtable.

This statistics is later used when flushing memtables into SSTables
using 3.x ('mc') format to delta-encode corresponding values using
collected minimums as bases (that is why it is called encoding
statistics).

This patchset is sent out apart from other changes that introduce
writing SSTables 3.x to facilitate read path implementation that also
needs the encoding_stats structure.

The tests for write path implicitly cover this functionality as any rows
written to a SSTable 3.0 file make use of delta-encoding.
"

* 'projects/sstables-30/collect-encoding-statistics-v4' of https://github.com/argenet/scylla:
  Collect encoding statistics for memtable updates.
  Factor out min_tracker and max_tracker as common helpers.
  Always pass mutation_partitions to partition_entry::apply()
2018-04-26 00:39:15 +01:00
Vladimir Krivopalov
948c4d79d3 Collect encoding statistics for memtable updates.
We keep track of all updates and store the minimal values of timestamps,
TTLs and local deletion times across all the inserted data.
These values are written as a part of serialization_header for
Statistics.db and used for delta-encoding values when writing Data.db
file in SSTables 3.0 (mc) format.

For #1969.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-04-25 15:39:14 -07:00
Vladimir Krivopalov
f6f99919da Factor out min_tracker and max_tracker as common helpers.
They will be re-used for collecting encoding statistics which is needed
to write SSTables 3.0.

Part of #1969.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-04-25 14:58:47 -07:00
Vladimir Krivopalov
e1ee833861 Always pass mutation_partitions to partition_entry::apply()
Previously it was also possible to pass a frozen_mutation to it.
Now we de-serialize frozen mutations at the calling side.

This is a pre-requisite for collecting memtable statistics needed for
writing into the SSTables 3.0 format.

For #1969.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-04-25 14:58:47 -07:00
Moreno Garcia
8dde91d03c docker: Create data_dir if it does not exist
When provisioning a Scylla docker image with --developer-mode 0 (disabled)
scylla_raid_setup is not invoked. As a consequence the "data" directory is not
created and scylla_io_setup fails (steps to reproduce and error message provided
at the end).

This patch adds the same verifications present in scylla_io_setup to docker's
scyllasetup.py and creates the data directory in the case it is not present.

--

Steps to reproduce on AWS i3.2xlarge with Ubuntu 16.04:

sudo -s
apt update && apt upgrade -y && apt-get install docker.io -y

mdadm --create --verbose --force --run /dev/md0 --level=0 -c1024 --raid-devices=1 /dev/nvme0n1
mkfs.xfs /dev/md0 -f -K
mkdir /var/lib/scylla
mount -t xfs /dev/md0 /var/lib/scylla

docker run --name some-scylla \
  --volume /var/lib/scylla:/var/lib/scylla \
  -p 9042:9042 -p 7000:7000 -p 7001:7001 -p 7199:7199 \
  -p 9160:9160 -p 9180:9180 -p 10000:10000 \
  -d scylladb/scylla --overprovisioned 1 --developer-mode 0

docker logs some-scylla
  running: (['/usr/lib/scylla/scylla_dev_mode_setup', '--developer-mode', '0'],)
  running: (['/usr/lib/scylla/scylla_io_setup'],)
  terminate called after throwing an instance of 'std::system_error'
    what():  open: No such file or directory
  ERROR:root:/var/lib/scylla/data did not pass validation tests, it may not be on XFS and/or has limited disk space.
  This is a non-supported setup, and performance is expected to be very bad.
  For better performance, placing your data on XFS-formatted directories is required.
  To override this error, enable developer mode as follow:
  sudo /usr/lib/scylla/scylla_dev_mode_setup --developer-mode 1
  failed!
  Traceback (most recent call last):
    File "/docker-entrypoint.py", line 15, in <module>
      setup.io()
    File "/scyllasetup.py", line 34, in io
      self._run(['/usr/lib/scylla/scylla_io_setup'])
    File "/scyllasetup.py", line 23, in _run
      subprocess.check_call(*args, **kwargs)
    File "/usr/lib64/python3.4/subprocess.py", line 558, in check_call
      raise CalledProcessError(retcode, cmd)
  subprocess.CalledProcessError: Command '['/usr/lib/scylla/scylla_io_setup']' returned non-zero exit status 1

ls -latr /var/lib/scylla
  total 4
  drwxr-xr-x 44 root root 4096 Abr 24 13:02 ..
  drwxr-xr-x  2 root root    6 Abr 24 13:10 .

Signed-off-by: Moreno Garcia <moreno@scylladb.com>
Message-Id: <20180424173729.22151-1-moreno@scylladb.com>
2018-04-25 17:48:34 +03:00
Calle Wilund
b1edf75c8b types: Make seastar::inet_address the "native" type for CQL inet.
Fixes #3187

Requires seastar "inet_address: Add constructor and conversion function
from/to IPv4"

Implements support IPv6 for CQL inet data. The actual data stored will
now vary between 4 and 16 bytes. gms::inet_address has been augumented
to interop with seastar::inet_address, though of course actually trying
to use an Ipv6 address there or in any of its tables with throw badly.

Tests assuming ipv4 changed. Storing a ipv4_address should be
transparent, as it now "widens". However, since all ipv4 is
inet_address, but not vice versa, there is no implicit overloading on
the read paths. I.e. tests and system_keyspace (where we read ip
addresses from tables explicitly) are modified to use the proper type.
Message-Id: <20180424161817.26316-1-calle@scylladb.com>
2018-04-24 23:12:07 +01:00
Duarte Nunes
9111c6e49a Merge seastar upstream
* seastar 1bb44ac...70aecca (12):
  > Experimental CMake-based build system
  > inet_address: Add constructor and conversion function from/to IPv4
  > tls: Add missing includes and forward declarations to header
  > install_dependencies.sh: fix remaining centos issues
  > rpc: Add missing return when closing client socket
  > install-dependencies.sh: install g++7.3 for centos, instead of g++7.2
  > reactor: fix race beween alien queue construction and start
  > Merge "enhance the I/O Scheduler with bandwidth and throughput limits" from Glauber
  > reactor: gracefully exit if exception happens during initialization
  > build: really add alien_test
  > Merge "reactor: add alien::submit_to()" from Kefu
  > queue: do not consume from aborted queue

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-24 23:07:13 +01:00
Duarte Nunes
f5eeafe1bf tests/secondary_index_test: Add test for dropping index-backing MV
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180424140745.7144-2-duarte@scylladb.com>
2018-04-24 17:02:59 +01:00
Duarte Nunes
9146de3118 service/migration_manager: Don't drop index-backing MV
Unless dropped by the index itself, forbid dropping an index-backing
MV using `drop materialized view`.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180424140745.7144-1-duarte@scylladb.com>
2018-04-24 17:01:59 +01:00
Nadav Har'El
d674b6f672 secondary index: fix bug in indexing case-sensitive column names
CQL normally folds identifiers such as column names to lowercase. However,
if the column name is quoted, case-sensitive column names and other strange
characters can be used. We had a bug where such columns could be indexed,
but then, when trying to use the index in a SELECT statement, it was not
found.

The existing code remembered the index's column after converting it to CQL
format (adding quotes). But such conversion was unnecessary, and wrong,
because the rest of the code works with bare strings and does not involve
actual CQL statements. So the fix avoids this mistaken conversion.

This patch also includes a test to reproduce this problem.

Fixes #3154.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180424154920.15924-1-nyh@scylladb.com>
2018-04-24 16:57:17 +01:00
Piotr Sarna
d323b5cddc tests: add missing case-sensitive JSON tests
This commit complements cql_query_test with case-sensitivity cases
for both SELECT JSON and INSERT JSON statements.
Message-Id: <20bc7df2ec644618727183e09f2352ca5546a9b9.1524576066.git.sarna@scylladb.com>
2018-04-24 16:30:56 +03:00
Piotr Sarna
000ce24306 cql3: solve JSON case-sensitivity issues
This commit fixes two closely related issues with handling
case-sensitive column names in JSON:
 * according to doc, case-sensitive names should be wrapped with
   additional pair of double quotes during JSON SELECT
 * logic error in parse_json() prevented INSERT JSON from working
   properly on case-sensitive column names

This commit is followed by updated cql_query_test, which checks
case-sensitive cases as well.
Message-Id: <82d9d5e193a656e99bc86b297c00662a6fb808a0.1524576066.git.sarna@scylladb.com>
2018-04-24 16:30:55 +03:00
Avi Kivity
13ea1a89b5 Merge "Implement loading sstables in 3.x format" from Piotr
"
Pass sstable version to parse, write and describe_type methods to make it possible to handle different versions.
For now serialization header from 3.x format is ignored.

Tests: units (release)
"

* 'haaawk/sstables3/loading_v4' of ssh://github.com/scylladb/seastar-dev:
  Add test for loading the whole sstable
  Add test for loading statistics
  Add support for 3_x stats metadata
  Pass sstable version to describe_type
  Pass sstable version to write methods
  metadata_type: add Serialization type
  Pass sstable_version_types to parse methods
  Add test for reading filter
  Add test for read_summary
  sstables 3.x: Add test for reading TOC
  sstable: Make component_map version dependent
  sstable::component_type: add operator<<
  Extract sstable::component_type to separete header
  Remove unused sstable::get_shared_components
  sstable_version_types: add mc version
2018-04-24 12:49:41 +03:00
Piotr Jastrzebski
6310fc5f1c Add test for loading the whole sstable
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 11:30:26 +02:00
Piotr Jastrzebski
9e78b6d4c6 Add test for loading statistics
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 11:30:26 +02:00
Piotr Jastrzebski
df457166b0 Add support for 3_x stats metadata
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 11:30:26 +02:00
Piotr Jastrzebski
e1e23ec555 Pass sstable version to describe_type
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 11:30:26 +02:00
Piotr Jastrzebski
1cc1f9af5f Pass sstable version to write methods
This will allow writing different versions differently

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 11:30:26 +02:00
Piotr Jastrzebski
08da518dae metadata_type: add Serialization type
Ignore it while reading sstable 3_x and throw
if it's present when reading 2_x.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 11:30:26 +02:00
Piotr Jastrzebski
cb84ca8abb Pass sstable_version_types to parse methods
Parsing will depend on the sstable version when
we have support for both 2_x and 3_x formats.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 11:30:26 +02:00
Piotr Jastrzebski
444b468d46 Add test for reading filter
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 11:30:26 +02:00
Piotr Jastrzebski
ff06d2153c Add test for read_summary
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 11:30:26 +02:00
Piotr Jastrzebski
10f9b06145 sstables 3.x: Add test for reading TOC
Make sure DigestCRC32 is handled correctly.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 11:30:26 +02:00
Piotr Jastrzebski
561ca34ec2 sstable: Make component_map version dependent
Introduce sstable_version_constants that will be a proxy
serving correct constants depending on the format version.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 11:30:26 +02:00
Piotr Jastrzebski
7aef74c55f sstable::component_type: add operator<<
Make it possible to print out component_type.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 11:30:26 +02:00
Piotr Jastrzebski
d492e92b15 Extract sstable::component_type to separete header
It will be used in other places which won't depend on
sstable.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 11:29:57 +02:00
Nadav Har'El
4af2604e76 secondary index: update test.py
I forgot that I also need to update test.py for the new test.

It's unfortunate that this script doesn't pick up the list of
tests automatically (perhaps with a black-list of tests we don't
want to run). I wonder if there are additional tests we are
forgetting to run.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180424085911.29732-1-nyh@scylladb.com>
2018-04-24 12:11:38 +03:00
Nadav Har'El
9605059a2b secondary index: move tests to separate source file
Move the two tests we have for the secondary indexing feature from the
huge tests/cql_query_test.cc to a new file, secondary_index_test.cc.

Having these tests in a separate file will make it easier and faster to
write more tests for this feature, and to run these tests together.

This patch doesn't change anything in the tests' code - it's just a code
move.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180424084700.28816-1-nyh@scylladb.com>
2018-04-24 11:49:57 +03:00
Takuya ASADA
4a8ed4cc6f dist/common/scripts/scylla_raid_setup: prevent 'device or resource busy' on creating mdraid device
According to this web site, there is possibility we have race condition with
mdraid creation vs udev:
http://dev.bizo.com/2012/07/mdadm-device-or-resource-busy.html
And looks like it can happen on our AMI, too (see #2784).

To initialize RAID safely, we should wait udev events are finished before and
after mdadm executed.

Fixes #2784

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1505898196-28389-1-git-send-email-syuu@scylladb.com>
2018-04-24 11:48:40 +03:00
Vladimir Krivopalov
fc644a8778 Fix Scylla to compile with older versions of JsonCpp (<= 1.7.0).
Old versions of JsonCpp declare the following typedefs for internally
used aliases:
    typedef long long int Int64;
    typedef unsigned long long int UInt64;

In newer versions (1.8.x), those are declared as:
    typedef int64_t Int64;
    typedef uint64_t UInt64;

Those base types are not identical so in cases when a type has
constructors overloaded only for specific integral types (such as
Json::Value in JsonCpp or data_value in Scylla), an attempt to
pack/unpack an integer from/to a JSON object causes ambiguous calls.

Fixes #3208

Tests: unit {release}.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <e9fff9f41e0f34b15afc90b5439be03e4295623e.1524556258.git.vladimir@scylladb.com>
2018-04-24 10:58:38 +03:00
Piotr Jastrzebski
279b426ee8 Remove unused sstable::get_shared_components
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 09:45:55 +02:00
Piotr Jastrzebski
7248752698 sstable_version_types: add mc version
This is the latest version of 3.x SSTable format.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 09:45:55 +02:00
Raphael S. Carvalho
11940ca39e sstables: Fix bloom filter size after resharding by properly estimating partition count
We were feeding the total estimation partition count of an input shared
sstable to the output unshared ones.

So sstable writer thinks, *from estimation*, that each sstable created
by resharding will have the same data amount as the shared sstable they
are being created from. That's a problem because estimation is feeded to
bloom filter creation which directly influences its size.
So if we're resharding all sstables that belong to all shards, the
disk usage taken by filter components will be multiplied by the number
of shards. That becomes more of a problem with #3302.

Partition count estimation for a shard S will now be done as follow:
    //
    // TE, the total estimated partition count for a shard S, is defined as
    // TE = Sum(i = 0...N) { Ei / Si }.
    //
    // where i is an input sstable that belongs to shard S,
    //       Ei is the estimated partition count for sstable i,
    //       Si is the total number of shards that own sstable i.

Fixes #2672.
Refs #3302.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180423151001.9995-1-raphaelsc@scylladb.com>
2018-04-23 18:11:20 +03:00
Avi Kivity
8a8f688dbf Merge "Materialized views: Fixes to update generation" from Duarte
"
Fixes to several issues around view update generation, pertaining to
timestamp and TTL management.

Fixes #3361
Fixes #3360
Fixes #3140
Refs #3362

Tests: unit(release, debug), dtest(materialized_views.py)
"

Reviewed-by: Nadav Har'El <nyh@scylladb.com>

* 'materialized-views/fixes-galore/v2' of http://github.com/duarten/scylla:
  mutation_partition: Clarify comment about emptiness
  tests: Add view_complex_test
  tests/view_schema_test: Complete test
  db/view: Move cells instead of copying in add_cells_to_view()
  db/view: Handle unselected base columns and corner cases
  mutation_partition: Regular base column in view determines row liveness
  db/view: Don't avoid read-before-write when view PK matches base
  db/view: Process base updates to column unselected by its views
  db/view: Consider partition tombstone when generating updates
  tests/view_schema_test: Remove unneeded test
  mutation_fragment: Allow querying if row is live
  view_info: Add view_column() overload
  view_info: Explicitly initialize base-dependent fields
  cql3/alter_table_statement: Forbid dropping columns of MV base tables
2018-04-23 16:49:29 +03:00
Nadav Har'El
1ec5688b0b Materialized Views: fix incorrect limitations on row filtering
This patch fixes several cases where it was disallowed to create
a materialized view with a filter ("where ..."), for no good reason.
After this patch, these cases will be allowed. Fixes #2367.

In ordinary SELECT queries, certain types of filtering which is known to
be deceptively inefficient is now allowed. For example, trying to query
a range of partition keys cannot be done without reading the entire
database (because the murmur3 tokenizer randomizes the order of partitions).
Restricting two partition key components also cannot be done without
reading excessive amount of the entire partition. So Scylla, following
Cassandra, chooses to disallow such SELECT queries, and give an error
message.

However, the same SELECT statements *should* be allowed when defining a
materialized view. In this case, the filter is just used to check an
individual row - not to search for one - so there is no performance
concern.

Unfortunately the existing code did these validations while building the
SELECT statement's "restrictions", in code shared by both uses of SELECT
(query and MV definition). It was easy to move one of the validations
to later code which runs after the restriction has already been built (and
knows if it is working for query or MV), but because of the way the
"restrictions" objects (translated from Cassandra 2's code) hide what they
contain, many of the checks are harder to perform after having built the
restrictions object. So instead, we add in strategic places in the
restriction-handling code a new "allow_filtering" flag. If restrictions
are built with allow_filtering=true, the extra performance-oriented tests
on the filtering restrictions is not done. Materialized views sets
allow_filtering=true.

The allow_filtering flag will also be useful later when we want to support
the "ALLOW FILTERING" query option which is currently not supported properly
(we have several open issues on that). However note that this patch doesn't
complete that support: I left a FIXME in the spot where we set
allow_filtering in the Materialized Views case, but in the futre also need
to set it if the user specified "ALLOWED FILTERING" in the query.

This patch also enables several unit tests written by Duarte which used to
fail because of this bug, and now pass. These tests verify that the
restrictions are now allowed and filter the view as desired; But I also
added test code to verify that the same restrictions are still forbidden,
as before, when used in ordinary SELECT queries.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Message-Id: <20180423124343.17591-1-nyh@scylladb.com>
2018-04-23 14:08:04 +01:00
Avi Kivity
ff055a291a Merge "Improve "out-of-the-box" build experience on centos" from Botond
"
Make sure install_dependencies.sh installs all the right dependencies
and that the example `configure.py` invokation can just be copy-pasted
into the terminal and will "just work".

Ref: #3208
"

* 'fix_centos_compile/v2' of https://github.com/denesb/scylla:
  install_dependencies.sh: update centos package list and example
  configure.py: add --with-ragel option
  configure.py: add --with-antlr3
  configure.py: check compiler version first
2018-04-23 15:49:27 +03:00
Botond Dénes
bfe741c03d install_dependencies.sh: update centos package list and example
Add missing packages to `yum install` list:
* scylla-boost163-static
* scylla-python34-pyparsing20

Update the configure.py example so that it just works:
* Change g++ to 7.3
* Add --with-antlr3 pointing to antlr3 installed from scylla 3rdparty
2018-04-23 15:46:43 +03:00
Botond Dénes
1efcf215b6 configure.py: add --with-ragel option
To allow the user to select the exact ragel executable they whish to
use.
2018-04-23 15:46:43 +03:00
Botond Dénes
784be9cc43 configure.py: add --with-antlr3
To allow the user to select the exact antlr3 executable they whish to
use.
2018-04-23 15:46:43 +03:00
Botond Dénes
ea8d8f9fbf configure.py: check compiler version first
Before checking anything else (presence of boost, its version, etc.)
check that the compiler is present and can compile and link a simple c++
program.
Before if the compiler was not set up correctly configure.py would fail
at one of the other try_compile checks, whichever came first (usually
the one checking for boost). This lead the user into chasing some
false-positive error when in fact the compiler wasn't working.
2018-04-23 15:46:43 +03:00
Takuya ASADA
7b92c3fd3f dist: Drop AmbientCapabilities from scylla-server.service for Debian 8
Debian 8 causes "Invalid argument" when we used AmbientCapabilities on systemd
unit file, so drop the line when we build .deb package for Debian 8.
For other distributions, keep using the feature.

Fixes #3344

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180423102041.2138-1-syuu@scylladb.com>
2018-04-23 13:27:14 +03:00
Avi Kivity
269207fdf6 Merge "Introducing INSERT JSON and fromJson to CQL3" from Piotr
"
This series complements JSON support with INSERT JSON and fromJson
cql function.

INSERT JSON implementation tries hard to interfere as little as possible
with regular INSERT path. So, after being parsed, insertJsonStatement
exists as a separate statement and is handled in a special way.
Overridden add_update_for_key extracts values from JSON map and applies
them to columns.

Converting from insert_json_statement to insert_statement uses auxiliary
from_json_object methods to convert JSON-encoded types to bytes.
Then, terms are matched to appropriate column names and cells are
updated.

fromJson CQL function uses the same from_json_object helper methods,
but applies them to single arguments, not whole rows.

Existing json handling functions from json.hh and libjsoncpp were used
where possible.

Things implemented:
 * expanding CQL grammar to accept INSERT JSON
 * converting JSON representation of cql values to cql terms
 * serving 'INSERT INTO xxx JSON yyy' clause
 * tests for INSERT JSON and fromJson()
"

* 'json_ops_2' of https://github.com/psarna/scylla:
  tests: add cql unit tests for INSERT JSON
  cql3: add fromJson() function
  cql3: add INSERT JSON parsing to CQL grammar
  cql3: add support for INSERT JSON clause
  cql3: decouple execute from term binding in setters
  cql3: change operation::make_* functions to static
  cql3: add from_json_object function to types
  cql3: Make literals::NULL_VALUE public
2018-04-23 13:19:54 +03:00
Piotr Sarna
97e89f2efb tests: add cql unit tests for INSERT JSON
This commit adds tests for INSERT JSON clause, which is expected
to accept JSON strings and insert appropriate values to columns
defined there.
The tests also cover fromJson function calls and inserting prepared
batch statements with INSERT JSON inside.

References #2058
2018-04-23 12:00:57 +02:00
Piotr Sarna
cd76a01747 cql3: add fromJson() function
This function extends JSON support with fromJson() function,
which can be used in UPDATE clause to transform JSON value
into a value with proper CQL type.

fromJson() accepts strings and may return any type, so its instances,
like toJson(), are generated during calls.

This commit also extends functions::get() with additional
'receiver' parameter. This parameter is used to extract receiver type
information neeeded to generate proper fromJson instance.
Receiver is known only during insert/update, so functions::get() also
accepts a nullptr if receiver is not known (e.g. during selection).

References #2058
2018-04-23 12:00:57 +02:00
Piotr Sarna
9dd34bf34d cql3: add INSERT JSON parsing to CQL grammar
This commit makes it possible to parse INSERT JSON statement
in CQL grammar, so it's available via cqlsh.

References #2058
2018-04-23 12:00:57 +02:00
Piotr Sarna
cdcbf654a8 cql3: add support for INSERT JSON clause
This commit adds the implementation of INSERT JSON clause
which accepts JSON object as parameter and inserts appropriate
values into appropriate columns, as defined in given JSON.

Example:
INSERT INTO testme JSON '{
  "id" : 77,
  "name" : "Jones",
  "ranking" : 8.5
}'

References #2058
2018-04-23 12:00:57 +02:00
Piotr Sarna
bfe3c20035 cql3: decouple execute from term binding in setters
This commit makes it possible to pass values to setters,
instead of having to pass cql3::term instances.
Thanks to that previously prepared terminals can be directly
used in a setter execution.

References #2058
2018-04-23 12:00:56 +02:00
Piotr Sarna
2b729a10bc cql3: change operation::make_* functions to static
This commit makes operation::make* functions static, because they
don't access any instance-specific data anyway. It is later needed
to decouple setter execution from binding a cql3::term.
2018-04-23 12:00:56 +02:00
Piotr Sarna
1d40d2186e cql3: add from_json_object function to types
This commit adds a 'from_json_object' method which will be used
for converting JSON representation of a value to raw bytes representing
the same value. This functionality will be needed by 'INSERT JSON'
clause implementation, which can turn these raw bytes into cql3::term.

References #2058
2018-04-23 12:00:56 +02:00
Piotr Sarna
e3dfa2193b cql3: Make literals::NULL_VALUE public
This commit makes NULL_VALUE public for future use in JSON translation.

References #2058
2018-04-23 12:00:56 +02:00
Botond Dénes
c34b69f4b2 Add PULL_REQUEST_TEMPLATE.md
Hopefully it will guide people wanting to contribute to the mailing
list.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <73c5d9c9884d8595b466412486494d6aa45d1d55.1524476490.git.bdenes@scylladb.com>
2018-04-23 10:45:25 +01:00
Avi Kivity
1a6b891ce2 Update scylla-ami submodule
* dist/ami/files/scylla-ami 9b4be70...02b1853 (1):
  > scylla_install_ami: remove the host id file after scylla_setup
2018-04-23 12:43:56 +03:00
Avi Kivity
b7b3d2bfec tests: continuous_data_consumer_test: increase coverage
Cover also values in the ranges 0 to 1 and 2^63 to 2^64 - 1.
Message-Id: <20180422150938.29143-2-avi@scylladb.com>
2018-04-23 11:39:06 +03:00
Avi Kivity
732177d2b0 tests: continuous_data_consumer_test: reduce runtime
continuous_data_consumer_test takes an unreasonable amount of
time to run, especially in debug mode.  Reduce the run time by
reducing the number of loops.
Message-Id: <20180422150938.29143-1-avi@scylladb.com>
2018-04-23 11:39:06 +03:00
Duarte Nunes
c8baba4e3a mutation_partition: Clarify comment about emptiness
empty() doesn't distinguish between live and dead data, so clarify
that in its comment.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-23 09:32:03 +01:00
Duarte Nunes
cc6c96bc92 tests: Add view_complex_test
This patch introduces view_complex_test and adds more test coverage
for materialized views.

A new file was introduced to avoid making view_schema_test slower.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-23 09:32:03 +01:00
Duarte Nunes
7ba1291731 tests/view_schema_test: Complete test
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-23 09:32:03 +01:00
Duarte Nunes
844e0b41d1 db/view: Move cells instead of copying in add_cells_to_view()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-23 09:32:03 +01:00
Duarte Nunes
4b4d1dbd1f db/view: Handle unselected base columns and corner cases
When a view's PK only contains the columns that form the base's PK,
then the liveness of a particular view row is determined not only by
the base row's marker, but also by the selected and, more importantly,
unselected columns.

This patch ensures that unselected columns are considered as much as
possible, even though some limitations will still exist. In
particular, we need to represent multiple timestamps (from all the
unselected columns), but have only mechanisms to record a single
timestamp.

We also have some issues when dealing with selected column, and the
way we currently delete them. Consider the following:

create table cf (p int, c int, a int, b int, primary key (p, c))
create materialized view vcf as select a, b
from cf where p is not null and c is not null
primary key (p, c)

1) update cf using timestamp 10 set a = 1 where p = 1 and c = 1
2) delete a from cf using timestamp 11 where p = 1 and c = 1
3) update cf using timestamp 1 set a = 2 where p = 1 and c = 1

After 1), the MV should include a row with row marker @ ts10,
p = 1, c = 1, a = 1. After 2), this row should be removed.

At 3), we should add a row with row marker @ ts1, p = 1, c = 1, a = 1,
with a lower timestamp. This means that the delete should not
insert a row tombstone with timestamp @ 11, as we do now but it should
just delete the view's row marker (which exists) with ts1.

Refs #3362
Fixes #3140
Fixes #3361

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-23 09:32:02 +01:00
Duarte Nunes
67dac67c46 mutation_partition: Regular base column in view determines row liveness
When views contain a primary key column that is not part of the base
table primary key, that column determines whether the row is live or
not. We need to ensure that when that cell is dead, and thus the
derived row marker, either by normal deletion of by TTL, so is the
rest of the row.

This patch introduces the idea of shawdowing row marker. We map the
status of the regular base column in the view's PK to the view row's
marker. If this marker is dead, so is that cell in the base table, and
so should the view row become. To enforce that, a view row's dead
marker shadows the whole row if that view includes a base regular
column in its PK.

Fixes #3360

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-23 09:32:02 +01:00
Duarte Nunes
4dfce4d369 db/view: Don't avoid read-before-write when view PK matches base
When a view's PK only contains the columns that form the base's PK,
then the liveness of a particular view row is determined not only by
the base row's marker, but also by the selected and, more importantly,
unselected columns. When calculating the view's row marker we need
to access those unselected columns, so we can't avoid the
read-before-write as we were doing.

Refs #3362

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-23 09:32:02 +01:00
Duarte Nunes
bd3cedd240 db/view: Process base updates to column unselected by its views
When a view's PK only contains the columns that form the base's PK,
then the liveness of a particular view row is determined not only by
the base row's marker, but also by the selected and, more importantly,
unselected columns. So, process base updates to columns unselected by
any of its views.

Refs #3362

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-23 09:32:02 +01:00
Duarte Nunes
ac9b93eb89 db/view: Consider partition tombstone when generating updates
Not adding the partition tombstone to the current list of tombstones
may cause updates to be incorrectly generated.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-23 09:32:02 +01:00
Duarte Nunes
e6467f46b7 tests/view_schema_test: Remove unneeded test
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-23 09:32:02 +01:00
Duarte Nunes
b0cb5480d5 mutation_fragment: Allow querying if row is live
For clustering_row and static_row, allow querying whether they are
live or not.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-23 09:32:02 +01:00
Duarte Nunes
164f043768 view_info: Add view_column() overload
For when we already have the base's column_definition.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-23 09:32:02 +01:00
Duarte Nunes
31370fd7b1 view_info: Explicitly initialize base-dependent fields
Instead of lazily-initializing the regular base column in the view's
PK field, explicitly initialize it. This will be used by future
patches that don't have access to the schema when wanting to obtain
that column.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-23 09:32:02 +01:00
Duarte Nunes
b77b71436d cql3/alter_table_statement: Forbid dropping columns of MV base tables
When a view's PK only contains the columns that form the base's PK,
then the liveness of a particular view row is determined not only by
the base row's marker, but also by the selected and, more importantly,
unselected columns.

The fact that unselected columns can keep a view row alive also
requires that users cannot drop columns of base tables with
materialized views, which this patch implements.

Refs #3362

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-23 09:32:02 +01:00
Avi Kivity
28be4ff5da Revert "Merge "Implement loading sstables in 3.x format" from Piotr"
This reverts commit 513479f624, reversing
changes made to 01c36556bf. It breaks
booting.

Fixes #3376.
2018-04-23 06:47:00 +03:00
Avi Kivity
513479f624 Merge "Implement loading sstables in 3.x format" from Piotr
"
Pass sstable version to parse, write and describe_type methods to make it possible to handle different versions.
For now serialization header from 3.x format is ignored.

Tests: units (release)
"

* 'haaawk/sstables3/loading_v3' of ssh://github.com/scylladb/seastar-dev:
  Add test for loading the whole sstable
  Add test for loading statistics
  Add support for 3_x stats metadata
  Pass sstable version to describe_type
  Pass sstable version to write methods
  metadata_type: add Serialization type
  Pass sstable_version_types to parse methods
  Add test for reading filter
  Add test for read_summary
  sstables 3.x: Add test for reading TOC
  sstable: Make component_map version dependent
  sstable::component_type: add operator<<
  Extract sstable::component_type to separete header
  Remove unused sstable::get_shared_components
  sstable_version_types: add mc version
2018-04-22 16:18:39 +03:00
Piotr Jastrzebski
0288121c0a Add test for loading the whole sstable
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-22 15:07:03 +02:00
Piotr Jastrzebski
fbe9ee72d6 Add test for loading statistics
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-22 15:07:03 +02:00
Piotr Jastrzebski
b683870644 Add support for 3_x stats metadata
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-22 15:06:51 +02:00
Takuya ASADA
01c36556bf dist/debian: use --configfile to specify pbuilderrc
Use --configfile to specify pbuilderrc, instead of copying it to home directory.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180420024624.9661-1-syuu@scylladb.com>
2018-04-22 16:06:42 +03:00
Piotr Jastrzebski
26ab3056ae Pass sstable version to describe_type
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-22 14:41:11 +02:00
Piotr Jastrzebski
0022c309ee Pass sstable version to write methods
This will allow writing different versions differently

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-22 14:41:10 +02:00
Piotr Jastrzebski
65fe564cd2 metadata_type: add Serialization type
Ignore it while reading sstable 3_x and throw
if it's present when reading 2_x.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-22 14:40:04 +02:00
Piotr Jastrzebski
d68f3b328f Pass sstable_version_types to parse methods
Parsing will depend on the sstable version when
we have support for both 2_x and 3_x formats.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-22 13:46:12 +02:00
Piotr Jastrzebski
9b448b9082 Add test for reading filter
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-22 13:46:12 +02:00
Piotr Jastrzebski
6bb5468ba0 Add test for read_summary
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-22 13:46:12 +02:00
Piotr Jastrzebski
6c2cf40ce8 sstables 3.x: Add test for reading TOC
Make sure DigestCRC32 is handled correctly.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-22 13:46:12 +02:00
Piotr Jastrzebski
00756582ca sstable: Make component_map version dependent
Introduce sstable_version_constants that will be a proxy
serving correct constants depending on the format version.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-22 13:46:12 +02:00
Piotr Jastrzebski
94fbec788e sstable::component_type: add operator<<
Make it possible to print out component_type.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-22 13:46:12 +02:00
Piotr Jastrzebski
82d483a1d3 Extract sstable::component_type to separete header
It will be used in other places which won't depend on
sstable.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-22 13:45:29 +02:00
Avi Kivity
70220d8f85 tests: sstable_datafile_test: peel off redundant parentheses around compression_parameters initializer
The compression_parameter constructor is called with an extra level of
parentheses. Presumably this caused a temporary object to be constructed
and then moved into the argument being initialized, but gcc 8 complains
about ambiguity.

Make it happy by stripping off the redundant parentheses.
Message-Id: <20180421121854.12314-1-avi@scylladb.com>
2018-04-21 13:53:29 +01:00
Avi Kivity
7a141c0240 tests: network_topology_strategy_test: peel off redundant parentheses around token initializer
The token constructor is called with an extra level of parentheses. Presumably
this caused a temporary object to be constructed and then moved into the
variable being initialized, but gcc 8 complains about ambiguity.

Make it happy by stripping off the redundant parentheses.
Message-Id: <20180421121736.12136-1-avi@scylladb.com>
2018-04-21 13:53:29 +01:00
Avi Kivity
7c54e8559c mutation_fragment: fix concept for mutation_fragment::consume()
The parameters to the MutationFragmentConsumer concept must be concrete
types, not decltype(auto).

Reported by gcc 8.
Message-Id: <20180421110738.7574-1-avi@scylladb.com>
2018-04-21 13:53:29 +01:00
Duarte Nunes
6eeb6514f1 Merge 'Introduce "scylla active-sstables" command' from Tomasz
"
Prints info about sstables used by readers

Example:

  (gdb) scylla active-sstables
  sstable "keyspace1"."standard1"#5, readers=3 data_file_size=39393952
  sstable "keyspace1"."standard1"#6, readers=3 data_file_size=127513304
  sstable_count=2, total_index_lists_size=0
"

* 'tgrabiec/gdb-scylla-active-sstables' of github.com:tgrabiec/scylla:
  gdb: Introduce "scylla active-sstables" command
  gdb: Make list_unordered_map() more general
  gdb: Improve compatibility with python2.7
2018-04-19 19:04:59 +01:00
Tomasz Grabiec
fb126abdc5 gdb: Introduce "scylla active-sstables" command
Prints info about sstables used by readers

Example:

  (gdb) scylla active-sstables
  sstable "keyspace1"."standard1"#5, readers=3 data_file_size=39393952
  sstable "keyspace1"."standard1"#6, readers=3 data_file_size=127513304
  sstable_count=2, total_index_lists_size=0
2018-04-19 19:45:52 +02:00
Tomasz Grabiec
68dd61a0e7 gdb: Make list_unordered_map() more general
1) vt.name returns None for some types, use str() instead
 2) some unorderd_maps use 'false' as the second Hash_node template parameter
 3) some consumers will prefer a reference to the value instead of its address
2018-04-19 19:06:00 +02:00
Tomasz Grabiec
309257ddda gdb: Improve compatibility with python2.7
Which is still used in some builds of GDB
2018-04-19 19:04:26 +02:00
Piotr Jastrzebski
0c96573807 Remove unused sstable::get_shared_components
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-18 10:24:57 +02:00
Piotr Jastrzebski
4f1528192f sstable_version_types: add mc version
This is the latest version of 3.x SSTable format.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-18 10:24:57 +02:00
Duarte Nunes
1db6d7d6e2 cql3/functions: Add some missing functions
Fixes #3368

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180417170638.12625-1-duarte@scylladb.com>
2018-04-17 21:15:14 +03:00
Duarte Nunes
17917e12ce db/view: Wait for schema agreement in background upon view building
Waiting for schema agreement in the foreground may cause the node to
not boot in useful time.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180417125915.11262-1-duarte@scylladb.com>
2018-04-17 18:03:43 +03:00
Duarte Nunes
b5e7d5fa2c column_family: Make reader without going through mutation source
When doing the read before write for a materialized view update, call
make_reader directly.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180417091918.10043-1-duarte@scylladb.com>
2018-04-17 12:22:36 +03:00
Takuya ASADA
e99f43ef43 dist/debian: call lsb_release after command existance check
Fixes #3364

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1523907917-13188-1-git-send-email-syuu@scylladb.com>
2018-04-17 10:54:39 +03:00
Avi Kivity
2c2175ab34 Merge "Add support for reading variant integers from SSTables" from Piotr
"
Enhance continuous_data_consumer to use existing vint serialization for reading
variant integers from SSTables.

Also available at:
https://github.com/scylladb/seastar-dev/commits/haaawk/sstables3/unsigned-vint-v6

Tests: units (release)
"

* 'haaawk/sstables3/unsigned-vint-v6' of ssh://github.com/scylladb/seastar-dev:
  sstables: add test for continuous_data_consumer::read_unsigned_vint
  buffer_input_stream: make it possible to specify chunk size
  Add tests for make_limiting_data_source
  Introduce make_limiting_data_source
  sstables: add continuous_data_consumer::read_unsigned_vint
  Cover serialized_size_from_first_byte in tests
  core: add unsigned_vint::serialized_size_from_first_byte
  sstables: add all dependant headers to consumer.hh
  sstables: add all dependant headers to exceptions.hh
  core: add #pragma once to vint-serialization.hh
2018-04-17 10:09:38 +03:00
Takuya ASADA
ace44784e8 dist/debian: use ~root as HOME to place .pbuilderrc
When 'always_set_home' is specified on /etc/sudoers pbuilder won't read
.pbuilderrc from current user home directory, and we don't have a way to change
the behavor from sudo command parameter.

So let's use ~root/.pbuilderrc and switch to HOME=/root when sudo executed,
this can work both environment which does specified always_set_home and doesn't
specified.

Fixes #3366

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1523926024-3937-1-git-send-email-syuu@scylladb.com>
2018-04-17 09:37:16 +03:00
Takuya ASADA
5a71d4f814 dist/debian: use apt-get instead of apt
To suppress following warning, use apt-get instead of apt:
"WARNING: apt does not have a stable CLI interface. Use with caution in scripts."

Fixes #3365

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1523909727-13343-1-git-send-email-syuu@scylladb.com>
2018-04-17 09:29:16 +03:00
Piotr Jastrzebski
c5dda1c0c9 sstables: add test for continuous_data_consumer::read_unsigned_vint
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-16 21:14:34 +02:00
Piotr Jastrzebski
fdad8eba97 buffer_input_stream: make it possible to specify chunk size
This will allow to force input stream to return its data
in chunks of a specified size.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-16 21:11:13 +02:00
Piotr Jastrzebski
4406d11095 Add tests for make_limiting_data_source
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-16 21:00:35 +02:00
Piotr Jastrzebski
cc6e619aa9 Introduce make_limiting_data_source
This method takes a data_source and returns another data_source
that returns data from the input source but in chunks of limited
size.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-16 20:56:30 +02:00
Piotr Jastrzebski
b68d1fa5bd sstables: add continuous_data_consumer::read_unsigned_vint
This allows reading unsigned variant integers from
SSTable format 3.x.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-16 20:30:10 +02:00
Piotr Jastrzebski
4431c1bbe7 Cover serialized_size_from_first_byte in tests
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-16 20:26:44 +02:00
Piotr Jastrzebski
e423529077 core: add unsigned_vint::serialized_size_from_first_byte
This method takes first byte and determins how many bytes
are used to represent an unsigned variant integer.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-16 20:12:03 +02:00
Botond Dénes
07fb2e9c4d make_foreign_reader: don't wrap local readers
If the to-be-wrapped reader is local (lives on the same shard where
make_foreign_reader() is called) there is no need to wrap it with
foreign_reader. Just return it as is.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <886ed883b707f163603a40b56b8823f2bb6c47c6.1523873224.git.bdenes@scylladb.com>
2018-04-16 15:11:20 +03:00
Piotr Jastrzebski
20705c4536 sstables: add all dependant headers to consumer.hh
Before it was depending on byteorder.hh that just happend
to be included in all compilation units that were using consumer.hh
This change makes the header compile when used in new compilation units.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-16 11:02:49 +02:00
Piotr Jastrzebski
9288074d02 sstables: add all dependant headers to exceptions.hh
Before it was depending on print.hh that just happend
to be included in all compilation units that were using
exceptions.hh. This change makes the header compile
 when used in new compilation units.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-16 11:02:33 +02:00
Avi Kivity
7c01e66d53 cql3: query_processor: store and use just local shard reference of storage_proxy
Since storage_proxy provides access to the entire cluster, a local shard
reference is sufficient.  Adjust query_processor to store a reference to
just the local shard, rather than a seastar::sharded<storage_proxy> and
adjust callers.

This simplifies the code a little.
Message-Id: <20180415142656.25370-3-avi@scylladb.com>
2018-04-16 10:20:50 +02:00
Avi Kivity
f7b102238a cql3: change cql_statement methods to accept a local storage_proxy
The storage_proxy represents the entire cluster, so there's never a need
to access it on a remote shard; the local shard instance will contact
remote shard or remote nodes as needed.

Simplify the API by passing storage_proxy references instead of
seastar::sharded<storage_proxy> references. query_processor and
other callers are adjusted to call seastar::sharded::local() first.
Message-Id: <20180415142656.25370-2-avi@scylladb.com>
2018-04-16 10:18:28 +02:00
Avi Kivity
52882d1bd9 dist: debian: try harder to set the target distribution
build_deb.sh relies on pbuilder picking up a ~/.pbuilderrc which we
copy from the script. According to the pbuilder manual, "~" will refer
to the root directory (since pbuilder is run via sudo). In practice
we've observed this working with "~" referring to the current user's
home directory, but also sometimes failing, while complaining
about /root/.pbuilderrc failing. When it fails, it fails to set
the correct distribution.

To be extra sure, also copy .pbuilderrc to root's home directory. This
way, whatever behavior pbuilder chooses to follow, it will have a
configuration file to read.
Message-Id: <20180410134508.9415-1-avi@scylladb.com>
2018-04-16 10:10:47 +02:00
Avi Kivity
e0545cd2ad Merge seastar upstream
* seastar 2da7d46...1bb44ac (7):
  > doc: exclude non-API paths and symbols
  > docs: move detailed descriptions to top of page
  > doc: add default layout file
  > Merge "Misc fixes for io_tester" from Glauber
  > Merge RPC template cleanup from Gleb
  > Revert "Merge rpc template cleanup from Gleb"
  > Merge rpc template cleanup from Gleb
2018-04-15 15:48:50 +03:00
Daniel Fiala
a3533a62ba Allow /upload to be at the end of a path for sstable file
The patch fixes a bug introduce by commit
089b54f2d2.

When sstable files are stored in .../upload directory
and refresh is initialised with `nodetool` then it fails
because Scyla doesn't expect .../upload to be a part of the path.

Fixes #3334.

Signed-off-by: Daniel Fiala <daniel@scylladb.com>
Message-Id: <20180413132019.17779-1-daniel@scylladb.com>
2018-04-14 15:25:55 +03:00
Piotr Sarna
5a6fcebed6 cql3: add toJson function
This commit extends JSON support with toJson() function,
which can be used in SELECT clause to transform a single argument
to JSON form.

toJson() accepts any type including nested collection types,
so instead of being declared with concrete types,
proper toJson() instances are generated during calls.

This commit also supplements JSON CQL query tests with toJson calls.

Finally, it refactors JSON tests so they use do_with_cql_env_thread.

References #2058

Message-Id: <a7833650428e9ef590765a14e91c4d42532588f4.1523528698.git.sarna@scylladb.com>
2018-04-14 15:23:47 +03:00
Gleb Natapov
1a9aaece3e cql_server: fix a race between closing of a connection and notifier registration
There is a race between cql connection closure and notifier
registration. If a connection is closed before notification registration
is complete stale pointer to the connection will remain in notification
list since attempt to unregister the connection will happen to early.
The fix is to move notifier unregisteration after connection's gate
is closed which will ensure that there is no outstanding registration
request. But this means that now a connection with closed gate can be in
notifier list, so with_gate() may throw and abort a notifier loop. Fix
that by replacing with_gate() by call to is_closed();

Fixes: #3355
Tests: unit(release)

Message-Id: <20180412134744.GB22593@scylladb.com>
2018-04-12 16:56:50 +03:00
Raphael S. Carvalho
0c72781939 sstables/twcs: add support to millisecond timestamp resolution
That's blocking KairosDB users because it uses TWCS with millisecond
timestamp resolution.

Also older drivers use millisecond instead of the default microsecond.

Fixes #3152.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180411171244.19958-1-raphaelsc@scylladb.com>
2018-04-12 12:46:52 +03:00
Glauber Costa
98d784aba7 sstables: correctly calculate number of bits in filter
In my well intentioned attempt to use fewer magic numbers in the loading
code I replaced "64" with something calculated automatically from the
type being used.

Except I did it wrong, because sizeof(uint64_t) is 8, not 64.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180411155903.27665-1-glauber@scylladb.com>
2018-04-11 19:03:30 +03:00
Avi Kivity
dc0c458c12 Merge "First series on JSON support in CQL" from Piotr
"
This series introduces 'SELECT JSON' clause support for CQL.
Things implemented:
 * expanding CQL grammar with JSON keyword
 * converting values to JSON format
 * serving 'SELECT JSON *' clauses
 * tests for 'SELECT JSON'
"

* 'json_ops' of https://github.com/psarna/scylla:
  tests: add cql unit tests for SELECT JSON
  cql3: Add JSON token to CQL grammar
  cql3: add support for SELECT JSON clause
  cql3: add to_json_string function to types
2018-04-11 18:26:53 +03:00
Piotr Sarna
fa66e64c24 tests: add cql unit tests for SELECT JSON
This commit adds tests for SELECT JSON clause,
which is expected to return rows in JSON format.

References #2058
2018-04-11 17:12:21 +02:00
Piotr Sarna
1b6e3ccd2b cql3: Add JSON token to CQL grammar
This commit adds JSON keyword to CQL grammar and allows parsing
'SELECT JSON' command in CQL. Additionally, it will be useful
in implementing 'INSERT JSON(...)'.

References #2058
2018-04-11 17:12:21 +02:00
Piotr Sarna
15545da572 cql3: add support for SELECT JSON clause
This commit adds the implementation of SELECT JSON clause
which returns rows in JSON format. Each returned row has a single
'[json]' column.

References #2058
2018-04-11 17:12:02 +02:00
Avi Kivity
2d126a79b5 Merge "Multishard combined reader" from Botond
"
The multishard combined reader provides a convenient
flat_mutation_reader implementation that takes care of efficiently
reading a range from all shards that own data belonging to the range.
All this happens transparently, the user of the reader need only pass a
factory function to the multishard reader which it uses to create
remote readers when needed. These remote readers will then be managed
through foreign reader which abstracts away the fact that the reader is
located on a remote shard.
Sub readers are created for the entire read range, meaning they are free
to cross shard-range limits to fill their buffer. The output of these
sub readers is merged in a round-robin manner, the same way data is
distributed among shards. The multishard reader will move to the next
shard's reader whenever it encounters a partition whose token is after
the delimiter token.
To improve throughput and latency two levels of read-ahead is employed.
One in foreign_reader, which will try to fill the remote shard reader's
buffer in the background, in parallel to processing the results on the
local shard. And one in the multishard reader itself which will
exponentially increase concurrency whenever a sub-reader's buffer
becomes empty. But only if this happened after crossing a shard
boundary. This is important because there is no point in increasing
concurrency if a single sub reader can fill the multishard readers'
buffer.
"

* 'multishard-reader/v3' of https://github.com/denesb/scylla:
  Add unit tests for multishard_combined_reader
  Add multishard_combined_reader
  flat_mutation_reader: add peek_buffer()
  Add unit tests for foreign_reader
  forwardable reader: implement fast_forward_to(position_in_partition)
  Add foreign_reader
  flat_mutation_reader: add detach_buffer()
2018-04-11 18:03:35 +03:00
Glauber Costa
c93bc6b853 sstables: don't rely on parameter evaluation order
Asias reported in issue #3351 that a floating point exception was seen
while loading SSTables. Looking at the trace, that seems to be because
we tried to issue a modulo operation with something that was likely 0.

That field comes from the nr_bits attribute in the large bitset, and our
current code should set it to whatever we read from the Filter file -
something that has been working for ages.

The difference is that after the patch that Asias identified as culprit,
we are moving the array from which we compute the size in the same
parameter list where we are computing the size.

This works for me and passed all my tests - likely because my compiler
was doing left-to-right evaluation as I would expect it to do. But the
standard doesn't guarantee that at all, and it reads:

"Order of evaluation of the operands of almost all C++ operators
(including the order of evaluation of function arguments in a
function-call expression and the order of evaluation of the
subexpressions within any expression) is unspecified. The compiler can
evaluate operands in any order, and may choose another order when the
same expression is evaluated again."

This likely fixes the bug, but even if it doesn't we should patch it,
since we currently have something that is technically an UB.

Fixes #3351.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180411144036.24748-1-glauber@scylladb.com>
2018-04-11 18:01:06 +03:00
Daniel Fiala
202bff0b18 database: Remember versions and formats of all temporary TOC files.
The patch fixes a bug introduce by commit 089b54f2d2.
This bug exhibited when master was deployed in an attempt to populate
materialised views. The nodes restarted in the middle and they were not able
to come back.

The fix is to remember formats and versions of sstables for every generation.

Fixes: #3324.

Signed-off-by: Daniel Fiala <daniel@scylladb.com>
Message-Id: <20180410083114.17315-1-daniel@scylladb.com>
2018-04-11 16:47:33 +03:00
Piotr Sarna
399ab1d455 cql3: add to_json_string function to types
This commit adds a 'to_json_string' method which will be used
for converting values to JSON strings. In several cases it's not
sufficient to use 'to_string', e.g. actual strings need to be
surrounded with double quotes.

References #2058
2018-04-11 13:27:56 +02:00
Avi Kivity
4c588de70f tests: apply overprovisioned flag to all tests
Some tests escaped the --overprovisioned flag, causing them to
compete over cpu 0. Add the flag to all tests.
Message-Id: <20180410181606.8341-1-avi@scylladb.com>
2018-04-11 10:48:52 +02:00
Botond Dénes
f931b45dfa test_resources_based_cache_eviction: s/assert/BOOST_REQUIRE_*/
After moving this test into a SEASTAR_THREAD_TEST_CASE we can use the
BOOST_REQUIRE_* macros which have much better diagnostics than simple
assert()s.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <d2faa5db2bc352e6a2dcf09287faed42284c3248.1523432699.git.bdenes@scylladb.com>
2018-04-11 10:55:21 +03:00
Botond Dénes
49128d12cf Move querier_cache_resource_based_eviction test into querier_cache.cc
Turns out do_with_cql_env can be used from within SEASTAR test cases so
no reason to have a separate file for a single test case.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <028a28b7d90a3bc5ed4719ce273da05880133c0e.1523432699.git.bdenes@scylladb.com>
2018-04-11 10:55:19 +03:00
Botond Dénes
ff3982a817 Add unit tests for multishard_combined_reader 2018-04-11 10:03:50 +03:00
Botond Dénes
3a6f397fd0 Add multishard_combined_reader
Takes care of reading a range from all shards that own a subrange in the
range. The read happens sequentially, reading from one shard at a time.
Under the scenes it uses combined_mutation_reader and foreign_reader,
the former providing the merging logic and the latter taking care of
transferring the output of the remote readers to the local shard.
Readers are created on-demand by a reader-selector implementation that
creates readers for yet unvisited shards as the read progresses.
The read starts with a concurrency of one, that is the reader reads from
a single shard at a time. The concurrency is exponentially increased (to
a maximum of the number of shards) when a reader's buffer is empty after
moving the next shard. This condition is important as we only wan't to
increase concurrency for sparse tables that have little data and the
reader has to move between shards often. When concurrency is > 1, the
reader issues background read-aheads to the next shards so that by the
time it needs to move to them they have the data ready.
For dense tables (where we rarely cross shards) we rely on the
foreign_reader to issue sufficient read-aheads on its own to avoid
blocking.
2018-04-11 10:03:47 +03:00
Botond Dénes
94140258d0 flat_mutation_reader: add peek_buffer()
Allows peeking at the next mutation fragment in the buffer. As opposed
to the existing `peek()` it assumes there's at least one fragment in the
buffer. Useful for code that already ensured that the buffer is not
empty and doesn't want to introduce a continuation (via `peek()`).
2018-04-11 09:22:49 +03:00
Botond Dénes
de4a3c8bdb Add unit tests for foreign_reader 2018-04-11 09:22:49 +03:00
Botond Dénes
50b67232e5 forwardable reader: implement fast_forward_to(position_in_partition)
Instead of throwing std::bad_function_call. Needed by the foreign_reader
unit test. Not sure how other tests didn't hit this before as the test
is using `run_mutation_source_tests()`.
2018-04-11 09:22:49 +03:00
Botond Dénes
2c0f8d0586 Add foreign_reader
Local representant of a reader located on a remote shard. Manages the
lifecycle and takes care of seamlessly transferring fragments produced
by the remote reader. Fragments are *copied* between the shards in
batches, a bufferful at a time.
To maximize throughput read-ahead is used. After each fill_buffer() or
fast_forward_to() a read-ahead (a fill_buffer() on the remote reader) is
issued. This read-ahead runs in the background and is brough back to
foreground on the next fill_buffer() or fast_forward_to() call.
2018-04-11 09:22:45 +03:00
Botond Dénes
334efb4d70 flat_mutation_reader: add detach_buffer()
Allows for detaching the internal buffer of the reader. Enables
convenient transferring of buffered fragmends in a single batch but
will force the reader to reallocate it's buffer on the next
fill_buffer() call.
Introduced for foreign_reader which favours quick transferring of the
fragments between shards in a single batch, over minimizing allocations,
which can be amortized by background read-aheads.
2018-04-11 09:08:51 +03:00
Piotr Jastrzebski
190cdd27f0 core: add #pragma once to vint-serialization.hh
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-10 20:09:40 +02:00
Raphael S. Carvalho
638a647b7d sstables/compaction_manager: do not break lcs invariant by not allowing parallel compaction for it
After change to serialize compaction on compaction weight (eff62bc61e),
LCS invariant may break because parallel compaction can start, and it's
not currently supported for LCS.

The condition is that weight is deregistered right before last sstable
for a leveled compaction is sealed, so it may happen that a new compaction
starts for the same column family meanwhile that will promote a sstable to
an overlapping token range.

That leads to strategy restoring invariant when it finds the overlapping,
and that means wasted resources.
The fix is about removing a fast path check which is incorrect now because
we release weight early and also fixing a check for ongoing compaction
which prevented compaction from starting for LCS whenever weight tracker
was not empty.

Fixes #3279.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180410034538.30486-1-raphaelsc@scylladb.com>
2018-04-10 20:02:08 +03:00
Avi Kivity
fc488adc72 logalloc: remove segment_descriptor::_lsa_managed
_lsa_managed is always 1:1 with _region, so we can remove it, saving
some space in the segment descriptor vector.

Tests: unit (release), logalloc_test (debug)
Message-Id: <20180410122606.10671-1-avi@scylladb.com>
2018-04-10 13:54:38 +01:00
Asias He
d71a94a08b gossip: Add tokens and host_id in add_saved_endpoint
Problem:

   Start node 1 2 3
   Shutdown node2
   Shutdown node1 node3
   Start node1 node3
   Try to repalce_address for node 2
   The replace operation fails with the error:
     seastar - Exiting on unhandled exception: std::runtime_error
     (Cannot replace_address node2 because it doesn't exist in gossip)

This is because after all nodes shutdown, the other nodes do not have the
tokens and host_id info of node2 until node2 boots up and talks to the cluster.

If node2 can not boots up for whatever reason, currently the only way to
recover node2 is to `nodetool removenode` and bootstrap node2 again. This will
change tokens in the cluster and cause more data movement than just replacing
node2.

To fix, we add the tokens and host_id gossip application state in add_saved_endpoint
during boot up.

This is pretty safe because the generation for application state added by
add_saved_endpoint is zero, if node2 actually boots, other nodes will update
with node2's version.

Before:
$ curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" | python -mjson.tool

    {
        "addrs": "127.0.0.2",
        "generation": 0,
        "is_alive": false,
        "update_time": 1523344828953,
        "version": 0
    }

Node 2 can not be replaced.

After:
$ curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" | python -mjson.tool

    {
        "addrs": "127.0.0.2",
        "application_state": [
            {
                "application_state": 12,
                "value": "31284090-2557-4036-9367-7bb4ef49c35a",
                "version": 2
            },
            {
                "application_state": 13,
                "value": "... a lot of tokens ...",
                "version": 1
            }
        ],
        "generation": 0,
        "is_alive": false,
        "update_time": 1523344828953,
        "version": 0
    }

Node 2 can be replaced.

Tests: dtest/replace_address_test.py
Fixes: #3347
Message-Id: <117fd6649939e0505847335791be8d7a96e7d273.1523346805.git.asias@scylladb.com>
2018-04-10 13:14:31 +02:00
Piotr Jastrzebski
5cd48407ad test: logalloc_test: Fix build for boost 1.63
Due to https://svn.boost.org/trac10/ticket/12778?replyto=3
BOOST_REQUIRE_NE does not work with nullptr.

Tests: units (release)

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <e7158e8a235356fad99560f6fcbecb57615cefe6.1523298193.git.piotr@scylladb.com>
2018-04-10 12:50:22 +03:00
Piotr Jastrzebski
3565820526 sstables: Remove unused mp_row_consumer::skip_partition
The method is never called so we can remove it and
mp_row_consumer::_skip_partition which is set only
by mp_row_consumer::skip_partition

Tests: units (release)

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <cae8b09032c58361b7cfb9d02a792cb31b186f5c.1523298605.git.piotr@scylladb.com>
2018-04-10 12:50:05 +03:00
Glauber Costa
b2f9958071 large_bitset: use a chunked_vector internally and simplify API
save and load functions for the large_bitset were introduced by Avi with
d590e327c0.

In that commit, Avi says:

"... providing iterator-based load() and save() methods.  The methods
support partial load/save so that access to very large bitmaps can be
split over multiple tasks."

The only user of this interface is SSTables. And turns out we don't really
split the access like that. What we do instead is to create a chunked vector
and then pass its begin() method with position = 0 and let it write everything.

The problem here is that this require the chunked vector to be fully
initialized, not just reserved. If the bitmap is large enough that in itself
can take a long time without yielding (up to 16ms seen in my setup).

We can simplify things considerably by moving the large_bitset to use a
chunked vector internally: it already uses a poor man's version of it
by allocating chunks internally (it predates the chunked_vector).

By doing that, we can turn save() into a simple copy operation, and do
away with load altogether by adding a new constructor that will just
copy an existing chunked_vector.

Fixes #3341
Tests: unit (release)

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180409234726.28219-1-glauber@scylladb.com>
2018-04-10 10:25:06 +03:00
Paweł Dziepak
252c5dfa52 Merge "logalloc: replace zones with segment-at-a-time alloc/free" from Avi
"This patchset removes zones and replaces them with a simpler system. LSA tries
to allocate segments at higher addresses, so that we'll end up with the standard
allocator using lower addresses and LSA using higher addresses, allowing for easier
allocation from std."

* tag 'lsa-no-zones/v6' of https://github.com/avikivity/scylla:
  tests: add logalloc_test for large contiguous allocations in a challenging environemnt
  logalloc: limit std segment allocations in debug mode
  logalloc: introduce prime_segment_pool()
  logalloc: limit non-contiguous reclaims
  logalloc: pre-allocate all memory as lsa on startup
  tests: add random test for dynamic_bitset
  dynamic_bitset: optimize for large sets
  dynamic_bitset: get rid of resize()
  dynamic_bitset: remove find_*_clear() variants
  logalloc: reduce segment size to 128k
  logalloc: get rid of the emergency reserve stack
  logalloc: replace zones with segment-at-a-time alloc/free
2018-04-09 10:30:11 +02:00
Avi Kivity
80651e6dcc database: reduce idle memtable flush cpu shares to 1%
Commit 1671d9c433 (not on any release branch)
accidentally bumped the idle memtable flush cpu shares to 100 (representing
10%), causing flushes to be too when they don't comsume too much cpu.

Fixes #3243.
Message-Id: <20180408104601.9607-1-avi@scylladb.com>
2018-04-08 17:12:14 +01:00
Avi Kivity
53d97b1da3 Merge seastar upstream
* seastar 33d8f74...2da7d46 (4):
  > http routes: Add parameters to path when adding alias
  > future: compile-time optimize futurize<void>::apply()
  > memory: remove unneeded union 'pla'
  > queue: not_empty()/not_full() should throw when called after abort
2018-04-08 16:36:45 +03:00
Piotr Jastrzebski
9ad00b8207 data_consume_rows_context: Mark RANGE_TOMBSTONE_5 as nonconsuming
This state does not read any data and is used only to perform
action when finishing to read a primitive type.

According to comment on continuous_data_consumer::non_consuming
such states should be marked as non_consuming.

Tests: units (release)

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <55a5c9b76268b50312ecd044291f28dcd8179a22.1523005293.git.piotr@scylladb.com>
2018-04-08 15:16:13 +03:00
Alexys Jacob
d3d736cd87 dist: gentoo: rename prometheus node exporter package
net-analyzer/prometheus-node_exporter got moved to app-metrics/node_exporter
and the service name changed on Gentoo Linux

Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20180405135605.26146-1-ultrabug@gentoo.org>
2018-04-08 14:11:38 +03:00
Avi Kivity
c3a2471c9e tests: add logalloc_test for large contiguous allocations in a challenging environemnt
Test large std allocations in an evironement that has seen many persistent
std allocations interspersed with lsa allocations, causing memory fragmentation.
2018-04-07 21:04:10 +03:00
Avi Kivity
2c670f6161 logalloc: limit std segment allocations in debug mode
Address Sanitizer has a global limit on the number of allocations
(note: not number of allocations less number of frees, but cumulative
number of allocations). Running some tests in debug mode on a machine
with sufficient memory can break that limit.

Work around that limit by restricting the amount of memory the
debug mode segment_pool can allocate. It's also nicer for running
the test on a workstation.
2018-04-07 21:04:10 +03:00
Avi Kivity
2baa16b371 logalloc: introduce prime_segment_pool()
To segregate std and lsa allocations, we prime the segment pool
during initialization so that lsa will release lower-addressed
memory to std, rather than lsa and std competing for memory at
random addresses.

However, tests often evict all of lsa memory for their own
purposes, which defeats this priming.

Extract the functionality into a new prime_segment_pool()
function for use in tests that rely on allocation segregation.
2018-04-07 14:52:58 +03:00
Avi Kivity
ff6325ee7e logalloc: limit non-contiguous reclaims
We may fail to reclaim because a region has reclaim disabled (usually because
it is in an allocating_section. Failed reclaims can cause high CPU usage
if all of the lower addresses happen to be in a reclaim-disabled region (this
is somewhat mitigated by the fact that checking for reclaim disabled is very
cheap), but worse, failing a segment reclaim can lead to reclaimed memory
being fragmented.  This results in the original allocation continuing to fail.

To combat that, we limit the number of failed reclaims. If we reach the limit,
we fail the reclaim.  The surrounding allocating_section will release the
reclaim_lock, and increase reserves, which will result in reclaim being
retried with all regions being reclaimable, and succeed in allocating
contiguous memory.
2018-04-07 14:52:58 +03:00
Avi Kivity
c6c659ce7a logalloc: pre-allocate all memory as lsa on startup
Since lsa tries to keep some non-lsa memory as reserve, we end up
with three blocks of memory: at low addresses, non-lsa memory that was
allocated during startup or subsequently freed by lsa; at middle addresses,
lsa; and at the top addresses, memory that lsa left alone during initial
cache population due to the reserve.

After time passes, both std and lsa will allocate from the top section,
causing a mix of lsa and non-lsa memory. Since lsa tries to free from
lower addresses, this mix will stay there forever, increasing fragmentation.

Fix that by disabling the reserve during startup and allocating all of memory
for lsa. Any further allocation will then have to be satisfied by lsa first
freeing memory from the low addresses, so we will now have just two sections
of memory: low addresses for std, and top addresses for lsa.

Note that this startup allocation does not page in lsa segments, since the
segment constructor does not touch memory.
2018-04-07 14:52:58 +03:00
Avi Kivity
413bf34fbd tests: add random test for dynamic_bitset
Compare against vector<bool> as a reference.
2018-04-07 14:52:58 +03:00
Avi Kivity
ff52767ec9 dynamic_bitset: optimize for large sets
Add 1:64 summary bitmaps so that searching for set bits is O(log n)
instead of O(n).
2018-04-07 14:52:58 +03:00
Avi Kivity
14510ae986 dynamic_bitset: get rid of resize()
Makes it easier to modify later on. Maybe "dynamic" is not so justified now.
2018-04-07 14:52:58 +03:00
Avi Kivity
f219ae1275 dynamic_bitset: remove find_*_clear() variants
They are no longer used, and cannot be efficiently implemenented
for large bitsets using a summary vector approach without slowing
down the find_*_set() variants, which are used.

Also remove find_previous_set() for the same reason.
2018-04-07 14:52:58 +03:00
Avi Kivity
54db0f3d30 logalloc: reduce segment size to 128k
Reducing the segment size reduces the time needed to compact segments,
and increases the number of segments that can be compacted (and so
the probability of finding low-occupancy segments).

128k is the size of I/O buffers and of thread stacks, so we can't
go lower than that without more significant changes.
2018-04-07 14:52:58 +03:00
Avi Kivity
3f17dbfcbc logalloc: get rid of the emergency reserve stack
Instead of keeping specific segments in the emergency reserve,
just keep the number of segments in the reserve. This simplifies the
code considerably.
2018-04-07 14:52:55 +03:00
Avi Kivity
fa73d844e9 logalloc: replace zones with segment-at-a-time alloc/free
This patch replaces the zones mechanism with something simpler: a
single segment is moved from the standard allocator to lsa and vice
versa, at a time. Fragmentation resistance is (hopefully) achieved
by having lsa prefer high addresses for lsa data, and return segments
at low address to the standard allocator. Over time, the two will move
apart.

Moving just once segment at a time reduces the latency costs of
transferring memory between free and std.
2018-04-07 13:48:40 +03:00
Piotr Sarna
a5b6047ffa cql3: add row-wise read statistics
Database read metrics is now extended by total number of rows read,
exported through cql_rows_read field.

Closes #3146
Message-Id: <02f0816c509f3d7fea06da22869eea61548284e2.1522919708.git.sarna@scylladb.com>
2018-04-05 13:39:08 +03:00
Paweł Dziepak
67aaaefde7 Merge "api: type-erase more of the column_family API" from Avi
"Together with the already merged patch, we reduce the object file
from 114MB to 81MB."

* tag 'api-diet-1/v1' of https://github.com/avikivity/scylla:
  api: type-erase all-column_family map_reduce variant
  api: simplify 6-argument map_reduce_cf() variant
2018-04-05 11:07:17 +02:00
Botond Dénes
3c078d2554 forwardable reader: pass down timeout in fast_forward_to()
The `const dht::partition_range&` overload to be more precise. The
timeout wasn't passed to the underlying reader. Spotted during test
debugging.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <39c02a55196d923bd0af8e6be6f0baa578cba070.1522915463.git.bdenes@scylladb.com>
2018-04-05 11:43:21 +03:00
Avi Kivity
1fa8682412 Merge seastar upstream
* seastar 7328d17...33d8f74 (3):
  > memory: switch to buddy allocation
  > tls: Ensure we always pass through semaphores on shutdown
  > memory: replace placement-new in unions with member construction

See scylladb/seastar#426.
2018-04-05 11:12:30 +03:00
Raphael S. Carvalho
30b6c9b4cd database: make sure sstable is also forwarded to shard responsible for its generation
After f59f423f3c, sstable is loaded only at shards
that own it so as to reduce the sstable load overhead.

The problem is that a sstable may no longer be forwarded to a shard that needs to
be aware of its existence which would result in that sstable generation being
reallocated for a write request.
That would result in a failure as follow:
"SSTable write failed due to existence of TOC file for generation..."

This can be fixed by forwarding any sstable at load to all its owner shards
*and* the shard responsible for its generation, which is determined as follow:
s = generation % smp::count

Fixes #3273.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180405035245.30194-1-raphaelsc@scylladb.com>
2018-04-05 10:58:05 +03:00
Tzach Livyatan
58e47fa0b3 docs/docker: Fix and add links to Scylla docs
- Fix link for reporting a Scylla problem
- Add a link to Best Practices for Running Scylla on Docker

Signed-off-by: Tzach Livyatan <tzach@scylladb.com>
Message-Id: <20180404065129.16776-1-tzach@scylladb.com>
2018-04-04 10:52:04 +03:00
Piotr Sarna
ae3265f905 cql_server: use handle_exception for failed accepts
Follows up "cql_server: replace recursion in do_accepts with repeat".
Failed accepts are now handled with handle_exception routine
instead of generic then_wrapped.
Message-Id: <db820a674100ae57f3acc7b49ebae57d0c2bdbb8.1522785444.git.sarna@scylladb.com>
2018-04-03 21:34:46 +01:00
Piotr Sarna
b298bb2f7a cql_server: replace recursion in do_accepts with repeat
Recursion in do_accepts function is now replaced with
repeat utility.

Fixes #2467

Message-Id: <07d6da60726fc3ecc06139309b9716180e8accf7.1522777060.git.sarna@scylladb.com>
2018-04-03 21:23:11 +03:00
Avi Kivity
9cef37e643 Merge "db/view: View building fixes" from Duarte
"
Fixes to the view building process, discovered from field experience.

Tests: dtest(materialized_view_tests.py, smp=2)
"

* 'views/view-build-fixes/v1' of https://github.com/duarten/scylla:
  db/view: Start view building after schema agreement
  db/system_keyspace: scylla_views_builds_in_progress writes are user mem
  db/view: Require configuration option to enable view building
2018-04-03 17:42:21 +03:00
Duarte Nunes
b84bbfc51d tests/view_schema_test: Test empty partition key entries are rejected
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180403122244.10626-2-duarte@scylladb.com>
2018-04-03 15:25:53 +03:00
Duarte Nunes
ec8960df45 db/view: Reject view entries with non-composite, empty partition key
Empty partition keys are not supported on normal tables - they cannot
be inserted or queried (surprisingly, the rules for composite
partition keys are different: all components are then allowed to be
empty). However, the (non-composite) partition key of a view could end
up being empty if that column is: a base table regular column, a
base table clustering key column, or a base table partition key column,
part of a composite key.

Fixes #3262
Refs CASSANDRA-14345

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180403122244.10626-1-duarte@scylladb.com>
2018-04-03 15:25:52 +03:00
Duarte Nunes
d4db043f03 db/view: Start view building after schema agreement
If a base table or view has been dropped in one node, but another
one hasn't yet learned about it, it starts the view build process
immediately on boot, possibly calculating unneeded view updates and
causing errors at the view replica, if that replica has already
processed the schema changes. We should thus wait for schema
agreement, even if the node is a seed.

Fixes #3328

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-03 13:16:28 +01:00
Duarte Nunes
75bb66a50d db/system_keyspace: scylla_views_builds_in_progress writes are user mem
Treat writes to scylla_views_builds_in_progress as user memory, as the
number of writes is dependent on the amount of user data on views
(times the number of views, divided by the view building batch size).

Fixes #3325

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-03 13:16:28 +01:00
Duarte Nunes
bf5045c7eb db/view: Require configuration option to enable view building
View building, enabled by default, can contain or expose issues that
prevent the node from starting. In those cases, it is necessary to
disable view building such that the node can be submitted to
maintenance operations.

Fixes #3329

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-03 13:16:28 +01:00
Avi Kivity
6c35db2c44 api: type-erase all-column_family map_reduce variant
Encapsulate the map_reduce parameters in type-erased
std::function, as well as the iterator-on-all-column-families
logic. Reduces binary size by 18%.
2018-04-03 13:08:22 +03:00
Avi Kivity
0ade558999 api: simplify 6-argument map_reduce_cf() variant
The 6-argument map_reduce_cf function is identical to the 5-argument
version, except that it applies performs an extra cast (by calling
the 6th argument's operator=()).

Simplify the code by calling the 5-argument version from the 6-argument
version.

Reduces binary size by ~10%.
2018-04-03 12:22:14 +03:00
Duarte Nunes
11ece46f14 db/view: Remove leftover debug statement
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180402175238.5528-1-duarte@scylladb.com>
2018-04-03 09:41:33 +01:00
Avi Kivity
cadd983856 api: type-erase map_reduce_cf()
map_reduce_cf() is called with varying template parameters which each
have to be compiled separately. Unifying the internals to use types based
on std::any reduced the object size by 15% (115MB->99MB) with presumably
a commensurate decrease in compile time.

A version that used "I" instead of "std::any" (and thus merged the
internals only for callers that used the same result type) delivered
a 10% decrease in object size.  While std::any is less safe, in this
case it is completely encapsulated.
Message-Id: <20180402213732.432-1-avi@scylladb.com>
2018-04-03 09:31:04 +01:00
Avi Kivity
ffcdcd6d16 tests: logalloc_test: relax test_large_allocation
test_large_allocation attempts to allocate almost half of memory.
With a buddy allocator, even if more than half of memory is free,
and even if it is contiguous, it is unlikely to be available as a
single allocation because the allocator inserts boundaries at powers-
of-two addresses.

Relax the test by allocating smaller chunks (but still the same amount,
and still with challenging sizes); allocating half of memory contiguously
is not a goal.

Also use a vector instead of a deque, and reserve it, so we don't get
intervening non-lsa allocations. I'm not sure there's a problem there
but let's not depend on the allocation patterns.
Message-Id: <20180401150828.13921-1-avi@scylladb.com>
2018-04-02 19:23:06 +01:00
Avi Kivity
7ab52947dc conf: define named_value<log_level> externally
While building with -O1, I saw that the linker could not find
the vtable for named_value<log_level>. Rather than fixing up the
includes (and likely lengthening build time), fix by defining
the class as an extern template, preventing it from being
instantiated at the call site.
Message-Id: <20180401150235.13451-1-avi@scylladb.com>
2018-04-02 19:23:06 +01:00
Avi Kivity
3964fd0be2 client_state: initialize _remote_addr for internal queries
-O1 complains that client_state::_remote_addr is not initialized
(and it is right). The call site is tracing, which likely won't be
invoked for internal queries, but still.
Message-Id: <20180401150410.13651-1-avi@scylladb.com>
2018-04-02 19:23:06 +01:00
Avi Kivity
2edf36f863 bytes: don't allocate NUL terminator
Since bytes is used to encapsulate blobs, not strings, there's no
need for a NUL terminator. It will never be passed to a function
that expects a C string.
Message-Id: <20180401151009.14108-1-avi@scylladb.com>
2018-04-02 19:23:06 +01:00
Duarte Nunes
abe8bbe7b5 Merge seastar upstream
* seastar a66cc34...7328d17 (5):
  > sstring: add support for non-nul-terminated sstrings
  > core/sharded: Make async_sharded_service dtor virtual
  > reactor: pass naked pointer to submit_io
  > Merge http: "Add alias support to the API" from Amnon
  > systemwide_memory_barrier: use madvise(MADV_DONTNEED) instead of mprotect()

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-02 19:23:06 +01:00
Glauber Costa
ef84780c27 docker: default docker to overprovisioned mode.
By default, overprovisioned is not enabled on docker unless it is
explicitly set. I have come to believe that this is a mistake.

If the user is running alone in the machine, and there are no other
processes pinned anywhere - including interrupts - not running
overprovisioned is the best choice.

But everywhere else, it is not: even if a user runs 2 docker containers
in the same machine and statically partitions CPUs with --smp (but
without cpuset) the docker containers will pin themselves to the same
sets of CPU, as they are totally unaware of each other.

It is also very common, specially in some virtualized environments, for
interrupts not to be properly distributed - being particularly keen on
being delivered on CPU0, a CPU which Scylla will pin by default.

Lastly, environments like Kubernetes simply don't support pinning at the
moment.

This patch enables the overprovisioned flag if it is explicitly set -
like we did before - but also by default unless --cpuset is set.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180331142131.842-1-glauber@scylladb.com>
2018-04-01 09:17:20 +03:00
Takuya ASADA
95129c4b12 dist/ami: point wiki page when variables.json
Since there's no document for build_ami.sh on this repo, point to wiki page.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1521710239-9687-1-git-send-email-syuu@scylladb.com>
2018-03-29 18:54:42 +03:00
Glauber Costa
a9ef72537f parse and ignore background writer controller
Unused options are not exposed as command line options and will prevent
Scylla from booting when present, although they can still be pased over
YAML, for Cassandra compatibility.

That has never been a problem, but we have been adding options to i3
(and others) that are now deprecated, but were previously marked as
Used. Systems with those options may have issues upgrading.

While this problem is common to all Unused options, the likelihood for
any other unused option to appear in the command line is near zero,
except for those two - since we put them there ourselves.

There are two ways to handle this issue:

1) Mark them as Used, and just ignore them.
2) Add them explicitly to boost program options, and then ignore them.

The second option is preferred here, because we can add them as hidden
options in program_options, meaning they won't show up in the help. We
can then just print a discrete message saying that those options are,
for now on ignored.

v2: mark set as const (Botond)
v3: rebase on top of master, identation suggested by Duarte.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180329145517.8462-1-glauber@scylladb.com>
2018-03-29 17:57:30 +03:00
Avi Kivity
c9aa9f0d86 Revert "logalloc: capture current scheduling group for deferring function"
This reverts commit 3b53f922a3. It's broken
in two ways:

 1. concrete_allocating_function::allocate()'ss caller,
    region_group::start_releaser() loop, will delete the object
    as soon as it returns; however we scheduled some work depending
    on `this` in a separate continuation (via with_scheduling_group())
 2. the calling loop's termination condition depends on the work being
    done immediately, not later.
2018-03-29 16:08:12 +03:00
Vladimir Krivopalov
3a9cb54c76 Merge the pair of index_readers into just one tracking a range.
Historically, we had two index_readers per a sstable_mutation_reader,
one for the lower bound and one for the upper bound. Most of public
members of the index_reader class were only called on either of those.
With the changes introduced in #2981, two readers are even more tied
together as they now have a shared-per-pair list of index pages that
needs proper cleanup and was protruding woefully into the caller code.

This fix re-structures index_reader so that it now keeps track of both
lower and upper bounds. The shared_index_lists structure is encapsulated
within index_reader and becomes an internal detail rather than a
liability.

Fixes #3220.

Tests: unit (debug, release)
+
Tested using cassandra-stress commands from #3189.

perf_fast_forward results indicate there is no performance degradation
caused by thix fix.

=========================== Baseline ===================================
running: large-partition-skips
Testing scanning large partition with skips.
Reads whole range interleaving reads with skips according to read-skip pattern:
read    skip      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
1       0         0.494458   1000000    2022418   1018     126960      27       0        0        0        0        0        0        0  97.6%
1       1         1.754717    500000     284946    997     127064       6       0        0        3        3        0        0        0  99.9%
1       8         0.551664    111112     201413    997     127064       6       0        0        3        3        0        0        0  99.7%
1       16        0.383888     58824     153232   1001     127080      10       0        0        5        5        0        0        0  99.5%
1       32        0.289073     30304     104832    997     127064      28       0        0        3        3        0        0        0  99.3%
1       64        0.236963     15385      64926    997     127064     122       0        0        3        3        0        0        0  99.2%
1       256       0.172901      3892      22510    997     127064     217       0        0        3        3        0        0        0  95.5%
1       1024      0.117570       976       8301    997     127064     235       0        0        3        3        0        0        0  49.0%
1       4096      0.085811       245       2855    664      27172     375     274        0        3        3        0        0        0  21.4%
64      1         0.512781    984616    1920149   1142     127064     139       0        0        3        3        0        0        0  98.7%
64      8         0.479232    888896    1854833   1001     127080      10       0        0        5        5        0        0        0  99.6%
64      16        0.451193    800000    1773078    997     127064       6       0        0        3        3        0        0        0  99.6%
64      32        0.408684    666688    1631305    997     127064       6       0        0        3        3        0        0        0  99.5%
64      64        0.351906    500032    1420924    997     127064      14       0        0        3        3        0        0        0  99.5%
64      256       0.227008    200000     881026    997     127064     211       0        0        3        3        0        0        0  99.1%
64      1024      0.125803     58880     468032    997     127064     290       0        0        3        3        0        0        0  65.1%
64      4096      0.098155     15424     157139    703      27856     401     267        0        3        3        0        0        0  25.8%

running: large-partition-slicing
Testing slicing of large partition:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.000701         1       1427      9        296       6       4        0        3        3        0        0        0  12.4%
0       32        0.000698        32      45827      9        296       6       3        0        3        3        0        0        0  13.9%
0       256       0.000808       256     316920     10        328       6       3        0        3        3        0        0        0  24.9%
0       4096      0.004368      4096     937697     25        808      14       3        0        3        3        0        0        0  45.9%
500000  1         0.001196         1        836     13        412       9       4        0        3        3        0        0        0  22.7%
500000  32        0.001200        32      26664     13        412       9       4        0        3        3        0        0        0  22.2%
500000  256       0.001503       256     170338     14        444      10       4        0        3        3        0        0        0  25.3%
500000  4096      0.004351      4096     941465     30        956      20       4        0        3        3        0        0        0  50.7%

running: large-partition-slicing-clustering-keys
Testing slicing of large partition using clustering keys:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.000625         1       1601      7        176       6       0        0        3        3        0        0        0  23.2%
0       32        0.000604        32      53016      7        176       6       0        0        3        3        0        0        0  24.7%
0       256       0.000695       256     368498      8        180       6       0        0        3        3        0        0        0  36.4%
0       4096      0.004083      4096    1003106     20        692      12       1        0        3        3        0        0        0  47.0%
500000  1         0.001198         1        835     12        516       9       3        0        3        3        0        0        0  22.8%
500000  32        0.000981        32      32631     12        388       9       3        0        3        3        0        0        0  29.2%
500000  256       0.001320       256     194011     13        384      10       3        0        3        3        0        0        0  29.0%
500000  4096      0.003944      4096    1038567     25        840      17       2        0        3        3        0        0        0  52.2%

running: large-partition-slicing-single-key-reader
Testing slicing of large partition, single-partition reader:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.000849         1       1178      9        488       6       0        0        3        3        0        0        0  16.5%
0       32        0.000661        32      48415      9        296       6       0        0        3        3        0        0        0  22.2%
0       256       0.000756       256     338648     10        328       6       0        0        3        3        0        0        0  33.3%
0       4096      0.004147      4096     987610     22        840      12       1        0        3        3        0        0        0  47.9%
500000  1         0.001041         1        960     13        476       9       3        0        3        3        0        0        0  25.9%
500000  32        0.001020        32      31375     13        412       9       3        0        3        3        0        0        0  29.1%
500000  256       0.001265       256     202373     14        444      10       3        0        3        3        0        0        0  32.0%
500000  4096      0.004121      4096     994014     30        988      18       3        0        3        3        0        0        0  52.7%

running: large-partition-select-few-rows
Testing selecting few rows from a large partition:
stride  rows      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
1000000 1         0.000668         1       1498      9        296       6       4        0        3        3        0        0        0  19.8%
500000  2         0.000976         2       2048     13        412       9       4        0        3        3        0        0        0  29.0%
250000  4         0.001408         4       2842     18        572      12       6        0        3        3        0        0        0  28.8%
125000  8         0.002004         8       3993     29        912      19      10        0        3        3        0        0        0  34.0%
62500   16        0.002883        16       5551     50       1584      32      18        0        3        3        0        0        0  41.9%
2       500000    1.053215    500000     474737   1138     127080     120       0        0        5        5        0        0        0  99.7%

running: large-partition-forwarding
Testing forwarding with clustering restriction in a large partition:
pk-scan   time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
yes       0.002717         2        736     24       2684       8      16        0        3        3        0        0        0  19.7%
no        0.001004         2       1992     13        412       8       2        0        3        3        0        0        0  30.2%

running: small-partition-skips
Testing scanning small partitions with skips.
Reads whole range interleaving reads with skips according to read-skip pattern:
   read    skip      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
-> 1       0         1.466523   1000000     681885   1369     139732      33       1        0        0        0        0        0        0  99.7%
-> 1       1        12.792183    500000      39086   6235     177736    5155       0        0     5123     7663        0        0        0  96.4%
-> 1       8         3.451431    111112      32193   6235     177736    5155       0        0     5123     9673        0        0        0  84.8%
-> 1       16        2.223815     58824      26452   6234     177704    5154       0        0     5122     9965        0        0        0  75.0%
-> 1       32        1.512511     30304      20036   6233     177680    5155       1        0     5123    10090        0        0        0  61.8%
-> 1       64        1.129465     15385      13621   6227     177464    5154       0        0     5122    10159        0        0        0  49.5%
-> 1       256       0.733282      3892       5308   6211     175464    5178      24        0     5122    10220        0        0        0  33.8%
-> 1       1024      0.397302       976       2457   5946     142152    5369     217        0     5120    10235        0        0        0  32.1%
-> 1       4096      0.187746       245       1305   5499      81992    5296     142        0     5122    10240        0        0        0  46.8%
-> 64      1         2.428488    984616     405444   7332     177736    5155      25        0     5123     5208        0        0        0  79.9%
-> 64      8         2.262876    888896     392817   6235     177736    5155       0        0     5123     5654        0        0        0  78.1%
-> 64      16        2.137544    800000     374261   6234     177732    5154       0        0     5122     6110        0        0        0  77.1%
-> 64      32        1.862466    666688     357960   6235     177736    5155       0        0     5123     6844        0        0        0  73.7%
-> 64      64        1.547757    500032     323069   6234     177728    5155       0        0     5123     7651        0        0        0  68.7%
-> 64      256       0.914612    200000     218672   6233     177704    5154       0        0     5122     9202        0        0        0  55.5%
-> 64      1024      0.475472     58880     123835   6229     177492    5154       5        0     5122     9930        0        0        0  45.4%
-> 64      4096      0.271239     15424      56865   6158     169480    5257     114        0     5115    10142        0        0        0  44.1%

running: small-partition-slicing
Testing slicing small partitions:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.003209         1        312      3        260       2       7        0        1        1        0        0        0  15.5%
0       32        0.004205        32       7610     16       1428      10       0        0        5        5        0        0        0  15.7%
0       256       0.009830       256      26042     97       8572      62       0        0       31       31        0        0        0  18.7%
0       4096      0.015471      4096     264748    100       8704      64       0        0       32       32        0        0        0  48.4%
500000  1         0.003654         1        274     34        492      33       0        0       32       64        0        0        0  28.7%
500000  32        0.004287        32       7464     40       1260      36       0        0       32       64        0        0        0  26.0%
500000  256       0.009598       256      26673    100       8748      64       4        0       32       64        0        0        0  20.6%
500000  4096      0.014151      4096     289449    119       7892      85       0        0       53       64        0        0        0  54.1%

========================  With the patch ================================
running: large-partition-skips
Testing scanning large partition with skips.
Reads whole range interleaving reads with skips according to read-skip pattern:
read    skip      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
1       0         0.468887   1000000    2132711   1018     126960      29       0        0        0        0        0        0        0  98.4%
1       1         1.735113    500000     288166   1001     127080      10       0        0        5        5        0        0        0  99.9%
1       8         0.535616    111112     207447    997     127064       6       0        0        3        3        0        0        0  99.6%
1       16        0.365487     58824     160947   1001     127080      15       0        0        5        5        0        0        0  99.5%
1       32        0.272208     30304     111326    997     127064      21       0        0        3        3        0        0        0  99.3%
1       64        0.224049     15385      68668    997     127064     208       0        0        3        3        0        0        0  99.1%
1       256       0.159247      3892      24440    997     127064     250       0        0        3        3        0        0        0  94.7%
1       1024      0.102107       976       9559    997     127064     292       0        0        3        3        0        0        0  53.6%
1       4096      0.084310       245       2906    664      27172     371     273        0        3        3        0        0        0  20.2%
64      1         0.508340    984616    1936923   1142     127064     129       0        0        3        3        0        0        0  98.1%
64      8         0.470369    888896    1889786    997     127064       6       0        0        3        3        0        0        0  99.6%
64      16        0.439917    800000    1818526   1001     127080      10       0        0        5        5        0        0        0  99.6%
64      32        0.397938    666688    1675358    997     127064       6       0        0        3        3        0        0        0  99.5%
64      64        0.344144    500032    1452972    997     127064      18       0        0        3        3        0        0        0  99.4%
64      256       0.219996    200000     909107    997     127064     251       0        0        3        3        0        0        0  99.1%
64      1024      0.124294     58880     473715    997     127064     284       1        0        3        3        0        0        0  62.2%
64      4096      0.097580     15424     158065    703      27856     400     267        0        3        3        0        0        0  25.3%

running: large-partition-slicing
Testing slicing of large partition:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.000733         1       1365      9        296       6       4        0        3        3        0        0        0  19.3%
0       32        0.000705        32      45417      9        296       6       3        0        3        3        0        0        0  15.3%
0       256       0.000830       256     308364     10        328       6       3        0        3        3        0        0        0  26.7%
0       4096      0.004631      4096     884529     25        808      14       3        0        3        3        0        0        0  48.1%
500000  1         0.001184         1        845     13        412       9       4        0        3        3        0        0        0  23.7%
500000  32        0.001199        32      26690     13        412       9       4        0        3        3        0        0        0  21.9%
500000  256       0.001530       256     167296     14        444      10       4        0        3        3        0        0        0  26.8%
500000  4096      0.004379      4096     935474     30        956      19       4        0        3        3        0        0        0  51.5%

running: large-partition-slicing-clustering-keys
Testing slicing of large partition using clustering keys:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.000620         1       1614      7        176       6       0        0        3        3        0        0        0  27.4%
0       32        0.000625        32      51218      7        176       6       0        0        3        3        0        0        0  27.0%
0       256       0.000701       256     365148      8        180       6       0        0        3        3        0        0        0  35.2%
0       4096      0.004063      4096    1008130     20        692      12       1        0        3        3        0        0        0  47.6%
500000  1         0.001208         1        827     12        516       9       3        0        3        3        0        0        0  24.3%
500000  32        0.000973        32      32876     12        388       9       3        0        3        3        0        0        0  28.7%
500000  256       0.001315       256     194612     13        384      10       3        0        3        3        0        0        0  29.0%
500000  4096      0.003950      4096    1037068     25        840      17       2        0        3        3        0        0        0  52.7%

running: large-partition-slicing-single-key-reader
Testing slicing of large partition, single-partition reader:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.000844         1       1185      9        488       6       0        0        3        3        0        0        0  16.5%
0       32        0.000656        32      48753      9        296       6       0        0        3        3        0        0        0  23.1%
0       256       0.000751       256     341011     10        328       6       0        0        3        3        0        0        0  34.0%
0       4096      0.004173      4096     981632     22        840      12       1        0        3        3        0        0        0  47.0%
500000  1         0.001036         1        966     13        476       9       3        0        3        3        0        0        0  25.4%
500000  32        0.001014        32      31573     13        412       9       3        0        3        3        0        0        0  27.4%
500000  256       0.001280       256     200044     14        444      10       3        0        3        3        0        0        0  31.8%
500000  4096      0.004081      4096    1003746     30        988      18       3        0        3        3        0        0        0  51.6%

running: large-partition-select-few-rows
Testing selecting few rows from a large partition:
stride  rows      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
1000000 1         0.000668         1       1498      9        296       6       3        0        3        3        0        0        0  21.7%
500000  2         0.000958         2       2088     13        412       9       4        0        3        3        0        0        0  27.7%
250000  4         0.001495         4       2676     18        572      12       6        0        3        3        0        0        0  25.8%
125000  8         0.002069         8       3866     29        912      19      10        0        3        3        0        0        0  30.8%
62500   16        0.002856        16       5603     50       1584      32      18        0        3        3        0        0        0  41.7%
2       500000    1.063129    500000     470310   1138     127080     120       0        0        5        5        0        0        0  99.7%

running: large-partition-forwarding
Testing forwarding with clustering restriction in a large partition:
pk-scan   time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
yes       0.002567         2        779     24       2684       8      16        0        3        3        0        0        0  21.5%
no        0.001013         2       1975     13        412       8       2        0        3        3        0        0        0  28.9%

running: small-partition-skips
Testing scanning small partitions with skips.
Reads whole range interleaving reads with skips according to read-skip pattern:
   read    skip      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
-> 1       0         1.349959   1000000     740763   1369     139732      33       1        0        0        0        0        0        0  99.7%
-> 1       1        12.640751    500000      39555   8144     191168    7064       0        0     7032    11481        0        0        0  96.2%
-> 1       8         3.404269    111112      32639   6651     180660    5571       0        0     5539    10505        0        0        0  84.5%
-> 1       16        2.175424     58824      27040   6434     179116    5354       0        0     5322    10365        0        0        0  74.3%
-> 1       32        1.493365     30304      20292   6335     178404    5257       0        0     5225    10294        0        0        0  61.1%
-> 1       64        1.112168     15385      13833   6256     177672    5183       0        0     5151    10217        0        0        0  48.7%
-> 1       256       0.719282      3892       5411   6211     175464    5178      24        0     5122    10220        0        0        0  33.3%
-> 1       1024      0.393236       976       2482   5946     142152    5369     217        0     5120    10235        0        0        0  30.7%
-> 1       4096      0.185284       245       1322   5499      81992    5296     142        0     5122    10240        0        0        0  44.7%
-> 64      1         2.356711    984616     417792   7361     177944    5184      21        0     5152     5266        0        0        0  79.1%
-> 64      8         2.192331    888896     405457   6253     177868    5173       0        0     5141     5690        0        0        0  77.2%
-> 64      16        2.029835    800000     394121   6245     177812    5165       0        0     5133     6132        0        0        0  75.7%
-> 64      32        1.806448    666688     369060   6245     177808    5165       0        0     5133     6864        0        0        0  72.6%
-> 64      64        1.508492    500032     331478   6242     177788    5163       0        0     5131     7667        0        0        0  67.7%
-> 64      256       0.892881    200000     223994   6233     177704    5154       0        0     5122     9202        0        0        0  54.2%
-> 64      1024      0.465715     58880     126429   6229     177492    5154       0        0     5122     9930        0        0        0  44.0%
-> 64      4096      0.266582     15424      57858   6158     169480    5257     114        0     5115    10142        0        0        0  42.3%

running: small-partition-slicing
Testing slicing small partitions:
offset  read      time (s)     frags     frag/s    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
0       1         0.003113         1        321      3        260       2       0        0        1        1        0        0        0  13.4%
0       32        0.004166        32       7682     16       1428      10       0        0        5        5        0        0        0  14.9%
0       256       0.009813       256      26088     97       8572      62       0        0       31       31        0        0        0  18.4%
0       4096      0.014798      4096     276794    100       8704      64       0        0       32       32        0        0        0  46.3%
500000  1         0.003700         1        270     34        492      33       0        0       32       64        0        0        0  28.4%
500000  32        0.004030        32       7940     40       1260      36       0        0       32       64        0        0        0  27.8%
500000  256       0.009514       256      26908    100       8748      64       0        0       32       64        0        0        0  20.2%
500000  4096      0.013368      4096     306413    119       7892      85       0        0       53       64        0        0        0  53.6%

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <a72818f79ca4081a606424545b0053fa581d49e7.1522173144.git.vladimir@scylladb.com>
2018-03-29 15:23:31 +03:00
Asias He
f539e993d3 gossip: Relax generation max difference check
start node 1 2 3
shutdown node2
shutdown node1 and node3
start node1 and node3
nodetool removenode node2
clean up all scylla data on node2
bootstrap node2 as a new node

I saw node2 could not bootstrap stuck at waiting for schema information to compelte for ever:

On node1, node3

    [shard 0] gossip - received an invalid gossip generation for peer 127.0.0.2; local generation = 2, received generation = 1521779704

On node2

    [shard 0] storage_service - JOINING: waiting for schema information to complete

This is becasue in nodetool removenode operation, the generation of node1 was increased from 0 to 2.

   gossiper::advertise_removing () calls eps.get_heart_beat_state().force_newer_generation_unsafe();
   gossiper::advertise_token_removed() calls eps.get_heart_beat_state().force_newer_generation_unsafe();

Each force_newer_generation_unsafe increases the generation by 1.

Here is an example,

Before nodetool removenode:
```
curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" | python -mjson.tool
   {
   "addrs": "127.0.0.2",
   "generation": 0,
   "is_alive": false,
   "update_time": 1521778757334,
   "version": 0
   },
```

After nodetool revmoenode:
```
curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" | python -mjson.tool
 {
     "addrs": "127.0.0.2",
     "application_state": [
         {
             "application_state": 0,
             "value": "removed,146b52d5-dc94-4e35-b7d4-4f64be0d2672,1522038476246",
             "version": 214
         },
         {
             "application_state": 6,
             "value": "REMOVER,14ecc9b0-4b88-4ff3-9c96-38505fb4968a",
             "version": 153
            }
     ],
     "generation": 2,
     "is_alive": false,
     "update_time": 1521779276246,
     "version": 0
 },
```

In gossiper::apply_state_locally, we have this check:

```
if (local_generation != 0 && remote_generation > local_generation + MAX_GENERATION_DIFFERENCE) {
    // assume some peer has corrupted memory and is broadcasting an unbelievable generation about another peer (or itself)
  logger.warn("received an invalid gossip generation for peer {}; local generation = {}, received generation = {}",ep, local_generation, remote_generation);

}
```
to skip the gossip update.

To fix, we relax generation max difference check to allow the generation
of a removed node.

After this patch, the removed node bootstraps successfully.

Tests: dtest:update_cluster_layout_tests.py
Fixes #3331

Message-Id: <678fb60f6b370d3ca050c768f705a8f2fd4b1287.1522289822.git.asias@scylladb.com>
2018-03-29 12:09:49 +03:00
Glauber Costa
b092234f2b sstables: print informative message earlier
Just saw this today during a crash when creating Materialized Views.
It is still unclear why this happened. But the message says:

Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: scylla: sstables/sstables.cc:2973: sstables::sstable::remove_sstable_with_temp_toc(seastar::sstring, seastar::sstring, seastar::sstring, int64_t, sstables::sstable::version_types, sstables::sstable::format_types)::<lambda()>: Assertion `tmptoc == true' failed.
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: Aborting on shard 0.
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: Backtrace:
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: 0x00000000005b4b4c
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: 0x00000000005b4df5
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: 0x00000000005b4ea3
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: /lib64/libpthread.so.0+0x000000000000f0ff
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: /lib64/libc.so.6+0x00000000000355f6
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: /lib64/libc.so.6+0x0000000000036ce7
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: /lib64/libc.so.6+0x000000000002e565
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: /lib64/libc.so.6+0x000000000002e611
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: 0x00000000015969d0
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: 0x0000000001596f7a
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: 0x000000000051ca8d

I can't even guess which table caused the problem, let alone which SSTable.
That's because those asserts are the very first thing we do. We can discuss
whether or not assert is the right behaviour (usually we can't guarantee the
state is sane if that is missing, so I don't see a problem)

But it would be nice to see which SSTable we are processing before we assert.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180328160856.10717-1-glauber@scylladb.com>
2018-03-28 19:55:04 +03:00
Avi Kivity
4419e60207 Merge "Add a confiugration API" from Amnon
"
The configuration API is part of scylla v2 configuration.
It uses the new definition capabilities of the API to dynamically create
the swagger definition for the configuration.
This mean that the swagger will contain an entry with description and
type for each of the config value.

To get the v2 of the swager file:
http://localhost:10000/v2

If using with swagger ui, change http://localhost:10000/api-doc to http://localhost:10000/v2
It takes longer to load because the file is much bigger now.
"

* 'amnon/config_api_v5' of github.com:scylladb/seastar-dev:
  Explanation about the API V2
  API: add the config API as part of the v2 API.
  Defining the config api
2018-03-28 12:45:17 +03:00
Amnon Heiman
71a04b5d26 Explanation about the API V2
Currently it holds a general explanation about the V2 and specific entry
about the config.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2018-03-28 12:42:04 +03:00
Amnon Heiman
94c2d82942 API: add the config API as part of the v2 API.
After this patch, the API v2 will contain a config section with all the
configuration parametes.

get http://localhost:10000/v2

Will contain the config section.

An example for getting a configuration parameter:
curl http://localhost:10000/v2/config/listen_address

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2018-03-28 12:42:04 +03:00
Amnon Heiman
6d907e43e0 Defining the config api
The config API is created dynamically from the config. This mean that
the swagger definition file will contain the description and types based on the
configuration.

The config.json file is used by the code generator to define a path that is
used to register the handler function.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2018-03-28 12:41:55 +03:00
Vladimir Krivopalov
b268ea951a tests: perf_fast_forward: Sanitize JSON files names.
Substitute various brackets and parentheses with alnum strings, remove
whitespaces, strip single-range values off curly braces.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <206adea8d05a1e64ce2627df1e4da3a845454906.1522171869.git.vladimir@scylladb.com>
2018-03-28 12:29:07 +03:00
Tomasz Grabiec
52c61df930 Relax includes
To avoid unnecessary recompilations.
Message-Id: <1522168295-994-1-git-send-email-tgrabiec@scylladb.com>
2018-03-28 10:49:07 +03:00
Avi Kivity
4c3e82bd67 Merge "db/view: Populate views with existing base table data" from Duarte
"
This series introduces the view_builder class, a sharded service
responsible for building all defined materialized views. This process
entails walking over the existing data in a given base table, and using
it to calculate and insert the respective entries for one or more views.

The view_builder uses the migration_manager to subscribe to schema
change events, and update its bookkeeping accordingly. We prefer this
to having the database call into the view_builder, as that would
create a cyclic dependency.

We serialize changes to the views of a particular base table, such
that schema changes do not interfere with the view building process.

We employ a flat_mutation_reader for each base table for which we're
building views.

We consume from the reader associated with each base table until all
its views are built. If the reader reaches the end and there are
incomplete views, then a view was added while others were being built.
In such cases, we restart the reader to the beginning of the current
token, but not to the beginning of the token range, when the view is
added. Then, when we exhaust the reader, we simply create a new one
for the whole token range, and resume building the pending views.

We aim to be resource-conscious. On a given shard, at any given moment,
we consume at most from one reader. We also strive for fairness, in that
each build step inserts entries for the views of a different base. Each
build step reads and generates updates for batch_size rows. We lack a
controller, which could potentially allow us to go faster (to execute
multiple steps at the same time, or consume more rows per batch), and
also which would apply backpressure, so we could, for example, delay
executing a build step.

Interaction with the system tables:
  - When we start building a view, we add an entry to the
    scylla_views_builds_in_progress system table. If the node restarts
    at this point, we'll consider these newly inserted views as having
    made no progress, and we'll treat them as new views;
  - When we finish a build step, we update the progress of the views
    that we built during this step by writing the next token to the
    scylla_views_builds_in_progress table. If the node restarts here,
    we'll start building the views at the token in the next_token
    column.
  - When we finish building a view, we mark it as completed in the
    built views system table, and remove it from the in-progress system
    table. Under failure, the following can happen:
        * When we fail to mark the view as built, we'll redo the last
          step upon node reboot;
        * When we fail to delete the in-progress record, upon reboot
          we'll remove this record.
    A view is marked as completed only when all shards have finished
    their share of the work, that is, if a view is not built, then all
    shards will still have an entry in the in-progress system table;
  - A view that a shard finished building, but not all other shards,
    remains in the in-progress system table, with first_token ==
    next_token.

Interaction with the distributed system tables:
  - When we start building a view, we mark the view build as being
    in-progress;
  - When we finish building a view, we mark the view as being built.
    Upon failure, we ensure that if the view is in the in-progress
    system table, then it may not have been written to this table. We
    don't load the built views from this table when starting. When
    starting, the following happens:
         * If the view is in the system.built_views table and not the
           in-progress system table, then it will be in this one;
         * If the view is in the system.built_views table and not in
           this one, it will still be in the in-progress system table -
           we detect this and mark it as built in this table too,
           keeping the invariant;
         * If the view is in this table but not in system.built_views,
           then it will also be in the in-progress system table - we
           don't detect this and will redo the missing step, for
           simplicity.

View building is necessarily a sharded process. That means that on
restart, if the number of shards has changed, we need to calculate
the most conservative token range that has been built, and build
the remainder.

When building view updates, we consider that everything is new and
nothing pre-existing is there (which means no tombstones will be sent
out to the paired view replicas).

Tests:
  unit (debug)
  dtest (materialized_view_test.py(smp=1, smp=2))
"

* 'view-building/v4' of https://github.com/duarten/scylla: (22 commits)
  tests/view_build_test: Add tests for view building
  tests/cql_test_env: Move eventually() to this file
  tests/cql_assertions: Assert result set is not empty
  tests/cql_test_env: Start the view_builder
  db/view/view_builder: Allow synchronizing with the end of a build
  db/view/view_builder: Actually build views
  flat_mutation_reader: Make reader from mutation fragments
  db/view/view_builder: React to schema changes
  service/migration_listener: Add class for view notifications
  db/view: Introduce view_builder
  column_family: Add function to populate views
  column_family: Allow synchronizing with in-progress writes
  database: Compare view id instead of name in find_views()
  database: Add get_views() function
  db/view: Return a future when sending view updates
  service/storage_service: Allow querying the view build status
  db: Introduce system_distributed_keyspace
  tests: Add unit test for build_progress_virtual_reader
  db/system_keyspace: Add API for MV-related system tables
  db/system_keyspace: Add virtual reader for MV in-progress build status
  ...
2018-03-27 15:41:28 +03:00
Daniel Fiala
051ed12ad2 cql3/functions: Print function declaration with cql3 types, not with internal types.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
Message-Id: <20180327084953.20313-3-daniel@scylladb.com>
2018-03-27 13:33:29 +03:00
Duarte Nunes
9f5cfa76f7 tests/view_build_test: Add tests for view building
This is a separate file from view_schema_test because that one is
already becoming too long to run; also, having multiple test files
means they can be executed in parallel.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:11 +01:00
Duarte Nunes
e5031f70ef tests/cql_test_env: Move eventually() to this file
Move eventually() from view_schema_test to cql_test_env.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:11 +01:00
Duarte Nunes
8528584056 tests/cql_assertions: Assert result set is not empty
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:11 +01:00
Duarte Nunes
a2c94e7925 tests/cql_test_env: Start the view_builder
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:11 +01:00
Duarte Nunes
a45fa8eaa2 db/view/view_builder: Allow synchronizing with the end of a build
Intended for use by unit tests, this patch allows synchronizing with
the end of a build for a particular view.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:11 +01:00
Duarte Nunes
5f822e3928 db/view/view_builder: Actually build views
This patch adds the missing view building code to the eponymous class.

We consume from the reader associated with each base table until all
its views are built. If the reader reaches the end and there are
incomplete views, then a view was added while others were being built.
In such cases, we restart the reader to the beginning of the current
token, but not to the beginning of the token range, when the view is
added. Then, when we exhaust the reader, we simply create a new one
for the whole token range, and resume building the pending views.

We aim to be resource-conscious. On a given shard, at any given moment,
we consume at most from one reader. We also strive for fairness, in that
each build step inserts entries for the views of a different base. Each
build step reads and generates updates for batch_size rows. We lack a
controller, which could potentially allow us to go faster (to execute
multiple steps at the same time, or consume more rows per batch), and
also which would apply backpressure, so we could, for example, delay
executing a build step.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:11 +01:00
Duarte Nunes
1f3e3d3813 flat_mutation_reader: Make reader from mutation fragments
Builds a reader from a set of ordered mutations fragments. This is
useful for building a reader out of a subset of segments returned by a
different reader. It is equivalent to building a mutation out of the
set of mutation fragments, and calling
make_flat_mutation_reader_from_mutations, except that it doest not yet
support fast-forwarding.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:11 +01:00
Duarte Nunes
a21efeffa0 db/view/view_builder: React to schema changes
The view_builder now uses the migration_manager to subscribe to schema
change events, and update its bookkeeping accordingly. We prefer this
to having the database call into the view_builder, as that would
create a cyclic dependency.

We serialize changes to the views of a particular base table, such
that schema changes do not interfere with the upcoming view building
code.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:11 +01:00
Duarte Nunes
3ffa3b6b54 service/migration_listener: Add class for view notifications
Add a convenience base class for view notifications, which provides
a default implementation for all other types of notifications.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:11 +01:00
Duarte Nunes
901faabaa2 db/view: Introduce view_builder
This patch introduces the view_builder class, a sharded service
responsible for building all defined materialized views. This process
entails walking over the existing data in a given base table, and using
it to calculate and insert the respective entries for one or more views.

This patch introduces only the bootstrap functionality, which is
responsible for loading the data stored in the system tables and
filling the in-memory data structures with the relevant information,
to be used in subsequent patches for the actual view building. The
interaction with the system tables is as follows.

Interaction with the tables in system_keyspace:
  - When we start building a view, we add an entry to the
    scylla_views_builds_in_progress system table. If the node restarts
    at this point, we'll consider these newly inserted views as having
    made no progress, and we'll treat them as new views;
  - When we finish a build step, we update the progress of the views
    that we built during this step by writing the next token to the
    scylla_views_builds_in_progress table. If the node restarts here,
    we'll start building the views at the token in the next_token
    column.
  - When we finish building a view, we mark it as completed in the
    built views system table, and remove it from the in-progress system
    table. Under failure, the following can happen:
        * When we fail to mark the view as built, we'll redo the last
          step upon node reboot;
        * When we fail to delete the in-progress record, upon reboot
          we'll remove this record.
    A view is marked as completed only when all shards have finished
    their share of the work, that is, if a view is not built, then all
    shards will still have an entry in the in-progress system table;
  - A view that a shard finished building, but not all other shards,
    remains in the in-progress system table, with first_token ==
    next_token.

Interaction with the distributed system table (view_build_status):
  - When we start building a view, we mark the view build as being
    in-progress;
  - When we finish building a view, we mark the view as being built.
    Upon failure, we ensure that if the view is in the in-progress
    system table, then it may not have been written to this table. We
    don't load the built views from this table when starting. When
    starting, the following happens:
         * If the view is in the system.built_views table and not the
           in-progress system table, then it will be in view_build_status;
         * If the view is in the system.built_views table and not in
           this one, it will still be in the in-progress system table -
           we detect this and mark it as built in this table too,
           keeping the invariant;
         * If the view is in this table but not in system.built_views,
           then it will also be in the in-progress system table - we
           don't detect this and will redo the missing step, for
           simplicity.

View building is necessarily a sharded process. That means that on
restart, if the number of shards has changed, we need to calculate
the most conservative token range that has been built, and build
the remainder.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:10 +01:00
Duarte Nunes
f298f57137 column_family: Add function to populate views
The populate_views() function takes a set of views to update, a
tokento select base table partitions, and the set of sstables to
query. This lays the foundation for a view building mechanism to exist,
which walks over a given base table, reads data token-by-token,
calculates view updates (in a simplified way, compared to the existing
functions that push view updates), and sends them to the paired view
replicas.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:10 +01:00
Duarte Nunes
67dd3e6e5d column_family: Allow synchronizing with in-progress writes
This patch adds a mechanism to class column_family through which we
can synchronize with in-progress writes. This is useful for code that,
after some modification, needs to ensure that new writes will see it
before it can proceed.

In particular, this will be used by the view building code, which needs
to wait until the in-progress writes, which may have missed that there
is now a view, is observable to the view building code.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:10 +01:00
Duarte Nunes
9640205f11 database: Compare view id instead of name in find_views()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:10 +01:00
Duarte Nunes
9b9ba525f7 database: Add get_views() function
Returns all the schemas that are views.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:10 +01:00
Duarte Nunes
dc44a08370 db/view: Return a future when sending view updates
While we now send view mutations asynchronously in the normal view
write path, other processes interested in sending view updates, such
as streaming or view building, may wish to do it synchronously.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:10 +01:00
Duarte Nunes
ff15068a41 service/storage_service: Allow querying the view build status
This patch adds support for the nodetool viewbuildstatus command,
which shows the progress of a materialized view build across the
cluster.

A view can be absent from the result, successfully built, or
currently being built.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:10 +01:00
Duarte Nunes
78b232d98f db: Introduce system_distributed_keyspace
This patch introduces a distributed system keyspace, used to hold
system tables that need to be replicated across a set of replicas
(that is, can't use the LocalStrategy).

In following patches, we will use this keyspace to hold a table
containing view building status updates for each node, used to support
range movements and a new nodetool command.

Fixes #3237

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:10 +01:00
Duarte Nunes
412f081db9 tests: Add unit test for build_progress_virtual_reader
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:10 +01:00
Duarte Nunes
4227641a3d db/system_keyspace: Add API for MV-related system tables
This patch implements an API to access the MV-related system tables,
which pertain to the view building process.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:10 +01:00
Duarte Nunes
b2cae7ea09 db/system_keyspace: Add virtual reader for MV in-progress build status
Provide a virtual reader so users can query the in-progress view table
in a way compatible with Apache Cassandra.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:10 +01:00
Duarte Nunes
7811474697 db/system_keyspace: Add Scylla-specific MV system table
When building a materialized view, we divide our work by shard, so we
need to register which shard did what work in the in-progress system
table. We also add the token we started at, which will enable some
optimizations in the view building code.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:10 +01:00
Duarte Nunes
38831888d2 db/system_keyspace: Include MV system tables in all_tables()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:10 +01:00
Avi Kivity
16a7650873 Merge "More extensions: commitlog + system tables" from Calle
"
Additional extension points.

* Allows wrapping commitlog file io (including hinted handoff).
* Allows system schema modification on boot, allowing extensions
  to inject extensions into hardcoded schemas.

Note: to make commitlog file extensions work, we need to both
enforce we can be notified on segment delete, and thus need to
fix the old issue of hard ::unlink call in segment destructor.
Segment delete is therefore moved to a batch routine, run at
intervals/flush. Replay segments and hints are also deleted via
the commitlog object, ensuring an extension is notified (metadata).

Configurable listeneres are now allowed to inject configuration
object into the main config. I.e. a local object can, either
by becoming a "configurable" or manually, add references to
self-describing values that will be parsed from the scylla.yaml
file, effectively extending it.

All these wonderful abstractions courtesy of encryption of course.
But super generalized!
"

* 'calle/commitlog_ext' of github.com:scylladb/seastar-dev:
  db::extensions: Allow extensions to modify (system) schemas
  db::commitlog: Add commitlog/hints file io extension
  db::commitlog: Do segment delete async + force replay delete go via CL
  main/init: Change configurable callbacks and calls to allow adding opts
  util::config_file: Add "add" config item overload
2018-03-26 16:18:22 +03:00
Calle Wilund
ff41f47a08 db::extensions: Allow extensions to modify (system) schemas
Allows extensions/config listeners to potentially augument
(system) schemas at boot time. This is only useful for schemas
who do not pass through system_schema tables.
2018-03-26 11:58:28 +00:00
Calle Wilund
bb1a2c6c2e db::commitlog: Add commitlog/hints file io extension
To allow on-disk data to be augumented.
2018-03-26 11:58:27 +00:00
Calle Wilund
2bc98aebaf db::commitlog: Do segment delete async + force replay delete go via CL
Refs #2858

Push segement files to be deleted to a pending list, and process at
intervals or flush-requests (or shutdown). Note that we do _not_
indescrimenately do deletes in non-anchored tasks, because we need
to guarantee that finshed segments are fully deleted and gone on CL
shutdown, not to be mistaken for replayables.

Also make sure we delete segments replayed via commitlog call,
so IFF we add metadata processing for CL, we can clear it out.
2018-03-26 11:58:27 +00:00
Duarte Nunes
a985ea0fcb column_family: Don't retry flushing memtable if shutdown is requested
Since we just keep retrying, this can cause Scylla to not shutdown for
a while.

The data will be safe in the commit log.

Note that this patch doesn't fix the issue when shutdown goes through
storage_service::drain_on_shutdown - more work is required to handle
that case.

Ref #3318.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180324140822.3743-3-duarte@scylladb.com>
2018-03-26 14:36:40 +03:00
Duarte Nunes
50ad37d39b column_family: Increase scope of exception handling when flushing a memtable
In column_family::try_flush_memtable_to_sstable, the handle_exception()
block is on the inside of the continuations to
write_memtable_to_sstable(), which, if it fails, will leave the
sstable in the compaction_backlog_tracker::_ongoing_writes map, which
will waste disk space, and that sstable will map to a dangling pointer
to a destroyed database_sstable_write_monitor, which causes a seg
fault when accessed (for example, through the backlog_controller,
which accounts the _ongoing_writes when calculating the backlog).

Fix this by increasing the scope of handle_exception().

Fixes #3315

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180324140822.3743-2-duarte@scylladb.com>
2018-03-26 14:36:16 +03:00
Duarte Nunes
b7bd9b8058 backlog_controller: Stop update timer
On database shutdown, this timer can cause use-after-free errors if
not stopped.

Refs #3315

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180324140822.3743-1-duarte@scylladb.com>
2018-03-26 14:36:16 +03:00
Botond Dénes
0e6aa91269 Fix test.py output and error handling
* Don't dump output of failed tests immediately, print the output
for failed tests in the end instead.
* Fix exception printing in run_test(): don't assume passed in error
object is a `bytes` (or bytes-like) object, call the object's str
operator instead and let callers encode bytes objects instead.
* Don't assume Exception object has an `out` member, use operator str
instead to convert it to string.
* Don't print progress in run_test() directly because it results in
incomprehensible output as the executors race to print to stdout. Leave
progress report to the caller who can serialize progress prints.
* Automatically detect non-tty stdout and don't try to edit already
printed text.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <7bb7e0003ded9b28710250bff851ea849bb99f7d.1522062795.git.bdenes@scylladb.com>
2018-03-26 14:26:45 +03:00
Avi Kivity
999df41a49 Merge "Bug fixes for access-control, and finalizing roles" from Jesse
"
This series does not add or change any features of access-control and
roles, but addresses some bugs and finalizes the switch to roles.

"auth: Wait for schema agreement" and the patch prior help avoid false
negatives for integration tests and error messages in logs.

"auth: Remove ordering dependence" fixes an important bug in `auth` that
could leave the default superuser in a corrupted state when it is first
created.

Since roles are feature-complete (to the best of the author's knowledge
as of this writing), the final patch in the series removes any warnings
about them being unimplemented.

Tests: unit (release), dtest (PENDING)
"

* 'jhk/auth_fixes/v1' of https://github.com/hakuch/scylla:
  Roles are implemented
  auth: Increase delay before background tasks start
  auth: Remove ordering dependence
  auth: Don't warn on rescheduled task
  auth: Wait for schema agreement
  Single-node clusters can agree on schema
2018-03-26 09:29:41 +03:00
Jesse Haber-Kucharsky
849cf49b8d Roles are implemented
Fixes #1941.
2018-03-26 00:52:59 -04:00
Jesse Haber-Kucharsky
af24637565 auth: Increase delay before background tasks start
I've observed failures due to "missing" the peer nodes by about 1
second. Adding 5 second to the existing delay should cover most false
negative test results.

Fixes #3320.
2018-03-26 00:52:55 -04:00
Jesse Haber-Kucharsky
00f7bc676d auth: Remove ordering dependence
If `auth::password_authenticator` also creates `system_auth.roles` and
we fix the existence check for the default superuser in
`auth::standard_role_manager` to only search for the columns that it
owns (instead of the column itself), then both modules' initialization
are independent of one another.

Fixes #3319.
2018-03-25 22:38:11 -04:00
Jesse Haber-Kucharsky
968c61c296 auth: Don't warn on rescheduled task
Apache Cassandra also prints at the `info` level. This change prevents
tasks which we expect to be rescheduled from failing tests and scaring
users.

A good example of this importance of this change is when queries with a
quorum consistency level (for the default superuser) fail because a
quorum is not available. We will try again in this case, and this should
not cause integration tests to fail.
2018-03-25 22:38:11 -04:00
Jesse Haber-Kucharsky
881656cea4 auth: Wait for schema agreement
Some modules of `auth` create a default superuser if it does not already
exist.

The existence check is through a SELECT query with quorum consistency
level. If the schema for the applicable tables has not yet propagated to
a peer node at the time that it processes this query, then the
`storage_proxy` will print an error message to the log and the query
will be retried.

Eventually, the schema will propagate and the default superuser will be
created. However, the error message in the log causes integration tests
to fail (and is somewhat annoying).

Now, prior to querying for existing data, we wait for all gossip peers
to have the same schema version as we do.

Fixes #2852.
2018-03-25 22:38:08 -04:00
Jesse Haber-Kucharsky
3e415e28bc Single-node clusters can agree on schema
At some points while bootstrapping [1], new non-seed Scylla nodes wait
for schema agreement among all known endpoints in the cluster.

The check for schema agreement was in
`service::migration_manager::is_ready_for_bootstrap`. This function
would return `true` if, at the time of its invocation, the node was
aware of at least one `UP` peer (not itself) and that all `UP` peers had
the same schema version as the node.

We wish to re-use this check in the `auth` sub-system to ensure that
the schema for internal system tables used for access-control have
propagated to the entire cluster.

Unlike in `service/storage_service.cc`, where `is_ready_for_bootstrap`
was only invoked for seed nodes, we wish to wait for schema agreement
for all nodes regardless of whether or not they are seeds.

For a single-node cluster with itself as a seed,
`is_ready_for_bootstrap` would always return `false`.

We therefore change the conditions for schema agreement. Schema
agreement is now reached when there are no known peers (so the endpoint
map of the gossiper consists only of ourselves), or when there is at
least one `UP` peer and all `UP` peers have the same schema version as
us.

This change should not impact any bootstrap behavior in
`storage_service` because seed nodes do not invoke the function and
non-seed nodes wait for peer visibility before checking for schema
agreement.

Since this function is no longer checking for schema agreement only in
the context of bootstrapping non-seed nodes, we rename it to reflect its
generality.

[1] http://thelastpickle.com/blog/2017/05/23/auto-bootstrapping-part1.html
2018-03-25 22:08:42 -04:00
Duarte Nunes
aed28c667c db/view: Pass pending endpoints to storage_proxy::send_to_endpoint
This minimizes the number of mutation copies by just doing a single
call to send_to_endpoint().

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180325121412.76844-2-duarte@scylladb.com>
2018-03-25 15:45:22 +03:00
Duarte Nunes
fb54c09e0b service/storage_proxy: Pass pending endpoints to send_to_endpoint()
This will allow us to minimize the number of mutation copies in
mutate_MV().

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180325121412.76844-1-duarte@scylladb.com>
2018-03-25 15:45:21 +03:00
Avi Kivity
389fb54a42 tests: sstable_test: fix for_each_sstable_version concept (again)
I see the following error:

seastar/core/future-util.hh:597:10: note:   constraints not satisfied
seastar/core/future-util.hh:597:10: note:     with ‘sstables::sstable_version_types* c’
seastar/core/future-util.hh:597:10: note:     with ‘sub_partitions_read::run_test_case()::<lambda(sstables::sstable::version_types)> aa’
seastar/core/future-util.hh:597:10: note: the required expression ‘seastar::futurize_apply(aa, (* c.begin()))’ would be ill-formed
seastar/core/future-util.hh:597:10: note: ‘seastar::futurize_apply(aa, (* c.begin()))’ is not implicitly convertible to ‘seastar::future<>’

The C array all_sstable_versions decayed to a pointer (see second gcc note)
and of course doesn't support std::begin().

Fix by replacing the C array with an std::array<>, which supports std::begin().

Not clear what made this break again, or why it worked before.
Message-Id: <20180325095239.12407-1-avi@scylladb.com>
2018-03-25 13:02:57 +01:00
Duarte Nunes
44996fa6ae Merge 'Reduce link dependencies in tests' from Avi
"
This patchset removes unneeded object files from the test link,
reducing unnecessary links and reducing link time and executable
size.

Tests: build (release)
"

* tag 'test-link/v1' of https://github.com/avikivity/scylla:
  build: link release.o into scylla and perf_fast_forward binaries only
  build: don't link api/ into tests
2018-03-24 20:54:49 +00:00
Avi Kivity
09453ca0db build: link release.o into scylla and perf_fast_forward binaries only
release.o depends on the release date and git hash, and therefore changes
every time ./configure.py is executed.  In turn, this causes all tests to
relink.

Improve the situation by only linking release.o into binaries that require
it.

This helps continuous integration scripts, which call configure.py
unconditionally. Developers usually won't, so they will not see significant
savings.

Tests: build (release)
2018-03-24 22:55:03 +03:00
Avi Kivity
e78cea4121 build: don't link api/ into tests
They don't need it.
2018-03-24 22:55:02 +03:00
Duarte Nunes
f298e3e6f8 database: Log exception which caused flush to fail
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180322204419.12961-1-duarte@scylladb.com>
2018-03-23 10:57:35 +00:00
Takuya ASADA
81fbcbf6bc dist/redhat: don't redefine __debug_install_post on Fedora27 or later
Redefining _debug_install_post does not work on Fedora27 or later,
it seems because of debuginfo generation process had been changed:
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/thread/ITJHJTUO2WFEAYIHANSM6AMAB5SIFASI/

To prevent the build error, move scylla-gdb.py to scylla-server package on
Fedora 27 or later.

Fixes #3313

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1521735371-29408-1-git-send-email-syuu@scylladb.com>
2018-03-22 19:39:14 +02:00
Takuya ASADA
879c9f1bf8 dist/redhat: don't use yaml-cpp-static on Fedora
Since Fedora still does not have separated yaml-cpp-static package, don't
depends on it.

Fixes #3183

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1521735663-29516-1-git-send-email-syuu@scylladb.com>
2018-03-22 18:24:28 +02:00
Avi Kivity
054854839a Merge "Fix abort during counter table read-on-delete" from Tomasz
"
This fixes an abort in an sstable reader when querying a partition with no
clustering ranges (happens on counter table mutation with no live rows) which
also doesn't have any static columns. In such case, the
sstable_mutation_reader will setup the data_consume_context such that it only
covers the static row of the partition, knowing that there is no need to read
any clustered rows. See partition.cc::advance_to_upper_bound(). Later when
the reader is done with the range for the static row, it will try to skip to
the first clustering range (missing in this case). If clustering_ranges_walker
tells us to skip to after_all_clustering_rows(), we will hit an assert inside
continuous_data_consumer::fast_forward_to() due to attempt to skip past the
original data file range. If clustering_ranges_walker returns
before_all_clustering_rows() instead, all is fine because we're still at the
same data file position.

Fixes #3304.
"

* 'tgrabiec/fix-counter-read-no-static-columns' of github.com:scylladb/seastar-dev:
  tests: mutation_source_test: Test reads with no clustering ranges and no static columns
  tests: simple_schema: Allow creating schema with no static column
  clustering_ranges_walker: Stop after static row in case no clustering ranges
2018-03-22 17:36:20 +02:00
Tomasz Grabiec
604166143c tests: mutation_source_test: Test reads with no clustering ranges and no static columns
Reproduces issue #3304.
2018-03-22 15:00:48 +01:00
Tomasz Grabiec
3a974d1776 tests: simple_schema: Allow creating schema with no static column 2018-03-22 14:44:54 +01:00
Tomasz Grabiec
d1cb6bbf95 clustering_ranges_walker: Stop after static row in case no clustering ranges
When there are no clustering ranges, stop at position which is right
after the static row instead of position which is after all clustered
rows.

This fixes an abort in sstable reader when querying a partition with
no clustering ranges (happens with counter tables) which also doesn't
have any static columns. In such case, the sstable_mutation_reader
will setup the data_consume_context such that it only covers the
static row of the partition, knowing that there is no need to ready
any clustering row. See partition.cc::advance_to_upper_bound().  Later
when we're done with reading the static row (which is absent), we will
try to skip to the first clustering range, which in this case is
missing.  If clustering_ranges_walker tells us to skip to
after_all_clustering_rows(), we will hit an asser inside
continuous_data_consumer::fast_forward_to() due to attempt to skip
past the original data file range. If clustering_ranges_walker returns
before_all_clustering_rows() instead, all is fine, becuase we end up
at the same data file position.

Fixes #3304.
2018-03-22 14:44:48 +01:00
Botond Dénes
a65b063ab2 incremental_reader_selector: remote unused members
Since 3d725d6823 the incremental_reader_selector creates readers via
a factory function so these members, used previously for creating the
readers, are not needed anymore.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <64b5cef93c1f9a2e544ccfd89e293627e99dd4cd.1521724155.git.bdenes@scylladb.com>
2018-03-22 13:14:03 +00:00
Takuya ASADA
bef08087e1 scripts/scylla_install_pkg: follow redirection of specified repo URL
We should follow redirection on curl, just like normal web browser does.
Fixes #3312

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1521712056-301-1-git-send-email-syuu@scylladb.com>
2018-03-22 12:55:43 +02:00
Vladimir Krivopalov
3010b637c9 perf_fast_forward: fix error in date formatting
Instead of 'month', 'minutes' has been used.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <1e005ecaa992d8205ca44ea4eebbca4621ad9886.1521659341.git.vladimir@scylladb.com>
2018-03-22 09:57:15 +00:00
Avi Kivity
a7d86410b5 Merge "Split more tasks out of the generic scheduling group" from Glauber
"
There are a lot of things that we should be grouping in scheduling
groups that we aren't yet. The write path is not tagged at all,
mutation_query isn't either. Some, like streaming, are used - but not in
all places where they are needed.

Tests: unit (release)
"

* 'split-scheduling-groups-v2' of github.com:glommer/scylla:
  database: group statements in their own scheduling group
  database: apply streaming mutations with streaming priority
  logalloc: capture current scheduling group for deferring function
2018-03-21 15:02:50 +02:00
Nadav Har'El
e5de66d0c4 Materialized Views: unit test for missing view key columns
Add a unit test for reproducing issue #2720 (and verifying its fix)
If a user tries to create a view whose primary key is missing any of the
base table's primary key columns, the creation should fail.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180320161121.13392-3-nyh@scylladb.com>
2018-03-21 09:47:41 +00:00
Nadav Har'El
c809dd2e66 Materialized Views: change order of view creation verification
Changed the order to check a couple of error conditions *after* checking
for too many or missing primary key columns. This order (showing the
too many or missing key columns first) is more useful, and is the order
in Cassadra's code.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180320161121.13392-2-nyh@scylladb.com>
2018-03-21 09:47:41 +00:00
Nadav Har'El
871cecfd3b Materialized Views: fix checking that view key includes base key
A view's primary key must include all the columns of the base's primary
key. If we don't check this and fail the table's creation, we can discover
problems later on when using the table, as demonstrated in issue #2720.

We had such checking code (translated from the same code in Java) but it
had an extra "else" which caused nothing to be put in "missing_pk_columns"
so the error was never recognized.

Also, when the error does happen, we should print the column's name_as_text(),
not name() which is (surprisingly) just a number.

Fixes #2720.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180320161121.13392-1-nyh@scylladb.com>
2018-03-21 09:47:41 +00:00
Nadav Har'El
06aaace5a4 Materialized View: fix one of the unit tests
One of the tests created a base table with 5 primary key columns, but
put only 4 of them in the view. This is not allowed, but prior to fixing
issue #2720 this error was silently ignored. Let's fix the error instead
of relying on this silence.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180321094352.22329-1-nyh@scylladb.com>
2018-03-21 09:46:55 +00:00
Duarte Nunes
0d74442252 tests/sstable_test: Fix concept for for_each_sstable_version
Un-break the build.

Fixes #3307

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180320182011.11068-1-duarte@scylladb.com>
2018-03-20 22:26:06 +00:00
Glauber Costa
9188059427 database: group statements in their own scheduling group
When we introduced the CPU scheduler, we have also introduced a group
for commitlog - but never used it. There is also doubtful value in
separating reads from writes, since they are often part of the same
workload.

To accomodate for that, let's rename the query group to "statement"
(query is not incorrect, just confusing), and move the write path,
currently ungrouped, inside it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-03-20 16:58:36 -04:00
Glauber Costa
c8e169f6d8 database: apply streaming mutations with streaming priority
We are flushing the streaming memtables with streaming priority, but
applying the mutations themselves is still done with normal priorities.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-03-20 16:58:35 -04:00
Glauber Costa
3b53f922a3 logalloc: capture current scheduling group for deferring function
When we call run_when_memory_available, it is entirely possible that
the caller is doing that inside a scheduling_group. If we don't defer
we will execute correctly. But if we do defer, the current code will
execute - in the future - with the default scheduling group.

This patch fixes that by capturing the caller scheduling group and
making sure the function is executed later using it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-03-20 16:58:35 -04:00
Duarte Nunes
237184324e Merge 'Make the read repair decision per-query instead of per-page' from Botond
"
Since f8613a8415 we have reader-caching
on replicas for single-partition queries. This caching works best when
all pages of a query are sent to the same replicas consistently and thus
they can reuse the cached readers there.
The propability-based nature of read-repair works against this as on any
given page a read-repair will be attempted or not based on probability.
This will cause hight drop-rates on the replicas used for read-repair as
the cached reader will not be reusable if the replica was skipped for
one or more pages.
To fix this make the repair-decision once, on the first page of the
query and store the decision in the paging-state. On all remaining
pages of the query use this stored decision.

Tests: unit-tests(release, debug), dtest(paging_advanced_tests.py)

Refs: #1865
"

* 'per_query_repair_decision/v2' of https://github.com/denesb/scylla:
  Make the read-repair decision only once
  storage_proxy: add coordinator_query_options and coordinator_query_result
  Add query_read_repair_decision to paging-state
2018-03-20 11:59:41 +00:00
Takuya ASADA
2045891cc2 dist/debian: use rebuilt libyaml-cpp on Debian9
On Debian9, distribution provided libyaml-cpp does not able to link against
scylla, use rebuilt one from our 3rdparty repo.

fixes #3221

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1521496288-12856-1-git-send-email-syuu@scylladb.com>
2018-03-20 12:30:47 +02:00
Nadav Har'El
07f88aef51 Materialized Views: test verification of only one new key column
For several reasons that I cannot fit in the margin, when a view is
created, at most ONE regular column from the base table may be added
to the view's key.
This small new test verifies that if we try to add two columns, the
view creation fails.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180319235453.1613-1-nyh@scylladb.com>
2018-03-20 00:30:18 +00:00
Nadav Har'El
1d4ceaa237 Materialized Views: Fix IS NOT NULL unit test
We had a unit test, test_primary_key_is_not_null, for testing that
we correctly complain - or don't complain - on missing "IS NOT NULL"
restrictions, as expected.

However, this test missed the actual bug we had regarding IS NOT NULL
checking - see issue #2628 - because it thought a silly syntax error
which caused an exception, was the exception we expected to see :-)

So in this patch, I rewrote this test. It fixes the test's bug and
demonstrates issue #2628 (and verifies its fix), and also tests a few
more corner cases.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180319235000.1399-1-nyh@scylladb.com>
2018-03-20 00:30:18 +00:00
Nadav Har'El
da110d612e Materialized Views: Fix "IS NOT NULL" checking
When creating a materialized view, the user must provide a "IS NOT NULL"
restriction for each of the created view's primary columns. If such a
restriction is missing, the view creation should fail. In #2628 we noticed
that sometimes it wasn't failing, but later updates to such table would fail,
which is a bug.

There is actually one special case where "IS NOT NULL" is optional:
It is optional on the base's partition key column (when there is just
one of these) because it is already assumed that the partition key in
its entirety can never be.

Our "IS NOT NULL" test, validate_primary_key(), had two logic errors
which caused it to miss some cases of missing "IS NOT NULL":

1. Instead of checking whether a certain column is a the base's only
   partition-key column, and avoid testing IS NOT NULL just for that
   specific column, the code tested whether the schema *has* such a
   column, and if it did, the test was skipped for all columns.

2. When the code found the one new column in the view's primary key, it
   was so happy to find it that it immediately returned, and forgot to
   test the IS NOT NULL on that column :-)

Both errors are fixed by this patch.
See the next patch for a unit test.

Fixes #2628.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180319233657.522-1-nyh@scylladb.com>
2018-03-20 00:30:18 +00:00
Glauber Costa
f80d4a28d7 flat_mutation_reader: explicitly yield at every partition
Right now we have yield points between partition processing guaranteed
by the fact that there are .get()s around the code, and those include
an yield point.

We have been discussing removing the implicit yield point from get and
pushing that to the caller. In that spirit, let's yield explicitly here
if needed.

It should be the responsibility of the loop that it doesn't hurt
latency, either by the fact that it is bounded by a small number of
iterations or yields. In other words, that loop should have a yield
point on every iteration (like the non-thread variant does).

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180319173051.8918-1-glauber@scylladb.com>
2018-03-19 19:39:01 +02:00
Avi Kivity
03c22ad524 Merge "Support for Cassandra 2.2 (LA) SSTable formats" from Daniel
"
These patches add support for C* 2.2 file(name) format.

Namely:
  * It forces Scylla to write files in la format.
  * Adds storage-service feature for them.
  * cf and ks are determined from directory, not from file-name (for 2.2 format).
  * Adds some other fixes to make dtest happy.
  * Unit tests work with la format or with both formats.
"

* 'danfiala/filename-format-2.2-v4' of https://github.com/hagrid-the-developer/scylla:
  tests/sstables: Tests use la format or iterate over both formats.
  tests/sstables: Helper functions support 2.2 format directory structure.
  stables: Use 2.2 (la) format as a default format to store sstables if it is enabled by feature-bits.
  storage_service: Support la sstable storage format as a feature.
  sstables: make_descriptor accepts sstable-directory, because it is necessary to determine cf and ks in 2.2 format.
  sstables: Throw more detail exception for unknown item in reverse_map.
  sstables/compaction: Suppress NaN in a report of a throughput.
2018-03-19 17:49:44 +02:00
Botond Dénes
eee9bda85b Make the read-repair decision only once
Make the read-repair decision on the first page of a paged-query and use
it for all the remaining pages. This helps querier-cache hit-rates as
reads to nodes will be sent consistently throught the query.
2018-03-19 16:29:43 +02:00
Avi Kivity
fe4049f074 Merge "gossip: Fixes to shadow round" from Duarte
"
Fixes to gossip pertaining to the shadow round.

In particular, an issue preventing a node from being marked as alive is
fixed: After the shadow round and the feature checking, we remove any
endpoints from the state - namely, those that contacted us -, before
re-adding them again. This is because those nodes that replied would
have been marked as alive in the endpoint state map (but not fully,
they'd be absent from the live endpoints list), and re-adding them marks
them as dead.

If the shadow round failed, after doing the feature checking against the
system tables, we were not clearing the state map and re-adding the
endpoints. This leaves the alive marker set, and prevents
real_mark_alive() from eventually being called.

Fixes #3301
"

* 'gossip/shadow-round-fixes/v3' of https://github.com/duarten/scylla:
  gms/gossiper: Remove superfluous check
  service/storage_service: Always re-add loaded endpoints
  gms/gossiper: Check for shadow round completion before throwing
2018-03-19 15:22:35 +02:00
Botond Dénes
2e2abf6edb storage_proxy: add coordinator_query_options and coordinator_query_result
As yet more parameters and return-values are about to be added to all
storage_proxy::query_* methods we need a way that scales better than
changing the signatures every time. To this end we aggregate all
non-mandatory query parameters into `coordinator_query_options` and all
return values into `coordinator_query_result`.
This way new fields can be simply added to the respective structs while
the signatures of the methods themselves and their client code can
remain unchanged.
2018-03-19 15:17:35 +02:00
Botond Dénes
b55dcc2ce5 Add query_read_repair_decision to paging-state
This new field will store the repair-decision made on the first page of
the query. This decision will be sticky to all pages of the query.
In mixed clusters the decision might not happen on the first page and it
might even change during the query as old coordinators will not store
nor respect the decision.
2018-03-19 15:17:31 +02:00
Daniel Fiala
4d703f9c6a tests/sstables: Tests use la format or iterate over both formats.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2018-03-19 14:12:10 +01:00
Daniel Fiala
386cae4ad2 tests/sstables: Helper functions support 2.2 format directory structure.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2018-03-19 14:12:09 +01:00
Daniel Fiala
089b54f2d2 stables: Use 2.2 (la) format as a default format to store sstables if it is enabled by feature-bits.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2018-03-19 14:12:01 +01:00
Daniel Fiala
802be72ca6 storage_service: Support la sstable storage format as a feature.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2018-03-19 14:10:31 +01:00
Duarte Nunes
9cadfb27f1 gms/gossiper: Remove superfluous check
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-19 13:08:53 +00:00
Duarte Nunes
2c7b77b6d2 service/storage_service: Always re-add loaded endpoints
After the shadow round and the feature checking, we remove any
endpoints from the state - namely, those that contacted us -, before
re-adding them again. This is because those nodes that replied would
have been marked as alive in the endpoint state map (but not fully,
they'd be absent from the live endpoints list), and re-adding them
marks them as dead.

If the shadow round failed, after doing the feature checking against
the system tables, we were not clearing the state map and re-adding
the endpoints. This left the alive marker set, and prevented
real_mark_alive() from eventually being called.

Fixes #3301

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-19 13:08:53 +00:00
Duarte Nunes
69b28a4f2b gms/gossiper: Check for shadow round completion before throwing
For values of `shadow_round_ms` lower than 1 second, this was assuming
failure without checking.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-19 13:08:53 +00:00
Avi Kivity
601d8f7cff test: switch boost.test from --log_sink to --logger
Upstream fix works only for --logger according to

  https://github.com/boostorg/test/pull/124
Message-Id: <20180319121520.11110-1-avi@scylladb.com>
2018-03-19 13:26:28 +01:00
Calle Wilund
eb10d32ff9 main/init: Change configurable callbacks and calls to allow adding opts
Refs #2526

Allows sub-configs to dynamically add yaml/command line options to
the main config object, i.e. extend the scylla.yaml
2018-03-19 12:24:04 +00:00
Calle Wilund
fc97e39782 util::config_file: Add "add" config item overload 2018-03-19 12:24:04 +00:00
Duarte Nunes
71fddad376 Merge 'Reduce unit test runtime' from Avi
This patchset reduces the time required to run the tests, mostly by
running them in parallel.

I measured a reduction of 3.5X on a 1s4c4t desktop (release mode).

Tests: unit (release)

* tag 'faster-tests/v2' of https://github.com/avikivity/scylla:
  tests: run tests in parallel
  tests: simplify timeout handling
  tests: don't require crash integrity
  tests: allow sharing the machine with other tests
  tests: extract seastar options to a separate variable
  tests: reduce memory for tests
  tests: add "--" unconditionally for boost tests
  tests: start cql_test_env without binding to messaging port
  storage_service: allow starting gossiper without binding to messaging port
  gms: allow gossiper to start_gossiping() without binding to the port
  tests: close file correctly in loading_file_test
2018-03-19 10:24:55 +00:00
Avi Kivity
31b86a46a0 tests: run tests in parallel
Launch tests in a concurrent executor with worker count determined
by available memory.
2018-03-19 12:17:10 +02:00
Avi Kivity
638611a350 tests: simplify timeout handling
The subprocess module can handle timeouts itself, so use this
to simplify the module code.
2018-03-19 12:16:58 +02:00
Avi Kivity
95abed020b tests: don't require crash integrity
We don't resume tests after crashes, so no need to spend time waiting
for the disk to fsync.
2018-03-19 12:16:58 +02:00
Avi Kivity
b3d8dadf0c tests: allow sharing the machine with other tests
By using the overprovisioned flag, we reduce polling and pinning, so
less CPU time is wasted and the scheduler has more options to schedule
reactor threads.
2018-03-19 12:16:58 +02:00
Avi Kivity
3d84c8945d tests: extract seastar options to a separate variable 2018-03-19 12:16:58 +02:00
Avi Kivity
8b1cff90ce tests: reduce memory for tests
If we reduce memory for an individual test, we can run more
in parallel.
2018-03-19 12:16:58 +02:00
Avi Kivity
c3750176d8 tests: add "--" unconditionally for boost tests
Now that we have a minimum boost version, we don't need to check whether
boost requires "--" before test-specific command line arguments. Removing
the check speeds up the test a little.
2018-03-19 12:16:58 +02:00
Avi Kivity
9a04def202 tests: start cql_test_env without binding to messaging port
Allows running tests in parallel.
2018-03-19 12:16:52 +02:00
Avi Kivity
ee68bfa49d storage_service: allow starting gossiper without binding to messaging port 2018-03-19 12:16:11 +02:00
Avi Kivity
02ce0c4cde gms: allow gossiper to start_gossiping() without binding to the port
This is useful in tests, which don't communicate. Binding to a port can
fail if the system is running something else.

It would be better to prevent even more of the gossiper from starting up,
but that is more difficult.
2018-03-19 12:16:11 +02:00
Avi Kivity
f2dd31ee76 tests: close file correctly in loading_file_test
Otherwise, we crash with --overprovisioned on a use-after-free.
2018-03-19 12:16:11 +02:00
Duarte Nunes
810db425a5 gms/gossiper: Synchronize endpoint state destruction
In gossiper::handle_major_state_change() we set the endpoint_state for
a particular endpoint and replicate the changes to other cores.

This is totally unsynchronized with the execution of
gossiper::evict_from_membership(), which can happen concurrently, and
can remove the very same endpoint from the map  (in all cores).

Replicating the changes to other cores in handle_major_state_change()
can interleave with replicating the changes to other cores in
evict_from_membership(), and result in an undefined final state.

Another issue happened in debug mode dtests, where a fiber executes
handle_major_state_change(), calls into the subscribers, of which
storage_service is one, and ultimately lands on
storage_service::update_peer_info(), which iterates over the
endpoint's application state with deferring points in between (to
update a system table). gossiper::evict_from_membership() was executed
concurrently by another fiber, which freed the state the first one is
iterating over.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180318123211.3366-1-duarte@scylladb.com>
2018-03-18 14:38:04 +02:00
Avi Kivity
38e1eb5e42 Update scylla-ami submodule
* dist/ami/files/scylla-ami 5170011...9b4be70 (1):
  > do not special case i3 for controller code
2018-03-18 11:37:00 +02:00
Takuya ASADA
378bf7cec0 dist/debian: switch Debian9 to boost-1.65
We switched Debian8/Ubuntu14/Ubuntu16 to boost-1.65 to fix #3090, but Debian9
stil uses distribution provided boost-1.62, it causes same build error.
So switch it to our boost-1.65, too.

See c636f552e0

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1521250590-16510-1-git-send-email-syuu@scylladb.com>
2018-03-18 10:23:43 +02:00
Daniel Fiala
10db711259 sstables: make_descriptor accepts sstable-directory, because it is necessary to determine cf and ks in 2.2 format.
Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2018-03-18 06:09:47 +01:00
Daniel Fiala
abdf22f5cd sstables: Throw more detail exception for unknown item in reverse_map.
* This can help with debugging.

Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2018-03-18 05:54:15 +01:00
Daniel Fiala
c5eca593fc sstables/compaction: Suppress NaN in a report of a throughput.
* It causes failures in dtest.

Signed-off-by: Daniel Fiala <daniel@scylladb.com>
2018-03-18 05:46:32 +01:00
Glauber Costa
f5c32423b8 summary: don't go through all entries when computing memory size.
Summary has a function, memory_size(), that estimates the amount of
memory the summary takes. It is my understanding that this is called
to serve information to tooling.

First, this function is innacurate because it doesn't take into account
the tokens per each entry, just the keys. But more importantly, it has
to iterate over all keys which can be pretty expensive if the entries
list is long. We are now keeping that in a memory area, with just
pointers in the entry. So instead of iterating through the entries, we
can iterate through the memory areas, which is much cheaper.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180316120915.16809-1-glauber@scylladb.com>
2018-03-16 12:57:19 +00:00
Duarte Nunes
fef9d4fa72 service/storage_service: Avoid superfluous seastar::thread
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180315212202.12176-1-duarte@scylladb.com>
2018-03-16 12:52:15 +00:00
Nadav Har'El
e9702aa126 Materialized Views: don't lose updates while cluster is changing
When the cluster is changed (nodes added or removed), ranges of tokens
are moved between nodes. Scylla initiates a streaming process between an
old and a new owner of the range, which can take a long time. During
that streaming time, the new owner of the range is known as a "pending node"
for this range, and all updates must go to both the old owner (in case the
movement fails!) and the pending node (in case the movement succeeds).

For materialized views, because they are ordinary tables, streaming moves
all the view's data that existed before the streaming started. But we did
not send updates done to the view *during* the streaming. A dtest
demonstrates that the new node will miss some of the view update, and will
require a repair of the view tables immediately after the cluster change
ends, which is not good. To fix that, we need to send every new update
that happens during the streaming also to the "pending node". We already
did this properly for base-table updates, but not to the view updates:
Each base table replica wrote to only one paired view table replica,
and nobody wrote to the new pending node (in case where there is one,
for the particular view token involved).

In this patch, we make sure that all view updates go also to the "pending
nodes" when there are any. We do the same thing that Cassandra does, which
is - *all* base replicas write the update to the pending node(s).
Arguably, it is inefficient that all replicas send the update to the same
node. In most cases it is enough to send it from just one base replica -
the one who is slated to be the new node's pair.  I opened
https://issues.apache.org/jira/browse/CASSANDRA-14262 about this idea.
But that is an optimization. The patch as-is already fixes the bug.

Fixes #3211

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180313171853.17283-1-nyh@scylladb.com>
2018-03-16 12:00:29 +00:00
Duarte Nunes
934d805b4b Merge 'Grant default permissions' from Jesse
The functional change in this series is in the last patch
("auth: Grant all permissions to object creator").

The first patch addresses `const` correctness in `auth`. This change
allowed the new code added in the last patch to be written with the
correct `const` specifiers, and also some code to be removed.

The second-to-last patch addresses error-handling in the authorizer for
unsupported operations and is a prerequisite for the last patch (since
we now always grant permissions for new database objects).

Tests: unit (release)

* 'jhk/default_permissions/v3' of https://github.com/hakuch/scylla:
  auth: Grant all permissions to object creator
  auth: Unify handling for unsupported errors
  auth: Fix life-time issue with parameter
  auth: Fix `const` correctness
2018-03-16 09:43:36 +01:00
Avi Kivity
9eb7c0c65b Merge "Remove (some) reactor stalls in the SSTable code" from Glauber
"
This is an improvement on my latest series. Instead of just
dealing with the problem of destroying the Summary that I have
identified in a previous test, I have tried to find other sources
of stalls.

Some of them are on readers and would affect early processes and
operations like nodetool refresh.

Others are on writers, which can affect any SSTable being written.

Two of those stalls (on large filter, on summary read), I saw in a
synthetic benchmark where I used very small values + nodetool compact
to generate one SSTable with many keys. They were 80ms and 20ms
respectively, and now they are totally gone.

For others, I just tried to be safe (for instance, if we know
reading/writing large vectors can be costly, just always insert
preemption points in them).

With all of these patches applied, I no longer see stalls coming from
the SSTable code in those tests (although given enough time, I am sure I
can find more).

Tests: unit (release)
Fixes: #3282, Fixes #3281, Fixes #3269
"

* 'sstables-stalls-v3-updated' of github.com:glommer/scylla:
  large_bitset/bloom filter: add preemption points in loops
  sstables: read filter in a thread
  abstract summary entry version of the token with a token view
  add a token_view
  sstables: rework summary entries reading
  sstables: avoid calls to resize for vectors
  sstables: replace potentially large for loop with do_until
  summary_entry: do not store key bytes in each summary entry
  tests: change tests to make summary non-copyable
  chunked_vector: do not iterate to destruct trivially destructible types
2018-03-16 09:43:36 +01:00
Glauber Costa
7fd31088f2 large_bitset/bloom filter: add preemption points in loops
SSTables that contain many keys - a common case with small partitions in
long lived nodes - can generate filters that are quite large.

I have seen stalls over 80ms when reading a filter that was the result
of a 6h write load of very small keys after nodetool compact (filter was
in the 100s of MB)

Similar care should be taken when creating the filter, as if the
estimated number of partitions is big, the resulting large_bitset can be
quite big as well.

If we treat the i_filter.hh and large_bitset.hh interfaces as truly
generic, then maybe we should have an in_thread version along with a
common version. But the bloom filter is the only user for both and even
if that changes in the future, it is still a good idea to run something
with a massive loop in a thread.

So for simplicity, I am just asserting that we are on a thread to avoid
surprises, and inserting preemption points in the loops.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-03-15 12:24:15 -04:00
Glauber Costa
c424ba01df sstables: read filter in a thread
Constructing filter objects can be quite expensive. We will insert some
yield points around, and that is made a lot easier if we are calling
things from a thread.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-03-15 12:24:15 -04:00
Glauber Costa
e680c7c8cc abstract summary entry version of the token with a token view
dht::token doesn't have a trivial destructor, so destroying an array
full of those can be quite expensive. If we use the same trick as we
used for the summary - storing the token data in a stable memory
location - we can leave the entries with a trivial destructor and destroy
the chunks themselves. Those being larger, they will be more efficient
to delete.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-03-15 12:24:15 -04:00
Glauber Costa
dddc7e1676 add a token_view
Ideally we would like tokens to be trivially destructible, so that we
can easily dispose of giant vectors holding them. While that is hard to
do with our current infrastructure, we can introduce a token_view, which
holds a bytes_view elements instead of the real data - making it
trivially destructible.

The comparators are then changed to take a token_view, and an implicit
conversion function is provided from tokens so they get compared.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-03-15 12:24:09 -04:00
Duarte Nunes
9da2b66cff cql3/untyped_result_set: Conform to boost::range concept
Enable some of that boost::copy_range goodness.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180315121801.2808-1-duarte@scylladb.com>
2018-03-15 13:34:44 +01:00
Takuya ASADA
69d226625a dist/ami: update CentOS base image to latest version
Since we requires updated version of systemd, we need to update CentOS base
image.

Fixes #3184

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1518118694-23770-1-git-send-email-syuu@scylladb.com>
2018-03-15 10:47:37 +02:00
Takuya ASADA
945e6ec4f6 dist/debian: use 3rdparty ppa on Ubuntu 18.04
Currently Ubuntu 18.04 uses distribution provided g++ and boost, but it's easier
to maintain Scylla package to build with same version toolchain/libraries, so
switch to them.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1521075576-12064-1-git-send-email-syuu@scylladb.com>
2018-03-15 10:41:05 +02:00
Takuya ASADA
1bb3531b90 dist/redhat: build only scylla, iotune
Since we don't package tests, we don't need to build them.
It reduces package building time.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1521066363-4859-1-git-send-email-syuu@scylladb.com>
2018-03-15 10:40:35 +02:00
Takuya ASADA
856dc0a636 dist/redhat: switch to gcc-7.3
We have hit following bug on debug-mode binary:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82560
Since it's fixed on gcc-7.3, we need to upgrade our gcc package.

See: https://groups.google.com/d/topic/scylladb-dev/RIdIpqMeTog/discussion
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1521064473-17906-1-git-send-email-syuu@scylladb.com>
2018-03-15 10:39:25 +02:00
Vladimir Krivopalov
5c3b32a9bf Remove to_boost_visitor heler.
The minimal Boost version required for Scylla now is 1.58 and this
helper is no longer needed.
Replaced it with more generic visitation utils from Seastar.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <e589ace7ac411d3d55dead475a8a2271f51642f1.1520976010.git.vladimir@scylladb.com>
2018-03-14 23:49:07 +00:00
Avi Kivity
bb4b1f0e91 Merge "Ubuntu/Debian build error fixes" from Takuya
* 'debian-ubuntu-build-fixes-v2' of https://github.com/syuu1228/scylla:
  dist/debian: build only scylla, iotune
  dist/debian: switch to boost-1.65
  dist/debian: switch to gcc-7.3
2018-03-14 22:50:40 +02:00
Takuya ASADA
7f891e7a48 dist/debian: build only scylla, iotune
Since we don't package tests, we don't need to build them.
It reduces package building time.
2018-03-15 04:33:11 +09:00
Glauber Costa
89b28a4bea sstables: rework summary entries reading
Like we did for generic arrays, let's move away from resize() in trying
to read summary entries and move to a reserve/push pattern.

I have tested this patch reading a summary file that a couple of MB big.
Stalls up to 20ms were seen. After applying this patch, no such stalls
are present.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-03-14 13:35:15 -04:00
Glauber Costa
a33f0d6f92 sstables: avoid calls to resize for vectors
resize is considered harmful, since it will attempt to allocate memory
and initialize each element of the vector. This can cause reactor stalls
that correlates to latency peaks.

A better idiom is reserve first - so we know we will have enough memory
and won't have to move contents - and push_back/emplace_back each
individual member.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-03-14 13:32:36 -04:00
Glauber Costa
0d9488eae6 sstables: replace potentially large for loop with do_until
We are pushing ints here, so it shouldn't be that bad in practice.
But a potentially gigantic for loop is just asking for a stall since we won't
need_preempt() it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-03-14 11:58:03 -04:00
Glauber Costa
091b0f9d41 summary_entry: do not store key bytes in each summary entry
If we store a bytes_view instead of bytes, that has a trivial destructor
and then we don't need to destroy each element individually. To do that,
we allocate the data in a couple of large arrays which can be disposed of
easily and point to it.

We still can't destroy trivially because of the token.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-03-14 10:46:20 -04:00
Glauber Costa
d15bfbe548 tests: change tests to make summary non-copyable
Right now the summary can be copied, but in real life there is no reason
for this to be a requirement. Tests want it, so we can destroy a summary,
load another, and compare the two. We can achieve this by allowing the first
summary to be moved, and then we can still have a reference to the second.

I am about to make a change that will make the summary not copyable as a
requirement, so we need to do this first.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-03-14 10:46:20 -04:00
Glauber Costa
00d04b49a0 chunked_vector: do not iterate to destruct trivially destructible types
Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-03-14 09:16:54 -04:00
Takuya ASADA
c636f552e0 dist/debian: switch to boost-1.65
We get following compile error on Debian/Ubuntu with boost-1.63:

/opt/scylladb/include/boost/intrusive/pointer_plus_bits.hpp:76:48: error: '*((void*)& __tmp +136)' is used uninitialized in this function [-Werror=uninitialized]
       n = pointer(uintptr_t(p) | (uintptr_t(n) & Mask));
                                         ~~~~~~~~~~~~~~^~~~~~~

This is known issue (https://github.com/boostorg/intrusive/issues/29), fixed
on boost-1.65.

Switch to boost-1.65 to fix the issue.

Fixes #3090
2018-03-14 22:13:24 +09:00
Avi Kivity
a0bc126ae2 Merge seastar upstream
* seastar bcfbe0c...a66cc34 (3):
  > reactor: fix sleep mode
  > cpu scheduler: don't penalize first group to run
  > Simple shellscript to find out which logical CPU's shards are mapped to
2018-03-14 14:14:21 +02:00
Jesse Haber-Kucharsky
6a360c2d17 auth: Grant all permissions to object creator
When a table, keyspace, or role is created, the creator now is
automatically granted all applicable permissions on the object.

This behavior is consistent with Apache Cassandra.

Fixes #3216.
2018-03-14 01:54:31 -04:00
Jesse Haber-Kucharsky
c502fe24ce auth: Unify handling for unsupported errors
Instead of some functions in `allow_all_authorizer` throwing exceptions
and others being silently pass-through, we consistently return exception
futures with `auth::unsupported_authorization_operation`. These errors
are converted to `invalid_request_exception` in the CQL error and
ignored where appropriate in the auth subsystem.
2018-03-14 01:54:28 -04:00
Jesse Haber-Kucharsky
97235445d3 auth: Fix life-time issue with parameter 2018-03-14 01:32:53 -04:00
Jesse Haber-Kucharsky
9117a689cf auth: Fix const correctness
This patch came about because of an important (and obvious, in
hindsight) realization: instances of the authorizer, role manager, and
authenticator are clients for access-control state and not the state
itself. This is reflected directly in Scylla: `auth::service` is
sharded across cores and this is possible because each instance queries
and modifies the same global state.

To give more examples, the value of an instance of `std::vector<int>` is
the structure of the container and its contents. The value of `int
file_descriptor` is an identifier for state maintained elsewhere.

Having watched an excellent talk by Herb Sutter [1] and having read an
informative blog post [2], it's clear that a member function marked
`const` communicates that the observable state of the instance is not
modified.

Thus, the member functions of the role-manager, authenticator, and
authorizer clients should not be marked `const` only if the state of the
client itself is observably changed. By this principle, member functions
which do not change the state of the client, but which mutate the global
state the client is associated with (for example, by creating a role)
are marked `const`.

The `start` (and `stop`) functions of the client have the dual role of
initializing (finalizing) both the local client state and the
external state; they are not marked `const`.

[1] https://herbsutter.com/2013/01/01/video-you-dont-know-const-and-mutable/

[2] http://talesofcpp.fusionfenix.com/post-2/episode-one-to-be-or-not-to-be-const
2018-03-14 01:32:43 -04:00
Takuya ASADA
c3b2e2580a dist/debian: switch to gcc-7.3
We have hit following bug on debug-mode binary:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82560
Since it's fixed on gcc-7.3, we need to upgrade our gcc package.

See: https://groups.google.com/d/topic/scylladb-dev/RIdIpqMeTog/discussion
2018-03-08 00:06:32 +09:00
1438 changed files with 65590 additions and 19812 deletions

View File

@@ -1,3 +1,9 @@
This is Scylla's bug tracker, to be used for reporting bugs only.
If you have a question about Scylla, and not a bug, please ask it in
our mailing-list at scylladb-dev@googlegroups.com or in our slack channel.
- [] I have read the disclaimer above, and I am reporting a suspected malfunction in Scylla.
*Installation details*
Scylla version (or git commit hash):
Cluster size:

4
.github/PULL_REQUEST_TEMPLATE.md vendored Normal file
View File

@@ -0,0 +1,4 @@
Scylla doesn't use pull-requests, please send a patch to the [mailing list](mailto:scylladb-dev@googlegroups.com) instead.
See our [contributing guidelines](../CONTRIBUTING.md) and our [Scylla development guidelines](../HACKING.md) for more information.
If you have any questions please don't hesitate to send a mail to the [dev list](mailto:scylladb-dev@googlegroups.com).

1
.gitignore vendored
View File

@@ -18,3 +18,4 @@ CMakeLists.txt.user
*.egg-info
__pycache__CMakeLists.txt.user
.gdbinit
resources

6
.gitmodules vendored
View File

@@ -6,9 +6,9 @@
path = swagger-ui
url = ../scylla-swagger-ui
ignore = dirty
[submodule "dist/ami/files/scylla-ami"]
path = dist/ami/files/scylla-ami
url = ../scylla-ami
[submodule "xxHash"]
path = xxHash
url = ../xxHash
[submodule "libdeflate"]
path = libdeflate
url = ../libdeflate

View File

@@ -138,4 +138,5 @@ target_include_directories(scylla PUBLIC
${SEASTAR_INCLUDE_DIRS}
${Boost_INCLUDE_DIRS}
xxhash
libdeflate
build/release/gen)

View File

@@ -20,7 +20,7 @@ $ git submodule update --init --recursive
Scylla depends on the system package manager for its development dependencies.
Running `./install_dependencies.sh` (as root) installs the appropriate packages based on your Linux distribution.
Running `./install-dependencies.sh` (as root) installs the appropriate packages based on your Linux distribution.
### Build system

View File

@@ -50,12 +50,12 @@ Then, to build an RPM, run:
./dist/redhat/build_rpm.sh
```
The built RPM is stored in ``/var/lib/mock/<configuration>/result`` directory.
The built RPM is stored in the ``build/mock/<configuration>/result`` directory.
For example, on Fedora 21 mock reports the following:
```
INFO: Done(scylla-server-0.00-1.fc21.src.rpm) Config(default) 20 minutes 7 seconds
INFO: Results and/or logs in: /var/lib/mock/fedora-21-x86_64/result
INFO: Results and/or logs in: build/mock/fedora-21-x86_64/result
```
## Building Fedora-based Docker image

View File

@@ -1,6 +1,6 @@
#!/bin/sh
VERSION=2.2.2
VERSION=3.0.11
if test -f version
then

View File

@@ -455,7 +455,7 @@
"operations":[
{
"method":"GET",
"summary":"Returns a list of filenames that contain the given key on this node",
"summary":"Returns a list of sstable filenames that contain the given partition key on this node",
"type":"array",
"items":{
"type":"string"
@@ -475,7 +475,7 @@
},
{
"name":"key",
"description":"The key",
"description":"The partition key. In a composite-key scenario, use ':' to separate the columns in the key.",
"required":true,
"allowMultiple":false,
"type":"string",

30
api/api-doc/config.json Normal file
View File

@@ -0,0 +1,30 @@
"/v2/config/{id}": {
"get": {
"description": "Return a config value",
"operationId": "find_config_id",
"produces": [
"application/json"
],
"tags": ["config"],
"parameters": [
{
"name": "id",
"in": "path",
"description": "ID of config to return",
"required": true,
"type": "string"
}
],
"responses": {
"200": {
"description": "Config value"
},
"default": {
"description": "unexpected error",
"schema": {
"$ref": "#/definitions/ErrorModel"
}
}
}
}
}

View File

@@ -2129,6 +2129,41 @@
]
}
]
},
{
"path":"/storage_service/view_build_statuses/{keyspace}/{view}",
"operations":[
{
"method":"GET",
"summary":"Gets the progress of a materialized view build",
"type":"array",
"items":{
"type":"mapper"
},
"nickname":"view_build_statuses",
"produces":[
"application/json"
],
"parameters":[
{
"name":"keyspace",
"description":"The keyspace",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
},
{
"name":"view",
"description":"View name",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
}
]
}
]
}
],
"models":{

View File

@@ -39,6 +39,7 @@
#include "http/exception.hh"
#include "stream_manager.hh"
#include "system.hh"
#include "api/config.hh"
namespace api {
@@ -65,6 +66,7 @@ future<> set_server_init(http_context& ctx) {
rb->set_api_doc(r);
rb02->set_api_doc(r);
rb02->register_api_file(r, "swagger20_header");
set_config(rb02, ctx, r);
rb->register_function(r, "system",
"The system related API");
set_system(ctx, r);

View File

@@ -429,7 +429,7 @@ void set_column_family(http_context& ctx, routes& r) {
return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](column_family& cf) {
utils::estimated_histogram res(0);
for (auto i: *cf.get_sstables() ) {
res.merge(i->get_stats_metadata().estimated_column_count);
res.merge(i->get_stats_metadata().estimated_cells_count);
}
return res;
},
@@ -905,5 +905,20 @@ void set_column_family(http_context& ctx, routes& r) {
return make_ready_future<json::json_return_type>(res);
});
});
cf::get_sstables_for_key.set(r, [&ctx](std::unique_ptr<request> req) {
auto key = req->get_query_param("key");
auto uuid = get_uuid(req->param["name"], ctx.db.local());
return ctx.db.map_reduce0([key, uuid] (database& db) {
return db.find_column_family(uuid).get_sstables_by_partition_key(key);
}, std::unordered_set<sstring>(),
[](std::unordered_set<sstring> a, std::unordered_set<sstring>&& b) mutable {
a.insert(b.begin(),b.end());
return a;
}).then([](const std::unordered_set<sstring>& res) {
return make_ready_future<json::json_return_type>(container_to_vec(res));
});
});
}
}

View File

@@ -24,6 +24,7 @@
#include "api.hh"
#include "api/api-doc/column_family.json.hh"
#include "database.hh"
#include <any>
namespace api {
@@ -37,9 +38,15 @@ template<class Mapper, class I, class Reducer>
future<I> map_reduce_cf_raw(http_context& ctx, const sstring& name, I init,
Mapper mapper, Reducer reducer) {
auto uuid = get_uuid(name, ctx.db.local());
return ctx.db.map_reduce0([mapper, uuid](database& db) {
return mapper(db.find_column_family(uuid));
}, init, reducer);
using mapper_type = std::function<std::any (database&)>;
using reducer_type = std::function<std::any (std::any, std::any)>;
return ctx.db.map_reduce0(mapper_type([mapper, uuid](database& db) {
return I(mapper(db.find_column_family(uuid)));
}), std::any(std::move(init)), reducer_type([reducer = std::move(reducer)] (std::any a, std::any b) mutable {
return I(reducer(std::any_cast<I>(std::move(a)), std::any_cast<I>(std::move(b))));
})).then([] (std::any r) {
return std::any_cast<I>(std::move(r));
});
}
@@ -51,35 +58,42 @@ future<json::json_return_type> map_reduce_cf(http_context& ctx, const sstring& n
});
}
template<class Mapper, class I, class Reducer, class Result>
future<I> map_reduce_cf_raw(http_context& ctx, const sstring& name, I init,
Mapper mapper, Reducer reducer, Result result) {
auto uuid = get_uuid(name, ctx.db.local());
return ctx.db.map_reduce0([mapper, uuid](database& db) {
return mapper(db.find_column_family(uuid));
}, init, reducer);
}
template<class Mapper, class I, class Reducer, class Result>
future<json::json_return_type> map_reduce_cf(http_context& ctx, const sstring& name, I init,
Mapper mapper, Reducer reducer, Result result) {
return map_reduce_cf_raw(ctx, name, init, mapper, reducer, result).then([result](const I& res) mutable {
return map_reduce_cf_raw(ctx, name, init, mapper, reducer).then([result](const I& res) mutable {
result = res;
return make_ready_future<json::json_return_type>(result);
});
}
template<class Mapper, class I, class Reducer>
future<I> map_reduce_cf_raw(http_context& ctx, I init,
Mapper mapper, Reducer reducer) {
return ctx.db.map_reduce0([mapper, init, reducer](database& db) {
struct map_reduce_column_families_locally {
std::any init;
std::function<std::any (column_family&)> mapper;
std::function<std::any (std::any, std::any)> reducer;
std::any operator()(database& db) const {
auto res = init;
for (auto i : db.get_column_families()) {
res = reducer(res, mapper(*i.second.get()));
}
return res;
}, init, reducer);
}
};
template<class Mapper, class I, class Reducer>
future<I> map_reduce_cf_raw(http_context& ctx, I init,
Mapper mapper, Reducer reducer) {
using mapper_type = std::function<std::any (column_family&)>;
using reducer_type = std::function<std::any (std::any, std::any)>;
auto wrapped_mapper = mapper_type([mapper = std::move(mapper)] (column_family& cf) mutable {
return I(mapper(cf));
});
auto wrapped_reducer = reducer_type([reducer = std::move(reducer)] (std::any a, std::any b) mutable {
return I(reducer(std::any_cast<I>(std::move(a)), std::any_cast<I>(std::move(b))));
});
return ctx.db.map_reduce0(map_reduce_column_families_locally{init, std::move(wrapped_mapper), wrapped_reducer}, std::any(init), wrapped_reducer).then([] (std::any res) {
return std::any_cast<I>(std::move(res));
});
}

112
api/config.cc Normal file
View File

@@ -0,0 +1,112 @@
/*
* Copyright 2018 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "api/config.hh"
#include "api/api-doc/config.json.hh"
#include "db/config.hh"
#include <sstream>
#include <boost/algorithm/string/replace.hpp>
namespace api {
template<class T>
json::json_return_type get_json_return_type(const T& val) {
return json::json_return_type(val);
}
/*
* As commented on db::seed_provider_type is not used
* and probably never will.
*
* Just in case, we will return its name
*/
template<>
json::json_return_type get_json_return_type(const db::seed_provider_type& val) {
return json::json_return_type(val.class_name);
}
std::string format_type(const std::string& type) {
if (type == "int") {
return "integer";
}
return type;
}
future<> get_config_swagger_entry(const std::string& name, const std::string& description, const std::string& type, bool& first, output_stream<char>& os) {
std::stringstream ss;
if (first) {
first=false;
} else {
ss <<',';
};
ss << "\"/config/" << name <<"\": {"
"\"get\": {"
"\"description\": \"" << boost::replace_all_copy(boost::replace_all_copy(boost::replace_all_copy(description,"\n","\\n"),"\"", "''"), "\t", " ") <<"\","
"\"operationId\": \"find_config_"<< name <<"\","
"\"produces\": ["
"\"application/json\""
"],"
"\"tags\": [\"config\"],"
"\"parameters\": ["
"],"
"\"responses\": {"
"\"200\": {"
"\"description\": \"Config value\","
"\"schema\": {"
"\"type\": \"" << format_type(type) << "\""
"}"
"},"
"\"default\": {"
"\"description\": \"unexpected error\","
"\"schema\": {"
"\"$ref\": \"#/definitions/ErrorModel\""
"}"
"}"
"}"
"}"
"}";
return os.write(ss.str());
}
namespace cs = httpd::config_json;
#define _get_config_value(name, type, deflt, status, desc, ...) if (id == #name) {return get_json_return_type(ctx.db.local().get_config().name());}
#define _get_config_description(name, type, deflt, status, desc, ...) f = f.then([&os, &first] {return get_config_swagger_entry(#name, desc, #type, first, os);});
void set_config(std::shared_ptr < api_registry_builder20 > rb, http_context& ctx, routes& r) {
rb->register_function(r, [] (output_stream<char>& os) {
return do_with(true, [&os] (bool& first) {
auto f = make_ready_future();
_make_config_values(_get_config_description)
return f;
});
});
cs::find_config_id.set(r, [&ctx] (const_req r) {
auto id = r.param["id"];
_make_config_values(_get_config_value)
throw bad_param_exception(sstring("No such config entry: ") + id);
});
}
}

30
api/config.hh Normal file
View File

@@ -0,0 +1,30 @@
/*
* Copyright (C) 2018 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include "api.hh"
#include <seastar/http/api_docs.hh>
namespace api {
void set_config(std::shared_ptr<api_registry_builder20> rb, http_context& ctx, routes& r);
}

View File

@@ -78,15 +78,17 @@ void set_storage_service(http_context& ctx, routes& r) {
});
});
ss::get_tokens.set(r, [] (const_req req) {
auto tokens = service::get_local_storage_service().get_token_metadata().sorted_tokens();
return container_to_vec(tokens);
ss::get_tokens.set(r, [] (std::unique_ptr<request> req) {
return make_ready_future<json::json_return_type>(stream_range_as_array(service::get_local_storage_service().get_token_metadata().sorted_tokens(), [](const dht::token& i) {
return boost::lexical_cast<std::string>(i);
}));
});
ss::get_node_tokens.set(r, [] (const_req req) {
gms::inet_address addr(req.param["endpoint"]);
auto tokens = service::get_local_storage_service().get_token_metadata().get_tokens(addr);
return container_to_vec(tokens);
ss::get_node_tokens.set(r, [] (std::unique_ptr<request> req) {
gms::inet_address addr(req->param["endpoint"]);
return make_ready_future<json::json_return_type>(stream_range_as_array(service::get_local_storage_service().get_token_metadata().get_tokens(addr), [](const dht::token& i) {
return boost::lexical_cast<std::string>(i);
}));
});
ss::get_commitlog.set(r, [&ctx](const_req req) {
@@ -107,11 +109,7 @@ void set_storage_service(http_context& ctx, routes& r) {
});
ss::get_moving_nodes.set(r, [](const_req req) {
auto points = service::get_local_storage_service().get_token_metadata().get_moving_endpoints();
std::unordered_set<sstring> addr;
for (auto i: points) {
addr.insert(boost::lexical_cast<std::string>(i.second));
}
return container_to_vec(addr);
});
@@ -852,6 +850,15 @@ void set_storage_service(http_context& ctx, routes& r) {
return make_ready_future<json::json_return_type>(map_to_key_value(ownership, res));
});
});
ss::view_build_statuses.set(r, [&ctx] (std::unique_ptr<request> req) {
auto keyspace = validate_keyspace(ctx, req->param);
auto view = req->param["view"];
return service::get_local_storage_service().view_build_statuses(std::move(keyspace), std::move(view)).then([] (std::unordered_map<sstring, sstring> status) {
std::vector<storage_service_json::mapper> res;
return make_ready_future<json::json_return_type>(map_to_key_value(std::move(status), res));
});
});
}
}

258
atomic_cell.cc Normal file
View File

@@ -0,0 +1,258 @@
/*
* Copyright (C) 2018 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "atomic_cell.hh"
#include "atomic_cell_or_collection.hh"
#include "types.hh"
/// LSA mirator for cells with irrelevant type
///
///
const data::type_imr_descriptor& no_type_imr_descriptor() {
static thread_local data::type_imr_descriptor state(data::type_info::make_variable_size());
return state;
}
atomic_cell atomic_cell::make_dead(api::timestamp_type timestamp, gc_clock::time_point deletion_time) {
auto& imr_data = no_type_imr_descriptor();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_dead(timestamp, deletion_time), &imr_data.lsa_migrator())
);
}
atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, bytes_view value, atomic_cell::collection_member cm) {
auto& imr_data = type.imr_state();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, bool(cm)), &imr_data.lsa_migrator())
);
}
atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, ser::buffer_view<bytes_ostream::fragment_iterator> value, atomic_cell::collection_member cm) {
auto& imr_data = type.imr_state();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, bool(cm)), &imr_data.lsa_migrator())
);
}
atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, const fragmented_temporary_buffer::view& value, collection_member cm)
{
auto& imr_data = type.imr_state();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, bool(cm)), &imr_data.lsa_migrator())
);
}
atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, bytes_view value,
gc_clock::time_point expiry, gc_clock::duration ttl, atomic_cell::collection_member cm) {
auto& imr_data = type.imr_state();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, expiry, ttl, bool(cm)), &imr_data.lsa_migrator())
);
}
atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, ser::buffer_view<bytes_ostream::fragment_iterator> value,
gc_clock::time_point expiry, gc_clock::duration ttl, atomic_cell::collection_member cm) {
auto& imr_data = type.imr_state();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, expiry, ttl, bool(cm)), &imr_data.lsa_migrator())
);
}
atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, const fragmented_temporary_buffer::view& value,
gc_clock::time_point expiry, gc_clock::duration ttl, collection_member cm)
{
auto& imr_data = type.imr_state();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, expiry, ttl, bool(cm)), &imr_data.lsa_migrator())
);
}
atomic_cell atomic_cell::make_live_counter_update(api::timestamp_type timestamp, int64_t value) {
auto& imr_data = no_type_imr_descriptor();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live_counter_update(timestamp, value), &imr_data.lsa_migrator())
);
}
atomic_cell atomic_cell::make_live_uninitialized(const abstract_type& type, api::timestamp_type timestamp, size_t size) {
auto& imr_data = no_type_imr_descriptor();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live_uninitialized(imr_data.type_info(), timestamp, size), &imr_data.lsa_migrator())
);
}
static imr::utils::object<data::cell::structure> copy_cell(const data::type_imr_descriptor& imr_data, const uint8_t* ptr)
{
using imr_object_type = imr::utils::object<data::cell::structure>;
// If the cell doesn't own any memory it is trivial and can be copied with
// memcpy.
auto f = data::cell::structure::get_member<data::cell::tags::flags>(ptr);
if (!f.template get<data::cell::tags::external_data>()) {
data::cell::context ctx(f, imr_data.type_info());
// XXX: We may be better off storing the total cell size in memory. Measure!
auto size = data::cell::structure::serialized_object_size(ptr, ctx);
return imr_object_type::make_raw(size, [&] (uint8_t* dst) noexcept {
std::copy_n(ptr, size, dst);
}, &imr_data.lsa_migrator());
}
return imr_object_type::make(data::cell::copy_fn(imr_data.type_info(), ptr), &imr_data.lsa_migrator());
}
atomic_cell::atomic_cell(const abstract_type& type, atomic_cell_view other)
: atomic_cell(type.imr_state().type_info(),
copy_cell(type.imr_state(), other._view.raw_pointer()))
{ }
atomic_cell_or_collection atomic_cell_or_collection::copy(const abstract_type& type) const {
if (!_data.get()) {
return atomic_cell_or_collection();
}
auto& imr_data = type.imr_state();
return atomic_cell_or_collection(
copy_cell(imr_data, _data.get())
);
}
atomic_cell_or_collection::atomic_cell_or_collection(const abstract_type& type, atomic_cell_view acv)
: _data(copy_cell(type.imr_state(), acv._view.raw_pointer()))
{
}
static collection_mutation_view get_collection_mutation_view(const uint8_t* ptr)
{
auto f = data::cell::structure::get_member<data::cell::tags::flags>(ptr);
auto ti = data::type_info::make_collection();
data::cell::context ctx(f, ti);
auto view = data::cell::structure::get_member<data::cell::tags::cell>(ptr).as<data::cell::tags::collection>(ctx);
auto dv = data::cell::variable_value::make_view(view, f.get<data::cell::tags::external_data>());
return collection_mutation_view { dv };
}
collection_mutation_view atomic_cell_or_collection::as_collection_mutation() const {
return get_collection_mutation_view(_data.get());
}
collection_mutation::collection_mutation(const collection_type_impl& type, collection_mutation_view v)
: _data(imr_object_type::make(data::cell::make_collection(v.data), &type.imr_state().lsa_migrator()))
{
}
collection_mutation::collection_mutation(const collection_type_impl& type, bytes_view v)
: _data(imr_object_type::make(data::cell::make_collection(v), &type.imr_state().lsa_migrator()))
{
}
collection_mutation::operator collection_mutation_view() const
{
return get_collection_mutation_view(_data.get());
}
bool atomic_cell_or_collection::equals(const abstract_type& type, const atomic_cell_or_collection& other) const
{
auto ptr_a = _data.get();
auto ptr_b = other._data.get();
if (!ptr_a || !ptr_b) {
return !ptr_a && !ptr_b;
}
if (type.is_atomic()) {
auto a = atomic_cell_view::from_bytes(type.imr_state().type_info(), _data);
auto b = atomic_cell_view::from_bytes(type.imr_state().type_info(), other._data);
if (a.timestamp() != b.timestamp()) {
return false;
}
if (a.is_live()) {
if (!b.is_live()) {
return false;
}
if (a.is_counter_update()) {
if (!b.is_counter_update()) {
return false;
}
return a.counter_update_value() == b.counter_update_value();
}
if (a.is_live_and_has_ttl()) {
if (!b.is_live_and_has_ttl()) {
return false;
}
if (a.ttl() != b.ttl() || a.expiry() != b.expiry()) {
return false;
}
}
return a.value() == b.value();
}
return a.deletion_time() == b.deletion_time();
} else {
return as_collection_mutation().data == other.as_collection_mutation().data;
}
}
size_t atomic_cell_or_collection::external_memory_usage(const abstract_type& t) const
{
if (!_data.get()) {
return 0;
}
auto ctx = data::cell::context(_data.get(), t.imr_state().type_info());
auto view = data::cell::structure::make_view(_data.get(), ctx);
auto flags = view.get<data::cell::tags::flags>();
size_t external_value_size = 0;
if (flags.get<data::cell::tags::external_data>()) {
if (flags.get<data::cell::tags::collection>()) {
external_value_size = get_collection_mutation_view(_data.get()).data.size_bytes();
} else {
auto cell_view = data::cell::atomic_cell_view(t.imr_state().type_info(), view);
external_value_size = cell_view.value_size();
}
// Add overhead of chunk headers. The last one is a special case.
external_value_size += (external_value_size - 1) / data::cell::maximum_external_chunk_length * data::cell::external_chunk_overhead;
external_value_size += data::cell::external_last_chunk_overhead;
}
return data::cell::structure::serialized_object_size(_data.get(), ctx)
+ imr_object_type::size_overhead + external_value_size;
}
std::ostream& operator<<(std::ostream& os, const atomic_cell_or_collection& c) {
if (!c._data.get()) {
return os << "{ null atomic_cell_or_collection }";
}
using dc = data::cell;
os << "{ ";
if (dc::structure::get_member<dc::tags::flags>(c._data.get()).get<dc::tags::collection>()) {
os << "collection";
} else {
os << "atomic cell";
}
return os << " @" << static_cast<const void*>(c._data.get()) << " }";
}

View File

@@ -30,189 +30,51 @@
#include <cstdint>
#include <iosfwd>
#include <seastar/util/gcc6-concepts.hh>
#include "data/cell.hh"
#include "data/schema_info.hh"
#include "imr/utils.hh"
#include "utils/fragmented_temporary_buffer.hh"
template<typename T, typename Input>
static inline
void set_field(Input& v, unsigned offset, T val) {
reinterpret_cast<net::packed<T>*>(v.begin() + offset)->raw = net::hton(val);
}
#include "serializer.hh"
template<typename T>
static inline
T get_field(const bytes_view& v, unsigned offset) {
return net::ntoh(*reinterpret_cast<const net::packed<T>*>(v.begin() + offset));
}
class abstract_type;
class collection_type_impl;
class atomic_cell_or_collection;
using atomic_cell_value_view = data::value_view;
using atomic_cell_value_mutable_view = data::value_mutable_view;
/*
* Represents atomic cell layout. Works on serialized form.
*
* Layout:
*
* <live> := <int8_t:flags><int64_t:timestamp>(<int32_t:expiry><int32_t:ttl>)?<value>
* <dead> := <int8_t: 0><int64_t:timestamp><int32_t:deletion_time>
*/
class atomic_cell_type final {
private:
static constexpr int8_t LIVE_FLAG = 0x01;
static constexpr int8_t EXPIRY_FLAG = 0x02; // When present, expiry field is present. Set only for live cells
static constexpr int8_t COUNTER_UPDATE_FLAG = 0x08; // Cell is a counter update.
static constexpr int8_t COUNTER_IN_PLACE_REVERT = 0x10;
static constexpr unsigned flags_size = 1;
static constexpr unsigned timestamp_offset = flags_size;
static constexpr unsigned timestamp_size = 8;
static constexpr unsigned expiry_offset = timestamp_offset + timestamp_size;
static constexpr unsigned expiry_size = 4;
static constexpr unsigned deletion_time_offset = timestamp_offset + timestamp_size;
static constexpr unsigned deletion_time_size = 4;
static constexpr unsigned ttl_offset = expiry_offset + expiry_size;
static constexpr unsigned ttl_size = 4;
friend class counter_cell_builder;
private:
static bool is_counter_update(bytes_view cell) {
return cell[0] & COUNTER_UPDATE_FLAG;
}
static bool is_counter_in_place_revert_set(bytes_view cell) {
return cell[0] & COUNTER_IN_PLACE_REVERT;
}
template<typename BytesContainer>
static void set_counter_in_place_revert(BytesContainer& cell, bool flag) {
cell[0] = (cell[0] & ~COUNTER_IN_PLACE_REVERT) | (flag * COUNTER_IN_PLACE_REVERT);
}
static bool is_live(const bytes_view& cell) {
return cell[0] & LIVE_FLAG;
}
static bool is_live_and_has_ttl(const bytes_view& cell) {
return cell[0] & EXPIRY_FLAG;
}
static bool is_dead(const bytes_view& cell) {
return !is_live(cell);
}
// Can be called on live and dead cells
static api::timestamp_type timestamp(const bytes_view& cell) {
return get_field<api::timestamp_type>(cell, timestamp_offset);
}
template<typename BytesContainer>
static void set_timestamp(BytesContainer& cell, api::timestamp_type ts) {
set_field(cell, timestamp_offset, ts);
}
// Can be called on live cells only
private:
template<typename BytesView>
static BytesView do_get_value(BytesView cell) {
auto expiry_field_size = bool(cell[0] & EXPIRY_FLAG) * (expiry_size + ttl_size);
auto value_offset = flags_size + timestamp_size + expiry_field_size;
cell.remove_prefix(value_offset);
return cell;
}
public:
static bytes_view value(bytes_view cell) {
return do_get_value(cell);
}
static bytes_mutable_view value(bytes_mutable_view cell) {
return do_get_value(cell);
}
// Can be called on live counter update cells only
static int64_t counter_update_value(bytes_view cell) {
return get_field<int64_t>(cell, flags_size + timestamp_size);
}
// Can be called only when is_dead() is true.
static gc_clock::time_point deletion_time(const bytes_view& cell) {
assert(is_dead(cell));
return gc_clock::time_point(gc_clock::duration(
get_field<int32_t>(cell, deletion_time_offset)));
}
// Can be called only when is_live_and_has_ttl() is true.
static gc_clock::time_point expiry(const bytes_view& cell) {
assert(is_live_and_has_ttl(cell));
auto expiry = get_field<int32_t>(cell, expiry_offset);
return gc_clock::time_point(gc_clock::duration(expiry));
}
// Can be called only when is_live_and_has_ttl() is true.
static gc_clock::duration ttl(const bytes_view& cell) {
assert(is_live_and_has_ttl(cell));
return gc_clock::duration(get_field<int32_t>(cell, ttl_offset));
}
static managed_bytes make_dead(api::timestamp_type timestamp, gc_clock::time_point deletion_time) {
managed_bytes b(managed_bytes::initialized_later(), flags_size + timestamp_size + deletion_time_size);
b[0] = 0;
set_field(b, timestamp_offset, timestamp);
set_field(b, deletion_time_offset, deletion_time.time_since_epoch().count());
return b;
}
static managed_bytes make_live(api::timestamp_type timestamp, bytes_view value) {
auto value_offset = flags_size + timestamp_size;
managed_bytes b(managed_bytes::initialized_later(), value_offset + value.size());
b[0] = LIVE_FLAG;
set_field(b, timestamp_offset, timestamp);
std::copy_n(value.begin(), value.size(), b.begin() + value_offset);
return b;
}
static managed_bytes make_live_counter_update(api::timestamp_type timestamp, int64_t value) {
auto value_offset = flags_size + timestamp_size;
managed_bytes b(managed_bytes::initialized_later(), value_offset + sizeof(value));
b[0] = LIVE_FLAG | COUNTER_UPDATE_FLAG;
set_field(b, timestamp_offset, timestamp);
set_field(b, value_offset, value);
return b;
}
static managed_bytes make_live(api::timestamp_type timestamp, bytes_view value, gc_clock::time_point expiry, gc_clock::duration ttl) {
auto value_offset = flags_size + timestamp_size + expiry_size + ttl_size;
managed_bytes b(managed_bytes::initialized_later(), value_offset + value.size());
b[0] = EXPIRY_FLAG | LIVE_FLAG;
set_field(b, timestamp_offset, timestamp);
set_field(b, expiry_offset, expiry.time_since_epoch().count());
set_field(b, ttl_offset, ttl.count());
std::copy_n(value.begin(), value.size(), b.begin() + value_offset);
return b;
}
// make_live_from_serializer() is intended for users that need to serialise
// some object or objects to the format used in atomic_cell::value().
// With just make_live() the patter would look like follows:
// 1. allocate a buffer and write to it serialised objects
// 2. pass that buffer to make_live()
// 3. make_live() needs to prepend some metadata to the cell value so it
// allocates a new buffer and copies the content of the original one
//
// The allocation and copy of a buffer can be avoided.
// make_live_from_serializer() allows the user code to specify the timestamp
// and size of the cell value as well as provide the serialiser function
// object, which would write the serialised value of the cell to the buffer
// given to it by make_live_from_serializer().
template<typename Serializer>
GCC6_CONCEPT(requires requires(Serializer serializer, bytes::iterator it) {
serializer(it);
})
static managed_bytes make_live_from_serializer(api::timestamp_type timestamp, size_t size, Serializer&& serializer) {
auto value_offset = flags_size + timestamp_size;
managed_bytes b(managed_bytes::initialized_later(), value_offset + size);
b[0] = LIVE_FLAG;
set_field(b, timestamp_offset, timestamp);
serializer(b.begin() + value_offset);
return b;
}
template<typename ByteContainer>
friend class atomic_cell_base;
/// View of an atomic cell
template<mutable_view is_mutable>
class basic_atomic_cell_view {
protected:
data::cell::basic_atomic_cell_view<is_mutable> _view;
friend class atomic_cell;
};
public:
using pointer_type = std::conditional_t<is_mutable == mutable_view::no, const uint8_t*, uint8_t*>;
protected:
explicit basic_atomic_cell_view(data::cell::basic_atomic_cell_view<is_mutable> v)
: _view(std::move(v)) { }
basic_atomic_cell_view(const data::type_info& ti, pointer_type ptr)
: _view(data::cell::make_atomic_cell_view(ti, ptr))
{ }
template<typename ByteContainer>
class atomic_cell_base {
protected:
ByteContainer _data;
protected:
atomic_cell_base(ByteContainer&& data) : _data(std::forward<ByteContainer>(data)) { }
friend class atomic_cell_or_collection;
public:
bool is_counter_update() const {
return atomic_cell_type::is_counter_update(_data);
operator basic_atomic_cell_view<mutable_view::no>() const noexcept {
return basic_atomic_cell_view<mutable_view::no>(_view);
}
bool is_counter_in_place_revert_set() const {
return atomic_cell_type::is_counter_in_place_revert_set(_data);
void swap(basic_atomic_cell_view& other) noexcept {
using std::swap;
swap(_view, other._view);
}
bool is_counter_update() const {
return _view.is_counter_update();
}
bool is_live() const {
return atomic_cell_type::is_live(_data);
return _view.is_live();
}
bool is_live(tombstone t, bool is_counter) const {
return is_live() && !is_covered_by(t, is_counter);
@@ -221,122 +83,140 @@ public:
return is_live() && !is_covered_by(t, is_counter) && !has_expired(now);
}
bool is_live_and_has_ttl() const {
return atomic_cell_type::is_live_and_has_ttl(_data);
return _view.is_expiring();
}
bool is_dead(gc_clock::time_point now) const {
return atomic_cell_type::is_dead(_data) || has_expired(now);
return !is_live() || has_expired(now);
}
bool is_covered_by(tombstone t, bool is_counter) const {
return timestamp() <= t.timestamp || (is_counter && t.timestamp != api::missing_timestamp);
}
// Can be called on live and dead cells
api::timestamp_type timestamp() const {
return atomic_cell_type::timestamp(_data);
return _view.timestamp();
}
void set_timestamp(api::timestamp_type ts) {
atomic_cell_type::set_timestamp(_data, ts);
_view.set_timestamp(ts);
}
// Can be called on live cells only
auto value() const {
return atomic_cell_type::value(_data);
data::basic_value_view<is_mutable> value() const {
return _view.value();
}
// Can be called on live cells only
size_t value_size() const {
return _view.value_size();
}
bool is_value_fragmented() const {
return _view.is_value_fragmented();
}
// Can be called on live counter update cells only
int64_t counter_update_value() const {
return atomic_cell_type::counter_update_value(_data);
return _view.counter_update_value();
}
// Can be called only when is_dead(gc_clock::time_point)
gc_clock::time_point deletion_time() const {
return !is_live() ? atomic_cell_type::deletion_time(_data) : expiry() - ttl();
return !is_live() ? _view.deletion_time() : expiry() - ttl();
}
// Can be called only when is_live_and_has_ttl()
gc_clock::time_point expiry() const {
return atomic_cell_type::expiry(_data);
return _view.expiry();
}
// Can be called only when is_live_and_has_ttl()
gc_clock::duration ttl() const {
return atomic_cell_type::ttl(_data);
return _view.ttl();
}
// Can be called on live and dead cells
bool has_expired(gc_clock::time_point now) const {
return is_live_and_has_ttl() && expiry() <= now;
}
bytes_view serialize() const {
return _data;
}
void set_counter_in_place_revert(bool flag) {
atomic_cell_type::set_counter_in_place_revert(_data, flag);
return _view.serialize();
}
};
class atomic_cell_view final : public atomic_cell_base<bytes_view> {
atomic_cell_view(bytes_view data) : atomic_cell_base(std::move(data)) {}
public:
static atomic_cell_view from_bytes(bytes_view data) { return atomic_cell_view(data); }
class atomic_cell_view final : public basic_atomic_cell_view<mutable_view::no> {
atomic_cell_view(const data::type_info& ti, const uint8_t* data)
: basic_atomic_cell_view<mutable_view::no>(ti, data) {}
template<mutable_view is_mutable>
atomic_cell_view(data::cell::basic_atomic_cell_view<is_mutable> view)
: basic_atomic_cell_view<mutable_view::no>(view) { }
friend class atomic_cell;
public:
static atomic_cell_view from_bytes(const data::type_info& ti, const imr::utils::object<data::cell::structure>& data) {
return atomic_cell_view(ti, data.get());
}
static atomic_cell_view from_bytes(const data::type_info& ti, bytes_view bv) {
return atomic_cell_view(ti, reinterpret_cast<const uint8_t*>(bv.begin()));
}
friend std::ostream& operator<<(std::ostream& os, const atomic_cell_view& acv);
};
class atomic_cell_mutable_view final : public atomic_cell_base<bytes_mutable_view> {
atomic_cell_mutable_view(bytes_mutable_view data) : atomic_cell_base(std::move(data)) {}
class atomic_cell_mutable_view final : public basic_atomic_cell_view<mutable_view::yes> {
atomic_cell_mutable_view(const data::type_info& ti, uint8_t* data)
: basic_atomic_cell_view<mutable_view::yes>(ti, data) {}
public:
static atomic_cell_mutable_view from_bytes(bytes_mutable_view data) { return atomic_cell_mutable_view(data); }
static atomic_cell_mutable_view from_bytes(const data::type_info& ti, imr::utils::object<data::cell::structure>& data) {
return atomic_cell_mutable_view(ti, data.get());
}
friend class atomic_cell;
};
class atomic_cell_ref final : public atomic_cell_base<managed_bytes&> {
public:
atomic_cell_ref(managed_bytes& buf) : atomic_cell_base(buf) {}
};
using atomic_cell_ref = atomic_cell_mutable_view;
class atomic_cell final : public atomic_cell_base<managed_bytes> {
atomic_cell(managed_bytes b) : atomic_cell_base(std::move(b)) {}
class atomic_cell final : public basic_atomic_cell_view<mutable_view::yes> {
using imr_object_type = imr::utils::object<data::cell::structure>;
imr_object_type _data;
atomic_cell(const data::type_info& ti, imr::utils::object<data::cell::structure>&& data)
: basic_atomic_cell_view<mutable_view::yes>(ti, data.get()), _data(std::move(data)) {}
public:
atomic_cell(const atomic_cell&) = default;
class collection_member_tag;
using collection_member = bool_class<collection_member_tag>;
atomic_cell(atomic_cell&&) = default;
atomic_cell& operator=(const atomic_cell&) = default;
atomic_cell& operator=(const atomic_cell&) = delete;
atomic_cell& operator=(atomic_cell&&) = default;
static atomic_cell from_bytes(managed_bytes b) {
return atomic_cell(std::move(b));
void swap(atomic_cell& other) noexcept {
basic_atomic_cell_view<mutable_view::yes>::swap(other);
_data.swap(other._data);
}
atomic_cell(atomic_cell_view other) : atomic_cell_base(managed_bytes{other._data}) {}
operator atomic_cell_view() const {
return atomic_cell_view(_data);
operator atomic_cell_view() const { return atomic_cell_view(_view); }
atomic_cell(const abstract_type& t, atomic_cell_view other);
static atomic_cell make_dead(api::timestamp_type timestamp, gc_clock::time_point deletion_time);
static atomic_cell make_live(const abstract_type& type, api::timestamp_type timestamp, bytes_view value,
collection_member = collection_member::no);
static atomic_cell make_live(const abstract_type& type, api::timestamp_type timestamp, ser::buffer_view<bytes_ostream::fragment_iterator> value,
collection_member = collection_member::no);
static atomic_cell make_live(const abstract_type& type, api::timestamp_type timestamp, const fragmented_temporary_buffer::view& value,
collection_member = collection_member::no);
static atomic_cell make_live(const abstract_type& type, api::timestamp_type timestamp, const bytes& value,
collection_member cm = collection_member::no) {
return make_live(type, timestamp, bytes_view(value), cm);
}
static atomic_cell make_dead(api::timestamp_type timestamp, gc_clock::time_point deletion_time) {
return atomic_cell_type::make_dead(timestamp, deletion_time);
}
static atomic_cell make_live(api::timestamp_type timestamp, bytes_view value) {
return atomic_cell_type::make_live(timestamp, value);
}
static atomic_cell make_live(api::timestamp_type timestamp, const bytes& value) {
return make_live(timestamp, bytes_view(value));
}
static atomic_cell make_live_counter_update(api::timestamp_type timestamp, int64_t value) {
return atomic_cell_type::make_live_counter_update(timestamp, value);
}
static atomic_cell make_live(api::timestamp_type timestamp, bytes_view value,
gc_clock::time_point expiry, gc_clock::duration ttl)
static atomic_cell make_live_counter_update(api::timestamp_type timestamp, int64_t value);
static atomic_cell make_live(const abstract_type&, api::timestamp_type timestamp, bytes_view value,
gc_clock::time_point expiry, gc_clock::duration ttl, collection_member = collection_member::no);
static atomic_cell make_live(const abstract_type&, api::timestamp_type timestamp, ser::buffer_view<bytes_ostream::fragment_iterator> value,
gc_clock::time_point expiry, gc_clock::duration ttl, collection_member = collection_member::no);
static atomic_cell make_live(const abstract_type&, api::timestamp_type timestamp, const fragmented_temporary_buffer::view& value,
gc_clock::time_point expiry, gc_clock::duration ttl, collection_member = collection_member::no);
static atomic_cell make_live(const abstract_type& type, api::timestamp_type timestamp, const bytes& value,
gc_clock::time_point expiry, gc_clock::duration ttl, collection_member cm = collection_member::no)
{
return atomic_cell_type::make_live(timestamp, value, expiry, ttl);
return make_live(type, timestamp, bytes_view(value), expiry, ttl, cm);
}
static atomic_cell make_live(api::timestamp_type timestamp, const bytes& value,
gc_clock::time_point expiry, gc_clock::duration ttl)
{
return make_live(timestamp, bytes_view(value), expiry, ttl);
}
static atomic_cell make_live(api::timestamp_type timestamp, bytes_view value, ttl_opt ttl) {
static atomic_cell make_live(const abstract_type& type, api::timestamp_type timestamp, bytes_view value, ttl_opt ttl, collection_member cm = collection_member::no) {
if (!ttl) {
return atomic_cell_type::make_live(timestamp, value);
return make_live(type, timestamp, value, cm);
} else {
return atomic_cell_type::make_live(timestamp, value, gc_clock::now() + *ttl, *ttl);
return make_live(type, timestamp, value, gc_clock::now() + *ttl, *ttl, cm);
}
}
template<typename Serializer>
static atomic_cell make_live_from_serializer(api::timestamp_type timestamp, size_t size, Serializer&& serializer) {
return atomic_cell_type::make_live_from_serializer(timestamp, size, std::forward<Serializer>(serializer));
}
static atomic_cell make_live_uninitialized(const abstract_type& type, api::timestamp_type timestamp, size_t size);
friend class atomic_cell_or_collection;
friend std::ostream& operator<<(std::ostream& os, const atomic_cell& ac);
};
@@ -350,33 +230,24 @@ class collection_mutation_view;
// list: tbd, probably ugly
class collection_mutation {
public:
managed_bytes data;
using imr_object_type = imr::utils::object<data::cell::structure>;
imr_object_type _data;
collection_mutation() {}
collection_mutation(managed_bytes b) : data(std::move(b)) {}
collection_mutation(collection_mutation_view v);
collection_mutation(const collection_type_impl&, collection_mutation_view v);
collection_mutation(const collection_type_impl&, bytes_view bv);
operator collection_mutation_view() const;
};
class collection_mutation_view {
public:
bytes_view data;
bytes_view serialize() const { return data; }
static collection_mutation_view from_bytes(bytes_view v) { return { v }; }
atomic_cell_value_view data;
};
inline
collection_mutation::collection_mutation(collection_mutation_view v)
: data(v.data) {
}
inline
collection_mutation::operator collection_mutation_view() const {
return { data };
}
class column_definition;
int compare_atomic_cell_for_merge(atomic_cell_view left, atomic_cell_view right);
void merge_column(const column_definition& def,
void merge_column(const abstract_type& def,
atomic_cell_or_collection& old,
const atomic_cell_or_collection& neww);

View File

@@ -33,12 +33,15 @@ template<>
struct appending_hash<collection_mutation_view> {
template<typename Hasher>
void operator()(Hasher& h, collection_mutation_view cell, const column_definition& cdef) const {
auto m_view = collection_type_impl::deserialize_mutation_form(cell);
cell.data.with_linearized([&] (bytes_view cell_bv) {
auto ctype = static_pointer_cast<const collection_type_impl>(cdef.type);
auto m_view = ctype->deserialize_mutation_form(cell_bv);
::feed_hash(h, m_view.tomb);
for (auto&& key_and_value : m_view.cells) {
::feed_hash(h, key_and_value.first);
::feed_hash(h, key_and_value.second, cdef);
}
});
}
};
@@ -50,7 +53,9 @@ struct appending_hash<atomic_cell_view> {
feed_hash(h, cell.timestamp());
if (cell.is_live()) {
if (cdef.is_counter()) {
::feed_hash(h, counter_cell_view(cell));
counter_cell_view::with_linearized(cell, [&] (counter_cell_view ccv) {
::feed_hash(h, ccv);
});
return;
}
if (cell.is_live_and_has_ttl()) {
@@ -85,9 +90,9 @@ struct appending_hash<atomic_cell_or_collection> {
template<typename Hasher>
void operator()(Hasher& h, const atomic_cell_or_collection& c, const column_definition& cdef) const {
if (cdef.is_atomic()) {
feed_hash(h, c.as_atomic_cell(), cdef);
feed_hash(h, c.as_atomic_cell(cdef), cdef);
} else {
feed_hash(h, c.as_collection_mutation(), cdef);
}
}
};
};

View File

@@ -25,42 +25,56 @@
#include "schema.hh"
#include "hashing.hh"
#include "imr/utils.hh"
// A variant type that can hold either an atomic_cell, or a serialized collection.
// Which type is stored is determined by the schema.
// Has an "empty" state.
// Objects moved-from are left in an empty state.
class atomic_cell_or_collection final {
managed_bytes _data;
// FIXME: This has made us lose small-buffer optimisation. Unfortunately,
// due to the changed cell format it would be less effective now, anyway.
// Measure the actual impact because any attempts to fix this will become
// irrelevant once rows are converted to the IMR as well, so maybe we can
// live with this like that.
using imr_object_type = imr::utils::object<data::cell::structure>;
imr_object_type _data;
private:
atomic_cell_or_collection(managed_bytes&& data) : _data(std::move(data)) {}
atomic_cell_or_collection(imr::utils::object<data::cell::structure>&& data) : _data(std::move(data)) {}
public:
atomic_cell_or_collection() = default;
atomic_cell_or_collection(atomic_cell_or_collection&&) = default;
atomic_cell_or_collection(const atomic_cell_or_collection&) = delete;
atomic_cell_or_collection& operator=(atomic_cell_or_collection&&) = default;
atomic_cell_or_collection& operator=(const atomic_cell_or_collection&) = delete;
atomic_cell_or_collection(atomic_cell ac) : _data(std::move(ac._data)) {}
atomic_cell_or_collection(const abstract_type& at, atomic_cell_view acv);
static atomic_cell_or_collection from_atomic_cell(atomic_cell data) { return { std::move(data._data) }; }
atomic_cell_view as_atomic_cell() const { return atomic_cell_view::from_bytes(_data); }
atomic_cell_ref as_atomic_cell_ref() { return { _data }; }
atomic_cell_mutable_view as_mutable_atomic_cell() { return atomic_cell_mutable_view::from_bytes(_data); }
atomic_cell_or_collection(collection_mutation cm) : _data(std::move(cm.data)) {}
atomic_cell_view as_atomic_cell(const column_definition& cdef) const { return atomic_cell_view::from_bytes(cdef.type->imr_state().type_info(), _data); }
atomic_cell_ref as_atomic_cell_ref(const column_definition& cdef) { return atomic_cell_mutable_view::from_bytes(cdef.type->imr_state().type_info(), _data); }
atomic_cell_mutable_view as_mutable_atomic_cell(const column_definition& cdef) { return atomic_cell_mutable_view::from_bytes(cdef.type->imr_state().type_info(), _data); }
atomic_cell_or_collection(collection_mutation cm) : _data(std::move(cm._data)) { }
atomic_cell_or_collection copy(const abstract_type&) const;
explicit operator bool() const {
return !_data.empty();
return bool(_data);
}
bool can_use_mutable_view() const {
return !_data.is_fragmented();
static constexpr bool can_use_mutable_view() {
return true;
}
static atomic_cell_or_collection from_collection_mutation(collection_mutation data) {
return std::move(data.data);
}
collection_mutation_view as_collection_mutation() const {
return collection_mutation_view{_data};
}
bytes_view serialize() const {
return _data;
}
bool operator==(const atomic_cell_or_collection& other) const {
return _data == other._data;
}
size_t external_memory_usage() const {
return _data.external_memory_usage();
void swap(atomic_cell_or_collection& other) noexcept {
_data.swap(other._data);
}
static atomic_cell_or_collection from_collection_mutation(collection_mutation data) { return std::move(data._data); }
collection_mutation_view as_collection_mutation() const;
bytes_view serialize() const;
bool equals(const abstract_type& type, const atomic_cell_or_collection& other) const;
size_t external_memory_usage(const abstract_type&) const;
friend std::ostream& operator<<(std::ostream&, const atomic_cell_or_collection&);
};
namespace std {
inline void swap(atomic_cell_or_collection& a, atomic_cell_or_collection& b) noexcept
{
a.swap(b);
}
}

View File

@@ -28,6 +28,7 @@
#include "database.hh"
#include "schema_builder.hh"
#include "service/migration_manager.hh"
#include "timeout_config.hh"
namespace auth {
@@ -86,12 +87,24 @@ future<> create_metadata_table_if_missing(
return mm.announce_new_column_family(b.build(), false);
}
future<> wait_for_schema_agreement(::service::migration_manager& mm, const database& db) {
future<> wait_for_schema_agreement(::service::migration_manager& mm, const database& db, seastar::abort_source& as) {
static const auto pause = [] { return sleep(std::chrono::milliseconds(500)); };
return do_until([&db] { return db.get_version() != database::empty_version; }, pause).then([&mm] {
return do_until([&mm] { return mm.have_schema_agreement(); }, pause);
return do_until([&db, &as] {
as.check();
return db.get_version() != database::empty_version;
}, pause).then([&mm, &as] {
return do_until([&mm, &as] {
as.check();
return mm.have_schema_agreement();
}, pause);
});
}
const timeout_config& internal_distributed_timeout_config() noexcept {
static const auto t = 5s;
static const timeout_config tc{t, t, t, t, t, t, t};
return tc;
}
}

View File

@@ -38,6 +38,7 @@
using namespace std::chrono_literals;
class database;
class timeout_config;
namespace service {
class migration_manager;
@@ -80,6 +81,11 @@ future<> create_metadata_table_if_missing(
stdx::string_view cql,
::service::migration_manager&);
future<> wait_for_schema_agreement(::service::migration_manager&, const database&);
future<> wait_for_schema_agreement(::service::migration_manager&, const database&, seastar::abort_source&);
///
/// Time-outs for internal, non-local CQL queries.
///
const timeout_config& internal_distributed_timeout_config() noexcept;
}

View File

@@ -103,6 +103,7 @@ future<bool> default_authorizer::any_granted() const {
return _qp.process(
query,
db::consistency_level::LOCAL_ONE,
infinite_timeout_config,
{},
true).then([this](::shared_ptr<cql3::untyped_result_set> results) {
return !results->empty();
@@ -115,7 +116,8 @@ future<> default_authorizer::migrate_legacy_metadata() const {
return _qp.process(
query,
db::consistency_level::LOCAL_ONE).then([this](::shared_ptr<cql3::untyped_result_set> results) {
db::consistency_level::LOCAL_ONE,
infinite_timeout_config).then([this](::shared_ptr<cql3::untyped_result_set> results) {
return do_for_each(*results, [this](const cql3::untyped_result_set_row& row) {
return do_with(
row.get_as<sstring>("username"),
@@ -158,7 +160,7 @@ future<> default_authorizer::start() {
_migration_manager).then([this] {
_finished = do_after_system_ready(_as, [this] {
return async([this] {
wait_for_schema_agreement(_migration_manager, _qp.db().local()).get0();
wait_for_schema_agreement(_migration_manager, _qp.db().local(), _as).get0();
if (legacy_metadata_exists()) {
if (!any_granted().get0()) {
@@ -176,7 +178,7 @@ future<> default_authorizer::start() {
future<> default_authorizer::stop() {
_as.request_abort();
return _finished.handle_exception_type([](const sleep_aborted&) {});
return _finished.handle_exception_type([](const sleep_aborted&) {}).handle_exception_type([](const abort_requested_exception&) {});
}
future<permission_set>
@@ -196,6 +198,7 @@ default_authorizer::authorize(const role_or_anonymous& maybe_role, const resourc
return _qp.process(
query,
db::consistency_level::LOCAL_ONE,
infinite_timeout_config,
{*maybe_role.name, r.name()}).then([](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
return permissions::NONE;
@@ -225,6 +228,7 @@ default_authorizer::modify(
return _qp.process(
query,
db::consistency_level::ONE,
internal_distributed_timeout_config(),
{permissions::to_strings(set), sstring(role_name), resource.name()}).discard_result();
});
}
@@ -250,6 +254,7 @@ future<std::vector<permission_details>> default_authorizer::list_all() const {
return _qp.process(
query,
db::consistency_level::ONE,
internal_distributed_timeout_config(),
{},
true).then([](::shared_ptr<cql3::untyped_result_set> results) {
std::vector<permission_details> all_details;
@@ -277,6 +282,7 @@ future<> default_authorizer::revoke_all(stdx::string_view role_name) const {
return _qp.process(
query,
db::consistency_level::ONE,
internal_distributed_timeout_config(),
{sstring(role_name)}).discard_result().handle_exception([role_name](auto ep) {
try {
std::rethrow_exception(ep);
@@ -297,6 +303,7 @@ future<> default_authorizer::revoke_all(const resource& resource) const {
return _qp.process(
query,
db::consistency_level::LOCAL_ONE,
infinite_timeout_config,
{resource.name()}).then_wrapped([this, resource](future<::shared_ptr<cql3::untyped_result_set>> f) {
try {
auto res = f.get0();
@@ -314,6 +321,7 @@ future<> default_authorizer::revoke_all(const resource& resource) const {
return _qp.process(
query,
db::consistency_level::LOCAL_ONE,
infinite_timeout_config,
{r.get_as<sstring>(ROLE_NAME), resource.name()}).discard_result().handle_exception(
[resource](auto ep) {
try {

View File

@@ -41,11 +41,6 @@
#include "auth/password_authenticator.hh"
extern "C" {
#include <crypt.h>
#include <unistd.h>
}
#include <algorithm>
#include <chrono>
#include <random>
@@ -55,6 +50,7 @@ extern "C" {
#include "auth/authenticated_user.hh"
#include "auth/common.hh"
#include "auth/passwords.hh"
#include "auth/roles-metadata.hh"
#include "cql3/untyped_result_set.hh"
#include "log.hh"
@@ -82,6 +78,8 @@ static const class_registrator<
cql3::query_processor&,
::service::migration_manager&> password_auth_reg("org.apache.cassandra.auth.PasswordAuthenticator");
static thread_local auto rng_for_salt = std::default_random_engine(std::random_device{}());
password_authenticator::~password_authenticator() {
}
@@ -91,80 +89,8 @@ password_authenticator::password_authenticator(cql3::query_processor& qp, ::serv
, _stopped(make_ready_future<>()) {
}
// TODO: blowfish
// Origin uses Java bcrypt library, i.e. blowfish salt
// generation and hashing, which is arguably a "better"
// password hash than sha/md5 versions usually available in
// crypt_r. Otoh, glibc 2.7+ uses a modified sha512 algo
// which should be the same order of safe, so the only
// real issue should be salted hash compatibility with
// origin if importing system tables from there.
//
// Since bcrypt/blowfish is _not_ (afaict) not available
// as a dev package/lib on most linux distros, we'd have to
// copy and compile for example OWL crypto
// (http://cvsweb.openwall.com/cgi/cvsweb.cgi/Owl/packages/glibc/crypt_blowfish/)
// to be fully bit-compatible.
//
// Until we decide this is needed, let's just use crypt_r,
// and some old-fashioned random salt generation.
static constexpr size_t rand_bytes = 16;
static thread_local crypt_data tlcrypt = { 0, };
static sstring hashpw(const sstring& pass, const sstring& salt) {
auto res = crypt_r(pass.c_str(), salt.c_str(), &tlcrypt);
if (res == nullptr) {
throw std::system_error(errno, std::system_category());
}
return res;
}
static bool checkpw(const sstring& pass, const sstring& salted_hash) {
auto tmp = hashpw(pass, salted_hash);
return tmp == salted_hash;
}
static sstring gensalt() {
static sstring prefix;
std::random_device rd;
std::default_random_engine e1(rd());
std::uniform_int_distribution<char> dist;
sstring valid_salt = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789./";
sstring input(rand_bytes, 0);
for (char&c : input) {
c = valid_salt[dist(e1) % valid_salt.size()];
}
sstring salt;
if (!prefix.empty()) {
return prefix + input;
}
// Try in order:
// blowfish 2011 fix, blowfish, sha512, sha256, md5
for (sstring pfx : { "$2y$", "$2a$", "$6$", "$5$", "$1$" }) {
salt = pfx + input;
const char* e = crypt_r("fisk", salt.c_str(), &tlcrypt);
if (e && (e[0] != '*')) {
prefix = pfx;
return salt;
}
}
throw std::runtime_error("Could not initialize hashing algorithm");
}
static sstring hashpw(const sstring& pass) {
return hashpw(pass, gensalt());
}
static bool has_salted_hash(const cql3::untyped_result_set_row& row) {
return utf8_type->deserialize(row.get_blob(SALTED_HASH)) != data_value::make_null(utf8_type);
return !row.get_or<sstring>(SALTED_HASH, "").empty();
}
static const sstring update_row_query = sprint(
@@ -185,7 +111,8 @@ future<> password_authenticator::migrate_legacy_metadata() const {
return _qp.process(
query,
db::consistency_level::QUORUM).then([this](::shared_ptr<cql3::untyped_result_set> results) {
db::consistency_level::QUORUM,
internal_distributed_timeout_config()).then([this](::shared_ptr<cql3::untyped_result_set> results) {
return do_for_each(*results, [this](const cql3::untyped_result_set_row& row) {
auto username = row.get_as<sstring>("username");
auto salted_hash = row.get_as<sstring>(SALTED_HASH);
@@ -193,6 +120,7 @@ future<> password_authenticator::migrate_legacy_metadata() const {
return _qp.process(
update_row_query,
consistency_for_user(username),
internal_distributed_timeout_config(),
{std::move(salted_hash), username}).discard_result();
}).finally([results] {});
}).then([] {
@@ -209,7 +137,8 @@ future<> password_authenticator::create_default_if_missing() const {
return _qp.process(
update_row_query,
db::consistency_level::QUORUM,
{hashpw(DEFAULT_USER_PASSWORD), DEFAULT_USER_NAME}).then([](auto&&) {
internal_distributed_timeout_config(),
{passwords::hash(DEFAULT_USER_PASSWORD, rng_for_salt), DEFAULT_USER_NAME}).then([](auto&&) {
plogger.info("Created default superuser authentication record.");
});
}
@@ -220,8 +149,6 @@ future<> password_authenticator::create_default_if_missing() const {
future<> password_authenticator::start() {
return once_among_shards([this] {
gensalt(); // do this once to determine usable hashing
auto f = create_metadata_table_if_missing(
meta::roles_table::name,
_qp,
@@ -230,7 +157,7 @@ future<> password_authenticator::start() {
_stopped = do_after_system_ready(_as, [this] {
return async([this] {
wait_for_schema_agreement(_migration_manager, _qp.db().local()).get0();
wait_for_schema_agreement(_migration_manager, _qp.db().local(), _as).get0();
if (any_nondefault_role_row_satisfies(_qp, &has_salted_hash).get0()) {
if (legacy_metadata_exists()) {
@@ -255,7 +182,7 @@ future<> password_authenticator::start() {
future<> password_authenticator::stop() {
_as.request_abort();
return _stopped.handle_exception_type([] (const sleep_aborted&) { });
return _stopped.handle_exception_type([] (const sleep_aborted&) { }).handle_exception_type([](const abort_requested_exception&) {});
}
db::consistency_level password_authenticator::consistency_for_user(stdx::string_view role_name) {
@@ -308,12 +235,17 @@ future<authenticated_user> password_authenticator::authenticate(
return _qp.process(
query,
consistency_for_user(username),
internal_distributed_timeout_config(),
{username},
true);
}).then_wrapped([=](future<::shared_ptr<cql3::untyped_result_set>> f) {
try {
auto res = f.get0();
if (res->empty() || !checkpw(password, res->one().get_as<sstring>(SALTED_HASH))) {
auto salted_hash = std::experimental::optional<sstring>();
if (!res->empty()) {
salted_hash = res->one().get_opt<sstring>(SALTED_HASH);
}
if (!salted_hash || !passwords::check(password, *salted_hash)) {
throw exceptions::authentication_exception("Username and/or password are incorrect");
}
return make_ready_future<authenticated_user>(username);
@@ -335,7 +267,8 @@ future<> password_authenticator::create(stdx::string_view role_name, const authe
return _qp.process(
update_row_query,
consistency_for_user(role_name),
{hashpw(*options.password), sstring(role_name)}).discard_result();
internal_distributed_timeout_config(),
{passwords::hash(*options.password, rng_for_salt), sstring(role_name)}).discard_result();
}
future<> password_authenticator::alter(stdx::string_view role_name, const authentication_options& options) const {
@@ -352,7 +285,8 @@ future<> password_authenticator::alter(stdx::string_view role_name, const authen
return _qp.process(
query,
consistency_for_user(role_name),
{hashpw(*options.password), sstring(role_name)}).discard_result();
internal_distributed_timeout_config(),
{passwords::hash(*options.password, rng_for_salt), sstring(role_name)}).discard_result();
}
future<> password_authenticator::drop(stdx::string_view name) const {
@@ -362,7 +296,10 @@ future<> password_authenticator::drop(stdx::string_view name) const {
meta::roles_table::qualified_name(),
meta::roles_table::role_col_name);
return _qp.process(query, consistency_for_user(name), {sstring(name)}).discard_result();
return _qp.process(
query, consistency_for_user(name),
internal_distributed_timeout_config(),
{sstring(name)}).discard_result();
}
future<custom_options> password_authenticator::query_custom_options(stdx::string_view role_name) const {

84
auth/passwords.cc Normal file
View File

@@ -0,0 +1,84 @@
/*
* Copyright (C) 2018 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "auth/passwords.hh"
#include <cerrno>
#include <optional>
extern "C" {
#include <crypt.h>
#include <unistd.h>
}
namespace auth::passwords {
static thread_local crypt_data tlcrypt = { 0, };
namespace detail {
scheme identify_best_supported_scheme() {
const auto all_schemes = { scheme::bcrypt_y, scheme::bcrypt_a, scheme::sha_512, scheme::sha_256, scheme::md5 };
// "Random", for testing schemes.
const sstring random_part_of_salt = "aaaabbbbccccdddd";
for (scheme c : all_schemes) {
const sstring salt = sstring(prefix_for_scheme(c)) + random_part_of_salt;
const char* e = crypt_r("fisk", salt.c_str(), &tlcrypt);
if (e && (e[0] != '*')) {
return c;
}
}
throw no_supported_schemes();
}
sstring hash_with_salt(const sstring& pass, const sstring& salt) {
auto res = crypt_r(pass.c_str(), salt.c_str(), &tlcrypt);
if (!res || (res[0] == '*')) {
throw std::system_error(errno, std::system_category());
}
return res;
}
const char* prefix_for_scheme(scheme c) noexcept {
switch (c) {
case scheme::bcrypt_y: return "$2y$";
case scheme::bcrypt_a: return "$2a$";
case scheme::sha_512: return "$6$";
case scheme::sha_256: return "$5$";
case scheme::md5: return "$1$";
default: return nullptr;
}
}
} // namespace detail
no_supported_schemes::no_supported_schemes()
: std::runtime_error("No allowed hashing schemes are supported on this system") {
}
bool check(const sstring& pass, const sstring& salted_hash) {
return detail::hash_with_salt(pass, salted_hash) == salted_hash;
}
} // namespace auth::paswords

125
auth/passwords.hh Normal file
View File

@@ -0,0 +1,125 @@
/*
* Copyright (C) 2018 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <random>
#include <stdexcept>
#include <seastar/core/sstring.hh>
#include "seastarx.hh"
namespace auth::passwords {
class no_supported_schemes : public std::runtime_error {
public:
no_supported_schemes();
};
///
/// Apache Cassandra uses a library to provide the bcrypt scheme. Many Linux implementations do not support bcrypt, so
/// we support alternatives. The cost is loss of direct compatibility with Apache Cassandra system tables.
///
enum class scheme {
bcrypt_y,
bcrypt_a,
sha_512,
sha_256,
md5
};
namespace detail {
template <typename RandomNumberEngine>
sstring generate_random_salt_bytes(RandomNumberEngine& g) {
static const sstring valid_bytes = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789./";
static constexpr std::size_t num_bytes = 16;
std::uniform_int_distribution<std::size_t> dist(0, valid_bytes.size() - 1);
sstring result(num_bytes, 0);
for (char& c : result) {
c = valid_bytes[dist(g)];
}
return result;
}
///
/// Test each allowed hashing scheme and report the best supported one on the current system.
///
/// \throws \ref no_supported_schemes when none of the known schemes is supported.
///
scheme identify_best_supported_scheme();
const char* prefix_for_scheme(scheme) noexcept;
///
/// Generate a implementation-specific salt string for hashing passwords.
///
/// The `RandomNumberEngine` is used to generate the string, which is an implementation-specific length.
///
/// \throws \ref no_supported_schemes when no known hashing schemes are supported on the system.
///
template <typename RandomNumberEngine>
sstring generate_salt(RandomNumberEngine& g) {
static const scheme scheme = identify_best_supported_scheme();
static const sstring prefix = sstring(prefix_for_scheme(scheme));
return prefix + generate_random_salt_bytes(g);
}
///
/// Hash a password combined with an implementation-specific salt string.
///
/// \throws \ref std::system_error when an unexpected implementation-specific error occurs.
///
sstring hash_with_salt(const sstring& pass, const sstring& salt);
} // namespace detail
///
/// Run a one-way hashing function on cleartext to produce encrypted text.
///
/// Prior to applying the hashing function, random salt is amended to the cleartext. The random salt bytes are generated
/// according to the random number engine `g`.
///
/// The result is the encrypted cyphertext, and also the salt used but in a implementation-specific format.
///
/// \throws \ref std::system_error when the implementation-specific implementation fails to hash the cleartext.
///
template <typename RandomNumberEngine>
sstring hash(const sstring& pass, RandomNumberEngine& g) {
return detail::hash_with_salt(pass, detail::generate_salt(g));
}
///
/// Check that cleartext matches previously hashed cleartext with salt.
///
/// \ref salted_hash is the result of invoking \ref hash, which is the implementation-specific combination of the hashed
/// password and the salt that was generated for it.
///
/// \returns `true` if the cleartext matches the salted hash.
///
/// \throws \ref std::system_error when an unexpected implementation-specific error occurs.
///
bool check(const sstring& pass, const sstring& salted_hash);
} // namespace auth::passwords

View File

@@ -72,12 +72,14 @@ future<bool> default_role_row_satisfies(
return qp.process(
query,
db::consistency_level::ONE,
infinite_timeout_config,
{meta::DEFAULT_SUPERUSER_NAME},
true).then([&qp, &p](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
return qp.process(
query,
db::consistency_level::QUORUM,
internal_distributed_timeout_config(),
{meta::DEFAULT_SUPERUSER_NAME},
true).then([&p](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
@@ -101,7 +103,8 @@ future<bool> any_nondefault_role_row_satisfies(
return do_with(std::move(p), [&qp](const auto& p) {
return qp.process(
query,
db::consistency_level::QUORUM).then([&p](::shared_ptr<cql3::untyped_result_set> results) {
db::consistency_level::QUORUM,
internal_distributed_timeout_config()).then([&p](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
return false;
}

View File

@@ -37,7 +37,7 @@
#include "cql3/query_processor.hh"
#include "cql3/untyped_result_set.hh"
#include "db/config.hh"
#include "db/consistency_level.hh"
#include "db/consistency_level_type.hh"
#include "exceptions/exceptions.hh"
#include "log.hh"
#include "service/migration_listener.hh"
@@ -184,7 +184,9 @@ future<> service::start() {
return once_among_shards([this] {
return create_keyspace_if_missing();
}).then([this] {
return when_all_succeed(_role_manager->start(), _authorizer->start(), _authenticator->start());
return _role_manager->start().then([this] {
return when_all_succeed(_authorizer->start(), _authenticator->start());
});
}).then([this] {
_permissions_cache = std::make_unique<permissions_cache>(_permissions_cache_config, *this, log);
}).then([this] {
@@ -196,6 +198,10 @@ future<> service::start() {
}
future<> service::stop() {
// Only one of the shards has the listener registered, but let's try to
// unregister on each one just to make sure.
_migration_manager.unregister_listener(_migration_listener.get());
return _permissions_cache->stop().then([this] {
return when_all_succeed(_role_manager->stop(), _authorizer->stop(), _authenticator->stop());
});
@@ -223,6 +229,7 @@ future<bool> service::has_existing_legacy_users() const {
return _qp.process(
default_user_query,
db::consistency_level::ONE,
infinite_timeout_config,
{meta::DEFAULT_SUPERUSER_NAME},
true).then([this](auto results) {
if (!results->empty()) {
@@ -232,6 +239,7 @@ future<bool> service::has_existing_legacy_users() const {
return _qp.process(
default_user_query,
db::consistency_level::QUORUM,
infinite_timeout_config,
{meta::DEFAULT_SUPERUSER_NAME},
true).then([this](auto results) {
if (!results->empty()) {
@@ -240,7 +248,8 @@ future<bool> service::has_existing_legacy_users() const {
return _qp.process(
all_users_query,
db::consistency_level::QUORUM).then([](auto results) {
db::consistency_level::QUORUM,
infinite_timeout_config).then([](auto results) {
return make_ready_future<bool>(!results->empty());
});
});

View File

@@ -89,6 +89,7 @@ static future<stdx::optional<record>> find_record(cql3::query_processor& qp, std
return qp.process(
query,
consistency_for_role(role_name),
internal_distributed_timeout_config(),
{sstring(role_name)},
true).then([](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
@@ -173,6 +174,7 @@ future<> standard_role_manager::create_default_role_if_missing() const {
return _qp.process(
query,
db::consistency_level::QUORUM,
internal_distributed_timeout_config(),
{meta::DEFAULT_SUPERUSER_NAME}).then([](auto&&) {
log.info("Created default superuser role '{}'.", meta::DEFAULT_SUPERUSER_NAME);
return make_ready_future<>();
@@ -198,7 +200,8 @@ future<> standard_role_manager::migrate_legacy_metadata() const {
return _qp.process(
query,
db::consistency_level::QUORUM).then([this](::shared_ptr<cql3::untyped_result_set> results) {
db::consistency_level::QUORUM,
internal_distributed_timeout_config()).then([this](::shared_ptr<cql3::untyped_result_set> results) {
return do_for_each(*results, [this](const cql3::untyped_result_set_row& row) {
role_config config;
config.is_superuser = row.get_as<bool>("super");
@@ -224,7 +227,7 @@ future<> standard_role_manager::start() {
return this->create_metadata_tables_if_missing().then([this] {
_stopped = auth::do_after_system_ready(_as, [this] {
return seastar::async([this] {
wait_for_schema_agreement(_migration_manager, _qp.db().local()).get0();
wait_for_schema_agreement(_migration_manager, _qp.db().local(), _as).get0();
if (any_nondefault_role_row_satisfies(_qp, &has_can_login).get0()) {
if (this->legacy_metadata_exists()) {
@@ -248,7 +251,7 @@ future<> standard_role_manager::start() {
future<> standard_role_manager::stop() {
_as.request_abort();
return _stopped.handle_exception_type([] (const sleep_aborted&) { });
return _stopped.handle_exception_type([] (const sleep_aborted&) { }).handle_exception_type([](const abort_requested_exception&) {});;
}
future<> standard_role_manager::create_or_replace(stdx::string_view role_name, const role_config& c) const {
@@ -260,6 +263,7 @@ future<> standard_role_manager::create_or_replace(stdx::string_view role_name, c
return _qp.process(
query,
consistency_for_role(role_name),
internal_distributed_timeout_config(),
{sstring(role_name), c.is_superuser, c.can_login},
true).discard_result();
}
@@ -303,6 +307,7 @@ standard_role_manager::alter(stdx::string_view role_name, const role_config_upda
build_column_assignments(u),
meta::roles_table::role_col_name),
consistency_for_role(role_name),
internal_distributed_timeout_config(),
{sstring(role_name)}).discard_result();
});
}
@@ -322,6 +327,7 @@ future<> standard_role_manager::drop(stdx::string_view role_name) const {
return _qp.process(
query,
consistency_for_role(role_name),
internal_distributed_timeout_config(),
{sstring(role_name)}).then([this, role_name](::shared_ptr<cql3::untyped_result_set> members) {
return parallel_for_each(
members->begin(),
@@ -361,6 +367,7 @@ future<> standard_role_manager::drop(stdx::string_view role_name) const {
return _qp.process(
query,
consistency_for_role(role_name),
internal_distributed_timeout_config(),
{sstring(role_name)}).discard_result();
};
@@ -387,6 +394,7 @@ standard_role_manager::modify_membership(
return _qp.process(
query,
consistency_for_role(grantee_name),
internal_distributed_timeout_config(),
{role_set{sstring(role_name)}, sstring(grantee_name)}).discard_result();
};
@@ -398,6 +406,7 @@ standard_role_manager::modify_membership(
"INSERT INTO %s (role, member) VALUES (?, ?)",
meta::role_members_table::qualified_name()),
consistency_for_role(role_name),
internal_distributed_timeout_config(),
{sstring(role_name), sstring(grantee_name)}).discard_result();
case membership_change::remove:
@@ -406,6 +415,7 @@ standard_role_manager::modify_membership(
"DELETE FROM %s WHERE role = ? AND member = ?",
meta::role_members_table::qualified_name()),
consistency_for_role(role_name),
internal_distributed_timeout_config(),
{sstring(role_name), sstring(grantee_name)}).discard_result();
}
@@ -506,7 +516,10 @@ future<role_set> standard_role_manager::query_all() const {
// To avoid many copies of a view.
static const auto role_col_name_string = sstring(meta::roles_table::role_col_name);
return _qp.process(query, db::consistency_level::QUORUM).then([](::shared_ptr<cql3::untyped_result_set> results) {
return _qp.process(
query,
db::consistency_level::QUORUM,
internal_distributed_timeout_config()).then([](::shared_ptr<cql3::untyped_result_set> results) {
role_set roles;
std::transform(

View File

@@ -77,7 +77,7 @@ protected:
, _io_priority(iop)
, _interval(interval)
, _update_timer([this] { adjust(); })
, _control_points({{0,0}})
, _control_points()
, _current_backlog(std::move(backlog))
, _inflight_update(make_ready_future<>())
{
@@ -96,6 +96,12 @@ protected:
}
virtual ~backlog_controller() {}
public:
backlog_controller(backlog_controller&&) = default;
float backlog_of_shares(float shares) const;
seastar::scheduling_group sg() {
return _scheduling_group;
}
};
// memtable flush CPU controller.
@@ -119,7 +125,7 @@ public:
flush_controller(seastar::scheduling_group sg, const ::io_priority_class& iop, float static_shares) : backlog_controller(sg, iop, static_shares) {}
flush_controller(seastar::scheduling_group sg, const ::io_priority_class& iop, std::chrono::milliseconds interval, float soft_limit, std::function<float()> current_dirty)
: backlog_controller(sg, iop, std::move(interval),
std::vector<backlog_controller::control_point>({{soft_limit, 100}, {soft_limit + (hard_dirty_limit - soft_limit) / 2, 200} , {hard_dirty_limit, 1000}}),
std::vector<backlog_controller::control_point>({{0.0, 0.0}, {soft_limit, 10}, {soft_limit + (hard_dirty_limit - soft_limit) / 2, 200} , {hard_dirty_limit, 1000}}),
std::move(current_dirty)
)
{}
@@ -128,10 +134,12 @@ public:
class compaction_controller : public backlog_controller {
public:
static constexpr unsigned normalization_factor = 30;
static constexpr float disable_backlog = std::numeric_limits<double>::infinity();
static constexpr float backlog_disabled(float backlog) { return std::isinf(backlog); }
compaction_controller(seastar::scheduling_group sg, const ::io_priority_class& iop, float static_shares) : backlog_controller(sg, iop, static_shares) {}
compaction_controller(seastar::scheduling_group sg, const ::io_priority_class& iop, std::chrono::milliseconds interval, std::function<float()> current_backlog)
: backlog_controller(sg, iop, std::move(interval),
std::vector<backlog_controller::control_point>({{0.5, 10}, {1.5, 100} , {normalization_factor, 1000}}),
std::vector<backlog_controller::control_point>({{0.0, 50}, {1.5, 100} , {normalization_factor, 1000}}),
std::move(current_backlog)
)
{}

View File

@@ -29,12 +29,16 @@
#include <functional>
#include "utils/mutable_view.hh"
using bytes = basic_sstring<int8_t, uint32_t, 31>;
using bytes = basic_sstring<int8_t, uint32_t, 31, false>;
using bytes_view = std::experimental::basic_string_view<int8_t>;
using bytes_mutable_view = basic_mutable_view<bytes_view::value_type>;
using bytes_opt = std::experimental::optional<bytes>;
using sstring_view = std::experimental::string_view;
inline sstring_view to_sstring_view(bytes_view view) {
return {reinterpret_cast<const char*>(view.data()), view.size()};
}
namespace std {
template <>
@@ -78,3 +82,11 @@ struct appending_hash<bytes_view> {
h.update(reinterpret_cast<const char*>(v.begin()), v.size() * sizeof(bytes_view::value_type));
}
};
inline int32_t compare_unsigned(bytes_view v1, bytes_view v2) {
auto n = memcmp(v1.begin(), v2.begin(), std::min(v1.size(), v2.size()));
if (n) {
return n;
}
return (int32_t) (v1.size() - v2.size());
}

View File

@@ -38,7 +38,7 @@ class bytes_ostream {
public:
using size_type = bytes::size_type;
using value_type = bytes::value_type;
static constexpr size_type max_chunk_size() { return 16 * 1024; }
static constexpr size_type max_chunk_size() { return 128 * 1024; }
private:
static_assert(sizeof(value_type) == 1, "value_type is assumed to be one byte long");
struct chunk {
@@ -57,16 +57,17 @@ private:
value_type data[0];
void operator delete(void* ptr) { free(ptr); }
};
// FIXME: consider increasing chunk size as the buffer grows
static constexpr size_type chunk_size{512};
static constexpr size_type default_chunk_size{512};
private:
std::unique_ptr<chunk> _begin;
chunk* _current;
size_type _size;
size_type _initial_chunk_size = default_chunk_size;
public:
class fragment_iterator : public std::iterator<std::input_iterator_tag, bytes_view> {
chunk* _current;
chunk* _current = nullptr;
public:
fragment_iterator() = default;
fragment_iterator(chunk* current) : _current(current) {}
fragment_iterator(const fragment_iterator&) = default;
fragment_iterator& operator=(const fragment_iterator&) = default;
@@ -101,13 +102,13 @@ private:
}
// Figure out next chunk size.
// - must be enough for data_size
// - must be at least chunk_size
// - must be at least _initial_chunk_size
// - try to double each time to prevent too many allocations
// - do not exceed max_chunk_size
size_type next_alloc_size(size_t data_size) const {
auto next_size = _current
? _current->size * 2
: chunk_size;
: _initial_chunk_size;
next_size = std::min(next_size, max_chunk_size());
// FIXME: check for overflow?
return std::max<size_type>(next_size, data_size + sizeof(chunk));
@@ -115,13 +116,19 @@ private:
// Makes room for a contiguous region of given size.
// The region is accounted for as already written.
// size must not be zero.
[[gnu::always_inline]]
value_type* alloc(size_type size) {
if (size <= current_space_left()) {
if (__builtin_expect(size <= current_space_left(), true)) {
auto ret = _current->data + _current->offset;
_current->offset += size;
_size += size;
return ret;
} else {
return alloc_new(size);
}
}
[[gnu::noinline]]
value_type* alloc_new(size_type size) {
auto alloc_size = next_alloc_size(size);
auto space = malloc(alloc_size);
if (!space) {
@@ -139,19 +146,22 @@ private:
}
_size += size;
return _current->data;
};
}
public:
bytes_ostream() noexcept
explicit bytes_ostream(size_t initial_chunk_size) noexcept
: _begin()
, _current(nullptr)
, _size(0)
, _initial_chunk_size(initial_chunk_size)
{ }
bytes_ostream() noexcept : bytes_ostream(default_chunk_size) {}
bytes_ostream(bytes_ostream&& o) noexcept
: _begin(std::move(o._begin))
, _current(o._current)
, _size(o._size)
, _initial_chunk_size(o._initial_chunk_size)
{
o._current = nullptr;
o._size = 0;
@@ -161,6 +171,7 @@ public:
: _begin()
, _current(nullptr)
, _size(0)
, _initial_chunk_size(o._initial_chunk_size)
{
append(o);
}
@@ -198,18 +209,20 @@ public:
return place_holder<T>{alloc(sizeof(T))};
}
[[gnu::always_inline]]
value_type* write_place_holder(size_type size) {
return alloc(size);
}
// Writes given sequence of bytes
[[gnu::always_inline]]
inline void write(bytes_view v) {
if (v.empty()) {
return;
}
auto this_size = std::min(v.size(), size_t(current_space_left()));
if (this_size) {
if (__builtin_expect(this_size, true)) {
memcpy(_current->data + _current->offset, v.begin(), this_size);
_current->offset += this_size;
_size += this_size;
@@ -218,11 +231,12 @@ public:
while (!v.empty()) {
auto this_size = std::min(v.size(), size_t(max_chunk_size()));
std::copy_n(v.begin(), this_size, alloc(this_size));
std::copy_n(v.begin(), this_size, alloc_new(this_size));
v.remove_prefix(this_size);
}
}
[[gnu::always_inline]]
void write(const char* ptr, size_t size) {
write(bytes_view(reinterpret_cast<const signed char*>(ptr), size));
}
@@ -289,6 +303,24 @@ public:
}
}
// Removes n bytes from the end of the bytes_ostream.
// Beware of O(n) algorithm.
void remove_suffix(size_t n) {
_size -= n;
auto left = _size;
auto current = _begin.get();
while (current) {
if (current->offset >= left) {
current->offset = left;
_current = current;
current->next.reset();
return;
}
left -= current->offset;
current = current->next.get();
}
}
// begin() and end() form an input range to bytes_view representing fragments.
// Any modification of this instance invalidates iterators.
fragment_iterator begin() const { return { _begin.get() }; }
@@ -374,6 +406,21 @@ public:
bool operator!=(const bytes_ostream& other) const {
return !(*this == other);
}
// Makes this instance empty.
//
// The first buffer is not deallocated, so callers may rely on the
// fact that if they write less than the initial chunk size between
// the clear() calls then writes will not involve any memory allocations,
// except for the first write made on this instance.
void clear() {
if (_begin) {
_begin->offset = 0;
_size = 0;
_current = _begin.get();
_begin->next.reset();
}
}
};
template<>

View File

@@ -61,11 +61,12 @@ class cache_flat_mutation_reader final : public flat_mutation_reader::impl {
// - _last_row points at a direct predecessor of the next row which is going to be read.
// Used for populating continuity.
// - _population_range_starts_before_all_rows is set accordingly
// - _underlying is engaged and fast-forwarded
reading_from_underlying,
end_of_stream
};
lw_shared_ptr<partition_snapshot> _snp;
partition_snapshot_ptr _snp;
position_in_partition::tri_compare _position_cmp;
query::clustering_key_filter_ranges _ck_ranges;
@@ -94,7 +95,18 @@ class cache_flat_mutation_reader final : public flat_mutation_reader::impl {
// Valid when _state == reading_from_underlying.
bool _population_range_starts_before_all_rows;
// Whether _lower_bound was changed within current fill_buffer().
// If it did not then we cannot break out of it (e.g. on preemption) because
// forward progress is not guaranteed in case iterators are getting constantly invalidated.
bool _lower_bound_changed = false;
// Points to the underlying reader conforming to _schema,
// either to *_underlying_holder or _read_context->underlying().underlying().
flat_mutation_reader* _underlying = nullptr;
std::optional<flat_mutation_reader> _underlying_holder;
future<> do_fill_buffer(db::timeout_clock::time_point);
future<> ensure_underlying(db::timeout_clock::time_point);
void copy_from_cache_to_buffer();
future<> process_static_row(db::timeout_clock::time_point);
void move_to_end();
@@ -132,7 +144,7 @@ public:
dht::decorated_key dk,
query::clustering_key_filter_ranges&& crr,
lw_shared_ptr<read_context> ctx,
lw_shared_ptr<partition_snapshot> snp,
partition_snapshot_ptr snp,
row_cache& cache)
: flat_mutation_reader::impl(std::move(s))
, _snp(std::move(snp))
@@ -152,9 +164,6 @@ public:
cache_flat_mutation_reader(const cache_flat_mutation_reader&) = delete;
cache_flat_mutation_reader(cache_flat_mutation_reader&&) = delete;
virtual future<> fill_buffer(db::timeout_clock::time_point timeout) override;
virtual ~cache_flat_mutation_reader() {
maybe_merge_versions(_snp, _lsa_manager.region(), _lsa_manager.read_section());
}
virtual void next_partition() override {
clear_buffer_to_next_partition();
if (is_buffer_empty()) {
@@ -184,23 +193,22 @@ future<> cache_flat_mutation_reader::process_static_row(db::timeout_clock::time_
return make_ready_future<>();
} else {
_read_context->cache().on_row_miss();
return _read_context->get_next_fragment(timeout).then([this] (mutation_fragment_opt&& sr) {
if (sr) {
assert(sr->is_static_row());
maybe_add_to_cache(sr->as_static_row());
push_mutation_fragment(std::move(*sr));
}
maybe_set_static_row_continuous();
return ensure_underlying(timeout).then([this, timeout] {
return (*_underlying)(timeout).then([this] (mutation_fragment_opt&& sr) {
if (sr) {
assert(sr->is_static_row());
maybe_add_to_cache(sr->as_static_row());
push_mutation_fragment(std::move(*sr));
}
maybe_set_static_row_continuous();
});
});
}
}
inline
void cache_flat_mutation_reader::touch_partition() {
if (_snp->at_latest_version()) {
rows_entry& last_dummy = *_snp->version()->partition().clustered_rows().rbegin();
_snp->tracker()->touch(last_dummy);
}
_snp->touch();
}
inline
@@ -230,14 +238,36 @@ future<> cache_flat_mutation_reader::fill_buffer(db::timeout_clock::time_point t
});
}
inline
future<> cache_flat_mutation_reader::ensure_underlying(db::timeout_clock::time_point timeout) {
if (_underlying) {
return make_ready_future<>();
}
return _read_context->ensure_underlying(timeout).then([this, timeout] {
flat_mutation_reader& ctx_underlying = _read_context->underlying().underlying();
if (ctx_underlying.schema() != _schema) {
_underlying_holder = make_delegating_reader(ctx_underlying);
_underlying_holder->upgrade_schema(_schema);
_underlying = &*_underlying_holder;
} else {
_underlying = &ctx_underlying;
}
});
}
inline
future<> cache_flat_mutation_reader::do_fill_buffer(db::timeout_clock::time_point timeout) {
if (_state == state::move_to_underlying) {
if (!_underlying) {
return ensure_underlying(timeout).then([this, timeout] {
return do_fill_buffer(timeout);
});
}
_state = state::reading_from_underlying;
_population_range_starts_before_all_rows = _lower_bound.is_before_all_clustered_rows(*_schema);
auto end = _next_row_in_range ? position_in_partition(_next_row.position())
: position_in_partition(_upper_bound);
return _read_context->fast_forward_to(position_range{_lower_bound, std::move(end)}, timeout).then([this, timeout] {
return _underlying->fast_forward_to(position_range{_lower_bound, std::move(end)}, timeout).then([this, timeout] {
return read_from_underlying(timeout);
});
}
@@ -262,9 +292,13 @@ future<> cache_flat_mutation_reader::do_fill_buffer(db::timeout_clock::time_poin
}
_next_row.maybe_refresh();
clogger.trace("csm {}: next={}, cont={}", this, _next_row.position(), _next_row.continuous());
while (!is_buffer_full() && _state == state::reading_from_cache) {
_lower_bound_changed = false;
while (_state == state::reading_from_cache) {
copy_from_cache_to_buffer();
if (need_preempt()) {
// We need to check _lower_bound_changed even if is_buffer_full() because
// we may have emitted only a range tombstone which overlapped with _lower_bound
// and thus didn't cause _lower_bound to change.
if ((need_preempt() || is_buffer_full()) && _lower_bound_changed) {
break;
}
}
@@ -274,7 +308,7 @@ future<> cache_flat_mutation_reader::do_fill_buffer(db::timeout_clock::time_poin
inline
future<> cache_flat_mutation_reader::read_from_underlying(db::timeout_clock::time_point timeout) {
return consume_mutation_fragments_until(_read_context->underlying().underlying(),
return consume_mutation_fragments_until(*_underlying,
[this] { return _state != state::reading_from_underlying || is_buffer_full(); },
[this] (mutation_fragment mf) {
_read_context->cache().on_row_miss();
@@ -355,7 +389,7 @@ future<> cache_flat_mutation_reader::read_from_underlying(db::timeout_clock::tim
}
});
return make_ready_future<>();
});
}, timeout);
}
inline
@@ -374,7 +408,7 @@ bool cache_flat_mutation_reader::ensure_population_lower_bound() {
rows_entry::compare less(*_schema);
// FIXME: Avoid the copy by inserting an incomplete clustering row
auto e = alloc_strategy_unique_ptr<rows_entry>(
current_allocator().construct<rows_entry>(*_last_row));
current_allocator().construct<rows_entry>(*_schema, *_last_row));
e->set_continuous(false);
auto insert_result = rows.insert_check(rows.end(), *e, less);
auto inserted = insert_result.second;
@@ -428,7 +462,7 @@ void cache_flat_mutation_reader::maybe_add_to_cache(const clustering_row& cr) {
cr.cells().prepare_hash(*_schema, column_kind::regular_column);
}
auto new_entry = alloc_strategy_unique_ptr<rows_entry>(
current_allocator().construct<rows_entry>(cr.key(), cr.tomb(), cr.marker(), cr.cells()));
current_allocator().construct<rows_entry>(*_schema, cr.key(), cr.tomb(), cr.marker(), cr.cells()));
new_entry->set_continuous(false);
auto it = _next_row.iterators_valid() ? _next_row.get_iterator_in_latest_version()
: mp.clustered_rows().lower_bound(cr.key(), less);
@@ -471,15 +505,19 @@ void cache_flat_mutation_reader::copy_from_cache_to_buffer() {
_next_row.touch();
position_in_partition_view next_lower_bound = _next_row.dummy() ? _next_row.position() : position_in_partition_view::after_key(_next_row.key());
for (auto &&rts : _snp->range_tombstones(_lower_bound, _next_row_in_range ? next_lower_bound : _upper_bound)) {
position_in_partition::less_compare less(*_schema);
// This guarantees that rts starts after any emitted clustering_row
// and not before any emitted range tombstone.
if (rts.trim_front(*_schema, _lower_bound)) {
if (!less(_lower_bound, rts.position())) {
rts.set_start(*_schema, _lower_bound);
} else {
_lower_bound = position_in_partition(rts.position());
_lower_bound_changed = true;
if (is_buffer_full()) {
return;
}
push_mutation_fragment(std::move(rts));
}
push_mutation_fragment(std::move(rts));
}
// We add the row to the buffer even when it's full.
// This simplifies the code. For more info see #3139.
@@ -516,6 +554,7 @@ void cache_flat_mutation_reader::move_to_range(query::clustering_row_ranges::con
_last_row = nullptr;
_lower_bound = std::move(lb);
_upper_bound = std::move(ub);
_lower_bound_changed = true;
_ck_ranges_curr = next_it;
auto adjacent = _next_row.advance_to(_lower_bound);
_next_row_in_range = !after_current_range(_next_row.position());
@@ -593,6 +632,7 @@ void cache_flat_mutation_reader::add_clustering_row_to_buffer(mutation_fragment&
auto new_lower_bound = position_in_partition::after_key(row.key());
push_mutation_fragment(std::move(mf));
_lower_bound = std::move(new_lower_bound);
_lower_bound_changed = true;
}
inline
@@ -600,10 +640,16 @@ void cache_flat_mutation_reader::add_to_buffer(range_tombstone&& rt) {
clogger.trace("csm {}: add_to_buffer({})", this, rt);
// This guarantees that rt starts after any emitted clustering_row
// and not before any emitted range tombstone.
if (!rt.trim_front(*_schema, _lower_bound)) {
position_in_partition::less_compare less(*_schema);
if (!less(_lower_bound, rt.end_position())) {
return;
}
_lower_bound = position_in_partition(rt.position());
if (!less(_lower_bound, rt.position())) {
rt.set_start(*_schema, _lower_bound);
} else {
_lower_bound = position_in_partition(rt.position());
_lower_bound_changed = true;
}
push_mutation_fragment(std::move(rt));
}
@@ -657,7 +703,7 @@ inline flat_mutation_reader make_cache_flat_mutation_reader(schema_ptr s,
query::clustering_key_filter_ranges crr,
row_cache& cache,
lw_shared_ptr<cache::read_context> ctx,
lw_shared_ptr<partition_snapshot> snp)
partition_snapshot_ptr snp)
{
return make_flat_mutation_reader<cache::cache_flat_mutation_reader>(
std::move(s), std::move(dk), std::move(crr), std::move(ctx), std::move(snp), cache);

View File

@@ -23,29 +23,15 @@
#include <boost/intrusive/unordered_set.hpp>
#if __has_include(<boost/container/small_vector.hpp>)
#include <boost/container/small_vector.hpp>
template <typename T, size_t N>
using small_vector = boost::container::small_vector<T, N>;
#else
#include <vector>
template <typename T, size_t N>
using small_vector = std::vector<T>;
#endif
#include "fnv1a_hasher.hh"
#include "utils/small_vector.hh"
#include "mutation_fragment.hh"
#include "mutation_partition.hh"
#include "xx_hasher.hh"
#include "db/timeout_clock.hh"
class cells_range {
using ids_vector_type = small_vector<column_id, 5>;
using ids_vector_type = utils::small_vector<column_id, 5>;
position_in_partition_view _position;
ids_vector_type _ids;
@@ -208,10 +194,10 @@ private:
explicit hasher(const schema& s) : _schema(&s) { }
size_t operator()(const cell_address& ca) const {
fnv1a_hasher hasher;
xx_hasher hasher;
ca.position.feed_hash(hasher, *_schema);
::feed_hash(hasher, ca.id);
return hasher.finalize();
return static_cast<size_t>(hasher.finalize_uint64());
}
size_t operator()(const cell_entry& ce) const {
return operator()(ce._address);

View File

@@ -22,6 +22,7 @@
#pragma once
#include <functional>
#include "keys.hh"
#include "schema.hh"
#include "range.hh"
@@ -43,22 +44,20 @@ bound_kind invert_kind(bound_kind k);
int32_t weight(bound_kind k);
class bound_view {
const static thread_local clustering_key _empty_prefix;
std::reference_wrapper<const clustering_key_prefix> _prefix;
bound_kind _kind;
public:
const static thread_local clustering_key empty_prefix;
const clustering_key_prefix& prefix;
bound_kind kind;
bound_view(const clustering_key_prefix& prefix, bound_kind kind)
: prefix(prefix)
, kind(kind)
: _prefix(prefix)
, _kind(kind)
{ }
bound_view(const bound_view& other) noexcept = default;
bound_view& operator=(const bound_view& other) noexcept {
if (this != &other) {
this->~bound_view();
new (this) bound_view(other);
}
return *this;
}
bound_view& operator=(const bound_view& other) noexcept = default;
bound_kind kind() const { return _kind; }
const clustering_key_prefix& prefix() const { return _prefix; }
struct tri_compare {
// To make it assignable and to avoid taking a schema_ptr, we
// wrap the schema reference.
@@ -82,13 +81,13 @@ public:
return d1 < d2 ? w1 - (w1 <= 0) : -(w2 - (w2 <= 0));
}
int operator()(const bound_view b, const clustering_key_prefix& p) const {
return operator()(b.prefix, weight(b.kind), p, 0);
return operator()(b._prefix, weight(b._kind), p, 0);
}
int operator()(const clustering_key_prefix& p, const bound_view b) const {
return operator()(p, 0, b.prefix, weight(b.kind));
return operator()(p, 0, b._prefix, weight(b._kind));
}
int operator()(const bound_view b1, const bound_view b2) const {
return operator()(b1.prefix, weight(b1.kind), b2.prefix, weight(b2.kind));
return operator()(b1._prefix, weight(b1._kind), b2._prefix, weight(b2._kind));
}
};
struct compare {
@@ -101,26 +100,26 @@ public:
return _cmp(p1, w1, p2, w2) < 0;
}
bool operator()(const bound_view b, const clustering_key_prefix& p) const {
return operator()(b.prefix, weight(b.kind), p, 0);
return operator()(b._prefix, weight(b._kind), p, 0);
}
bool operator()(const clustering_key_prefix& p, const bound_view b) const {
return operator()(p, 0, b.prefix, weight(b.kind));
return operator()(p, 0, b._prefix, weight(b._kind));
}
bool operator()(const bound_view b1, const bound_view b2) const {
return operator()(b1.prefix, weight(b1.kind), b2.prefix, weight(b2.kind));
return operator()(b1._prefix, weight(b1._kind), b2._prefix, weight(b2._kind));
}
};
bool equal(const schema& s, const bound_view other) const {
return kind == other.kind && prefix.equal(s, other.prefix);
return _kind == other._kind && _prefix.get().equal(s, other._prefix.get());
}
bool adjacent(const schema& s, const bound_view other) const {
return invert_kind(other.kind) == kind && prefix.equal(s, other.prefix);
return invert_kind(other._kind) == _kind && _prefix.get().equal(s, other._prefix.get());
}
static bound_view bottom() {
return {empty_prefix, bound_kind::incl_start};
return {_empty_prefix, bound_kind::incl_start};
}
static bound_view top() {
return {empty_prefix, bound_kind::incl_end};
return {_empty_prefix, bound_kind::incl_end};
}
template<template<typename> typename R>
GCC6_CONCEPT( requires Range<R, clustering_key_prefix_view> )
@@ -144,13 +143,13 @@ public:
template<template<typename> typename R>
GCC6_CONCEPT( requires Range<R, clustering_key_prefix_view> )
static stdx::optional<typename R<clustering_key_prefix_view>::bound> to_range_bound(const bound_view& bv) {
if (&bv.prefix == &empty_prefix) {
if (&bv._prefix.get() == &_empty_prefix) {
return {};
}
bool inclusive = bv.kind != bound_kind::excl_end && bv.kind != bound_kind::excl_start;
return {typename R<clustering_key_prefix_view>::bound(bv.prefix.view(), inclusive)};
bool inclusive = bv._kind != bound_kind::excl_end && bv._kind != bound_kind::excl_start;
return {typename R<clustering_key_prefix_view>::bound(bv._prefix.get().view(), inclusive)};
}
friend std::ostream& operator<<(std::ostream& out, const bound_view& b) {
return out << "{bound: prefix=" << b.prefix << ", kind=" << b.kind << "}";
return out << "{bound: prefix=" << b._prefix.get() << ", kind=" << b._kind << "}";
}
};

View File

@@ -30,7 +30,7 @@ namespace query {
class clustering_key_filter_ranges {
clustering_row_ranges _storage;
const clustering_row_ranges& _ref;
std::reference_wrapper<const clustering_row_ranges> _ref;
public:
clustering_key_filter_ranges(const clustering_row_ranges& ranges) : _ref(ranges) { }
struct reversed { };
@@ -39,21 +39,21 @@ public:
clustering_key_filter_ranges(clustering_key_filter_ranges&& other) noexcept
: _storage(std::move(other._storage))
, _ref(&other._ref == &other._storage ? _storage : other._ref)
, _ref(&other._ref.get() == &other._storage ? _storage : other._ref.get())
{ }
clustering_key_filter_ranges& operator=(clustering_key_filter_ranges&& other) noexcept {
if (this != &other) {
this->~clustering_key_filter_ranges();
new (this) clustering_key_filter_ranges(std::move(other));
_storage = std::move(other._storage);
_ref = (&other._ref.get() == &other._storage) ? _storage : other._ref.get();
}
return *this;
}
auto begin() const { return _ref.begin(); }
auto end() const { return _ref.end(); }
bool empty() const { return _ref.empty(); }
size_t size() const { return _ref.size(); }
auto begin() const { return _ref.get().begin(); }
auto end() const { return _ref.get().end(); }
bool empty() const { return _ref.get().empty(); }
size_t size() const { return _ref.get().size(); }
const clustering_row_ranges& ranges() const { return _ref; }
static clustering_key_filter_ranges get_ranges(const schema& schema, const query::partition_slice& slice, const partition_key& key) {

View File

@@ -31,72 +31,61 @@
class clustering_ranges_walker {
const schema& _schema;
const query::clustering_row_ranges& _ranges;
query::clustering_row_ranges::const_iterator _current;
query::clustering_row_ranges::const_iterator _end;
boost::iterator_range<query::clustering_row_ranges::const_iterator> _current_range;
bool _in_current; // next position is known to be >= _current_start
bool _with_static_row;
position_in_partition_view _current_start;
position_in_partition_view _current_end;
stdx::optional<position_in_partition> _trim;
std::optional<position_in_partition> _trim;
size_t _change_counter = 1;
private:
bool advance_to_next_range() {
_in_current = false;
if (!_current_start.is_static_row()) {
if (_current == _end) {
if (!_current_range) {
return false;
}
++_current;
_current_range.advance_begin(1);
}
++_change_counter;
if (_current == _end) {
if (!_current_range) {
_current_end = _current_start = position_in_partition_view::after_all_clustered_rows();
return false;
}
_current_start = position_in_partition_view::for_range_start(*_current);
_current_end = position_in_partition_view::for_range_end(*_current);
_current_start = position_in_partition_view::for_range_start(_current_range.front());
_current_end = position_in_partition_view::for_range_end(_current_range.front());
return true;
}
public:
clustering_ranges_walker(const schema& s, const query::clustering_row_ranges& ranges, bool with_static_row = true)
: _schema(s)
, _ranges(ranges)
, _current(ranges.begin())
, _end(ranges.end())
, _in_current(with_static_row)
, _with_static_row(with_static_row)
, _current_start(position_in_partition_view::for_static_row())
, _current_end(position_in_partition_view::before_all_clustered_rows())
{
if (!with_static_row) {
if (_current == _end) {
void set_current_positions() {
if (!_with_static_row) {
if (!_current_range) {
_current_start = position_in_partition_view::before_all_clustered_rows();
} else {
_current_start = position_in_partition_view::for_range_start(*_current);
_current_end = position_in_partition_view::for_range_end(*_current);
_current_start = position_in_partition_view::for_range_start(_current_range.front());
_current_end = position_in_partition_view::for_range_end(_current_range.front());
}
}
}
clustering_ranges_walker(clustering_ranges_walker&& o) noexcept
: _schema(o._schema)
, _ranges(o._ranges)
, _current(o._current)
, _end(o._end)
, _in_current(o._in_current)
, _with_static_row(o._with_static_row)
, _current_start(o._current_start)
, _current_end(o._current_end)
, _trim(std::move(o._trim))
, _change_counter(o._change_counter)
{ }
clustering_ranges_walker& operator=(clustering_ranges_walker&& o) {
if (this != &o) {
this->~clustering_ranges_walker();
new (this) clustering_ranges_walker(std::move(o));
}
return *this;
public:
clustering_ranges_walker(const schema& s, const query::clustering_row_ranges& ranges, bool with_static_row = true)
: _schema(s)
, _ranges(ranges)
, _current_range(ranges)
, _in_current(with_static_row)
, _with_static_row(with_static_row)
, _current_start(position_in_partition_view::for_static_row())
, _current_end(position_in_partition_view::before_all_clustered_rows()) {
set_current_positions();
}
clustering_ranges_walker(const clustering_ranges_walker&) = delete;
clustering_ranges_walker(clustering_ranges_walker&&) = delete;
clustering_ranges_walker& operator=(const clustering_ranges_walker&) = delete;
clustering_ranges_walker& operator=(clustering_ranges_walker&&) = delete;
// Excludes positions smaller than pos from the ranges.
// pos should be monotonic.
// No constraints between pos and positions passed to advance_to().
@@ -173,17 +162,15 @@ public:
return false;
}
auto i = _current;
while (i != _end) {
auto range_start = position_in_partition_view::for_range_start(*i);
for (const auto& rng : _current_range) {
auto range_start = position_in_partition_view::for_range_start(rng);
if (!less(range_start, end)) {
return false;
}
auto range_end = position_in_partition_view::for_range_end(*i);
auto range_end = position_in_partition_view::for_range_end(rng);
if (less(start, range_end)) {
return true;
}
++i;
}
return false;
@@ -191,18 +178,20 @@ public:
// Returns true if advanced past all contained positions. Any later advance_to() until reset() will return false.
bool out_of_range() const {
return !_in_current && _current == _end;
return !_in_current && !_current_range;
}
// Resets the state of the walker so that advance_to() can be now called for new sequence of positions.
// Any range trimmings still hold after this.
void reset() {
auto trim = std::move(_trim);
auto ctr = _change_counter;
*this = clustering_ranges_walker(_schema, _ranges, _with_static_row);
_change_counter = ctr + 1;
if (trim) {
trim_front(std::move(*trim));
_current_range = _ranges;
_in_current = _with_static_row;
_current_start = position_in_partition_view::for_static_row();
_current_end = position_in_partition_view::before_all_clustered_rows();
set_current_positions();
++_change_counter;
if (_trim) {
trim_front(*std::exchange(_trim, {}));
}
}
@@ -211,6 +200,11 @@ public:
return _current_start;
}
// Returns the upper bound of the last range in provided ranges set
position_in_partition_view uppermost_bound() const {
return position_in_partition_view::for_range_end(_ranges.back());
}
// When lower_bound() changes, this also does
// Always > 0.
size_t lower_bound_change_counter() const {

View File

@@ -25,7 +25,8 @@
#include "exceptions/exceptions.hh"
#include "sstables/compaction_backlog_manager.hh"
class column_family;
class table;
using column_family = table;
class schema;
using schema_ptr = lw_shared_ptr<const schema>;

View File

@@ -1,67 +0,0 @@
/*
* Copyright (C) 2016 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include "query-request.hh"
#include <experimental/optional>
// Wraps ring_position so it is compatible with old-style C++: default constructor,
// stateless comparators, yada yada
class compatible_ring_position {
const schema* _schema = nullptr;
// optional to supply a default constructor, no more
std::experimental::optional<dht::ring_position> _rp;
public:
compatible_ring_position() noexcept = default;
compatible_ring_position(const schema& s, const dht::ring_position& rp)
: _schema(&s), _rp(rp) {
}
compatible_ring_position(const schema& s, dht::ring_position&& rp)
: _schema(&s), _rp(std::move(rp)) {
}
const dht::token& token() const {
return _rp->token();
}
friend int tri_compare(const compatible_ring_position& x, const compatible_ring_position& y) {
return x._rp->tri_compare(*x._schema, *y._rp);
}
friend bool operator<(const compatible_ring_position& x, const compatible_ring_position& y) {
return tri_compare(x, y) < 0;
}
friend bool operator<=(const compatible_ring_position& x, const compatible_ring_position& y) {
return tri_compare(x, y) <= 0;
}
friend bool operator>(const compatible_ring_position& x, const compatible_ring_position& y) {
return tri_compare(x, y) > 0;
}
friend bool operator>=(const compatible_ring_position& x, const compatible_ring_position& y) {
return tri_compare(x, y) >= 0;
}
friend bool operator==(const compatible_ring_position& x, const compatible_ring_position& y) {
return tri_compare(x, y) == 0;
}
friend bool operator!=(const compatible_ring_position& x, const compatible_ring_position& y) {
return tri_compare(x, y) != 0;
}
};

View File

@@ -0,0 +1,64 @@
/*
* Copyright (C) 2016 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include "query-request.hh"
#include <optional>
// Wraps ring_position_view so it is compatible with old-style C++: default
// constructor, stateless comparators, yada yada.
class compatible_ring_position_view {
const schema* _schema = nullptr;
// Optional to supply a default constructor, no more.
std::optional<dht::ring_position_view> _rpv;
public:
constexpr compatible_ring_position_view() = default;
compatible_ring_position_view(const schema& s, dht::ring_position_view rpv)
: _schema(&s), _rpv(rpv) {
}
const dht::ring_position_view& position() const {
return *_rpv;
}
friend int tri_compare(const compatible_ring_position_view& x, const compatible_ring_position_view& y) {
return dht::ring_position_tri_compare(*x._schema, *x._rpv, *y._rpv);
}
friend bool operator<(const compatible_ring_position_view& x, const compatible_ring_position_view& y) {
return tri_compare(x, y) < 0;
}
friend bool operator<=(const compatible_ring_position_view& x, const compatible_ring_position_view& y) {
return tri_compare(x, y) <= 0;
}
friend bool operator>(const compatible_ring_position_view& x, const compatible_ring_position_view& y) {
return tri_compare(x, y) > 0;
}
friend bool operator>=(const compatible_ring_position_view& x, const compatible_ring_position_view& y) {
return tri_compare(x, y) >= 0;
}
friend bool operator==(const compatible_ring_position_view& x, const compatible_ring_position_view& y) {
return tri_compare(x, y) == 0;
}
friend bool operator!=(const compatible_ring_position_view& x, const compatible_ring_position_view& y) {
return tri_compare(x, y) != 0;
}
};

View File

@@ -25,6 +25,7 @@
#include <boost/range/adaptor/transformed.hpp>
#include "compound.hh"
#include "schema.hh"
#include "sstables/version.hh"
//
// This header provides adaptors between the representation used by our compound_type<>
@@ -302,7 +303,7 @@ private:
}
public:
template <typename Describer>
auto describe_type(Describer f) const {
auto describe_type(sstables::sstable_version_types v, Describer f) const {
return f(const_cast<bytes&>(_bytes));
}

View File

@@ -112,7 +112,7 @@ const sstring compression_parameters::CHUNK_LENGTH_KB = "chunk_length_kb";
const sstring compression_parameters::CRC_CHECK_CHANCE = "crc_check_chance";
compression_parameters::compression_parameters()
: compression_parameters(nullptr)
: compression_parameters(compressor::lz4)
{}
compression_parameters::~compression_parameters()
@@ -241,7 +241,7 @@ size_t lz4_processor::compress(const char* input, size_t input_len,
output[1] = (input_len >> 8) & 0xFF;
output[2] = (input_len >> 16) & 0xFF;
output[3] = (input_len >> 24) & 0xFF;
#ifdef HAVE_LZ4_COMPRESS_DEFAULT
#ifdef SEASTAR_HAVE_LZ4_COMPRESS_DEFAULT
auto ret = LZ4_compress_default(input, output + 4, input_len, LZ4_compressBound(input_len));
#else
auto ret = LZ4_compress(input, output + 4, input_len);

View File

@@ -118,6 +118,10 @@ public:
std::map<sstring, sstring> get_options() const;
bool operator==(const compression_parameters& other) const;
bool operator!=(const compression_parameters& other) const;
static compression_parameters no_compression() {
return compression_parameters(nullptr);
}
private:
void validate_options(const std::map<sstring, sstring>&);
};

View File

@@ -242,6 +242,9 @@ batch_size_fail_threshold_in_kb: 50
# The directory where hints files are stored if hinted handoff is enabled.
# hints_directory: /var/lib/scylla/hints
# The directory where hints files are stored for materialized-view updates
# view_hints_directory: /var/lib/scylla/view_hints
# See http://wiki.apache.org/cassandra/HintedHandoff
# May either be "true" or "false" to enable globally, or contain a list

File diff suppressed because it is too large Load Diff

View File

@@ -38,28 +38,45 @@ private:
static bool is_compatible(const column_definition& new_def, const data_type& old_type, column_kind kind) {
return ::is_compatible(new_def.kind, kind) && new_def.type->is_value_compatible_with(*old_type);
}
static void accept_cell(row& dst, column_kind kind, const column_definition& new_def, const data_type& old_type, atomic_cell_view cell) {
if (is_compatible(new_def, old_type, kind) && cell.timestamp() > new_def.dropped_at()) {
dst.apply(new_def, atomic_cell_or_collection(cell));
static atomic_cell upgrade_cell(const abstract_type& new_type, const abstract_type& old_type, atomic_cell_view cell,
atomic_cell::collection_member cm = atomic_cell::collection_member::no) {
if (cell.is_live() && !old_type.is_counter()) {
if (cell.is_live_and_has_ttl()) {
return atomic_cell::make_live(new_type, cell.timestamp(), cell.value().linearize(), cell.expiry(), cell.ttl(), cm);
}
return atomic_cell::make_live(new_type, cell.timestamp(), cell.value().linearize(), cm);
} else {
return atomic_cell(new_type, cell);
}
}
static void accept_cell(row& dst, column_kind kind, const column_definition& new_def, const data_type& old_type, atomic_cell_view cell) {
if (!is_compatible(new_def, old_type, kind) || cell.timestamp() <= new_def.dropped_at()) {
return;
}
dst.apply(new_def, upgrade_cell(*new_def.type, *old_type, cell));
}
static void accept_cell(row& dst, column_kind kind, const column_definition& new_def, const data_type& old_type, collection_mutation_view cell) {
if (!is_compatible(new_def, old_type, kind)) {
return;
}
auto&& ctype = static_pointer_cast<const collection_type_impl>(old_type);
auto old_view = ctype->deserialize_mutation_form(cell);
cell.data.with_linearized([&] (bytes_view cell_bv) {
auto new_ctype = static_pointer_cast<const collection_type_impl>(new_def.type);
auto old_ctype = static_pointer_cast<const collection_type_impl>(old_type);
auto old_view = old_ctype->deserialize_mutation_form(cell_bv);
collection_type_impl::mutation_view new_view;
collection_type_impl::mutation new_view;
if (old_view.tomb.timestamp > new_def.dropped_at()) {
new_view.tomb = old_view.tomb;
}
for (auto& c : old_view.cells) {
if (c.second.timestamp() > new_def.dropped_at()) {
new_view.cells.emplace_back(std::move(c));
new_view.cells.emplace_back(c.first, upgrade_cell(*new_ctype->value_comparator(), *old_ctype->value_comparator(), c.second, atomic_cell::collection_member::yes));
}
}
dst.apply(new_def, ctype->serialize_mutation_form(std::move(new_view)));
if (new_view.tomb || !new_view.cells.empty()) {
dst.apply(new_def, new_ctype->serialize_mutation_form(std::move(new_view)));
}
});
}
public:
converting_mutation_partition_applier(
@@ -75,6 +92,10 @@ public:
_p.apply(t);
}
void accept_static_cell(column_id id, atomic_cell cell) {
return accept_static_cell(id, atomic_cell_view(cell));
}
virtual void accept_static_cell(column_id id, atomic_cell_view cell) override {
const column_mapping_entry& col = _visited_column_mapping.static_column_at(id);
const column_definition* def = _p_schema.get_column_definition(col.name());
@@ -102,6 +123,10 @@ public:
_current_row = &r;
}
void accept_row_cell(column_id id, atomic_cell cell) {
return accept_row_cell(id, atomic_cell_view(cell));
}
virtual void accept_row_cell(column_id id, atomic_cell_view cell) override {
const column_mapping_entry& col = _visited_column_mapping.regular_column_at(id);
const column_definition* def = _p_schema.get_column_definition(col.name());
@@ -120,11 +145,11 @@ public:
// Appends the cell to dst upgrading it to the new schema.
// Cells must have monotonic names.
static void append_cell(row& dst, column_kind kind, const column_definition& new_def, const data_type& old_type, const atomic_cell_or_collection& cell) {
static void append_cell(row& dst, column_kind kind, const column_definition& new_def, const column_definition& old_def, const atomic_cell_or_collection& cell) {
if (new_def.is_atomic()) {
accept_cell(dst, kind, new_def, old_type, cell.as_atomic_cell());
accept_cell(dst, kind, new_def, old_def.type, cell.as_atomic_cell(old_def));
} else {
accept_cell(dst, kind, new_def, old_type, cell.as_collection_mutation());
accept_cell(dst, kind, new_def, old_def.type, cell.as_collection_mutation());
}
}
};

View File

@@ -78,10 +78,10 @@ std::vector<counter_shard> counter_cell_view::shards_compatible_with_1_7_4() con
return sorted_shards;
}
static bool apply_in_place(atomic_cell_or_collection& dst, atomic_cell_or_collection& src)
static bool apply_in_place(const column_definition& cdef, atomic_cell_mutable_view dst, atomic_cell_mutable_view src)
{
auto dst_ccmv = counter_cell_mutable_view(dst.as_mutable_atomic_cell());
auto src_ccmv = counter_cell_mutable_view(src.as_mutable_atomic_cell());
auto dst_ccmv = counter_cell_mutable_view(dst);
auto src_ccmv = counter_cell_mutable_view(src);
auto dst_shards = dst_ccmv.shards();
auto src_shards = src_ccmv.shards();
@@ -118,48 +118,19 @@ static bool apply_in_place(atomic_cell_or_collection& dst, atomic_cell_or_collec
auto src_ts = src_ccmv.timestamp();
dst_ccmv.set_timestamp(std::max(dst_ts, src_ts));
src_ccmv.set_timestamp(dst_ts);
src.as_mutable_atomic_cell().set_counter_in_place_revert(true);
return true;
}
static void revert_in_place_apply(atomic_cell_or_collection& dst, atomic_cell_or_collection& src)
void counter_cell_view::apply(const column_definition& cdef, atomic_cell_or_collection& dst, atomic_cell_or_collection& src)
{
assert(dst.can_use_mutable_view() && src.can_use_mutable_view());
auto dst_ccmv = counter_cell_mutable_view(dst.as_mutable_atomic_cell());
auto src_ccmv = counter_cell_mutable_view(src.as_mutable_atomic_cell());
auto dst_shards = dst_ccmv.shards();
auto src_shards = src_ccmv.shards();
auto dst_it = dst_shards.begin();
auto src_it = src_shards.begin();
while (src_it != src_shards.end()) {
while (dst_it != dst_shards.end() && dst_it->id() < src_it->id()) {
++dst_it;
}
assert(dst_it != dst_shards.end() && dst_it->id() == src_it->id());
dst_it->swap_value_and_clock(*src_it);
++src_it;
}
auto dst_ts = dst_ccmv.timestamp();
auto src_ts = src_ccmv.timestamp();
dst_ccmv.set_timestamp(src_ts);
src_ccmv.set_timestamp(dst_ts);
src.as_mutable_atomic_cell().set_counter_in_place_revert(false);
}
bool counter_cell_view::apply_reversibly(atomic_cell_or_collection& dst, atomic_cell_or_collection& src)
{
auto dst_ac = dst.as_atomic_cell();
auto src_ac = src.as_atomic_cell();
auto dst_ac = dst.as_atomic_cell(cdef);
auto src_ac = src.as_atomic_cell(cdef);
if (!dst_ac.is_live() || !src_ac.is_live()) {
if (dst_ac.is_live() || (!src_ac.is_live() && compare_atomic_cell_for_merge(dst_ac, src_ac) < 0)) {
std::swap(dst, src);
return true;
}
return false;
return;
}
if (dst_ac.is_counter_update() && src_ac.is_counter_update()) {
@@ -167,22 +138,26 @@ bool counter_cell_view::apply_reversibly(atomic_cell_or_collection& dst, atomic_
auto dst_v = dst_ac.counter_update_value();
dst = atomic_cell::make_live_counter_update(std::max(dst_ac.timestamp(), src_ac.timestamp()),
src_v + dst_v);
return true;
return;
}
assert(!dst_ac.is_counter_update());
assert(!src_ac.is_counter_update());
with_linearized(dst_ac, [&] (counter_cell_view dst_ccv) {
with_linearized(src_ac, [&] (counter_cell_view src_ccv) {
if (counter_cell_view(dst_ac).shard_count() >= counter_cell_view(src_ac).shard_count()
&& dst.can_use_mutable_view() && src.can_use_mutable_view()) {
if (apply_in_place(dst, src)) {
return true;
if (dst_ccv.shard_count() >= src_ccv.shard_count()) {
auto dst_amc = dst.as_mutable_atomic_cell(cdef);
auto src_amc = src.as_mutable_atomic_cell(cdef);
if (!dst_amc.is_value_fragmented() && !src_amc.is_value_fragmented()) {
if (apply_in_place(cdef, dst_amc, src_amc)) {
return;
}
}
}
src.as_mutable_atomic_cell().set_counter_in_place_revert(false);
auto dst_shards = counter_cell_view(dst_ac).shards();
auto src_shards = counter_cell_view(src_ac).shards();
auto dst_shards = dst_ccv.shards();
auto src_shards = src_ccv.shards();
counter_cell_builder result;
combine(dst_shards.begin(), dst_shards.end(), src_shards.begin(), src_shards.end(),
@@ -191,22 +166,9 @@ bool counter_cell_view::apply_reversibly(atomic_cell_or_collection& dst, atomic_
});
auto cell = result.build(std::max(dst_ac.timestamp(), src_ac.timestamp()));
src = std::exchange(dst, atomic_cell_or_collection(cell));
return true;
}
void counter_cell_view::revert_apply(atomic_cell_or_collection& dst, atomic_cell_or_collection& src)
{
if (dst.as_atomic_cell().is_counter_update()) {
auto src_v = src.as_atomic_cell().counter_update_value();
auto dst_v = dst.as_atomic_cell().counter_update_value();
dst = atomic_cell::make_live(dst.as_atomic_cell().timestamp(),
long_type->decompose(dst_v - src_v));
} else if (src.as_atomic_cell().is_counter_in_place_revert_set()) {
revert_in_place_apply(dst, src);
} else {
std::swap(dst, src);
}
src = std::exchange(dst, atomic_cell_or_collection(std::move(cell)));
});
});
}
stdx::optional<atomic_cell> counter_cell_view::difference(atomic_cell_view a, atomic_cell_view b)
@@ -216,13 +178,15 @@ stdx::optional<atomic_cell> counter_cell_view::difference(atomic_cell_view a, at
if (!b.is_live() || !a.is_live()) {
if (b.is_live() || (!a.is_live() && compare_atomic_cell_for_merge(b, a) < 0)) {
return atomic_cell(a);
return atomic_cell(*counter_type, a);
}
return { };
}
auto a_shards = counter_cell_view(a).shards();
auto b_shards = counter_cell_view(b).shards();
return with_linearized(a, [&] (counter_cell_view a_ccv) {
return with_linearized(b, [&] (counter_cell_view b_ccv) {
auto a_shards = a_ccv.shards();
auto b_shards = b_ccv.shards();
auto a_it = a_shards.begin();
auto a_end = a_shards.end();
@@ -244,18 +208,21 @@ stdx::optional<atomic_cell> counter_cell_view::difference(atomic_cell_view a, at
if (!result.empty()) {
diff = result.build(std::max(a.timestamp(), b.timestamp()));
} else if (a.timestamp() > b.timestamp()) {
diff = atomic_cell::make_live(a.timestamp(), bytes_view());
diff = atomic_cell::make_live(*counter_type, a.timestamp(), bytes_view());
}
return diff;
});
});
}
void transform_counter_updates_to_shards(mutation& m, const mutation* current_state, uint64_t clock_offset) {
// FIXME: allow current_state to be frozen_mutation
auto transform_new_row_to_shards = [clock_offset] (auto& cells) {
cells.for_each_cell([clock_offset] (auto, atomic_cell_or_collection& ac_o_c) {
auto acv = ac_o_c.as_atomic_cell();
auto transform_new_row_to_shards = [&s = *m.schema(), clock_offset] (column_kind kind, auto& cells) {
cells.for_each_cell([&] (column_id id, atomic_cell_or_collection& ac_o_c) {
auto& cdef = s.column_at(kind, id);
auto acv = ac_o_c.as_atomic_cell(cdef);
if (!acv.is_live()) {
return; // continue -- we are in lambda
}
@@ -266,32 +233,35 @@ void transform_counter_updates_to_shards(mutation& m, const mutation* current_st
};
if (!current_state) {
transform_new_row_to_shards(m.partition().static_row());
transform_new_row_to_shards(column_kind::static_column, m.partition().static_row());
for (auto& cr : m.partition().clustered_rows()) {
transform_new_row_to_shards(cr.row().cells());
transform_new_row_to_shards(column_kind::regular_column, cr.row().cells());
}
return;
}
clustering_key::less_compare cmp(*m.schema());
auto transform_row_to_shards = [clock_offset] (auto& transformee, auto& state) {
auto transform_row_to_shards = [&s = *m.schema(), clock_offset] (column_kind kind, auto& transformee, auto& state) {
std::deque<std::pair<column_id, counter_shard>> shards;
state.for_each_cell([&] (column_id id, const atomic_cell_or_collection& ac_o_c) {
auto acv = ac_o_c.as_atomic_cell();
auto& cdef = s.column_at(kind, id);
auto acv = ac_o_c.as_atomic_cell(cdef);
if (!acv.is_live()) {
return; // continue -- we are in lambda
}
counter_cell_view ccv(acv);
counter_cell_view::with_linearized(acv, [&] (counter_cell_view ccv) {
auto cs = ccv.local_shard();
if (!cs) {
return; // continue
}
shards.emplace_back(std::make_pair(id, counter_shard(*cs)));
});
});
transformee.for_each_cell([&] (column_id id, atomic_cell_or_collection& ac_o_c) {
auto acv = ac_o_c.as_atomic_cell();
auto& cdef = s.column_at(kind, id);
auto acv = ac_o_c.as_atomic_cell(cdef);
if (!acv.is_live()) {
return; // continue -- we are in lambda
}
@@ -313,7 +283,7 @@ void transform_counter_updates_to_shards(mutation& m, const mutation* current_st
});
};
transform_row_to_shards(m.partition().static_row(), current_state->partition().static_row());
transform_row_to_shards(column_kind::static_column, m.partition().static_row(), current_state->partition().static_row());
auto& cstate = current_state->partition();
auto it = cstate.clustered_rows().begin();
@@ -323,10 +293,10 @@ void transform_counter_updates_to_shards(mutation& m, const mutation* current_st
++it;
}
if (it == end || cmp(cr.key(), it->key())) {
transform_new_row_to_shards(cr.row().cells());
transform_new_row_to_shards(column_kind::regular_column, cr.row().cells());
continue;
}
transform_row_to_shards(cr.row().cells(), it->row().cells());
transform_row_to_shards(column_kind::regular_column, cr.row().cells(), it->row().cells());
}
}

View File

@@ -79,7 +79,7 @@ static_assert(std::is_pod<counter_id>::value, "counter_id should be a POD type")
std::ostream& operator<<(std::ostream& os, const counter_id& id);
template<typename View>
template<mutable_view is_mutable>
class basic_counter_shard_view {
enum class offset : unsigned {
id = 0u,
@@ -88,7 +88,8 @@ class basic_counter_shard_view {
total_size = unsigned(logical_clock) + sizeof(int64_t),
};
private:
typename View::pointer _base;
using pointer_type = std::conditional_t<is_mutable == mutable_view::no, const signed char*, signed char*>;
pointer_type _base;
private:
template<typename T>
T read(offset off) const {
@@ -100,7 +101,7 @@ public:
static constexpr auto size = size_t(offset::total_size);
public:
basic_counter_shard_view() = default;
explicit basic_counter_shard_view(typename View::pointer ptr) noexcept
explicit basic_counter_shard_view(pointer_type ptr) noexcept
: _base(ptr) { }
counter_id id() const { return read<counter_id>(offset::id); }
@@ -111,7 +112,7 @@ public:
static constexpr size_t off = size_t(offset::value);
static constexpr size_t size = size_t(offset::total_size) - off;
typename View::value_type tmp[size];
signed char tmp[size];
std::copy_n(_base + off, size, tmp);
std::copy_n(other._base + off, size, _base + off);
std::copy_n(tmp, size, other._base + off);
@@ -138,7 +139,7 @@ public:
};
};
using counter_shard_view = basic_counter_shard_view<bytes_view>;
using counter_shard_view = basic_counter_shard_view<mutable_view::no>;
std::ostream& operator<<(std::ostream& os, counter_shard_view csv);
@@ -198,7 +199,7 @@ public:
return do_apply(other);
}
static size_t serialized_size() {
static constexpr size_t serialized_size() {
return counter_shard_view::size;
}
void serialize(bytes::iterator& out) const {
@@ -252,15 +253,33 @@ public:
}
atomic_cell build(api::timestamp_type timestamp) const {
return atomic_cell::make_live_from_serializer(timestamp, serialized_size(), [this] (bytes::iterator out) {
serialize(out);
});
// If we can assume that the counter shards never cross fragment boundaries
// the serialisation code gets much simpler.
static_assert(data::cell::maximum_external_chunk_length % counter_shard::serialized_size() == 0);
auto ac = atomic_cell::make_live_uninitialized(*counter_type, timestamp, serialized_size());
auto dst_it = ac.value().begin();
auto dst_current = *dst_it++;
for (auto&& cs : _shards) {
if (dst_current.empty()) {
dst_current = *dst_it++;
}
assert(!dst_current.empty());
auto value_dst = dst_current.data();
cs.serialize(value_dst);
dst_current.remove_prefix(counter_shard::serialized_size());
}
return ac;
}
static atomic_cell from_single_shard(api::timestamp_type timestamp, const counter_shard& cs) {
return atomic_cell::make_live_from_serializer(timestamp, counter_shard::serialized_size(), [&cs] (bytes::iterator out) {
cs.serialize(out);
});
// We don't really need to bother with fragmentation here.
static_assert(data::cell::maximum_external_chunk_length >= counter_shard::serialized_size());
auto ac = atomic_cell::make_live_uninitialized(*counter_type, timestamp, counter_shard::serialized_size());
auto dst = ac.value().first_fragment().begin();
cs.serialize(dst);
return ac;
}
class inserter_iterator : public std::iterator<std::output_iterator_tag, counter_shard> {
@@ -287,28 +306,32 @@ public:
// <counter_id> := <int64_t><int64_t>
// <shard> := <counter_id><int64_t:value><int64_t:logical_clock>
// <counter_cell> := <shard>*
template<typename View>
template<mutable_view is_mutable>
class basic_counter_cell_view {
protected:
atomic_cell_base<View> _cell;
using linearized_value_view = std::conditional_t<is_mutable == mutable_view::no,
bytes_view, bytes_mutable_view>;
using pointer_type = typename linearized_value_view::pointer;
basic_atomic_cell_view<is_mutable> _cell;
linearized_value_view _value;
private:
class shard_iterator : public std::iterator<std::input_iterator_tag, basic_counter_shard_view<View>> {
typename View::pointer _current;
basic_counter_shard_view<View> _current_view;
class shard_iterator : public std::iterator<std::input_iterator_tag, basic_counter_shard_view<is_mutable>> {
pointer_type _current;
basic_counter_shard_view<is_mutable> _current_view;
public:
shard_iterator() = default;
shard_iterator(typename View::pointer ptr) noexcept
shard_iterator(pointer_type ptr) noexcept
: _current(ptr), _current_view(ptr) { }
basic_counter_shard_view<View>& operator*() noexcept {
basic_counter_shard_view<is_mutable>& operator*() noexcept {
return _current_view;
}
basic_counter_shard_view<View>* operator->() noexcept {
basic_counter_shard_view<is_mutable>* operator->() noexcept {
return &_current_view;
}
shard_iterator& operator++() noexcept {
_current += counter_shard_view::size;
_current_view = basic_counter_shard_view<View>(_current);
_current_view = basic_counter_shard_view<is_mutable>(_current);
return *this;
}
shard_iterator operator++(int) noexcept {
@@ -318,7 +341,7 @@ private:
}
shard_iterator& operator--() noexcept {
_current -= counter_shard_view::size;
_current_view = basic_counter_shard_view<View>(_current);
_current_view = basic_counter_shard_view<is_mutable>(_current);
return *this;
}
shard_iterator operator--(int) noexcept {
@@ -335,22 +358,23 @@ private:
};
public:
boost::iterator_range<shard_iterator> shards() const {
auto bv = _cell.value();
auto begin = shard_iterator(bv.data());
auto end = shard_iterator(bv.data() + bv.size());
auto begin = shard_iterator(_value.data());
auto end = shard_iterator(_value.data() + _value.size());
return boost::make_iterator_range(begin, end);
}
size_t shard_count() const {
return _cell.value().size() / counter_shard_view::size;
return _cell.value().size_bytes() / counter_shard_view::size;
}
public:
protected:
// ac must be a live counter cell
explicit basic_counter_cell_view(atomic_cell_base<View> ac) noexcept : _cell(ac) {
explicit basic_counter_cell_view(basic_atomic_cell_view<is_mutable> ac, linearized_value_view vv) noexcept
: _cell(ac), _value(vv)
{
assert(_cell.is_live());
assert(!_cell.is_counter_update());
}
public:
api::timestamp_type timestamp() const { return _cell.timestamp(); }
static data_type total_value_type() { return long_type; }
@@ -381,18 +405,22 @@ public:
}
};
struct counter_cell_view : basic_counter_cell_view<bytes_view> {
struct counter_cell_view : basic_counter_cell_view<mutable_view::no> {
using basic_counter_cell_view::basic_counter_cell_view;
template<typename Function>
static decltype(auto) with_linearized(basic_atomic_cell_view<mutable_view::no> ac, Function&& fn) {
return ac.value().with_linearized([&] (bytes_view value_view) {
counter_cell_view ccv(ac, value_view);
return fn(ccv);
});
}
// Returns counter shards in an order that is compatible with Scylla 1.7.4.
std::vector<counter_shard> shards_compatible_with_1_7_4() const;
// Reversibly applies two counter cells, at least one of them must be live.
// Returns true iff dst was modified.
static bool apply_reversibly(atomic_cell_or_collection& dst, atomic_cell_or_collection& src);
// Reverts apply performed by apply_reversible().
static void revert_apply(atomic_cell_or_collection& dst, atomic_cell_or_collection& src);
static void apply(const column_definition& cdef, atomic_cell_or_collection& dst, atomic_cell_or_collection& src);
// Computes a counter cell containing minimal amount of data which, when
// applied to 'b' returns the same cell as 'a' and 'b' applied together.
@@ -401,9 +429,15 @@ struct counter_cell_view : basic_counter_cell_view<bytes_view> {
friend std::ostream& operator<<(std::ostream& os, counter_cell_view ccv);
};
struct counter_cell_mutable_view : basic_counter_cell_view<bytes_mutable_view> {
struct counter_cell_mutable_view : basic_counter_cell_view<mutable_view::yes> {
using basic_counter_cell_view::basic_counter_cell_view;
explicit counter_cell_mutable_view(atomic_cell_mutable_view ac) noexcept
: basic_counter_cell_view<mutable_view::yes>(ac, ac.value().first_fragment())
{
assert(!ac.value().is_fragmented());
}
void set_timestamp(api::timestamp_type ts) { _cell.set_timestamp(ts); }
};

View File

@@ -373,7 +373,7 @@ useStatement returns [::shared_ptr<raw::use_statement> stmt]
;
/**
* SELECT <expression>
* SELECT [JSON] <expression>
* FROM <CF>
* WHERE KEY = "key1" AND COL > 1 AND COL < 100
* LIMIT <NUMBER>;
@@ -384,9 +384,12 @@ selectStatement returns [shared_ptr<raw::select_statement> expr]
::shared_ptr<cql3::term::raw> limit;
raw::select_statement::parameters::orderings_type orderings;
bool allow_filtering = false;
bool is_json = false;
}
: K_SELECT ( ( K_DISTINCT { is_distinct = true; } )?
sclause=selectClause
: K_SELECT (
( K_JSON { is_json = true; } )?
( K_DISTINCT { is_distinct = true; } )?
sclause=selectClause
)
K_FROM cf=columnFamilyName
( K_WHERE wclause=whereClause )?
@@ -394,7 +397,7 @@ selectStatement returns [shared_ptr<raw::select_statement> expr]
( K_LIMIT rows=intValue { limit = rows; } )?
( K_ALLOW K_FILTERING { allow_filtering = true; } )?
{
auto params = ::make_shared<raw::select_statement::parameters>(std::move(orderings), is_distinct, allow_filtering);
auto params = ::make_shared<raw::select_statement::parameters>(std::move(orderings), is_distinct, allow_filtering, is_json);
$expr = ::make_shared<raw::select_statement>(std::move(cf), std::move(params),
std::move(sclause), std::move(wclause), std::move(limit));
}
@@ -448,33 +451,54 @@ orderByClause[raw::select_statement::parameters::orderings_type& orderings]
: c=cident (K_ASC | K_DESC { reversed = true; })? { orderings.emplace_back(c, reversed); }
;
jsonValue returns [::shared_ptr<cql3::term::raw> value]
:
| s=STRING_LITERAL { $value = cql3::constants::literal::string(sstring{$s.text}); }
| ':' id=ident { $value = new_bind_variables(id); }
| QMARK { $value = new_bind_variables(shared_ptr<cql3::column_identifier>{}); }
;
/**
* INSERT INTO <CF> (<column>, <column>, <column>, ...)
* VALUES (<value>, <value>, <value>, ...)
* USING TIMESTAMP <long>;
*
*/
insertStatement returns [::shared_ptr<raw::insert_statement> expr]
insertStatement returns [::shared_ptr<raw::modification_statement> expr]
@init {
auto attrs = ::make_shared<cql3::attributes::raw>();
std::vector<::shared_ptr<cql3::column_identifier::raw>> column_names;
std::vector<::shared_ptr<cql3::term::raw>> values;
bool if_not_exists = false;
bool default_unset = false;
::shared_ptr<cql3::term::raw> json_value;
}
: K_INSERT K_INTO cf=columnFamilyName
'(' c1=cident { column_names.push_back(c1); } ( ',' cn=cident { column_names.push_back(cn); } )* ')'
K_VALUES
'(' v1=term { values.push_back(v1); } ( ',' vn=term { values.push_back(vn); } )* ')'
( K_IF K_NOT K_EXISTS { if_not_exists = true; } )?
( usingClause[attrs] )?
{
$expr = ::make_shared<raw::insert_statement>(std::move(cf),
std::move(attrs),
std::move(column_names),
std::move(values),
if_not_exists);
}
('(' c1=cident { column_names.push_back(c1); } ( ',' cn=cident { column_names.push_back(cn); } )* ')'
K_VALUES
'(' v1=term { values.push_back(v1); } ( ',' vn=term { values.push_back(vn); } )* ')'
( K_IF K_NOT K_EXISTS { if_not_exists = true; } )?
( usingClause[attrs] )?
{
$expr = ::make_shared<raw::insert_statement>(std::move(cf),
std::move(attrs),
std::move(column_names),
std::move(values),
if_not_exists);
}
| K_JSON
json_token=jsonValue { json_value = $json_token.value; }
( K_DEFAULT K_UNSET { default_unset = true; } | K_DEFAULT K_NULL )?
( K_IF K_NOT K_EXISTS { if_not_exists = true; } )?
( usingClause[attrs] )?
{
$expr = ::make_shared<raw::insert_json_statement>(std::move(cf),
std::move(attrs),
std::move(json_value),
if_not_exists,
default_unset);
}
)
;
usingClause[::shared_ptr<cql3::attributes::raw> attrs]
@@ -1510,12 +1534,22 @@ inMarkerForTuple returns [shared_ptr<cql3::tuples::in_raw> marker]
| ':' name=ident { $marker = new_tuple_in_bind_variables(name); }
;
comparatorType returns [shared_ptr<cql3_type::raw> t]
: n=native_type { $t = cql3_type::raw::from(n); }
| c=collection_type { $t = c; }
| tt=tuple_type { $t = tt; }
// The comparator_type rule is used for users' queries (internal=false)
// and for internal calls from db::cql_type_parser::parse() (internal=true).
// The latter is used for reading schemas stored in the system tables, and
// may support additional column types that cannot be created through CQL,
// but only internally through code. Today the only such type is "empty":
// Scylla code internally creates columns with type "empty" or collections
// "empty" to represent unselected columns in materialized views.
// If a user (internal=false) tries to use "empty" as a type, it is treated -
// as do all unknown types - as an attempt to use a user-defined type, and
// we report this name is reserved (as for _reserved_type_names()).
comparator_type [bool internal] returns [shared_ptr<cql3_type::raw> t]
: n=native_or_internal_type[internal] { $t = cql3_type::raw::from(n); }
| c=collection_type[internal] { $t = c; }
| tt=tuple_type[internal] { $t = tt; }
| id=userTypeName { $t = cql3::cql3_type::raw::user_type(id); }
| K_FROZEN '<' f=comparatorType '>'
| K_FROZEN '<' f=comparator_type[internal] '>'
{
try {
$t = cql3::cql3_type::raw::frozen(f);
@@ -1537,6 +1571,22 @@ comparatorType returns [shared_ptr<cql3_type::raw> t]
#endif
;
native_or_internal_type [bool internal] returns [shared_ptr<cql3_type> t]
: n=native_type { $t = n; }
// The "internal" types, only supported when internal==true:
| K_EMPTY {
if (internal) {
$t = cql3_type::empty;
} else {
add_recognition_error("Invalid (reserved) user type name empty");
}
}
;
comparatorType returns [shared_ptr<cql3_type::raw> t]
: tt=comparator_type[false] { $t = tt; }
;
native_type returns [shared_ptr<cql3_type> t]
: K_ASCII { $t = cql3_type::ascii; }
| K_BIGINT { $t = cql3_type::bigint; }
@@ -1561,24 +1611,24 @@ native_type returns [shared_ptr<cql3_type> t]
| K_TIME { $t = cql3_type::time; }
;
collection_type returns [shared_ptr<cql3::cql3_type::raw> pt]
: K_MAP '<' t1=comparatorType ',' t2=comparatorType '>'
collection_type [bool internal] returns [shared_ptr<cql3::cql3_type::raw> pt]
: K_MAP '<' t1=comparator_type[internal] ',' t2=comparator_type[internal] '>'
{
// if we can't parse either t1 or t2, antlr will "recover" and we may have t1 or t2 null.
if (t1 && t2) {
$pt = cql3::cql3_type::raw::map(t1, t2);
}
}
| K_LIST '<' t=comparatorType '>'
| K_LIST '<' t=comparator_type[internal] '>'
{ if (t) { $pt = cql3::cql3_type::raw::list(t); } }
| K_SET '<' t=comparatorType '>'
| K_SET '<' t=comparator_type[internal] '>'
{ if (t) { $pt = cql3::cql3_type::raw::set(t); } }
;
tuple_type returns [shared_ptr<cql3::cql3_type::raw> t]
tuple_type [bool internal] returns [shared_ptr<cql3::cql3_type::raw> t]
@init{ std::vector<shared_ptr<cql3::cql3_type::raw>> types; }
: K_TUPLE '<'
t1=comparatorType { types.push_back(t1); } (',' tn=comparatorType { types.push_back(tn); })*
t1=comparator_type[internal] { types.push_back(t1); } (',' tn=comparator_type[internal] { types.push_back(tn); })*
'>' { $t = cql3::cql3_type::raw::tuple(std::move(types)); }
;
@@ -1604,7 +1654,7 @@ unreserved_keyword returns [sstring str]
unreserved_function_keyword returns [sstring str]
: u=basic_unreserved_keyword { $str = u; }
| t=native_type { $str = t->to_string(); }
| t=native_or_internal_type[true] { $str = t->to_string(); }
;
basic_unreserved_keyword returns [sstring str]
@@ -1650,6 +1700,7 @@ basic_unreserved_keyword returns [sstring str]
| K_LANGUAGE
| K_NON
| K_DETERMINISTIC
| K_JSON
) { $str = $k.text; }
;
@@ -1786,6 +1837,11 @@ K_NON: N O N;
K_OR: O R;
K_REPLACE: R E P L A C E;
K_DETERMINISTIC: D E T E R M I N I S T I C;
K_JSON: J S O N;
K_DEFAULT: D E F A U L T;
K_UNSET: U N S E T;
K_EMPTY: E M P T Y;
K_SCYLLA_TIMEUUID_LIST_INDEX: S C Y L L A '_' T I M E U U I D '_' L I S T '_' I N D E X;
K_SCYLLA_COUNTER_SHARD_LIST: S C Y L L A '_' C O U N T E R '_' S H A R D '_' L I S T;

View File

@@ -77,12 +77,14 @@ int64_t attributes::get_timestamp(int64_t now, const query_options& options) {
if (tval.is_unset_value()) {
return now;
}
return with_linearized(*tval, [] (bytes_view val) {
try {
data_type_for<int64_t>()->validate(*tval);
data_type_for<int64_t>()->validate(val);
} catch (marshal_exception& e) {
throw exceptions::invalid_request_exception("Invalid timestamp value");
}
return value_cast<int64_t>(data_type_for<int64_t>()->deserialize(*tval));
return value_cast<int64_t>(data_type_for<int64_t>()->deserialize(val));
});
}
int32_t attributes::get_time_to_live(const query_options& options) {
@@ -96,14 +98,16 @@ int32_t attributes::get_time_to_live(const query_options& options) {
if (tval.is_unset_value()) {
return 0;
}
auto ttl = with_linearized(*tval, [] (bytes_view val) {
try {
data_type_for<int32_t>()->validate(*tval);
data_type_for<int32_t>()->validate(val);
}
catch (marshal_exception& e) {
throw exceptions::invalid_request_exception("Invalid TTL value");
}
auto ttl = value_cast<int32_t>(data_type_for<int32_t>()->deserialize(*tval));
return value_cast<int32_t>(data_type_for<int32_t>()->deserialize(val));
});
if (ttl < 0) {
throw exceptions::invalid_request_exception("A TTL must be greater or equal to 0");
}

View File

@@ -0,0 +1,187 @@
/*
* Copyright (C) 2018 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include "cql3/prepared_statements_cache.hh"
namespace cql3 {
struct authorized_prepared_statements_cache_size {
size_t operator()(const statements::prepared_statement::checked_weak_ptr& val) {
// TODO: improve the size approximation - most of the entry is occupied by the key here.
return 100;
}
};
class authorized_prepared_statements_cache_key {
public:
using cache_key_type = std::pair<auth::authenticated_user, typename cql3::prepared_cache_key_type::cache_key_type>;
private:
cache_key_type _key;
public:
authorized_prepared_statements_cache_key(auth::authenticated_user user, cql3::prepared_cache_key_type prepared_cache_key)
: _key(std::move(user), std::move(prepared_cache_key.key())) {}
cache_key_type& key() { return _key; }
const cache_key_type& key() const { return _key; }
bool operator==(const authorized_prepared_statements_cache_key& other) const {
return _key == other._key;
}
bool operator!=(const authorized_prepared_statements_cache_key& other) const {
return !(*this == other);
}
static size_t hash(const auth::authenticated_user& user, const cql3::prepared_cache_key_type::cache_key_type& prep_cache_key) {
return utils::hash_combine(std::hash<auth::authenticated_user>()(user), utils::tuple_hash()(prep_cache_key));
}
};
/// \class authorized_prepared_statements_cache
/// \brief A cache of previously authorized statements.
///
/// Entries are inserted every time a new statement is authorized.
/// Entries are evicted in any of the following cases:
/// - When the corresponding prepared statement is not valid anymore.
/// - Periodically, with the same period as the permission cache is refreshed.
/// - If the corresponding entry hasn't been used for \ref entry_expiry.
class authorized_prepared_statements_cache {
public:
struct stats {
uint64_t authorized_prepared_statements_cache_evictions = 0;
};
static stats& shard_stats() {
static thread_local stats _stats;
return _stats;
}
struct authorized_prepared_statements_cache_stats_updater {
static void inc_hits() noexcept {}
static void inc_misses() noexcept {}
static void inc_blocks() noexcept {}
static void inc_evictions() noexcept {
++shard_stats().authorized_prepared_statements_cache_evictions;
}
};
private:
using cache_key_type = authorized_prepared_statements_cache_key;
using checked_weak_ptr = typename statements::prepared_statement::checked_weak_ptr;
using cache_type = utils::loading_cache<cache_key_type,
checked_weak_ptr,
utils::loading_cache_reload_enabled::yes,
authorized_prepared_statements_cache_size,
std::hash<cache_key_type>,
std::equal_to<cache_key_type>,
authorized_prepared_statements_cache_stats_updater>;
public:
using key_type = cache_key_type;
using value_type = checked_weak_ptr;
using entry_is_too_big = typename cache_type::entry_is_too_big;
using iterator = typename cache_type::iterator;
private:
cache_type _cache;
logging::logger& _logger;
public:
// Choose the memory budget such that would allow us ~4K entries when a shard gets 1GB of RAM
authorized_prepared_statements_cache(std::chrono::milliseconds entry_expiration, std::chrono::milliseconds entry_refresh, size_t cache_size, logging::logger& logger)
: _cache(cache_size, entry_expiration, entry_refresh, logger, [this] (const key_type& k) {
_cache.remove(k);
return make_ready_future<value_type>();
})
, _logger(logger)
{}
future<> insert(auth::authenticated_user user, cql3::prepared_cache_key_type prep_cache_key, value_type v) noexcept {
return _cache.get_ptr(key_type(std::move(user), std::move(prep_cache_key)), [v = std::move(v)] (const cache_key_type&) mutable {
return make_ready_future<value_type>(std::move(v));
}).discard_result();
}
iterator find(const auth::authenticated_user& user, const cql3::prepared_cache_key_type& prep_cache_key) {
struct key_view {
const auth::authenticated_user& user_ref;
const cql3::prepared_cache_key_type& prep_cache_key_ref;
};
struct hasher {
size_t operator()(const key_view& kv) {
return cql3::authorized_prepared_statements_cache_key::hash(kv.user_ref, kv.prep_cache_key_ref.key());
}
};
struct equal {
bool operator()(const key_type& k1, const key_view& k2) {
return k1.key().first == k2.user_ref && k1.key().second == k2.prep_cache_key_ref.key();
}
bool operator()(const key_view& k2, const key_type& k1) {
return operator()(k1, k2);
}
};
return _cache.find(key_view{user, prep_cache_key}, hasher(), equal());
}
iterator end() {
return _cache.end();
}
void remove(const auth::authenticated_user& user, const cql3::prepared_cache_key_type& prep_cache_key) {
iterator it = find(user, prep_cache_key);
_cache.remove(it);
}
size_t size() const {
return _cache.size();
}
size_t memory_footprint() const {
return _cache.memory_footprint();
}
future<> stop() {
return _cache.stop();
}
};
}
namespace std {
template <>
struct hash<cql3::authorized_prepared_statements_cache_key> final {
size_t operator()(const cql3::authorized_prepared_statements_cache_key& k) const {
return cql3::authorized_prepared_statements_cache_key::hash(k.key().first, k.key().second);
}
};
inline std::ostream& operator<<(std::ostream& out, const cql3::authorized_prepared_statements_cache_key& k) {
return out << "{ " << k.key().first << ", " << k.key().second << " }";
}
}

View File

@@ -22,6 +22,7 @@
#include "cql3/column_identifier.hh"
#include "exceptions/exceptions.hh"
#include "cql3/selection/simple_selector.hh"
#include "cql3/util.hh"
#include <regex>
@@ -62,14 +63,11 @@ sstring column_identifier::to_string() const {
}
sstring column_identifier::to_cql_string() const {
static const std::regex unquoted_identifier_re("[a-z][a-z0-9_]*");
if (std::regex_match(_text.begin(), _text.end(), unquoted_identifier_re)) {
return _text;
}
static const std::regex double_quote_re("\"");
std::string result = _text;
std::regex_replace(result, double_quote_re, "\"\"");
return '"' + result + '"';
return util::maybe_quote(_text);
}
sstring column_identifier::raw::to_cql_string() const {
return util::maybe_quote(_text);
}
column_identifier::raw::raw(sstring raw_text, bool keep_case)
@@ -129,7 +127,11 @@ column_identifier::new_selector_factory(database& db, schema_ptr schema, std::ve
if (!def) {
throw exceptions::invalid_request_exception(sprint("Undefined name %s in selection clause", _text));
}
// Do not allow explicitly selecting hidden columns. We also skip them on
// "SELECT *" (see selection::wildcard()).
if (def->is_view_virtual()) {
throw exceptions::invalid_request_exception(sprint("Undefined name %s in selection clause", _text));
}
return selection::simple_selector::new_factory(def->name_as_text(), add_and_get_index(*def, defs), def->type);
}

View File

@@ -123,6 +123,7 @@ public:
bool operator!=(const raw& other) const;
virtual sstring to_string() const;
sstring to_cql_string() const;
friend std::hash<column_identifier::raw>;
friend std::ostream& operator<<(std::ostream& out, const column_identifier::raw& id);

View File

@@ -85,8 +85,8 @@ public:
virtual ::shared_ptr<terminal> bind(const query_options& options) override { return {}; }
virtual sstring to_string() const override { return "null"; }
};
static thread_local const ::shared_ptr<terminal> NULL_VALUE;
public:
static thread_local const ::shared_ptr<terminal> NULL_VALUE;
virtual ::shared_ptr<term> prepare(database& db, const sstring& keyspace, ::shared_ptr<column_specification> receiver) override {
if (!is_assignable(test_assignment(db, keyspace, receiver))) {
throw exceptions::invalid_request_exception("Invalid null value for counter increment/decrement");
@@ -203,10 +203,14 @@ public:
virtual void execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) override {
auto value = _t->bind_and_get(params._options);
execute(m, prefix, params, column, std::move(value));
}
static void execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params, const column_definition& column, cql3::raw_value_view value) {
if (value.is_null()) {
m.set_cell(prefix, column, std::move(make_dead_cell(params)));
} else if (value.is_value()) {
m.set_cell(prefix, column, std::move(make_cell(*value, params)));
m.set_cell(prefix, column, std::move(make_cell(*column.type, *value, params)));
}
}
};
@@ -221,7 +225,9 @@ public:
} else if (value.is_unset_value()) {
return;
}
auto increment = value_cast<int64_t>(long_type->deserialize_value(*value));
auto increment = with_linearized(*value, [] (bytes_view value_view) {
return value_cast<int64_t>(long_type->deserialize_value(value_view));
});
m.set_cell(prefix, column, make_counter_update_cell(increment, params));
}
};
@@ -236,7 +242,9 @@ public:
} else if (value.is_unset_value()) {
return;
}
auto increment = value_cast<int64_t>(long_type->deserialize_value(*value));
auto increment = with_linearized(*value, [] (bytes_view value_view) {
return value_cast<int64_t>(long_type->deserialize_value(value_view));
});
if (increment == std::numeric_limits<int64_t>::min()) {
throw exceptions::invalid_request_exception(sprint("The negation of %d overflows supported counter precision (signed 8 bytes integer)", increment));
}

View File

@@ -395,18 +395,15 @@ operator<<(std::ostream& os, const cql3_type::raw& r) {
namespace util {
sstring maybe_quote(const sstring& s) {
static const std::regex unquoted("\\w*");
static const std::regex double_quote("\"");
if (std::regex_match(s.begin(), s.end(), unquoted)) {
return s;
sstring maybe_quote(const sstring& identifier) {
static const std::regex unquoted_identifier_re("[a-z][a-z0-9_]*");
if (std::regex_match(identifier.begin(), identifier.end(), unquoted_identifier_re)) {
return identifier;
}
std::ostringstream ss;
ss << "\"";
std::regex_replace(std::ostreambuf_iterator<char>(ss), s.begin(), s.end(), double_quote, "\"\"");
ss << "\"";
return ss.str();
static const std::regex double_quote_re("\"");
std::string result = identifier;
std::regex_replace(result, double_quote_re, "\"\"");
return '"' + result + '"';
}
}

View File

@@ -45,6 +45,7 @@
#include "service/query_state.hh"
#include "service/storage_proxy.hh"
#include "cql3/query_options.hh"
#include "timeout_config.hh"
namespace cql_transport {
@@ -62,10 +63,15 @@ class metadata;
shared_ptr<const metadata> make_empty_metadata();
class cql_statement {
timeout_config_selector _timeout_config_selector;
public:
explicit cql_statement(timeout_config_selector timeout_selector) : _timeout_config_selector(timeout_selector) {}
virtual ~cql_statement()
{ }
timeout_config_selector get_timeout_config_selector() const { return _timeout_config_selector; }
virtual uint32_t get_bound_terms() = 0;
/**
@@ -81,7 +87,7 @@ public:
*
* @param state the current client state
*/
virtual void validate(distributed<service::storage_proxy>& proxy, const service::client_state& state) = 0;
virtual void validate(service::storage_proxy& proxy, const service::client_state& state) = 0;
/**
* Execute the statement and return the resulting result or null if there is no result.
@@ -90,15 +96,7 @@ public:
* @param options options for this query (consistency, variables, pageSize, ...)
*/
virtual future<::shared_ptr<cql_transport::messages::result_message>>
execute(distributed<service::storage_proxy>& proxy, service::query_state& state, const query_options& options) = 0;
/**
* Variant of execute used for internal query against the system tables, and thus only query the local node = 0.
*
* @param state the current query state
*/
virtual future<::shared_ptr<cql_transport::messages::result_message>>
execute_internal(distributed<service::storage_proxy>& proxy, service::query_state& state, const query_options& options) = 0;
execute(service::storage_proxy& proxy, service::query_state& state, const query_options& options) = 0;
virtual bool uses_function(const sstring& ks_name, const sstring& function_name) const = 0;
@@ -111,6 +109,7 @@ public:
class cql_statement_no_metadata : public cql_statement {
public:
using cql_statement::cql_statement;
virtual shared_ptr<const metadata> get_result_metadata() const override {
return make_empty_metadata();
}

View File

@@ -42,6 +42,7 @@
#pragma once
#include "types.hh"
#include "cql3/cql3_type.hh"
#include <vector>
#include <iosfwd>
#include <boost/functional/hash.hpp>
@@ -105,9 +106,9 @@ abstract_function::print(std::ostream& os) const {
if (i > 0) {
os << ", ";
}
os << _arg_types[i]->name(); // FIXME: asCQL3Type()
os << _arg_types[i]->as_cql3_type()->to_string();
}
os << ") -> " << _return_type->name(); // FIXME: asCQL3Type()
os << ") -> " << _return_type->as_cql3_type()->to_string();
}
}

View File

@@ -20,6 +20,7 @@
*/
#include "functions.hh"
#include "function_call.hh"
#include "token_fct.hh"
#include "cql3/maps.hh"
@@ -41,11 +42,22 @@ functions::init() {
declare(time_uuid_fcts::make_min_timeuuid_fct());
declare(time_uuid_fcts::make_max_timeuuid_fct());
declare(time_uuid_fcts::make_date_of_fct());
declare(time_uuid_fcts::make_unix_timestamp_of_fcf());
declare(time_uuid_fcts::make_unix_timestamp_of_fct());
declare(time_uuid_fcts::make_currenttimestamp_fct());
declare(time_uuid_fcts::make_currentdate_fct());
declare(time_uuid_fcts::make_currenttime_fct());
declare(time_uuid_fcts::make_currenttimeuuid_fct());
declare(time_uuid_fcts::make_timeuuidtodate_fct());
declare(time_uuid_fcts::make_timestamptodate_fct());
declare(time_uuid_fcts::make_timeuuidtotimestamp_fct());
declare(time_uuid_fcts::make_datetotimestamp_fct());
declare(time_uuid_fcts::make_timeuuidtounixtimestamp_fct());
declare(time_uuid_fcts::make_timestamptounixtimestamp_fct());
declare(time_uuid_fcts::make_datetounixtimestamp_fct());
declare(make_uuid_fct());
for (auto&& type : cql3_type::values()) {
// Note: because text and varchar ends up being synonimous, our automatic makeToBlobFunction doesn't work
// Note: because text and varchar ends up being synonymous, our automatic makeToBlobFunction doesn't work
// for varchar, so we special case it below. We also skip blob for obvious reasons.
if (type == cql3_type::varchar || type == cql3_type::blob) {
continue;
@@ -95,15 +107,22 @@ functions::init() {
declare(aggregate_fcts::make_max_function<sstring>());
declare(aggregate_fcts::make_min_function<sstring>());
declare(aggregate_fcts::make_count_function<simple_date_native_type>());
declare(aggregate_fcts::make_max_function<simple_date_native_type>());
declare(aggregate_fcts::make_min_function<simple_date_native_type>());
declare(aggregate_fcts::make_count_function<timestamp_native_type>());
declare(aggregate_fcts::make_max_function<timestamp_native_type>());
declare(aggregate_fcts::make_min_function<timestamp_native_type>());
declare(aggregate_fcts::make_count_function<timeuuid_native_type>());
declare(aggregate_fcts::make_max_function<timeuuid_native_type>());
declare(aggregate_fcts::make_min_function<timeuuid_native_type>());
declare(aggregate_fcts::make_count_function<utils::UUID>());
declare(aggregate_fcts::make_max_function<utils::UUID>());
declare(aggregate_fcts::make_min_function<utils::UUID>());
//FIXME:
//declare(aggregate_fcts::make_count_function<bytes>());
//declare(aggregate_fcts::make_max_function<bytes>());
@@ -153,23 +172,73 @@ functions::get_overload_count(const function_name& name) {
return _declared.count(name);
}
inline
shared_ptr<function>
make_to_json_function(data_type t) {
return make_native_scalar_function<true>("tojson", utf8_type, {t},
[t](cql_serialization_format sf, const std::vector<bytes_opt>& parameters) -> bytes_opt {
return utf8_type->decompose(t->to_json_string(parameters[0]));
});
}
inline
shared_ptr<function>
make_from_json_function(database& db, const sstring& keyspace, data_type t) {
return make_native_scalar_function<true>("fromjson", t, {utf8_type},
[&db, &keyspace, t](cql_serialization_format sf, const std::vector<bytes_opt>& parameters) -> bytes_opt {
Json::Value json_value = json::to_json_value(utf8_type->to_string(parameters[0].value()));
bytes_opt parsed_json_value;
if (!json_value.isNull()) {
parsed_json_value.emplace(t->from_json_object(json_value, sf));
}
return std::move(parsed_json_value);
});
}
shared_ptr<function>
functions::get(database& db,
const sstring& keyspace,
const function_name& name,
const std::vector<shared_ptr<assignment_testable>>& provided_args,
const sstring& receiver_ks,
const sstring& receiver_cf) {
const sstring& receiver_cf,
shared_ptr<column_specification> receiver) {
static const function_name TOKEN_FUNCTION_NAME = function_name::native_function("token");
static const function_name TO_JSON_FUNCTION_NAME = function_name::native_function("tojson");
static const function_name FROM_JSON_FUNCTION_NAME = function_name::native_function("fromjson");
if (name.has_keyspace()
? name == TOKEN_FUNCTION_NAME
: name.name == TOKEN_FUNCTION_NAME.name)
{
? name == TOKEN_FUNCTION_NAME
: name.name == TOKEN_FUNCTION_NAME.name) {
return ::make_shared<token_fct>(db.find_schema(receiver_ks, receiver_cf));
}
if (name.has_keyspace()
? name == TO_JSON_FUNCTION_NAME
: name.name == TO_JSON_FUNCTION_NAME.name) {
if (provided_args.size() != 1) {
throw exceptions::invalid_request_exception("toJson() accepts 1 argument only");
}
selection::selector *sp = dynamic_cast<selection::selector *>(provided_args[0].get());
if (!sp) {
throw exceptions::invalid_request_exception("toJson() is only valid in SELECT clause");
}
return make_to_json_function(sp->get_type());
}
if (name.has_keyspace()
? name == FROM_JSON_FUNCTION_NAME
: name.name == FROM_JSON_FUNCTION_NAME.name) {
if (provided_args.size() != 1) {
throw exceptions::invalid_request_exception("fromJson() accepts 1 argument only");
}
if (!receiver) {
throw exceptions::invalid_request_exception("fromJson() can only be called if receiver type is known");
}
return make_from_json_function(db, keyspace, receiver->type);
}
std::vector<shared_ptr<function>> candidates;
auto&& add_declared = [&] (function_name fn) {
auto&& fns = _declared.equal_range(fn);
@@ -392,9 +461,9 @@ function_call::make_terminal(shared_ptr<function> fun, cql3::raw_value result, c
}
auto ctype = static_pointer_cast<const collection_type_impl>(fun->return_type());
bytes_view res;
fragmented_temporary_buffer::view res;
if (result) {
res = *result;
res = fragmented_temporary_buffer::view(bytes_view(*result));
}
if (&ctype->_kind == &collection_type_impl::kind::list) {
return make_shared(lists::value::from_serialized(std::move(res), static_pointer_cast<const list_type_impl>(ctype), sf));
@@ -414,7 +483,7 @@ function_call::raw::prepare(database& db, const sstring& keyspace, ::shared_ptr<
[] (auto&& x) -> shared_ptr<assignment_testable> {
return x;
});
auto&& fun = functions::functions::get(db, keyspace, _name, args, receiver->ks_name, receiver->cf_name);
auto&& fun = functions::functions::get(db, keyspace, _name, args, receiver->ks_name, receiver->cf_name, receiver);
if (!fun) {
throw exceptions::invalid_request_exception(sprint("Unknown function %s called", _name));
}
@@ -478,7 +547,7 @@ function_call::raw::test_assignment(database& db, const sstring& keyspace, share
// of another, existing, function. In that case, we return true here because we'll throw a proper exception
// later with a more helpful error message that if we were to return false here.
try {
auto&& fun = functions::get(db, keyspace, _name, _terms, receiver->ks_name, receiver->cf_name);
auto&& fun = functions::get(db, keyspace, _name, _terms, receiver->ks_name, receiver->cf_name, receiver);
if (fun && receiver->type->equals(fun->return_type())) {
return assignment_testable::test_result::EXACT_MATCH;
} else if (!fun || receiver->type->is_value_compatible_with(*fun->return_type())) {

View File

@@ -80,16 +80,18 @@ public:
const function_name& name,
const std::vector<shared_ptr<assignment_testable>>& provided_args,
const sstring& receiver_ks,
const sstring& receiver_cf);
const sstring& receiver_cf,
::shared_ptr<column_specification> receiver = nullptr);
template <typename AssignmentTestablePtrRange>
static shared_ptr<function> get(database& db,
const sstring& keyspace,
const function_name& name,
AssignmentTestablePtrRange&& provided_args,
const sstring& receiver_ks,
const sstring& receiver_cf) {
const sstring& receiver_cf,
::shared_ptr<column_specification> receiver = nullptr) {
const std::vector<shared_ptr<assignment_testable>> args(std::begin(provided_args), std::end(provided_args));
return get(db, keyspace, name, args, receiver_ks, receiver_cf);
return get(db, keyspace, name, args, receiver_ks, receiver_cf, receiver);
}
static std::vector<shared_ptr<function>> find(const function_name& name);
static shared_ptr<function> find(const function_name& name, const std::vector<data_type>& arg_types);

View File

@@ -117,7 +117,7 @@ make_date_of_fct() {
inline
shared_ptr<function>
make_unix_timestamp_of_fcf() {
make_unix_timestamp_of_fct() {
return make_native_scalar_function<true>("unixtimestampof", long_type, { timeuuid_type },
[] (cql_serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
using namespace utils;
@@ -129,6 +129,163 @@ make_unix_timestamp_of_fcf() {
});
}
inline shared_ptr<function>
make_currenttimestamp_fct() {
return make_native_scalar_function<true>("currenttimestamp", timestamp_type, {},
[] (cql_serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
return {timestamp_type->decompose(timestamp_native_type{db_clock::now()})};
});
}
inline shared_ptr<function>
make_currenttime_fct() {
return make_native_scalar_function<true>("currenttime", time_type, {},
[] (cql_serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
constexpr int64_t milliseconds_in_day = 3600 * 24 * 1000;
int64_t milliseconds_since_epoch = std::chrono::duration_cast<std::chrono::milliseconds>(db_clock::now().time_since_epoch()).count();
int64_t nanoseconds_today = (milliseconds_since_epoch % milliseconds_in_day) * 1000 * 1000;
return {time_type->decompose(time_native_type{nanoseconds_today})};
});
}
inline shared_ptr<function>
make_currentdate_fct() {
return make_native_scalar_function<true>("currentdate", simple_date_type, {},
[] (cql_serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
auto to_simple_date = get_castas_fctn(simple_date_type, timestamp_type);
return {simple_date_type->decompose(to_simple_date(timestamp_native_type{db_clock::now()}))};
});
}
inline
shared_ptr<function>
make_currenttimeuuid_fct() {
return make_native_scalar_function<true>("currenttimeuuid", timeuuid_type, {},
[] (cql_serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
return {timeuuid_type->decompose(timeuuid_native_type{utils::UUID_gen::get_time_UUID()})};
});
}
inline
shared_ptr<function>
make_timeuuidtodate_fct() {
return make_native_scalar_function<true>("todate", simple_date_type, { timeuuid_type },
[] (cql_serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
using namespace utils;
auto& bb = values[0];
if (!bb) {
return {};
}
auto ts = db_clock::time_point(db_clock::duration(UUID_gen::unix_timestamp(UUID_gen::get_UUID(*bb))));
auto to_simple_date = get_castas_fctn(simple_date_type, timestamp_type);
return {simple_date_type->decompose(to_simple_date(ts))};
});
}
inline
shared_ptr<function>
make_timestamptodate_fct() {
return make_native_scalar_function<true>("todate", simple_date_type, { timestamp_type },
[] (cql_serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
using namespace utils;
auto& bb = values[0];
if (!bb) {
return {};
}
auto ts_obj = timestamp_type->deserialize(*bb);
if (ts_obj.is_null()) {
return {};
}
auto to_simple_date = get_castas_fctn(simple_date_type, timestamp_type);
return {simple_date_type->decompose(to_simple_date(ts_obj))};
});
}
inline
shared_ptr<function>
make_timeuuidtotimestamp_fct() {
return make_native_scalar_function<true>("totimestamp", timestamp_type, { timeuuid_type },
[] (cql_serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
using namespace utils;
auto& bb = values[0];
if (!bb) {
return {};
}
auto ts = db_clock::time_point(db_clock::duration(UUID_gen::unix_timestamp(UUID_gen::get_UUID(*bb))));
return {timestamp_type->decompose(ts)};
});
}
inline
shared_ptr<function>
make_datetotimestamp_fct() {
return make_native_scalar_function<true>("totimestamp", timestamp_type, { simple_date_type },
[] (cql_serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
using namespace utils;
auto& bb = values[0];
if (!bb) {
return {};
}
auto simple_date_obj = simple_date_type->deserialize(*bb);
if (simple_date_obj.is_null()) {
return {};
}
auto from_simple_date = get_castas_fctn(timestamp_type, simple_date_type);
return {timestamp_type->decompose(from_simple_date(simple_date_obj))};
});
}
inline
shared_ptr<function>
make_timeuuidtounixtimestamp_fct() {
return make_native_scalar_function<true>("tounixtimestamp", long_type, { timeuuid_type },
[] (cql_serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
using namespace utils;
auto& bb = values[0];
if (!bb) {
return {};
}
return {long_type->decompose(UUID_gen::unix_timestamp(UUID_gen::get_UUID(*bb)))};
});
}
inline
shared_ptr<function>
make_timestamptounixtimestamp_fct() {
return make_native_scalar_function<true>("tounixtimestamp", long_type, { timestamp_type },
[] (cql_serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
using namespace utils;
auto& bb = values[0];
if (!bb) {
return {};
}
auto ts_obj = timestamp_type->deserialize(*bb);
if (ts_obj.is_null()) {
return {};
}
return {long_type->decompose(ts_obj)};
});
}
inline
shared_ptr<function>
make_datetounixtimestamp_fct() {
return make_native_scalar_function<true>("tounixtimestamp", long_type, { simple_date_type },
[] (cql_serialization_format sf, const std::vector<bytes_opt>& values) -> bytes_opt {
using namespace utils;
auto& bb = values[0];
if (!bb) {
return {};
}
auto simple_date_obj = simple_date_type->deserialize(*bb);
if (simple_date_obj.is_null()) {
return {};
}
auto from_simple_date = get_castas_fctn(timestamp_type, simple_date_type);
return {long_type->decompose(from_simple_date(simple_date_obj))};
});
}
}
}
}

View File

@@ -115,11 +115,12 @@ lists::literal::to_string() const {
}
lists::value
lists::value::from_serialized(bytes_view v, list_type type, cql_serialization_format sf) {
lists::value::from_serialized(const fragmented_temporary_buffer::view& val, list_type type, cql_serialization_format sf) {
try {
// Collections have this small hack that validate cannot be called on a serialized object,
// but compose does the validation (so we're fine).
// FIXME: deserializeForNativeProtocol()?!
return with_linearized(val, [&] (bytes_view v) {
auto l = value_cast<list_type_impl::native_type>(type->deserialize(v, sf));
std::vector<bytes_opt> elements;
elements.reserve(l.size());
@@ -128,6 +129,7 @@ lists::value::from_serialized(bytes_view v, list_type type, cql_serialization_fo
elements.push_back(element.is_null() ? bytes_opt() : bytes_opt(type->get_elements_type()->decompose(element)));
}
return value(std::move(elements));
});
} catch (marshal_exception& e) {
throw exceptions::invalid_request_exception(e.what());
}
@@ -237,7 +239,12 @@ lists::precision_time::get_next(db_clock::time_point millis) {
void
lists::setter::execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) {
const auto& value = _t->bind(params._options);
auto value = _t->bind(params._options);
execute(m, prefix, params, column, std::move(value));
}
void
lists::setter::execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params, const column_definition& column, ::shared_ptr<terminal> value) {
if (value == constants::UNSET_VALUE) {
return;
}
@@ -280,7 +287,9 @@ lists::setter_by_index::execute(mutation& m, const clustering_key_prefix& prefix
return;
}
auto idx = net::ntoh(int32_t(*unaligned_cast<int32_t>(index->begin())));
auto idx = with_linearized(*index, [] (bytes_view v) {
return value_cast<int32_t>(data_type_for<int32_t>()->deserialize(v));
});
auto&& existing_list_opt = params.get_prefetched_list(m.key().view(), prefix.view(), column);
if (!existing_list_opt) {
throw exceptions::invalid_request_exception("Attempted to set an element on a list which is null");
@@ -299,7 +308,7 @@ lists::setter_by_index::execute(mutation& m, const clustering_key_prefix& prefix
if (!value) {
mut.cells.emplace_back(eidx, params.make_dead_cell());
} else {
mut.cells.emplace_back(eidx, params.make_cell(*value));
mut.cells.emplace_back(eidx, params.make_cell(*ltype->value_comparator(), *value, atomic_cell::collection_member::yes));
}
auto smut = ltype->serialize_mutation_form(mut);
m.set_cell(prefix, column, atomic_cell_or_collection::from_collection_mutation(std::move(smut)));
@@ -326,7 +335,7 @@ lists::setter_by_uuid::execute(mutation& m, const clustering_key_prefix& prefix,
list_type_impl::mutation mut;
mut.cells.reserve(1);
mut.cells.emplace_back(to_bytes(*index), params.make_cell(*value));
mut.cells.emplace_back(to_bytes(*index), params.make_cell(*ltype->value_comparator(), *value, atomic_cell::collection_member::yes));
auto smut = ltype->serialize_mutation_form(mut);
m.set_cell(prefix, column,
atomic_cell_or_collection::from_collection_mutation(
@@ -365,7 +374,7 @@ lists::do_append(shared_ptr<term> value,
auto uuid1 = utils::UUID_gen::get_time_UUID_bytes();
auto uuid = bytes(reinterpret_cast<const int8_t*>(uuid1.data()), uuid1.size());
// FIXME: can e be empty?
appended.cells.emplace_back(std::move(uuid), params.make_cell(*e));
appended.cells.emplace_back(std::move(uuid), params.make_cell(*ltype->value_comparator(), *e, atomic_cell::collection_member::yes));
}
m.set_cell(prefix, column, ltype->serialize_mutation_form(appended));
} else {
@@ -374,7 +383,7 @@ lists::do_append(shared_ptr<term> value,
m.set_cell(prefix, column, params.make_dead_cell());
} else {
auto newv = list_value->get_with_protocol_version(cql_serialization_format::internal());
m.set_cell(prefix, column, params.make_cell(std::move(newv)));
m.set_cell(prefix, column, params.make_cell(*column.type, std::move(newv)));
}
}
}
@@ -395,14 +404,14 @@ lists::prepender::execute(mutation& m, const clustering_key_prefix& prefix, cons
mut.cells.reserve(lvalue->get_elements().size());
// We reverse the order of insertion, so that the last element gets the lastest time
// (lists are sorted by time)
auto&& ltype = static_cast<const list_type_impl*>(column.type.get());
for (auto&& v : lvalue->_elements | boost::adaptors::reversed) {
auto&& pt = precision_time::get_next(time);
auto uuid = utils::UUID_gen::get_time_UUID_bytes(pt.millis.time_since_epoch().count(), pt.nanos);
mut.cells.emplace_back(bytes(uuid.data(), uuid.size()), params.make_cell(*v));
mut.cells.emplace_back(bytes(uuid.data(), uuid.size()), params.make_cell(*ltype->value_comparator(), *v, atomic_cell::collection_member::yes));
}
// now reverse again, to get the original order back
std::reverse(mut.cells.begin(), mut.cells.end());
auto&& ltype = static_cast<const list_type_impl*>(column.type.get());
m.set_cell(prefix, column, atomic_cell_or_collection::from_collection_mutation(ltype->serialize_mutation_form(std::move(mut))));
}

View File

@@ -79,7 +79,7 @@ public:
explicit value(std::vector<bytes_opt> elements)
: _elements(std::move(elements)) {
}
static value from_serialized(bytes_view v, list_type type, cql_serialization_format sf);
static value from_serialized(const fragmented_temporary_buffer::view& v, list_type type, cql_serialization_format sf);
virtual cql3::raw_value get(const query_options& options) override;
virtual bytes get_with_protocol_version(cql_serialization_format sf) override;
bool equals(shared_ptr<list_type_impl> lt, const value& v);
@@ -147,6 +147,7 @@ public:
: operation(column, std::move(t)) {
}
virtual void execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) override;
static void execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params, const column_definition& column, ::shared_ptr<terminal> value);
};
class setter_by_index : public operation {

View File

@@ -152,18 +152,20 @@ maps::literal::to_string() const {
}
maps::value
maps::value::from_serialized(bytes_view value, map_type type, cql_serialization_format sf) {
maps::value::from_serialized(const fragmented_temporary_buffer::view& fragmented_value, map_type type, cql_serialization_format sf) {
try {
// Collections have this small hack that validate cannot be called on a serialized object,
// but compose does the validation (so we're fine).
// FIXME: deserialize_for_native_protocol?!
return with_linearized(fragmented_value, [&] (bytes_view value) {
auto m = value_cast<map_type_impl::native_type>(type->deserialize(value, sf));
std::map<bytes, bytes, serialized_compare> map(type->get_keys_type()->as_less_comparator());
for (auto&& e : m) {
map.emplace(type->get_keys_type()->decompose(e.first),
type->get_values_type()->decompose(e.second));
}
return { std::move(map) };
return maps::value { std::move(map) };
});
} catch (marshal_exception& e) {
throw exceptions::invalid_request_exception(e.what());
}
@@ -233,10 +235,10 @@ maps::delayed_value::bind(const query_options& options) {
if (key_bytes.is_unset_value()) {
throw exceptions::invalid_request_exception("unset value is not supported inside collections");
}
if (key_bytes->size() > std::numeric_limits<uint16_t>::max()) {
if (key_bytes->size_bytes() > std::numeric_limits<uint16_t>::max()) {
throw exceptions::invalid_request_exception(sprint("Map key is too long. Map keys are limited to %d bytes but %d bytes keys provided",
std::numeric_limits<uint16_t>::max(),
key_bytes->size()));
key_bytes->size_bytes()));
}
auto value_bytes = value->bind_and_get(options);
if (value_bytes.is_null()) {
@@ -266,6 +268,11 @@ maps::marker::bind(const query_options& options) {
void
maps::setter::execute(mutation& m, const clustering_key_prefix& row_key, const update_parameters& params) {
auto value = _t->bind(params._options);
execute(m, row_key, params, column, std::move(value));
}
void
maps::setter::execute(mutation& m, const clustering_key_prefix& row_key, const update_parameters& params, const column_definition& column, ::shared_ptr<terminal> value) {
if (value == constants::UNSET_VALUE) {
return;
}
@@ -295,10 +302,11 @@ maps::setter_by_key::execute(mutation& m, const clustering_key_prefix& prefix, c
if (!key) {
throw invalid_request_exception("Invalid null map key");
}
auto avalue = value ? params.make_cell(*value) : params.make_dead_cell();
map_type_impl::mutation update = { {}, { { std::move(to_bytes(*key)), std::move(avalue) } } };
// should have been verified as map earlier?
auto ctype = static_pointer_cast<const map_type_impl>(column.type);
auto avalue = value ? params.make_cell(*ctype->get_values_type(), *value, atomic_cell::collection_member::yes) : params.make_dead_cell();
map_type_impl::mutation update;
update.cells.emplace_back(std::move(to_bytes(*key)), std::move(avalue));
// should have been verified as map earlier?
auto col_mut = ctype->serialize_mutation_form(std::move(update));
m.set_cell(prefix, column, std::move(col_mut));
}
@@ -323,10 +331,10 @@ maps::do_put(mutation& m, const clustering_key_prefix& prefix, const update_para
return;
}
for (auto&& e : map_value->map) {
mut.cells.emplace_back(e.first, params.make_cell(e.second));
}
auto ctype = static_pointer_cast<const map_type_impl>(column.type);
for (auto&& e : map_value->map) {
mut.cells.emplace_back(e.first, params.make_cell(*ctype->get_values_type(), fragmented_temporary_buffer::view(e.second), atomic_cell::collection_member::yes));
}
auto col_mut = ctype->serialize_mutation_form(std::move(mut));
m.set_cell(prefix, column, std::move(col_mut));
} else {
@@ -336,7 +344,7 @@ maps::do_put(mutation& m, const clustering_key_prefix& prefix, const update_para
} else {
auto v = map_type_impl::serialize_partially_deserialized_form({map_value->map.begin(), map_value->map.end()},
cql_serialization_format::internal());
m.set_cell(prefix, column, params.make_cell(std::move(v)));
m.set_cell(prefix, column, params.make_cell(*column.type, fragmented_temporary_buffer::view(std::move(v))));
}
}
}

View File

@@ -81,7 +81,7 @@ public:
value(std::map<bytes, bytes, serialized_compare> map)
: map(std::move(map)) {
}
static value from_serialized(bytes_view value, map_type type, cql_serialization_format sf);
static value from_serialized(const fragmented_temporary_buffer::view& value, map_type type, cql_serialization_format sf);
virtual cql3::raw_value get(const query_options& options) override;
virtual bytes get_with_protocol_version(cql_serialization_format sf);
bool equals(map_type mt, const value& v);
@@ -117,6 +117,7 @@ public:
}
virtual void execute(mutation& m, const clustering_key_prefix& row_key, const update_parameters& params) override;
static void execute(mutation& m, const clustering_key_prefix& row_key, const update_parameters& params, const column_definition& column, ::shared_ptr<terminal> value);
};
class setter_by_key : public operation {

View File

@@ -87,15 +87,19 @@ public:
virtual ~operation() {}
atomic_cell make_dead_cell(const update_parameters& params) const {
static atomic_cell make_dead_cell(const update_parameters& params) {
return params.make_dead_cell();
}
atomic_cell make_cell(bytes_view value, const update_parameters& params) const {
return params.make_cell(value);
static atomic_cell make_cell(const abstract_type& type, bytes_view value, const update_parameters& params) {
return params.make_cell(type, fragmented_temporary_buffer::view(value));
}
atomic_cell make_counter_update_cell(int64_t delta, const update_parameters& params) const {
static atomic_cell make_cell(const abstract_type& type, const fragmented_temporary_buffer::view& value, const update_parameters& params) {
return params.make_cell(type, value);
}
static atomic_cell make_counter_update_cell(int64_t delta, const update_parameters& params) {
return params.make_counter_update_cell(delta);
}

View File

@@ -68,6 +68,14 @@ public:
static thrift_prepared_id_type thrift_id(const prepared_cache_key_type& key) {
return key.key().second;
}
bool operator==(const prepared_cache_key_type& other) const {
return _key == other._key;
}
bool operator!=(const prepared_cache_key_type& other) const {
return !(*this == other);
}
};
class prepared_statements_cache {
@@ -102,9 +110,9 @@ private:
}
};
public:
static const std::chrono::minutes entry_expiry;
public:
using key_type = prepared_cache_key_type;
using value_type = checked_weak_ptr;
using statement_is_too_big = typename cache_type::entry_is_too_big;
@@ -116,8 +124,8 @@ private:
value_extractor_fn _value_extractor_fn;
public:
prepared_statements_cache(logging::logger& logger)
: _cache(memory::stats().total_memory() / 256, entry_expiry, logger)
prepared_statements_cache(logging::logger& logger, size_t size)
: _cache(size, entry_expiry, logger)
{}
template <typename LoadFunc>
@@ -155,6 +163,10 @@ public:
size_t memory_footprint() const {
return _cache.memory_footprint();
}
future<> stop() {
return _cache.stop();
}
};
}
@@ -168,4 +180,11 @@ inline std::ostream& operator<<(std::ostream& os, const cql3::prepared_cache_key
os << p.key();
return os;
}
template<>
struct hash<cql3::prepared_cache_key_type> final {
size_t operator()(const cql3::prepared_cache_key_type& k) const {
return utils::tuple_hash()(k.key());
}
};
}

View File

@@ -46,10 +46,11 @@ namespace cql3 {
thread_local const query_options::specific_options query_options::specific_options::DEFAULT{-1, {}, {}, api::missing_timestamp};
thread_local query_options query_options::DEFAULT{db::consistency_level::ONE, std::experimental::nullopt,
thread_local query_options query_options::DEFAULT{db::consistency_level::ONE, infinite_timeout_config, std::experimental::nullopt,
std::vector<cql3::raw_value_view>(), false, query_options::specific_options::DEFAULT, cql_serialization_format::latest()};
query_options::query_options(db::consistency_level consistency,
const ::timeout_config& timeout_config,
std::experimental::optional<std::vector<sstring_view>> names,
std::vector<cql3::raw_value> values,
std::vector<cql3::raw_value_view> value_views,
@@ -57,6 +58,7 @@ query_options::query_options(db::consistency_level consistency,
specific_options options,
cql_serialization_format sf)
: _consistency(consistency)
, _timeout_config(timeout_config)
, _names(std::move(names))
, _values(std::move(values))
, _value_views(value_views)
@@ -67,12 +69,14 @@ query_options::query_options(db::consistency_level consistency,
}
query_options::query_options(db::consistency_level consistency,
const ::timeout_config& timeout_config,
std::experimental::optional<std::vector<sstring_view>> names,
std::vector<cql3::raw_value> values,
bool skip_metadata,
specific_options options,
cql_serialization_format sf)
: _consistency(consistency)
, _timeout_config(timeout_config)
, _names(std::move(names))
, _values(std::move(values))
, _value_views()
@@ -84,12 +88,14 @@ query_options::query_options(db::consistency_level consistency,
}
query_options::query_options(db::consistency_level consistency,
const ::timeout_config& timeout_config,
std::experimental::optional<std::vector<sstring_view>> names,
std::vector<cql3::raw_value_view> value_views,
bool skip_metadata,
specific_options options,
cql_serialization_format sf)
: _consistency(consistency)
, _timeout_config(timeout_config)
, _names(std::move(names))
, _values()
, _value_views(std::move(value_views))
@@ -99,9 +105,10 @@ query_options::query_options(db::consistency_level consistency,
{
}
query_options::query_options(db::consistency_level cl, std::vector<cql3::raw_value> values, specific_options options)
query_options::query_options(db::consistency_level cl, const ::timeout_config& timeout_config, std::vector<cql3::raw_value> values, specific_options options)
: query_options(
cl,
timeout_config,
{},
std::move(values),
false,
@@ -113,6 +120,7 @@ query_options::query_options(db::consistency_level cl, std::vector<cql3::raw_val
query_options::query_options(std::unique_ptr<query_options> qo, ::shared_ptr<service::pager::paging_state> paging_state)
: query_options(qo->_consistency,
qo->get_timeout_config(),
std::move(qo->_names),
std::move(qo->_values),
std::move(qo->_value_views),
@@ -122,84 +130,49 @@ query_options::query_options(std::unique_ptr<query_options> qo, ::shared_ptr<ser
}
query_options::query_options(std::unique_ptr<query_options> qo, ::shared_ptr<service::pager::paging_state> paging_state, int32_t page_size)
: query_options(qo->_consistency,
qo->get_timeout_config(),
std::move(qo->_names),
std::move(qo->_values),
std::move(qo->_value_views),
qo->_skip_metadata,
std::move(query_options::specific_options{page_size, paging_state, qo->_options.serial_consistency, qo->_options.timestamp}),
qo->_cql_serialization_format) {
}
query_options::query_options(std::vector<cql3::raw_value> values)
: query_options(
db::consistency_level::ONE, std::move(values))
db::consistency_level::ONE, infinite_timeout_config, std::move(values))
{}
db::consistency_level query_options::get_consistency() const
{
return _consistency;
}
cql3::raw_value_view query_options::get_value_at(size_t idx) const
{
return _value_views.at(idx);
}
size_t query_options::get_values_count() const
{
return _value_views.size();
}
cql3::raw_value_view query_options::make_temporary(cql3::raw_value value) const
{
if (value) {
_temporaries.emplace_back(value->begin(), value->end());
auto& temporary = _temporaries.back();
return cql3::raw_value_view::make_value(bytes_view{temporary.data(), temporary.size()});
auto value_view = *value;
auto ptr = _temporaries.write_place_holder(value_view.size());
std::copy_n(value_view.data(), value_view.size(), ptr);
return cql3::raw_value_view::make_value(fragmented_temporary_buffer::view(bytes_view{ptr, value_view.size()}));
}
return cql3::raw_value_view::make_null();
}
bool query_options::skip_metadata() const
bytes_view query_options::linearize(fragmented_temporary_buffer::view view) const
{
return _skip_metadata;
}
int32_t query_options::get_page_size() const
{
return get_specific_options().page_size;
}
::shared_ptr<service::pager::paging_state> query_options::get_paging_state() const
{
return get_specific_options().state;
}
std::experimental::optional<db::consistency_level> query_options::get_serial_consistency() const
{
return get_specific_options().serial_consistency;
}
api::timestamp_type query_options::get_timestamp(service::query_state& state) const
{
auto tstamp = get_specific_options().timestamp;
return tstamp != api::missing_timestamp ? tstamp : state.get_timestamp();
}
int query_options::get_protocol_version() const
{
return _cql_serialization_format.protocol_version();
}
cql_serialization_format query_options::get_cql_serialization_format() const
{
return _cql_serialization_format;
}
const query_options::specific_options& query_options::get_specific_options() const
{
return _options;
}
const query_options& query_options::for_statement(size_t i) const
{
if (!_batch_options) {
// No per-statement options supplied, so use the "global" options
return *this;
if (view.empty()) {
return { };
} else if (std::next(view.begin()) == view.end()) {
return *view.begin();
} else {
auto ptr = _temporaries.write_place_holder(view.size_bytes());
auto dst = ptr;
using boost::range::for_each;
for_each(view, [&] (bytes_view bv) {
dst = std::copy(bv.begin(), bv.end(), dst);
});
return bytes_view(ptr, view.size_bytes());
}
return _batch_options->at(i);
}
void query_options::prepare(const std::vector<::shared_ptr<column_specification>>& specs)
@@ -226,11 +199,7 @@ void query_options::prepare(const std::vector<::shared_ptr<column_specification>
void query_options::fill_value_views()
{
for (auto&& value : _values) {
if (value) {
_value_views.emplace_back(cql3::raw_value_view::make_value(bytes_view{*value}));
} else {
_value_views.emplace_back(cql3::raw_value_view::make_null());
}
_value_views.emplace_back(value.to_view());
}
}

View File

@@ -44,13 +44,14 @@
#include <seastar/util/gcc6-concepts.hh>
#include "timestamp.hh"
#include "bytes.hh"
#include "db/consistency_level.hh"
#include "db/consistency_level_type.hh"
#include "service/query_state.hh"
#include "service/pager/paging_state.hh"
#include "cql3/column_specification.hh"
#include "cql3/column_identifier.hh"
#include "cql3/values.hh"
#include "cql_serialization_format.hh"
#include "timeout_config.hh"
namespace cql3 {
@@ -70,10 +71,11 @@ public:
};
private:
const db::consistency_level _consistency;
const timeout_config& _timeout_config;
const std::experimental::optional<std::vector<sstring_view>> _names;
std::vector<cql3::raw_value> _values;
std::vector<cql3::raw_value_view> _value_views;
mutable std::vector<std::vector<int8_t>> _temporaries;
mutable bytes_ostream _temporaries;
const bool _skip_metadata;
const specific_options _options;
cql_serialization_format _cql_serialization_format;
@@ -100,15 +102,17 @@ private:
public:
query_options(query_options&&) = default;
query_options(const query_options&) = delete;
explicit query_options(const query_options&) = default;
explicit query_options(db::consistency_level consistency,
const timeout_config& timeouts,
std::experimental::optional<std::vector<sstring_view>> names,
std::vector<cql3::raw_value> values,
bool skip_metadata,
specific_options options,
cql_serialization_format sf);
explicit query_options(db::consistency_level consistency,
const timeout_config& timeouts,
std::experimental::optional<std::vector<sstring_view>> names,
std::vector<cql3::raw_value> values,
std::vector<cql3::raw_value_view> value_views,
@@ -116,6 +120,7 @@ public:
specific_options options,
cql_serialization_format sf);
explicit query_options(db::consistency_level consistency,
const timeout_config& timeouts,
std::experimental::optional<std::vector<sstring_view>> names,
std::vector<cql3::raw_value_view> value_views,
bool skip_metadata,
@@ -147,30 +152,81 @@ public:
// forInternalUse
explicit query_options(std::vector<cql3::raw_value> values);
explicit query_options(db::consistency_level, std::vector<cql3::raw_value> values, specific_options options = specific_options::DEFAULT);
explicit query_options(db::consistency_level, const timeout_config& timeouts,
std::vector<cql3::raw_value> values, specific_options options = specific_options::DEFAULT);
explicit query_options(std::unique_ptr<query_options>, ::shared_ptr<service::pager::paging_state> paging_state);
explicit query_options(std::unique_ptr<query_options>, ::shared_ptr<service::pager::paging_state> paging_state, int32_t page_size);
const timeout_config& get_timeout_config() const { return _timeout_config; }
db::consistency_level get_consistency() const {
return _consistency;
}
cql3::raw_value_view get_value_at(size_t idx) const {
return _value_views.at(idx);
}
size_t get_values_count() const {
return _value_views.size();
}
db::consistency_level get_consistency() const;
cql3::raw_value_view get_value_at(size_t idx) const;
cql3::raw_value_view make_temporary(cql3::raw_value value) const;
size_t get_values_count() const;
bool skip_metadata() const;
/** The pageSize for this query. Will be <= 0 if not relevant for the query. */
int32_t get_page_size() const;
bytes_view linearize(fragmented_temporary_buffer::view) const;
bool skip_metadata() const {
return _skip_metadata;
}
int32_t get_page_size() const {
return get_specific_options().page_size;
}
/** The paging state for this query, or null if not relevant. */
::shared_ptr<service::pager::paging_state> get_paging_state() const;
::shared_ptr<service::pager::paging_state> get_paging_state() const {
return get_specific_options().state;
}
/** Serial consistency for conditional updates. */
std::experimental::optional<db::consistency_level> get_serial_consistency() const;
api::timestamp_type get_timestamp(service::query_state& state) const;
std::experimental::optional<db::consistency_level> get_serial_consistency() const {
return get_specific_options().serial_consistency;
}
api::timestamp_type get_timestamp(service::query_state& state) const {
auto tstamp = get_specific_options().timestamp;
return tstamp != api::missing_timestamp ? tstamp : state.get_timestamp();
}
/**
* The protocol version for the query. Will be 3 if the object don't come from
* a native protocol request (i.e. it's been allocated locally or by CQL-over-thrift).
*/
int get_protocol_version() const;
cql_serialization_format get_cql_serialization_format() const;
int get_protocol_version() const {
return _cql_serialization_format.protocol_version();
}
cql_serialization_format get_cql_serialization_format() const {
return _cql_serialization_format;
}
const query_options::specific_options& get_specific_options() const {
return _options;
}
// Mainly for the sake of BatchQueryOptions
const specific_options& get_specific_options() const;
const query_options& for_statement(size_t i) const;
const query_options& for_statement(size_t i) const {
if (!_batch_options) {
// No per-statement options supplied, so use the "global" options
return *this;
}
return _batch_options->at(i);
}
const std::experimental::optional<std::vector<sstring_view>>& get_names() const noexcept {
return _names;
}
void prepare(const std::vector<::shared_ptr<column_specification>>& specs);
private:
void fill_value_views();
@@ -188,7 +244,7 @@ query_options::query_options(query_options&& o, std::vector<OneMutationDataRange
std::vector<query_options> tmp;
tmp.reserve(values_ranges.size());
std::transform(values_ranges.begin(), values_ranges.end(), std::back_inserter(tmp), [this](auto& values_range) {
return query_options(_consistency, {}, std::move(values_range), _skip_metadata, _options, _cql_serialization_format);
return query_options(_consistency, _timeout_config, {}, std::move(values_range), _skip_metadata, _options, _cql_serialization_format);
});
_batch_options = std::move(tmp);
}

View File

@@ -58,6 +58,7 @@ using namespace cql_transport::messages;
logging::logger log("query_processor");
logging::logger prep_cache_log("prepared_statements_cache");
logging::logger authorized_prepared_statements_cache_log("authorized_prepared_statements_cache");
distributed<query_processor> _the_query_processor;
@@ -91,12 +92,16 @@ api::timestamp_type query_processor::next_timestamp() {
return _internal_state->next_timestamp();
}
query_processor::query_processor(distributed<service::storage_proxy>& proxy, distributed<database>& db)
query_processor::query_processor(service::storage_proxy& proxy, distributed<database>& db, query_processor::memory_config mcfg)
: _migration_subscriber{std::make_unique<migration_subscriber>(this)}
, _proxy(proxy)
, _db(db)
, _internal_state(new internal_state())
, _prepared_cache(prep_cache_log) {
, _prepared_cache(prep_cache_log, mcfg.prepared_statment_cache_size)
, _authorized_prepared_cache(std::min(std::chrono::milliseconds(_db.local().get_config().permissions_validity_in_ms()),
std::chrono::duration_cast<std::chrono::milliseconds>(prepared_statements_cache::entry_expiry)),
std::chrono::milliseconds(_db.local().get_config().permissions_update_interval_in_ms()),
mcfg.authorized_prepared_cache_size, authorized_prepared_statements_cache_log) {
namespace sm = seastar::metrics;
_metrics.add_group(
@@ -159,6 +164,11 @@ query_processor::query_processor(distributed<service::storage_proxy>& proxy, dis
sm::description("Counts a total number of LOGGED batches that were executed as UNLOGGED "
"batches.")),
sm::make_derive(
"rows_read",
_cql_stats.rows_read,
sm::description("Counts a total number of rows read during CQL requests.")),
sm::make_derive(
"prepared_cache_evictions",
[] { return prepared_statements_cache::shard_stats().prepared_cache_evictions; },
@@ -172,7 +182,80 @@ query_processor::query_processor(distributed<service::storage_proxy>& proxy, dis
sm::make_gauge(
"prepared_cache_memory_footprint",
[this] { return _prepared_cache.memory_footprint(); },
sm::description("Size (in bytes) of the prepared statements cache."))});
sm::description("Size (in bytes) of the prepared statements cache.")),
sm::make_derive(
"secondary_index_creates",
_cql_stats.secondary_index_creates,
sm::description("Counts a total number of CQL CREATE INDEX requests.")),
sm::make_derive(
"secondary_index_drops",
_cql_stats.secondary_index_drops,
sm::description("Counts a total number of CQL DROP INDEX requests.")),
// secondary_index_reads total count is also included in all cql reads
sm::make_derive(
"secondary_index_reads",
_cql_stats.secondary_index_reads,
sm::description("Counts a total number of CQL read requests performed using secondary indexes.")),
// secondary_index_rows_read total count is also included in all cql rows read
sm::make_derive(
"secondary_index_rows_read",
_cql_stats.secondary_index_rows_read,
sm::description("Counts a total number of rows read during CQL requests performed using secondary indexes.")),
// read requests that required ALLOW FILTERING
sm::make_derive(
"filtered_read_requests",
_cql_stats.filtered_reads,
sm::description("Counts a total number of CQL read requests that required ALLOW FILTERING. See filtered_rows_read_total to compare how many rows needed to be filtered.")),
// rows read with filtering enabled (because ALLOW FILTERING was required)
sm::make_derive(
"filtered_rows_read_total",
_cql_stats.filtered_rows_read_total,
sm::description("Counts a total number of rows read during CQL requests that required ALLOW FILTERING. See filtered_rows_matched_total and filtered_rows_dropped_total for information how accurate filtering queries are.")),
// rows read with filtering enabled and accepted by the filter
sm::make_derive(
"filtered_rows_matched_total",
_cql_stats.filtered_rows_matched_total,
sm::description("Counts a number of rows read during CQL requests that required ALLOW FILTERING and accepted by the filter. Number similar to filtered_rows_read_total indicates that filtering is accurate.")),
// rows read with filtering enabled and rejected by the filter
sm::make_derive(
"filtered_rows_dropped_total",
[this]() {return _cql_stats.filtered_rows_read_total - _cql_stats.filtered_rows_matched_total;},
sm::description("Counts a number of rows read during CQL requests that required ALLOW FILTERING and dropped by the filter. Number similar to filtered_rows_read_total indicates that filtering is not accurate and might cause performance degradation.")),
sm::make_derive(
"authorized_prepared_statements_cache_evictions",
[] { return authorized_prepared_statements_cache::shard_stats().authorized_prepared_statements_cache_evictions; },
sm::description("Counts a number of authenticated prepared statements cache entries evictions.")),
sm::make_gauge(
"authorized_prepared_statements_cache_size",
[this] { return _authorized_prepared_cache.size(); },
sm::description("A number of entries in the authenticated prepared statements cache.")),
sm::make_gauge(
"user_prepared_auth_cache_footprint",
[this] { return _authorized_prepared_cache.memory_footprint(); },
sm::description("Size (in bytes) of the authenticated prepared statements cache.")),
sm::make_counter(
"reverse_queries",
_cql_stats.reverse_queries,
sm::description("Counts number of CQL SELECT requests with ORDER BY DESC.")),
sm::make_counter(
"unpaged_select_queries",
_cql_stats.unpaged_select_queries,
sm::description("Counts number of unpaged CQL SELECT requests.")),
});
service::get_local_migration_manager().register_listener(_migration_subscriber.get());
}
@@ -182,7 +265,7 @@ query_processor::~query_processor() {
future<> query_processor::stop() {
service::get_local_migration_manager().unregister_listener(_migration_subscriber.get());
return make_ready_future<>();
return _authorized_prepared_cache.stop().finally([this] { return _prepared_cache.stop(); });
}
future<::shared_ptr<result_message>>
@@ -190,11 +273,11 @@ query_processor::process(const sstring_view& query_string, service::query_state&
log.trace("process: \"{}\"", query_string);
tracing::trace(query_state.get_trace_state(), "Parsing a statement");
auto p = get_statement(query_string, query_state.get_client_state());
options.prepare(p->bound_names);
auto cql_statement = p->statement;
if (cql_statement->get_bound_terms() != options.get_values_count()) {
throw exceptions::invalid_request_exception("Invalid amount of bind variables");
}
options.prepare(p->bound_names);
warn(unimplemented::cause::METRICS);
#if 0
@@ -202,33 +285,55 @@ query_processor::process(const sstring_view& query_string, service::query_state&
metrics.regularStatementsExecuted.inc();
#endif
tracing::trace(query_state.get_trace_state(), "Processing a statement");
return process_statement(std::move(cql_statement), query_state, options);
return process_statement_unprepared(std::move(cql_statement), query_state, options);
}
future<::shared_ptr<result_message>>
query_processor::process_statement(
query_processor::process_statement_unprepared(
::shared_ptr<cql_statement> statement,
service::query_state& query_state,
const query_options& options) {
return statement->check_access(query_state.get_client_state()).then([this, statement, &query_state, &options]() {
auto& client_state = query_state.get_client_state();
return statement->check_access(query_state.get_client_state()).then([this, statement, &query_state, &options] () mutable {
return process_authorized_statement(std::move(statement), query_state, options);
});
}
statement->validate(_proxy, client_state);
future<::shared_ptr<result_message>>
query_processor::process_statement_prepared(
statements::prepared_statement::checked_weak_ptr prepared,
cql3::prepared_cache_key_type cache_key,
service::query_state& query_state,
const query_options& options,
bool needs_authorization) {
auto fut = make_ready_future<::shared_ptr<cql_transport::messages::result_message>>();
if (client_state.is_internal()) {
fut = statement->execute_internal(_proxy, query_state, options);
} else {
fut = statement->execute(_proxy, query_state, options);
}
return fut.then([statement] (auto msg) {
if (msg) {
return make_ready_future<::shared_ptr<result_message>>(std::move(msg));
}
return make_ready_future<::shared_ptr<result_message>>(
::make_shared<result_message::void_message>());
::shared_ptr<cql_statement> statement = prepared->statement;
future<> fut = make_ready_future<>();
if (needs_authorization) {
fut = statement->check_access(query_state.get_client_state()).then([this, &query_state, prepared = std::move(prepared), cache_key = std::move(cache_key)] () mutable {
return _authorized_prepared_cache.insert(*query_state.get_client_state().user(), std::move(cache_key), std::move(prepared)).handle_exception([this] (auto eptr) {
log.error("failed to cache the entry", eptr);
});
});
}
return fut.then([this, statement = std::move(statement), &query_state, &options] () mutable {
return process_authorized_statement(std::move(statement), query_state, options);
});
}
future<::shared_ptr<result_message>>
query_processor::process_authorized_statement(const ::shared_ptr<cql_statement> statement, service::query_state& query_state, const query_options& options) {
auto& client_state = query_state.get_client_state();
statement->validate(_proxy, client_state);
auto fut = statement->execute(_proxy, query_state, options);
return fut.then([statement] (auto msg) {
if (msg) {
return make_ready_future<::shared_ptr<result_message>>(std::move(msg));
}
return make_ready_future<::shared_ptr<result_message>>(::make_shared<result_message::void_message>());
});
}
@@ -340,6 +445,7 @@ query_options query_processor::make_internal_options(
const statements::prepared_statement::checked_weak_ptr& p,
const std::initializer_list<data_value>& values,
db::consistency_level cl,
const timeout_config& timeout_config,
int32_t page_size) {
if (p->bound_names.size() != values.size()) {
throw std::invalid_argument(
@@ -363,10 +469,11 @@ query_options query_processor::make_internal_options(
api::timestamp_type ts = api::missing_timestamp;
return query_options(
cl,
timeout_config,
bound_values,
cql3::query_options::specific_options{page_size, std::move(paging_state), serial_consistency, ts});
}
return query_options(cl, bound_values);
return query_options(cl, timeout_config, bound_values);
}
statements::prepared_statement::checked_weak_ptr query_processor::prepare_internal(const sstring& query_string) {
@@ -397,7 +504,7 @@ struct internal_query_state {
::shared_ptr<internal_query_state> query_processor::create_paged_state(const sstring& query_string,
const std::initializer_list<data_value>& values, int32_t page_size) {
auto p = prepare_internal(query_string);
auto opts = make_internal_options(p, values, db::consistency_level::ONE, page_size);
auto opts = make_internal_options(p, values, db::consistency_level::ONE, infinite_timeout_config, page_size);
::shared_ptr<internal_query_state> res = ::make_shared<internal_query_state>(
internal_query_state{
query_string,
@@ -446,7 +553,7 @@ future<> query_processor::for_each_cql_result(
future<::shared_ptr<untyped_result_set>>
query_processor::execute_paged_internal(::shared_ptr<internal_query_state> state) {
return state->p->statement->execute_internal(_proxy, *_internal_state, *state->opts).then(
return state->p->statement->execute(_proxy, *_internal_state, *state->opts).then(
[state, this](::shared_ptr<cql_transport::messages::result_message> msg) mutable {
class visitor : public result_message::visitor_base {
::shared_ptr<internal_query_state> _state;
@@ -485,9 +592,9 @@ future<::shared_ptr<untyped_result_set>>
query_processor::execute_internal(
statements::prepared_statement::checked_weak_ptr p,
const std::initializer_list<data_value>& values) {
query_options opts = make_internal_options(p, values);
query_options opts = make_internal_options(p, values, db::consistency_level::ONE, infinite_timeout_config);
return do_with(std::move(opts), [this, p = std::move(p)](auto& opts) {
return p->statement->execute_internal(
return p->statement->execute(
_proxy,
*_internal_state,
opts).then([&opts, stmt = p->statement](auto msg) {
@@ -500,15 +607,16 @@ future<::shared_ptr<untyped_result_set>>
query_processor::process(
const sstring& query_string,
db::consistency_level cl,
const timeout_config& timeout_config,
const std::initializer_list<data_value>& values,
bool cache) {
if (cache) {
return process(prepare_internal(query_string), cl, values);
return process(prepare_internal(query_string), cl, timeout_config, values);
} else {
auto p = parse_statement(query_string)->prepare(_db.local(), _cql_stats);
p->statement->validate(_proxy, *_internal_state);
auto checked_weak_ptr = p->checked_weak_from_this();
return process(std::move(checked_weak_ptr), cl, values).finally([p = std::move(p)] {});
return process(std::move(checked_weak_ptr), cl, timeout_config, values).finally([p = std::move(p)] {});
}
}
@@ -516,8 +624,9 @@ future<::shared_ptr<untyped_result_set>>
query_processor::process(
statements::prepared_statement::checked_weak_ptr p,
db::consistency_level cl,
const timeout_config& timeout_config,
const std::initializer_list<data_value>& values) {
auto opts = make_internal_options(p, values, cl);
auto opts = make_internal_options(p, values, cl, timeout_config);
return do_with(std::move(opts), [this, p = std::move(p)](auto & opts) {
return p->statement->execute(_proxy, *_internal_state, opts).then([](auto msg) {
return make_ready_future<::shared_ptr<untyped_result_set>>(::make_shared<untyped_result_set>(msg));
@@ -529,11 +638,18 @@ future<::shared_ptr<cql_transport::messages::result_message>>
query_processor::process_batch(
::shared_ptr<statements::batch_statement> batch,
service::query_state& query_state,
query_options& options) {
return batch->check_access(query_state.get_client_state()).then([this, &query_state, &options, batch] {
batch->validate();
batch->validate(_proxy, query_state.get_client_state());
return batch->execute(_proxy, query_state, options);
query_options& options,
std::unordered_map<prepared_cache_key_type, authorized_prepared_statements_cache::value_type> pending_authorization_entries) {
return batch->check_access(query_state.get_client_state()).then([this, &query_state, &options, batch, pending_authorization_entries = std::move(pending_authorization_entries)] () mutable {
return parallel_for_each(pending_authorization_entries, [this, &query_state] (auto& e) {
return _authorized_prepared_cache.insert(*query_state.get_client_state().user(), e.first, std::move(e.second)).handle_exception([this] (auto eptr) {
log.error("failed to cache the entry", eptr);
});
}).then([this, &query_state, &options, batch] {
batch->validate();
batch->validate(_proxy, query_state.get_client_state());
return batch->execute(_proxy, query_state, options);
});
});
}

View File

@@ -49,6 +49,7 @@
#include <seastar/core/shared_ptr.hh>
#include "cql3/prepared_statements_cache.hh"
#include "cql3/authorized_prepared_statements_cache.hh"
#include "cql3/query_options.hh"
#include "cql3/statements/prepared_statement.hh"
#include "cql3/statements/raw/parsed_statement.hh"
@@ -99,10 +100,14 @@ public:
class query_processor {
public:
class migration_subscriber;
struct memory_config {
size_t prepared_statment_cache_size = 0;
size_t authorized_prepared_cache_size = 0;
};
private:
std::unique_ptr<migration_subscriber> _migration_subscriber;
distributed<service::storage_proxy>& _proxy;
service::storage_proxy& _proxy;
distributed<database>& _db;
struct stats {
@@ -117,6 +122,7 @@ private:
std::unique_ptr<internal_state> _internal_state;
prepared_statements_cache _prepared_cache;
authorized_prepared_statements_cache _authorized_prepared_cache;
// A map for prepared statements used internally (which we don't want to mix with user statement, in particular we
// don't bother with expiration on those.
@@ -135,7 +141,7 @@ public:
static ::shared_ptr<statements::raw::parsed_statement> parse_statement(const std::experimental::string_view& query);
query_processor(distributed<service::storage_proxy>& proxy, distributed<database>& db);
query_processor(service::storage_proxy& proxy, distributed<database>& db, memory_config mcfg);
~query_processor();
@@ -143,7 +149,7 @@ public:
return _db;
}
distributed<service::storage_proxy>& proxy() {
service::storage_proxy& proxy() {
return _proxy;
}
@@ -151,6 +157,21 @@ public:
return _cql_stats;
}
statements::prepared_statement::checked_weak_ptr get_prepared(const auth::authenticated_user* user_ptr, const prepared_cache_key_type& key) {
if (user_ptr) {
auto it = _authorized_prepared_cache.find(*user_ptr, key);
if (it != _authorized_prepared_cache.end()) {
try {
return it->get()->checked_weak_from_this();
} catch (seastar::checked_ptr_is_null_exception&) {
// If the prepared statement got invalidated - remove the corresponding authorized_prepared_statements_cache entry as well.
_authorized_prepared_cache.remove(*user_ptr, key);
}
}
}
return statements::prepared_statement::checked_weak_ptr();
}
statements::prepared_statement::checked_weak_ptr get_prepared(const prepared_cache_key_type& key) {
auto it = _prepared_cache.find(key);
if (it == _prepared_cache.end()) {
@@ -160,11 +181,19 @@ public:
}
future<::shared_ptr<cql_transport::messages::result_message>>
process_statement(
process_statement_unprepared(
::shared_ptr<cql_statement> statement,
service::query_state& query_state,
const query_options& options);
future<::shared_ptr<cql_transport::messages::result_message>>
process_statement_prepared(
statements::prepared_statement::checked_weak_ptr statement,
cql3::prepared_cache_key_type cache_key,
service::query_state& query_state,
const query_options& options,
bool needs_authorization);
future<::shared_ptr<cql_transport::messages::result_message>>
process(
const std::experimental::string_view& query_string,
@@ -215,12 +244,14 @@ public:
future<::shared_ptr<untyped_result_set>> process(
const sstring& query_string,
db::consistency_level,
const timeout_config& timeout_config,
const std::initializer_list<data_value>& = { },
bool cache = false);
future<::shared_ptr<untyped_result_set>> process(
statements::prepared_statement::checked_weak_ptr p,
db::consistency_level,
const timeout_config& timeout_config,
const std::initializer_list<data_value>& = { });
/*
@@ -242,7 +273,11 @@ public:
future<> stop();
future<::shared_ptr<cql_transport::messages::result_message>>
process_batch(::shared_ptr<statements::batch_statement>, service::query_state& query_state, query_options& options);
process_batch(
::shared_ptr<statements::batch_statement>,
service::query_state& query_state,
query_options& options,
std::unordered_map<prepared_cache_key_type, authorized_prepared_statements_cache::value_type> pending_authorization_entries);
std::unique_ptr<statements::prepared_statement> get_statement(
const std::experimental::string_view& query,
@@ -254,9 +289,13 @@ private:
query_options make_internal_options(
const statements::prepared_statement::checked_weak_ptr& p,
const std::initializer_list<data_value>&,
db::consistency_level = db::consistency_level::ONE,
db::consistency_level,
const timeout_config& timeout_config,
int32_t page_size = -1);
future<::shared_ptr<cql_transport::messages::result_message>>
process_authorized_statement(const ::shared_ptr<cql_statement> statement, service::query_state& query_state, const query_options& options);
/*!
* \brief created a state object for paging
*

View File

@@ -45,12 +45,16 @@
#include "cql3/statements/request_validations.hh"
#include "cql3/restrictions/primary_key_restrictions.hh"
#include "cql3/statements/request_validations.hh"
#include "cql3/restrictions/single_column_primary_key_restrictions.hh"
namespace cql3 {
namespace restrictions {
class multi_column_restriction : public primary_key_restrictions<clustering_key_prefix> {
private:
bool _has_only_asc_columns;
bool _has_only_desc_columns;
protected:
schema_ptr _schema;
std::vector<const column_definition*> _column_defs;
@@ -58,7 +62,9 @@ public:
multi_column_restriction(schema_ptr schema, std::vector<const column_definition*>&& defs)
: _schema(schema)
, _column_defs(std::move(defs))
{ }
{
update_asc_desc_existence();
}
virtual bool is_multi_column() const override {
return true;
@@ -84,6 +90,7 @@ public:
"Mixing single column relations and multi column relations on clustering columns is not allowed");
auto as_pkr = static_pointer_cast<primary_key_restrictions<clustering_key_prefix>>(other);
do_merge_with(as_pkr);
update_asc_desc_existence();
}
bool is_satisfied_by(const schema& schema,
@@ -140,6 +147,40 @@ protected:
virtual bool is_supported_by(const secondary_index::index& index) const = 0;
/**
* @return true if the restriction contains at least one column of each
* ordering, false otherwise.
*/
bool is_mixed_order() const {
return !is_desc_order() && !is_asc_order();
}
/**
* @return true if all the restricted columns ordered in descending
* order, false otherwise
*/
bool is_desc_order() const {
return _has_only_desc_columns;
}
/**
* @return true if all the restricted columns ordered in ascending
* order, false otherwise
*/
bool is_asc_order() const {
return _has_only_asc_columns;
}
private:
/**
* Updates the _has_only_asc_columns and _has_only_desc_columns fields.
*/
void update_asc_desc_existence() {
std::size_t num_of_desc =
std::count_if(_column_defs.begin(), _column_defs.end(), [] (const column_definition* cd) { return cd->type->is_reversed(); });
_has_only_asc_columns = num_of_desc == 0;
_has_only_desc_columns = num_of_desc == _column_defs.size();
}
#if 0
/**
* Check if this type of restriction is supported for the specified column by the specified index.
@@ -385,6 +426,7 @@ protected:
};
class multi_column_restriction::slice final : public multi_column_restriction {
using restriction_shared_ptr = ::shared_ptr<primary_key_restrictions<clustering_key_prefix>>;
private:
term_slice _slice;
@@ -422,24 +464,11 @@ public:
}
virtual std::vector<bounds_range_type> bounds_ranges(const query_options& options) const override {
// FIXME: doesn't work properly with mixed CLUSTERING ORDER (CASSANDRA-7281)
auto read_bound = [&] (statements::bound b) -> std::experimental::optional<bounds_range_type::bound> {
if (!has_bound(b)) {
return {};
}
auto vals = component_bounds(b, options);
for (unsigned i = 0; i < vals.size(); i++) {
statements::request_validations::check_not_null(vals[i], "Invalid null value in condition for column %s", _column_defs.at(i)->name_as_text());
}
auto prefix = clustering_key_prefix::from_optional_exploded(*_schema, vals);
return bounds_range_type::bound(prefix, is_inclusive(b));
};
auto range = wrapping_range<clustering_key_prefix>(read_bound(statements::bound::START), read_bound(statements::bound::END));
auto bounds = bound_view::from_range(range);
if (bound_view::compare(*_schema)(bounds.second, bounds.first)) {
return { };
if (!is_mixed_order()) {
return bounds_ranges_unified_order(options);
} else {
return bounds_ranges_mixed_order(options);
}
return { bounds_range_type(std::move(range)) };
}
#if 0
@Override
@@ -514,6 +543,221 @@ private:
auto value = static_pointer_cast<tuples::value>(_slice.bound(b)->bind(options));
return value->get_elements();
}
std::vector<bytes_opt> read_bound_components(const query_options& options, statements::bound b) const {
if (!has_bound(b)) {
return {};
}
auto vals = component_bounds(b, options);
for (unsigned i = 0; i < vals.size(); i++) {
statements::request_validations::check_not_null(vals[i], "Invalid null value in condition for column %s", _column_defs.at(i)->name_as_text());
}
return vals;
}
/**
* Retrieve the bounds for the case that all clustering columns have the same order.
* Having the same order implies we can do a prefix search on the data.
* @param options the query options
* @return the vector of ranges for the restriction
*/
std::vector<bounds_range_type> bounds_ranges_unified_order(const query_options& options) const {
auto start_prefix = clustering_key_prefix::from_optional_exploded(*_schema, read_bound_components(options, statements::bound::START));
auto start_bound = bounds_range_type::bound(std::move(start_prefix), is_inclusive(statements::bound::START));
auto end_prefix = clustering_key_prefix::from_optional_exploded(*_schema, read_bound_components(options, statements::bound::END));
auto end_bound = bounds_range_type::bound(std::move(end_prefix), is_inclusive(statements::bound::END));
auto make_range = [&] () {
if (is_asc_order()) {
return bounds_range_type::make(start_bound, end_bound);
} else {
return bounds_range_type::make(end_bound, start_bound);
}
};
auto range = make_range();
auto bounds = bound_view::from_range(range);
if (bound_view::compare(*_schema)(bounds.second, bounds.first)) {
return { };
}
return { std::move(range) };
}
/**
* Retrieve the bounds when clustering columns are mixed order
* (contains ASC and DESC together).
* Having mixed order implies that a prefix search can't take place,
* instead, the bounds have to be broken down to separate prefix serchable
* ranges such that their combination is equivalent to the original range.
* @param options the query options
* @return the vector of ranges for the restriction
*/
std::vector<bounds_range_type> bounds_ranges_mixed_order(const query_options& options) const {
std::vector<bounds_range_type> ret_ranges;
auto mixed_order_restrictions = build_mixed_order_restriction_set(options);
ret_ranges.reserve(mixed_order_restrictions.size());
for (auto r : mixed_order_restrictions) {
for (auto&& range : r->bounds_ranges(options)) {
ret_ranges.emplace_back(std::move(range));
}
}
return ret_ranges;
}
/**
* The function returns the first real inequality component.
* The first real inequality is the index of the first component in the
* tuple that will turn into a slice single column restriction.
* For example: (a, b, c) > (0, 1, 2) and (a, b, c) < (0, 1, 5) will be
* broken into one single column restriction set of the form:
* a = 0 and b = 1 and c > 2 and c < 5 , c is the first element that has
* inequality so for this case the function will return 2.
* @param start_components - the components of the starts tuple range.
* @param end_components - the components of the end tuple range.
* @return an empty value if not found and the index of the first index that
* will yield inequality
*/
std::optional<std::size_t> find_first_neq_component(std::vector<bytes_opt>& start_components,
std::vector<bytes_opt>& end_components) const {
size_t common_components_count = std::min(start_components.size(), end_components.size());
for (size_t i = 0; i < common_components_count ; i++) {
if (start_components[i].value() != end_components[i].value()) {
return i;
}
}
size_t max_components_count = std::max(start_components.size(), end_components.size());
if (common_components_count < max_components_count) {
return common_components_count;
} else {
return std::nullopt;
}
}
/**
* Creates a single column restriction which is either slice or equality.
* @param bound - if bound is empty this is an equality, if its either START or END ,
* this is the corresponding slice restriction.
* @param inclusive - is the slice inclusive (ignored for equality).
* @param column_pos - the column position to restrict
* @param value - the value to restrict the colum with.
* @return a shared pointer to the just created restriction.
*/
::shared_ptr<restriction> make_single_column_restriction(std::optional<cql3::statements::bound> bound, bool inclusive,
std::size_t column_pos,const bytes_opt& value) const {
::shared_ptr<cql3::term> term = ::make_shared<cql3::constants::value>(cql3::raw_value::make_value(value));
if (!bound){
return ::make_shared<cql3::restrictions::single_column_restriction::EQ>(*_column_defs[column_pos], term);
} else {
return ::make_shared<cql3::restrictions::single_column_restriction::slice>(*_column_defs[column_pos], bound.value(), inclusive, term);
}
}
/**
* A helper function to create a single column restrictions set from a tuple relation on
* clustering keys.
* i.e : (a,b,c) >= (0,1,2) will become:
* 1.a > 0
* 2. a = 0 and b > 1
* 3. a = 0 and b = 1 and c >=2
* @param bound - determines if the operator is '>' (START) or '<' (END)
* @param bound_inclusive - determines if to append equality to the operator i.e: if > becomes >=
* @param bound_values - the tuple values for the restriction
* @param first_neq_component - the first component that will have inequality.
* for the example above, if this parameter is 1, only restrictions 2 and 3 will be created.
* this parameter helps to facilitate the nuances of breaking more complex relations, for example when
* there is in existence a second condition limiting the other side of the bound
* i.e:(a,b,c) >= (0,1,2) and (a,b,c) < (5,6,7), this will require each bound to use the parameter.
* @return the single column restriction set built according to the above parameters.
*/
std::vector<restriction_shared_ptr> make_single_bound_restrictions(statements::bound bound, bool bound_inclusive,
std::vector<bytes_opt>& bound_values,
std::size_t first_neq_component) const{
std::vector<restriction_shared_ptr> ret;
std::size_t num_of_restrictions = bound_values.size() - first_neq_component;
ret.reserve(num_of_restrictions);
for (std::size_t i = 0;i < num_of_restrictions ; i++) {
ret.emplace_back(::make_shared<cql3::restrictions::single_column_primary_key_restrictions<clustering_key>>(_schema, false));
std::size_t neq_component_idx = first_neq_component + i;
for (std::size_t j = 0;j < neq_component_idx; j++) {
ret[i]->merge_with(make_single_column_restriction(std::nullopt, false, j, bound_values[j]));
}
bool inclusive = (i == (num_of_restrictions-1)) && bound_inclusive;
ret[i]->merge_with(make_single_column_restriction(bound, inclusive, neq_component_idx, bound_values[neq_component_idx]));
}
return ret;
}
/**
* Builds and returns a set of restrictions such that the union of their ranges (the restrictions OR-ed together)
* is logically identical to this restriction, with the additional property that it can execute
* correctly when the clustering columns are with "mixed order" - contains ASC and DESC orderings.
* for more information: https://github.com/scylladb/scylla/issues/2050
* @param options - the query options
* @return set of restrictions which their ranges union is logically identical to this restriction.
*/
std::vector<::shared_ptr<primary_key_restrictions<clustering_key_prefix>>>
build_mixed_order_restriction_set(const query_options& options) const {
std::vector<restriction_shared_ptr> ret;
auto start_components = read_bound_components(options, statements::bound::START);
auto end_components = read_bound_components(options, statements::bound::END);
bool start_inclusive = is_inclusive(statements::bound::START);
bool end_inclusive = is_inclusive(statements::bound::END);
std::optional<std::size_t> first_neq_component = std::nullopt;
// find the first index of the first component that is not equal between the tuples.
if (start_components.empty() || end_components.empty()) {
first_neq_component = 0;
} else {
auto tuple_mismatch = std::mismatch(start_components.begin(), start_components.end(),
end_components.begin(), end_components.end());
if ((tuple_mismatch.first != start_components.end()) ||
(tuple_mismatch.second != end_components.end())) {
first_neq_component = std::distance(start_components.begin(), tuple_mismatch.first);
}
}
// this is either a simple equality or a never fulfilled restriction
if (!first_neq_component && start_inclusive && end_inclusive) {
// This is a simple equality case
shared_ptr<cql3::term> term = ::make_shared<cql3::tuples::value>(start_components);
ret.emplace_back(::make_shared<cql3::restrictions::multi_column_restriction::EQ>(_schema, _column_defs, term));
return ret;
} else if (!first_neq_component) {
// This is a contradiction case
return {};
} else if ((*first_neq_component == end_components.size() && !end_inclusive ) ||
(*first_neq_component == start_components.size() && !start_inclusive )) {
// This is a case where one bound is a prefix of the other. If this prefix bound
// is not inclusive the result will be an empty set.
return {};
}
bool start_components_exists = (start_components.size() - first_neq_component.value()) > 0;
bool end_components_exists = (end_components.size() - first_neq_component.value()) > 0;
bool both_components_exists = start_components_exists && end_components_exists;
if (start_components_exists) {
auto restrictions =
make_single_bound_restrictions(statements::bound::START, start_inclusive, start_components, first_neq_component.value());
for (auto&& r : restrictions) {
ret.emplace_back(r);
}
}
if (end_components_exists) {
auto restrictions =
make_single_bound_restrictions(statements::bound::END, end_inclusive,
end_components, first_neq_component.value() + both_components_exists);
for (auto&& r : restrictions) {
ret.emplace_back(r);
}
}
if (both_components_exists) {
bool inclusive = end_inclusive && ((end_components.size() - first_neq_component.value()) == 1);
ret[0]->merge_with(make_single_column_restriction(statements::bound::END, inclusive, first_neq_component.value(),
end_components[first_neq_component.value()]));
}
return ret;
}
};
}

View File

@@ -88,6 +88,7 @@ public:
using restrictions::uses_function;
using restrictions::has_supporting_index;
using restrictions::values;
bool empty() const override {
return get_column_defs().empty();
@@ -95,7 +96,72 @@ public:
uint32_t size() const override {
return uint32_t(get_column_defs().size());
}
bool has_unrestricted_components(const schema& schema) const;
virtual bool needs_filtering(const schema& schema) const;
// How long a prefix of the restrictions could have resulted in
// need_filtering() == false. These restrictions do not need to be
// applied during filtering.
// For example, if we have the filter "c1 < 3 and c2 > 3", c1 does
// not need filtering (just a read stopping at c1=3) but c2 does,
// so num_prefix_columns_that_need_not_be_filtered() will be 1.
virtual unsigned int num_prefix_columns_that_need_not_be_filtered() const {
return 0;
}
virtual bool is_all_eq() const {
return false;
}
virtual size_t prefix_size() const {
return 0;
}
size_t prefix_size(const schema_ptr schema) const {
return 0;
}
};
template<>
inline bool primary_key_restrictions<partition_key>::has_unrestricted_components(const schema& schema) const {
return size() < schema.partition_key_size();
}
template<>
inline bool primary_key_restrictions<clustering_key>::has_unrestricted_components(const schema& schema) const {
return size() < schema.clustering_key_size();
}
template<>
inline bool primary_key_restrictions<partition_key>::needs_filtering(const schema& schema) const {
return !empty() && !is_on_token() && (has_unrestricted_components(schema) || is_contains() || is_slice());
}
template<>
inline bool primary_key_restrictions<clustering_key>::needs_filtering(const schema& schema) const {
// Currently only overloaded single_column_primary_key_restrictions will require ALLOW FILTERING
return false;
}
template<>
inline size_t primary_key_restrictions<clustering_key>::prefix_size(const schema_ptr schema) const {
size_t count = 0;
if (schema->clustering_key_columns().empty()) {
return count;
}
auto column_defs = get_column_defs();
column_id expected_column_id = schema->clustering_key_columns().begin()->id;
for (auto&& cdef : column_defs) {
if (schema->position(*cdef) != expected_column_id) {
return count;
}
expected_column_id++;
count++;
}
return count;
}
}
}

View File

@@ -68,6 +68,10 @@ public:
virtual std::vector<bytes_opt> values(const query_options& options) const = 0;
virtual bytes_opt value_for(const column_definition& cdef, const query_options& options) const {
throw exceptions::invalid_request_exception("Single value can be obtained from single-column restrictions only");
}
/**
* Returns <code>true</code> if one of the restrictions use the specified function.
*

View File

@@ -49,6 +49,7 @@
#include <boost/algorithm/cxx11/all_of.hpp>
#include <boost/range/adaptor/transformed.hpp>
#include <boost/range/adaptor/filtered.hpp>
#include <boost/range/adaptor/map.hpp>
namespace cql3 {
@@ -62,21 +63,46 @@ class single_column_primary_key_restrictions : public primary_key_restrictions<V
using range_type = query::range<ValueType>;
using range_bound = typename range_type::bound;
using bounds_range_type = typename primary_key_restrictions<ValueType>::bounds_range_type;
template<typename OtherValueType>
friend class single_column_primary_key_restrictions;
private:
schema_ptr _schema;
bool _allow_filtering;
::shared_ptr<single_column_restrictions> _restrictions;
bool _slice;
bool _contains;
bool _in;
public:
single_column_primary_key_restrictions(schema_ptr schema)
single_column_primary_key_restrictions(schema_ptr schema, bool allow_filtering)
: _schema(schema)
, _allow_filtering(allow_filtering)
, _restrictions(::make_shared<single_column_restrictions>(schema))
, _slice(false)
, _contains(false)
, _in(false)
{ }
// Convert another primary key restrictions type into this type, possibly using different schema
template<typename OtherValueType>
explicit single_column_primary_key_restrictions(schema_ptr schema, const single_column_primary_key_restrictions<OtherValueType>& other)
: _schema(schema)
, _allow_filtering(other._allow_filtering)
, _restrictions(::make_shared<single_column_restrictions>(schema))
, _slice(other._slice)
, _contains(other._contains)
, _in(other._in)
{
for (const auto& entry : other._restrictions->restrictions()) {
const column_definition* other_cdef = entry.first;
const column_definition* this_cdef = _schema->get_column_definition(other_cdef->name());
if (!this_cdef) {
throw exceptions::invalid_request_exception(sprint("Base column %s not found in view index schema", other_cdef->name_as_text()));
}
::shared_ptr<single_column_restriction> restriction = entry.second;
_restrictions->add_restriction(restriction->apply_to(*this_cdef));
}
}
virtual bool is_on_token() const override {
return false;
}
@@ -97,6 +123,10 @@ public:
return _in;
}
virtual bool is_all_eq() const override {
return _restrictions->is_all_eq();
}
virtual bool has_bound(statements::bound b) const override {
return boost::algorithm::all_of(_restrictions->restrictions(), [b] (auto&& r) { return r.second->has_bound(b); });
}
@@ -110,7 +140,7 @@ public:
}
void do_merge_with(::shared_ptr<single_column_restriction> restriction) {
if (!_restrictions->empty()) {
if (!_restrictions->empty() && !_allow_filtering) {
auto last_column = *_restrictions->last_column();
auto new_column = restriction->get_column_def();
@@ -127,11 +157,6 @@ public:
last_column.name_as_text(), new_column.name_as_text()));
}
}
if (_in && _schema->position(new_column) > _schema->position(last_column)) {
throw exceptions::invalid_request_exception(sprint("Clustering column \"%s\" cannot be restricted by an IN relation",
new_column.name_as_text()));
}
}
_slice |= restriction->is_slice();
@@ -140,6 +165,25 @@ public:
_restrictions->add_restriction(restriction);
}
virtual size_t prefix_size() const override {
return primary_key_restrictions<ValueType>::prefix_size(_schema);
}
::shared_ptr<single_column_primary_key_restrictions<clustering_key>> get_longest_prefix_restrictions() {
static_assert(std::is_same_v<ValueType, clustering_key>, "Only clustering key can produce longest prefix restrictions");
size_t current_prefix_size = prefix_size();
if (current_prefix_size == _restrictions->restrictions().size()) {
return dynamic_pointer_cast<single_column_primary_key_restrictions<clustering_key>>(this->shared_from_this());
}
auto longest_prefix_restrictions = ::make_shared<single_column_primary_key_restrictions<clustering_key>>(_schema, _allow_filtering);
auto restriction_it = _restrictions->restrictions().begin();
for (size_t i = 0; i < current_prefix_size; ++i) {
longest_prefix_restrictions->merge_with((restriction_it++)->second);
}
return longest_prefix_restrictions;
}
virtual void merge_with(::shared_ptr<restriction> restriction) override {
if (restriction->is_multi_column()) {
throw exceptions::invalid_request_exception(
@@ -312,11 +356,20 @@ public:
}
return res;
}
virtual bytes_opt value_for(const column_definition& cdef, const query_options& options) const override {
return _restrictions->value_for(cdef, options);
}
std::vector<bytes_opt> bounds(statements::bound b, const query_options& options) const override {
// TODO: if this proved to be required.
fail(unimplemented::cause::LEGACY_COMPOSITE_KEYS); // not 100% correct...
}
const single_column_restrictions::restrictions_map& restrictions() const {
return _restrictions->restrictions();
}
virtual bool has_supporting_index(const secondary_index::secondary_index_manager& index_manager) const override {
return _restrictions->has_supporting_index(index_manager);
}
@@ -352,10 +405,13 @@ public:
_restrictions->restrictions() | boost::adaptors::map_values,
[&] (auto&& r) { return r->is_satisfied_by(schema, key, ckey, cells, options, now); });
}
virtual bool needs_filtering(const schema& schema) const override;
virtual unsigned int num_prefix_columns_that_need_not_be_filtered() const override;
};
template<>
dht::partition_range_vector
inline dht::partition_range_vector
single_column_primary_key_restrictions<partition_key>::bounds_ranges(const query_options& options) const {
dht::partition_range_vector ranges;
ranges.reserve(size());
@@ -373,7 +429,7 @@ single_column_primary_key_restrictions<partition_key>::bounds_ranges(const query
}
template<>
std::vector<query::clustering_range>
inline std::vector<query::clustering_range>
single_column_primary_key_restrictions<clustering_key_prefix>::bounds_ranges(const query_options& options) const {
auto wrapping_bounds = compute_bounds(options);
auto bounds = boost::copy_range<query::clustering_row_ranges>(wrapping_bounds
@@ -409,6 +465,62 @@ single_column_primary_key_restrictions<clustering_key_prefix>::bounds_ranges(con
return bounds;
}
template<>
inline bool single_column_primary_key_restrictions<partition_key>::needs_filtering(const schema& schema) const {
return primary_key_restrictions<partition_key>::needs_filtering(schema);
}
template<>
inline bool single_column_primary_key_restrictions<clustering_key>::needs_filtering(const schema& schema) const {
// Restrictions currently need filtering in three cases:
// 1. any of them is a CONTAINS restriction
// 2. restrictions do not form a contiguous prefix (i.e. there are gaps in it)
// 3. a SLICE restriction isn't on a last place
column_id position = 0;
for (const auto& restriction : _restrictions->restrictions() | boost::adaptors::map_values) {
if (restriction->is_contains() || position != restriction->get_column_def().id) {
return true;
}
if (!restriction->is_slice()) {
position = restriction->get_column_def().id + 1;
}
}
return false;
}
// How many of the restrictions (in column order) do not need filtering
// because they are implemented as a slice (potentially, a contiguous disk
// read). For example, if we have the filter "c1 < 3 and c2 > 3", c1 does not
// need filtering but c2 does so num_prefix_columns_that_need_not_be_filtered
// will be 1.
// The implementation of num_prefix_columns_that_need_not_be_filtered() is
// closely tied to that of needs_filtering() above - basically, if only the
// first num_prefix_columns_that_need_not_be_filtered() restrictions existed,
// then needs_filtering() would have returned false.
template<>
inline unsigned single_column_primary_key_restrictions<clustering_key>::num_prefix_columns_that_need_not_be_filtered() const {
column_id position = 0;
unsigned int count = 0;
for (const auto& restriction : _restrictions->restrictions() | boost::adaptors::map_values) {
if (restriction->is_contains() || position != restriction->get_column_def().id) {
return count;
}
if (!restriction->is_slice()) {
position = restriction->get_column_def().id + 1;
}
count++;
}
return count;
}
template<>
inline unsigned single_column_primary_key_restrictions<partition_key>::num_prefix_columns_that_need_not_be_filtered() const {
// skip_filtering() is currently called only for clustering key
// restrictions, so it doesn't matter what we return here.
return 0;
}
}
}

View File

@@ -93,6 +93,9 @@ public:
}
virtual bool is_supported_by(const secondary_index::index& index) const = 0;
using abstract_restriction::is_satisfied_by;
virtual bool is_satisfied_by(bytes_view data, const query_options& options) const = 0;
virtual ::shared_ptr<single_column_restriction> apply_to(const column_definition& cdef) = 0;
#if 0
/**
* Check if this type of restriction is supported by the specified index.
@@ -113,7 +116,7 @@ public:
class contains;
protected:
bytes_view_opt get_value(const schema& schema,
std::optional<atomic_cell_value_view> get_value(const schema& schema,
const partition_key& key,
const clustering_key_prefix& ckey,
const row& cells,
@@ -166,6 +169,10 @@ public:
const row& cells,
const query_options& options,
gc_clock::time_point now) const override;
virtual bool is_satisfied_by(bytes_view data, const query_options& options) const override;
virtual ::shared_ptr<single_column_restriction> apply_to(const column_definition& cdef) override {
return ::make_shared<EQ>(cdef, _value);
}
#if 0
@Override
@@ -201,6 +208,10 @@ public:
const row& cells,
const query_options& options,
gc_clock::time_point now) const override;
virtual bool is_satisfied_by(bytes_view data, const query_options& options) const override;
virtual ::shared_ptr<single_column_restriction> apply_to(const column_definition& cdef) override {
throw std::logic_error("IN superclass should never be cloned directly");
}
virtual std::vector<bytes_opt> values_raw(const query_options& options) const = 0;
@@ -243,6 +254,10 @@ public:
virtual sstring to_string() const override {
return sprint("IN(%s)", std::to_string(_values));
}
virtual ::shared_ptr<single_column_restriction> apply_to(const column_definition& cdef) override {
return ::make_shared<IN_with_values>(cdef, _values);
}
};
class single_column_restriction::IN_with_marker : public IN {
@@ -268,6 +283,10 @@ public:
virtual sstring to_string() const override {
return "IN ?";
}
virtual ::shared_ptr<single_column_restriction> apply_to(const column_definition& cdef) override {
return ::make_shared<IN_with_marker>(cdef, _marker);
}
};
class single_column_restriction::slice : public single_column_restriction {
@@ -279,6 +298,11 @@ public:
, _slice(term_slice::new_instance(bound, inclusive, std::move(term)))
{ }
slice(const column_definition& column_def, term_slice slice)
: single_column_restriction(column_def)
, _slice(slice)
{ }
virtual bool uses_function(const sstring& ks_name, const sstring& function_name) const override {
return (_slice.has_bound(statements::bound::START) && abstract_restriction::term_uses_function(_slice.bound(statements::bound::START), ks_name, function_name))
|| (_slice.has_bound(statements::bound::END) && abstract_restriction::term_uses_function(_slice.bound(statements::bound::END), ks_name, function_name));
@@ -364,6 +388,10 @@ public:
const row& cells,
const query_options& options,
gc_clock::time_point now) const override;
virtual bool is_satisfied_by(bytes_view data, const query_options& options) const override;
virtual ::shared_ptr<single_column_restriction> apply_to(const column_definition& cdef) override {
return ::make_shared<slice>(cdef, _slice);
}
};
// This holds CONTAINS, CONTAINS_KEY, and map[key] = value restrictions because we might want to have any combination of them.
@@ -485,6 +513,10 @@ public:
const row& cells,
const query_options& options,
gc_clock::time_point now) const override;
virtual bool is_satisfied_by(bytes_view data, const query_options& options) const override;
virtual ::shared_ptr<single_column_restriction> apply_to(const column_definition& cdef) override {
throw std::logic_error("Cloning 'contains' restriction is not implemented.");
}
#if 0
private List<ByteBuffer> keys(const query_options& options) {

View File

@@ -111,6 +111,11 @@ public:
return r;
}
virtual bytes_opt value_for(const column_definition& cdef, const query_options& options) const override {
auto it = _restrictions.find(std::addressof(cdef));
return (it != _restrictions.end()) ? it->second->value(options) : bytes_opt{};
}
/**
* Returns the restriction associated to the specified column.
*

View File

@@ -23,6 +23,7 @@
#include <boost/range/algorithm/transform.hpp>
#include <boost/range/algorithm.hpp>
#include <boost/range/adaptors.hpp>
#include <boost/algorithm/cxx11/any_of.hpp>
#include "statement_restrictions.hh"
#include "single_column_primary_key_restrictions.hh"
@@ -36,19 +37,24 @@
namespace cql3 {
namespace restrictions {
static logging::logger rlogger("restrictions");
using boost::adaptors::filtered;
using boost::adaptors::transformed;
template<typename T>
class statement_restrictions::initial_key_restrictions : public primary_key_restrictions<T> {
bool _allow_filtering;
public:
initial_key_restrictions(bool allow_filtering)
: _allow_filtering(allow_filtering) {}
using bounds_range_type = typename primary_key_restrictions<T>::bounds_range_type;
::shared_ptr<primary_key_restrictions<T>> do_merge_to(schema_ptr schema, ::shared_ptr<restriction> restriction) const {
if (restriction->is_multi_column()) {
throw std::runtime_error(sprint("%s not implemented", __PRETTY_FUNCTION__));
}
return ::make_shared<single_column_primary_key_restrictions<T>>(schema)->merge_to(schema, restriction);
return ::make_shared<single_column_primary_key_restrictions<T>>(schema, _allow_filtering)->merge_to(schema, restriction);
}
::shared_ptr<primary_key_restrictions<T>> merge_to(schema_ptr schema, ::shared_ptr<restriction> restriction) override {
if (restriction->is_multi_column()) {
@@ -57,7 +63,7 @@ public:
if (restriction->is_on_token()) {
return static_pointer_cast<token_restriction>(restriction);
}
return ::make_shared<single_column_primary_key_restrictions<T>>(schema)->merge_to(restriction);
return ::make_shared<single_column_primary_key_restrictions<T>>(schema, _allow_filtering)->merge_to(restriction);
}
void merge_with(::shared_ptr<restriction> restriction) override {
throw exceptions::unsupported_operation_exception();
@@ -66,6 +72,9 @@ public:
// throw? should not reach?
return {};
}
bytes_opt value_for(const column_definition& cdef, const query_options& options) const override {
return {};
}
std::vector<T> values_as_keys(const query_options& options) const override {
// throw? should not reach?
return {};
@@ -122,9 +131,10 @@ statement_restrictions::initial_key_restrictions<clustering_key_prefix>::merge_t
}
template<typename T>
::shared_ptr<primary_key_restrictions<T>> statement_restrictions::get_initial_key_restrictions() {
static thread_local ::shared_ptr<primary_key_restrictions<T>> initial_kr = ::make_shared<initial_key_restrictions<T>>();
return initial_kr;
::shared_ptr<primary_key_restrictions<T>> statement_restrictions::get_initial_key_restrictions(bool allow_filtering) {
static thread_local ::shared_ptr<primary_key_restrictions<T>> initial_kr_true = ::make_shared<initial_key_restrictions<T>>(true);
static thread_local ::shared_ptr<primary_key_restrictions<T>> initial_kr_false = ::make_shared<initial_key_restrictions<T>>(false);
return allow_filtering ? initial_kr_true : initial_kr_false;
}
std::vector<::shared_ptr<column_identifier>>
@@ -141,10 +151,10 @@ statement_restrictions::get_partition_key_unrestricted_components() const {
return r;
}
statement_restrictions::statement_restrictions(schema_ptr schema)
statement_restrictions::statement_restrictions(schema_ptr schema, bool allow_filtering)
: _schema(schema)
, _partition_key_restrictions(get_initial_key_restrictions<partition_key>())
, _clustering_columns_restrictions(get_initial_key_restrictions<clustering_key_prefix>())
, _partition_key_restrictions(get_initial_key_restrictions<partition_key>(allow_filtering))
, _clustering_columns_restrictions(get_initial_key_restrictions<clustering_key_prefix>(allow_filtering))
, _nonprimary_key_restrictions(::make_shared<single_column_restrictions>(schema))
{ }
#if 0
@@ -162,8 +172,9 @@ statement_restrictions::statement_restrictions(database& db,
::shared_ptr<variable_specifications> bound_names,
bool selects_only_static_columns,
bool select_a_collection,
bool for_view)
: statement_restrictions(schema)
bool for_view,
bool allow_filtering)
: statement_restrictions(schema, allow_filtering)
{
/*
* WHERE clause. For a given entity, rules are: - EQ relation conflicts with anything else (including a 2nd EQ)
@@ -197,23 +208,22 @@ statement_restrictions::statement_restrictions(database& db,
throw exceptions::invalid_request_exception(sprint("restriction '%s' is only supported in materialized view creation", relation->to_string()));
}
} else {
add_restriction(relation->to_restriction(db, schema, bound_names));
add_restriction(relation->to_restriction(db, schema, bound_names), for_view, allow_filtering);
}
}
}
auto& cf = db.find_column_family(schema);
auto& sim = cf.get_index_manager();
bool has_queriable_clustering_column_index = _clustering_columns_restrictions->has_supporting_index(sim);
bool has_queriable_index = has_queriable_clustering_column_index
|| _partition_key_restrictions->has_supporting_index(sim)
|| _nonprimary_key_restrictions->has_supporting_index(sim);
const bool has_queriable_clustering_column_index = _clustering_columns_restrictions->has_supporting_index(sim);
const bool has_queriable_pk_index = _partition_key_restrictions->has_supporting_index(sim);
const bool has_queriable_regular_index = _nonprimary_key_restrictions->has_supporting_index(sim);
// At this point, the select statement if fully constructed, but we still have a few things to validate
process_partition_key_restrictions(has_queriable_index, for_view);
process_partition_key_restrictions(has_queriable_pk_index, for_view, allow_filtering);
// Some but not all of the partition key columns have been specified;
// hence we need turn these restrictions into index expressions.
if (_uses_secondary_indexing) {
if (_uses_secondary_indexing || _partition_key_restrictions->needs_filtering(*_schema)) {
_index_restrictions.push_back(_partition_key_restrictions);
}
@@ -229,13 +239,14 @@ statement_restrictions::statement_restrictions(database& db,
}
}
process_clustering_columns_restrictions(has_queriable_index, select_a_collection, for_view);
process_clustering_columns_restrictions(has_queriable_clustering_column_index, select_a_collection, for_view, allow_filtering);
// Covers indexes on the first clustering column (among others).
if (_is_key_range && has_queriable_clustering_column_index)
_uses_secondary_indexing = true;
if (_is_key_range && has_queriable_clustering_column_index) {
_uses_secondary_indexing = true;
}
if (_uses_secondary_indexing) {
if (_uses_secondary_indexing || _clustering_columns_restrictions->needs_filtering(*_schema)) {
_index_restrictions.push_back(_clustering_columns_restrictions);
} else if (_clustering_columns_restrictions->is_contains()) {
fail(unimplemented::cause::INDEXES);
@@ -264,31 +275,48 @@ statement_restrictions::statement_restrictions(database& db,
uses_secondary_indexing = true;
#endif
}
// Even if uses_secondary_indexing is false at this point, we'll still have to use one if
// there is restrictions not covered by the PK.
if (!_nonprimary_key_restrictions->empty()) {
_uses_secondary_indexing = true;
if (has_queriable_regular_index) {
_uses_secondary_indexing = true;
} else if (!allow_filtering) {
throw exceptions::invalid_request_exception("Cannot execute this query as it might involve data filtering and "
"thus may have unpredictable performance. If you want to execute "
"this query despite the performance unpredictability, use ALLOW FILTERING");
}
_index_restrictions.push_back(_nonprimary_key_restrictions);
}
if (_uses_secondary_indexing && !for_view) {
if (_uses_secondary_indexing && !(for_view || allow_filtering)) {
validate_secondary_index_selections(selects_only_static_columns);
}
}
void statement_restrictions::add_restriction(::shared_ptr<restriction> restriction) {
void statement_restrictions::add_restriction(::shared_ptr<restriction> restriction, bool for_view, bool allow_filtering) {
if (restriction->is_multi_column()) {
_clustering_columns_restrictions = _clustering_columns_restrictions->merge_to(_schema, restriction);
} else if (restriction->is_on_token()) {
_partition_key_restrictions = _partition_key_restrictions->merge_to(_schema, restriction);
} else {
add_single_column_restriction(::static_pointer_cast<single_column_restriction>(restriction));
add_single_column_restriction(::static_pointer_cast<single_column_restriction>(restriction), for_view, allow_filtering);
}
}
void statement_restrictions::add_single_column_restriction(::shared_ptr<single_column_restriction> restriction) {
void statement_restrictions::add_single_column_restriction(::shared_ptr<single_column_restriction> restriction, bool for_view, bool allow_filtering) {
auto& def = restriction->get_column_def();
if (def.is_partition_key()) {
// A SELECT query may not request a slice (range) of partition keys
// without using token(). This is because there is no way to do this
// query efficiently: mumur3 turns a contiguous range of partition
// keys into tokens all over the token space.
// However, in a SELECT statement used to define a materialized view,
// such a slice is fine - it is used to check whether individual
// partitions, match, and does not present a performance problem.
assert(!restriction->is_on_token());
if (restriction->is_slice() && !for_view && !allow_filtering) {
throw exceptions::invalid_request_exception(
"Only EQ and IN relation are supported on the partition key (unless you use the token() function or allow filtering)");
}
_partition_key_restrictions = _partition_key_restrictions->merge_to(_schema, restriction);
} else if (def.is_clustering_key()) {
_clustering_columns_restrictions = _clustering_columns_restrictions->merge_to(_schema, restriction);
@@ -307,7 +335,54 @@ const std::vector<::shared_ptr<restrictions>>& statement_restrictions::index_res
return _index_restrictions;
}
void statement_restrictions::process_partition_key_restrictions(bool has_queriable_index, bool for_view) {
std::optional<secondary_index::index> statement_restrictions::find_idx(secondary_index::secondary_index_manager& sim) const {
for (::shared_ptr<cql3::restrictions::restrictions> restriction : index_restrictions()) {
for (const auto& cdef : restriction->get_column_defs()) {
for (auto index : sim.list_indexes()) {
if (index.depends_on(*cdef)) {
return std::make_optional<secondary_index::index>(std::move(index));
}
}
}
}
return std::nullopt;
}
std::vector<const column_definition*> statement_restrictions::get_column_defs_for_filtering(database& db) const {
std::vector<const column_definition*> column_defs_for_filtering;
if (need_filtering()) {
auto& sim = db.find_column_family(_schema).get_index_manager();
std::optional<secondary_index::index> opt_idx = find_idx(sim);
auto column_uses_indexing = [&opt_idx] (const column_definition* cdef) {
return opt_idx && opt_idx->depends_on(*cdef);
};
if (_partition_key_restrictions->needs_filtering(*_schema)) {
for (auto&& cdef : _partition_key_restrictions->get_column_defs()) {
if (!column_uses_indexing(cdef)) {
column_defs_for_filtering.emplace_back(cdef);
}
}
}
const bool pk_has_unrestricted_components = _partition_key_restrictions->has_unrestricted_components(*_schema);
if (pk_has_unrestricted_components || _clustering_columns_restrictions->needs_filtering(*_schema)) {
column_id first_filtering_id = pk_has_unrestricted_components ? 0 : _schema->clustering_key_columns().begin()->id +
_clustering_columns_restrictions->num_prefix_columns_that_need_not_be_filtered();
for (auto&& cdef : _clustering_columns_restrictions->get_column_defs()) {
if (cdef->id >= first_filtering_id && !column_uses_indexing(cdef)) {
column_defs_for_filtering.emplace_back(cdef);
}
}
}
for (auto&& cdef : _nonprimary_key_restrictions->get_column_defs()) {
if (!column_uses_indexing(cdef)) {
column_defs_for_filtering.emplace_back(cdef);
}
}
}
return column_defs_for_filtering;
}
void statement_restrictions::process_partition_key_restrictions(bool has_queriable_index, bool for_view, bool allow_filtering) {
// If there is a queriable index, no special condition are required on the other restrictions.
// But we still need to know 2 things:
// - If we don't have a queriable index, is the query ok
@@ -316,28 +391,32 @@ void statement_restrictions::process_partition_key_restrictions(bool has_queriab
// components must have a EQ. Only the last partition key component can be in IN relation.
if (_partition_key_restrictions->is_on_token()) {
_is_key_range = true;
} else if (has_partition_key_unrestricted_components()) {
if (!_partition_key_restrictions->empty() && !for_view) {
if (!has_queriable_index) {
throw exceptions::invalid_request_exception(sprint("Partition key parts: %s must be restricted as other parts are",
join(", ", get_partition_key_unrestricted_components())));
}
}
} else if (_partition_key_restrictions->has_unrestricted_components(*_schema)) {
_is_key_range = true;
_uses_secondary_indexing = has_queriable_index;
}
if (_partition_key_restrictions->needs_filtering(*_schema)) {
if (!allow_filtering && !for_view && !has_queriable_index) {
throw exceptions::invalid_request_exception("Cannot execute this query as it might involve data filtering and "
"thus may have unpredictable performance. If you want to execute "
"this query despite the performance unpredictability, use ALLOW FILTERING");
}
_is_key_range = true;
_uses_secondary_indexing = has_queriable_index;
}
}
bool statement_restrictions::has_partition_key_unrestricted_components() const {
return _partition_key_restrictions->size() < _schema->partition_key_size();
return _partition_key_restrictions->has_unrestricted_components(*_schema);
}
bool statement_restrictions::has_unrestricted_clustering_columns() const {
return _clustering_columns_restrictions->size() < _schema->clustering_key_size();
return _clustering_columns_restrictions->has_unrestricted_components(*_schema);
}
void statement_restrictions::process_clustering_columns_restrictions(bool has_queriable_index, bool select_a_collection, bool for_view) {
void statement_restrictions::process_clustering_columns_restrictions(bool has_queriable_index, bool select_a_collection, bool for_view, bool allow_filtering) {
if (!has_clustering_columns_restriction()) {
return;
}
@@ -346,38 +425,36 @@ void statement_restrictions::process_clustering_columns_restrictions(bool has_qu
throw exceptions::invalid_request_exception(
"Cannot restrict clustering columns by IN relations when a collection is selected by the query");
}
if (_clustering_columns_restrictions->is_contains() && !has_queriable_index) {
if (_clustering_columns_restrictions->is_contains() && !has_queriable_index && !allow_filtering) {
throw exceptions::invalid_request_exception(
"Cannot restrict clustering columns by a CONTAINS relation without a secondary index");
"Cannot restrict clustering columns by a CONTAINS relation without a secondary index or filtering");
}
auto clustering_columns_iter = _schema->clustering_key_columns().begin();
for (auto&& restricted_column : _clustering_columns_restrictions->get_column_defs()) {
const column_definition* clustering_column = &(*clustering_columns_iter);
++clustering_columns_iter;
if (clustering_column != restricted_column && !for_view) {
if (!has_queriable_index) {
throw exceptions::invalid_request_exception(sprint(
"PRIMARY KEY column \"%s\" cannot be restricted as preceding column \"%s\" is not restricted",
restricted_column->name_as_text(), clustering_column->name_as_text()));
if (has_clustering_columns_restriction() && _clustering_columns_restrictions->needs_filtering(*_schema)) {
if (has_queriable_index) {
_uses_secondary_indexing = true;
} else if (!allow_filtering && !for_view) {
auto clustering_columns_iter = _schema->clustering_key_columns().begin();
for (auto&& restricted_column : _clustering_columns_restrictions->get_column_defs()) {
const column_definition* clustering_column = &(*clustering_columns_iter);
++clustering_columns_iter;
if (clustering_column != restricted_column) {
throw exceptions::invalid_request_exception(sprint(
"PRIMARY KEY column \"%s\" cannot be restricted as preceding column \"%s\" is not restricted",
restricted_column->name_as_text(), clustering_column->name_as_text()));
}
}
_uses_secondary_indexing = true; // handle gaps and non-keyrange cases.
break;
}
}
if (_clustering_columns_restrictions->is_contains()) {
_uses_secondary_indexing = true;
}
}
dht::partition_range_vector statement_restrictions::get_partition_key_ranges(const query_options& options) const {
if (_partition_key_restrictions->empty()) {
return {dht::partition_range::make_open_ended_both_sides()};
}
if (_partition_key_restrictions->needs_filtering(*_schema)) {
return {dht::partition_range::make_open_ended_both_sides()};
}
return _partition_key_restrictions->bounds_ranges(options);
}
@@ -385,18 +462,40 @@ std::vector<query::clustering_range> statement_restrictions::get_clustering_boun
if (_clustering_columns_restrictions->empty()) {
return {query::clustering_range::make_open_ended_both_sides()};
}
if (_clustering_columns_restrictions->needs_filtering(*_schema)) {
if (auto single_ck_restrictions = dynamic_pointer_cast<single_column_primary_key_restrictions<clustering_key>>(_clustering_columns_restrictions)) {
return single_ck_restrictions->get_longest_prefix_restrictions()->bounds_ranges(options);
}
return {query::clustering_range::make_open_ended_both_sides()};
}
return _clustering_columns_restrictions->bounds_ranges(options);
}
bool statement_restrictions::need_filtering() {
uint32_t number_of_restricted_columns = 0;
bool statement_restrictions::need_filtering() const {
uint32_t number_of_restricted_columns_for_indexing = 0;
for (auto&& restrictions : _index_restrictions) {
number_of_restricted_columns += restrictions->size();
number_of_restricted_columns_for_indexing += restrictions->size();
}
return number_of_restricted_columns > 1
|| (number_of_restricted_columns == 0 && has_clustering_columns_restriction())
|| (number_of_restricted_columns != 0 && _nonprimary_key_restrictions->has_multiple_contains());
int number_of_filtering_restrictions = _nonprimary_key_restrictions->size();
// If the whole partition key is restricted, it does not imply filtering
if (_partition_key_restrictions->has_unrestricted_components(*_schema) || !_partition_key_restrictions->is_all_eq()) {
number_of_filtering_restrictions += _partition_key_restrictions->size() + _clustering_columns_restrictions->size();
} else if (_clustering_columns_restrictions->has_unrestricted_components(*_schema)) {
number_of_filtering_restrictions += _clustering_columns_restrictions->size() - _clustering_columns_restrictions->prefix_size();
}
if (_partition_key_restrictions->is_multi_column() || _clustering_columns_restrictions->is_multi_column()) {
// TODO(sarna): Implement ALLOW FILTERING support for multi-column restrictions - return false for now
// in order to ensure backwards compatibility
return false;
}
return number_of_restricted_columns_for_indexing > 1
|| (number_of_restricted_columns_for_indexing == 0 && _partition_key_restrictions->empty() && !_clustering_columns_restrictions->empty())
|| (number_of_restricted_columns_for_indexing != 0 && _nonprimary_key_restrictions->has_multiple_contains())
|| (number_of_restricted_columns_for_indexing != 0 && !_uses_secondary_indexing)
|| (_uses_secondary_indexing && number_of_filtering_restrictions > 1);
}
void statement_restrictions::validate_secondary_index_selections(bool selects_only_static_columns) {
@@ -414,7 +513,34 @@ void statement_restrictions::validate_secondary_index_selections(bool selects_on
}
}
static bytes_view_opt do_get_value(const schema& schema,
const single_column_restrictions::restrictions_map& statement_restrictions::get_single_column_partition_key_restrictions() const {
static single_column_restrictions::restrictions_map empty;
auto single_restrictions = dynamic_pointer_cast<single_column_primary_key_restrictions<partition_key>>(_partition_key_restrictions);
if (!single_restrictions) {
if (dynamic_pointer_cast<initial_key_restrictions<partition_key>>(_partition_key_restrictions)) {
return empty;
}
throw std::runtime_error("statement restrictions for multi-column partition key restrictions are not implemented yet");
}
return single_restrictions->restrictions();
}
/**
* @return clustering key restrictions split into single column restrictions (e.g. for filtering support).
*/
const single_column_restrictions::restrictions_map& statement_restrictions::get_single_column_clustering_key_restrictions() const {
static single_column_restrictions::restrictions_map empty;
auto single_restrictions = dynamic_pointer_cast<single_column_primary_key_restrictions<clustering_key>>(_clustering_columns_restrictions);
if (!single_restrictions) {
if (dynamic_pointer_cast<initial_key_restrictions<clustering_key>>(_clustering_columns_restrictions)) {
return empty;
}
throw std::runtime_error("statement restrictions for multi-column partition key restrictions are not implemented yet");
}
return single_restrictions->restrictions();
}
static std::optional<atomic_cell_value_view> do_get_value(const schema& schema,
const column_definition& cdef,
const partition_key& key,
const clustering_key_prefix& ckey,
@@ -422,21 +548,21 @@ static bytes_view_opt do_get_value(const schema& schema,
gc_clock::time_point now) {
switch(cdef.kind) {
case column_kind::partition_key:
return key.get_component(schema, cdef.component_index());
return atomic_cell_value_view(key.get_component(schema, cdef.component_index()));
case column_kind::clustering_key:
return ckey.get_component(schema, cdef.component_index());
return atomic_cell_value_view(ckey.get_component(schema, cdef.component_index()));
default:
auto cell = cells.find_cell(cdef.id);
if (!cell) {
return stdx::nullopt;
return std::nullopt;
}
assert(cdef.is_atomic());
auto c = cell->as_atomic_cell();
return c.is_dead(now) ? stdx::nullopt : bytes_view_opt(c.value());
auto c = cell->as_atomic_cell(cdef);
return c.is_dead(now) ? std::nullopt : std::optional<atomic_cell_value_view>(c.value());
}
}
bytes_view_opt single_column_restriction::get_value(const schema& schema,
std::optional<atomic_cell_value_view> single_column_restriction::get_value(const schema& schema,
const partition_key& key,
const clustering_key_prefix& ckey,
const row& cells,
@@ -456,11 +582,24 @@ bool single_column_restriction::EQ::is_satisfied_by(const schema& schema,
auto operand = value(options);
if (operand) {
auto cell_value = get_value(schema, key, ckey, cells, now);
return cell_value && _column_def.type->compare(*operand, *cell_value) == 0;
if (!cell_value) {
return false;
}
return cell_value->with_linearized([&] (bytes_view cell_value_bv) {
return _column_def.type->compare(*operand, cell_value_bv) == 0;
});
}
return false;
}
bool single_column_restriction::EQ::is_satisfied_by(bytes_view data, const query_options& options) const {
if (_column_def.type->is_counter()) {
fail(unimplemented::cause::COUNTERS);
}
auto operand = value(options);
return operand && _column_def.type->compare(*operand, data) == 0;
}
bool single_column_restriction::IN::is_satisfied_by(const schema& schema,
const partition_key& key,
const clustering_key_prefix& ckey,
@@ -475,8 +614,20 @@ bool single_column_restriction::IN::is_satisfied_by(const schema& schema,
return false;
}
auto operands = values(options);
return cell_value->with_linearized([&] (bytes_view cell_value_bv) {
return std::any_of(operands.begin(), operands.end(), [&] (auto&& operand) {
return operand && _column_def.type->compare(*operand, *cell_value) == 0;
return operand && _column_def.type->compare(*operand, cell_value_bv) == 0;
});
});
}
bool single_column_restriction::IN::is_satisfied_by(bytes_view data, const query_options& options) const {
if (_column_def.type->is_counter()) {
fail(unimplemented::cause::COUNTERS);
}
auto operands = values(options);
return boost::algorithm::any_of(operands, [this, &data] (const bytes_opt& operand) {
return operand && _column_def.type->compare(*operand, data) == 0;
});
}
@@ -490,7 +641,8 @@ static query::range<bytes_view> to_range(const term_slice& slice, const query_op
if (!value) {
return { };
}
return { range_type::bound(*value, slice.is_inclusive(bound)) };
auto value_view = options.linearize(*value);
return { range_type::bound(value_view, slice.is_inclusive(bound)) };
};
return range_type(
extract_bound(statements::bound::START),
@@ -510,7 +662,16 @@ bool single_column_restriction::slice::is_satisfied_by(const schema& schema,
if (!cell_value) {
return false;
}
return to_range(_slice, options).contains(*cell_value, _column_def.type->as_tri_comparator());
return cell_value->with_linearized([&] (bytes_view cell_value_bv) {
return to_range(_slice, options).contains(cell_value_bv, _column_def.type->as_tri_comparator());
});
}
bool single_column_restriction::slice::is_satisfied_by(bytes_view data, const query_options& options) const {
if (_column_def.type->is_counter()) {
fail(unimplemented::cause::COUNTERS);
}
return to_range(_slice, options).contains(data, _column_def.type->underlying_type()->as_tri_comparator());
}
bool single_column_restriction::contains::is_satisfied_by(const schema& schema,
@@ -536,7 +697,8 @@ bool single_column_restriction::contains::is_satisfied_by(const schema& schema,
auto&& element_type = col_type->is_set() ? col_type->name_comparator() : col_type->value_comparator();
if (_column_def.type->is_multi_cell()) {
auto cell = cells.find_cell(_column_def.id);
auto&& elements = col_type->deserialize_mutation_form(cell->as_collection_mutation()).cells;
return cell->as_collection_mutation().data.with_linearized([&] (bytes_view collection_bv) {
auto&& elements = col_type->deserialize_mutation_form(collection_bv).cells;
auto end = std::remove_if(elements.begin(), elements.end(), [now] (auto&& element) {
return element.second.is_dead(now);
});
@@ -545,8 +707,12 @@ bool single_column_restriction::contains::is_satisfied_by(const schema& schema,
if (!val) {
continue;
}
auto found = std::find_if(elements.begin(), end, [&] (auto&& element) {
return element_type->compare(element.second.value(), *val) == 0;
auto found = with_linearized(*val, [&] (bytes_view bv) {
return std::find_if(elements.begin(), end, [&] (auto&& element) {
return element.second.value().with_linearized([&] (bytes_view value_bv) {
return element_type->compare(value_bv, bv) == 0;
});
});
});
if (found == end) {
return false;
@@ -557,8 +723,10 @@ bool single_column_restriction::contains::is_satisfied_by(const schema& schema,
if (!k) {
continue;
}
auto found = std::find_if(elements.begin(), end, [&] (auto&& element) {
return map_key_type->compare(element.first, *k) == 0;
auto found = with_linearized(*k, [&] (bytes_view bv) {
return std::find_if(elements.begin(), end, [&] (auto&& element) {
return map_key_type->compare(element.first, bv) == 0;
});
});
if (found == end) {
return false;
@@ -570,27 +738,42 @@ bool single_column_restriction::contains::is_satisfied_by(const schema& schema,
if (!map_key || !map_value) {
continue;
}
auto found = std::find_if(elements.begin(), end, [&] (auto&& element) {
return map_key_type->compare(element.first, *map_key) == 0;
auto found = with_linearized(*map_key, [&] (bytes_view map_key_bv) {
return std::find_if(elements.begin(), end, [&] (auto&& element) {
return map_key_type->compare(element.first, map_key_bv) == 0;
});
});
if (found == end || element_type->compare(found->second.value(), *map_value) != 0) {
if (found == end) {
return false;
}
auto cmp = with_linearized(*map_value, [&] (bytes_view map_value_bv) {
return found->second.value().with_linearized([&] (bytes_view value_bv) {
return element_type->compare(value_bv, map_value_bv);
});
});
if (cmp != 0) {
return false;
}
}
return true;
});
} else {
auto cell_value = get_value(schema, key, ckey, cells, now);
if (!cell_value) {
return false;
}
auto deserialized = _column_def.type->deserialize(*cell_value);
auto deserialized = cell_value->with_linearized([&] (bytes_view cell_value_bv) {
return _column_def.type->deserialize(cell_value_bv);
});
for (auto&& value : _values) {
auto val = value->bind_and_get(options);
if (!val) {
auto fragmented_val = value->bind_and_get(options);
if (!fragmented_val) {
continue;
}
return with_linearized(*fragmented_val, [&] (bytes_view val) {
auto exists_in = [&](auto&& range) {
auto found = std::find_if(range.begin(), range.end(), [&] (auto&& element) {
return element_type->compare(element.serialize(), *val) == 0;
return element_type->compare(element.serialize(), val) == 0;
});
return found != range.end();
};
@@ -608,6 +791,8 @@ bool single_column_restriction::contains::is_satisfied_by(const schema& schema,
return false;
}
}
return true;
});
}
if (col_type->is_map()) {
auto& data_map = value_cast<map_type_impl::native_type>(deserialized);
@@ -616,8 +801,10 @@ bool single_column_restriction::contains::is_satisfied_by(const schema& schema,
if (!k) {
continue;
}
auto found = std::find_if(data_map.begin(), data_map.end(), [&] (auto&& element) {
return map_key_type->compare(element.first.serialize(), *k) == 0;
auto found = with_linearized(*k, [&] (bytes_view k_bv) {
return std::find_if(data_map.begin(), data_map.end(), [&] (auto&& element) {
return map_key_type->compare(element.first.serialize(), k_bv) == 0;
});
});
if (found == data_map.end()) {
return false;
@@ -629,10 +816,15 @@ bool single_column_restriction::contains::is_satisfied_by(const schema& schema,
if (!map_key || !map_value) {
continue;
}
auto found = std::find_if(data_map.begin(), data_map.end(), [&] (auto&& element) {
return map_key_type->compare(element.first.serialize(), *map_key) == 0;
auto found = with_linearized(*map_key, [&] (bytes_view map_key_bv) {
return std::find_if(data_map.begin(), data_map.end(), [&] (auto&& element) {
return map_key_type->compare(element.first.serialize(), map_key_bv) == 0;
});
});
if (found == data_map.end() || element_type->compare(found->second.serialize(), *map_value) != 0) {
if (found == data_map.end()
|| with_linearized(*map_value, [&] (bytes_view map_value_bv) {
return element_type->compare(found->second.serialize(), map_value_bv);
}) != 0) {
return false;
}
}
@@ -642,6 +834,11 @@ bool single_column_restriction::contains::is_satisfied_by(const schema& schema,
return true;
}
bool single_column_restriction::contains::is_satisfied_by(bytes_view data, const query_options& options) const {
//TODO(sarna): Deserialize & return. It would be nice to deduplicate, is_satisfied_by above is rather long
fail(unimplemented::cause::INDEXES);
}
bool token_restriction::EQ::is_satisfied_by(const schema& schema,
const partition_key& key,
const clustering_key_prefix& ckey,
@@ -653,7 +850,9 @@ bool token_restriction::EQ::is_satisfied_by(const schema& schema,
for (auto&& operand : values(options)) {
if (operand) {
auto cell_value = do_get_value(schema, **cdef, key, ckey, cells, now);
satisfied = cell_value && (*cdef)->type->compare(*operand, *cell_value) == 0;
satisfied = cell_value && cell_value->with_linearized([&] (bytes_view cell_value_bv) {
return (*cdef)->type->compare(*operand, cell_value_bv) == 0;
});
}
if (!satisfied) {
break;
@@ -675,7 +874,9 @@ bool token_restriction::slice::is_satisfied_by(const schema& schema,
if (!cell_value) {
return false;
}
satisfied = range.contains(*cell_value, cdef->type->as_tri_comparator());
satisfied = cell_value->with_linearized([&] (bytes_view cell_value_bv) {
return range.contains(cell_value_bv, cdef->type->as_tri_comparator());
});
if (!satisfied) {
break;
}

View File

@@ -67,7 +67,7 @@ private:
class initial_key_restrictions;
template<typename T>
static ::shared_ptr<primary_key_restrictions<T>> get_initial_key_restrictions();
static ::shared_ptr<primary_key_restrictions<T>> get_initial_key_restrictions(bool allow_filtering);
/**
* Restrictions on partitioning columns
@@ -108,7 +108,7 @@ public:
* @param cfm the column family meta data
* @return a new empty <code>StatementRestrictions</code>.
*/
statement_restrictions(schema_ptr schema);
statement_restrictions(schema_ptr schema, bool allow_filtering);
statement_restrictions(database& db,
schema_ptr schema,
@@ -117,10 +117,11 @@ public:
::shared_ptr<variable_specifications> bound_names,
bool selects_only_static_columns,
bool select_a_collection,
bool for_view = false);
bool for_view = false,
bool allow_filtering = false);
private:
void add_restriction(::shared_ptr<restriction> restriction);
void add_single_column_restriction(::shared_ptr<single_column_restriction> restriction);
void add_restriction(::shared_ptr<restriction> restriction, bool for_view, bool allow_filtering);
void add_single_column_restriction(::shared_ptr<single_column_restriction> restriction, bool for_view, bool allow_filtering);
public:
bool uses_function(const sstring& ks_name, const sstring& function_name) const;
@@ -162,6 +163,20 @@ public:
return _clustering_columns_restrictions;
}
/**
* Builds a possibly empty collection of column definitions that will be used for filtering
* @param db - the database context
* @return A list with the column definitions needed for filtering.
*/
std::vector<const column_definition*> get_column_defs_for_filtering(database& db) const;
/**
* Determines the index to be used with the restriction.
* @param db - the database context (for extracting index manager)
* @return If an index can be used, an optional containing this index, otherwise an empty optional.
*/
std::optional<secondary_index::index> find_idx(secondary_index::secondary_index_manager& sim) const;
/**
* Checks if the partition key has some unrestricted components.
* @return <code>true</code> if the partition key has some unrestricted components, <code>false</code> otherwise.
@@ -174,7 +189,7 @@ public:
*/
bool has_unrestricted_clustering_columns() const;
private:
void process_partition_key_restrictions(bool has_queriable_index, bool for_view);
void process_partition_key_restrictions(bool has_queriable_index, bool for_view, bool allow_filtering);
/**
* Returns the partition key components that are not restricted.
@@ -189,7 +204,7 @@ private:
* @param select_a_collection <code>true</code> if the query should return a collection column
* @throws InvalidRequestException if the request is invalid
*/
void process_clustering_columns_restrictions(bool has_queriable_index, bool select_a_collection, bool for_view);
void process_clustering_columns_restrictions(bool has_queriable_index, bool select_a_collection, bool for_view, bool allow_filtering);
/**
* Returns the <code>Restrictions</code> for the specified type of columns.
@@ -357,7 +372,7 @@ public:
* Checks if the query need to use filtering.
* @return <code>true</code> if the query need to use filtering, <code>false</code> otherwise.
*/
bool need_filtering();
bool need_filtering() const;
void validate_secondary_index_selections(bool selects_only_static_columns);
@@ -380,6 +395,14 @@ public:
return !_nonprimary_key_restrictions->empty();
}
bool pk_restrictions_need_filtering() const {
return _partition_key_restrictions->needs_filtering(*_schema);
}
bool ck_restrictions_need_filtering() const {
return _partition_key_restrictions->has_unrestricted_components(*_schema) || _clustering_columns_restrictions->needs_filtering(*_schema);
}
/**
* @return true if column is restricted by some restriction, false otherwise
*/
@@ -398,6 +421,16 @@ public:
const single_column_restrictions::restrictions_map& get_non_pk_restriction() const {
return _nonprimary_key_restrictions->restrictions();
}
/**
* @return partition key restrictions split into single column restrictions (e.g. for filtering support).
*/
const single_column_restrictions::restrictions_map& get_single_column_partition_key_restrictions() const;
/**
* @return clustering key restrictions split into single column restrictions (e.g. for filtering support).
*/
const single_column_restrictions::restrictions_map& get_single_column_clustering_key_restrictions() const;
};
}

139
cql3/result_generator.hh Normal file
View File

@@ -0,0 +1,139 @@
/*
* Copyright (C) 2018 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include "selection/selection.hh"
#include "stats.hh"
namespace cql3 {
class result_generator {
schema_ptr _schema;
foreign_ptr<lw_shared_ptr<query::result>> _result;
lw_shared_ptr<const query::read_command> _command;
shared_ptr<const selection::selection> _selection;
cql_stats* _stats;
private:
template<typename Visitor>
class query_result_visitor {
const schema& _schema;
std::vector<bytes> _partition_key;
std::vector<bytes> _clustering_key;
uint32_t _partition_row_count = 0;
uint32_t _total_row_count = 0;
Visitor& _visitor;
const selection::selection& _selection;
private:
void accept_cell_value(const column_definition& def, query::result_row_view::iterator_type& i) {
if (def.is_multi_cell()) {
_visitor.accept_value(i.next_collection_cell());
} else {
auto cell = i.next_atomic_cell();
_visitor.accept_value(cell ? std::optional<query::result_bytes_view>(cell->value()) : std::optional<query::result_bytes_view>());
}
}
public:
query_result_visitor(const schema& s, Visitor& visitor, const selection::selection& select)
: _schema(s), _visitor(visitor), _selection(select) { }
void accept_new_partition(const partition_key& key, uint32_t row_count) {
_partition_key = key.explode(_schema);
accept_new_partition(row_count);
}
void accept_new_partition(uint32_t row_count) {
_partition_row_count = row_count;
_total_row_count += row_count;
}
void accept_new_row(const clustering_key& key, query::result_row_view static_row,
query::result_row_view row) {
_clustering_key = key.explode(_schema);
accept_new_row(static_row, row);
}
void accept_new_row(query::result_row_view static_row, query::result_row_view row) {
auto static_row_iterator = static_row.iterator();
auto row_iterator = row.iterator();
_visitor.start_row();
for (auto&& def : _selection.get_columns()) {
switch (def->kind) {
case column_kind::partition_key:
_visitor.accept_value(query::result_bytes_view(bytes_view(_partition_key[def->component_index()])));
break;
case column_kind::clustering_key:
if (_clustering_key.size() > def->component_index()) {
_visitor.accept_value(query::result_bytes_view(bytes_view(_clustering_key[def->component_index()])));
} else {
_visitor.accept_value({});
}
break;
case column_kind::regular_column:
accept_cell_value(*def, row_iterator);
break;
case column_kind::static_column:
accept_cell_value(*def, static_row_iterator);
break;
}
}
_visitor.end_row();
}
void accept_partition_end(const query::result_row_view& static_row) {
if (_partition_row_count == 0) {
_total_row_count++;
_visitor.start_row();
auto static_row_iterator = static_row.iterator();
for (auto&& def : _selection.get_columns()) {
if (def->is_partition_key()) {
_visitor.accept_value(query::result_bytes_view(bytes_view(_partition_key[def->component_index()])));
} else if (def->is_static()) {
accept_cell_value(*def, static_row_iterator);
} else {
_visitor.accept_value({});
}
}
_visitor.end_row();
}
}
uint32_t rows_read() const { return _total_row_count; }
};
public:
result_generator() = default;
result_generator(schema_ptr s, foreign_ptr<lw_shared_ptr<query::result>> result, lw_shared_ptr<const query::read_command> cmd,
::shared_ptr<const selection::selection> select, cql_stats& stats)
: _schema(std::move(s))
, _result(std::move(result))
, _command(std::move(cmd))
, _selection(std::move(select))
, _stats(&stats)
{ }
template<typename Visitor>
void visit(Visitor&& visitor) const {
query_result_visitor<Visitor> v(*_schema, visitor, *_selection);
query::result_view::consume(*_result, _command->slice, v);
_stats->rows_read += v.rows_read();
}
};
}

View File

@@ -45,27 +45,25 @@ namespace cql3 {
metadata::metadata(std::vector<::shared_ptr<column_specification>> names_)
: _flags(flag_enum_set())
, names(std::move(names_)) {
_column_count = names.size();
}
, _column_info(make_lw_shared<column_info>(std::move(names_)))
{ }
metadata::metadata(flag_enum_set flags, std::vector<::shared_ptr<column_specification>> names_, uint32_t column_count,
::shared_ptr<const service::pager::paging_state> paging_state)
: _flags(flags)
, names(std::move(names_))
, _column_count(column_count)
, _column_info(make_lw_shared<column_info>(std::move(names_), column_count))
, _paging_state(std::move(paging_state))
{ }
// The maximum number of values that the ResultSet can hold. This can be bigger than columnCount due to CASSANDRA-4911
uint32_t metadata::value_count() const {
return _flags.contains<flag::NO_METADATA>() ? _column_count : names.size();
return _flags.contains<flag::NO_METADATA>() ? _column_info->_column_count : _column_info->_names.size();
}
void metadata::add_non_serialized_column(::shared_ptr<column_specification> name) {
// See comment above. Because columnCount doesn't account the newly added name, it
// won't be serialized.
names.emplace_back(std::move(name));
_column_info->_names.emplace_back(std::move(name));
}
bool metadata::all_in_same_cf() const {
@@ -73,18 +71,24 @@ bool metadata::all_in_same_cf() const {
return false;
}
return column_specification::all_in_same_table(names);
return column_specification::all_in_same_table(_column_info->_names);
}
void metadata::set_has_more_pages(::shared_ptr<const service::pager::paging_state> paging_state) {
if (!paging_state) {
return;
}
void metadata::set_paging_state(::shared_ptr<const service::pager::paging_state> paging_state) {
_flags.set<flag::HAS_MORE_PAGES>();
_paging_state = std::move(paging_state);
}
void metadata::maybe_set_paging_state(::shared_ptr<const service::pager::paging_state> paging_state) {
assert(paging_state);
if (paging_state->get_remaining() > 0) {
set_paging_state(std::move(paging_state));
} else {
_flags.remove<flag::HAS_MORE_PAGES>();
_paging_state = nullptr;
}
}
void metadata::set_skip_metadata() {
_flags.set<flag::NO_METADATA>();
}
@@ -93,18 +97,10 @@ metadata::flag_enum_set metadata::flags() const {
return _flags;
}
uint32_t metadata::column_count() const {
return _column_count;
}
::shared_ptr<const service::pager::paging_state> metadata::paging_state() const {
return _paging_state;
}
const std::vector<::shared_ptr<column_specification>>& metadata::get_names() const {
return names;
}
prepared_metadata::prepared_metadata(const std::vector<::shared_ptr<column_specification>>& names,
const std::vector<uint16_t>& partition_key_bind_indices)
: _names{names}

View File

@@ -47,6 +47,12 @@
#include "service/pager/paging_state.hh"
#include "schema.hh"
#include "query-result-reader.hh"
#include "result_generator.hh"
#include <seastar/util/gcc6-concepts.hh>
namespace cql3 {
class metadata {
@@ -64,18 +70,29 @@ public:
using flag_enum_set = enum_set<flag_enum>;
private:
flag_enum_set _flags;
public:
struct column_info {
// Please note that columnCount can actually be smaller than names, even if names is not null. This is
// used to include columns in the resultSet that we need to do post-query re-orderings
// (SelectStatement.orderResults) but that shouldn't be sent to the user as they haven't been requested
// (CASSANDRA-4911). So the serialization code will exclude any columns in name whose index is >= columnCount.
std::vector<::shared_ptr<column_specification>> names;
std::vector<::shared_ptr<column_specification>> _names;
uint32_t _column_count;
column_info(std::vector<::shared_ptr<column_specification>> names, uint32_t column_count)
: _names(std::move(names))
, _column_count(column_count)
{ }
explicit column_info(std::vector<::shared_ptr<column_specification>> names)
: _names(std::move(names))
, _column_count(_names.size())
{ }
};
private:
flag_enum_set _flags;
private:
uint32_t _column_count;
lw_shared_ptr<column_info> _column_info;
::shared_ptr<const service::pager::paging_state> _paging_state;
public:
@@ -93,17 +110,20 @@ private:
bool all_in_same_cf() const;
public:
void set_has_more_pages(::shared_ptr<const service::pager::paging_state> paging_state);
void set_paging_state(::shared_ptr<const service::pager::paging_state> paging_state);
void maybe_set_paging_state(::shared_ptr<const service::pager::paging_state> paging_state);
void set_skip_metadata();
flag_enum_set flags() const;
uint32_t column_count() const;
uint32_t column_count() const { return _column_info->_column_count; }
::shared_ptr<const service::pager::paging_state> paging_state() const;
const std::vector<::shared_ptr<column_specification>>& get_names() const;
const std::vector<::shared_ptr<column_specification>>& get_names() const {
return _column_info->_names;
}
};
::shared_ptr<const cql3::metadata> make_empty_metadata();
@@ -131,10 +151,22 @@ public:
const std::vector<uint16_t>& partition_key_bind_indices() const;
};
GCC6_CONCEPT(
template<typename Visitor>
concept bool ResultVisitor = requires(Visitor& visitor) {
visitor.start_row();
visitor.accept_value(std::optional<query::result_bytes_view>());
visitor.end_row();
};
)
class result_set {
public:
::shared_ptr<metadata> _metadata;
std::deque<std::vector<bytes_opt>> _rows;
friend class result;
public:
result_set(std::vector<::shared_ptr<column_specification>> metadata_);
@@ -163,6 +195,80 @@ public:
// Returns a range of rows. A row is a range of bytes_opt.
const std::deque<std::vector<bytes_opt>>& rows() const;
template<typename Visitor>
GCC6_CONCEPT(requires ResultVisitor<Visitor>)
void visit(Visitor&& visitor) const {
auto column_count = get_metadata().column_count();
for (auto& row : _rows) {
visitor.start_row();
for (auto i = 0u; i < column_count; i++) {
auto& cell = row[i];
visitor.accept_value(cell ? std::optional<query::result_bytes_view>(*cell) : std::optional<query::result_bytes_view>());
}
visitor.end_row();
}
}
class builder;
};
class result_set::builder {
result_set _result;
std::vector<bytes_opt> _current_row;
public:
explicit builder(shared_ptr<metadata> mtd)
: _result(std::move(mtd)) { }
void start_row() { }
void accept_value(std::optional<query::result_bytes_view> value) {
if (!value) {
_current_row.emplace_back();
return;
}
_current_row.emplace_back(value->linearize());
}
void end_row() {
_result.add_row(std::exchange(_current_row, { }));
}
result_set get_result_set() && { return std::move(_result); }
};
class result {
std::unique_ptr<cql3::result_set> _result_set;
result_generator _result_generator;
shared_ptr<const cql3::metadata> _metadata;
public:
explicit result(std::unique_ptr<cql3::result_set> rs)
: _result_set(std::move(rs))
, _metadata(_result_set->_metadata)
{ }
explicit result(result_generator generator, shared_ptr<const metadata> m)
: _result_generator(std::move(generator))
, _metadata(std::move(m))
{ }
const cql3::metadata& get_metadata() const { return *_metadata; }
cql3::result_set result_set() const {
if (_result_set) {
return *_result_set;
} else {
auto builder = result_set::builder(make_shared<cql3::metadata>(*_metadata));
_result_generator.visit(builder);
return std::move(builder).get_result_set();
}
}
template<typename Visitor>
GCC6_CONCEPT(requires ResultVisitor<Visitor>)
void visit(Visitor&& visitor) const {
if (_result_set) {
_result_set->visit(std::forward<Visitor>(visitor));
} else {
_result_generator.visit(std::forward<Visitor>(visitor));
}
}
};
}

View File

@@ -112,11 +112,37 @@ selectable::with_function::raw::make_count_rows_function() {
std::vector<shared_ptr<cql3::selection::selectable::raw>>());
}
shared_ptr<selector::factory>
selectable::with_anonymous_function::new_selector_factory(database& db, schema_ptr s, std::vector<const column_definition*>& defs) {
auto&& factories = selector_factories::create_factories_and_collect_column_definitions(_args, db, s, defs);
return abstract_function_selector::new_factory(_function, std::move(factories));
}
sstring
selectable::with_anonymous_function::to_string() const {
return sprint("%s(%s)", _function->name().name, join(", ", _args));
}
shared_ptr<selectable>
selectable::with_anonymous_function::raw::prepare(schema_ptr s) {
std::vector<shared_ptr<selectable>> prepared_args;
prepared_args.reserve(_args.size());
for (auto&& arg : _args) {
prepared_args.push_back(arg->prepare(s));
}
return ::make_shared<with_anonymous_function>(_function, std::move(prepared_args));
}
bool
selectable::with_anonymous_function::raw::processes_selection() const {
return true;
}
shared_ptr<selector::factory>
selectable::with_field_selection::new_selector_factory(database& db, schema_ptr s, std::vector<const column_definition*>& defs) {
auto&& factory = _selected->new_selector_factory(db, s, defs);
auto&& type = factory->new_instance()->get_type();
auto&& ut = dynamic_pointer_cast<const user_type_impl>(std::move(type));
auto&& ut = dynamic_pointer_cast<const user_type_impl>(type->underlying_type());
if (!ut) {
throw exceptions::invalid_request_exception(
sprint("Invalid field selection: %s of type %s is not a user type",

View File

@@ -46,6 +46,7 @@
#include "core/shared_ptr.hh"
#include "cql3/selection/selector.hh"
#include "cql3/cql3_type.hh"
#include "cql3/functions/function.hh"
#include "cql3/functions/function_name.hh"
namespace cql3 {
@@ -82,6 +83,7 @@ public:
class writetime_or_ttl;
class with_function;
class with_anonymous_function;
class with_field_selection;
@@ -114,6 +116,28 @@ public:
};
};
class selectable::with_anonymous_function : public selectable {
shared_ptr<functions::function> _function;
std::vector<shared_ptr<selectable>> _args;
public:
with_anonymous_function(::shared_ptr<functions::function> f, std::vector<shared_ptr<selectable>> args)
: _function(f), _args(std::move(args)) {
}
virtual sstring to_string() const override;
virtual shared_ptr<selector::factory> new_selector_factory(database& db, schema_ptr s, std::vector<const column_definition*>& defs) override;
class raw : public selectable::raw {
shared_ptr<functions::function> _function;
std::vector<shared_ptr<selectable::raw>> _args;
public:
raw(shared_ptr<functions::function> f, std::vector<shared_ptr<selectable::raw>> args)
: _function(f), _args(std::move(args)) {
}
virtual shared_ptr<selectable> prepare(schema_ptr s) override;
virtual bool processes_selection() const override;
};
};
class selectable::with_cast : public selectable {
::shared_ptr<selectable> _arg;

View File

@@ -40,6 +40,7 @@
*/
#include <boost/range/adaptor/transformed.hpp>
#include <boost/range/adaptor/filtered.hpp>
#include "cql3/selection/selection.hh"
#include "cql3/selection/selector_factories.hh"
@@ -53,13 +54,15 @@ selection::selection(schema_ptr schema,
std::vector<const column_definition*> columns,
std::vector<::shared_ptr<column_specification>> metadata_,
bool collect_timestamps,
bool collect_TTLs)
bool collect_TTLs,
trivial is_trivial)
: _schema(std::move(schema))
, _columns(std::move(columns))
, _metadata(::make_shared<metadata>(std::move(metadata_)))
, _collect_timestamps(collect_timestamps)
, _collect_TTLs(collect_TTLs)
, _contains_static_columns(std::any_of(_columns.begin(), _columns.end(), std::mem_fn(&column_definition::is_static)))
, _is_trivial(is_trivial)
{ }
query::partition_slice::option_set selection::get_query_options() {
@@ -100,7 +103,7 @@ public:
*/
simple_selection(schema_ptr schema, std::vector<const column_definition*> columns,
std::vector<::shared_ptr<column_specification>> metadata, bool is_wildcard)
: selection(schema, std::move(columns), std::move(metadata), false, false)
: selection(schema, std::move(columns), std::move(metadata), false, false, trivial::yes)
, _is_wildcard(is_wildcard)
{ }
@@ -153,9 +156,9 @@ public:
return _factories->uses_function(ks_name, function_name);
}
virtual uint32_t add_column_for_ordering(const column_definition& c) override {
uint32_t index = selection::add_column_for_ordering(c);
_factories->add_selector_for_ordering(c, index);
virtual uint32_t add_column_for_post_processing(const column_definition& c) override {
uint32_t index = selection::add_column_for_post_processing(c);
_factories->add_selector_for_post_processing(c, index);
return index;
}
@@ -206,9 +209,17 @@ protected:
::shared_ptr<selection> selection::wildcard(schema_ptr schema) {
auto columns = schema->all_columns_in_select_order();
auto cds = boost::copy_range<std::vector<const column_definition*>>(columns | boost::adaptors::transformed([](const column_definition& c) {
return &c;
}));
// filter out hidden columns, which should not be seen by the
// user when doing "SELECT *". We also disallow selecting them
// individually (see column_identifier::new_selector_factory()).
auto cds = boost::copy_range<std::vector<const column_definition*>>(
columns |
boost::adaptors::filtered([](const column_definition& c) {
return !c.is_view_virtual();
}) |
boost::adaptors::transformed([](const column_definition& c) {
return &c;
}));
return simple_selection::make(schema, std::move(cds), true);
}
@@ -216,7 +227,7 @@ protected:
return simple_selection::make(schema, std::move(columns), false);
}
uint32_t selection::add_column_for_ordering(const column_definition& c) {
uint32_t selection::add_column_for_post_processing(const column_definition& c) {
_columns.push_back(&c);
_metadata->add_non_serialized_column(c.column_specification);
return _columns.size() - 1;
@@ -328,93 +339,106 @@ std::unique_ptr<result_set> result_set_builder::build() {
return std::move(_result_set);
}
result_set_builder::visitor::visitor(
cql3::selection::result_set_builder& builder, const schema& s,
const selection& selection)
: _builder(builder), _schema(s), _selection(selection), _row_count(0) {
}
bool result_set_builder::restrictions_filter::do_filter(const selection& selection,
const std::vector<bytes>& partition_key,
const std::vector<bytes>& clustering_key,
const query::result_row_view& static_row,
const query::result_row_view& row) const {
static logging::logger rlogger("restrictions_filter");
void result_set_builder::visitor::add_value(const column_definition& def,
query::result_row_view::iterator_type& i) {
if (def.type->is_multi_cell()) {
auto cell = i.next_collection_cell();
if (!cell) {
_builder.add_empty();
return;
}
_builder.add_collection(def, *cell);
} else {
auto cell = i.next_atomic_cell();
if (!cell) {
_builder.add_empty();
return;
}
_builder.add(def, *cell);
if (_current_partition_key_does_not_match || _current_static_row_does_not_match || _remaining == 0) {
return false;
}
}
void result_set_builder::visitor::accept_new_partition(const partition_key& key,
uint32_t row_count) {
_partition_key = key.explode(_schema);
_row_count = row_count;
}
void result_set_builder::visitor::accept_new_partition(uint32_t row_count) {
_row_count = row_count;
}
void result_set_builder::visitor::accept_new_row(const clustering_key& key,
const query::result_row_view& static_row,
const query::result_row_view& row) {
_clustering_key = key.explode(_schema);
accept_new_row(static_row, row);
}
void result_set_builder::visitor::accept_new_row(
const query::result_row_view& static_row,
const query::result_row_view& row) {
auto static_row_iterator = static_row.iterator();
auto row_iterator = row.iterator();
_builder.new_row();
for (auto&& def : _selection.get_columns()) {
switch (def->kind) {
case column_kind::partition_key:
_builder.add(_partition_key[def->component_index()]);
break;
case column_kind::clustering_key:
if (_clustering_key.size() > def->component_index()) {
_builder.add(_clustering_key[def->component_index()]);
auto non_pk_restrictions_map = _restrictions->get_non_pk_restriction();
auto partition_key_restrictions_map = _restrictions->get_single_column_partition_key_restrictions();
auto clustering_key_restrictions_map = _restrictions->get_single_column_clustering_key_restrictions();
for (auto&& cdef : selection.get_columns()) {
switch (cdef->kind) {
case column_kind::static_column:
// fallthrough
case column_kind::regular_column: {
auto& cell_iterator = (cdef->kind == column_kind::static_column) ? static_row_iterator : row_iterator;
if (cdef->type->is_multi_cell()) {
cell_iterator.next_collection_cell();
auto restr_it = non_pk_restrictions_map.find(cdef);
if (restr_it == non_pk_restrictions_map.end()) {
continue;
}
throw exceptions::invalid_request_exception("Collection filtering is not supported yet");
} else {
_builder.add({});
auto cell = cell_iterator.next_atomic_cell();
auto restr_it = non_pk_restrictions_map.find(cdef);
if (restr_it == non_pk_restrictions_map.end()) {
continue;
}
restrictions::single_column_restriction& restriction = *restr_it->second;
bool regular_restriction_matches;
if (cell) {
regular_restriction_matches = cell->value().with_linearized([&restriction, this](bytes_view data) {
return restriction.is_satisfied_by(data, _options);
});
} else {
regular_restriction_matches = restriction.is_satisfied_by(bytes(), _options);
}
if (!regular_restriction_matches) {
_current_static_row_does_not_match = (cdef->kind == column_kind::static_column);
return false;
}
}
}
break;
case column_kind::regular_column:
add_value(*def, row_iterator);
case column_kind::partition_key: {
auto restr_it = partition_key_restrictions_map.find(cdef);
if (restr_it == partition_key_restrictions_map.end()) {
continue;
}
restrictions::single_column_restriction& restriction = *restr_it->second;
const bytes& value_to_check = partition_key[cdef->id];
bool pk_restriction_matches = restriction.is_satisfied_by(value_to_check, _options);
if (!pk_restriction_matches) {
_current_partition_key_does_not_match = true;
return false;
}
}
break;
case column_kind::static_column:
add_value(*def, static_row_iterator);
case column_kind::clustering_key: {
auto restr_it = clustering_key_restrictions_map.find(cdef);
if (restr_it == clustering_key_restrictions_map.end()) {
continue;
}
restrictions::single_column_restriction& restriction = *restr_it->second;
const bytes& value_to_check = clustering_key[cdef->id];
bool pk_restriction_matches = restriction.is_satisfied_by(value_to_check, _options);
if (!pk_restriction_matches) {
return false;
}
}
break;
default:
assert(0);
break;
}
}
return true;
}
void result_set_builder::visitor::accept_partition_end(
const query::result_row_view& static_row) {
if (_row_count == 0) {
_builder.new_row();
auto static_row_iterator = static_row.iterator();
for (auto&& def : _selection.get_columns()) {
if (def->is_partition_key()) {
_builder.add(_partition_key[def->component_index()]);
} else if (def->is_static()) {
add_value(*def, static_row_iterator);
} else {
_builder.add_empty();
}
}
bool result_set_builder::restrictions_filter::operator()(const selection& selection,
const std::vector<bytes>& partition_key,
const std::vector<bytes>& clustering_key,
const query::result_row_view& static_row,
const query::result_row_view& row) const {
const bool accepted = do_filter(selection, partition_key, clustering_key, static_row, row);
if (!accepted) {
++_rows_dropped;
} else if (_remaining > 0) {
--_remaining;
}
return accepted;
}
api::timestamp_type result_set_builder::timestamp_of(size_t idx) {
@@ -426,7 +450,7 @@ int32_t result_set_builder::ttl_of(size_t idx) {
}
bytes_opt result_set_builder::get_value(data_type t, query::result_atomic_cell_view c) {
return {to_bytes(c.value())};
return {c.value().linearize()};
}
}

View File

@@ -48,6 +48,7 @@
#include "exceptions/exceptions.hh"
#include "cql3/selection/raw_selector.hh"
#include "cql3/selection/selector_factories.hh"
#include "cql3/restrictions/statement_restrictions.hh"
#include "unimplemented.hh"
namespace cql3 {
@@ -84,12 +85,15 @@ private:
const bool _collect_timestamps;
const bool _collect_TTLs;
const bool _contains_static_columns;
bool _is_trivial;
protected:
using trivial = bool_class<class trivial_tag>;
selection(schema_ptr schema,
std::vector<const column_definition*> columns,
std::vector<::shared_ptr<column_specification>> metadata_,
bool collect_timestamps,
bool collect_TTLs);
bool collect_TTLs, trivial is_trivial = trivial::no);
virtual ~selection() {}
public:
@@ -165,10 +169,14 @@ public:
return _metadata;
}
::shared_ptr<metadata> get_result_metadata() {
return _metadata;
}
static ::shared_ptr<selection> wildcard(schema_ptr schema);
static ::shared_ptr<selection> for_columns(schema_ptr schema, std::vector<const column_definition*> columns);
virtual uint32_t add_column_for_ordering(const column_definition& c);
virtual uint32_t add_column_for_post_processing(const column_definition& c);
virtual bool uses_function(const sstring &ks_name, const sstring& function_name) const {
return false;
@@ -223,6 +231,12 @@ public:
}
}
/**
* Returns true if the selection is trivial, i.e. there are no function
* selectors (including casts or aggregates).
*/
bool is_trivial() const { return _is_trivial; }
friend class result_set_builder;
};
@@ -238,6 +252,40 @@ private:
const gc_clock::time_point _now;
cql_serialization_format _cql_serialization_format;
public:
class nop_filter {
public:
inline bool operator()(const selection&, const std::vector<bytes>&, const std::vector<bytes>&, const query::result_row_view&, const query::result_row_view&) const {
return true;
}
void reset() {
}
uint32_t get_rows_dropped() const {
return 0;
}
};
class restrictions_filter {
::shared_ptr<restrictions::statement_restrictions> _restrictions;
const query_options& _options;
mutable bool _current_partition_key_does_not_match = false;
mutable bool _current_static_row_does_not_match = false;
mutable uint32_t _rows_dropped = 0;
mutable uint32_t _remaining = 0;
public:
restrictions_filter() = default;
explicit restrictions_filter(::shared_ptr<restrictions::statement_restrictions> restrictions, const query_options& options, uint32_t remaining) : _restrictions(restrictions), _options(options), _remaining(remaining) {}
bool operator()(const selection& selection, const std::vector<bytes>& pk, const std::vector<bytes>& ck, const query::result_row_view& static_row, const query::result_row_view& row) const;
void reset() {
_current_partition_key_does_not_match = false;
_current_static_row_does_not_match = false;
_rows_dropped = 0;
}
uint32_t get_rows_dropped() const {
return _rows_dropped;
}
private:
bool do_filter(const selection& selection, const std::vector<bytes>& pk, const std::vector<bytes>& ck, const query::result_row_view& static_row, const query::result_row_view& row) const;
};
result_set_builder(const selection& s, gc_clock::time_point now, cql_serialization_format sf);
void add_empty();
void add(bytes_opt value);
@@ -247,8 +295,9 @@ public:
std::unique_ptr<result_set> build();
api::timestamp_type timestamp_of(size_t idx);
int32_t ttl_of(size_t idx);
// Implements ResultVisitor concept from query.hh
template<typename Filter = nop_filter>
class visitor {
protected:
result_set_builder& _builder;
@@ -257,20 +306,101 @@ public:
uint32_t _row_count;
std::vector<bytes> _partition_key;
std::vector<bytes> _clustering_key;
Filter _filter;
public:
visitor(cql3::selection::result_set_builder& builder, const schema& s, const selection&);
visitor(cql3::selection::result_set_builder& builder, const schema& s,
const selection& selection, Filter filter = Filter())
: _builder(builder)
, _schema(s)
, _selection(selection)
, _row_count(0)
, _filter(filter)
{}
visitor(visitor&&) = default;
void add_value(const column_definition& def, query::result_row_view::iterator_type& i);
void accept_new_partition(const partition_key& key, uint32_t row_count);
void accept_new_partition(uint32_t row_count);
void accept_new_row(const clustering_key& key,
const query::result_row_view& static_row,
const query::result_row_view& row);
void accept_new_row(const query::result_row_view& static_row,
const query::result_row_view& row);
void accept_partition_end(const query::result_row_view& static_row);
void add_value(const column_definition& def, query::result_row_view::iterator_type& i) {
if (def.type->is_multi_cell()) {
auto cell = i.next_collection_cell();
if (!cell) {
_builder.add_empty();
return;
}
_builder.add_collection(def, cell->linearize());
} else {
auto cell = i.next_atomic_cell();
if (!cell) {
_builder.add_empty();
return;
}
_builder.add(def, *cell);
}
}
void accept_new_partition(const partition_key& key, uint32_t row_count) {
_partition_key = key.explode(_schema);
_row_count = row_count;
_filter.reset();
}
void accept_new_partition(uint32_t row_count) {
_row_count = row_count;
_filter.reset();
}
void accept_new_row(const clustering_key& key, const query::result_row_view& static_row, const query::result_row_view& row) {
_clustering_key = key.explode(_schema);
accept_new_row(static_row, row);
}
void accept_new_row(const query::result_row_view& static_row, const query::result_row_view& row) {
auto static_row_iterator = static_row.iterator();
auto row_iterator = row.iterator();
if (!_filter(_selection, _partition_key, _clustering_key, static_row, row)) {
return;
}
_builder.new_row();
for (auto&& def : _selection.get_columns()) {
switch (def->kind) {
case column_kind::partition_key:
_builder.add(_partition_key[def->component_index()]);
break;
case column_kind::clustering_key:
if (_clustering_key.size() > def->component_index()) {
_builder.add(_clustering_key[def->component_index()]);
} else {
_builder.add({});
}
break;
case column_kind::regular_column:
add_value(*def, row_iterator);
break;
case column_kind::static_column:
add_value(*def, static_row_iterator);
break;
default:
assert(0);
}
}
}
uint32_t accept_partition_end(const query::result_row_view& static_row) {
if (_row_count == 0) {
_builder.new_row();
auto static_row_iterator = static_row.iterator();
for (auto&& def : _selection.get_columns()) {
if (def->is_partition_key()) {
_builder.add(_partition_key[def->component_index()]);
} else if (def->is_static()) {
add_value(*def, static_row_iterator);
} else {
_builder.add_empty();
}
}
}
return _filter.get_rows_dropped();
}
};
private:
bytes_opt get_value(data_type t, query::result_atomic_cell_view c);
};

View File

@@ -53,6 +53,7 @@ selector_factories::selector_factories(std::vector<::shared_ptr<selectable>> sel
: _contains_write_time_factory(false)
, _contains_ttl_factory(false)
, _number_of_aggregate_factories(0)
, _number_of_factories_for_post_processing(0)
{
_factories.reserve(selectables.size());
@@ -76,8 +77,9 @@ bool selector_factories::uses_function(const sstring& ks_name, const sstring& fu
return false;
}
void selector_factories::add_selector_for_ordering(const column_definition& def, uint32_t index) {
void selector_factories::add_selector_for_post_processing(const column_definition& def, uint32_t index) {
_factories.emplace_back(simple_selector::new_factory(def.name_as_text(), index, def.type));
++_number_of_factories_for_post_processing;
}
std::vector<::shared_ptr<selector>> selector_factories::new_instances() const {

View File

@@ -74,6 +74,11 @@ private:
*/
uint32_t _number_of_aggregate_factories;
/**
* The number of factories that are only for post processing.
*/
uint32_t _number_of_factories_for_post_processing;
public:
/**
* Creates a new <code>SelectorFactories</code> instance and collect the column definitions.
@@ -97,11 +102,12 @@ public:
bool uses_function(const sstring& ks_name, const sstring& function_name) const;
/**
* Adds a new <code>Selector.Factory</code> for a column that is needed only for ORDER BY purposes.
* Adds a new <code>Selector.Factory</code> for a column that is needed only for ORDER BY or post
* processing purposes.
* @param def the column that is needed for ordering
* @param index the index of the column definition in the Selection's list of columns
*/
void add_selector_for_ordering(const column_definition& def, uint32_t index);
void add_selector_for_post_processing(const column_definition& def, uint32_t index);
/**
* Checks if this <code>SelectorFactories</code> contains only factories for aggregates.
@@ -111,7 +117,7 @@ public:
*/
bool contains_only_aggregate_functions() const {
auto size = _factories.size();
return size != 0 && _number_of_aggregate_factories == size;
return size != 0 && _number_of_aggregate_factories == (size - _number_of_factories_for_post_processing);
}
/**

View File

@@ -120,17 +120,19 @@ sets::literal::to_string() const {
}
sets::value
sets::value::from_serialized(bytes_view v, set_type type, cql_serialization_format sf) {
sets::value::from_serialized(const fragmented_temporary_buffer::view& val, set_type type, cql_serialization_format sf) {
try {
// Collections have this small hack that validate cannot be called on a serialized object,
// but compose does the validation (so we're fine).
// FIXME: deserializeForNativeProtocol?!
return with_linearized(val, [&] (bytes_view v) {
auto s = value_cast<set_type_impl::native_type>(type->deserialize(v, sf));
std::set<bytes, serialized_compare> elements(type->get_elements_type()->as_less_comparator());
for (auto&& element : s) {
elements.insert(elements.end(), type->get_elements_type()->decompose(element));
}
return value(std::move(elements));
});
} catch (marshal_exception& e) {
throw exceptions::invalid_request_exception(e.what());
}
@@ -198,10 +200,10 @@ sets::delayed_value::bind(const query_options& options) {
return constants::UNSET_VALUE;
}
// We don't support value > 64K because the serialization format encode the length as an unsigned short.
if (b->size() > std::numeric_limits<uint16_t>::max()) {
if (b->size_bytes() > std::numeric_limits<uint16_t>::max()) {
throw exceptions::invalid_request_exception(sprint("Set value is too long. Set values are limited to %d bytes but %d bytes value provided",
std::numeric_limits<uint16_t>::max(),
b->size()));
b->size_bytes()));
}
buffers.insert(buffers.end(), std::move(to_bytes(*b)));
@@ -225,7 +227,12 @@ sets::marker::bind(const query_options& options) {
void
sets::setter::execute(mutation& m, const clustering_key_prefix& row_key, const update_parameters& params) {
const auto& value = _t->bind(params._options);
auto value = _t->bind(params._options);
execute(m, row_key, params, column, std::move(value));
}
void
sets::setter::execute(mutation& m, const clustering_key_prefix& row_key, const update_parameters& params, const column_definition& column, ::shared_ptr<terminal> value) {
if (value == constants::UNSET_VALUE) {
return;
}
@@ -264,7 +271,7 @@ sets::adder::do_add(mutation& m, const clustering_key_prefix& row_key, const upd
}
for (auto&& e : set_value->_elements) {
mut.cells.emplace_back(e, params.make_cell({}));
mut.cells.emplace_back(e, params.make_cell(*set_type->value_comparator(), bytes_view(), atomic_cell::collection_member::yes));
}
auto smut = set_type->serialize_mutation_form(mut);
@@ -274,7 +281,7 @@ sets::adder::do_add(mutation& m, const clustering_key_prefix& row_key, const upd
auto v = set_type->serialize_partially_deserialized_form(
{set_value->_elements.begin(), set_value->_elements.end()},
cql_serialization_format::internal());
m.set_cell(row_key, column, params.make_cell(std::move(v)));
m.set_cell(row_key, column, params.make_cell(*column.type, fragmented_temporary_buffer::view(v)));
} else {
m.set_cell(row_key, column, params.make_dead_cell());
}

View File

@@ -78,7 +78,7 @@ public:
value(std::set<bytes, serialized_compare> elements)
: _elements(std::move(elements)) {
}
static value from_serialized(bytes_view v, set_type type, cql_serialization_format sf);
static value from_serialized(const fragmented_temporary_buffer::view& v, set_type type, cql_serialization_format sf);
virtual cql3::raw_value get(const query_options& options) override;
virtual bytes get_with_protocol_version(cql_serialization_format sf) override;
bool equals(set_type st, const value& v);
@@ -113,6 +113,7 @@ public:
: operation(column, std::move(t)) {
}
virtual void execute(mutation& m, const clustering_key_prefix& row_key, const update_parameters& params) override;
static void execute(mutation& m, const clustering_key_prefix& row_key, const update_parameters& params, const column_definition& column, ::shared_ptr<terminal> value);
};
class adder : public operation {

View File

@@ -101,13 +101,6 @@ single_column_relation::to_receivers(schema_ptr schema, const column_definition&
}
if (is_IN()) {
// For partition keys we only support IN for the last name so far
if (column_def.is_partition_key() && !schema->is_last_partition_key(column_def)) {
throw exceptions::invalid_request_exception(sprint(
"Partition KEY part %s cannot be restricted by IN relation (only the last part of the partition key can)",
column_def.name_as_text()));
}
// We only allow IN on the row key and the clustering key so far, never on non-PK columns, and this even if
// there's an index
// Note: for backward compatibility reason, we conside a IN of 1 value the same as a EQ, so we let that
@@ -116,18 +109,6 @@ single_column_relation::to_receivers(schema_ptr schema, const column_definition&
throw exceptions::invalid_request_exception(sprint(
"IN predicates on non-primary-key columns (%s) is not yet supported", column_def.name_as_text()));
}
} else if (is_slice()) {
// Non EQ relation is not supported without token(), even if we have a 2ndary index (since even those
// are ordered by partitioner).
// Note: In theory we could allow it for 2ndary index queries with ALLOW FILTERING, but that would
// probably require some special casing
// Note bis: This is also why we don't bother handling the 'tuple' notation of #4851 for keys. If we
// lift the limitation for 2ndary
// index with filtering, we'll need to handle it though.
if (column_def.is_partition_key()) {
throw exceptions::invalid_request_exception(
"Only EQ and IN relation are supported on the partition key (unless you use the token() function)");
}
}
if (is_contains() && !receiver->type->is_collection()) {

View File

@@ -134,7 +134,7 @@ protected:
#endif
virtual sstring to_string() const override {
auto entity_as_string = _entity->to_string();
auto entity_as_string = _entity->to_cql_string();
if (_map_key) {
entity_as_string = sprint("%s[%s]", std::move(entity_as_string), _map_key->to_string());
}

View File

@@ -42,7 +42,7 @@
#include "alter_keyspace_statement.hh"
#include "prepared_statement.hh"
#include "service/migration_manager.hh"
#include "database.hh"
#include "db/system_keyspace.hh"
bool is_system_keyspace(const sstring& keyspace);
@@ -59,7 +59,7 @@ future<> cql3::statements::alter_keyspace_statement::check_access(const service:
return state.has_keyspace_access(_name, auth::permission::ALTER);
}
void cql3::statements::alter_keyspace_statement::validate(distributed<service::storage_proxy>& proxy, const service::client_state& state) {
void cql3::statements::alter_keyspace_statement::validate(service::storage_proxy& proxy, const service::client_state& state) {
try {
service::get_local_storage_proxy().get_db().local().find_keyspace(_name); // throws on failure
auto tmp = _name;
@@ -90,7 +90,7 @@ void cql3::statements::alter_keyspace_statement::validate(distributed<service::s
}
}
future<shared_ptr<cql_transport::event::schema_change>> cql3::statements::alter_keyspace_statement::announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only) {
future<shared_ptr<cql_transport::event::schema_change>> cql3::statements::alter_keyspace_statement::announce_migration(service::storage_proxy& proxy, bool is_local_only) {
auto old_ksm = service::get_local_storage_proxy().get_db().local().find_keyspace(_name).metadata();
return service::get_local_migration_manager().announce_keyspace_update(_attrs->as_ks_metadata_update(old_ksm), is_local_only).then([this] {
using namespace cql_transport;

View File

@@ -60,8 +60,8 @@ public:
const sstring& keyspace() const override;
future<> check_access(const service::client_state& state) override;
void validate(distributed<service::storage_proxy>& proxy, const service::client_state& state) override;
future<shared_ptr<cql_transport::event::schema_change>> announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only) override;
void validate(service::storage_proxy& proxy, const service::client_state& state) override;
future<shared_ptr<cql_transport::event::schema_change>> announce_migration(service::storage_proxy& proxy, bool is_local_only) override;
virtual std::unique_ptr<prepared> prepare(database& db, cql_stats& stats) override;
};

View File

@@ -62,12 +62,12 @@ public:
, _options(std::move(options)) {
}
void validate(distributed<service::storage_proxy>&, const service::client_state&) override;
void validate(service::storage_proxy&, const service::client_state&) override;
virtual future<> check_access(const service::client_state&) override;
virtual future<::shared_ptr<cql_transport::messages::result_message>>
execute(distributed<service::storage_proxy>&, service::query_state&, const query_options&) override;
execute(service::storage_proxy&, service::query_state&, const query_options&) override;
};
}

View File

@@ -75,7 +75,7 @@ future<> alter_table_statement::check_access(const service::client_state& state)
return state.has_column_family_access(keyspace(), column_family(), auth::permission::ALTER);
}
void alter_table_statement::validate(distributed<service::storage_proxy>& proxy, const service::client_state& state)
void alter_table_statement::validate(service::storage_proxy& proxy, const service::client_state& state)
{
// validated in announce_migration()
}
@@ -165,9 +165,9 @@ static void validate_column_rename(database& db, const schema& schema, const col
}
}
future<shared_ptr<cql_transport::event::schema_change>> alter_table_statement::announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only)
future<shared_ptr<cql_transport::event::schema_change>> alter_table_statement::announce_migration(service::storage_proxy& proxy, bool is_local_only)
{
auto& db = proxy.local().get_db().local();
auto& db = proxy.get_db().local();
auto schema = validation::validate_column_family(db, keyspace(), column_family());
if (schema->is_view()) {
throw exceptions::invalid_request_exception("Cannot use ALTER TABLE on Materialized View");
@@ -246,15 +246,22 @@ future<shared_ptr<cql_transport::event::schema_change>> alter_table_statement::a
cfm.with_column(column_name->name(), type, _is_static ? column_kind::static_column : column_kind::regular_column);
// Adding a column to a table which has an include all view requires the column to be added to the view
// as well
// Adding a column to a base table always requires updating the view
// schemas: If the view includes all columns it should include the new
// column, but if it doesn't, it may need to include the new
// unselected column as a virtual column. The case when it we
// shouldn't add a virtual column is when the view has in its PK one
// of the base's regular columns - but even in this case we need to
// rebuild the view schema, to update the column ID.
if (!_is_static) {
for (auto&& view : cf.views()) {
schema_builder builder(view);
if (view->view_info()->include_all_columns()) {
schema_builder builder(view);
builder.with_column(column_name->name(), type);
view_updates.push_back(view_ptr(builder.build()));
} else if (!view->view_info()->base_non_pk_column_in_view_pk()) {
db::view::create_virtual_column(builder, column_name->name(), type);
}
view_updates.push_back(view_ptr(builder.build()));
}
}
@@ -269,7 +276,7 @@ future<shared_ptr<cql_transport::event::schema_change>> alter_table_statement::a
auto type = validate_alter(schema, *def, *validator);
// In any case, we update the column definition
cfm.with_altered_column_type(column_name->name(), type);
cfm.alter_column_type(column_name->name(), type);
// We also have to validate the view types here. If we have a view which includes a column as part of
// the clustering key, we need to make sure that it is indeed compatible.
@@ -278,7 +285,7 @@ future<shared_ptr<cql_transport::event::schema_change>> alter_table_statement::a
if (view_def) {
schema_builder builder(view);
auto view_type = validate_alter(view, *view_def, *validator);
builder.with_altered_column_type(column_name->name(), std::move(view_type));
builder.alter_column_type(column_name->name(), std::move(view_type));
view_updates.push_back(view_ptr(builder.build()));
}
}
@@ -299,20 +306,16 @@ future<shared_ptr<cql_transport::event::schema_change>> alter_table_statement::a
} else {
for (auto&& column_def : boost::range::join(schema->static_columns(), schema->regular_columns())) { // find
if (column_def.name() == column_name->name()) {
cfm.without_column(column_name->name());
cfm.remove_column(column_name->name());
break;
}
}
}
// If a column is dropped which is included in a view, we don't allow the drop to take place.
auto view_names = ::join(", ", cf.views()
| boost::adaptors::filtered([&] (auto&& v) { return bool(v->get_column_definition(column_name->name())); })
| boost::adaptors::transformed([] (auto&& v) { return v->cf_name(); }));
if (!view_names.empty()) {
if (!cf.views().empty()) {
throw exceptions::invalid_request_exception(sprint(
"Cannot drop column %s, depended on by materialized views (%s.{%s})",
column_name, keyspace(), view_names));
"Cannot drop column %s on base table %s.%s with materialized views",
column_name, keyspace(), column_family()));
}
break;
}
@@ -346,9 +349,10 @@ future<shared_ptr<cql_transport::event::schema_change>> alter_table_statement::a
auto to = entry.second->prepare_column_identifier(schema);
validate_column_rename(db, *schema, *from, *to);
cfm.with_column_rename(from->name(), to->name());
cfm.rename_column(from->name(), to->name());
// If the view includes a renamed column, it must be renamed in the view table and the definition.
// If the view includes a renamed column, it must be renamed in
// the view table and the definition.
for (auto&& view : cf.views()) {
if (view->get_column_definition(from->name())) {
schema_builder builder(view);
@@ -356,7 +360,7 @@ future<shared_ptr<cql_transport::event::schema_change>> alter_table_statement::a
auto view_from = entry.first->prepare_column_identifier(view);
auto view_to = entry.second->prepare_column_identifier(view);
validate_column_rename(db, *view, *view_from, *view_to);
builder.with_column_rename(view_from->name(), view_to->name());
builder.rename_column(view_from->name(), view_to->name());
auto new_where = util::rename_column_in_where_clause(
view->view_info()->where_clause(),

View File

@@ -77,8 +77,8 @@ public:
bool is_static);
virtual future<> check_access(const service::client_state& state) override;
virtual void validate(distributed<service::storage_proxy>& proxy, const service::client_state& state) override;
virtual future<shared_ptr<cql_transport::event::schema_change>> announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only) override;
virtual void validate(service::storage_proxy& proxy, const service::client_state& state) override;
virtual future<shared_ptr<cql_transport::event::schema_change>> announce_migration(service::storage_proxy& proxy, bool is_local_only) override;
virtual std::unique_ptr<prepared> prepare(database& db, cql_stats& stats) override;
};

View File

@@ -66,7 +66,7 @@ future<> alter_type_statement::check_access(const service::client_state& state)
return state.has_keyspace_access(keyspace(), auth::permission::ALTER);
}
void alter_type_statement::validate(distributed<service::storage_proxy>& proxy, const service::client_state& state)
void alter_type_statement::validate(service::storage_proxy& proxy, const service::client_state& state)
{
// Validation is left to announceMigration as it's easier to do it while constructing the updated type.
// It doesn't really change anything anyway.
@@ -110,7 +110,7 @@ void alter_type_statement::do_announce_migration(database& db, ::keyspace& ks, b
if (t_opt) {
modified = true;
// We need to update this column
cfm.with_altered_column_type(column.name(), *t_opt);
cfm.alter_column_type(column.name(), *t_opt);
}
}
if (modified) {
@@ -135,10 +135,10 @@ void alter_type_statement::do_announce_migration(database& db, ::keyspace& ks, b
}
}
future<shared_ptr<cql_transport::event::schema_change>> alter_type_statement::announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only)
future<shared_ptr<cql_transport::event::schema_change>> alter_type_statement::announce_migration(service::storage_proxy& proxy, bool is_local_only)
{
return seastar::async([this, &proxy, is_local_only] {
auto&& db = proxy.local().get_db().local();
auto&& db = proxy.get_db().local();
try {
auto&& ks = db.find_keyspace(keyspace());
do_announce_migration(db, ks, is_local_only);
@@ -165,7 +165,7 @@ alter_type_statement::add_or_alter::add_or_alter(const ut_name& name, bool is_ad
user_type alter_type_statement::add_or_alter::do_add(database& db, user_type to_update) const
{
if (get_idx_of_field(to_update, _field_name)) {
throw exceptions::invalid_request_exception(sprint("Cannot add new field %s to type %s: a field of the same name already exists", _field_name->name(), _name.to_string()));
throw exceptions::invalid_request_exception(sprint("Cannot add new field %s to type %s: a field of the same name already exists", _field_name->to_string(), _name.to_string()));
}
std::vector<bytes> new_names(to_update->field_names());
@@ -173,7 +173,7 @@ user_type alter_type_statement::add_or_alter::do_add(database& db, user_type to_
std::vector<data_type> new_types(to_update->field_types());
auto&& add_type = _field_type->prepare(db, keyspace())->get_type();
if (add_type->references_user_type(to_update->_keyspace, to_update->_name)) {
throw exceptions::invalid_request_exception(sprint("Cannot add new field %s of type %s to type %s as this would create a circular reference", _field_name->name(), _field_type->to_string(), _name.to_string()));
throw exceptions::invalid_request_exception(sprint("Cannot add new field %s of type %s to type %s as this would create a circular reference", _field_name->to_string(), _field_type->to_string(), _name.to_string()));
}
new_types.push_back(std::move(add_type));
return user_type_impl::get_instance(to_update->_keyspace, to_update->_name, std::move(new_names), std::move(new_types));
@@ -183,13 +183,13 @@ user_type alter_type_statement::add_or_alter::do_alter(database& db, user_type t
{
stdx::optional<uint32_t> idx = get_idx_of_field(to_update, _field_name);
if (!idx) {
throw exceptions::invalid_request_exception(sprint("Unknown field %s in type %s", _field_name->name(), _name.to_string()));
throw exceptions::invalid_request_exception(sprint("Unknown field %s in type %s", _field_name->to_string(), _name.to_string()));
}
auto previous = to_update->field_types()[*idx];
auto new_type = _field_type->prepare(db, keyspace())->get_type();
if (!new_type->is_compatible_with(*previous)) {
throw exceptions::invalid_request_exception(sprint("Type %s in incompatible with previous type %s of field %s in user type %s", _field_type->to_string(), previous->as_cql3_type()->to_string(), _field_name->name(), _name.to_string()));
throw exceptions::invalid_request_exception(sprint("Type %s in incompatible with previous type %s of field %s in user type %s", _field_type->to_string(), previous->as_cql3_type()->to_string(), _field_name->to_string(), _name.to_string()));
}
std::vector<data_type> new_types(to_update->field_types());

View File

@@ -59,11 +59,11 @@ public:
virtual future<> check_access(const service::client_state& state) override;
virtual void validate(distributed<service::storage_proxy>& proxy, const service::client_state& state) override;
virtual void validate(service::storage_proxy& proxy, const service::client_state& state) override;
virtual const sstring& keyspace() const override;
virtual future<shared_ptr<cql_transport::event::schema_change>> announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only) override;
virtual future<shared_ptr<cql_transport::event::schema_change>> announce_migration(service::storage_proxy& proxy, bool is_local_only) override;
class add_or_alter;
class renames;

View File

@@ -69,14 +69,14 @@ future<> alter_view_statement::check_access(const service::client_state& state)
return make_ready_future<>();
}
void alter_view_statement::validate(distributed<service::storage_proxy>&, const service::client_state& state)
void alter_view_statement::validate(service::storage_proxy&, const service::client_state& state)
{
// validated in announce_migration()
}
future<shared_ptr<cql_transport::event::schema_change>> alter_view_statement::announce_migration(distributed<service::storage_proxy>& proxy, bool is_local_only)
future<shared_ptr<cql_transport::event::schema_change>> alter_view_statement::announce_migration(service::storage_proxy& proxy, bool is_local_only)
{
auto&& db = proxy.local().get_db().local();
auto&& db = proxy.get_db().local();
schema_ptr schema = validation::validate_column_family(db, keyspace(), column_family());
if (!schema->is_view()) {
throw exceptions::invalid_request_exception("Cannot use ALTER MATERIALIZED VIEW on Table");
@@ -86,10 +86,10 @@ future<shared_ptr<cql_transport::event::schema_change>> alter_view_statement::an
throw exceptions::invalid_request_exception("ALTER MATERIALIZED VIEW WITH invoked, but no parameters found");
}
_properties->validate(proxy.local().get_db().local().get_config().extensions());
_properties->validate(proxy.get_db().local().get_config().extensions());
auto builder = schema_builder(schema);
_properties->apply_to_builder(builder, proxy.local().get_db().local().get_config().extensions());
_properties->apply_to_builder(builder, proxy.get_db().local().get_config().extensions());
if (builder.get_gc_grace_seconds() == 0) {
throw exceptions::invalid_request_exception(

Some files were not shown because too many files have changed in this diff Show More