Commit Graph

1370 Commits

Author SHA1 Message Date
Eliran Sinvani
0220786710 database: Fix view schemas in place when loading
On restart the view schemas are loaded and might contain old
views with an unmarked computed column. We already have code to
update the schema, but before we do it we load the view as is. This
is not desired since once registered, this view version can be used
for writes which is forbidden since we will spot a none computed
column which is in the view's primary key but not in the base table
at all. To solve this, in addition to altering the persistent schema,
we fix the view's loaded schema in place. This is safe since computed
column is just involved in generating a value for this column when
creating a view update so the effect of this manipulation stays
internal.
The second stage of the in place fixing is to persist the
changes made in the in place fixing so the view is ready for
the next node restart in particular the `computed_columns` table.
2021-03-07 12:57:16 +02:00
Eliran Sinvani
39cd9dae4e materialized views: Extract fix legacy schema into its own logic
We extract the logic for fixing the view schema into it's own
logic as we will need to use it in more places in the code.
This makes 'maybe_update_legacy_secondary_index_mv_schema' redundant since
it becomes a two liner wrapper for this logic. We also
remove it here and replace the call to it with the equivalent code.
2021-03-07 12:50:42 +02:00
Tomasz Grabiec
761f89e55e api: Introduce system/drop_sstable_caches RESTful API
Evicts objects from caches which reflect sstable content, like the row
cache. In the future, it will also drop the page cache
and sstable index caches.

Unlike lsa/compact, doesn't cause reactor stalls.

The old lsa/compact call invokes memory reclamation, which is
non-preemptible. It also compacts LSA segments, so does more
work. Some use cases don't need to compact LSA segments, just want the
row cache to be wiped.

Message-Id: <20210301120211.36195-1-tgrabiec@scylladb.com>
2021-03-01 16:13:04 +02:00
Avi Kivity
78d1afeabd Merge "Use radix tree to store cells on a row" from Pavel E
"
Current storage of cells in a row is a union of vector and set. The
vector holds 5 cell_and_hash's inline, up to 32 ones in the external
storage and then it's switched to std::set. Once switched, the whole
union becomes the waste of space, as it's size is

   sizeof(vector head) + 5 * sizeof(cell and hash) = 90+ bytes

and only 3 pointers from it are used (std::set header). Also the
overhead to keep cell_and_hash as a set entry is more then the size
of the structure itself.

Column ids are 32-bit integers that most likely come sequentialy.
For this kind of a search key a radix tree (with some care for
non-sequential cases) can be beneficial.

This set introduces a compact radix tree, that uses 7-bit sub values
from the search key to index on each node and compacts the nodes
themselves for better memory usage. Then the row::_storage is replaced
with the new tree.

The most notable result is the memory footprint decrease, for wide
rows down to 2x times. The performance of micro-benchmarks is a bit
lower for small rows and (!) higer for longer (8+ cells). The numbers
are in patch #12 (spoiler: they are better than for v2)

v3:
- trimmed size of radix down to 7 bits
- simplified the nodes layouts, now there are 2 of them (was 4)
- enhanced perf_mutation to test N-cells schema
- added AVX intra-nodes search for medium-sized nodes
- added .clone_from() method that helped to improve perf_mutation
- minor
  - changed functions not to return values via refs-arguments
  - fixed nested classes to properly use language constructors
  - renamed index_to to key_t to distinguish from node_index_t
  - improved recurring variadic templates not to use sentinel argument
  - use standard concepts

v2:
- fixed potential mis-compilation due to strict-aliasing violation
- added oracle test (radix tree is compared with std::map)
- added radix to perf_collection
- cosmetic changes (concepts, comments, names)

A note on item 1 from v2 changelog. The nodes are no longer packed
perfectly, each has grown 3 bytes. But it turned out that when used
as cells container most of this growth drowned in lsa alignments.

next todo:
- aarch64 version of 16-keys node search

tests: unit(dev), unit(debug for radix*), pref(dev)
"

* 'br-radix-tree-for-cells-3' of https://github.com/xemul/scylla:
  test/memory_footpring: Print radix tree node sizes
  row: Remove old storages
  row: Prepare row::equal for switch
  row: Prepare row::difference for switch
  row: Introduce radix tree storage type
  row-equal: Re-declare the cells_equal lambda
  test: Add tests for radix tree
  utils: Compact radix tree
  array-search: Add helpers to search for a byte in array
  test/perf_collection: Add callback to check the speed of clone
  test/perf_mutation: Add option to run with more than 1 columns
  test/perf_mutation: Prepare to have several regular columns
  test/perf_mutation: Use builder to build schema
2021-02-18 21:19:14 +02:00
Benny Halevy
92e0e84ee5 database: futurize remove
In preparation for futurizing the querier_cache api.

Coroutinize drop_column_family while at it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210215101254.480228-61-bhalevy@scylladb.com>
2021-02-17 18:52:53 +02:00
Pavel Emelyanov
1bdfa355ea row: Remove old storages
Now when the 3rd storage type (radix tree) is all in, old
storage can be safely removed.  The result is:

1. memory footprint

sizeof(class row):  112 => 16 bytes
sizeof(rows_entry): 126 => 120 bytes

the "in cache" value depends on the number of cells:

num of cells     master       patch
         1       752         656
         2       808         712
         3       864         768
         4       920         824
         5       968         936
         6      1136         992
         ...
         16     1840        1672
         17     1904        1992  (+88)
         18     1976        2048  (+72)
         19     2048        2104  (+56)
         20     2120        2160  (+40)
         21     2184        2208  (+24)
         22     2256        2264  ( +8)
         23     2328        2320
         ...
         32     2960        2808

After 32 cells the storage switches into rbtree with
24-bytes per-cell overhead and the radix tree improvement
rocketlaunches

           64     7872        6056
           128   15040        9512
           256   29376       18568

2. perf_mutation test is enhanced by this series and the
   results differ depending on the number of columns used

                    tps value
--column-count    master   patch
          1       59.9k    57.6k  (-3.8%)
          2       59.9k    57.5k
          4       59.8k    57.6k
          8       57.6k    57.7k  <- eq
         16       56.3k    57.6k
         32       53.2k    57.4k  (+7.9%)

A note on this. Last time 1-column test was ~5% worse which
was explained by inline storage of 5 cells that's present on
current implementation and was absent in radix tree.

An attempt to make inline storage for small radix trees
resulted in complete loss of memory footprint gain, but gave
fraction of percent to perf_mutation performance. So this
version doesn't have inline nodes.

The 1.2% improvement from v2 surprisingly came from the
tree::clone_from() which in v2 was work-around-ed by slow
walk+emplace sequence while this version has the optimized
API call for cloning.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 20:35:06 +03:00
Gleb Natapov
d06d21bfae database: remove add_keyspace() function
It is not longer used.
Message-Id: <20210209175931.1796263-2-gleb@scylladb.com>
2021-02-10 00:36:02 +01:00
Gleb Natapov
d8345c67d9 Consolidate system and non system keyspace creation
The code that creates system keyspace open code a lot of things from
database::create_keyspace(). The patch makes create_keyspace() suitable
for both system and non system keyspaces and uses it to create system
keyspaces as well.
Message-Id: <20210209160506.1711177-1-gleb@scylladb.com>
2021-02-09 17:18:04 +01:00
Avi Kivity
4082f57edc Merge 'Make commitlog disk limit a hard limit.' from Calle Wilund
Refs #6148

Commitlog disk limit was previously a "soft" limit, in that we allowed allocating new segments, even if we were over
disk usage max. This would also cause us sometimes to create new segments and delete old ones, if badly timed in
needing and releasing segments, in turn causing useless disk IO for pre-allocation/zeroing.

This patch set does:
* Make limit a hard limit. If we have disk usage > max, we wait for delete or recycle.
* Make flush threshold configurable. Default is ask for flush when over 50% usage. (We do not wait for results)
* Make flush "partial". We flush X% of the used space (used - thres/2), and make the rp limit accordingly. This means we will try to clear the N oldest segments, not all. I.e. "lighter" flush. Of course, if the CL is wholly dominated by a single CF, this will not really help much. But when > 1 cf is used, it means we can skip those not having unflushed data < req rp.
* Force more eager flush/recycle if we're out of segments

Note: flush threshold is not exposed in scylla config (yet). Because I am unsure of wording, and even if it should.
Note: testing is sparse, esp. in regard to latency/timeouts added in high usage scenarios. While I can fairly easily provoke "stalls" (i.e. forced waiting for segments to free up) with simple C-S, it is hard to say exactly where in a more sane config (I set my limits looow) latencies will start accumulating.

Closes #7879

* github.com:scylladb/scylla:
  commitlog: Force earlier cycle/flush iff segment reserve is empty
  commitlog: Make segment allocation wait iff disk usage > max
  commitlog: Do partial (memtable) flushing based on threshold
  commitlog: Make flush threshold configurable
  table: Add a flush RP mark to table, and shortcut if not above
2021-02-08 16:44:05 +02:00
Pavel Emelyanov
a05adb8538 database: Remove global storage proxy reference
The db::update_keyspace() needs sharded<storage_proxy>
reference, but the only caller of it already has it and
can pass one as argument.

tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210205175611.13464-3-xemul@scylladb.com>
2021-02-08 12:59:46 +01:00
Avi Kivity
913d970c64 Merge "Unify inactive readers" from Botond
"
Currently inactive readers are stored in two different places:
* reader concurrency semaphore
* querier cache
With the latter registering its inactive readers with the former. This
is an unnecessarily complex (and possibly surprising) setup that we want
to move away from. This series solves this by moving the responsibility
if storing of inactive reads solely to the reader concurrency semaphore,
including all supported eviction policies. The querier cache is now only
responsible for indexing queriers and maintaining relevant stats.
This makes the ownership of the inactive readers much more clear,
hopefully making Benny's work on introducing close() and abort() a
little bit easier.

Tests: unit(release, debug:v1)
"

* 'unify-inactive-readers/v2' of https://github.com/denesb/scylla:
  reader_concurrency_semaphore: store inactive readers directly
  querier_cache: store readers in the reader concurrency semaphore directly
  querier_cache: retire memory based cache eviction
  querier_cache: delegate expiry to the reader_concurrency_semaphore
  reader_concurrency_semaphore: introduce ttl for inactive reads
  querier_cache: use new eviction notify mechanism to maintain stats
  reader_concurrency_semaphore: add eviction notification facility
  reader_concurrency_semaphore: extract evict code into method evict()
2021-02-03 10:59:04 +02:00
Calle Wilund
c3d95811da table: Add a flush RP mark to table, and shortcut if not above
Adds a second RP to table, marking where we flushed last.
If a new flush request comes in that is below this mark, we
can skip a second flush.

This is to (in future) support incremental CL flush.
2021-01-05 18:16:09 +00:00
Piotr Sarna
aba9772eff database: migrate find_keyspace to string views
... in order to avoid creating unnecessary sstring instances
just to compare strings.
2021-01-04 09:47:01 +01:00
Calle Wilund
71c5dc82df database: Verify iff we actually are writing memtables to disk in truncate
Fixes #7732

When truncating with auto_snapshot on, we try to verify the low rp mark
from the CF against the sstables discarded by the truncation timestamp.
However, in a scenario like:

Fill memtables
Flush
Truncate with snapshot A
Fill memtables some more
Truncate
Move snapshot A to upload + refresh (load old tables)
Truncate

The last op will assert, because while we have sstables loaded, which
will be discarded now, we did not in fact generate any _new_ ones
(since memtables are empty), and the RP we get back from discard is
one from an earlier generation set.

(Any permutation of events that create the situation "empty memtable" +
"non-empty sstables with only old tables" will generate the same error).

Added a check that before flushing checks if we actually have any
data, and if not, does not uphold the RP relation assert.

Closes #7799
2020-12-15 16:24:36 +02:00
Piotr Sarna
cd1e351dc1 table: unify waiting for pending operations
In order to reduce code duplication which already caused a bug,
waiting for pending operations is now unified with a single helper
function.
2020-12-15 13:11:25 +01:00
Piotr Sarna
57d63ca036 database: add waiting for pending streams on table drop
We already wait for pending reads and writes, so for completeness
we should also wait for all pending stream operations to finish
before dropping the table to avoid inconsistencies.
2020-12-15 12:55:45 +01:00
Pavel Emelyanov
62214e2258 database: Have local id arg in transform_counter_updates_to_shards()
There are two places that call it -- database code itself and
tests. The former already has the local host id, so just pass
one.

The latter are a bit trickier. Currently they use the value from
storage_service created by storage_service_for_tests, but since
this version of service doesn't pass through prepare_to_join()
the local_host_id value there is default-initialized, so just
default-initialize the needed argument in place.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-04 15:09:30 +03:00
Pavel Emelyanov
66dcc47571 system-keyspace: Rewrite force_blocking_flush
The method is called after query_processor::execute_internal
to flush the cf. Encapsulating this flush inside database and
getting the database from query_processor lets removing
database reference from global qctx object.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-11-19 18:39:05 +03:00
Avi Kivity
f55b522c1b database: detect misconfigured unit tests that don't set available_memory
available_memory is used to seed many caches and controllers. Usually
it's detected from the environment, but unit tests configure it
on their own with fake values. If they forget, then the undefined
behavior sanitizer will kick in in random places (see 8aa842614a
("test: gossip_test: configure database memory allocation correctly")
for an example.

Prevent this early by asserting that available_memory is nonzero.

Closes #7612
2020-11-18 08:49:32 +02:00
Botond Dénes
34c213f9bb database: hook-in to the seastar OOM diagnostics report generation
Use the mechanism provided by seastar to add scylla specific information
to the memory diagnostics report. The information added is mostly the
same contained in the output of `scylla memory` from `scylla-gdb.py`,
with the exception of the coordinator-specific metrics. The report is
generated in the database layer, where the storage-proxy is not
available and it is not worth pulling it in just for this purpose.

An example report:

INFO  2020-11-10 12:02:44,182 [shard 0] testlog - Dumping seastar memory diagnostics
Used memory:  2029M
Free memory:  19M
Total memory: 2G

LSA
  allocated: 1770M
  used:      1766M
  free:      3M

Cache:
  total: 1770M
  used:  1716M
  free:  54M

Memtables:
 total: 0B
 Regular:
  real dirty: 0B
  virt dirty: 0B
 System:
  real dirty: 0B
  virt dirty: 0B

Replica:
  Read Concurrency Semaphores:
    user: 100/100, 33M/41M, queued: 477
    streaming: 0/10, 0B/41M, queued: 0
    system: 0/100, 0B/41M, queued: 0
    compaction: 0/∞, 0B/∞
  Execution Stages:
    data query stage:
      statement	987
         Total: 987
    mutation query stage:
         Total: 0
    apply stage:
         Total: 0
  Tables - Ongoing Operations:
    Pending writes (top 10):
      0 Total (all)
    Pending reads (top 10):
      1564 ks.test
      1564 Total (all)
    Pending streams (top 10):
      0 Total (all)

Small pools:
objsz	spansz	usedobj	memory	unused	wst%
8	4K	11k	88K	6K	6
10	4K	10	8K	8K	98
12	4K	2	8K	8K	99
14	4K	4	8K	8K	99
16	4K	15k	244K	5K	2
32	4K	2k	52K	3K	5
32	4K	20k	628K	2K	0
32	4K	528	20K	4K	17
32	4K	5k	144K	480B	0
48	4K	17k	780K	3K	0
48	4K	3k	140K	3K	2
64	4K	50k	3M	6K	0
64	4K	66k	4M	7K	0
80	4K	131k	10M	1K	0
96	4K	37k	3M	192B	0
112	4K	65k	7M	10K	0
128	4K	21k	3M	2K	0
160	4K	38k	6M	3K	0
192	4K	15k	3M	12K	0
224	4K	3k	720K	10K	1
256	4K	148	56K	19K	33
320	8K	13k	4M	14K	0
384	8K	3k	1M	20K	1
448	4K	11k	5M	5K	0
512	4K	2k	1M	39K	3
640	12K	163	144K	42K	29
768	12K	1k	832K	59K	7
896	8K	131	144K	29K	20
1024	4K	643	732K	89K	12
1280	20K	11k	13M	26K	0
1536	12K	12	128K	110K	85
1792	16K	12	144K	123K	85
2048	8K	601	1M	14K	1
2560	20K	70	224K	48K	21
3072	12K	13	240K	201K	83
3584	28K	6	288K	266K	92
4096	16K	10k	39M	88K	0
5120	20K	7	416K	380K	91
6144	24K	24	480K	336K	70
7168	28K	27	608K	413K	67
8192	32K	256	3M	736K	26
10240	40K	11k	105M	550K	0
12288	48K	21	960K	708K	73
14336	56K	59	1M	378K	31
16384	64K	8	1M	1M	89
Page spans:
index	size	free	used	spans
0	4K	48M	48M	12k
1	8K	6M	6M	822
2	16K	41M	41M	3k
3	32K	18M	18M	579
4	64K	108M	108M	2k
5	128K	1774M	2G	14k
6	256K	512K	0B	2
7	512K	2M	2M	4
8	1M	0B	0B	0
9	2M	2M	0B	1
10	4M	0B	0B	0
11	8M	0B	0B	0
12	16M	16M	0B	1
13	32M	32M	32M	1
14	64M	0B	0B	0
15	128M	0B	0B	0
16	256M	0B	0B	0
17	512M	0B	0B	0
18	1G	0B	0B	0
19	2G	0B	0B	0
20	4G	0B	0B	0
21	8G	0B	0B	0
22	16G	0B	0B	0
23	32G	0B	0B	0
24	64G	0B	0B	0
25	128G	0B	0B	0
26	256G	0B	0B	0
27	512G	0B	0B	0
28	1T	0B	0B	0
29	2T	0B	0B	0
30	4T	0B	0B	0
31	8T	0B	0B	0
2020-11-17 15:13:21 +02:00
Avi Kivity
5d45662804 database, streaming: remove remnants of memtable-base streaming
Commit e5be3352cf ("database, streaming, messaging: drop
streaming memtables") removed streaming memtables; this removes
the mechanisms to synchronize them: _streaming_flush_gate and
_streaming_flush_phaser. The memory manager for streaming is removed,
and its 10% reserve is evenly distributed between memtables and
general use (e.g. cache).

Note that _streaming_flush_phaser and _streaming_flush_date are
no longer used to syncrhonize anything - the gate is only used
to protect the phaser, and the phaser isn't used for anything.

Closes #7454
2020-11-16 14:32:19 +01:00
Avi Kivity
6091dc9b79 Merge 'Add more overload-related metrics' from Piotr Sarna
This miniseries adds metrics which can help the users detect potential overloads:
 * due to having too many in-flight hints
 * due to exceeding the capacity of the read admission queue, on replica side

Closes #7584

* github.com:scylladb/scylla:
  reader_concurrency_semaphore: add metrics for shed reads
  storage_proxy: add metrics for too many in-flight hints failures
2020-11-12 12:27:31 +02:00
Piotr Sarna
3ce7848bdf reader_concurrency_semaphore: add metrics for shed reads
When the admission queue capacity reaches its limits, excessive
reads are shed in order to avoid overload. Each such operation
now bumps the metrics, which can help the user judge if a replica
is overloaded.
2020-11-11 19:01:38 +01:00
Benny Halevy
6d06853e6c abstract_replication_strategy: convert to shared_token_metadata
To facilitate that, keep a const shared_token_metadata& in class database
rather than a const token_metadata&

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
29ed59f8c4 main: start a shared_token_metadata
And use it to get a token_metadata& compatible
with current usage, until the services are converted to
use token_metadata_ptr.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Michał Chojnowski
1eb19976b9 database: make changes to durable_writes effective immediately
Users can change `durable_writes` anytime with ALTER KEYSPACE.
Cassandra reads the value of `durable_writes` every time when applying
a mutation, so changes to that setting take effect immediately. That is,
mutations are added to the commitlog only when `durable_writes` is `true`
at the moment of their application.
Scylla reads the value of `durable_writes` only at `keyspace` construction time,
so changes to that setting take effect only after Scylla is restarted.
This patch fixes the inconsistency.

Fixes #3034

Closes #7533
2020-11-06 17:53:22 +01:00
Tomasz Grabiec
f893516e55 Merge "lwt: store column_mapping's for each table schema version upon a DDL change" from Pavel Solodovnikov
This patch introduces a new system table: `system.scylla_table_schema_history`,
which is used to keep track of column mappings for obsolete table
schema versions (i.e. schema becomes obsolete when it's being changed
by means of `CREATE TABLE` or `ALTER TABLE` DDL operations).

It is populated automatically when a new schema version is being
pulled from a remote in get_schema_definition() at migration_manager.cc
and also when schema change is being propagated to system schema tables
in do_merge_schema() at schema_tables.cc.

The data referring to the most recent table schema version is always
present. Other entries are garbage-collected when the corresponding
table schema version is obsoleted (they will be updated with a TTL equal
to `DEFAULT_GC_GRACE_SECONDS` on `ALTER TABLE`).

In case we failed to persist column mapping after a schema change,
missing entries will be recreated on node boot.

Later, the information from this table is used in `paxos_state::learn`
callback in case we have a mismatch between the most recent schema
version and the one that is stored inside the `frozen_mutation`
for the accepted proposal.

Such situation may arise under following circumstances:
 1. The previous LWT operation crashed on the "accept" stage,
    leaving behind a stale accepted proposal, which waits to be
    repaired.
 2. The table affected by LWT operation is being altered, so that
    schema version is now different. Stored proposal now references
    obsolete schema.
 3. LWT query is retried, so that Scylla tries to repair the
    unfinished Paxos round and apply the mutation in the learn stage.

When such mismatch happens, prior to that patch the stored
`frozen_mutation` is able to be applied only if we are lucky enough
and column_mapping in the mutation is "compatible" with the new
table schema.

It wouldn't work if, for example, the columns are reordered, or
some columns, which are referenced by an LWT query, are dropped.

With this patch we try to look up the column mapping for
the obsolete schema version, then upgrade the stored mutation
using obtained column mapping and apply an upgraded mutation instead.

* git@github.com:ManManson/scylla.git feature/table_schema_history_v7:
  lwt: add column_mapping history persistence tests
  schema: add equality operator for `column_mapping` class
  lwt: store column_mapping's for each table schema version upon a DDL change
  schema_tables: extract `fill_column_info` helper
  frozen_mutation: introduce `unfreeze_upgrading` method
2020-10-15 20:48:29 +02:00
Pavel Solodovnikov
055fd3d8ad lwt: store column_mapping's for each table schema version upon a DDL change
This patch introduces a new system table: `system.scylla_table_schema_history`,
which is used to keep track of column mappings for obsolete table
schema versions (i.e. schema becomes obsolete when it's being changed
by means of `CREATE TABLE` or `ALTER TABLE` DDL operations).

It is populated automatically when a new schema version is being
pulled from a remote in get_schema_definition() at migration_manager.cc
and also when schema change is being propagated to system schema tables
in do_merge_schema() at schema_tables.cc.

The data referring to the most recent table schema version is always
present. Other entries are garbage-collected when the corresponding
table schema version is obsoleted (they will be updated with a TTL equal
to `DEFAULT_GC_GRACE_SECONDS` on `ALTER TABLE`).

In case we failed to persist column mapping after a schema change,
missing entries will be recreated on node boot.

Later, the information from this table is used in `paxos_state::learn`
callback in case we have a mismatch between the most recent schema
version and the one that is stored inside the `frozen_mutation`
for the accepted proposal.

Such situation may arise under following circumstances:
 1. The previous LWT operation crashed on the "accept" stage,
    leaving behind a stale accepted proposal, which waits to be
    repaired.
 2. The table affected by LWT operation is being altered, so that
    schema version is now different. Stored proposal now references
    obsolete schema.
 3. LWT query is retried, so that Scylla tries to repair the
    unfinished Paxos round and apply the mutation in the learn stage.

When such mismatch happens, prior to that patch the stored
`frozen_mutation` is able to be applied only if we are lucky enough
and column_mapping in the mutation is "compatible" with the new
table schema.

It wouldn't work if, for example, the columns are reordered, or
some columns, which are referenced by an LWT query, are dropped.

With this patch we try to look up the column mapping for
the obsolete schema version, then upgrade the stored mutation
using obtained column mapping and apply an upgraded mutation instead.

In case we don't find a column_mapping we just return an error
from the learn stage.

Tests: unit(dev, debug), dtests(paxos_tests.py:TestPaxos.schema_mismatch_*_test)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-10-15 19:24:30 +03:00
Botond Dénes
ff623e70b3 reader_concurrency_semaphore: name permits
Require a schema and an operation name to be given to each permit when
created. The schema is of the table the read is executed against, and
the operation name, which is some name identifying the operation the
permit is part of. Ideally this should be different for each site the
permit is created at, to be able to discern not only different kind of
reads, but different code paths the read took.

As not all read can be associated with one schema, the schema is allowed
to be null.

The name will be used for debugging purposes, both for coredump
debugging and runtime logging of permit-related diagnostics.
2020-10-13 12:32:13 +03:00
Botond Dénes
307cdf1e0d multishard_combining_reader: reader_lifecycle_policy: add permit param to create_reader()
Allow the evictable reader managing the underlying reader to pass its
own permit to it when creating it, making sure they share the same
permit. Note that the two parts can still end up using different
permits, when the underlying reader is kept alive between two pages of a
paged read and thus keeps using the permit received on the previous
page.

Also adjust the `reader_context` in multishard_mutation_query.cc to use
the passed-in permit instead of creating a new one when creating a new
reader.
2020-10-12 15:56:56 +03:00
Botond Dénes
e09ab09fff multishard_combining_reader: add permit parameter
Don't create an own permit, take one as a parameter, like all other
readers do, so the permit can be provided by the higher layer, making
sure all parts of the logical read use the same permit.
2020-10-12 15:56:56 +03:00
Benny Halevy
57cc5f6ae1 sstable_directory: use a external load_semaphore
Although each sstable_directory limits concurrency using
max_concurrent_for_each, there could be a large number
of calls to do_for_each_sstable running in parallel
(e.g per keyspace X per table in the distributed_loader).

To cap parallelism across sstable_directory instances and
concurrent calls to do_for_each_sstable, start a sharded<semaphore>
and pass a shared semaphore& to the sstable_directory:s.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-10-08 11:57:06 +03:00
Nadav Har'El
a5369881b3 Merge 'sstables: make sstable_manager control the lifetime of the sstables it manages' from Avi Kivity
Currently, sstable_manager is used to create sstables, but it loses track
of them immediately afterwards. This series makes an sstable's life fully
contained within its sstable_manager.

The first practical impact (implemented in this series) is that file removal
stops being a background job; instead it is tracked by the sstable_manager,
so when the sstable_manager is stopped, you know that all of its sstable
activity is complete.

Later, we can make use of this to track the data size on disk, but this is not
implemented here.

Closes #7253

* github.com:scylladb/scylla:
  sstables: remove background_jobs(), await_background_jobs()
  sstables: make sstables_manager take charge of closing sstables
  test: test_env: hold sstables_manager with a unique_ptr
  test: drop test_sstable_manager
  test: sstables::test_env: take ownership of manager
  test: broken_sstable_test: prepare for asynchronously closed sstables_manager
  test: sstable_utils: close test_env after use
  test: sstable_test:  dont leak shared_sstable outside its test_env's lifetime
  test: sstables::test_env: close self in do_with helpers
  test: perf/perf_sstable.hh: prepare for asynchronously closed sstables_manager
  test: view_build_test: prepare for asynchronously closed sstables_manager
  test: sstable_resharding_test: prepare for asynchronously closed sstables_manager
  test: sstable_mutation_test: prepare for asynchronously closed sstables_manager
  test: sstable_directory_test: prepare for asynchronously closed sstables_manager
  test: sstable_datafile_test: prepare for asynchronously closed sstables_manager
  test: sstable_conforms_to_mutation_source_test: remove references to test_sstables_manager
  test: sstable_3_x_test: remove test_sstables_manager references
  test: schema_changes_test: drop use of test_sstables_manager
  mutation_test: adjust for column_family_test_config accepting an sstables_manager
  test: lib: sstable_utils: stop using test_sstables_manager
  test: sstables test_env: introduce manager() accessor
  test: sstables test_env: introduce do_with_async_sharded()
  test: sstables test_env: introduce  do_with_async_returning()
  test: lib: sstable test_env: prepare for life as a sharded<> service
  test: schema_changes_test: properly close sstables::test_env
  test: sstable_mutation_test: avoid constructing temporary sstables::test_env
  test: mutation_reader_test: avoid constructing temporary sstables::test_env
  test: sstable_3_x_test: avoid constructing temporary sstables::test_env
  test: lib: test_services: pass sstables_manager to column_family_test_config
  test: lib: sstables test_env: implement tests_env::manager()
  test: sstable_test: detemplate write_and_validate_sst()
  test: sstable_test_env: detemplate do_with_async()
  test: sstable_datafile_test: drop bad 'return'
  table: clear sstable set when stopping
  table: prevent table::stop() race with table::query()
  database: close sstable_manager:s
  sstables_manager: introduce a stub close()
  sstable_directory_test: fix threading confusion in make_sstable_directory_for*() functions
  test: sstable_datafile_test: reorder table stop in compaction_manager_test
  test: view_build_test: test_view_update_generator_register_semaphore_unit_leak: do not discard future in timer
  test: view_build_test: fix threading in test_view_update_generator_register_semaphore_unit_leak
  view: view_update_generator: drop references to sstables when stopping
2020-09-24 13:54:38 +03:00
Avi Kivity
9f886f303c database: close sstable_manager:s
The database class owns two sstable_manager:s - one for user sstables and
one for system sstables. Now that they have a close() method, call it.
2020-09-23 20:55:05 +03:00
Botond Dénes
d7e794e565 database: move total_reads* metrics to the concurrency semaphore 2020-09-23 14:10:24 +03:00
Botond Dénes
32ff524454 database: setup_metrics(): split the registering database metrics in two
Currently all "database" metrics are registered in a single call to
`metric_groups::add_group()`. As all the metrics to-be-registered are
passed in a single initializer list, this blows up the stack size, to
the point that adding a single new metric causes it to exceed the
currently configured max-stack-size of 13696 bytes. To reduce stack
usage, split the single call in two, roughly in the middle. While we
could try to come up with some logical grouping of metrics and do much
arranging and code-movement I think we might as well just split into two
arbitrary groups, containing roughly the same amount of metrics.
2020-09-23 14:06:20 +03:00
Botond Dénes
c18756ce9a reader_concurrency_semaphore: s/inactive_read_stats/stats/
In preparations of non-inactive read stats being added to the semaphore,
rename its existing stats struct and member to a more generic name.
Fields, whose name only made sense in the context of the old name are
adjusted accordingly.
2020-09-23 13:11:55 +03:00
Tomasz Grabiec
691009bc1e db, schema: Hide update_schema_version_and_announce() 2020-09-11 14:42:48 +02:00
Tomasz Grabiec
9f58dcc705 db, storage_service: Do not call into gossiper from the database layer
The storage service computes gossiper states before it starts the
gossiper. Among them, node's schema version. There are two problems with that.

First is that computing the schema version and publishing it is not
atomic, so is not safe against concurrent schema changes or schema
version recalculations. It will not exclude with
recalculate_schema_version() calls, and we could end up with the old
(and incorrect) schema version being advertised in gossip.

Second problem is that we should not allow the database layer to call
into the gossiper layer before it is fully initialized, as this may
produce undefined behavior.

The solution for both problems is to break the cyclic dependency
between the database layer and the storage_service layer by having the
database layer not use the gossiper at all. The database layer
publishes schema version inside the database class and allows
installing listeners on changes. The storage_service layer asks the
database layer for the current version when it initializes, and only
after that installs a listener which will update the gossiper.

This also allows us to drop unsafe functions like update_schema_version().
2020-09-11 14:42:41 +02:00
Tomasz Grabiec
ad0b674b13 db: Make schema version observable 2020-09-11 14:42:41 +02:00
Avi Kivity
907b775523 Merge "Free compaction from storage service" from Pavel E
"
There's last call for global storage service left in compaction code, it
comes from cleanup_compaction to get local token ranges for filtering.

The call in question is a pure wrapper over database, so this set just
makes use of the database where it's already available (perform_cleanup)
and adds it where it's needed (perform_sstable_upgrade).

tests: unit(dev), nodetool upgradesstables
"

* 'br-remove-ss-from-compaction-3' of https://github.com/xemul/scylla:
  storage_service: Remove get_local_ranges helper
  compaction: Use database from options to get local ranges
  compaction: Keep database reference on upgrade options
  compaction: Keep database reference on cleanup options
  db: Factor out get_local_ranges helper
2020-08-23 17:58:32 +03:00
Pavel Emelyanov
06f4828b93 db: Factor out get_local_ranges helper
Storage service and repair code have identical helpers to get local
ranges for keyspace. Move this helper's code onto database, later it
will be reused by one more place.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-21 14:58:40 +03:00
Benny Halevy
dd6d771331 database: keep const token_metadata&
No need to modify token_metadata form database code.
Also, get rid of mutable get_token_metadata variant.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-20 16:20:34 +03:00
Benny Halevy
8b5c32c7a8 database: keyspace_metadata: pass const locator::token_metadata& around
No need to modify token_metadata on this path.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-20 16:20:34 +03:00
Benny Halevy
4dba81cb92 replication_strategy: keep a const token_metadata&
replication strategies don't need to change token_metadata.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-20 16:20:34 +03:00
Avi Kivity
f6b66456fd Update seastar submodule
Contains patch from Rafael to fix up includes.

* seastar c872c3408c...7f7cf0f232 (9):
  > future: Consider result_unavailable invalid in future_state_base::ignore()
  > future: Consider result_unavailable invalid in future_state_base::valid()
  > Merge "future-util: split header" from Benny
  > docs: corrected some text and code-examples in streaming-rpc docs
  > future: Reduce nesting in future::then
  > demos: coroutines: include std-compat.hh
  > sstring: mark str() and methods using it as noexcept
  > tls: Add an assert
  > future: fix coroutine compilation
2020-08-19 17:18:57 +03:00
Dejan Mircevski
fb6c011b52 everywhere: Insert space after switch
Quoth @avikivity: "switch is not a function, and we celebrate that by
putting a space after it like other control-flow keywords."

https://github.com/scylladb/scylla/pull/7052#discussion_r471932710

Tests: unit (dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-08-18 14:31:04 +03:00
Piotr Jastrzebski
01ea159fde codebase wide: use try_emplace when appropriate
C++17 introduced try_emplace for maps to replace a pattern:
if(element not in a map) {
    map.emplace(...)
}

try_emplace is more efficient and results in a more concise code.

This commit introduces usage of try_emplace when it's appropriate.

Tests: unit(dev)

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <4970091ed770e233884633bf6d46111369e7d2dd.1597327358.git.piotr@scylladb.com>
2020-08-16 14:41:09 +03:00
Piotr Jastrzebski
c001374636 codebase wide: replace count with contains
C++20 introduced `contains` member functions for maps and sets for
checking whether an element is present in the collection. Previously
`count` function was often used in various ways.

`contains` does not only express the intend of the code better but also
does it in more unified way.

This commit replaces all the occurences of the `count` with the
`contains`.

Tests: unit(dev)

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <b4ef3b4bc24f49abe04a2aba0ddd946009c9fcb2.1597314640.git.piotr@scylladb.com>
2020-08-15 20:26:02 +03:00
Avi Kivity
3530e80ce1 Merge "Support md format" from Benny
"
This series adds support for the "md" sstable format.

Support is based on the following:

* do not use clustering based filtering in the presence
  of static row, tombstones.
* Disabling min/max column names in the metadata for
  formats older than "md".
* When updating the metadata, reset and disable min/max
  in the presence of range tombstones (like Cassandra does
  and until we process them accurately).
* Fix the way we maintain min/max column names by:
  keeping whole clustering key prefixes as min/max
  rather than calculating min/max independently for
  each component, like Cassandra does in the "md" format.

Fixes #4442

Tests: unit(dev), cql_query_test -t test_clustering_filtering* (debug)
md migration_test dtest from git@github.com:bhalevy/scylla-dtest.git migration_test-md-v1
"

* tag 'md-format-v4' of github.com:bhalevy/scylla: (27 commits)
  config: enable_sstables_md_format by default
  test: cql_query_test: add test_clustering_filtering unit tests
  table: filter_sstable_for_reader: allow clustering filtering md-format sstables
  table: create_single_key_sstable_reader: emit partition_start/end for empty filtered results
  table: filter_sstable_for_reader: adjust to md-format
  table: filter_sstable_for_reader: include non-scylla sstables with tombstones
  table: filter_sstable_for_reader: do not filter if static column is requested
  table: filter_sstable_for_reader: refactor clustering filtering conditional expression
  features: add MD_SSTABLE_FORMAT cluster feature
  config: add enable_sstables_md_format
  database: add set_format_by_config
  test: sstable_3_x_test: test both mc and md versions
  test: Add support for the "md" format
  sstables: mx/writer: use version from sstable for write calls
  sstables: mx/writer: update_min_max_components for partition tombstone
  sstables: metadata_collector: support min_max_components for range tombstones
  sstable: validate_min_max_metadata: drop outdated logic
  sstables: rename mc folder to mx
  sstables: may_contain_rows: always true for old formats
  sstables: add may_contain_rows
  ...
2020-08-11 13:29:11 +03:00