Commit Graph

1893 Commits

Author SHA1 Message Date
Botond Dénes
05ef13a627 Merge 'Add support to split large partitions across SSTables' from Raphael "Raph" Carvalho
Introduces support to split large partitions during compaction. Today, compaction can only split input data at partition boundary, so a large partition is stored in a single file. But that can cause many problems, like memory pressure (e.g.: https://github.com/scylladb/scylladb/issues/4217), and incremental compaction can also not fulfill its promise as the file storing the large partition can only be released once exhausted.

The first step was to add clustering range metadata for first and last partition keys (retrieved from promoted index), which is crucial to determine disjointness at clustering level, and also the order at which the disjoint files should be opened for incremental reading.

The second step was to extend sstable_run to look at clustering dimension, so a set of files storing disjoint ranges for the same partition can live in the same sstable run.

The final step was to introduce the option for compaction to split large partition being written if it has exceeded the size threshold.

What's next? Following this series, a reader will be implemented for sstable_run that will incrementally open the readers. It can be safely built on the assumption of the disjoint invariant after the second step aforementioned.

Closes #11233

* github.com:scylladb/scylladb:
  test: Add test for large partition splitting on compaction
  compaction: Add support to split large partitions
  sstable: Extend sstable_run to allow disjointness on the clustering level
  sstables: simplify will_introduce_overlapping()
  test: move sstable_run_disjoint_invariant_test into sstable_datafile_test
  test: lib: Fix inefficient merging of mutations in make_sstable_containing()
  sstables: Keep track of first partition's first pos and last partition's last pos
  sstables: Rename min/max position_range to a descriptive name
  sstables_manager: Add sstable metadata reader concurrency semaphore
  sstables: Add ability to find first or last position in a partition
2022-09-15 16:08:56 +03:00
Kamil Braun
728161003a Merge 'raft server, abort on background errors' from Gusev Petr
Halted background fibers render raft server effectively unusable, so
report this explicitly to the clients.

Fix: #11352

Closes #11370

* github.com:scylladb/scylladb:
  raft server, status metric
  raft server, abort group0 server on background errors
  raft server, provide a callback to handle background errors
  raft server, check aborted state on public server public api's
2022-09-15 14:12:11 +02:00
Raphael S. Carvalho
20a6483678 test: Add test for large partition splitting on compaction
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-14 13:23:19 -03:00
Raphael S. Carvalho
4bc24acf81 sstable: Extend sstable_run to allow disjointness on the clustering level
After commit 0796b8c97a, sstable_run won't accept a fragment
that introduces key overlapping. But once we split large partitions,
fragments in the same run may store disjoint clustering ranges
of the same partition. So we're extending sstable_run to look
at clustering dimension, so fragments storing disjoint clustering
ranges of the same large partition can co-exist in the same run.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-14 13:09:51 -03:00
Raphael S. Carvalho
13942ec947 test: move sstable_run_disjoint_invariant_test into sstable_datafile_test
That's where it belongs.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-14 13:09:51 -03:00
Raphael S. Carvalho
5937765009 sstables: Keep track of first partition's first pos and last partition's last pos
With first partition's first position and last partition's last
partition, we'll be able to determine which fragments composing a
sstable run store a large partition that was split.

Then sstable run will be able to detect if all fragments storing
a given large partition are disjoint in the clustering level.

Fixes #10637.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-14 13:09:51 -03:00
Raphael S. Carvalho
e099a9bf3b sstables_manager: Add sstable metadata reader concurrency semaphore
Let's introduce a reader_concurrency_semaphore for reading sstable
metadata, to avoid an OOM due to unlimited concurrency.
The concurrency on startup is not controlled, so it's important
to enforce a limit on the amount of memory used by the parallel
readers.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-14 13:09:51 -03:00
Raphael S. Carvalho
9bcad9ffa8 sstables: Add ability to find first or last position in a partition
This new method allows sstable to load the first row of the first
partition and last row of last partition.

That's useful for incremental reading of sstable run which will
be split at clustering boundary.

To get the first row, it consumes the first row (which can be
either a clustering row or range tombstone change) and returns
its position_in_partition.
To get the last row, it does the same as above but in reverse
mode instead.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-14 13:09:48 -03:00
Petr Gusev
4ff0807cd0 raft server, status metric 2022-09-13 19:34:22 +04:00
Nadav Har'El
8ece63c433 Merge 'Safemode - Introduce TimeWindowCompactionStrategy Guardrails'
This series introduces two configurable options when working with TWCS tables:

- `restrict_twcs_default_ttl` - a LiveUpdate-able tri_mode_restriction which defaults to WARN and will notify the user whenever a TWCS table is created without a `default_time_to_live` setting
- `twcs_max_window_count` - Which forbids the user from creating TWCS tables whose window count (buckets) are past a certain threshold. We default to 50, which should be enough for most use cases, and a setting of 0 effectively disables the check.

Refs: #6923
Fixes: #9029

Closes #11445

* github.com:scylladb/scylladb:
  tests: cql_query_test: add mixed tests for verifying TWCS guard rails
  tests: cql_query_test: add test for TWCS window size
  tests: cql_query_test: add test for TWCS tables with no TTL defined
  cql: add configurable restriction of default_time_to_live when for TimeWindowCompactionStrategy tables
  cql: add max window restriction for TimeWindowCompactionStrategy
  time_window_compaction_strategy: reject invalid window_sizes
  cql3 - create/alter_table_statement: Make check_restricted_table_properties accept a schema_ptr
2022-09-12 23:55:51 +03:00
Botond Dénes
9db940ff1b Merge "Make network_topology_strategy_test use topology" from Pavel Emelyanov
"
The test in question plays with snitches to simulate the topology
over which tokens are spread. This set replaces explicit snitch
usage with temporary topology object.

Some snitch traces are still left, but those are for token_metadata
internal which still call global snitch for DC/RACK.
"

* 'br-tests-use-topology-not-snitch' of https://github.com/xemul/scylla:
  network_topology_strategy_test: Use topology instead of snitch
  network_topology_strategy_test: Populate explicit topology
2022-09-12 09:40:17 +03:00
Avi Kivity
6c797587c7 dirty_memory_manager: region_group: remove sorting of subgroups
dirty_memory_manager tracks lsa regions (memtables) under region_group:s,
in order to be able to pick up the largest memtable as a candidate for
flushing.

Just as region_group:s contain regions, they can also contain other
region_group:s in a nested structure. It also tracks the nested region_group
that contains the largest region in a binomial heap.

This latter facility is no longer used. It saw use when we had the system
dirty_memory_manager nested under the user dirty_memory_manager, but
that proved too complicated so it was undone. We still nest a virtual
region_group under the real region_group, and in fact it is the
virtual region_group that holds the memtables, but it is accessed
directly to find the largest memtable (region_group::get_largest_region)
and so all the mechanism that sorts region_group:s is bypassed.

Start to dismantle this house of cards by removing the subgroup
sorting. Since the hierarchy has exactly one parent and one child,
it's clearly useless. This is seen by the fact that we can just remove
everything related.

We still need the _subgroups member to hold the virtual region_group;
it's replaced by a vector. I verified that the non-intrusive vector
is exception safe since push_back() happens at the very end; in any
case this is early during setup where we aren't under memory pressure.

A few tests that check the removed functionality are deleted.

Closes #11515
2022-09-12 09:29:08 +03:00
Petr Gusev
1b5fa4088e raft server, abort group0 server on background errors 2022-09-12 10:16:43 +04:00
Felipe Mendes
6a3d8607b4 tests: cql_query_test: add mixed tests for verifying TWCS guard rails
This patch adds set of 10 cenarios that have been unveiled during additional testing.
In particular, most of the scenarios cover ALTER TABLE statements, which - if not handled -
may break the guardrails safe-mode. The situations covered are:

- STCS->TWCS with no TTL defined
- STCS->TWCS with small TTL
- STCS->TWCS with large TTL value
- TWCS table with small to large TTL
- No TTL TWCS to large TTL and then small TTL
- twcs_max_window_count LiveUpdate - Decrease TTL
- twcs_max_window_count LiveUpdate - Switch CompactionStrategy
- No TTL TWCS table to STCS
- Large TTL TWCS table, modify attribute other than compaction and default_time_to_live
- Large TTL STCS table, fail to switch to TWCS with no TTL explicitly defined
2022-09-11 17:57:14 -03:00
Felipe Mendes
a7a91e3216 tests: cql_query_test: add test for TWCS window size
This patch adds a test for checking the validity of tables using TimeWindowCompactionStrategy
with an incorrect number of compaction windows.

The twcs_max_window_count LiveUpdate-able parameter is also disabled during the execution of the
test in order to ensure that users can effectively disable the enforcement, should they want.
2022-09-11 17:38:25 -03:00
Felipe Mendes
1c5d46877e tests: cql_query_test: add test for TWCS tables with no TTL defined
This patch adds a testcase for TimeWindowCompactionStrategy tables created with no
default_time_to_live defined. It makes use of the LiveUpdate-able restrict_twcs_default_ttl
parameter in order to determine whether TWCS tables without TTL should be forbidden or not.

The test replays all 3 possible variations of the tri_mode_restriction and verifies tables
are correctly created/altered according to the current setting on the replica which receives
the request.
2022-09-11 16:55:46 -03:00
Raphael S. Carvalho
f5715d3f0b replica: Move memtables to compaction_group
Now memtables live in compaction_group. Also introduced function
that selects group based on token, but today table always return
the single group managed by it. Once multiple groups are supported,
then the function should interpret token content to select the
group.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-11 14:26:59 -03:00
Raphael S. Carvalho
6717d96684 replica: move maintenance SSTable set to compaction_group
This commit is restricted to moving maintenance set into compaction_group.
Next, we'll introduce compound set into it.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-11 14:26:59 -03:00
Raphael S. Carvalho
65414e6756 test: sstable_compaction_test: Don't reference main sstable set directly
Preparatory change for main sstable set to be moved into compaction
group. After that, tests can no longer direct access the main
set.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-11 14:26:59 -03:00
Raphael S. Carvalho
4fa8159a13 test: sstable_compaction_test: remove needless usage of column_family_test::add_sstable
column_family_test::add_sstable will soon be changed to run in a thread,
and it's not needed in this procedure, so let's remove its usage.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-11 14:26:59 -03:00
Kamil Braun
dba595d347 Merge 'Minimal implementation of Broadcast Tables' from Mikołaj Grzebieluch
Broadcast tables are tables for which all statements are strongly
consistent (linearizable), replicated to every node in the cluster and
available as long as a majority of the cluster is available. If a user
wants to store a “small” volume of metadata that is not modified “too
often” but provides high resiliency against failures and strong
consistency of operations, they can use broadcast tables.

The main goal of the broadcast tables project is to solve problems which
need to be solved when we eventually implement general-purpose strongly
consistent tables: designing the data structure for the Raft command,
ensuring that the commands are idempotent, handling snapshots correctly,
and so on.

In this MVP (Minimum Viable Product), statements are limited to simple
SELECT and UPDATE operations on the built-in table. In the future, other
statements and data types will be available but with this PR we can
already work on features like idempotent commands or snapshotting.
Snapshotting is not handled yet which means that restarting a node or
performing too many operations (which would cause a snapshot to be
created) will give incorrect results.

In a follow-up, we plan to add end-to-end Jepsen tests
(https://jepsen.io/). With this PR we can already simulate operations on
lists and test linearizability in linear complexity. This can also test
Scylla's implementation of persistent storage, failure detector, RPC,
etc.

Design doc: https://docs.google.com/document/d/1m1IW320hXtsGulzSTSHXkfcBKaG5UlsxOpm6LN7vWOc/edit?usp=sharing

Closes #11164

* github.com:scylladb/scylladb:
  raft: broadcast_tables: add broadcast_kv_store test
  raft: broadcast_tables: add returning query result
  raft: broadcast_tables: add execution of intermediate language
  raft: broadcast_tables: add compilation of cql to intermediate language
  raft: broadcast_tables: add definition of intermediate language
  db: system_keyspace: add broadcast_kv_store table
  db: config: add BROADCAST_TABLES feature flag
2022-09-09 18:05:37 +02:00
Benny Halevy
d86810d22c mutation_partition: compact_for_compaction_v2: get tombstone_gc_state
To be passed down to compact_mutation_state in a following patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-07 07:43:15 +03:00
Benny Halevy
0627667a06 mutation_partition: compact_for_compaction: get tombstone_gc_state
And pass down to `do_compact`.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-07 07:43:15 +03:00
Benny Halevy
7e4612d3aa mutation_readers: pass tombstone_gc_state to compating_reader
To be passed further done to `compact_mutation_state` in
a following patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-07 07:43:14 +03:00
Benny Halevy
2cd3fc2f36 compaction: table_state: add virtual get_tombstone_gc_state method
and override it in table::table_state to get the tombstone_gc_state
from the table's compaction_manager.

It is going to be used in the next patched to pass the gc state
from the compaction_strategy down to sstables and compaction.

table_state_for_test was modified to just keep a null
tombstone_gc_state.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-06 23:05:39 +03:00
Pavel Emelyanov
398e9f8593 network_topology_strategy_test: Use topology instead of snitch
Most of the test's cases use rack-inferring snitch driver and get
DC/RACK from it via the test_dc_rack() helper. The helper was introduced
in one of the previous sets to populate token metadata with some DC/RACK
as normal tokens manipulations required respective endpoint in topology.

This patch removes the usage of global snitch and replaces it with the
pre-populated topology. The pre-population is done in rack-inferring
snitch like manner, since token_metadata still uses global snitch and
the locations from snitch and this temporary topology should match.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-06 12:26:30 +03:00
Pavel Emelyanov
d8b2940cd8 network_topology_strategy_test: Populate explicit topology
There's a test case that makes its own snitch driver that generates
pre-claculated DC/RACK data for test endpoints. This patch replaces this
custom snitch driver with a standalone topology object.

Note: to get DC/RACK info from this topo the get_location() is used
since the get_rack()/get_datacenter() are still wrappers around global
snitch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-06 12:24:39 +03:00
Mikołaj Grzebieluch
5b1421cc33 db: config: add BROADCAST_TABLES feature flag
Add experimental flag 'broadcast-tables' for enabling BROADCAST_TABLES feature.
This feature requires raft group0, thus enabling it without RAFT will cause an error.
2022-09-05 11:11:08 +02:00
Botond Dénes
be9d1c4df4 sstables: crawling mx-reader: make on_out_of_clustering_range() no-op
Said method currently emits a partition-end. This method is only called
when the last fragment in the stream is a range tombstone change with a
position after all clustered rows. The problem is that
consume_partition_end() is also called unconditionally, resulting in two
partition-end fragments being emitted. The fix is simple: make this
method a no-op, there is nothing to do there.

Also add two tests: one targeted to this bug and another one testing the
crawling reader with random mutations generated for random schema.

Fixes: #11421

Closes #11422
2022-09-04 20:02:50 +03:00
Avi Kivity
421557b40a Merge "Provide DC/RACK when populating topology" from Pavel E
"
The topology object maintains all sort of node/DC/RACK mappings on
board. When new entries are added to it the DC and RACK are taken
from the global snitch instance which, in turn, checks gossiper,
system keyspace and its local caches.

This set make topology population API require DC and RACK via the
call argument. In most of the cases the populating code is the
storage service that knows exactly where to get those from.

After this set it will be possible to remove the dependency knot
consiting of snitch, gossiper, system keyspace and messaging.
"

* 'br-topology-dc-rack-info' of https://github.com/xemul/scylla:
  toplogy: Use the provided dc/rack info
  test: Provide testing dc/rack infos
  storage_service: Provide dc/rack for snitch reconfiguration
  storage_service: Provide dc/rack from system ks on start
  storage_service: Provide dc/rack from gossiper for replacement
  storage_service: Provide dc/rack from gossiper for remotes
  storage_service,dht,repair: Provide local dc/rack from system ks
  system_keyspace: Cache local dc-rack on .start()
  topology: Some renames after previous patch
  topology: Require entry in the map for update_normal_tokens()
  topology: Make update_endpoint() accept dc-rack info
  replication_strategy: Accept dc-rack as get_pending_address_ranges argument
  dht: Carry dc-rack over boot_strapper and range_streamer
  storage_service: Make replacement info a real struct
2022-08-31 12:53:06 +03:00
Tomasz Grabiec
ae8d2a550d db: schema_tables: Make table creation shadow earlier concurrent changes
Issuing two CREATE TABLE statements with a different name for one of
the partition key columns leads to the following assertion failure on
all replicas:

scylla: schema.cc:363: schema::schema(const schema::raw_schema&, std::optional<raw_view_info>): Assertion `!def.id || def.id == id - column_offset(def.kind)' failed.

The reason is that once the create table mutations are merged, the
columns table contains two entries for the same position in the
partition key tuple.

If the schemas were the same, or not conflicting in a way which leads
to abort, the current behavior would be to drop the older table as if
the last CREATE TABLE was preceded by a DROP TABLE.

The proposed fix is to make CREATE TABLE mutation include a tombstone
for all older schema changes of this table, effectively overriding
them. The behavior will be the same as if the schemas were not
different, older table will be dropped.

Fixes #11396
2022-08-29 12:06:02 +02:00
Pavel Emelyanov
10e8804417 test: Provide testing dc/rack infos
There's a test that's sensitive to correct dc/rack info for testing
entries. To populate them it uses global rack-inferring snitch instance
or a special "testing" snitch. To make it continue working add a helper
that would populate the topology properly (spoiler: next branch will
replace it with explicitly populated topology object).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-26 10:00:04 +03:00
Pavel Emelyanov
4cbe6ee9f4 topology: Require entry in the map for update_normal_tokens()
The method in question tries to be on the safest side and adds the
enpoint for which it updates the tokens into the topology. From now on
it's up to the caller to put the endpoint into topology in advance.

So most of what this patch does is places topology.update_endpoint()
into the relevant places of the code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-26 09:44:08 +03:00
Pavel Emelyanov
1d437302a8 tests: Use one-by-one tokens updating method
Tests are the only users of batch tokens updating "sugar" which
actually makes things more complicated

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-08-24 08:24:21 +03:00
Avi Kivity
6ce5e9079c Merge 'utils/logalloc: consolidate lsa state in shard tracker' from Botond Dénes
Currently the state of LSA is scattered across a handful of global variables. This series consolidates all these into a single one: the shard tracker. Beyond reducing the number of globals (the less globals, the better) this paves the way for a planned de-globalization of the shard tracker itself.
There is one separate global left, the static migrators registry. This is left as-is for now.

Closes #11284

* github.com:scylladb/scylladb:
  utils/logalloc: remove reclaim_timer:: globals
  utils/logalloc: make s_sanitizer_report_backtrace global a member of tracker
  utils/logalloc: tracker_reclaimer_lock: get shard tracker via constructor arg
  utils/logalloc: move global stat accessors to tracker
  utils/logalloc: allocating_section: don't use the global tracker
  utils/logalloc: pass down tracker::impl reference to segment_pool
  utils/logalloc: move segment pool into tracker
  utils/logalloc: add tracker member to basic_region_impl
  utils/logalloc: make segment independent of segment pool
2022-08-23 18:51:14 +02:00
Tomasz Grabiec
0e5b86d3da Merge 'Optimize mutation consume of range tombstones in reverse' from Benny Halevy
Reversing the whole range_tombstone_list
into reversed_range_tombstones is inefficient
and can lead to reactor stalls with a large number of
range tombstones.

Instead, iterate over the range_tombsotne_list in reverse
direction and reverse each range_tombstone as we go,
keeping the result in the optional cookie.reversed_rt member.

While at it, this series contains some other cleanups on this path
to improve the code readability and maybe make the compiler's life
easier as for optimizing the cleaned-up code.

Closes #11271

* github.com:scylladb/scylladb:
  mutation: consume_clustering_fragments: get rid of reversed_range_tombstones;
  mutation: consume_clustering_fragments: reindent
  mutation: consume_clustering_fragments: shuffle emit_rt logic around
  mutation: consume, consume_gently: simplify partition_start logic
  mutation: consume_clustering_fragments: pass iterators to mutation_consume_cookie ctor
  mutation: consume_clustering_fragments: keep the reversed schema in cookie
  mutation: clustering_iterators: get rid of current_rt
  mutation_test: test_mutation_consume_position_monotonicity: test also consume_gently
2022-08-23 10:05:39 +02:00
Botond Dénes
7d17d675af utils/logalloc: move global stat accessors to tracker
These are pretend free functions, accessing globals in the background,
make them a member of the tracker instead, which everything needed
locally to compute them. Callers still have to access these stats
through the global tracker instance, but this can be changed to happen
through a local instance. Soon....
2022-08-23 10:38:58 +03:00
Botond Dénes
331033adae Merge 'Fix frozen mutation consume ordering' from Benny Halevy
Currently, frozen_mutation is not consumed in position_in_partition
order as all range tombstones are consumed before all rows.

This violates the range_tombstone_generator invariants
as its lower_bound needs to be monotonically increasing.

Fix this by adding mutation_partition_view::accept_ordered
and rewriting do_accept_gently to do the same,
both making sure to consume the range tombstones
and clustering rows in position_in_partition order,
similar to the mutation consume_clustering_fragments function.

Add a unit test that verifies that.

Fixes #11198

Closes #11269

* github.com:scylladb/scylladb:
  mutation_partition_view: make mutation_partition_view_virtual_visitor stoppable
  frozen_mutation: consume and consume_gently in-order
  frozen_mutation: frozen_mutation_consumer_adaptor: rename rt to rtc
  frozen_mutation: frozen_mutation_consumer_adaptor: return early when flush returns stop_iteration::yes
  frozen_mutation: frozen_mutation_consumer_adaptor: consume static row unconditionally
  frozen_mutation: frozen_mutation_consumer_adaptor: flush current_row before rt_gen
2022-08-23 06:37:04 +03:00
Mikołaj Sielużycki
b5380baf8a frozen_mutation: consume and consume_gently in-order
Currently, frozen_mutation is not consumed in position_in_partition
order as all range tombstones are consumed before all rows.

This violates the range_tombstone_generator invariants
as its lower_bound needs to be monotonically increasing.

Fix this by adding mutation_partition_view::accept_ordered
and rewriting do_accept_gently to do the same,
both making sure to consume the range tombstones
and clustering rows in position_in_partition order,
similar to the mutation consume_clustering_fragments function.

Add a unit test that verifies that.

Fixes #11198

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-22 20:12:20 +03:00
Piotr Sarna
484004e766 Merge 'Fix mutation commutativity with shadowable tombstone'
from Tomasz Grabiec

This series fixes lack of mutation associativity which manifests as
sporadic failures in
row_cache_test.cc::test_concurrent_reads_and_eviction due to differences
in mutations applied and read.

No known production impact.

Refs https://github.com/scylladb/scylladb/issues/11307

Closes #11312

* github.com:scylladb/scylladb:
  test: mutation_test: Add explicit test for mutation commutativity
  test: random_mutation_generator: Workaround for non-associativity of mutations with shadowable tombstones
  db: mutation_partition: Drop unnecessary maybe_shadow()
  db: mutation_partition: Maintain shadowable tombstone invariant when applying a hard tombstone
  mutation_partition: row: make row marker shadowing symmetric
2022-08-20 16:46:32 +02:00
Benny Halevy
7747b8fa33 sstables: define run_identifier as a strong tagged_uuid type
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #11321
2022-08-18 19:03:10 +03:00
Tomasz Grabiec
5a9df433c6 test: mutation_test: Add explicit test for mutation commutativity 2022-08-17 17:39:54 +02:00
Benny Halevy
017f9b4131 mutation_test: test_mutation_consume_position_monotonicity: test also consume_gently
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-17 14:43:52 +03:00
Piotr Sarna
cf30d4cbcf Merge 'Secondary index of collection columns' from Nadav Har'El
This pull request introduces global secondary-indexing for non-frozen collections.

The intent is to enable such queries:

```
CREATE TABLE test(int id, somemap map<int, int>, somelist<int>, someset<int>, PRIMARY KEY(id));
CREATE INDEX ON test(keys(somemap));
CREATE INDEX ON test(values(somemap));
CREATE INDEX ON test(entries(somemap));
CREATE INDEX ON test(values(somelist));
CREATE INDEX ON test(values(someset));

-- index on test(c) is the same as index on (values(c))
CREATE INDEX IF NOT EXISTS ON test(somelist);
CREATE INDEX IF NOT EXISTS ON test(someset);
CREATE INDEX IF NOT EXISTS ON test(somemap);

SELECT * FROM test WHERE someset CONTAINS 7;
SELECT * FROM test WHERE somelist CONTAINS 7;
SELECT * FROM test WHERE somemap CONTAINS KEY 7;
SELECT * FROM test WHERE somemap CONTAINS 7;
SELECT * FROM test WHERE somemap[7] = 7;
```

We use here all-familiar materialized views (MVs). Scylla treats all the
collections the same way - they're a list of pairs (key, value). In case
of sets, the value type is dummy one. In case of lists, the key type is
TIMEUUID. When describing the design, I will forget that there is more
than one collection type.  Suppose that the columns in the base table
were as follows:

```
pkey int, ckey1 int, ckey2 int, somemap map<int, text>, PRIMARY KEY(pkey, ckey1, ckey2)
```

The MV schema is as follows (the names of columns which are not the same
as in base might be different). All the columns here form the primary
key.

```
-- for index over entries
indexed_coll (int, text), idx_token long, pkey int, ckey1 int, ckey2 int
-- for index over keys
indexed_coll int, idx_token long, pkey int, ckey1 int, ckey2 int
-- for index over values
indexed_coll text, idx_token long, pkey int, ckey1 int, ckey2 int, coll_keys_for_values_index int
```

The reason for the last additional column is that the values from a collection might not be unique.

Fixes #2962
Fixes #8745
Fixes #10707

This patch does not implement **local** secondary indexes for collection columns: Refs #10713.

Closes #10841

* github.com:scylladb/scylladb:
  test/cql-pytest: un-xfail yet another passing collection-indexing test
  secondary index: fix paging in map value indexing
  test/cql-pytest: test for paging with collection values index
  cql, view: rename and explain bytes_with_action
  cql, index: make collection indexing a cluster feature
  test/cql-pytest: failing tests for oversized key values in MV and SI
  cql: fix secondary index "target" when column name has special characters
  cql, index: improve error messages
  cql, index: fix default index name for collection index
  test/cql-pytest: un-xfail several collecting indexing tests
  test/cql-pytest/test_secondary_index: verify that local index on collection fails.
  docs/design-notes/secondary_index: add `VALUES` to index target list
  test/cql-pytest/test_secondary_index: add randomized test for indexes on collections
  cql-pytest/cassandra_tests/.../secondary_index_test: fix error message in test ported from Cassandra
  cql-pytest/cassandra_tests/.../secondary_index_on_map_entries,select_test: test ported from Cassandra is expected to fail, since Scylla assumes that comparison with null doesn't throw error, just evaluates to false. Since it's not a bug, but expected behavior from the perspective of Scylla, we don't mark it as xfail.
  test/boost/secondary_index_test: update for non-frozen indexes on collections
  test/cql-pytest: Uncomment collection indexes tests that should be working now
  cql, index: don't use IS NOT NULL on collection column
  cql3/statements/select_statement: for index on values of collection, don't emit duplicate rows
  cql/expr/expression, index/secondary_index_manager: needs_filtering and index_supports_expression rewrite to accomodate for indexes over collections
  cql3, index: Use entries() indexes on collections for queries
  cql3, index: Use keys() and values() indexes on collections for queries.
  types/tuple: Use std::begin() instead of .begin() in tuple_type_impl::build_value_fragmented
  cql3/statements/index_target: throw exception to signalize that we didn't miss returning from function
  db/view/view.cc: compute view_updates for views over collections
  view info: has_computed_column_depending_on_base_non_primary_key
  column_computation: depends_on_non_primary_key_column
  schema, index/secondary_index_manager: make schema for index-induced mv
  index/secondary_index_manager: extract keys, values, entries types from collection
  cql3/statements/: validate CREATE INDEX for index over a collection
  cql3/statements/create_index_statement,index_target: rewrite index target for collection
  column_computation.hh, schema.cc: collection_column_computation
  column_computation.hh, schema.cc: compute_value interface refactor
  Cql.g, treewide: support cql syntax `INDEX ON table(VALUES(collection))`
2022-08-16 14:18:51 +02:00
Avi Kivity
afa7960926 Merge 'database: evict all inactive reads for table when detaching table' from Botond Dénes
Currently, when detaching the table from the database, we force-evict all queriers for said table. This series broadens the scope of this force-evict to include all inactive reads registered at the semaphore. This ensures that any regular inactive read "forgotten" for any reason in the semaphore, will not end up in said readers accessing a dangling table reference when destroyed later.

Fixes: https://github.com/scylladb/scylladb/issues/11264

Closes #11273

* github.com:scylladb/scylladb:
  querier: querier_cache: remove now unused evict_all_for_table()
  database: detach_column_family(): use reader_concurrency_semaphore::evict_inactive_reads_for_table()
  reader_concurrency_semaphore: add evict_inactive_reads_for_table()
2022-08-15 19:05:59 +03:00
Botond Dénes
92e5f438a4 querier: querier_cache: remove now unused evict_all_for_table() 2022-08-15 14:16:41 +03:00
Botond Dénes
e55ccbde8f reader_concurrency_semaphore: add evict_inactive_reads_for_table()
Allowing for evicting all inactive reads that belong to a certain table.
2022-08-15 14:16:41 +03:00
Botond Dénes
c8ef356859 test/lib: move convenience table config factory to sstable_test_env
All users of `column_family_test_config()`, get the semaphore parameter
for it from `sstable_test_env`. It is clear that the latter serves as
the storage space for stable objects required by the table config. This
patch just enshrines this fact by moving the config factory method to
`sstable_test_env`, so it can just get what it needs from members.
2022-08-15 11:23:59 +03:00
Michał Radwański
f572051ee9 test/boost/secondary_index_test: update for non-frozen indexes on
collections
2022-08-14 10:29:52 +03:00
Benny Halevy
d295d8e280 everywhere: define locator::host_id as a strong tagged_uuid type
So it can be distinguished from other uuid-based
identifiers in the system.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #11276
2022-08-12 06:01:44 +03:00