Commit Graph

361 Commits

Author SHA1 Message Date
Avi Kivity
13a64d8ab2 Merge 'Remove all remaining restrictions classes' from Jan Ciołek
This PR removes all code that used classes `restriction`, `restrictions` and their children.

There were two fields in `statement_restrictions` that needed to be dealt with: `_clustering_columns_restrictions` and `_nonprimary_key_restrictions`.

Each function was reimplemented to operate on the new expression representaiion and eventually these fields weren't needed anymore.

After that the restriction classes weren't used anymore and could be deleted as well.

Now all of the code responsible for analyzing WHERE clause and planning a query works on expressions.

Closes #11069

* github.com:scylladb/scylla:
  cql3: Remove all remaining restrictions code
  cql3: Move a function from restrictions class to the test
  cql3: Remove initial_key_restrictions
  cql3: expr: Remove convert_to_restriction
  cql3: Remove _new from _new_nonprimary_key_restrictions
  cql3: Remove _nonprimary_key_restrictions field
  cql3: Reimplement uses of _nonprimary_key_restrictions using expression
  cql3: Keep a map of single column nonprimary key restrictions
  cql3: Remove _new from _new_clustering_columns_restrictions
  cql3: Remove _clustering_columns_restrictions from statement_restrictions
  cql3: Use a variable instead of dynamic cast
  cql3: Use the new map of single column clustering restrictions
  cql3: Keep a map of single column clustering key restrictions
  cql3: Return an expression in get_clustering_columns_restrctions()
  cql3: Reimplement _clustering_columns_restrictions->has_supporting_index()
  cql3: Don't create single element conjunction
  cql3: Add expr::index_supports_some_column
  cql3: Reimplement has_unrestricted_components()
  cql3: Reimplement _clustering_columns_restrictions->need_filtering()
  cql3: Reimplement num_prefix_columns_that_need_not_be_filtered
  cql3: Use the new clustering restrictions field instead of ->expression
  cql3: Reimplement _clustering_columns_restrictions->size() using expressions
  cql3: Reimplement _clustering_columns_restrictions->get_column_defs() using expressions
  cql3: Reimplement _clustering_columns_restrictions->is_all_eq() using expressions
  cql3: expr: Add has_only_eq_binops function
  cql3: Reimplement _clustering_columns_restrictions->empty() using expressions
2022-07-20 18:01:15 +03:00
Jan Ciolek
9d1ba07471 cql3: Reimplement uses of _nonprimary_key_restrictions using expression
All parts of the code that use _nonprimary_key_restrictions
are changed to use _new_nonprimary_key_restrictions instead.
I decided not to split this into multiple commits,
as there isn't a lot of changes and they are
analogous to the ones done before for partition
and clustering columns.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-20 09:10:30 +02:00
Jan Ciolek
2b7ffd57fb cql3: Return an expression in get_clustering_columns_restrctions()
get_clustering_columns_restrctions() used to return
a shared pointer to the clustering_restrictions class.

Now everything is being converted to expression,
so it should return an expression as well.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-19 16:02:01 +02:00
Pavel Emelyanov
62d95f09de view: De-futurize make_view_update_builder()
It doesn't sleep, just returns ready future with builder

tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1384
       it's red because e-mail notification is broken (scylla-pkg#2988)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220718132529.30751-1-xemul@scylladb.com>
2022-07-18 17:15:48 +03:00
Botond Dénes
4d2ce5c304 mutation_compactor: remove emit_only_live_rows template parameter
Now that we use emit_only_live_rows::no everywhere we can remove this
template parameters. Only the template parameter is removed, the
internal logic around it is left in place (will be removed in a next
patch), by hard-wiring `only_live()`.
2022-07-12 08:43:49 +03:00
Botond Dénes
bedc82e52c tree: use emit_only_live_rows::no
emit_only_live_rows is a convenience so downstream consumers of the
mutation compactors don't have to check the `bool is_live` already
passed to them. This convenience however causes a template parameter and
additional logic for the compactor. As the most prominent of these
consumers (the query result builder) will soon have to switch to
emit_only_live_rows::no for other reasons anyway (it will want to count
tombstones), we take the opportunity to switch everybody to ::no. This
can be done with very little additional complexity to these consumer --
basically an additional if or two.
This prepares the ground for removing this template parameter and the
associate logic from the compactor.
2022-07-12 08:41:51 +03:00
Pavel Emelyanov
5526738794 view: Fix trace-state pointer use after move
It's moved into .mutate_locally() but it captured and used in its
continuation. It works well just because moved-from pointer looks like
nullptr and all the tracing code checks for it to be non-such.

tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1266/
       (CI job failed on post-actions thus it's red)

Fixes #11015

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220711134152.30346-1-xemul@scylladb.com>
2022-07-11 17:20:51 +03:00
Jan Ciolek
1339ff1c79 cql3: Use expression instead of _partition_key_restrictions in the remaining code
There are still some places that use partition_key_restrictions
instead of _new_partition_key_restrictions in statement_restrictions.
Change them to use the new representation

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-07-01 16:29:10 +02:00
Benny Halevy
81fa1ce9a1 Revert 'Compact staging sstables'
This patch reverts the following patches merged in
78750c2e1a "Merge 'Compact staging sstables' from Benny Halevy"

> 597e415c38 "table: clone staging sstables into table dir"
> ce5bd505dc "view_update_generator: discover_staging_sstables: reindent"
> 59874b2837 "table: add get_staging_sstables"
> 7536dd7f00 "distributed_loader: populate table directory first"

The feature causes regressions seen with e.g.
https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-daily-release/41/testReport/materialized_views_test/TestMaterializedViews/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split011___test_base_replica_repair/
```
AssertionError: Expected [[0, 0, 'a', 3.0]] from SELECT * FROM t_by_v WHERE v = 0, but got []
```

Where views aren't updated properly.
Apparently since `table::stream_view_replica_updates`
doesn't exclude the staging sstables anymore and
since they are cloned to the base table as new sstables
it seems to the view builder that no view updates are
required since there's no changes comparing to the base table.

Reopens #9559

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #10890
2022-06-27 12:18:48 +03:00
Botond Dénes
78750c2e1a Merge 'Compact staging sstables' from Benny Halevy
This series decouples the staging sstables from the table's sstable set.

The current behavior keeps the sstables in the staging directory until view building is done. They are readable as any other sstable, but fenced off from compaction, so they don't go away in the meanwhile.

Currently, when views are built, the sstables are moved into the main table directory where they will then be compacted normally.

The problem with this design is that the staging sstables are never compacted, in particular they won't get cleaned up or scrubbed.

The cleanup scenario open a backdoor for data resurrection when the staging sstables are moved after view building while possibly containing stale partitions (#9559) which will not be cleaned up until next time cleanup compaction is performed.

With this series, SSTables that are created in or moved to the staging sub-directory are "cloned" into the base table directory by hard-linking the components there and creating a new sstable object which loads the cloned files.

The former, in the staging directory is used solely for view building and is not added to the table's sstable set, while the latter, its clone, behaves like any other sstable and is added either to the regular or maintenance set and is read and compacted normally.

When view building is done, instead of moving the staging sstable into the table's base directory, it is simply unlinked.
If its "clone" wasn't compacted away yet, then it will just remain where it is, exactly like it would be after it was moved there in the present state of things.  If it was already compacted and no longer exists, then unlinking will then free its storage.

Note that snapshot is based on the sstables listed by the table, which do not include the staging sstables with this change.
But that shouldn't matter since even today, the sstables in the snapshot has no notion of "staging" directory and it is expected that the MV's are either updated view `nodetool refresh` if restoring sstables from snapshot using the uploads dir, or if restoring the whole table from backup - MV's are effectively expected to be rebuilt from scratch (they are not included in automatic snapshots anyway since we don't have snapshot-coherency across tables).

A fundamental infrastructure change was done to achieve that which is to change the sstable_list which was a std::unordered_set<shared_sstable> into a std::unordered_map<generation_type, shared_sstable> that keeps the shared_sstable objects indexed by generation number (that must be unique).  With this model, sstables are supposed to be searched by the generation number, not by their pointer, since when the staging sstable is clones, there will be 2 shared_sstable objects with the same generation (and different `dir()`) and we must distinguish between them.

Special care was taken to throw a runtime_error exception if when looking up a shared sstable and finding another one with the same generation, since they must never exist in the same sstable_map.

Fixes #9559

Closes #10657

* github.com:scylladb/scylla:
  table: clone staging sstables into table dir
  view_update_generator: discover_staging_sstables: reindent
  table: add get_staging_sstables
  view_update_generator: discover_staging_sstables: get shared table ptr earlier
  distributed_loader: populate table directory first
  sstables: time_series_sstable_set: insert: make exception safe
  sstables: move_to_new_dir: fix debug log message
2022-06-24 08:05:38 +03:00
Avi Kivity
dab56b82fa Merge 'Per-partition rate limiting' from Piotr Dulikowski
Due to its sharded and token-based architecture, Scylla works best when the user workload is more or less uniformly balanced across all nodes and shards. However, a common case when this assumption is broken is the "hot partition" - suddenly, a single partition starts getting a lot more reads and writes in comparison to other partitions. Because the shards owning the partition have only a fraction of the total cluster capacity, this quickly causes latency problems for other partitions within the same shard and vnode.

This PR introduces per-partition rate limiting feature. Now, users can choose to apply per-partition limits to their tables of choice using a schema extension:

```
ALTER TABLE ks.tbl
WITH per_partition_rate_limit = {
	'max_writes_per_second': 100,
	'max_reads_per_second': 200
};
```

Reads and writes which are detected to go over that quota are rejected to the client using a new RATE_LIMIT_ERROR CQL error code - existing error codes didn't really fit well with the rate limit error, so a new error code is added. This code is implemented as a part of a CQL protocol extension and returned to clients only if they requested the extension - if not, the existing CONFIG_ERROR will be used instead.

Limits are tracked and enforced on the replica side. If a write fails with some replicas reporting rate limit being reached, the rate limit error is propagated to the client. Additionally, the following optimization is implemented: if the coordinator shard/node is also a replica, we account the operation into the rate limit early and return an error in case of exceeding the rate limit before sending any messages to other replicas at all.

The PR covers regular, non-batch writes and single-partition reads. LWT and counters are not covered here.

Results of `perf_simple_query --smp=1 --operations-per-shard=1000000`:

- Write mode:
  ```
  8f690fdd47 (PR base):
  129644.11 tps ( 56.2 allocs/op,  13.2 tasks/op,   49785 insns/op)
  This PR:
  125564.01 tps ( 56.2 allocs/op,  13.2 tasks/op,   49825 insns/op)
  ```
- Read mode:
  ```
  8f690fdd47 (PR base):
  150026.63 tps ( 63.1 allocs/op,  12.1 tasks/op,   42806 insns/op)
  This PR:
  151043.00 tps ( 63.1 allocs/op,  12.1 tasks/op,   43075 insns/op)
  ```

Manual upgrade test:
- Start 3 nodes, 4 shards each, Scylla version 8f690fdd47
- Create a keyspace with scylla-bench, RF=3
- Start reading and writing with scylla-bench with CL=QUORUM
- Manually upgrade nodes one by one to the version from this PR
- Upgrade succeeded, apart from a small number of operations which failed when each node was being put down all reads/writes succeeded
- Successfully altered the scylla-bench table to have a read and write limit and those limits were enforced as expected

Fixes: #4703

Closes #9810

* github.com:scylladb/scylla:
  storage_proxy: metrics for per-partition rate limiting of reads
  storage_proxy: metrics for per-partition rate limiting of writes
  database: add stats for per partition rate limiting
  tests: add per_partition_rate_limit_test
  config: add add_per_partition_rate_limit_extension function for testing
  cf_prop_defs: guard per-partition rate limit with a feature
  query-request: add allow_limit flag
  storage_proxy: add allow rate limit flag to get_read_executor
  storage_proxy: resultize return type of get_read_executor
  storage_proxy: add per partition rate limit info to read RPC
  storage_proxy: add per partition rate limit info to query_result_local(_digest)
  storage_proxy: add allow rate limit flag to mutate/mutate_result
  storage_proxy: add allow rate limit flag to mutate_internal
  storage_proxy: add allow rate limit flag to mutate_begin
  storage_proxy: choose the right per partition rate limit info in write handler
  storage_proxy: resultize return types of write handler creation path
  storage_proxy: add per partition rate limit to mutation_holders
  storage_proxy: add per partition rate limit info to write RPC
  storage_proxy: add per partition rate limit info to mutate_locally
  database: apply per-partition rate limiting for reads/writes
  database: move and rename: classify_query -> classify_request
  schema: add per_partition_rate_limit schema extension
  db: add rate_limiter
  storage_proxy: propagate rate_limit_exception through read RPC
  gms: add TYPED_ERRORS_IN_READ_RPC cluster feature
  storage_proxy: pass rate_limit_exception through write RPC
  replica: add rate_limit_exception and a simple serialization framework
  docs: design doc for per-partition rate limiting
  transport: add rate_limit_error
2022-06-24 01:32:13 +03:00
Benny Halevy
597e415c38 table: clone staging sstables into table dir
clone staging sstables so their content may be compacted while
views are built.  When done, the hard-linked copy in the staging
subdirectory will be simply unlinked.

Fixes #9559

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-06-23 16:55:27 +03:00
Benny Halevy
ce5bd505dc view_update_generator: discover_staging_sstables: reindent
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-06-23 16:55:27 +03:00
Benny Halevy
59874b2837 table: add get_staging_sstables
We don't have to go over all sstables in the table to select the
staging sstables out of them, we can get it directly from the
_sstables_staging map.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-06-23 16:55:27 +03:00
Benny Halevy
b8b14d76b3 view_update_generator: discover_staging_sstables: get shared table ptr earlier
It's potentially a bit more efficient since
t.get_sstables is called only once, while
t.shared_from_this() is called per staging sstable.

Also, prepare for the following patches that modify
this function further.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-06-23 16:55:27 +03:00
Botond Dénes
080ed590bf Merge "Obtain dc/rack from topology, not snitch" from Pavel Emelyanov
"
The way dc/rack info is maintained is very intricate.

The dc/rack strings originate at snitch, get propagated via gossiper,
get notified to storage service which, in turn, stores them into the
system keyspace and token metadata. Code that needs to get dc/rack
for a given endpoint calls snitch which tries to get the data from
gossiper and if failed goes and loads it from system keyspace cache.
Also there's "internal IP" thing hanging arond that loops messaging
service in both -- updating and getting the info.

The plan is to make topology (that currently sits on token metadata)
stay the only "source of truth" regarding the endpoints' dc/rack and
internal IP info. The dc/rack mappings are put into topology already,
but it cannot yet fully replace snitch for two reasons:

- it doesn't map internal IP to endpoint
- it doesn't get data stored in system keyspace

So what this patch set does is patches most of the dc/rack getters
to call topology methods. The topology is temporarily patched to
just call the respective snitch methods. This removes a big portion
of calls for global snitch instance.

After the set the places that still explicitly rely on snitch to
provide dc/rack are

- messaging service: needs internal IP knowledge on topology
- db/consistency_level: is all "global", needs heavier patching
- tests: just later
"

* 'br-get-dc-rack-from-topology-2' of https://github.com/xemul/scylla:
  proxy stats: Get rack/datacenter from topology
  proxy stats: Push topology arg to get_ep_stats
  api: Get rack/datacenter from topology
  hints: Remove snitch dependency
  hints: Get rack/datacenter from topology
  alternator: Get rack/datacenter from topology
  range_streamer: Get rack/datacenter from topology
  repair: Get rack/datacenter from topology
  view: Get rack/datacenter from topology
  storage_service: Get rack/datacenter from topology
  proxy: Get rack/datacenter from topology
  topology: Add get_rack/_datacenter methods
2022-06-23 10:01:36 +03:00
Piotr Dulikowski
e6beab3106 storage_proxy: add allow rate limit flag to mutate/mutate_result
Now, mutate/mutate_result accept a flag which decides whether the write
should be rate limited or not.

The new parameter is mandatory and all call sites were updated.
2022-06-22 20:16:49 +02:00
Piotr Sarna
bc3a635c42 view: exclude using static columns in the view filter
The code which applied view filtering (i.e. a condition placed
on a view column, e.g. "WHERE v = 42") erroneously used a wildcard
selection, which also assumes that static columns are needed,
if the base table contains any such columns.
The filtering code currently assumes that no such columns are fetched,
so the selection is amended to only ask for regular columns
(primary key columns are sent anyway, because they are enabled
via slice options, so no need to ask for them explicitly).

Fixes #10851

Closes #10855
2022-06-22 15:55:45 +03:00
Pavel Emelyanov
17128eb54b view: Get rack/datacenter from topology
The view code already gets token metadata from global proxy instance. Do
the same to get topology object.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-06-22 11:47:26 +03:00
Raphael S. Carvalho
aa667e590e sstable_set: Fix partitioned_sstable_set constructor
The sstable set param isn't being used anywhere, and it's also buggy
as sstable run list isn't being updated accordingly. so it could happen
that set contains sstables but run list is empty, introducing
inconsistency.

we're fortunate that the bug wasn't activated as it would've been
a hard one to catch. found this while auditting the code.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220617203438.74336-1-raphaelsc@scylladb.com>
2022-06-21 11:58:13 +03:00
Avi Kivity
c80999fab4 cql3: expr: push is_satisfied_by regular and static column extraction to callers
is_satisfied_by() rearranges the static and regular columns from
query::result_row_view form (which is a use-once iterator) to
std::vector<managed_bytes_opt> (which uses the standard value
representation, and allows random access which expression
evaluation needs). Doing it in is_saitisfied_by() means that it is
done every time an expression is evaluated, which is wasteful. It's
also done even if the expression doesn't need it at all.

Push it out to callers, which already eliminates some calls.

We still pass cql3::expr::selection, which is a layering violation,
but that is left to another time.

Note that in view.cc's check_if_matches(), we should have been
able to move static_and_regular_columns calculation outside the
loop. However, we get crashes if we do. This is likely due to a
preexisting bug (which the zero iterations loop avoids). However,
in selection.cc, we are able to avoid the computation when the code
claims it is only handling partition keys or clustering keys.
2022-06-12 16:12:41 +03:00
Avi Kivity
4b715226fe cql3: expr: convert is_satisfied_by() signature to evaluation_inputs
Callers are converted, but the internals are kept using the old
conventions until more APIs are converted.

Although the new API allows passing no query_options, the view code
keeps passing dummy query_options and improvement is left as a FIXME.
2022-06-12 12:53:44 +03:00
Michael Livshin
632b4e5a9a fix "ninja dev-headers"
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-05-31 23:42:34 +03:00
Piotr Sarna
c3a9658535 db,view: add delete ghost rows visitor
The visitor is used to traverse view rows, and if it detects
a ghost row it qualifies it for deletion. Qualification is based
on a base table read with cl=ALL: if the corresponding row is not
present in the base table, it is considered a ghost.
2022-05-19 10:11:50 +02:00
Avi Kivity
528ab5a502 treewide: change metric calls from make_derive to make_counter
make_derive was recently deprecated in favor of make_counter, so
make the change throughput the codebase.

Closes #10564
2022-05-14 12:53:55 +02:00
Avi Kivity
5937b1fa23 treewide: remove empty comments in top-of-files
After fcb8d040 ("treewide: use Software Package Data Exchange
(SPDX) license identifiers"), many dual-licensed files were
left with empty comments on top. Remove them to avoid visual
noise.

Closes #10562
2022-05-13 07:11:58 +02:00
Avi Kivity
a1df583dea db: view: explicitly ignore unused result
Otherwise, gcc complains.
2022-04-18 12:27:18 +03:00
Benny Halevy
6454c8d67f db: view_updates: coroutinize move_to
And allow yielding in-between freezing each update mutation.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-04-08 11:29:25 +03:00
Benny Halevy
0e570d6ffa db: view_update_builder: build_some: maybe yield between updates
`update.move_to` freezes the mutation

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-04-08 11:22:41 +03:00
Benny Halevy
243ba2e976 db: view_update_builder: build_some: fixup indentation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-04-08 11:21:42 +03:00
Benny Halevy
3e376155ef db: view_update_builder: coroutinize build_some
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-04-08 11:20:35 +03:00
Botond Dénes
b029bd3db7 tree: remove mutation_reader.hh include
In most files it was unused. We should move these to the patch which
moved out the last interesting reader from mutation_reader.hh (and added
the corresponding new header include) but its probably not worth the
effort.
Some other files still relied on mutation_reader.hh to provide reader
concurrency semaphore and some other misc reader related definitions.
2022-03-30 15:42:51 +03:00
Botond Dénes
d0ea895671 readers: move multishard reader & friends to reader/multishard.cc
Since the multishard reader family weighs more than 1K SLOC, it gets
its own .cc file.
2022-03-30 15:42:51 +03:00
Mikołaj Sielużycki
6f1b6da68a compile: Fix headers so that *-headers targets compile cleanly.
Closes #10273
2022-03-25 16:19:26 +02:00
Botond Dénes
4b9219a209 replica/table: migrate populate_views() to v2 2022-03-17 10:51:05 +02:00
Botond Dénes
909be0b9d7 db/view: convert view_update_builder interface to v2
The constructor and the make_ factory method now take v2 readers.
Immediate users are patched, with conversions if needed.
2022-03-17 10:50:50 +02:00
Botond Dénes
0740019e4d db/view: migrate view_update_builder to v2
To avoid noise, the interface is left as v1 and inbound readers are
converted in the constructor.
2022-03-17 10:47:55 +02:00
Avi Kivity
39504dea61 Merge "Convert result builders to v2" from Botond
"
Namely the query result writer and the reconcilable result builder, used
for building results for regular queries and mutation queries (used in
read repair) respectively.
With this, there are no users left for the v1 output of the compactor,
so we remove that, making the compactor v2 all-the-way (and simpler).
This means that for regular queries, a downgrade phase is eliminated
completely, as regular queries don't store range tombstone in their
result, so no need to convert them.

Tests: unit(dev, release, debug)
"

* 'result-builders-v2/v1' of https://github.com/denesb/scylla:
  reconcilable_result_builder: remove v1 support
  query_result_builder: remove v1 support
  mutation_compactor: drop v1 related code-paths
  mutation_compactor: drop v1 support altogether from the API
  tree: migrate to the v2 consumer APIs
  test/boost/mutation_test: remove v1 specific test code
  querier: switch to v2 compactor output
  reconcilable_result_builder: add v2 support
  query_result_writer: add v2 support
  query_result_builder: make consume(range_tombstone) noop
2022-03-15 14:32:58 +02:00
Mikołaj Sielużycki
1d84a254c0 flat_mutation_reader: Split readers by file and remove unnecessary includes.
The flat_mutation_reader files were conflated and contained multiple
readers, which were not strictly necessary. Splitting optimizes both
iterative compilation times, as touching rarely used readers doesn't
recompile large chunks of codebase. Total compilation times are also
improved, as the size of flat_mutation_reader.hh and
flat_mutation_reader_v2.hh have been reduced and those files are
included by many file in the codebase.

With changes

real	29m14.051s
user	168m39.071s
sys	5m13.443s

Without changes

real	30m36.203s
user	175m43.354s
sys	5m26.376s

Closes #10194
2022-03-14 13:20:25 +02:00
Botond Dénes
87ac2e9ab0 tree: migrate to the v2 consumer APIs 2022-03-11 09:24:05 +02:00
Nadav Har'El
ef43531fb6 materialized views: allow empty strings in views and indexes
Although Cassandra generally does not allow empty strings as partition
keys (note they are allowed as clustering keys!), it *does* allow empty
strings in regular columns to be indexed by a secondary index, or to
become an empty partition-key column in a materialized view. As noted in
issues #9375 and #9364 and verified in a few xfailing cql-pytest tests,
Scylla didn't allow these cases - and this patch fixes that.

The patch mostly *removes* unnecessary code: In one place, code
prevented an sstable with an empty partition key from being written.
Another piece of removed code was a function is_partition_key_empty()
which the materialized-view code used to check whether the view's
row will end up with an empty partition key, which was supposedly
forbidden. But in fact, should have been allowed like they are allowed
in Cassandra and required for the secondary-index implementation, and
the entire function wasn't necessary.

Note that the removed function is_partition_key_empty() was *NOT* required
for the "IS NOT NULL" feature of materialized views - this continues to
work as expected after this patch, and we add another test to confirm it.
Being null and being an empty string are two different things.

This patch also removes a part of a unit test which enshrined the
wrong behavior.

After this patch we are left with one interesting difference from
Cassandra: Though Cassandra allows a user to create a view row with an
empty-string partition key, and this row is fully visible in when
scanning the view, this row can *not* be queried individually because
"WHERE v=''" is forbidden when v is the partition key (of the view).
Scylla does not reproduce this anomaly - and such point query does work
in Scylla after this patch. We add a new test to check this case, and mark
it "cassandra_bug", i.e., it's a Cassandra behavior which we consider
wrong and don't want to emulate.

This patch relies on #9352 and #10178 having been fixed in previous patches,
otherwise the WHERE v='' does not work when reading from sstables.
We add to the already existing tests we had for empty materialized-views
keys a lookup with WHERE v='' which failed before fixing those two issues.

Fixes #9364
Fixes #9375

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-03-08 15:34:26 +02:00
Botond Dénes
05c48ee0cc db/view/view_updating_consumer: migrate to v2
Not a completely mechanical transition. The consumer has to generate its
mutation via a mutation_rebuilder_v2 as mutation fragment v2 cannot be
applied to mutations directly yet.
2022-02-21 12:29:24 +02:00
Botond Dénes
45b36d91c6 db/view/view_builder: use v2 reader 2022-02-21 12:27:55 +02:00
Benny Halevy
b56b10a4bb view_builder: do_build_step: handle unexpected exceptions
Exception are handled by do_build_step in principle,
Yet if an unhandled exception escapes handling
(e.g. get_units(_sem, 1) fails on a broken semaphore)
we should warn about it since the _build_step.trigger() calls
do no handle exceptions.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-02-02 14:54:19 +02:00
Avi Kivity
fcb8d040e8 treewide: use Software Package Data Exchange (SPDX) license identifiers
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.

Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.

The changes we applied mechanically with a script, except to
licenses/README.md.

Closes #9937
2022-01-18 12:15:18 +01:00
Avi Kivity
985403ab99 view: convert build_progress_virtual_reader to flat_mutation_reader_v2
build_progress_virtual_reader is a virtual reader that trims off
the last clustering key column from an underlying base table. It
is here converted to flat_mutation_reader_v2.

Because range_tombstone_change uses position_in_partition, not
clustering_key_prefix, we need a new adjust_ckey() overload.

Note the transformation is likely incorrect. When trimming the
last clustering key column, an inclusive bound changes should
change to exclusive. However, the original code did not do this,
so we don't fix it here. It's immaterial anyway since the base
table doesn't include range tombstones.

Test: unit (dev)   (which has a test for this reader)

Closes #9913
2022-01-17 10:31:37 +02:00
Michael Livshin
91d38ef2a9 view_update_generator: remove unneeded call to downgrade_to_v1()
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-01-11 10:49:26 +02:00
Avi Kivity
bbad8f4677 replica: move ::database, ::keyspace, and ::table to replica namespace
Move replica-oriented classes to the replica namespace. The main
classes moved are ::database, ::keyspace, and ::table, but a few
ancillary classes are also moved. There are certainly classes that
should be moved but aren't (like distributed_loader) but we have
to start somewhere.

References are adjusted treewide. In many cases, it is obvious that
a call site should not access the replica (but the data_dictionary
instead), but that is left for separate work.

scylla-gdb.py is adjusted to look for both the new and old names.
2022-01-07 12:04:38 +02:00
Avi Kivity
ae3a360725 database: Move database, keyspace, table classes to replica/ directory
The database, keyspace, and table classes represent the replica-only
part of the objects after which they are named. Reading from a table
doesn't give you the full data, just the replica's view, and it is not
consistent since reconciliation is applied on the coordinator.

As a first step in acknowledging this, move the related files to
a replica/ subdirectory.
2022-01-06 17:07:30 +02:00
Asias He
a8ad385ecd repair: Get rid of the gc_grace_seconds
The gc_grace_seconds is a very fragile and broken design inherited from
Cassandra. Deleted data can be resurrected if cluster wide repair is not
performed within gc_grace_seconds. This design pushes the job of making
the database consistency to the user. In practice, it is very hard to
guarantee repair is performed within gc_grace_seconds all the time. For
example, repair workload has the lowest priority in the system which can
be slowed down by the higher priority workload, so that there is no
guarantee when a repair can finish. A gc_grace_seconds value that is
used to work might not work after data volume grows in a cluster. Users
might want to avoid running repair during a specific period where
latency is the top priority for their business.

To solve this problem, an automatic mechanism to protect data
resurrection is proposed and implemented. The main idea is to remove the
tombstone only after the range that covers the tombstone is repaired.

In this patch, a new table option tombstone_gc is added. The option is
used to configure tombstone gc mode. For example:

1) GC a tombstone after gc_grace_seconds

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'timeout'} ;

This is the default mode. If no tombstone_gc option is specified by the
user. The old gc_grace_seconds based gc will be used.

2) Never GC a tombstone

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'disabled'};

3) GC a tombstone immediately

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'immediate'};

4) GC a tombstone after repair

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'};

In addition to the 'mode' option, another option 'propagation_delay_in_seconds'
is added. It defines the max time a write could possibly delay before it
eventually arrives at a node.

A new gossip feature TOMBSTONE_GC_OPTIONS is added. The new tombstone_gc
option can only be used after the whole cluster supports the new
feature. A mixed cluster works with no problem.

Tests: compaction_test.py, ninja test

Fixes #3560

[avi: resolve conflicts vs data_dictionary]
2022-01-04 19:48:14 +02:00