Commit Graph

383 Commits

Author SHA1 Message Date
Avi Kivity
5a30f9b789 Merge 'Distributed aggregate query' from Michał Jadwiszczak
This PR extends #9209. It consists of 2 main points:

To enable parallelization of user-defined aggregates, reduction function was added to UDA definition. Reduction function is optional and it has to be scalar function that takes 2 arguments with type of UDA's state and returns UDA's state

All currently implemented native aggregates got their reducible counterpart, which return their state as final result, so it can be reduced with other result. Hence all native aggregates can now be distributed.

Local 3-node cluster made with current master. `node1` updated to this branch. Accessing node with `ccm <node-name> cqlsh`

I've tested belowed things from both old and new node:
- creating UDA with reduce function - not allowed
- selecting count(*) - distributed
- selecting other aggregate function - not distributed

Fixes: #10224

Closes #10295

* github.com:scylladb/scylla:
  test: add tests for parallelized aggregates
  test: cql3: Add UDA REDUCEFUNC test
  forward_service: enable multiple selection
  forward_service: support UDA and native aggregate parallelization
  cql3:functions: Add cql3::functions::functions::mock_get()
  cql3: selection: detect parallelize reduction type
  db,cql3: Move part of cql3's function into db
  selection: detect if selectors factory contains only simple selectors
  cql3: reducible aggregates
  DB: Add `scylla_aggregates` system table
  db,gms: Add SCYLLA_AGGREGATES schema features
  CQL3: Add reduce function to UDA
  gms: add UDA_NATIVE_PARALLELIZED_AGGREGATION feature
2022-07-19 19:05:19 +03:00
Jadw1
d13f347621 DB: Add scylla_aggregates system table
Saving information about UDA's reduce function to `scylla_aggregates`
table and distributing it across cluster.
2022-07-18 15:25:37 +02:00
Jadw1
d8f3461147 CQL3: Add reduce function to UDA
Add optional field to UDA, that describes reduce function to allow
parallelization of UDA aggregates.
2022-07-18 14:18:48 +02:00
Benny Halevy
71aad45757 schema_tables: merge_tables_and_views: use drop_table_on_all_shards
So that the dropped table's directory can be
removed after it has been dropped on all shards
if it has no snapshots.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-17 14:33:34 +03:00
Avi Kivity
3b20407f25 Merge 'db: Avoid memtable flush latency on schema merge' from Tomasz Grabiec
Currently, applying schema mutations involves flushing all schema
tables so that on restart commit log replay is performed on top of
latest schema (for correctness). The downside is that schema merge is
very sensitive to fdatasync latency. Flushing a single memtable
involves many syncs, and we flush several of them. It was observed to
take as long as 30 seconds on GCE disks under some conditions.

This patch changes the schema merge to rely on a separate commit log
to replay the mutations on restart. This way it doesn't have to wait
for memtables to be flushed. It has to wait for the commitlog to be
synced, but this cost is well amortized.

We put the mutations into a separate commit log so that schema can be
recovered before replaying user mutations. This is necessary because
regular writes have a dependency on schema version, and replaying on
top of latest schema satisfies all dependencies. Without this, we
could get loss of writes if we replay a write which depends on the
latest schema on top of old schema.

Also, if we have a separate commit log for schema we can delay schema
parsing for after the replay and avoid complexity of recognizing
schema transactions in the log and invoking the schema merge logic.

I reproduced bad behavior locally on my machine with a tired (high latency)
SSD disk, load driver remote. Under high load, I saw table alter (server-side part) taking
up to 10 seconds before. After the patch, it takes up to 200 ms (50:1 improvement).
Without load, it is 300ms vs 50ms.

Fixes #8272
Fixes #8309
Fixes #1459

Closes #10333

* github.com:scylladb/scylla:
  config: Introduce force_schema_commit_log option
  config: Introduce unsafe_ignore_truncation_record
  db: Avoid memtable flush latency on schema merge
  db: Allow splitting initiatlization of system tables
  db: Flush system.scylla_local on change
  migration_manager: Do not drop system.IndexInfo on keyspace drop
  Introduce SCHEMA_COMMITLOG cluster feature
  frozen_mutation: Introduce freeze/unfreeze helpers for vectors of mutations
  db/commitlog: Improve error messages in case of unknown column mapping
  db/commitlog: Fix error format string to print the version
  db: Introduce multi-table atomic apply()
2022-07-07 16:03:50 +03:00
Benny Halevy
acae3cc223 treewide: stop use of deprecated coroutine::make_exception
Convert most use sites from `co_return coroutine::make_exception`
to `co_await coroutine::return_exception{,_ptr}` where possible.

In cases this is done in a catch clause, convert to
`co_return coroutine::exception`, generating an exception_ptr
if needed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #10972
2022-07-07 15:02:16 +03:00
Tomasz Grabiec
6b316f267f db: Avoid memtable flush latency on schema merge
Currently, applying schema mutations involves flushing all schema
tables so that on restart commit log replay is performed on top of
latest schema (for correctness). The downside is that schema merge is
very sensitive to fdatasync latency. Flushing a single memtable
involves many syncs, and we flush several of them. It was observed to
take as long as 30 seconds on GCE disks under some conditions.

This patch changes the schema merge to rely on a separate commit log
to replay the mutations on restart. This way it doesn't have to wait
for memtables to be flushed. It has to wait for the commitlog to be
synced, but this cost is well amortized.

We put the mutations into a separate commit log so that schema can be
recovered before replaying user mutations. This is necessary because
regular writes have a dependency on schema version, and replaying on
top of latest schema satisfies all dependencies. Without this, we
could get loss of writes if we replay a write which depends on the
latest schema on top of old schema.

Also, if we have a separate commit log for schema we can delay schema
parsing for after the replay and avoid complexity of recognizing
schema transactions in the log and invoking the schema merge logic.

One complication with this change is that replay_position markers are
commitlog-domain specific and cannot cross domains. They are recorded
in various places which survive node restart: sstables are annotated
with the maximum replay position, and they are present inside
truncation records. The former annotation is used by "truncate"
operation to drop sstables. To prevent old replay positions from being
interpreted in the context in the new schema commitlog domain, the
change refuses to boot if there are truncation records, and also
prohibits truncation of schema tables.

The boot sequence needs to know whether the cluster feature associated
with this change was enabled on all nodes. Fetaures are stored in
system.scylla_local. Because we need to read it before initializing
schema tables, the initialization of tables now has to be split into
two phases. The first phase initializes all system tables except
schema tables, and later we initialize schema tables, after reading
stored cluster features.

The commitlog domain is switched only when all nodes are upgraded, and
only after new node is restarted. This is so that we don't have to add
risky code to deal with hot-switching of the commitlog domain. Cold
switching is safer. This means that after upgrade there is a need for
yet another rolling restart round.

Fixes #8272
Fixes #8309
Fixes #1459
2022-07-06 22:08:56 +02:00
Tomasz Grabiec
609bf1d547 migration_manager: Do not drop system.IndexInfo on keyspace drop
It's not needed anymore because system.IndexInfo is a virtual table
calculated from view info.

The drop accesses a table which is outside system_schema keyspace
so crosses commit log domain. This will trigger an internal from
database::apply() on schema merge once the code switches to use
the schema commit log and require that all mutations which are
part of the schema change belong to a single commit log domain.

We could theoretically move system.IndexInfo to the schema commit log
domain. It's not easy though because table initialization at boot
needs to be split, and current functions for initailization work
at keyspace granularity, not table granularity.
2022-07-06 22:08:56 +02:00
Avi Kivity
4b53af0bd5 treewide: replace parallel_for_each with coroutine::parallel_for_each in coroutines
coroutine::parallel_for_each avoids an allocation and is therefore preferred. The lifetime
of the function object is less ambiguous, and so it is safer. Replace all eligible
occurences (i.e. caller is a coroutine).

One case (storage_service::node_ops_cmd_heartbeat_updater()) needed a little extra
attention since there was a handle_exception() continuation attached. It is converted
to a try/catch.

Closes #10699
2022-05-31 09:06:24 +03:00
Avi Kivity
5937b1fa23 treewide: remove empty comments in top-of-files
After fcb8d040 ("treewide: use Software Package Data Exchange
(SPDX) license identifiers"), many dual-licensed files were
left with empty comments on top. Remove them to avoid visual
noise.

Closes #10562
2022-05-13 07:11:58 +02:00
Benny Halevy
5b4eb44795 database: add flush_on_all variants
Use by api layer.

Will be used in a later patch to flush
on all shards before taking a snapshot.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-10 09:56:44 +03:00
Botond Dénes
fd27fbfe64 Merge "Add user types carrier helper" from Pavel Emelyanov
"
There's a cql_type_parser::parse() method that needs to get user
types for a keyspace by its name. For this it uses the global
storage proxy instance as a place to get database from. This set
introduces an abstract user_types_storage helper object that's
responsible in providing the user types for the caller.

This helper, in turn, is provided to the parse() method by the
database itself or by the schema_ctxt object that needs parse()
to unfreeze schemas and doesn't have database at those times.

This removes one more get_storage_proxy() call.
"

* 'br-user-types-storage' of https://github.com/xemul/scylla:
  cql_type_parser: Require user_types_storage& in parse()
  schame_tables: Add db/ctxt args here and there
  user_types: Carry storage on database and schema_ctxt
  data_dictionary: Introduce user types storage
2022-05-09 17:38:52 +03:00
Pavel Emelyanov
0f698910e8 cql_type_parser: Require user_types_storage& in parse()
Right now to get user types the method in question gets global proxy
instance to get database from it and then peek a keyspace, its metadata
and, finally, the user types. There's also a safety check for proxy not
being initialized, which happens in tests.

Instead of messing with the proxy, the parse() method now accepts the
user_types_storage reference from which it gets the types. All the
callers already have the needed storage at hand -- in most of the cases
it's one shared between the database and schema_ctxt. In case of tests
is's a dummy storage, in case of schema-loader it's its local one.

The get_column_mapping() is special -- it doesn't expect any user-types
to be parsed and passes "" keyspace into it, neither it has db/ctxt to
get types storage from, so it can safely use the dummy one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-05 13:11:18 +03:00
Pavel Emelyanov
44f38d4de2 schame_tables: Add db/ctxt args here and there
This is to have them in places that call cql_type_parser::parse.
Pure churn reduction for the next patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-05 13:11:18 +03:00
Pavel Emelyanov
2104d90dd0 user_types: Carry storage on database and schema_ctxt
The user types storage is needed in cql_type_parser::parse which is in
turn called with either replica::database or scema_ctxt at hand.

To facilitate the former case replica::database has its own user types
storage created in database constructor.

The latter case is a bit trickier. In many cases the ctxt is created as
a temporary object and the database is available at those places. Also
the ctxt object lives on the schema_registry instance which doesn't have
database nearby. However, that ctxt lifetime is the same as the registry
instance one and when it's created there's a database at hand (it's the
database constructor that calls schema_registry.init() passing "this"
into it). Thus, the solution is to make database's user types storage be
a shared pointer that's shared between database itself and all the ctxts
out there including the one that lives on schema_registry instance.

When database goes away it .deactivate()s its user types storage so that
any ctxts that may share it stay on the safe side and don't use database
after free. This part will go away when the schema_registry will be
deglobalized.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-05 13:06:04 +03:00
Avi Kivity
19ab3edd77 gms: feature_service: remove variable/helper function duplication
Each feature has a private variable and a public accessor. Since the
accessor effectively makes the variable public, avoid the intermediary
and make the variable public directly.

To ease mechanical translation, the variable name is chosen as
the function name (without the cluster_supports_ prefix).

References throughout the codebase are adjusted.
2022-05-04 18:59:56 +03:00
Eliran Sinvani
a16b4e407d internal queries: add caching to some queries
Some of the internal queries didn't have caching enabled even though
there are chances of the query executing in large bursts or relatively
often, example of the former is `default_authorized::authorize` and for
the later is `system_distributed_keyspace::get_service_levels`.

Fixes #10335

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
2022-05-01 13:30:02 +03:00
Eliran Sinvani
e0c7178e75 query_processor: remove default internal query caching behavior
When executing internal queries, it is important that the developer
will decide if to cache the query internally or not since internal
queries are cached indefinitely. Also important is that the programmer
will be aware if caching is going to happen or not.
The code contained two "groups" of `query_processor::execute_internal`,
one group has caching by default and the other doesn't.
Here we add overloads to eliminate default values for caching behaviour,
forcing an explicit parameter for the caching values.
All the call sites were changed to reflect the original caching default
that was there.

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
2022-05-01 08:33:55 +03:00
Avi Kivity
de0ee13f45 schema_tables: forward-declare user_function and user_aggerates
These bring in wasm.hh (though they really shouldn't) and make
everyone suffer. Forward declare instead and add missing includes
where needed.

Closes #10444
2022-04-28 07:22:02 +03:00
Piotr Sarna
fea18943cd schema_tables: drop leftover change to system_schema.keyspaces
Series 59d56a3fd7 introduced
an accidental backward incompatible regression by adding
a column to system_schema.keyspaces and then not even using
it for anything. It's a leftover from the original hackathon
implementation and should never reach master in the first place.
Fortunately, the series isn't part of any stable release yet.

Fixes #10376
Tests: manual, verifying that the system_schema.keyspaces table
no longer contains the extraneous column.

Closes #10377
2022-04-18 12:00:43 +03:00
Piotr Sarna
91f130bd9c schema_tables: remove unnecessary throws
Throws are translated to passing the exception directly.
2022-04-12 13:09:27 +02:00
Piotr Sarna
58529591a9 database,cql3: add STORAGE option to keyspaces
The STORAGE option is designed to hold a map of options
used for customizing storage for given keyspace.
The option is kept in a system_schema.scylla_keyspaces table.
The option is only available if the whole cluster is aware
of it - guarded by a cluster feature.

Example of the table contents:
```
cassandra@cqlsh> select * from system_schema.scylla_keyspaces;

 keyspace_name | storage_options                                | storage_type
---------------+------------------------------------------------+--------------
           ksx | {'bucket': '/tmp/xx', 'endpoint': 'localhost'} |           S3
```
2022-04-08 09:17:01 +02:00
Piotr Sarna
7f02b188b7 db,schema_tables: add scylla_keyspaces table
The table holds scylla-specific information on keyspaces.
The first columns include storage_type and storage_options,
which will be used later to store storage information.
2022-04-08 09:17:00 +02:00
Jadw1
b560286ffe CQL3: check sfunc return type in UDA
Thre return type of state function is now checked while creating UDA.
Appropriate test added to cql-pytest.
2022-04-06 09:25:17 +02:00
Jadw1
c921efd1b3 cql3: allow no final_func and no initcond in UDA
Makes final function and initial condition to be optional while
creating UDA. No final function means UDA returns final state
and defeult initial condition is `null`.

Fixes: #10324
2022-04-06 09:08:50 +02:00
Pavel Emelyanov
c15359165d system_keyspace: Make update_schema_version non-static
It's called from two places -- .setup() and schema_tables code. Both
have the instance hanging around, so the method can be de-marked
static and set free from global qctx

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-16 14:24:40 +03:00
Pavel Emelyanov
b80d5f8900 schema_tables: Add sharded<system_keyspace> argument to update_schema_version_and_announce
All its (indirect) callers had been patched to have it, now it's
possible to have the argument in it. Next patch will make use of it

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-16 14:24:40 +03:00
Tomasz Grabiec
7719f4cd91 Merge "Group 0 discovery: persist and restore peers" from Kamil
We add a `peers()` method to `discovery` which returns the peers
discovered until now (including seeds). The caller of functions which
return an output -- `tick` or `request` -- is responsible for persisting
`peers()` before returning the output of `tick`/`request` (e.g. before
sending the response produced by `request` back). The user of
`discovery` is also responsible for restoring previously persisted peers
when constructing `discovery` again after a restart (e.g. if we
previously crashed in the middle of the algorithm).

The `persistent_discovery` class is a wrapper around `discovery` which
does exactly that.

For storage we use a simple local table.

A simple bugfix is also included in the first patch.

* kbr/discovery-persist-v3:
  service: raft: raft_group0: persist discovered peers and restore on restart
  db: system_keyspace: introduce discovery table
  service: raft: discovery: rename `get_output` to `tick`
  service: raft: discovery: stop returning peer_list from `request` after becoming leader
2022-02-25 17:23:08 +01:00
Nadav Har'El
7be3129458 cdc: don't need current keyspace to create the log table
CDC registers to the table-creation hook (before_create_column_family)
to add a second table - the CDC log table - to the same keyspace.
The handler function (on_before_update_column_family() in cdc/log.cc)
wants to retrieve the keyspace's definition, but that does NOT WORK if
we create the keyspace and table in one operation (which is exactly what
we intend to do in Alternator to solve issue #9868) - because at the
time of the hook, the keyspace does not yet exist in the schema.

It turns out that on_before_update_column_family() does not REALLY need
the keyspace. It needed it to pass it on to make_create_table_mutations()
but that function doesn't use the keyspace parameter passed to it! All
it needs is the keyspace's name - which is in the schema anyway and
doesn't need to be looked up.

So in this patch we fix make_create_table_mutations() to not require the
unused keyspace parameter - and fix the CDC code not to look for the
keyspace that is no longer needed.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220215162342.622509-1-nyh@scylladb.com>
2022-02-16 08:38:56 +02:00
Kamil Braun
5dbf86fa29 db: system_keyspace: introduce discovery table
This table will be used to persist the list of peers discovered by the
`discovery` algorithm that is used for creating Raft group 0 when
bootstrapping a fresh cluster.
2022-02-14 12:05:18 +01:00
Kamil Braun
fad72daeb4 db: system_keyspace: introduce system.group0_history table
This table will contain a history of all group 0 changes applied through
Raft. With each change is an associated unique ID, which also identifies
the state of all group 0 tables (including schema tables) after this
change is applied, assuming that all such changes are serialized through
Raft (they will be eventually).

We will use these state IDs to check if a given change is still
valid at the moment it is applied (in `group0_state_machine::apply`),
i.e. that there wasn't a concurrent change that happened between
creating this change and applying it (which may invalidate it).
2022-01-24 15:20:37 +01:00
Avi Kivity
fcb8d040e8 treewide: use Software Package Data Exchange (SPDX) license identifiers
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.

Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.

The changes we applied mechanically with a script, except to
licenses/README.md.

Closes #9937
2022-01-18 12:15:18 +01:00
Avi Kivity
bbad8f4677 replica: move ::database, ::keyspace, and ::table to replica namespace
Move replica-oriented classes to the replica namespace. The main
classes moved are ::database, ::keyspace, and ::table, but a few
ancillary classes are also moved. There are certainly classes that
should be moved but aren't (like distributed_loader) but we have
to start somewhere.

References are adjusted treewide. In many cases, it is obvious that
a call site should not access the replica (but the data_dictionary
instead), but that is left for separate work.

scylla-gdb.py is adjusted to look for both the new and old names.
2022-01-07 12:04:38 +02:00
Avi Kivity
ae3a360725 database: Move database, keyspace, table classes to replica/ directory
The database, keyspace, and table classes represent the replica-only
part of the objects after which they are named. Reading from a table
doesn't give you the full data, just the replica's view, and it is not
consistent since reconciliation is applied on the coordinator.

As a first step in acknowledging this, move the related files to
a replica/ subdirectory.
2022-01-06 17:07:30 +02:00
Asias He
a8ad385ecd repair: Get rid of the gc_grace_seconds
The gc_grace_seconds is a very fragile and broken design inherited from
Cassandra. Deleted data can be resurrected if cluster wide repair is not
performed within gc_grace_seconds. This design pushes the job of making
the database consistency to the user. In practice, it is very hard to
guarantee repair is performed within gc_grace_seconds all the time. For
example, repair workload has the lowest priority in the system which can
be slowed down by the higher priority workload, so that there is no
guarantee when a repair can finish. A gc_grace_seconds value that is
used to work might not work after data volume grows in a cluster. Users
might want to avoid running repair during a specific period where
latency is the top priority for their business.

To solve this problem, an automatic mechanism to protect data
resurrection is proposed and implemented. The main idea is to remove the
tombstone only after the range that covers the tombstone is repaired.

In this patch, a new table option tombstone_gc is added. The option is
used to configure tombstone gc mode. For example:

1) GC a tombstone after gc_grace_seconds

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'timeout'} ;

This is the default mode. If no tombstone_gc option is specified by the
user. The old gc_grace_seconds based gc will be used.

2) Never GC a tombstone

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'disabled'};

3) GC a tombstone immediately

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'immediate'};

4) GC a tombstone after repair

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'};

In addition to the 'mode' option, another option 'propagation_delay_in_seconds'
is added. It defines the max time a write could possibly delay before it
eventually arrives at a node.

A new gossip feature TOMBSTONE_GC_OPTIONS is added. The new tombstone_gc
option can only be used after the whole cluster supports the new
feature. A mixed cluster works with no problem.

Tests: compaction_test.py, ninja test

Fixes #3560

[avi: resolve conflicts vs data_dictionary]
2022-01-04 19:48:14 +02:00
Avi Kivity
d768e9fac5 cql3, related: switch to data_dictionary
Stop using database (and including database.hh) for schema related
purposes and use data_dictionary instead.

data_dictionary::database::real_database() is called from several
places, for these reasons:

 - calling yet-to-be-converted code
 - callers with a legitimate need to access data (e.g. system_keyspace)
   but with the ::database accessor removed from query_processor.
   We'll need to find another way to supply system_keyspace with
   data access.
 - to gain access to the wasm engine for testing whether used
   defined functions compile. We'll have to find another way to
   do this as well.

The change is a straightforward replacement. One case in
modification_statement had to change a capture, but everything else
was just a search-and-replace.

Some files that lost "database.hh" gained "mutation.hh", which they
previously had access to through "database.hh".
2021-12-15 13:54:23 +02:00
Avi Kivity
021c7593b8 data_dictionary: move user_types_metadata to new module data_dictionary
The new module will contain all schema related metadata, detached from
actual data access (provided by the database class). User types is the
first contents to be moved to the new module.
2021-12-15 13:52:10 +02:00
Benny Halevy
5947de7674 keyspace: get a reference to the erm_factory
To be used for creating effective_replication_map.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-19 10:46:51 +02:00
Botond Dénes
5b3ac3147b db/schema_tables: merge_tables_and_views(): match old/new view with old/new base table
For altered tables, the above function creates schema objects
representing before/after (old/new) table states. In case of views,
there is a matching mechanism to set the base table field of the view to
the appropriate base table object. This works by iterating over the list
of altered tables and selecting the "new_schema" field of the first
instance matching the keyspace/name of the base-table. This ends up
pairing the after/old version of the base table to both the before and
after version of the view. This means the base attached to the view is
possibly incompatible with the view it is attached to.
This patch fixes this by passing the schema generation (before/after) to
the function responsible for this matching, so it can select the
appropriate version of the base class.
For example, given the following input to `merge_tables_and_views()`:

    tables_before = { t1_before }
    tables_after = { t1_after }
    views_before = { v1_before }
    views_after = { v1_after }

Before this patch, the `base_schema` field of `v1_before` would be
`t1_after`, while it obviously should be `t1_before`. This sounds scary
but has no practical implications currently as `v1_before` is only
computed and then discarded without being used.

Tests: unit(dev)

Fixes: #9586
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20211108124806.151268-1-bdenes@scylladb.com>
2021-11-09 09:13:51 +02:00
Botond Dénes
ccf5c31776 treewide: system tables: don't use make_shared_schema() for creating schemas
`make_shared_schema()` is a convenience method for creating a schema in
a single function call, however it doesn't have all the advanced
capabilities as `schema_builder`. So most users (which all happen to be
system tables) pass the schema created by it to schema builder
immediately to do some further tweaking, effectively building the schema
twice. This is wasteful.
This patch changes all these users to use the newly added
`schema_builder()` constructor which has the same signature (and
therefore ease-of-use) as `make_shared_schema()`.
2021-11-05 11:41:04 +02:00
Benny Halevy
3393df45eb token_metadata, storage_service: unify token_metadata_lock and merge_lock.
Serialize the metadata changes with
keyspace create, update, or drop.

This will become necessary in the following patch
when we update the effective_replication_map
on all keyspaces and we want instances on all shards
end up with the same replication map.

Note that storage_service::keyspace_changed is called
from the scheme_merge path so it already holds
the merge_lock.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 13:01:25 +03:00
Avi Kivity
daf028210b build: enable -Winconsistent-missing-override warning
This warning can catch a virtual function that thinks it
overrides another, but doesn't, because the two functions
have different signatures. This isn't very likely since most
of our virtual functions override pure virtuals, but it's
still worth having.

Enable the warning and fix numerous violations.

Closes #9347
2021-09-15 12:55:54 +03:00
Avi Kivity
3f2c680b70 Merge 'Add initial support for WebAssembly in user-defined functions (UDF)' from Piotr Sarna
This series adds very basic support for WebAssembly-based user-defined functions.

This series comes with a basic set of tests which were used to designate a minimal goal for this initial implementation.

Example usage:
```cql
CREATE FUNCTION ks.fibonacci (str text)
    RETURNS NULL ON NULL INPUT
    RETURNS boolean
    LANGUAGE xwasm
    AS ' (module
  (func $fibonacci (param $n i32) (result i32)
    (if
      (i32.lt_s (local.get $n) (i32.const 2))
      (return (local.get $n))
    )
    (i32.add
      (call $fibonacci (i32.sub (local.get $n) (i32.const 1)))
      (call $fibonacci (i32.sub (local.get $n) (i32.const 2)))
    )
  )
  (export "fibonacci" (func $fibonacci))
) '
```

Note that the language is currently called "xwasm" as in "experimental wasm", because its interface is still subject to change in the future.

Closes #9108

* github.com:scylladb/scylla:
  docs: add a WebAssembly entry
  cql-pytest: add wasm-based tests for user-defined functions
  main: add wasm engine instantiation
  treewide: add initial WebAssembly support to UDF
  wasm: add initial WebAssembly runtime implementation
  db: add wasm_engine pointer to database
  lang: add wasm_engine service
  import wasmtime.hh
  lua: move to lang/ directory
  cql3: generalize user-defined functions for more languages
2021-09-14 11:34:20 +03:00
Piotr Sarna
62e8c89a9c treewide: add initial WebAssembly support to UDF
This commit adds a very basic support for user-defined functions
coded in wasm. The support is very limited (only a few types work)
and was not tested against reactor stalls and performance in general.
2021-09-13 19:03:58 +02:00
Avi Kivity
115d6d8d4c system_keyspace: prepare forward-declared members
In anticipation of making system_keyspace a class instead of a
namespace, rename any member that is currently forward-declared,
since one can't forward-declare a class member. Each member
is taken out of the system_keyspace namespace and gains a
system_keyspace prefix. Aliases are added to reduce code churn.

The result isn't lovely, but can be adjusted later.
2021-09-13 15:11:26 +03:00
Piotr Sarna
4e952df470 lua: move to lang/ directory
Support for more languages is comming, so let's group them
in a separate directory.
2021-09-13 11:01:33 +02:00
Piotr Sarna
46c6603fe0 cql3: generalize user-defined functions for more languages
In order to support more languages than just Lua in the future,
Lua-specific configuration is now extracted to a separate
structure.
2021-09-13 11:01:33 +02:00
Avi Kivity
c5f52f9d97 schema_tables: don't flush in tests
Flushing schema tables is important for crash recovery (without a flush,
we might have sstables using a new schema before the commitlog entry
noting the schema change has been replayed), but not important for tests
that do not test crash recovery. Avoiding those flushes reduces system,
user, and real time on tests running on a consumer-level SSD.

before:
real	8m51.347s
user	7m5.743s
sys	5m11.185s

after:
real	7m4.249s
user	5m14.085s
sys	2m11.197s

Note real time is higher that user+sys time divided by the number
of hardware threads, indicating that there is still idle time due
to the disk flushing, so more work is needed.

Closes #9319
2021-09-12 11:32:13 +03:00
Pavel Solodovnikov
8d3c0ee9b6 raft: new schema for storing raft snapshots
Previously, the layout for storing raft snapshot
descriptors contained a `config` field, which had `blob`
data type.

That means `raft::configuration` for the snapshot was serialized
as a whole in binary form. It's convenient to implement and
is the most compact form of representing the data, but:

1. Hard to debug due to the need to de-serialize the data.
2. Plants a time bomb wrt. changing data layout and also the
   documentation in the future.

Remove the `config` field from `system.raft_snapshots` and
extract it to a separate `system.raft_config` table to store
the data in exploded form.

Also, modify the schema of `system.raft_snapshots` table in
the following way: add a `server_id` field as a part of
composite partition key ((group_id, server_id)) to
be able to start multiple raft servers belonging to one raft
group on the same scylla node.

Rename `id` field in `raft_snapshots` to `snapshot_id` so
it's self-documenting.

Rename `snapshot_id` from clustering key since a given server
can have only one snapshot installed at a time.

Note that the `raft::server_address` stucture contains an opaque
`info` member, which is `bytes`, but in the `raft_config` table
we use `ip_addr inet` field, instead. We always know that the
corresponding member field is going to contain an IP address (either v4
or v6) of a given raft server.

So, now the snapshots schema looks like this:

    CREATE TABLE raft_snapshots (
        group_id timeuuid,
        server_id uuid,
        snapshot_id uuid,
        idx int,
        term int,
        -- no `config` field here, moved to `raft_config` table
        PRIMARY KEY ((group_id, server_id))
    )

    CREATE TABLE raft_config (
         group_id timeuuid,
         my_server_id uuid,
         server_id uuid,
         disposition text, -- can be either 'CURRENT` or `PREVIOUS'
         can_vote bool,
         ip_addr inet,
         PRIMARY KEY ((group_id, my_server_id), server_id, disposition)
    );

This way it's much easier to extend the schema with new fields,
very easy to debug and inspect via CQL, and it's much more descriptive
in terms of self-documentation.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-08-27 09:24:46 +03:00
Piotr Sarna
84876a165b db,schema_tables: add handling user-defined aggregates
Aggregates are propagated, created and dropped very similarly
to user-defined functions - a set of helper functions
for aggregates are added based on the UDF implementation.
2021-08-13 11:14:11 +02:00