Commit Graph

1491 Commits

Author SHA1 Message Date
Botond Dénes
0f60cc84f4 Merge 'replica: create a replica module' from Avi Kivity
Move the ::database, ::keyspace, and ::table classes to a new replica
namespace and replica/ directory. This designates objects that only
have meaning on a replica and should not be used on a coordinator
(but note that not all replica-only classes should be in this module,
for example compaction and sstables are lower-level objects that
deserve their own modules).

The module is imperfect - some additional classes like distributed_loader
should also be moved, but there is only one way to untie Gordian knots.

Closes #9872

* github.com:scylladb/scylla:
  replica: move ::database, ::keyspace, and ::table to replica namespace
  database: Move database, keyspace, table classes to replica/ directory
2022-01-07 13:37:40 +02:00
Avi Kivity
ae3a360725 database: Move database, keyspace, table classes to replica/ directory
The database, keyspace, and table classes represent the replica-only
part of the objects after which they are named. Reading from a table
doesn't give you the full data, just the replica's view, and it is not
consistent since reconciliation is applied on the coordinator.

As a first step in acknowledging this, move the related files to
a replica/ subdirectory.
2022-01-06 17:07:30 +02:00
Avi Kivity
b850b34bcc build: reduce inline threshold on aarch64 to 300
We see coroutine miscompiles with 600.

Fixes #9881.

Closes #9883
2022-01-06 15:13:27 +02:00
Botond Dénes
015d09a926 tools: utils: add configure_tool_mode()
Which configures seastar to act more appropriate to a tool app. I.e.
don't act as if it owns the place, taking over all system resources.
These tools are often run on a developer machine, or even next to a
running scylla instance, we want them to be the least intrusive
possible.
Also use the new tool mode in the existing tools.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20211220143104.132327-1-bdenes@scylladb.com>
2022-01-05 15:33:57 +02:00
Nadav Har'El
dcc42d3815 configure.py: re-run configure.py if the build/ directory is gone
When you run "configure.py", the result is not only the creation of
./build.ninja - it also creates build/<mode>/seastar/build.ninja
and build/<mode>/abseil/build.ninja. After a "rm -r build" (or "ninja
clean"), "ninja" will no longer work because those files are missing
when Scylla's ninja tries to run ninja in those internal project.

So we need to add a dependency, e.g., that running ninja in Seastar
requires build/<mode>/seastar/build.ninja to exist, and also say
that the rule that (re)runs "configure.py" generates those files.

After this patch,

        configure.py --with-some-parameters --of-your-choice
        rm -r build
        ninja

works - "ninja" will re-run configure.py with the same parameters
when it needs Seastar's or Abseil's build.ninja.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211230133702.869177-1-nyh@scylladb.com>
2022-01-05 10:15:19 +02:00
Asias He
a8ad385ecd repair: Get rid of the gc_grace_seconds
The gc_grace_seconds is a very fragile and broken design inherited from
Cassandra. Deleted data can be resurrected if cluster wide repair is not
performed within gc_grace_seconds. This design pushes the job of making
the database consistency to the user. In practice, it is very hard to
guarantee repair is performed within gc_grace_seconds all the time. For
example, repair workload has the lowest priority in the system which can
be slowed down by the higher priority workload, so that there is no
guarantee when a repair can finish. A gc_grace_seconds value that is
used to work might not work after data volume grows in a cluster. Users
might want to avoid running repair during a specific period where
latency is the top priority for their business.

To solve this problem, an automatic mechanism to protect data
resurrection is proposed and implemented. The main idea is to remove the
tombstone only after the range that covers the tombstone is repaired.

In this patch, a new table option tombstone_gc is added. The option is
used to configure tombstone gc mode. For example:

1) GC a tombstone after gc_grace_seconds

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'timeout'} ;

This is the default mode. If no tombstone_gc option is specified by the
user. The old gc_grace_seconds based gc will be used.

2) Never GC a tombstone

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'disabled'};

3) GC a tombstone immediately

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'immediate'};

4) GC a tombstone after repair

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'};

In addition to the 'mode' option, another option 'propagation_delay_in_seconds'
is added. It defines the max time a write could possibly delay before it
eventually arrives at a node.

A new gossip feature TOMBSTONE_GC_OPTIONS is added. The new tombstone_gc
option can only be used after the whole cluster supports the new
feature. A mixed cluster works with no problem.

Tests: compaction_test.py, ninja test

Fixes #3560

[avi: resolve conflicts vs data_dictionary]
2022-01-04 19:48:14 +02:00
Avi Kivity
5eccb42846 Merge "Host tool executables in the scylla main executable" from Botond
"
A big problem with scylla tool executables is that they include the
entire scylla codebase and thus they are just as big as the scylla
executable itself, making them impractical to deploy on production
machines. We could try to combat this by selectively including only the
actually needed dependencies but even ignoring the huge churn of
sorting out our depedency hell (which we should do at one point anyway),
some tools may genuinely depend on most of the scylla codebase.

A better solution is to host the tool executables in the scylla
executable itself, switching between the actual main function to run
some way. The tools themselves don't contain a lot of code so
this won't cause any considerable bloat in the size of the scylla
executable itself.
This series does exactly this, folds all the tool executables into the
scylla one, with main() switching between the actual main it will
delegate to based on a argv[1] command line argument. If this is a known
tool name, the respective tool's main will be invoked.
If it is "server", missing or unrecognized, the scylla main is invoked.

Originally this series used argv[0] as the mean to switch between the
main to run. This approach was abandoned for the approach mentioned above
for the following reasons:
* No launcher script, hard link, soft link or similar games are needed to
  launch a specific tool.
* No packaging needed, all tools are automatically deployed.
* Explicit tool selection, no surprises after renaming scylla to
  something else.
* Tools are discoverable via scylla's description.
* Follows the trend set by modern command line multi-command or multi-app
  programs, like git.

Fixes: #7801

Tests: unit(dev)
"

* 'tools-in-scylla-exec-v5' of https://github.com/denesb/scylla:
  main,tools,configure.py: fold tools into scylla exec
  tools: prepare for inclusion in scylla's main
  main: add skeleton switching code on argv[1]
  main: extract scylla specific code into scylla_main()
2022-01-04 17:55:07 +02:00
Eliran Sinvani
6d9d00ec9c conofigure.py: Set seastar scheduling groups count explicitly
In order to have stability and also regression control, we set
the scheduling groups parameter explicitly.

Closes #9847
2021-12-27 15:48:45 +02:00
Botond Dénes
bb0874b28b main,tools,configure.py: fold tools into scylla exec
The infrastructure is now in place. Remove the proxy main of the tools,
and add appropriate `else if` statements to the executable switch in
main.cc. Also remove the tool applications from the `apps` list and add
their respective sources as dependencies to the main scylla executable.
With this, we now have all tool executables living inside the scylla
main one.
2021-12-20 18:27:25 +02:00
Avi Kivity
021c7593b8 data_dictionary: move user_types_metadata to new module data_dictionary
The new module will contain all schema related metadata, detached from
actual data access (provided by the database class). User types is the
first contents to be moved to the new module.
2021-12-15 13:52:10 +02:00
Avi Kivity
c519857beb build: rearrange -O3 and -f<optimization-option> options
It turns out that -O3 enabled -fslp-vectorize even if it is
disabled before -O3 on the command line. Rearrange the code
so that -O3 is before the more specific optimization options.
2021-12-07 17:52:32 +02:00
Avi Kivity
04ad07b072 build: disable superword-level parallism (slp) on clang
Clang (and gcc) can combine loads and stores of independent variables
into wider operations, often using vector registers. This reduces
instruction count and execution unit occupancy. However, clang
is too aggressive and generates loads that break the store-to-load
forwarding rules: a load must be the same size or smaller than the
corresponding load, or it will execute with a large penalty.

Disabling slp results in larger but faster code. Comparing
before and after on Zen 3:

slp:

226766.49 tps ( 75.1 allocs/op,  12.1 tasks/op,   45073 insns/op)
226679.57 tps ( 75.1 allocs/op,  12.1 tasks/op,   45074 insns/op)
226168.79 tps ( 75.1 allocs/op,  12.1 tasks/op,   45061 insns/op)
225884.34 tps ( 75.1 allocs/op,  12.1 tasks/op,   45068 insns/op)
225998.16 tps ( 75.1 allocs/op,  12.1 tasks/op,   45056 insns/op)

median 226168.79 tps ( 75.1 allocs/op,  12.1 tasks/op,   45061 insns/op)
median absolute deviation: 284.45
maximum: 226766.49
minimum: 225884.34

no slp:

228195.33 tps ( 75.1 allocs/op,  12.1 tasks/op,   45109 insns/op)
227773.76 tps ( 75.1 allocs/op,  12.1 tasks/op,   45123 insns/op)
228088.98 tps ( 75.1 allocs/op,  12.1 tasks/op,   45117 insns/op)
228157.43 tps ( 75.1 allocs/op,  12.1 tasks/op,   45129 insns/op)
228072.29 tps ( 75.1 allocs/op,  12.1 tasks/op,   45128 insns/op)

median 228088.98 tps ( 75.1 allocs/op,  12.1 tasks/op,   45117 insns/op)
median absolute deviation: 68.45
maximum: 228195.33
minimum: 227773.76

Disabling slp increases the instruction count by ~60 instructions per op
(0.13%) but increases throughput by 0.85%. This shows the impact of the
violation is quite high. It can also be observed by the effect on
stalled cycles:

slp:

         44,932.70 msec task-clock                #    0.993 CPUs utilized
            13,618      context-switches          #  303.075 /sec
                33      cpu-migrations            #    0.734 /sec
             1,695      page-faults               #   37.723 /sec
   211,997,160,633      cycles                    #    4.718 GHz                      (71.67%)
     1,118,855,786      stalled-cycles-frontend   #    0.53% frontend cycles idle     (71.67%)
     1,258,837,025      stalled-cycles-backend    #    0.59% backend cycles idle      (71.66%)
   454,445,559,376      instructions              #    2.14  insn per cycle
                                                  #    0.00  stalled cycles per insn  (71.66%)
    83,557,588,477      branches                  #    1.860 G/sec                    (71.67%)
       174,313,252      branch-misses             #    0.21% of all branches          (71.67%)

no-slp:

         44,579.83 msec task-clock                #    0.986 CPUs utilized
            13,435      context-switches          #  301.369 /sec
                33      cpu-migrations            #    0.740 /sec
             1,691      page-faults               #   37.932 /sec
   210,070,080,283      cycles                    #    4.712 GHz                      (71.68%)
     1,066,774,628      stalled-cycles-frontend   #    0.51% frontend cycles idle     (71.68%)
     1,082,255,966      stalled-cycles-backend    #    0.52% backend cycles idle      (71.66%)
   455,067,924,891      instructions              #    2.17  insn per cycle
                                                  #    0.00  stalled cycles per insn  (71.68%)
    83,597,450,748      branches                  #    1.875 G/sec                    (71.65%)
       151,897,866      branch-misses             #    0.18% of all branches          (71.68%)

Note the differences in "backend cycles idle" and "stalled cycles
per insn".

I also observed the same pattern on a much older generation Intel (although
the baseline instructions per clock there are around 0.56).

slp:

42232.64 tps ( 75.1 allocs/op,  12.1 tasks/op,   44818 insns/op)
42318.87 tps ( 75.1 allocs/op,  12.1 tasks/op,   44849 insns/op)
42331.33 tps ( 75.1 allocs/op,  12.1 tasks/op,   44857 insns/op)
42315.89 tps ( 75.1 allocs/op,  12.1 tasks/op,   44875 insns/op)
42410.19 tps ( 75.1 allocs/op,  12.1 tasks/op,   44818 insns/op)

median 42318.87 tps ( 75.1 allocs/op,  12.1 tasks/op,   44849 insns/op)
median absolute deviation: 12.46
maximum: 42410.19
minimum: 42232.64

no-slp:

42464.18 tps ( 75.1 allocs/op,  12.1 tasks/op,   44886 insns/op)
42631.88 tps ( 75.1 allocs/op,  12.1 tasks/op,   44939 insns/op)
42783.95 tps ( 75.1 allocs/op,  12.1 tasks/op,   44961 insns/op)
42671.23 tps ( 75.1 allocs/op,  12.1 tasks/op,   44947 insns/op)
42487.82 tps ( 75.1 allocs/op,  12.1 tasks/op,   44875 insns/op)

median 42631.88 tps ( 75.1 allocs/op,  12.1 tasks/op,   44939 insns/op)
median absolute deviation: 144.06
maximum: 42783.95
minimum: 42464.18

slp:

         26,877.01 msec task-clock                #    0.989 CPUs utilized
            15,621      context-switches          #    0.581 K/sec
                 9      cpu-migrations            #    0.000 K/sec
            55,322      page-faults               #    0.002 M/sec
    96,084,360,190      cycles                    #    3.575 GHz                      (72.55%)
    71,435,545,235      stalled-cycles-frontend   #   74.35% frontend cycles idle     (72.57%)
    59,531,573,539      stalled-cycles-backend    #   61.96% backend cycles idle      (70.96%)
    53,273,420,083      instructions              #    0.55  insn per cycle
                                                  #    1.34  stalled cycles per insn  (72.55%)
    10,240,844,987      branches                  #  381.026 M/sec                    (72.57%)
        94,348,150      branch-misses             #    0.92% of all branches          (72.57%)

no-slp:

         26,381.66 msec task-clock                #    0.971 CPUs utilized
            15,586      context-switches          #    0.591 K/sec
                 9      cpu-migrations            #    0.000 K/sec
            55,318      page-faults               #    0.002 M/sec
    94,317,505,691      cycles                    #    3.575 GHz                      (72.59%)
    69,693,601,709      stalled-cycles-frontend   #   73.89% frontend cycles idle     (72.59%)
    57,579,078,046      stalled-cycles-backend    #   61.05% backend cycles idle      (58.08%)
    53,260,417,953      instructions              #    0.56  insn per cycle
                                                  #    1.31  stalled cycles per insn  (72.60%)
    10,235,123,948      branches                  #  387.964 M/sec                    (72.60%)
        96,002,988      branch-misses             #    0.94% of all branches          (72.62%)
2021-12-07 17:08:38 +02:00
Avi Kivity
595cc328b1 Merge 'cql3: Remove term, replace with expression' from Jan Ciołek
This PR finally removes the `term` class and replaces it with `expression`.

* There was some trouble with `lwt_cache_id` in `expr::function_call`.
  The current code works the following way:
  * for each `function_call` inside a `term` that describes a pk restriction, `prepare_context::add_pk_function_call` is called.
  * `add_pk_function_call` takes a `::shared_ptr<cql3::functions::function_call>`, sets its `cache_id` and pushes this shared pointer onto a vector of all collected function calls
  * Later when some condiition is met we want to clear cache ids of all those collected function calls. To do this we iterate through shared pointers collected in `prepare_context` and clear cache id for each of them.

  This doesn't work with `expr::function_call` because it isn't kept inside a shared pointer.
  To solve this I put the `lwt_cache_id` inside a shared pointer and then `prepare_context` collects these shared pointers to cache ids.

  I also experimented with doing this without any shared pointers, maybe we could just walk through the expression and clear the cache ids ourselves. But the problem is that expressions are copied all the time, we could clear the cache in one place, but forget about a copy. Doing it using shared pointers more closely matches the original behaviour.
The experiment is on the [term2-pr3-backup-altcache](https://github.com/cvybhu/scylla/tree/term2-pr3-backup-altcache) branch
* `shared_ptr<term>` being `nullptr` could mean:
  * It represents a cql value `null`
  * That there is no value, like `std::nullopt` (for example in `attributes.hh`)
  * That it's a mistake, it shouldn't be possible

  A good way to distinguish between optional and mistake is to look for `my_term->bind_and_get()`, we then know that it's not an optional value.

* On the other hand `raw_value` cased to bool means:
   * `false` - null or unset
   * `true` - some value, maybe empty

I ran a simple benchmark on my laptop to see how performance is affected:
```
build/release/test/perf/perf_simple_query --smp 1 -m 1G --operations-per-shard 1000000 --task-quota-ms 10
```
* On master (a21b1fbb2f) I get:
  ```
  176506.60 tps ( 77.0 allocs/op,  12.0 tasks/op,   45831 insns/op)

  median 176506.60 tps ( 77.0 allocs/op,  12.0 tasks/op,   45831 insns/op)
  median absolute deviation: 0.00
  maximum: 176506.60
  minimum: 176506.60
  ```
* On this branch I get:
  ```
  172225.30 tps ( 75.1 allocs/op,  12.1 tasks/op,   46106 insns/op)

  median 172225.30 tps ( 75.1 allocs/op,  12.1 tasks/op,   46106 insns/op)
  median absolute deviation: 0.00
  maximum: 172225.30
  minimum: 172225.30
  ```

Closes #9481

* github.com:scylladb/scylla:
  cql3: Remove remaining mentions of term
  cql3: Remove term
  cql3: Rename prepare_term to prepare_expression
  cql3: Make prepare_term return an expression instead of term
  cql3: expr: Add size check to evaluate_set
  cql3: expr: Add expr::contains_bind_marker
  cql3: expr: Rename find_atom to find_binop
  cql3: expr: Add find_in_expression
  cql3: Remove term in operations
  cql3: Remove term in relations
  cql3: Remove term in multi_column_restrictions
  cql3: Remove term in term_slice, rename to bounds_slice
  cql3: expr: Remove term in expression
  cql3: expr: Add evaluate_IN_list(expression, options)
  cql3: Remove term in column_condition
  cql3: Remove term in select_statement
  cql3: Remove term in update_statement
  cql3: Use internal cql format in insert_prepared_json_statement cache
  types: Add map_type_impl::serialize(range of <bytes, bytes>)
  cql3: Remove term in cql3/attributes
  cql3: expr: Add constant::view() method
  cql3: expr: Implement fill_prepare_context(expression)
  cql3: expr: add expr::visit that takes a mutable expression
  cql3: expr: Add receiver to expr::bind_variable
2021-11-30 16:39:39 +02:00
Konstantin Osipov
c22f945f11 raft: (service) manage Raft configuration during topology changes
Operations of adding or removing a node to Raft configuration
are made idempotent: they do nothing if already done, and
they are safe to resume after a failure.

However, since topology changes are not transactional, if a
bootstrap or removal procedure fails midway, Raft group 0
configuration may go out of sync with topology state as seen by
gossip.

In future we must change gossip to avoid making any persistent
changes to the cluster: all changes to persistent topology state
will be done exclusively through Raft Group 0.

Specifically, instead of persisting the tokens by advertising
them through gossip, the bootstrap will commit a change to a system
table using Raft group 0. nodetool will switch from looking at
gossip-managed tables to consulting with Raft Group 0 configuration
or Raft-managed tables.
Once this transformation is done, naturally, adding a node to Raft
configuration (perhaps as a non-voting member at first) will become the
first persistent change to ring state applied when a node joins;
removing a node from the Raft Group 0 configuration will become the last
action when removing a node.

Until this is done, do our best to avoid a cluster state when
a removed node or a node which addition failed is stuck in Raft
configuration, but the node is no longer present in gossip-managed
system tables. In other words, keep the gossip the primary source of
truth. For this purpose, carefully chose the timing when we
join and leave Raft group 0:

Join the Raft group 0 only after we've advertised our tokens, so the
cluster is aware of this node, it's visible in nodetool status,
but before node state jumps to "normal", i.e. before it accepts
queries. Since the operation is idempotent, invoke it on each
restart.

Remove the node from Group 0 *before* its tokens are removed
from gossip-managed system tables. This guarantees
that if removal from Raft group 0 fails for whatever reason,
the node stays in the ring, so nodetool removenode and
friends are re-tried.

Add tracing.
2021-11-25 12:35:42 +03:00
Konstantin Osipov
8ee88a9d8a raft: (discovery) introduce leader discovery state machine
Introduce a special state machine used to to find
a leader of an existing Raft cluster or create
a new cluster.

This state machine should be used when a new
Scylla node has no persisted Raft Group 0 configuration.

The algorithm is initialized with a list of seed
IP addresses, IP address of this server, and,
this server's Raft server id.

The IP addresses are used to construct an initial list of peers.

Then, the algorithm tries to contact each peer (excluding self) from
its peer list and share the peer list with this peer, as well as
get the peer's peer list. If this peer is already part of
some Raft cluster, this information is also shared. On a response
from a peer, the current peer's peer list is updated. The
algorithm stops when all peers have exchanged peer information or
one of the peers responds with id of a Raft group and Raft
server address of the group leader.

(If any of the peers fails to respond, the algorithm re-tries
ad infinitum with a timeout).

More formally, the algorithm stops when one of the following is true:
- it finds an instance with initialized Raft Group 0, with a leader
- all the peers have been contacted, and this server's
  Raft server id is the smallest among all contacted peers.
2021-11-25 11:50:38 +03:00
Benny Halevy
d2703eace7 test: remove gossip_test
First, it doesn't test the gossiper so
it's unclear why have it at all.
And it doesn't test anything more than what we test
using the cql_test_env either.

For testing gossip there is test/manual/gossip.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211122081305.789375-2-bhalevy@scylladb.com>
2021-11-22 16:15:41 +02:00
Botond Dénes
d4d4c0ace7 redis: mv service.* -> controller.* 2021-11-17 13:58:49 +02:00
Avi Kivity
7a3930f7cf Merge 'More nodetool-replacing virtual tables' from Botond Dénes
This PR introduces 4 new virtual tables aimed at replacing nodetool commands, working towards the long-term goal of replacing nodetool completely at least for cluster information retrieval purposes.
As you may have noticed, most of these replacement are not exact matches. This is on purpose. I feel that the nodetool commands are somewhat chaotic: they might have had a clear plan on what command prints what but after years of organic development they are a mess of fields that feel like don't belong. In addition to this, they are centered on C* terminology which often sounds strange or doesn't make any sense for scylla (off-heap memory, counter cache, etc.).
So in this PR I tried to do a few things:
* Drop all fields that don't make sense for scylla;
* Rename/reformat/rephrase fields that have a corresponding concept in scylla, so that it uses the scylla terminology;
* Group information in tables based on some common theme;

With these guidelines in mind lets look at the virtual tables introduced in this PR:
* `system.snapshots` - replacement for `nodetool listnapshots`;
* `system.protocol_servers`- replacement for `nodetool statusbinary` as well as `Thrift active` and `Native Transport active` from `nodetool info`;
* `system.runtime_info` - replacement for `nodetool info`, not an exact match: some fields were removed, some were refactored to make sense for scylla;
* `system.versions` - replacement for `nodetool version`, prints all versions, including build-id;

Closes #9517

* github.com:scylladb/scylla:
  test/cql-pytest: add virtual_tables.py
  test/cql-pytest: nodetool.py: add take_snapshot()
  db/system_keyspace: add versions table
  configure.py: move release.cc and build_id.cc to scylla_core
  db/system_keyspace: add runtime_info table
  db/system_keyspace: add protocol_servers table
  service: storage_service: s/client_shutdown_hooks/protocol_servers/
  service: storage_service: remove unused unregister_client_shutdown_hook
  redis: redis_service: implement the protocol_server interface
  alternator: controller: implement the protocol_server interface
  transport: controller: implement the protocol_server interface
  thrift: controller: implement the protocol_server interface
  Add protocol_server interface
  db/system_keyspace: add snapshots virtual table
  db/virtual_table: remove _db member
  db/system_keyspace: propagate distributed<> database and storage_service to register_virtual_tables()
  docs/design-notes/system_keyspace.md: add listing of existing virtual tables
  docs/guides: add virtual-tables.md
2021-11-07 16:55:31 +02:00
Botond Dénes
5c87263ff8 configure.py: move release.cc and build_id.cc to scylla_core
These two files were only added to the scylla executable and some
specific unit tests. As we are about to use the symbols defined in these
files in some scylla_core code move them there.
2021-11-05 15:42:42 +02:00
Michael Livshin
60f76155a7 build: have configure.py create compile_commands.json
compile_commands.json (a.k.a. "compdb",
https://clang.llvm.org/docs/JSONCompilationDatabase.html) is intended
to help stand-alone C-family LSP servers index the codebase as
precisely as possible.

The actively maintained LSP servers with good C++ support are:
- Clangd (https://clangd.llvm.org/)
- CCLS (https://github.com/MaskRay/ccls)

This change causes a successful invocation of configure.py to create a
unified Scylla+Seastar+Abseil compdb for every selected build mode,
and to leave a valid symlink in the source root (if a valid symlink
already exists, it will be left alone).

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>

Closes #9558
2021-11-05 11:28:37 +02:00
Jan Ciolek
e458340821 cql3: Remove term
term isn't used anywhere now. We can remove it and all classes that derive from it.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-11-04 15:56:45 +01:00
Jan Ciolek
dcd3199037 cql3: Rename prepare_term to prepare_expression
prepare_term now takes an expression and returns a prepared expression.
It should be renamed to prepare_expression.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-11-04 15:56:45 +01:00
Avi Kivity
e1817b536f build: clobber user/group info from node_exporter tarball
node_exporter is packaged with some random uid/gid in the tarball.
When extracting it as an ordinary user this isn't a problem, since
the uid/gid are reset to the current user, but that doesn't happen
under dbuild since `tar` thinks the current user is root. This causes
a problem if one wants to delete the build directory later, since it
becomes owned by some random user (see /etc/subuid)

Reset the uid/gid infomation so this doesn't happen.

Closes #9579
2021-11-04 09:27:13 +02:00
Avi Kivity
1e1e4f4934 Update abseil submodule
* abseil 9c6a50f...f70eada (122):
  > Fix over-aligned layout test with older gcc compilers (#1049)
  > Export of internal Abseil changes
  > Initial support for Haiku (#1045)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Remove bazelbuild/rules_cc dependency (#1038)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Use FreeBSD macro definition for ElfW macro for compatibility. (#1037)
  > Export of internal Abseil changes
  > Fix hashing on big endian platforms (#1028)
  > Fix typedef of sig_t on AIX (#1030)
  > Export of internal Abseil changes
  > Fixed typo `constuct` to `construct` in 3 places. (#1022)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Initial support for AIX (#1021)
  > Export of internal Abseil changes
  > Update from_chars documentation with regard to whitespace (#1020)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Include immintrin.h instead of wmmintrin.h (#1015)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Add -Wno-unknown-warning-option to ABSL_LLVM_FLAGS to disable warnings on unknown warning flags. (#1008)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Add missing ABSL_DLL for a few functions (#1002)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Simplifies the construction of the value returned by GenerateRealFromBits() (#994)
  > CMake: option to use cxx_std_11 (minimum) that propagates. (#986)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Fix Bazel build on aarch64 (#984)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > CMake: add option to use Google Test already installed on system (#969)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Use CMAKE_INSTALL_FULL_{LIBDIR,INCLUDEDIR}. (#963)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Uses alignas for portability in dynamic_annotations.h (#947)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Call FailureSignalHandlerOptions.writenfn with nullptr at the end (#938)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Add missing `add_subdirectory()` call for "cleanup" (#925)
  > Allowing to change the MSVC runtime (#921)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Fix C++/CLI build problem (#916)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Add support for more Linux architectures (#904)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Add support for m68k (#900)
  > Add support for sparc and sparc64 (#899)
  > Fix uc_mcontext register access on 32-bit PowerPC (#898)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
2021-10-28 16:22:18 +03:00
Benny Halevy
4062cd17e0 test: hashers_test: mutation_fragment_sanity_check: stop semaphore
To stop the semaphore as required we need run
the test in a seastar thread.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211024053402.990142-1-bhalevy@scylladb.com>
2021-10-24 11:29:23 +03:00
Avi Kivity
acfe0a3803 build: reinstate -Wunknown-attributes
The warning was disabled during the migration to clang, but now it
appears unnecessary (perhaps clang added support for the attributes
it did not have then). It is valuable for detecting misspelled
attributes, so enable it again.

Closes #9480
2021-10-14 14:26:56 +03:00
Nadav Har'El
33f8ec09df Merge 'treewide: improve compatibility with gcc 11' from Avi Kivity
Our source base drifted away from gcc compatibility; this mostly
restores the ability to build with gcc. An important exception is
coroutines that have an initializer list [1]; this still doesn't work.

We aim to switch back to gcc 11 if/when this gives us better
C++ compatibility and performance.

Test: unit (dev)

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98056

Closes #9459

* github.com:scylladb/scylla:
  test: radix_tree_printer: avoid template specialization in class context
  test: raft: avoid ignored variable errors
  test: reader_concurrency_semaphore_test: isolate from namespace of source_location
  test: cql_query_test: drop unused lambda assert_replication_not_contains
  test: commitlog_test: don't use deprecated seastar::unaligned_cast
  test: adjust signed/unsigned comparisons in loops and boost tests
  build: silence some gcc 11 warnings
  sstables: processing_result_generator: make coroutine support palatable for C++20 compilers
  managed_bytes: avoid compile-time loop in converting constructor
  service: service_level_controller: drop unused variable sl_compare
  raft: disambiguate promise name in raft::active_read
  locator: azure_snitch: use full type name in definition of globals
  cql3: statements: create_service_level_statement: don't ignore replace_defaults()
  cql3: statement_restrictions: adjust call to std::vector deduction guide
  types: remove recursive constraint in deserialize_value
  cql3: restrictions: relax constraint on visitor_with_binary_operator_content
  treewide: handle switch statements that return
  cql3: expr: correct type of captured map value_type
  cdc: adjust type of streams_count
  alternator: disambiguate attrs_to_get in table_requests
2021-10-11 16:54:01 +03:00
Pavel Emelyanov
42f83f6669 storage_service: Move the sstables loading code
Just cut-n-paste the code into sstables_loader.cc. No other
changes but replace storage service logger with its own one.
For now the code stays in storage_service class, but next
patch will relocate the code into the sstables_loader one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-10-11 11:07:39 +03:00
Avi Kivity
15ffd84473 build: silence some gcc 11 warnings
These warnings are valuable, but limit the noise for now by disabling
them.
2021-10-10 18:16:50 +03:00
Tomasz Grabiec
e89b9799b8 Merge 'sstable mx reader: implement reverse single-partition reads' from Kamil Braun
Until now reversed queries were implemented inside
`querier::consume_page` (more precisely, inside the free function
`consume_page` used by `querier::consume_page`) by wrapping the
passed-in reader into `make_reversing_reader` and then consuming
fragments from the resulting reversed reader.

The first couple of commits change that by pushing the reversing down below
the `make_combined_reader` call in `table::query`. This allows
working on improving reversing for memtables independently from
reversing for sstables.

We then extend the `index_reader` with functions that allow
reading the promoted index in reverse.

We introduce `partition_reversing_data_source`, which wraps an sstable data
file and returns data buffers with contents of a single chosen partition
as if the rows were stored in reverse order.

We use the reversing source and the extended index reader in
`mx_sstable_mutation_reader` to implement efficient (at least in theory)
reversed single-partition reads.

The patchset disables cache for reversed reads. Fast-forwarding
is not supported in the mx reader for reversed queries at this point.

Details in commit messages. Read the commits in topological order
for best review experience.

Refs: #9134
(not saying "Fixes" because it's only for single-partition queries
without forwarding)

Closes #9281

* github.com:scylladb/scylla:
  table: add option to automatically bypass cache for reversed queries
  test: reverse sstable reader with random schema and random mutations
  sstables: mx: implement reversed single-partition reads
  sstables: mx: introduce partition_reversing_data_source
  sstables: index_reader: add support for iterating over clustering ranges in reverse
  clustering_key_filter: clustering_key_filter_ranges owning constructor
  flat_mutation_reader: mention reversed schema in make_reversing_reader docstring
  clustering_key_filter: document clustering_key_filter_ranges::get_ranges
2021-10-04 15:37:34 +02:00
Wojciech Mitros
64e703bb54 sstables: mx: introduce partition_reversing_data_source
This patch adds an implementation of a data source that wraps an sstable
data file and returns data buffers with contents of one partition in the
sstable as if the rows of the partition were present in a reversed
order. In other words, to the user of the source the partition appears
to be reversed. We shall call this an 'intermediary' data source.

As part of the interface of the intermediary source the user is also
given read access to the source's current position over the data file,
and the constructor of the source takes a reference to `index_reader`.
This is necessary because the index operates directly on data file
offsets and we want the user to be able to use the index to skip
sequences of rows.

In order to ask the source to skip a sequence of rows - e.g. when jumping
between clustering ranges - the user must advance the index' upper bound
in reverse (to an earlier position). The source will then notice that
the end position of the index has changed and take appropriate action.

An alternative would be to translate the data positions of
`index_reader` to 'reversed positions' of the intermediary and then use
`skip_to` for skipping, as we do for forward reads. However this
solution would introduce more complexity to `index_reader` and the
intermediary source. One reason for the complexity in the input stream
is that we would have two kinds of skips: a single row skip,
and a skip to a clustering range. We know the offset of the next row,
so we could check that to differentiate them. We would also need to add
an information about the position of first clustering row and end of
the last one in the index_reader. Skipping by checking the index seems
to be overall simpler.

For simplicity, the intermediary stream always starts with
parsing the partition header and (if present) the static row,
and returning the corresponding bytes as a result of the first
read.

After partition header and static row we must find the last row entry of
the requested range. If the range ends before the partition end (i.e.
there are more row entries after the range) we can use the 'previous
unfiltered size' of the row following the range; otherwise we must scan
the last promoted index block and take its last row.

After finding the data range of the last row, we parse rows
consecutively in reversed order.  We must parse the rows partially
to learn their lengths and the positions of previous rows. We're
using similar constructs as in the sstable parser, but it only
contains a small part of the parsing coroutine and doesn't perform
any correctness checks.  The parser for rows still turned out rather
big mostly because we can't always deduce the size of the clustering
blocks without reading the block header.

The parser allows reading rows while skipping their bodies also in
non-reversed order, which we are making use of while reading the
last promoted index block.

The intermediary data source has one more utility: reversing range
tombstones.  When we read a tombstone bound/boundary, we modify
the data buffer so that the resulting bound/boundary has the reversed
kind (so we don't read ends before starts) and the boundaries have their
before/after timestamps swapped.
2021-10-04 15:24:12 +02:00
Piotr Sarna
e2fe8559ca configure: temporarily disable wasm support for aarch64
There seems to be a problem with libwasmtime.a dependency on aarch64,
causing occasional segfaults during tests - specifically, tests
which exercise the path for halting wasm execution due to fuel
exhaustion. As a temporary measure, wasm is disabled on this
architecture to unblock the flow.

Refs #9387

Closes #9414
2021-09-30 14:57:04 +03:00
Piotr Sarna
d3edca4b43 Merge 'alternator: add stub implementation of TTL's API operations'
... from Nadav Har'El

This small series adds a stub implementation of Alternator's
UpdateTimeToLive and DescribeTimeToLive operations. These operations can
enable, disable, or inquire about, the chosen expiration-time attribute.
Currently, the information about the chosen attribute is only saved,
with no actual expiration of any items taking place.

Because this is an incomplete implementation of this feature, it is not
enabled unless an experimental flag is enabled on all nodes in the
cluster.

See the individual patches for more information on what this series
does.

Refs #5060.

Closes #9345

* github.com:scylladb/scylla:
  test/alternator: rename utility function test_table_name()
  alternator: stub TTL operations
  alternator: make three utility functions in executor.cc non-static
  test/alternator: test another corner case of TTL
2021-09-21 09:58:17 +02:00
Avi Kivity
15819e0304 Merge "Database start/stop code sanitation" from Pavel E
"
Currently database start and stop code is quite disperse and
exists in two slightly different forms -- one in main and the
other one in cql_test_env. This set unifies both and makes
them look almost the perfect way:

    sharded<database> db;
    db.start(<dependencies>);
    auto stop = defer([&db] { db.stop().get(); });
    db.invoke_on_all(&database::start).get();

with all (well, most) other mentionings of the "db" variable
being arguments for other services' dependencies.

tests: unit(dev, release), unit.cross_shard_barrier(debug)
       dtest.simple_boot_shutdown(dev)
refs: #2737
refs: #2795
refs: #5489

"

* 'br-database-teardown-unification-2' of https://github.com/xemul/scylla: (26 commits)
  main: Log when database starts
  view_update_generator: Register staging sstables in constructor
  database, messaging: Delete old connection drop notification
  database, proxy: Relocate connection-drop activity
  messaging, proxy: Notify connection drops with boost signal
  database, tests: Rework recommended format setting
  database, sstables_manager: Sow some noexcepts
  database: Eliminate unused helpers
  database: Merge the stop_database() into database::stop()
  database: Flatten stop_database()
  database: Equip with cross-shard-barrier
  database: Move starting bits into start()
  database: Add .start() method
  main: Initialize directories before database
  main, api: Detach set_server_config from database and move up
  main: Shorten commitlog creation
  database: Extract commitlog initialization from init_system_keyspace
  repair: Shutdown without database help
  main: Shift iosched verification upward
  database: Remove unused mm arg from init_non_system_keyspaces()
  ...
2021-09-20 10:26:13 +03:00
Nadav Har'El
4ffd8c1f2b alternator: stub TTL operations
This patch adds stubs for the UpdateTimeToLive and DescribeTimeToLive
operations to Alternator. These operations can enable, disable, or inquire
about, the chosen expiration-time attribute.

Currently, the information about the chosen attribute is only saved, with
no actual expiration of any items taking place.

Some of the tests for the TTL feature start to pass, so their xfail tag
is removed.

Because this this new feature is incomplete, it is not enabled unless
the "alternator-ttl" experimental feature is enabled. Moreover, for
these operations to be allowed, the entire cluster needs to support
this experimental feature, because all nodes need to participate in the
data expiration - if some old nodes don't support Alternator TTL, some
of the data they hold won't get expired... So we don't allow enabling
TTL until all the nodes in the cluster support this feature.

The implementation is in a new source file, alternator/ttl.cc. This
source file will continue to grow as we implement the expiration feature.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-09-19 21:05:21 +03:00
Pavel Emelyanov
bb23986826 wasm: Localize it to database usage
The wasm::engine exists as a sharded<> service in main, but it's only
passed by local reference into database on start. There's no much profit
in keeping it at main scope, things get much simpler if keeping the
engine purely on database.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-15 17:35:17 +03:00
Pavel Emelyanov
e324230648 utils: Introduce cross-shard barrier (with test)
Add a synchronization facility to let shards wait for each
other to pass through certain points in the code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-15 17:35:12 +03:00
Avi Kivity
daf028210b build: enable -Winconsistent-missing-override warning
This warning can catch a virtual function that thinks it
overrides another, but doesn't, because the two functions
have different signatures. This isn't very likely since most
of our virtual functions override pure virtuals, but it's
still worth having.

Enable the warning and fix numerous violations.

Closes #9347
2021-09-15 12:55:54 +03:00
Avi Kivity
3f2c680b70 Merge 'Add initial support for WebAssembly in user-defined functions (UDF)' from Piotr Sarna
This series adds very basic support for WebAssembly-based user-defined functions.

This series comes with a basic set of tests which were used to designate a minimal goal for this initial implementation.

Example usage:
```cql
CREATE FUNCTION ks.fibonacci (str text)
    RETURNS NULL ON NULL INPUT
    RETURNS boolean
    LANGUAGE xwasm
    AS ' (module
  (func $fibonacci (param $n i32) (result i32)
    (if
      (i32.lt_s (local.get $n) (i32.const 2))
      (return (local.get $n))
    )
    (i32.add
      (call $fibonacci (i32.sub (local.get $n) (i32.const 1)))
      (call $fibonacci (i32.sub (local.get $n) (i32.const 2)))
    )
  )
  (export "fibonacci" (func $fibonacci))
) '
```

Note that the language is currently called "xwasm" as in "experimental wasm", because its interface is still subject to change in the future.

Closes #9108

* github.com:scylladb/scylla:
  docs: add a WebAssembly entry
  cql-pytest: add wasm-based tests for user-defined functions
  main: add wasm engine instantiation
  treewide: add initial WebAssembly support to UDF
  wasm: add initial WebAssembly runtime implementation
  db: add wasm_engine pointer to database
  lang: add wasm_engine service
  import wasmtime.hh
  lua: move to lang/ directory
  cql3: generalize user-defined functions for more languages
2021-09-14 11:34:20 +03:00
Piotr Sarna
78afd518a8 wasm: add initial WebAssembly runtime implementation
The engine is based on wasmtime and is able to:
 - compile wasm text format to bytecode
 - run a given compiled function with custom arguments

This implementation is missing crucial features, like running
on any other types than 32-bit integers. It serves as a skeleton
for future full implementation.
2021-09-13 19:03:58 +02:00
Takuya ASADA
f93793da7e configure.py: remove $builddir/release/{scylla_product}-python3-{arch}-package.tar.gz from dist-python3 target
'$builddir/release/{scylla_product}-python3-package.tar.gz' on
dist-python3 target is for compat-python3, we forgot to remove at 35a14ab.

Fixes #9333

Closes #9334
2021-09-13 18:48:10 +03:00
Piotr Sarna
5e6fa47198 lang: add wasm_engine service
WASM engine stores the wasm runtime engine for user-defined functions.
2021-09-13 11:01:33 +02:00
Piotr Sarna
4e952df470 lua: move to lang/ directory
Support for more languages is comming, so let's group them
in a separate directory.
2021-09-13 11:01:33 +02:00
Botond Dénes
6e78e6c97f tools: remove scylla-sstable-index
It is replaced by scylla-sstable --dump-index.
2021-09-07 17:10:44 +03:00
Botond Dénes
2c600e34aa tools: introduce scylla-sstable
A tool which can be used to examine the content of sstable(s) and
execute various operations on them. The currently supported operations
are:
* dump - dumps the content of the sstable(s), similar to sstabledump;
* index-dump - dumps the content of the sstable index(es), similar to
  scylla-sstable-index;
* writetime-histogram - generates a histogram of all the timestamps in
  the sstable(s);
* custom - a hackable operation for the expert user (until scripting
  support is implemented);
* validate - validate the content of the sstable(s) with the mutation
  fragment stream validator, same as scrub in validate mode;
2021-09-07 17:10:44 +03:00
Botond Dénes
23a56beccc tools: add schema_loader
A utility which can load a schema from a schema.cql file. The file has
to contain all the "dependencies" of the table: keyspace, UDTs, etc.
This will be used by the scylla-sstable-crawler in the next patch.
2021-09-07 15:47:22 +03:00
Takuya ASADA
729d0feef0 install-dependencies.sh: add scylla-driver to relocatable python3
Pass --pip-packages option to tools/python3/reloc/build_reloc.sh,
add scylla-driver to relocatable python3 which required for
fix_system_distributed_tables.py.

[avi: regenrate toolchain]

Ref #9040
2021-09-02 11:52:47 +03:00
Pavel Emelyanov
e26a6c1acc btree, test: Test exception safety and non-leakness of btree::clone_from
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-08-31 12:23:49 +03:00
Nadav Har'El
bd4552fd57 configure.py: fix build-mode-specific targets to not build all modes
We have in our Ninja build file various targets which ask to build just
a single build mode. For example, "ninja dev" builds everything in dev
mode - including Scylla, tests, and distribution artifacts - but
shouldn't build anything in other build modes (debug, release, etc.),
even if they were previously configured by configure.py.

However, we had a bug where these build-mode-specific targets
nevertheless compiled *all* configured modes, not just the requested
mode.

The bug was introduced in commit edd54a9463 -
targets "dist-server-compat" and "dist-unified-compat" were introduced,
but instead of having per-build-mode versions of these targets, only
one of each was introduced building all modes. When these new targets
were used in a couple of places in per-build-mode targets, it forced
these targets to build all modes instead of just the chosen one.

The solution is to split the dist-server-compat target into multiple
dist-server-compat-{mode}, and similarly split dist-unified-compat.
The unsplit target is also retained - for use in targets that really
want all build modes.

Fixes #9260.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210829123418.290333-1-nyh@scylladb.com>
2021-08-29 15:38:27 +03:00
Avi Kivity
725065b066 cql3: term::raw, expr: add bridge between term::raw and expressions
A term_raw_expression is a term::raw that holds an expression. It will
be used to incrementally convert the source base to expressions, while
still exposing the result to the common interface of shared_ptr<term::raw>.
2021-08-26 14:14:18 +03:00